Holistic AI Agent Graph August 1st

Wrap json list of runs as a trace object (what we defined)

Put trace files in random folder like traces

Preprocess traces like below command (at agent-graph repo)


uv run python agentgraph/input/text_processing/trace_preprocessor.py --traces temp --output preprocessed

4. Generate agent graph with below command


uv run python evaluation/extract_agentgraph.py --method pydantic_hybrid --input preprocessed/input-context.json --output preprocessed/pydantic-hybrid.py

5. Visualise generated agent graph by serving portable AgentGraph visualiser


serve evaluation/visualiser

What to change Tool ideas

line

rag

rule based

Validation

Image

Rule based

Graph

networkx



class RunBase(BaseModel):
    id: UUID
    name: str
    start_time: datetime
    run_type: str
    end_time: Optional[datetime] = None
    extra: Optional[dict] = Field(default_factory=_default_extra)
    error: Optional[str] = None
    serialized: Optional[dict] = None
    events: Optional[list[dict]] = None
    inputs: dict = Field(default_factory=dict)
    outputs: Optional[dict] = Non
    reference_example_id: Optional[UUID] = None
    parent_run_id: Optional[UUID] = None
    tags: Optional[list[str]] = None
    attachments: Union[Attachments, dict[str, AttachmentInfo]] = Field(
        default_factory=dict
    )
    @property
    def metadata(self) -> dict[str, Any]:
        """Retrieve the metadata (if any)."""
        if self.extra is None:
            self.extra = {}
        return self.extra.setdefault("metadata", {})
    @property
    def revision_id(self) -> Optional[UUID]:
        """Retrieve the revision ID (if any)."""
        return self.metadata.get("revision_id")
    def __repr__(self):
        """Return a string representation of the RunBase object."""
        return f"{self.__class__}(id={self.id}, name='{self.name}', run_type='{self.run_type}')"
    class Config:
        """Configuration class for the schema."""
        arbitrary_types_allowed = True


class Run(RunBase):
    session_id: Optional[UUID] = None
    child_run_ids: Optional[list[UUID]] = None
    child_runs: Optional[list[Run]] = None
    feedback_stats: Optional[dict[str, Any]] = None
    app_path: Optional[str] = None
    manifest_id: Optional[UUID] = None
    status: Optional[str] = None
    prompt_tokens: Optional[int] = None
    completion_tokens: Optional[int] = None
    total_tokens: Optional[int] = None
    first_token_time: Optional[datetime] = None
    total_cost: Optional[Decimal] = None
    prompt_cost: Optional[Decimal] = None
    completion_cost: Optional[Decimal] = None
    parent_run_ids: Optional[list[UUID]] = None
    trace_id: UUID
    dotted_order: str = Field(default="")
    in_dataset: Optional[bool] = None
    _host_url: Optional[str] = PrivateAttr(default=None)
    @property
    def url(self) -> Optional[str]:
        """URL of this run within the app."""
        if self._host_url and self.app_path:
            return f"{self._host_url}{self.app_path}"
        return None

Whole run with 49 samples gpt4o-mini

At least with the same model, performance remains consistent when increasing sample size independent of steps or token usage with limited data using the same prompt. Even changing the prompt could produce better results, but there is still an upper bound on performance.

Significant differences only appear when preprocessing the trace.

External tools or data processing algorithms are the only ways to improve results with a fixed model.

Never, judge based on a single result, the results differs randomly with high variation. Only rely on multiple visualisation and metrics


Total datasets: 49
Successful: 49
Failed: 0

📈 METHOD PERFORMANCE:

hybrid_method:
   Success rate: 49/49
   Avg exact score: 0.212
   Avg semantic score: 0.664
   Avg processing time: 58.58s
   Avg matching score: 0.455
   💰 Avg tokens per run: 279395
   💰 Avg cost per run: $0.0488
   💰 Total cost: $2.3927
   🤖 Model: unknown

direct_llm_method:
   Success rate: 49/49
   Avg exact score: 0.250
   Avg semantic score: 0.661
   Avg processing time: 42.51s
   Avg matching score: 0.480
   💰 Avg tokens per run: 43024
   💰 Avg cost per run: $0.0070
   💰 Total cost: $0.3372
   🤖 Model: gpt-4o-mini

pydantic_hybrid_method:
   Success rate: 49/49
   Avg exact score: 0.239
   Avg semantic score: 0.643
   Avg processing time: 63.37s
   Avg matching score: 0.474
   💰 Avg tokens per run: 46116
   💰 Avg cost per run: $0.0083
   💰 Total cost: $0.3881
   🤖 Model: unknown

GPT 4o


hybrid_method:
   Success rate: 49/49
   Avg exact score: 0.212
   Avg semantic score: 0.664
   Avg processing time: 58.58s
   Avg matching score: 0.455
   💰 Avg tokens per run: 279395
   💰 Avg cost per run: $0.0488
   💰 Total cost: $2.3927
   🤖 Model: unknown

pydantic_hybrid_method:
   Success rate: 49/49
   Avg exact score: 0.239
   Avg semantic score: 0.643
   Avg processing time: 63.37s
   Avg matching score: 0.474
   💰 Avg tokens per run: 46116
   💰 Avg cost per run: $0.0083
   💰 Total cost: $0.3881
   🤖 Model: unknown

OpenAI Agent


Total datasets: 10
Successful: 10
Failed: 0

📈 METHOD PERFORMANCE:

openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.329
   Avg semantic score: 0.720
   Avg processing time: 46.68s
   Avg matching score: 0.509
   💰 Token usage: Not available


hybrid_method:
   Success rate: 10/10
   Avg exact score: 0.268
   Avg semantic score: 0.704
   Avg processing time: 44.13s
   Avg matching score: 0.487
   💰 Avg tokens per run: 79385
   💰 Avg cost per run: $0.0166
   💰 Total cost: $0.1659
   🤖 Model: unknown
   
pydantic_hybrid_method:
   Success rate: 10/10
   Avg exact score: 0.325
   Avg semantic score: 0.665
   Avg processing time: 49.07s
   Avg matching score: 0.524
   💰 Avg tokens per run: 14567
   💰 Avg cost per run: $0.0032
   💰 Total cost: $0.0320
   🤖 Model: unknown

Two step agent

baseline single validation agnet


openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.248
   Avg semantic score: 0.628
   Avg processing time: 19.01s
   Avg matching score: 0.446
   💰 Token usage: Not available

two step

direct → validation only


openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.367
   Avg semantic score: 0.750
   Avg processing time: 73.79s
   Avg matching score: 0.564
   💰 Token usage: Not available

only add validation


openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.401
   Avg semantic score: 0.712
   Avg processing time: 473.85s
   Avg matching score: 0.565
   💰 Token usage: Not available

external iterative addition strategy without addtion (whole generate)


openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.390
   Avg semantic score: 0.737
   Avg processing time: 57.66s
   Avg matching score: 0.597
   💰 Token usage: Not available

interval validation tool without addition


openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.388
   Avg semantic score: 0.726
   Avg processing time: 64.93s
   Avg matching score: 0.576
   💰 Token usage: Not available

corrected add and validation tool


openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.368
   Avg semantic score: 0.725
   Avg processing time: 54.57s
   Avg matching score: 0.573
   💰 Token usage: Not available


openai_agent_method:
   Success rate: 10/10
   Avg exact score: 0.386
   Avg semantic score: 0.732
   Avg processing time: 29.55s
   Avg matching score: 0.582
   💰 Token usage: Not available

direct llm


direct_llm_method:
   Success rate: 10/10
   Avg exact score: 0.365
   Avg semantic score: 0.727
   Avg processing time: 26.10s
   Avg matching score: 0.545
   💰 Avg tokens per run: 12796
   💰 Avg cost per run: $0.0025
   💰 Total cost: $0.0248

Production


crew_agent_method:
   Success rate: 10/10
   Avg exact score: 0.077
   Avg semantic score: 0.541
   Avg processing time: 69.81s
   Avg matching score: 0.383
   💰 Token usage: Not available

slice


crew_agent_method:
   Success rate: 10/10
   Avg exact score: 0.076
   Avg semantic score: 0.602
   Avg processing time: 66.89s
   Avg matching score: 0.423
   💰 Token usage: Not available

proper slice