Loading views...

Holistic AI Agent Graph August 1st

Date
Date
2025 Aug 1 0:0 β†’ 2025 Aug 7 0:0
Created by
Created by
Seonglae ChoSeonglae Cho
Created time
Created time
2025 Aug 1 15:18
Last edited by
Last edited by
Seonglae ChoSeonglae Cho
Last edited time
Last edited time
2025 Aug 8 0:36
Refs
Refs
  1. Wrap json list of runs as a trace object (what we defined)
  1. Put trace files in random folder like traces
  1. Preprocess traces like below command (at agent-graph repo)
uv run python agentgraph/input/text_processing/trace_preprocessor.py --traces temp --output preprocessed
4. Generate agent graph with below command
uv run python evaluation/extract_agentgraph.py --method pydantic_hybrid --input preprocessed/input-context.json --output preprocessed/pydantic-hybrid.py
5. Visualise generated agent graph by serving portable AgentGraph visualiser
serve evaluation/visualiser

What to change Tool ideas

Search
line
rag
rule based
Validation
Image
Rule based
Graph
networkx
class RunBase(BaseModel): id: UUID name: str start_time: datetime run_type: str end_time: Optional[datetime] = None extra: Optional[dict] = Field(default_factory=_default_extra) error: Optional[str] = None serialized: Optional[dict] = None events: Optional[list[dict]] = None inputs: dict = Field(default_factory=dict) outputs: Optional[dict] = Non reference_example_id: Optional[UUID] = None parent_run_id: Optional[UUID] = None tags: Optional[list[str]] = None attachments: Union[Attachments, dict[str, AttachmentInfo]] = Field( default_factory=dict ) @property def metadata(self) -> dict[str, Any]: """Retrieve the metadata (if any).""" if self.extra is None: self.extra = {} return self.extra.setdefault("metadata", {}) @property def revision_id(self) -> Optional[UUID]: """Retrieve the revision ID (if any).""" return self.metadata.get("revision_id") def __repr__(self): """Return a string representation of the RunBase object.""" return f"{self.__class__}(id={self.id}, name='{self.name}', run_type='{self.run_type}')" class Config: """Configuration class for the schema.""" arbitrary_types_allowed = True class Run(RunBase): session_id: Optional[UUID] = None child_run_ids: Optional[list[UUID]] = None child_runs: Optional[list[Run]] = None feedback_stats: Optional[dict[str, Any]] = None app_path: Optional[str] = None manifest_id: Optional[UUID] = None status: Optional[str] = None prompt_tokens: Optional[int] = None completion_tokens: Optional[int] = None total_tokens: Optional[int] = None first_token_time: Optional[datetime] = None total_cost: Optional[Decimal] = None prompt_cost: Optional[Decimal] = None completion_cost: Optional[Decimal] = None parent_run_ids: Optional[list[UUID]] = None trace_id: UUID dotted_order: str = Field(default="") in_dataset: Optional[bool] = None _host_url: Optional[str] = PrivateAttr(default=None) @property def url(self) -> Optional[str]: """URL of this run within the app.""" if self._host_url and self.app_path: return f"{self._host_url}{self.app_path}" return None

Whole run with 49 samples gpt4o-mini

  • At least with the same model, performance remains consistent when increasing sample size independent of steps or token usage with limited data using the same prompt. Even changing the prompt could produce better results, but there is still an upper bound on performance.
  • Significant differences only appear when preprocessing the trace.
  • External tools or data processing algorithms are the only ways to improve results with a fixed model.
  • Never, judge based on a single result, the results differs randomly with high variation. Only rely on multiple visualisation and metrics
Total datasets: 49 Successful: 49 Failed: 0 πŸ“ˆ METHOD PERFORMANCE: hybrid_method: Success rate: 49/49 Avg exact score: 0.212 Avg semantic score: 0.664 Avg processing time: 58.58s Avg matching score: 0.455 πŸ’° Avg tokens per run: 279395 πŸ’° Avg cost per run: $0.0488 πŸ’° Total cost: $2.3927 πŸ€– Model: unknown direct_llm_method: Success rate: 49/49 Avg exact score: 0.250 Avg semantic score: 0.661 Avg processing time: 42.51s Avg matching score: 0.480 πŸ’° Avg tokens per run: 43024 πŸ’° Avg cost per run: $0.0070 πŸ’° Total cost: $0.3372 πŸ€– Model: gpt-4o-mini pydantic_hybrid_method: Success rate: 49/49 Avg exact score: 0.239 Avg semantic score: 0.643 Avg processing time: 63.37s Avg matching score: 0.474 πŸ’° Avg tokens per run: 46116 πŸ’° Avg cost per run: $0.0083 πŸ’° Total cost: $0.3881 πŸ€– Model: unknown

GPT 4o

hybrid_method: Success rate: 49/49 Avg exact score: 0.212 Avg semantic score: 0.664 Avg processing time: 58.58s Avg matching score: 0.455 πŸ’° Avg tokens per run: 279395 πŸ’° Avg cost per run: $0.0488 πŸ’° Total cost: $2.3927 πŸ€– Model: unknown pydantic_hybrid_method: Success rate: 49/49 Avg exact score: 0.239 Avg semantic score: 0.643 Avg processing time: 63.37s Avg matching score: 0.474 πŸ’° Avg tokens per run: 46116 πŸ’° Avg cost per run: $0.0083 πŸ’° Total cost: $0.3881 πŸ€– Model: unknown
Β 
Β 

OpenAI Agent

Total datasets: 10 Successful: 10 Failed: 0 πŸ“ˆ METHOD PERFORMANCE: openai_agent_method: Success rate: 10/10 Avg exact score: 0.329 Avg semantic score: 0.720 Avg processing time: 46.68s Avg matching score: 0.509 πŸ’° Token usage: Not available
Β 
hybrid_method: Success rate: 10/10 Avg exact score: 0.268 Avg semantic score: 0.704 Avg processing time: 44.13s Avg matching score: 0.487 πŸ’° Avg tokens per run: 79385 πŸ’° Avg cost per run: $0.0166 πŸ’° Total cost: $0.1659 πŸ€– Model: unknown pydantic_hybrid_method: Success rate: 10/10 Avg exact score: 0.325 Avg semantic score: 0.665 Avg processing time: 49.07s Avg matching score: 0.524 πŸ’° Avg tokens per run: 14567 πŸ’° Avg cost per run: $0.0032 πŸ’° Total cost: $0.0320 πŸ€– Model: unknown
Β 

Two step agent

baseline single validation agnet
openai_agent_method: Success rate: 10/10 Avg exact score: 0.248 Avg semantic score: 0.628 Avg processing time: 19.01s Avg matching score: 0.446 πŸ’° Token usage: Not available
two step
direct β†’ validation only
openai_agent_method: Success rate: 10/10 Avg exact score: 0.367 Avg semantic score: 0.750 Avg processing time: 73.79s Avg matching score: 0.564 πŸ’° Token usage: Not available
only add validation
openai_agent_method: Success rate: 10/10 Avg exact score: 0.401 Avg semantic score: 0.712 Avg processing time: 473.85s Avg matching score: 0.565 πŸ’° Token usage: Not available
external iterative addition strategy without addtion (whole generate)
openai_agent_method: Success rate: 10/10 Avg exact score: 0.390 Avg semantic score: 0.737 Avg processing time: 57.66s Avg matching score: 0.597 πŸ’° Token usage: Not available
interval validation tool without addition
openai_agent_method: Success rate: 10/10 Avg exact score: 0.388 Avg semantic score: 0.726 Avg processing time: 64.93s Avg matching score: 0.576 πŸ’° Token usage: Not available
corrected add and validation tool
openai_agent_method: Success rate: 10/10 Avg exact score: 0.368 Avg semantic score: 0.725 Avg processing time: 54.57s Avg matching score: 0.573 πŸ’° Token usage: Not available
Β 
openai_agent_method: Success rate: 10/10 Avg exact score: 0.386 Avg semantic score: 0.732 Avg processing time: 29.55s Avg matching score: 0.582 πŸ’° Token usage: Not available

direct llm

direct_llm_method: Success rate: 10/10 Avg exact score: 0.365 Avg semantic score: 0.727 Avg processing time: 26.10s Avg matching score: 0.545 πŸ’° Avg tokens per run: 12796 πŸ’° Avg cost per run: $0.0025 πŸ’° Total cost: $0.0248

Production

crew_agent_method: Success rate: 10/10 Avg exact score: 0.077 Avg semantic score: 0.541 Avg processing time: 69.81s Avg matching score: 0.383 πŸ’° Token usage: Not available
slice
crew_agent_method: Success rate: 10/10 Avg exact score: 0.076 Avg semantic score: 0.602 Avg processing time: 66.89s Avg matching score: 0.423 πŸ’° Token usage: Not available
proper slice
Β 
Β 
Β 
Β 
Β 
Β 

Recommendations