- First day PR - Evaluation script feature with screenshots
- Second day PR - Improved prompt evaluation comparison and relative superiority between manual LLM judge comparison
Research note
Open Deep Research ICML
openssf as an evaluation
- should run openssh locally
- github api token
improve prompt based on the evaluation
RRF prompt enhancement - not that complex and just mention in the papaer
change search API
doing multiple experiment on frameworks
How to catch the unstable generation in the perspective of metrics
human feedback
- iteration
- web serach
Is there anything that i can help for paper writing? since I can make a pr today
- paper work is due to next week
feedback this week
work on next week
Paper
risk score is weird
popular libraries were more higher laignment score since it is easy to get an information in the we
bold in table for higheest score overall
change the title: AgentOSSF SSFAgent
case study example appendix from result
all library referencesø
key parts
- section writer
- query writer
- plan writer
What should be prioritized?
- Automating table format or output in the source code should be first priority, as some data points like GitHub stars, license information, and Active Maintenance status are not directly extracted by the current workflow
- Adding more libraries to the evaluation
- Implementing automated submission pipeline using GitHub Actions with cron jobs for better scalability
Leaderboard
Remove url link error
column changes 때문에 type list 줘도 오류나는데 전부 markdown 으로 해결함
fix double row error due to the double language
very long page error - maybe gradio? or huggingface space
libray type (framework) icon visualizaion or legend
버전 여러개면 list 된 json으로 변경
change the github readme
Paper
내용추가해야한다면
results section 에 내가 적은 insight 추가
첫 페이지 footnote 없에고 refernce 공간 차지하니
cost for each report 0.1 달러 추가
cache scorecard tool
Candidates
ML Frameworks
- Pytorch
- Tensorflow
- JAX
- Candle ML
Agents Framework
- CrewAI
- LangGraph
- Composio
- Agent Development Kit
- SmolAgents
- MetaGPT
- Pydantic AI
App Agent
- Browser Use
- Stagehand
Prompt Engineering
- Langchain
- LLaMaIndex
Inference Engine
- SGLang
- vLLM
- TensorRT
- TGI
- ONNX
Category | Name | ㅤ | ㅤ | ㅤ | ㅤ | ㅤ | Score Metrics | Model Metrics | ㅤ |
ㅤ | ㅤ | License | Security | Maintenance | Dependencies | Regulatory | Overall | Model Coverage | Model Seeking |
ML Frameworks | Pytorch | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 88% (15/17) | 8 |
ㅤ | JAX | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐⭐ | 61% (11/18) | 12 |
ㅤ | Tensorflow | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 80% (12/15) | 5 |
ㅤ | ONNX | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐⭐ | 88% (14/16) | 5 |
ㅤ | Candle ML | ⭐ | ⭐ | ⭐ | ⭐ | ⭐ | ⭐ | 71% (10/14) | 12 |
Agents Framework | CrewAI | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐⭐ | 71% (10/14) | 13 |
ㅤ | LangGraph | ⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐ | 78% (14/18) | 7 |
ㅤ | Composio | ⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐ | 67% (10/15) | 5 |
ㅤ | Agent Development Kit | ⭐ | ⭐ | ⭐ | ⭐⭐ | ⭐ | ⭐ | 71% (10/14) | 7 |
ㅤ | SmolAgents | ⭐⭐⭐⭐⭐ | ⭐ | ⭐ | ⭐ | ⭐ | ⭐ | 73% (11/15) | 9 |
ㅤ | MetaGPT | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐⭐ | 57% (8/14) | 7 |
ㅤ | Pydantic AI | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐⭐ | 88% (15/17) | 10 |
App Agent | Browser Use | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 88% (15/17) | 7 |
ㅤ | Stagehand | ⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐ | 47% (7/15) | 6 |
Prompt Engineering | LangChain | ⭐⭐⭐⭐⭐ | ⭐ | ⭐ | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 72% (13/18) | 19 |
ㅤ | LLaMaIndex | ⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐ | 47% (8/17) | 7 |
Inference Engine | SGLang | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐⭐ | 73% (11/15) | 5 |
ㅤ | vLLM | ⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐ | 73% (11/15) | 7 |
ㅤ | TensorRT | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 69% (11/16) | 5 |
ㅤ | TGI | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐⭐ | 72% (13/18) | 6 |
Open Deep Research ICMLs
Name
Baseline Alignment
Novelty Yield
Trust Score
License
Security
Maintenance
Dependencies
Regulatory
Baseline Alignment
88.24%Novelty Yield
8Trust Score
License
5Security
1Maintenance
3Dependencies
1Regulatory
3Baseline Alignment
61.11%Novelty Yield
12Trust Score
License
5Security
3Maintenance
4Dependencies
1Regulatory
1Baseline Alignment
72.22%Novelty Yield
5Trust Score
License
5Security
1Maintenance
3Dependencies
1Regulatory
3Baseline Alignment
87.5%Novelty Yield
5Trust Score
License
5Security
1Maintenance
3Dependencies
1Regulatory
1Baseline Alignment
76.47%Novelty Yield
4Trust Score
License
5Security
1Maintenance
4Dependencies
1Regulatory
3





Seonglae Cho