For each modification (experiment), we perform benchmark testing for empirical validation and archive the results (knowledge accumulation). This automates the research cycle. Rather than single-goal optimization, it focuses on discovering new ideas and algorithms through exploration-research. Going beyond human-designed fixed structures, AI modifies and validates code by itself for endless self-improvement → achieving cumulative and open-ended development like scientific discovery. Open-ended exploration: Creates diverse 'stepping stones' to diversify search paths and escape local optima.
Results
Key improvements found are tool refinement, multi-patch evaluation, etc.
- SWE-bench success rate 20% → 50%
- Polyglot success rate 14.2% → 30.7%
While it's not exactly a fair comparison since the baseline removed self-improvement and open-ended elements, this is one of the few cases where Evolutionary algorithm has shown efficiency in AI Agent and LLM applications.