Synthetic Dataset

Creator

Creator

Created

Created

2024 Mar 31 10:11

Editor

Editor

Edited

Edited

2025 May 22 10:5

Refs

Refs

Knowledge Distillation

AI Data Scaling

AI Generated

If you go to ChatGPT and you ask it to give you a joke, you will notice that it will only knows few jokes. It the models are collapsed in silent so when you are looking at any single individual output, you will just seeing a single example. When you use synthetic generated dataset, it silently gets worse while you want the diversity and richness. Therefore, you have to make sure that you maintain your entropy in your dataset and that is the hard part. Nevertheless, synthetic data is absolutely the future and we must treat carefully.

Transformer Model 이

Extrapolation 에 아주 강력하기 때문에 인간 생성 데이터의 한계에 직면하지 않을 가능성도 있다

Synthetic Datasets

Synthetic Dataset Metrics

Synthetic Dataset Diversity

Synthetic Data Generation

AI models collapse when trained on recursively generated data

https://www.nature.com/articles/s41586-024-07566-y

No Priors Ep. 80 | With Andrej Karpathy from OpenAI and Tesla

Andrej Karpathy joins Sarah and Elad in this week of No Priors. Andrej, who was a founding team member of OpenAI and the former Tesla Autopilot leader, needs no introduction. In this episode, Andrej discusses the evolution of self-driving cars, comparing Tesla's and Waymo’s approaches, and the technical challenges ahead. They also cover Tesla’s Optimus humanoid robot, the bottlenecks of AI development today, and how AI capabilities could be further integrated with human cognition. Andrej shares more about his new mission Eureka Labs and his insights into AI-driven education and what young people should study to prepare for the reality ahead. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Karpathy Show Notes: 0:00 Introduction 0:33 Evolution of self-driving cars 2:23 The Tesla vs. Waymo approach to self-driving 6:32 Training Optimus with automotive models 10:26 Reasoning behind the humanoid form factor 13:22 Existing challenges in robotics 16:12 Bottlenecks of AI progress 20:27 Parallels between human cognition and AI models 22:12 Merging human cognition with AI capabilities 27:10 Building high performance small models 30:33 Andrej’s current work in AI-enabled education 36:17 How AI-driven education reshapes knowledge networks and status 41:26 Eureka Labs 42:25 What young people study to prepare for the future

No Priors Ep. 80 | With Andrej Karpathy from OpenAI and Tesla

https://www.youtube.com/watch?v=hM_h0UA7upI

No Priors Ep. 80 | With Andrej Karpathy from OpenAI and Tesla

Persona Chat AI based Synthetic data creation with

tencent-ailab • Updated 2024 Jul 10 11:46

LLMs aren't "trained on the internet" anymore | Hacker News

And Phi-3 is something else, even from relatively limited time playing with it, so that’s useful signal for anyone who hadn’t looked at it yet. Wildly cool stuff.

https://news.ycombinator.com/item?id=40549021&utm_source=pytorchkr&ref=pytorchkr

generation instruction

Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators

Orca-AgentInstruct, from Microsoft Research, can generate diverse, high-quality synthetic data at scale to post-train and fine-tune base LLMs for expanded capabilities, continual learning, and increased performance.

Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators

https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/

Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators

Backlinks

AI Data AI Data Scaling AI Optimization

Recommendations

///////