AI Generated
If you go to ChatGPT and you ask it to give you a joke, you will notice that it will only knows few jokes. It the models are collapsed in silent so when you are looking at any single individual output, you will just seeing a single example. When you use synthetic generated dataset, it silently gets worse while you want the diversity and richness. Therefore, you have to make sure that you maintain your entropy in your dataset and that is the hard part. Nevertheless, synthetic data is absolutely the future and we must treat carefully.
Transformer Models are very powerful for Extrapolation, so there is a possibility that they won't face the limitations of human-generated data
Synthetic Datasets
Synthetic Dataset Metrics
AI models collapse when trained on recursively generated data
No Priors Ep. 80 | With Andrej Karpathy from OpenAI and Tesla
Andrej Karpathy joins Sarah and Elad in this week of No Priors. Andrej, who was a founding team member of OpenAI and the former Tesla Autopilot leader, needs no introduction. In this episode, Andrej discusses the evolution of self-driving cars, comparing Tesla's and Waymo’s approaches, and the technical challenges ahead. They also cover Tesla’s Optimus humanoid robot, the bottlenecks of AI development today, and how AI capabilities could be further integrated with human cognition. Andrej shares more about his new mission Eureka Labs and his insights into AI-driven education and what young people should study to prepare for the reality ahead.
Sign up for new podcasts every week. Email feedback to show@no-priors.com
Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Karpathy
Show Notes:
0:00 Introduction
0:33 Evolution of self-driving cars
2:23 The Tesla vs. Waymo approach to self-driving
6:32 Training Optimus with automotive models
10:26 Reasoning behind the humanoid form factor
13:22 Existing challenges in robotics
16:12 Bottlenecks of AI progress
20:27 Parallels between human cognition and AI models
22:12 Merging human cognition with AI capabilities
27:10 Building high performance small models
30:33 Andrej’s current work in AI-enabled education
36:17 How AI-driven education reshapes knowledge networks and status
41:26 Eureka Labs
42:25 What young people study to prepare for the future
https://www.youtube.com/watch?v=hM_h0UA7upI

Persona Chat AI based Synthetic data creation with
persona-hub
tencent-ailab • Updated 2024 Jul 10 11:46
LLMs aren't "trained on the internet" anymore | Hacker News
And Phi-3 is something else, even from relatively limited time playing with it, so that’s useful signal for anyone who hadn’t looked at it yet. Wildly cool stuff.
https://news.ycombinator.com/item?id=40549021&utm_source=pytorchkr&ref=pytorchkr
generation instruction
Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators
Orca-AgentInstruct, from Microsoft Research, can generate diverse, high-quality synthetic data at scale to post-train and fine-tune base LLMs for expanded capabilities, continual learning, and increased performance.
https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/


Seonglae Cho