Dataset for AI are three types
- Background information - Pretraining
- Problems with solution - SFT
- Practice problems - Reinforcement Learning
Dataset for AI are three types
- Background information - Pretraining
- Problems with solution - SFT
- Practice problems - Reinforcement Learning
We typically say that a dataset is high-dimensional if the number of data points N is
smaller than the dimensionality D
- not cheatable
- large degree of intra-class variability
Datasets
Dataset Usages
I Built an AI Chatbot Based On My Favorite Podcast
Sponsored By: Reflect In the future, any time you look up information you're going to use a chatbot. This applies to every piece of information you interact with day to day: personal, organizational, and cultural.
https://every.to/superorganizers/i-trained-a-gpt-3-chatbot-on-every-episode-of-my-favorite-podcast

20 Open Datasets for Natural Language Processing
Natural language processing is a significant part of machine learning use cases, but it requires a lot of data and some deftly handled…
https://odsc.medium.com/20-open-datasets-for-natural-language-processing-538fbfaf8e38

Andrej Karpathy on Twitter / X
We have to take the LLMs to school.When you open any textbook, you'll see three major types of information:1. Background information / exposition. The meat of the textbook that explains concepts. As you attend over it, your brain is training on that data. This is equivalent… pic.twitter.com/m9vJj4AjV8— Andrej Karpathy (@karpathy) January 30, 2025
https://x.com/karpathy/status/1885026028428681698


Seonglae Cho