AI Generated
If you go to ChatGPT and you ask it to give you a joke, you will notice that it will only knows few jokes. It the models are collapsed in silent so when you are looking at any single individual output, you will just seeing a single example. When you use synthetic generated dataset, it silently gets worse while you want the diversity and richness. Therefore, you have to make sure that you maintain your entropy in your dataset and that is the hard part. Nevertheless, synthetic data is absolutely the future and we must treat carefully.
Transformer Model 이 Extrapolation 에 아주 강력하기 때문에 인간 생성 데이터의 한계에 직면하지 않을 가능성도 있다
Synthetic Datasets
Synthetic Dataset Metrics
AI models collapse when trained on recursively generated data
Persona Chat AI based Synthetic data creation with
generation instruction