객관식 단답형 without seminar
마지막 Tool learning 에서 간단한 한문제
- BERT sequence 생성 못하고 하나의 고정 길이 벡터로 변환한다 (classification) CLS Token
- unlike gpt2 pre-training, feature based라서 embedding 가중치 두고 추가 레이어만 학습시켰다
중간고사 많이 틀린거 다시 내는 교수님이니 개념 명확히 하기
시험 예상
- LLM 을 이용한 아이디어 자유문제 하나 나올듯
GPT2
Paradigm shift
Word Vectors + Task specific architectures → Multi layer RNN → Pre-trained transformers + Fine-tuning
- Task별 Limitations of Pre-training ➔ Fine-Tuning End up with many “copies” of the same model
- 학습 분포에 오버피팅이 될 뿐, Out-of-distribution(분포 외) 샘플에 대해서 제대로 동작하지 않음
- 벤치마크에서 높은 성능을 달성하더라도 그 데이터셋을 푼 것이지 그 태스크를 푼 것은 아님 Spurious correlation
- Scaling up Scaling Law
- In-context Learning Meta Learning (in charge of the inner loop while SGD is responsible for the outer loop)
- Larger Models Learn Better In-Context

In context learning based on few shot
Unlike fine-tuning, the model is only trained once for all downstream tasks.
In-context Learning(Recognition)과 이전 Adaptation의 차이
(Pre-training and Fine-tuning): Adaptation
Dataset or metrics for GPT3
- Perplexity (Language Modeling)
- LAMBADA (Predict last word)
- HellaSwag (ending)
- StoryCloze (ending)
- Natural Questions Web Questions TriviaQA
- Translation Task (into English > from English)
- Winograd-Style Tasks : Reading comprehension test Which word a pronoun refers to
- Common Sense Reasoning: OpenBookQA, PIQA, ARC
- Reading Comprehension (CoQA, QuAC, DROP, RACE, SQuADv2) - GPT3 bad
- SuperGLUE
- Natural Language Inference: ability to understand the relationship between two sentences (bad)
Because of the huge dataset ➔GPT-3 doesn't overfit on test data it has seen before. Performance drop when seen samples are removed from test set is small

Arithmetic
Is GPT-3 Just Memorizing Tables? No!
Word Manipulation
- Cycle letters in word (CL)
- Anagrams of all but the first and last k letters (A1, A2 for k=1,2)
- Random insertion in word (RI)
- Reversed words (RW)

Qualitative Tasks
- News article generation
Limitation
- Limited common sense
- Poor one-shot and zero-shot performance
- Lack of grounding

Commonsense Reasoning
Knowledge Graph, Knowledge Base
- ConceptNet semantic relations + ATOMIC if-then = ATOMIC 20 20
- COMET Commonsense Transformers, VisualCOMET
Benckmark
- WinoGrade Schema Challenge (WG)
- Choice of Plausible Alternatives (COPA): Commonsense causality, Visual COPA
- CosmosQA: Commonsense Machine Comprehension but also reasoning with background knowledge
- CommonsenseQA(CSQA)
- Social Intelligence QA (SocialIQA) about social events from ATOMIC
Symbolic Knowledge Distillation Symbolic Distillation
From General Language Models to Commonsense Models
Machine-to-corpus-to-machine pipeline does not require human-authored knowledge
- Loose teacher with critic model
Naive knowledge distillation trains the student model to match the teacher probabilities, thereby making it intractable.
- Distill a symbolic knowledge graph
- Distill only a selective aspect of the teacher model
LLM as Clinical Reasoner
Medical Chain-of-Thought Distillation
Dialogue Systems
Task-oriented Dialogue System, Open-domain Dialogue System
- Persona-Grounded Dialogue
- EmpatheticDialogues (benchmark)
- Long Term Conversation
- Multi Session Chat
- BlenderBot (search)
- Multi-modal Chatbot
Seonglae Cho