YSU NLP Final

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 29 4:26
Editor
Edited
Edited
2024 Jun 19 4:49
Refs
Refs

객관식 단답형 without seminar

마지막
Tool learning
에서 간단한 한문제
  • BERT
    sequence 생성 못하고 하나의 고정 길이 벡터로 변환한다 (classification) CLS Token
  • unlike gpt2 pre-training, feature based라서 embedding 가중치 두고 추가 레이어만 학습시켰다
중간고사 많이 틀린거 다시 내는 교수님이니 개념 명확히 하기

시험 예상

  • LLM 을 이용한 아이디어 자유문제 하나 나올듯
 

GPT2

Paradigm shift

Word Vectors + Task specific architectures → Multi layer RNN → Pre-trained transformers + Fine-tuning
  • Task별 Limitations of Pre-training ➔ Fine-Tuning End up with many “copies” of the same model
  • 학습 분포에 오버피팅이 될 뿐, Out-of-distribution(분포 외) 샘플에 대해서 제대로 동작하지 않음
  • 벤치마크에서 높은 성능을 달성하더라도 그 데이터셋을 푼 것이지 그 태스크를 푼 것은 아님 Spurious correlation
  1. Scaling up
    Scaling Law
  1. In-context Learning
    Meta Learning
    (in charge of the inner loop while SGD is responsible for the outer loop)
      • Larger Models Learn Better In-Context
      notion image
      In context learning based on few shot
      Unlike fine-tuning, the model is only trained once for all downstream tasks.

In-context Learning(Recognition)과 이전 Adaptation의 차이

(Pre-training and Fine-tuning): Adaptation
 
 
 

Dataset or metrics for GPT3

  • Perplexity (Language Modeling)
  • LAMBADA (Predict last word)
  • HellaSwag (ending)
  • StoryCloze (ending)
  • Natural Questions Web Questions TriviaQA
  • Translation Task (into English > from English)
  • Winograd-Style Tasks : Reading comprehension test Which word a pronoun refers to
  • Common Sense Reasoning: OpenBookQA, PIQA, ARC
  • Reading Comprehension (CoQA, QuAC, DROP, RACE, SQuADv2) - GPT3 bad
  • SuperGLUE
  • Natural Language Inference: ability to understand the relationship between two sentences (bad)
Because of the huge dataset ➔GPT-3 doesn't overfit on test data it has seen before. Performance drop when seen samples are removed from test set is small
notion image

Arithmetic

Is GPT-3 Just Memorizing Tables? No!

Word Manipulation

  1. Cycle letters in word (CL)
  1. Anagrams of all but the first and last k letters (A1, A2 for k=1,2)
  1. Random insertion in word (RI)
  1. Reversed words (RW)
notion image

Qualitative Tasks

  • News article generation

Limitation

  • Limited common sense
  • Poor one-shot and zero-shot performance
  • Lack of grounding
notion image

Commonsense Reasoning

Knowledge Graph
, Knowledge Base
  • ConceptNet semantic relations + ATOMIC if-then = ATOMIC 20 20
  • COMET Commonsense Transformers, VisualCOMET

Benckmark

  • WinoGrade Schema Challenge (WG)
  • Choice of Plausible Alternatives (COPA): Commonsense causality, Visual COPA
  • CosmosQA: Commonsense Machine Comprehension but also reasoning with background knowledge
  • CommonsenseQA(CSQA)
  • Social Intelligence QA (SocialIQA) about social events from ATOMIC

Symbolic Knowledge Distillation
Symbolic Distillation

From General Language Models to Commonsense Models
Machine-to-corpus-to-machine pipeline does not require human-authored knowledge
  • Loose teacher with critic model
Naive knowledge distillation trains the student model to match the teacher probabilities, thereby making it intractable.
  • Distill a symbolic knowledge graph
  • Distill only a selective aspect of the teacher model

LLM as Clinical Reasoner

Medical Chain-of-Thought Distillation

Dialogue Systems

Task-oriented Dialogue System, Open-domain Dialogue System
  • Persona-Grounded Dialogue
  • EmpatheticDialogues (benchmark)
  • Long Term Conversation
  • Multi Session Chat
  • BlenderBot (search)
  • Multi-modal Chatbot
     
     
     

     

    Recommendations