기본적인 구조는 root folder의 4가지 python 파일로 다양한 배치작업들을 실행시키는 구조
소스 구현 구조를 간략하게 설명하는게 resrer 프로젝트 이해해 도움될거라 생각
index_ctx.py
dataset- Step 0. Embedding passages and index into vector DB(Milvus)
hf_data.py
upload- Step 1. Creating & uploading QA training dataset using API prompting
train.py
train- Step 3. Training from base model and uploading it to Huggingface
qa_pipeline.py
dataset- Step 4. Running QA pipeline with evaluating and uploading summary
use_tiktoken.py
- split - r1
- split raw Wikipedia documents to passages
- count - r2
- for debugging
Task
- indexing Wikipedia data for retrieval
- Creating (passages, summary) dataset
- Training the Summarizer model
- QA pipeline running and evaluation
Seonglae Cho