ReSRer 11/09 연구노트


apt update
apt install zsh git-lfs libssl-dev net-tools iputils-ping gcc -y
source ~/.rye/env
source ~/.cargo/env

git clone https://github.com/seonglae/ReSRer

TEI

Rust

rye

Zsh Init

path


~/.cargo/bin

env


~/.rye/env

torch

Issue: TypeError: issubclass() arg 1 must be a class

Updated 2023 Nov 29 15:1


# 11.8
pip uninstall torch torch-tensorrt torchtext torchvision torchdata
pip3 install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install .
# for train
# pip install packaging flash-attn --no-build-isolation
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
# here
pip install pydantic==1.10.11
# or for training? (optional if error)
pip install --force-reinstall typing-extensions==4.5.0
pip uninstall deepspeed
pip install deepspeed
pip uninstall -y apex


# 2.0, 12,1
pip uninstall torchdata torchtext torchvision torchdata
pip3 install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install .
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
# if error (undefined transformer engine symbol) uninstall transformer engine becuz we only need flash attn2

tgi
Anaconda Get Started
TGI


PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.Anaconda3-2019.10-Linux-x86_64.shsh
bash Anaconda3-2019.10-Linux-x86_64.sh
conda create -n text-generation-inference python=3.9
conda activate text-generation-inference

이슈

도커 cuda 안되는 문제는 재시작으로 해결

여담으로 containerd, wasm, kubernetes 등 dockerdesektop 도 기능 늘었다
kuber gpu되는지 확인해봐야

[Bug] `multilingual-e5-small` missing token_type_ids
Updated 2023 Sep 20 18:9

e5 문제 tokenizer달라서 코드추가
4096 e5 썻는데
2020 5 windows 10 update 에서 길이 제한 있어서 3000으로 잘라줌

vessl 개병신 이런거로 왜 계약했는지…

개열받아서 일단 milvus db instance에서 로컬로 돌리는 중 dpr은 cpu로 embedding만 옮기면 되니까

1,685,000 옮기고 메모리 부족으로 날아감

아마 큐에 계속 쌓여서인가??

5000 을 1000으로 줄였는데 그러면 그만큼 보내는 시간도 빨라질테니 (정확히는 아니고 2.5배정도 시간 줄어서 2배정도 줄긴 한듯) 여전히 메모리 늘어남

그리고 id겹칠텐데 계속 늘어나는거 보니 흠 뭐지

그래도 milvus 처음으로 제대로 넣어봐서 좋다 create_collection_by_schema함수가 따로 있는 게 schema제대로 생성 못하는 눔ㄴ제였다

Milvus 문제가 아니라 모니터링 보니 stream 오면서 계속 메모리 증가하는게 문제

왜 stream인데 계속 증가하나 메모리 dataset bat

return한거 list로 만드는게 문제였다

index 안생기는 문제

parameter dict 잘못 주는 문제

pm2 plus

도입이 핵심

CUDA vessl 오류

cannot import name '_get_privateuse1_backend_name’


>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/ReSRer/.venv/lib/python3.11/site-packages/torch/__init__.py", line 1119, in <module>
    from ._tensor import Tensor
  File "/root/ReSRer/.venv/lib/python3.11/site-packages/torch/_tensor.py", line 12, in <module>
    import torch.utils.hooks as hooks
  File "/root/ReSRer/.venv/lib/python3.11/site-packages/torch/utils/__init__.py", line 6, in <module>
    from .backend_registration import rename_privateuse1_backend, generate_methods_for_privateuse1_backend
  File "/root/ReSRer/.venv/lib/python3.11/site-packages/torch/utils/backend_registration.py", line 2, in <module>
    from torch._C import _rename_privateuse1_backend, _get_privateuse1_backend_name
ImportError: cannot import name '_get_privateuse1_backend_name' from 'torch._C' (/root/ReSRer/.venv/lib/python3.11/site-packages/torch/_C.cpython-311-x86_64-linux-gnu.so)

그냥 root python으로 했다

https://app.pm2.io/bucket/654cc9d8a9a22dc4e120b42c/backend/overview/servers

https://console.cloud.google.com/compute/instancesDetail/zones/asia-southeast1-b/instances/vector-db?project=text-project-396607&tab=monitoring&pageState=("duration":("groupValue":"PT1H","customValue":null))

http://34.126.139.238:8000/#/collections

할당량 늘리는 요청해둔 상태


Token count {
'~1024': 5320881, 
'1024~2048': 693911, 
'2048~4096': 300935, 
'4096~8192': 106221, 
'8192~16384': 30611, 
'16384~32768': 4812, 
'32768~65536': 1253, 
'65536~128000': 46, 
'128000~': 0
}
Text count {
'0~1024': 2751539, 
'1024~2048': 1310778, 
'2048~4096': 1179150, 
'4096~8192': 722101, 
'8192~16384': 329062, 
'16384~32768': 121237, 
'32768~65536': 36894, 
'65536~': 7909
}
Token percent {
'~1024': '82.38%', 
'1024~2048': '10.74%', 
'2048~4096': '4.66%', 
'4096~8192': '1.64%', 
'8192~16384': '0.47%', 
'16384~32768': '0.07%', 
'32768~65536': '0.02%', 
'65536~128000': '0.00%', 
'128000~': '0.00%'
}
Text percent {
'0~1024': '42.60%', 
'1024~2048': '20.29%', 
'2048~4096': '18.26%', 
'4096~8192': '11.18%', 
'8192~16384': '5.09%', 
'16384~32768': '1.88%', 
'32768~65536': '0.57%', 
'65536~': '0.12%'
}


{
'~128': 2625007, '128~256': 18370607, '256~512': 19066, '512~1024': 571, '1024~2048': 47, '2048~4096': 2, '4096~8192': 0, '8192~16384': 0, '16384~32768': 0, '32768~65536': 0, '65536~128000': 0, '128000~': 0}
Text count {'~512': 86519, '512~1024': 20927180, '1024~2048': 1557, '2048~4096': 43, '4096~8192': 1, '8192~16384': 0, '16384~32768': 0, '32768~65536': 0, '65536~': 0}
Token percent {'~128': '12.49%', '128~256': '87.42%', '256~512': '0.09%', '512~1024': '0.00%', '1024~2048': '0.00%', '2048~4096': '0.00%', '4096~8192': '0.00%', '8192~16384': '0.00%', '16384~32768': '0.00%', '32768~65536': '0.00%', '65536~128000': '0.00%', '128000~': '0.00%'}
Text percent
{'~512': '0.41%', '512~1024': '99.58%', '1024~2048': '0.01%', '2048~4096': '0.00%', '4096~8192': '0.00%', '8192~16384': '0.00%', '16384~32768': '0.00%', '32768~65536': '0.00%', '65536~': '0.00%'}
[{"~128": "12.49%", "128~256": "87.42%", "256~512": "0.09%", "512~1024": "0.00%", "1024~2048": "0.00%", "2048~4096": "0.00%", "4096~8192": "0.00%", "8192~16384": "0.00%", "16384~32768": "0.00%", "32768~65536": "0.00%", "65536~128000": "0.00%", "128000~": "0.00%"}, {"~512": "0.41%", "512~1024": "99.58%", "1024~2048": "0.01%", "2048~4096": "0.00%", "4096~8192": "0.00%", "8192~16384": "0.00%", "16384~32768": "0.00%", "32768~65536": "0.00%", "65536~": "0.00%"}]