Huggingface Datasets Streaming

Creator
Creator
Seonglae Cho
Created
Created
2023 Oct 19 18:1
Editor
Edited
Edited
2025 Apr 15 22:36
Refs
Refs

Huggingface Batch Processing

sideeffect is no
tqdm
sometimes http request timeout error occurs so if the job is long, do not recommend
dataset = load_dataset(dataset_id, split='train', streaming=True) iterable = dataset.to_iterable_dataset(num_shards=128) shuffled = itereable.shuffle(seed=42, buffer_size=100_000) dataloader = torch.utils.data.DataLoader(shuffled , num_workers=4)
 
 
 
 
 

Recommendations