Huggingface Batch Processing
sideeffect is no tqdm
sometimes http request timeout error occurs so if the job is long, do not recommend
dataset = load_dataset(dataset_id, split='train', streaming=True) iterable = dataset.to_iterable_dataset(num_shards=128) shuffled = itereable.shuffle(seed=42, buffer_size=100_000) dataloader = torch.utils.data.DataLoader(shuffled , num_workers=4)