Parquet
dataset = load_dataset(dataset_id, split='train', streaming=True) iterable = dataset.to_iterable_dataset(num_shards=128) shuffled = itereable.shuffle(seed=42, buffer_size=100_000) dataloader = torch.utils.data.DataLoader(shuffled , num_workers=4)
from datasets import load_dataset, Dataset datasets.config.IN_MEMORY_MAX_SIZE = Dataset.from_dict() dataset.train_test_split(test_size=0.0005, seed=2357, shuffle=True) dataset.select(range(100))
Huggingface Datasets Usages
Create a dataset
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/docs/datasets/create_dataset
Community Eval
Community Evals: Because we're done trusting black-box leaderboards over the community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/blog/community-evals

Seonglae Cho