The Pile

Creator

Creator

Seonglae Cho

Created

Created

2022 Jul 12 17:3

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Mar 2 15:27

Refs

Refs

Size

Size

TB

Multilingual

Multilingual

Multilingual

The two most comparable publicly available datasets to the Pile are CC-100 and

C4/mC4

C4 is comparably-sized to

The Pile, while mC4 and

CC-100 are larger, multilingual datasets

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. The Pile is hosted by the Eye. Have a model that uses or evaluates on the Pile? Let us know!

https://pile.eleuther.ai/

Small from

NeelNanda/pile-10k · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/datasets/NeelNanda/pile-10k

NeelNanda/pile-10k · Datasets at Hugging Face

Recommendations

////////