Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Data/Dataset/NLP Dataset/Web Dataset/
The Pile
Search

The Pile

Creator
Creator
Seonglae Cho
Created
Created
2022 Jul 12 17:3
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Mar 2 15:27
Refs
Refs
Size
Size
TB
Multilingual
Multilingual
Multilingual
The two most comparable publicly available datasets to the Pile are CC-100 and
C4
/mC4
C4
is comparably-sized to
The Pile
, while mC4 and
CC-100
are larger, multilingual datasets
 
 
 
The Pile
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. The Pile is hosted by the Eye. Have a model that uses or evaluates on the Pile? Let us know!
The Pile
https://pile.eleuther.ai/
Small from
Neel Nanda
NeelNanda/pile-10k · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
NeelNanda/pile-10k · Datasets at Hugging Face
https://huggingface.co/datasets/NeelNanda/pile-10k
NeelNanda/pile-10k · Datasets at Hugging Face
 
 

Backlinks

Leo GaoCC-100SAE DatasetMonosemanticity

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Data/Dataset/NLP Dataset/Web Dataset/
The Pile
Copyright Seonglae Cho