The two most comparable publicly available datasets to the Pile are CC-100 and C4/mC4C4 is comparably-sized to The Pile, while mC4 and CC-100 are larger, multilingual datasets The PileThe Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. The Pile is hosted by the Eye. Have a model that uses or evaluates on the Pile? Let us know!https://pile.eleuther.ai/Small from Neel Nanda NeelNanda/pile-10k · Datasets at Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/datasets/NeelNanda/pile-10k