Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Data/Dataset/NLP Dataset/
Web Dataset
Search

Web Dataset

Creator
Creator
Seonglae Cho
Created
Created
2023 Oct 14 10:55
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Mar 2 15:22
Refs
Refs
Crawling
notion image
Web Datasets
FineWeb
Size
GB
Multilingual
Multilingual
The Pile
Size
TB
Multilingual
Multilingual
OpenWebText
Size
GB
Multilingual
Multilingual
RefinedWeb
Size
TB
Multilingual
Multilingual
RedPajama
Size
TB
Multilingual
Multilingual
C4
Size
TB
Multilingual
Multilingual
CC-100
Size
TB
Multilingual
Multilingual
CommonCrawl
Size
PB
Multilingual
Multilingual
CulturaX
Size
TB
Multilingual
Multilingual
UpvoteWeb
Size
GB
Multilingual
Multilingual
WebText
Size
GB
Multilingual
Multilingual
 
 
 
 

CS324

stanford-cs324.github.io
https://stanford-cs324.github.io/winter2022/lectures/data/

Youtube Transcript

PleIAs/YouTube-Commons · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
PleIAs/YouTube-Commons · Datasets at Hugging Face
https://huggingface.co/datasets/PleIAs/YouTube-Commons
PleIAs/YouTube-Commons · Datasets at Hugging Face

Generative Crawling agents

arxiv.org
https://arxiv.org/pdf/2404.12753.pdf
 
 

Table of Contents
CS324Youtube TranscriptGenerative Crawling agents

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Data/Dataset/NLP Dataset/
Web Dataset
Copyright Seonglae Cho