Dask

Creator

Creator

Created

Created

2021 Jul 10 5:41

Editor

Editor

Edited

Edited

2025 May 29 0:2

Refs

Refs

dask • Updated 2023 Nov 30 13:23

Library for parallel computing in Python with interface mimicking the
Pandas DataFrame,
Numpy Array and
PySpark

PyTorch and Dask can be combined for effective handling of large-scale data processing and model training. Dask is a flexible parallel computing library for analytics that scales from a single CPU to thousands of nodes. Dask allows PyTorch to handle much larger datasets that can be loaded and processed in parallel, accelerating data preparation.

Scalability: Handle datasets larger than your available memory.

Parallel Computing: Leverage multiple cores for faster computation.

Familiar Syntax: Use a syntax similar to Pandas, minimizing the learning curve.

Memory Efficiency: Dask operates on out-of-core arrays, DataFrames, and lists.

CSV

Architecture

Understanding Dask Architecture: Client, Scheduler, Workers

In A short introduction to Dask for Pandas developers, we looked at how the fundamental components of Dask work. We examined the Dask dataframe and some other data structures that Dask uses internally. Now we'll zoom out and see how the higher-level components of Dask work, and how its client, scheduler, and workers share data and instructions.

https://www.datarevenue.com/en-blog/understanding-dask-architecture-client-scheduler-workers

Understanding Dask Architecture: Client, Scheduler, Workers

Homepage

Dask: Scalable analytics in Python

Dask uses existing Python APIs and data structures to make it easy to switch between NumPy, pandas, scikit-learn to their Dask-powered equivalents. You don't have to completely rewrite your code or retrain to scale up.

Dask: Scalable analytics in Python

https://dask.org/

Dask: Scalable analytics in Python

Recommendations

//////////