MLX

Creator

Creator

Created

Created

2023 Dec 8 11:52

Editor

Editor

Edited

Edited

2024 Jul 3 15:30

Refs

Refs

ml-explore • Updated 2023 Dec 7 9:35

An array framework for Apple silicon

MLX Usages

MLX Distributed Communication

KV Cache

ParaLLM: 1600+ tok/s on a MacBook - William Brown

Recently I’ve been doing some LLM finetuning experiments on my MacBook using MLX, and found that there wasn’t really a great way to take advantage of parallel inference for evaluating outputs locally. For single-stream applications like chat interfaces, this isn’t a big deal – both llama.cpp and MLXServer run quite fast on Apple devices. But if you’re trying to sample a large number of outputs at once, either for evaluating a training run or for “agent-flavored” applications, neither of them really offer a speedup in terms of total throughput (at least from what I’ve been able to test). If you’re on a CUDA machine, you’d use something like vLLM, which is a more “production-grade” solution for achieving high tok/s throughput with parallel requests, but it doesn’t work on a Mac.

https://willcb.com/blog/parallm/

Recommendations

///////