ParaLLM: 1600+ tok/s on a MacBook - William Brown
Recently I’ve been doing some LLM finetuning experiments on my MacBook using MLX, and found that there wasn’t really a great way to take advantage of parallel inference for evaluating outputs locally. For single-stream applications like chat interfaces, this isn’t a big deal – both llama.cpp and MLXServer run quite fast on Apple devices. But if you’re trying to sample a large number of outputs at once, either for evaluating a training run or for “agent-flavored” applications, neither of them really offer a speedup in terms of total throughput (at least from what I’ve been able to test). If you’re on a CUDA machine, you’d use something like vLLM, which is a more “production-grade” solution for achieving high tok/s throughput with parallel requests, but it doesn’t work on a Mac.
https://willcb.com/blog/parallm/