Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Data Processing/Computer Graphics/Computer Graphics API/Metal API/
MLX
Search

MLX

Creator
Creator
Seonglae Cho
Created
Created
2023 Dec 8 11:52
Editor
Editor
Seonglae Cho
Edited
Edited
2024 Jul 3 15:30
Refs
Refs

An array framework for Apple silicon

MLX Usages
MLX Distributed Communication
MLX Example
 
 
 
 

KV Cache

ParaLLM: 1600+ tok/s on a MacBook - William Brown
Recently I’ve been doing some LLM finetuning experiments on my MacBook using MLX, and found that there wasn’t really a great way to take advantage of parallel inference for evaluating outputs locally. For single-stream applications like chat interfaces, this isn’t a big deal – both llama.cpp and MLXServer run quite fast on Apple devices. But if you’re trying to sample a large number of outputs at once, either for evaluating a training run or for “agent-flavored” applications, neither of them really offer a speedup in terms of total throughput (at least from what I’ve been able to test). If you’re on a CUDA machine, you’d use something like vLLM, which is a more “production-grade” solution for achieving high tok/s throughput with parallel requests, but it doesn’t work on a Mac.
https://willcb.com/blog/parallm/
 
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Data Processing/Computer Graphics/Computer Graphics API/Metal API/
MLX
Copyright Seonglae Cho