Release v0.4.0 · EricLBuehler/mistral.rs

New features

🔥 New models!
- DeepSeek V2
- DeepSeek V3 and R1
- MiniCpm-O 2.6
🧮 Imatrix quantization
⚙️ Automatic device mapping
BNB quantization
Support blockwise FP8 dequantization and FP8 on Metal
Integrate the llguidance library (@mmoskal)
Metal PagedAttention
Many fixes and improvements from contributors!

Breaking changes

The Rust device mapping API has changed.

MSRV

The MSRV of this release is 1.83.0.

What's Changed

Use CUDA_COMPUTE_CAP if nvidia-smi not found by @EricLBuehler in #944
fix(docs): fix broken link by @sammcj in #945
Better diffusion interactive mode by @EricLBuehler in #948
Implement Imatrix for ISQ by @EricLBuehler in #949
Support imatrix quantization for vision models by @EricLBuehler in #950
Perplexity calculations with imatrix by @EricLBuehler in #952
set minimum rustc version to 1.82 by @mmoskal in #957
Fix append_sliding_window by @EricLBuehler in #958
Fix completion api behavior of best_of by @EricLBuehler in #959
Ensure support for cuda cc 5.3 by @EricLBuehler in #960
Improve test speeds on Windows by @EricLBuehler in #961
use llguidance library for constraints (including json schemas) by @mmoskal in #899
Fix metal fp8 quantization by @EricLBuehler in #962
Fix example gguf_locally to match chat template requirements by @msk in #966
Bitsandbytes quantization: loading and kernels by @EricLBuehler in #967
updated the tokenizers dependency of core to 0.21 by @vkomenda in #975
Remove outdated binaries mention in the readme by @BafS in #973
Improve error handling by @cdoko in #974
Add None check to prevent panic in evict_all_to_cpu in prefix_cacher.rs by @cdoko in #979
Include start offset for metal bitwise ops by @EricLBuehler in #978
Fail fast on TcpListener bind errors by @cdoko in #982
Inplace softmax long-seqlen attention optimizations by @EricLBuehler in #984
Fix cuda cublaslt when using vllama mask by @EricLBuehler in #985
Add cross attn quantization for mllama by @EricLBuehler in #987
fix mistralrs-server ignoring interactive_mode arg by @haricot in #990
Adding streaming function to mistralrs server. by @Narsil in #986
Fixes for bnb and more apis in mistralrs-quant by @EricLBuehler in #972
Support send + sync in loader by @EricLBuehler in #991
More vllama optimizations by @EricLBuehler in #992
Update docs by @EricLBuehler in #993
Use metal autorelease to optimize memory usage by @EricLBuehler in #996
Partial Fix for Sliding Window Attention by @cdoko in #994
Only dep on objc when building on metal by @EricLBuehler in #998
Prefix cacher v2 by @EricLBuehler in #1000
Add --cpu flag to mistralrs-server by @cdoko in #997
Metal PagedAttention support by @EricLBuehler in #1001
Fix cross attention + prefix cacher v2 support by @EricLBuehler in #1006
Support for normal cache for mllama, phi3v, qwen2vl by @EricLBuehler in #1007
Cleaner creation of dummy pa input metadata by @EricLBuehler in #1014
Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models by @guoqingbao in #1009
Support device mapping for Paged Attention by @cdoko in #1011
Prefix cacher fixes by @EricLBuehler in #1018
More fixes for the prefix cacher by @EricLBuehler in #1019
Support uqff for idefics3 by @EricLBuehler in #1020
Prepare for v0.3.5 by @EricLBuehler in #1021
Cleaner pipeline no prefix cache setting by @EricLBuehler in #1022
Support uqff load/save for idefics3 by @EricLBuehler in #1023
Update license for 2025 by @EricLBuehler in #1024
Implement DeepSeekV2 by @EricLBuehler in #1010
Use cudarc fork to fix CUDA build on Windows by @EricLBuehler in #1032
Fix metal paged attn phi3 by @EricLBuehler in #1033
Use float8 mistralrs_cudarc_fork feature by @EricLBuehler in #1034
Patch prefix caching to fix incorrect outputs by @EricLBuehler in #1035
Allocate paged attn cache as empty instead of zeros by @EricLBuehler in #1036
Remove ug and cudarc transient dep by @EricLBuehler in #1037
Rename MemoryGpuConfig::Amount->MbAmount by @EricLBuehler in #1038
CUDA dequant kernels conditional compilation by @EricLBuehler in #1039
F16 support for mllama, introduce FloatInfo by @EricLBuehler in #1041
Automatic device mapping support by @EricLBuehler in #1042
Support automatic device mapping for gguf models by @EricLBuehler in #1044
Support loading models without ISQ using device map by @EricLBuehler in #1045
Fix GGUF auto device mapping by @EricLBuehler in #1047
More efficient loading of safetensors when casting by @EricLBuehler in #1048
Fix Loading and Running on CPU by @cdoko in #1052
Work on better device mapping for mllama by @EricLBuehler in #1049
Mention interactive mode or server port in readme for gguf by @EricLBuehler in #1055
Fix panic in mistralrs-server by @cdoko in #981
Include device memory avail in device map err by @EricLBuehler in #1060
Fix --cpu on cuda by @cdoko in #1056
Improve pagedattn support in mistralrs bench by @EricLBuehler in #1063
Paged attention support for multi gpu by @EricLBuehler in #1059
Ergonomic automatic device mapping support by @EricLBuehler in #1054
Examples for automatic device mapping by @EricLBuehler in #1065
Fix metal pagedattn half8 vec impl by @EricLBuehler in #1067
Improve support for GGUF auto device map by @EricLBuehler in #1069
Fix missing field in idefics3 during loading by @EricLBuehler in #1070
Fix missing field in idefics3 during loading by @EricLBuehler in #1072
Fix paged attention for vision models on multiple devices by @cdoko in #1071
Fixes for idefics3 and idefics2 by @EricLBuehler in #1073
Improve automatic device map by @EricLBuehler in #1076
Implement the DeepSeekV3 model (support full DeepSeek R1) by @EricLBuehler in #1077
Don't print GGUF model metadata when silent=true by @Jeadie in #1079
Allow ChatCompletionChunkResponse (and therefore streaming) to have Usage. by @Jeadie in #1078
Support loading blockwise quantized fp8 by @EricLBuehler in #1080
Implement MiniCpm-O 2.6 by @EricLBuehler in #1074
Bump version to v0.4.0 by @EricLBuehler in #1081

New Contributors

@sammcj made their first contribution in #945
@mmoskal made their first contribution in #957
@vkomenda made their first contribution in #975
@BafS made their first contribution in #973
@cdoko made their first contribution in #974
@Narsil made their first contribution in #986

Full Changelog: v0.3.4...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0

New features

Breaking changes

MSRV

What's Changed

New Contributors

Contributors