New features
- ๐ฅ New models!
- DeepSeek V2
- DeepSeek V3 and R1
- MiniCpm-O 2.6
- ๐งฎ Imatrix quantization
- โ๏ธ Automatic device mapping
- BNB quantization
- Support blockwise FP8 dequantization and FP8 on Metal
- Integrate the llguidance library (@mmoskal)
- Metal PagedAttention
- Many fixes and improvements from contributors!
Breaking changes
- The Rust device mapping API has changed.
MSRV
The MSRV of this release is 1.83.0.
What's Changed
- Use CUDA_COMPUTE_CAP if nvidia-smi not found by @EricLBuehler in #944
- fix(docs): fix broken link by @sammcj in #945
- Better diffusion interactive mode by @EricLBuehler in #948
- Implement Imatrix for ISQ by @EricLBuehler in #949
- Support imatrix quantization for vision models by @EricLBuehler in #950
- Perplexity calculations with imatrix by @EricLBuehler in #952
- set minimum rustc version to 1.82 by @mmoskal in #957
- Fix append_sliding_window by @EricLBuehler in #958
- Fix completion api behavior of best_of by @EricLBuehler in #959
- Ensure support for cuda cc 5.3 by @EricLBuehler in #960
- Improve test speeds on Windows by @EricLBuehler in #961
- use llguidance library for constraints (including json schemas) by @mmoskal in #899
- Fix metal fp8 quantization by @EricLBuehler in #962
- Fix example gguf_locally to match chat template requirements by @msk in #966
- Bitsandbytes quantization: loading and kernels by @EricLBuehler in #967
- updated the tokenizers dependency of core to 0.21 by @vkomenda in #975
- Remove outdated binaries mention in the readme by @BafS in #973
- Improve error handling by @cdoko in #974
- Add None check to prevent panic in evict_all_to_cpu in prefix_cacher.rs by @cdoko in #979
- Include start offset for metal bitwise ops by @EricLBuehler in #978
- Fail fast on TcpListener bind errors by @cdoko in #982
- Inplace softmax long-seqlen attention optimizations by @EricLBuehler in #984
- Fix cuda cublaslt when using vllama mask by @EricLBuehler in #985
- Add cross attn quantization for mllama by @EricLBuehler in #987
- fix mistralrs-server ignoring interactive_mode arg by @haricot in #990
- Adding streaming function to mistralrs server. by @Narsil in #986
- Fixes for bnb and more apis in mistralrs-quant by @EricLBuehler in #972
- Support send + sync in loader by @EricLBuehler in #991
- More vllama optimizations by @EricLBuehler in #992
- Update docs by @EricLBuehler in #993
- Use metal autorelease to optimize memory usage by @EricLBuehler in #996
- Partial Fix for Sliding Window Attention by @cdoko in #994
- Only dep on objc when building on metal by @EricLBuehler in #998
- Prefix cacher v2 by @EricLBuehler in #1000
- Add
--cpu
flag tomistralrs-server
by @cdoko in #997 - Metal PagedAttention support by @EricLBuehler in #1001
- Fix cross attention + prefix cacher v2 support by @EricLBuehler in #1006
- Support for normal cache for mllama, phi3v, qwen2vl by @EricLBuehler in #1007
- Cleaner creation of dummy pa input metadata by @EricLBuehler in #1014
- Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models by @guoqingbao in #1009
- Support device mapping for Paged Attention by @cdoko in #1011
- Prefix cacher fixes by @EricLBuehler in #1018
- More fixes for the prefix cacher by @EricLBuehler in #1019
- Support uqff for idefics3 by @EricLBuehler in #1020
- Prepare for v0.3.5 by @EricLBuehler in #1021
- Cleaner pipeline no prefix cache setting by @EricLBuehler in #1022
- Support uqff load/save for idefics3 by @EricLBuehler in #1023
- Update license for 2025 by @EricLBuehler in #1024
- Implement DeepSeekV2 by @EricLBuehler in #1010
- Use cudarc fork to fix CUDA build on Windows by @EricLBuehler in #1032
- Fix metal paged attn phi3 by @EricLBuehler in #1033
- Use float8 mistralrs_cudarc_fork feature by @EricLBuehler in #1034
- Patch prefix caching to fix incorrect outputs by @EricLBuehler in #1035
- Allocate paged attn cache as empty instead of zeros by @EricLBuehler in #1036
- Remove ug and cudarc transient dep by @EricLBuehler in #1037
- Rename MemoryGpuConfig::Amount->MbAmount by @EricLBuehler in #1038
- CUDA dequant kernels conditional compilation by @EricLBuehler in #1039
- F16 support for mllama, introduce FloatInfo by @EricLBuehler in #1041
- Automatic device mapping support by @EricLBuehler in #1042
- Support automatic device mapping for gguf models by @EricLBuehler in #1044
- Support loading models without ISQ using device map by @EricLBuehler in #1045
- Fix GGUF auto device mapping by @EricLBuehler in #1047
- More efficient loading of safetensors when casting by @EricLBuehler in #1048
- Fix Loading and Running on CPU by @cdoko in #1052
- Work on better device mapping for mllama by @EricLBuehler in #1049
- Mention interactive mode or server port in readme for gguf by @EricLBuehler in #1055
- Fix panic in mistralrs-server by @cdoko in #981
- Include device memory avail in device map err by @EricLBuehler in #1060
- Fix
--cpu
on cuda by @cdoko in #1056 - Improve pagedattn support in mistralrs bench by @EricLBuehler in #1063
- Paged attention support for multi gpu by @EricLBuehler in #1059
- Ergonomic automatic device mapping support by @EricLBuehler in #1054
- Examples for automatic device mapping by @EricLBuehler in #1065
- Fix metal pagedattn half8 vec impl by @EricLBuehler in #1067
- Improve support for GGUF auto device map by @EricLBuehler in #1069
- Fix missing field in idefics3 during loading by @EricLBuehler in #1070
- Fix missing field in idefics3 during loading by @EricLBuehler in #1072
- Fix paged attention for vision models on multiple devices by @cdoko in #1071
- Fixes for idefics3 and idefics2 by @EricLBuehler in #1073
- Improve automatic device map by @EricLBuehler in #1076
- Implement the DeepSeekV3 model (support full DeepSeek R1) by @EricLBuehler in #1077
- Don't print GGUF model metadata when silent=true by @Jeadie in #1079
- Allow
ChatCompletionChunkResponse
(and therefore streaming) to haveUsage
. by @Jeadie in #1078 - Support loading blockwise quantized fp8 by @EricLBuehler in #1080
- Implement MiniCpm-O 2.6 by @EricLBuehler in #1074
- Bump version to v0.4.0 by @EricLBuehler in #1081
New Contributors
- @sammcj made their first contribution in #945
- @mmoskal made their first contribution in #957
- @vkomenda made their first contribution in #975
- @BafS made their first contribution in #973
- @cdoko made their first contribution in #974
- @Narsil made their first contribution in #986
Full Changelog: v0.3.4...v0.4.0