ollama/llama/patches
Jesse Gross afaf7ce8c3 ggml: Enable op_offload to improve partial offload performance
When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037
2025-10-30 13:53:10 -07:00
..
.gitignore update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0001-ggml-backend-malloc-and-free-using-the-same-compiler.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0002-pretokenizer.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0003-clip-unicode.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0004-solar-pro.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0005-fix-deepseek-deseret-regex.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0006-maintain-ordering-for-rules-for-grammar.patch Update GGML to b6646 (#12245) 2025-10-02 14:47:10 -07:00
0007-sort-devices-by-score.patch Update GGML to b6646 (#12245) 2025-10-02 14:47:10 -07:00
0008-add-phony-target-ggml-cpu-for-all-cpu-variants.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0009-remove-amx.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0010-fix-string-arr-kv-loading.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0011-ollama-debug-tensor.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0012-add-ollama-vocab-for-grammar-support.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0013-add-argsort-and-cuda-copy-for-i32.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0014-graph-memory-reporting-on-failure.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0015-ggml-Export-GPU-UUIDs.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0016-add-C-API-for-mtmd_input_text.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0017-no-power-throttling-win32-with-gnuc.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0018-BF16-macos-version-guard.patch Update GGML to b6646 (#12245) 2025-10-02 14:47:10 -07:00
0019-ggml-Add-batch-size-hint.patch ggml: Enable op_offload to improve partial offload performance 2025-10-30 13:53:10 -07:00
0020-Disable-ggml-blas-on-macos-v13-and-older.patch Update GGML to b6646 (#12245) 2025-10-02 14:47:10 -07:00
0021-fix-mtmd-audio.cpp-build-on-windows.patch llm: New memory management 2025-08-14 15:24:01 -07:00
0022-ggml-No-alloc-mode.patch ggml: Enable op_offload to improve partial offload performance 2025-10-30 13:53:10 -07:00
0023-decode-disable-output_all.patch Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552) 2025-10-13 15:26:18 -07:00
0024-ggml-Enable-resetting-backend-devices.patch logs: fix bogus "0 MiB free" log line (#12590) 2025-10-14 11:26:28 -07:00
0025-harden-uncaught-exception-registration.patch harden uncaught exception registration (#12120) 2025-09-02 09:43:55 -07:00
0026-GPU-discovery-enhancements.patch Fix vulkan PCI ID and ID handling (#12775) 2025-10-28 15:15:35 -07:00
0027-NVML-fallback-for-unified-memory-GPUs.patch Fix vulkan PCI ID and ID handling (#12775) 2025-10-28 15:15:35 -07:00
0028-CUDA-Changing-the-CUDA-scheduling-strategy-to-spin-1.patch Fix vulkan PCI ID and ID handling (#12775) 2025-10-28 15:15:35 -07:00
0029-report-LoadLibrary-failures.patch Fix vulkan PCI ID and ID handling (#12775) 2025-10-28 15:15:35 -07:00
0032-interleave-multi-rope.patch interleaved mrope (#12807) 2025-10-30 11:29:00 -07:00