ollama

History

Jesse Gross afaf7ce8c3 ggml: Enable op_offload to improve partial offload performance When a model is partially offloaded to system RAM, we can either do the calculations on the CPU or we can temporarily transfer the data to the GPU to do the calculations there. Small batches tend to be better on the CPU, large batches on the GPU. The llamarunner used the GPU in most cases and the ollamarunner used the CPU. Although the ollamarunner saw an improvement in token generation performance, there was a large performance hit in prompt processing (3-10x). There is an existing heuristic to dynamically switch between these two modes but in practice it doesn't have enough information to accurately make that decision. This adds authoritative data to make the check work to get the best of both worlds. Fixes #12037		2025-10-30 13:53:10 -07:00
..
.gitignore	update vendored llama.cpp and ggml (#11823 )	2025-08-14 14:42:58 -07:00
0001-ggml-backend-malloc-and-free-using-the-same-compiler.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0002-pretokenizer.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0003-clip-unicode.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0004-solar-pro.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0005-fix-deepseek-deseret-regex.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0006-maintain-ordering-for-rules-for-grammar.patch	Update GGML to b6646 (#12245 )	2025-10-02 14:47:10 -07:00
0007-sort-devices-by-score.patch	Update GGML to b6646 (#12245 )	2025-10-02 14:47:10 -07:00
0008-add-phony-target-ggml-cpu-for-all-cpu-variants.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0009-remove-amx.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0010-fix-string-arr-kv-loading.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0011-ollama-debug-tensor.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0012-add-ollama-vocab-for-grammar-support.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0013-add-argsort-and-cuda-copy-for-i32.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0014-graph-memory-reporting-on-failure.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0015-ggml-Export-GPU-UUIDs.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0016-add-C-API-for-mtmd_input_text.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0017-no-power-throttling-win32-with-gnuc.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0018-BF16-macos-version-guard.patch	Update GGML to b6646 (#12245 )	2025-10-02 14:47:10 -07:00
0019-ggml-Add-batch-size-hint.patch	ggml: Enable op_offload to improve partial offload performance	2025-10-30 13:53:10 -07:00
0020-Disable-ggml-blas-on-macos-v13-and-older.patch	Update GGML to b6646 (#12245 )	2025-10-02 14:47:10 -07:00
0021-fix-mtmd-audio.cpp-build-on-windows.patch	llm: New memory management	2025-08-14 15:24:01 -07:00
0022-ggml-No-alloc-mode.patch	ggml: Enable op_offload to improve partial offload performance	2025-10-30 13:53:10 -07:00
0023-decode-disable-output_all.patch	Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552 )	2025-10-13 15:26:18 -07:00
0024-ggml-Enable-resetting-backend-devices.patch	logs: fix bogus "0 MiB free" log line (#12590 )	2025-10-14 11:26:28 -07:00
0025-harden-uncaught-exception-registration.patch	harden uncaught exception registration (#12120 )	2025-09-02 09:43:55 -07:00
0026-GPU-discovery-enhancements.patch	Fix vulkan PCI ID and ID handling (#12775 )	2025-10-28 15:15:35 -07:00
0027-NVML-fallback-for-unified-memory-GPUs.patch	Fix vulkan PCI ID and ID handling (#12775 )	2025-10-28 15:15:35 -07:00
0028-CUDA-Changing-the-CUDA-scheduling-strategy-to-spin-1.patch	Fix vulkan PCI ID and ID handling (#12775 )	2025-10-28 15:15:35 -07:00
0029-report-LoadLibrary-failures.patch	Fix vulkan PCI ID and ID handling (#12775 )	2025-10-28 15:15:35 -07:00
0032-interleave-multi-rope.patch	interleaved mrope (#12807 )	2025-10-30 11:29:00 -07:00