ollama/llama/patches
Jesse Gross 9d97e6a9f1 ggml: Avoid allocating CUDA primary context on unused GPUs
The recent memory management changes caused all GPUs to be visible
to the runner, regardless of whether they are ultimately used. This
caused CUDA devices to allocate a primary context (~300 MB VRAM) on
each GPU, for each model. This is unnecessary, so we can both avoid
touching GPUs that we exclude in the early stage of allocation and
freeing the memory for any that we touch but don't use.

The issue will continue to exist for the old engine, since it touches
all devices during initialization.
2025-08-27 16:24:18 -07:00
..
.gitignore update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0001-ggml-backend-malloc-and-free-using-the-same-compiler.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0002-pretokenizer.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0003-clip-unicode.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0004-solar-pro.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0005-fix-deepseek-deseret-regex.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0006-maintain-ordering-for-rules-for-grammar.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0007-sort-devices-by-score.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0008-add-phony-target-ggml-cpu-for-all-cpu-variants.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0009-remove-amx.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0010-fix-string-arr-kv-loading.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0011-ollama-debug-tensor.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0012-add-ollama-vocab-for-grammar-support.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0013-add-argsort-and-cuda-copy-for-i32.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0014-graph-memory-reporting-on-failure.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0015-ggml-Export-GPU-UUIDs.patch update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
0016-add-C-API-for-mtmd_input_text.patch llm: New memory management 2025-08-14 15:24:01 -07:00
0017-no-power-throttling-win32-with-gnuc.patch llm: New memory management 2025-08-14 15:24:01 -07:00
0018-BF16-macos-version-guard.patch llm: New memory management 2025-08-14 15:24:01 -07:00
0019-Enable-CUDA-Graphs-for-gemma3n.patch disable output_all (#11959) 2025-08-18 17:45:40 -07:00
0020-Disable-ggml-blas-on-macos-v13-and-older.patch llm: New memory management 2025-08-14 15:24:01 -07:00
0021-fix-mtmd-audio.cpp-build-on-windows.patch llm: New memory management 2025-08-14 15:24:01 -07:00
0022-ggml-No-alloc-mode.patch llm: New memory management 2025-08-14 15:24:01 -07:00
0023-decode-disable-output_all.patch disable output_all (#11959) 2025-08-18 17:45:40 -07:00
0024-ggml-Enable-resetting-backend-devices.patch ggml: Avoid allocating CUDA primary context on unused GPUs 2025-08-27 16:24:18 -07:00