ollama

Commit Graph

Author	SHA1	Message	Date
Inforithmics	c4d8c75e54	merge fixes	2025-10-04 15:27:52 +02:00
Inforithmics	ac6ba7d44b	Merge remote-tracking branch 'upstream/main' into VulkanV3Update	2025-10-04 14:53:59 +02:00
Daniel Hiltgen	c68f367ef6	Update GGML to b6646 (#12245 ) Notable EOLs with this change: - MacOS v12 and v13 are no longer supported (v14+ required) - AMD gfx900 and gfx906 are no longer supported	2025-10-02 14:47:10 -07:00
Daniel Hiltgen	bc8909fb38	Use runners for GPU discovery (#12090 ) This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.	2025-10-01 15:12:32 -07:00
Jesse Gross	3d0b1734c0	ggml: Preallocate CUDA pool memory The GGML CUDA backend allocates additional memory for intermediate results during calculation. This memory isn't currently allocated during worst case graph reservation and therefore not included in scheduling. This means that as these buffers potentially grow with context length, we could crash. This extends the memory allocation system down layer from the GGML graph to the CUDA layer, preallocating the worst case memory there as well. Fixes #11753	2025-09-30 15:04:43 -07:00
Jesse Gross	efaee8c2d6	ggml: Backport scale kernel fixes The GGML scale kernel uses signed 32-bit ints to represent the number of elements in the tensor. For large images, mistral-small3.2 overflows this, triggering CUDA errors due to negative arguments. Currently, this can happen when the user passes a large image to mistral-small3.2. However, with upcoming changes to reserve CUDA memory, it happens every time mistral-small is loaded as we reserve using a worst case batch. This patch is part of an upstream GGML commit and should be removed after GGML is updated past 0a1b398 "ggml: add ops for WAN video model (cuda && cpu) (#15669)". Fixes #10388	2025-09-30 15:04:43 -07:00
Jesse Gross	734b57da0e	ggml: Remove allocation status reporting For each memory allocation we report the size of the (attempted) allocation and whether it succeeded or failed. The latter status reporting proved to be not that useful in practice as systems such as Windows can automatically overflow from VRAM into RAM, resultings in successful allocations even when there isn't enough memory where we wanted. As a result, this information is only used for debug logging, which isn't worthwhile enough for the amount of code. It also isn't fully accurate, as multiple allocations may result in partial failures.	2025-09-30 15:04:43 -07:00
Daniel Hiltgen	5c18fb456c	fix vulkan ids to be underlying	2025-09-24 15:48:35 -07:00
Daniel Hiltgen	c86af47ac0	WIP - wire up Vulkan with the new engine based discovery Not a complete implementation - free VRAM is better, but not accurate on windows	2025-09-24 10:49:39 -07:00
Daniel Hiltgen	3a8ee62bd5	Merge remote-tracking branch 'inforithmics/vulkanV3' into engine_based_discovery_with_vulkan	2025-09-21 14:04:22 -07:00
Daniel Hiltgen	f761292516	Use runners for GPU discovery This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.	2025-09-21 13:53:24 -07:00
Inforithmics	0d4f3341c3	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-09-16 22:15:31 +02:00
Inforithmics	eb7b5ce9f4	Fix patches apply	2025-09-16 22:14:05 +02:00
Michael Yang	ad95d5b30b	use split activations when possible (#12293 ) * use ggml__split activations when possible forward qkv	2025-09-16 09:51:19 -07:00
Michael Yang	3f6642f6fc	model: implement bert in ollama engine (#9080 ) * fix truncate * s/SentencePieceModel/SentencePiece/ * bert * wordpiece * refactor pooling * more tokenizers * normalize embeddings	2025-09-15 15:35:59 -07:00
Masato Nakasaka	dd853c4040	modified UUID code inside ggml	2025-09-10 14:45:12 +09:00
Inforithmics	08bec121eb	Remove Code not in llama.cpp	2025-09-10 00:09:17 +02:00
Inforithmics	d97c2ab8b9	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-09-06 20:16:05 +02:00
Xiaodong Ye	603d3ab0ca	vulkan: get GPU ID (ollama v0.11.5) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-06 20:11:06 +02:00
Michael Yang	fb92b61754	logutil: add Trace and TraceContext helpers (#12110 )	2025-09-02 13:09:12 -07:00
Daniel Hiltgen	0cc90a8186	harden uncaught exception registration (#12120 )	2025-09-02 09:43:55 -07:00
Daniel Hiltgen	517807cdf2	perf: build graph for next batch async to keep GPU busy (#11863 ) * perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.	2025-08-29 14:20:28 -07:00
Jesse Gross	9d97e6a9f1	ggml: Avoid allocating CUDA primary context on unused GPUs The recent memory management changes caused all GPUs to be visible to the runner, regardless of whether they are ultimately used. This caused CUDA devices to allocate a primary context (~300 MB VRAM) on each GPU, for each model. This is unnecessary, so we can both avoid touching GPUs that we exclude in the early stage of allocation and freeing the memory for any that we touch but don't use. The issue will continue to exist for the old engine, since it touches all devices during initialization.	2025-08-27 16:24:18 -07:00
Michael Yang	59412fbb43	convert(gptoss): mxfp4 to ggml layout to avoid jit conversion (#12018 ) * convert: return bytes written * ggml flavor mxfp4 * simplify jit conversion * comment	2025-08-26 16:41:02 -07:00
Jesse Gross	05ccb17c6e	kvcache: Use Cast instead of Copy for flash attention masks Flash attention kernels require the mask of the KV cache be a F16 rather than an F32. We can use the GGML operation ggml_cast to do this rather than doing it ourselves, which allows reuse of a preallocated buffer in the graph rather than allocating a new one for each batch. This improves token generation performance with flash attention by 10-30% (with gpt-oss). This also makes performance with flash attention better than without it, as expected.	2025-08-19 12:36:28 -07:00
Daniel Hiltgen	6eaf194b85	fix arm linux build when HWCAP2_SVE2 undefined (#11908 )	2025-08-14 16:38:53 -07:00
Jesse Gross	d5a0d8d904	llm: New memory management This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.	2025-08-14 15:24:01 -07:00
Inforithmics	834a66689e	Update Vulkan backend to e54d41befcc1575f4c898c5ff4ef43970cead75f	2025-08-15 00:18:18 +02:00
Inforithmics	199458944f	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-08-15 00:06:53 +02:00
Michael Yang	1a19df1f3a	update vendored llama.cpp and ggml (#11823 ) * TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch This will be redone once my branch is merged upstream in llama.cpp * feat: Update all patches There are a number that are no longer needed at all: - 0003-embeddings: Embeddings entirely overhauled on master - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely overhauled on master - 0019-metal-add-mean-kernel-14267: Merged upstream - 0020-CUDA-add-mean-operation-14313: Merged upstream * feat: Sync llama.cpp and ggml * fix: Update rsync-filter for all moved/new/removed files * fix: Add files missing from sync * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs * fix: Add ggml files missing from sync * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files * fix: Remove mtmd main cpp files * fix: Add missing include in sampling_ext.cpp * fix: Update llama.go to use mtmd instead of clip/llava * fix: Add patch for mtmd_input_text * chore: Ignore .patched in the patch directory fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c\|cpp, use an #ifdef \| #elif defined \| #endif chain to explicitly include the .c\|.cpp files for the given architecture from the nested directory * fix: Use mtmd_helper to correctly load the bitmap for the image * fix: Apply patch for mtmd_text_input * fix: Add missing stb to llama.cpp rsync-filter * fix: Add sync'ed stb vendored header * fix: Use c++17 and include vendor for go wrapper modules * fix: Update patch 0015 for upstream implementation of uuid * feat: Bump to the latest tip of the branch * fix: Update patches for bump * feat: Bump back to the cenral repo and point at the latest master This includes granite 4 and a number of other model architectures! * fix: Revert changes to ggml export GPU UUID patch * fix: Add patch for GGML_VERSION and GGML_COMMIT constants * feat: Sync all patched code * build: Include cmake/common.cmake in ggml sync * build: Add top-level include for GNUINstallDirs in CMakeLists.txt This is used to populate CMAKE_INSTALL_BINDIR * fix: Add a patch to avoid power throttling API on non-msvc windows builds * fix: Sync patch changes for ggml-cpu.c * feat: Bump llama.cpp to 4a4f42 This picks up support for Kimi K2 and PLaMO-2 * feat: Sync llama.cpp * fix: Handle multi-chunk image encodings from mtmd * fix: Re-number patches after merge with `main` * feat: Bump to 41e78c in the makefile * fix: Fix Solar and argsort/copy patches after bump * fix: Remove Gemma3n CUDA Graphs patch It was implemented upstream: https://github.com/ggml-org/llama.cpp/pull/14741 * feat: Sync llama.cpp / ggml after latest bump * build: Remove unnecessary CFLAGS definitions in cpu.go * fix: Remove unnecessary additions in the rsync-filter * fix: Remove unused vendored code for chat template parsing * Revert "fix: Remove Gemma3n CUDA Graphs patch" This reverts commit `d724caced3`. * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394 * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n * unwind mxfp4 patch Prepare to bump ggml with their impl for mxfp4 * bump * fix windows build error * Convert tensors at load time Repack the mxfp4 tensors as ggmls kernels expect them to be. * convert mlp bf16 to f32 * buffer the conversion better * reshape earlier * openai swiglu * add ids * split qkv, gate_up * fix nested alt tags * fast attention * remove debug messages * fix lint * remove redundant test * remap values only if source/target are different * add back i32->i32 copy * refactor cpu quants * clean up vendor * update patch instructions * clean up patches * remove webgpu * update mem * also handle gpt-oss * revert convert changes --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-08-14 14:42:58 -07:00
Inforithmics	6543213e6f	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-08-13 23:50:00 +02:00
youzichuan	bb71654ebe	chore: fix some inconsistent function name in comment Signed-off-by: youzichuan <youzichuan6@outlook.com>	2025-08-13 09:50:27 -07:00
Inforithmics	eaf42a646c	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-08-13 08:27:22 +02:00
Jesse Gross	a343ae53a4	ggml: Use ordinal IDs for AMD GPUs on Linux when UUID is unavailable Some AMD GPUs do not provide UUIDs and report only "XX". In these cases, we should use the ordinal ID as an alternate identifier. This is the same as we always need to do on Windows for AMD. In addition, this prints out the ID for each GPU when enumerating them for easier debugging in the future.	2025-08-12 16:56:14 -07:00
Inforithmics	60a015e8c3	Revert chnages in ggml.go	2025-08-10 16:09:44 +02:00
Inforithmics	1edbfd0559	Revert changes in ggml.go	2025-08-10 16:07:24 +02:00
Inforithmics	fd4480a848	Fixed duplicate sync in ggml.go	2025-08-10 16:05:09 +02:00
Inforithmics	2e7452be71	Update Vulkan Code to de4c07f93783a1a96456a44dc16b9db538ee1618	2025-08-10 16:01:07 +02:00
Inforithmics	f8ed1541ed	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-08-09 21:59:30 +02:00
Jesse Gross	79f6376f5b	ggml: No-alloc mode Callers can set a backend buffer type to be no-alloc, meaning that it does not allocate memory for tensors or operations. This can be used for calculating memory requirements. Tensors and graphs must be recreated with no-alloc set to false before loading data. Defaults to false for newly created backend buffer types.	2025-08-08 14:57:13 -07:00
Jesse Gross	756c78cfc7	ggml: Support closing backends In order to iteratively find the best memory allocation, we need to be able to free backend memory so we can try again.	2025-08-08 14:57:13 -07:00
Jesse Gross	d7f4f788d1	ggml: Use GGML's typedef'ed pointer types For many backend data structures, GGML defines a typedef of a pointer type and returns these from functions. In most cases, CGo understands that these are interchangable but some parts of Go (such as generics) think they are two different types. We should prefer the form that GGML uses.	2025-08-08 14:57:13 -07:00
Daniel Hiltgen	fa8be9e35c	clean up debugging (#11756 )	2025-08-06 13:31:22 -07:00
Michael Yang	fa7776fd24	gpt-oss (#11672 ) * bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2025-08-05 12:21:16 -07:00
Daniel Hiltgen	25911a6e6b	mac: disable bf16 on unsupported OS versions (#11585 ) Support for bf16 was added in MacOS v14+ and attempting to enable on older versions causes runtime failures.	2025-07-30 08:50:54 -07:00
Oliver Simons	ea85e27bbd	Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525 ) * Enable CUDA Graphs for gemma3n. Similar to https://github.com/ggml-org/llama.cpp/pull/14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks. * Remove residual check by reshaping differently in gemma3n model This should make the heuristics more robust	2025-07-29 12:37:06 -07:00
Michael Yang	b4fe3adc0a	compile bf16 support into ggml-metal (#11430 )	2025-07-16 17:32:57 -07:00
Jesse Gross	acef9b4c1b	ggml: Use assigned layers when reporting loading stats Reporting params.NumGPULayers can be misleading because it is the requested number of layers, not the actual number that is loaded. While they are often the same, there are cases where they might mismatch, such as if the GPU backend is missing.	2025-07-11 14:21:50 -07:00
Jesse Gross	9a43994c45	ggml: Disable unused pipeline parallelism We're not currently using it, even in cases where we could. Disabling it improves generation performance by 10-30% with multiple GPUs.	2025-07-11 13:30:05 -07:00
Jesse Gross	35fda7b4af	ggml: Report ordinal IDs for AMD GPUs on Windows We don't get valid UUIDs for AMD GPUs on Windows, so the best option is to use the ordinal IDs. This brings us in line with what we currently do on the Ollama server - the only exception is AMD GPUs on Linux, which falls back to using ordinal IDs. The GGML implementation has no fallback but it doesn't appear to occur for any of the GPUs that we support. It's also possible that there are collisions between ordinal IDs for different libraries - however the only places where we use them are AMD on Windows and Metal on Mac, which can never occur on the same system.	2025-07-09 10:35:31 -07:00

1 2 3 4

170 Commits