ollama

Commit Graph

Author	SHA1	Message	Date
Gabe Goodhart	e6a22f20d1	Merge remote-tracking branch 'origin/main' into GraniteFour * origin/main: docs: update modelfile.md to reflect current default num_ctx (#11189) ggml: Use assigned layers when reporting loading stats ggml: Disable unused pipeline parallelism Only load supported models on new engine (#11362)	2025-07-15 14:50:19 -06:00
Gabe Goodhart	5305e2ad14	feat: Sync llama.cpp Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-07-15 14:50:01 -06:00
Gabe Goodhart	91e4b10d40	fix: Sync patch changes for ggml-cpu.c Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-07-11 16:01:15 -06:00
Jesse Gross	acef9b4c1b	ggml: Use assigned layers when reporting loading stats Reporting params.NumGPULayers can be misleading because it is the requested number of layers, not the actual number that is loaded. While they are often the same, there are cases where they might mismatch, such as if the GPU backend is missing.	2025-07-11 14:21:50 -07:00
Jesse Gross	9a43994c45	ggml: Disable unused pipeline parallelism We're not currently using it, even in cases where we could. Disabling it improves generation performance by 10-30% with multiple GPUs.	2025-07-11 13:30:05 -07:00
Gabe Goodhart	81d821ba9b	build: Include cmake/common.cmake in ggml sync Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-07-11 13:25:01 -06:00
Gabe Goodhart	bf1b261611	feat: Sync all patched code Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-07-11 11:44:18 -06:00
Gabe Goodhart	e61826c180	Merge remote-tracking branch 'origin/main' into GraniteFour * origin/main: ggml: Report ordinal IDs for AMD GPUs on Windows doc: add MacOS docs (#11334) Reduce default parallelism to 1 (#11330) API/CLI context enhancements (#11331) add `tool_name` to api.md (#11326) template: add tool result compatibility (#11294) ci: modularization (#11324) Revert "ggml: Temporarily disable reporting UUIDs" readme: update Ollama icon size int: add performance integration tests (#11173) doc: add NVIDIA blackwell to supported list (#11307) Update base image to Ubuntu 24.04 LTS (#9681) doc: Update link for mac install (#11288) mimic logs for layers on new engine (#11278) readme: add NativeMind to community integrations (#11242) tools: fix parsing tool calls with empty arguments, missing required fields (#11233) readme: add ollama-bash-toolshed to community integrations (#11224)	2025-07-10 14:01:24 -06:00
Jesse Gross	35fda7b4af	ggml: Report ordinal IDs for AMD GPUs on Windows We don't get valid UUIDs for AMD GPUs on Windows, so the best option is to use the ordinal IDs. This brings us in line with what we currently do on the Ollama server - the only exception is AMD GPUs on Linux, which falls back to using ordinal IDs. The GGML implementation has no fallback but it doesn't appear to occur for any of the GPUs that we support. It's also possible that there are collisions between ordinal IDs for different libraries - however the only places where we use them are AMD on Windows and Metal on Mac, which can never occur on the same system.	2025-07-09 10:35:31 -07:00
Jesse Gross	592d21e7db	Revert "ggml: Temporarily disable reporting UUIDs" The root cause was an unclean upgrade - this code is fine. This reverts commit `45f216a9c7`.	2025-07-07 11:31:02 -07:00
Daniel Hiltgen	2c4ce40334	mimic logs for layers on new engine (#11278 ) This adds some extra logs to make the new engine a bit more consistent with the llama engine.	2025-07-02 16:38:36 -07:00
Gabe Goodhart	dbd8ee2654	fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c\|cpp, use an #ifdef \| #elif defined \| #endif chain to explicitly include the .c\|.cpp files for the given architecture from the nested directory Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-27 17:08:56 -06:00
Gabe Goodhart	85aba511ec	fix: Add ggml files missing from sync Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-27 17:06:05 -06:00
Gabe Goodhart	62af160d82	fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-27 17:05:39 -06:00
Gabe Goodhart	2613f5da2d	feat: Sync llama.cpp and ggml Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-27 17:01:24 -06:00
Jesse Gross	45f216a9c7	ggml: Temporarily disable reporting UUIDs This is causing segfaults, so disable it. Currently UUIDs are only used for debugging purposes, although they planned to be used in additional ways in the future. Bug #11211	2025-06-27 11:27:22 -07:00
Michael Yang	73b642e6f3	add new gemma model (#11204 ) * update patches * cherry pick metal mean kernel * cherry pick cuda mean kernel * gemma3n	2025-06-25 21:47:09 -07:00
Daniel Hiltgen	1c6669e64c	Re-remove cuda v11 (#10694 ) * Re-remove cuda v11 Revert the revert - drop v11 support requiring drivers newer than Feb 23 This reverts commit `c6bcdc4223`. * Simplify layout With only one version of the GPU libraries, we can simplify things down somewhat. (Jetsons still require special handling) * distinct sbsa variant for linux arm64 This avoids accidentally trying to load the sbsa cuda libraries on a jetson system which results in crashes. * temporary prevent rocm+cuda mixed loading	2025-06-23 14:07:00 -07:00
Jesse Gross	87b7af6cee	ggml: Check return status for computation. We don't check the return status after computing the graph, which can silently lead to bad outputs if we try to keep going and future computation succeeds. This appears to happens in certain cases on Apple M2 devices. Fixes #11070	2025-06-19 17:12:49 -07:00
Jeffrey Morgan	6baf1e31e2	Revert "Revert "ggml: Export GPU UUIDs" (#11115 )" (#11117 ) Reverts PR #11115. The original change was mistakingly reverted instead of #10822	2025-06-18 07:30:49 -07:00
Jeffrey Morgan	ed567ef43b	Revert "ggml: Export GPU UUIDs" (#11115 ) This reverts commit `aaa7818000`.	2025-06-18 05:45:00 -07:00
Jesse Gross	aaa7818000	ggml: Export GPU UUIDs This enables matching up devices and information reported by the backend with system management libraries such as nvml to get accurate free memory reporting.	2025-05-29 14:01:26 -07:00
Jesse Gross	1f371ea92f	ml: Panic rather than return error on tensor allocation failure FromFloatSlice and FromIntSlice return an error if the shape doesn't match the passed data or if memory can't be allocated. Since these are inputs, the memory being allocated is system memory rather than VRAM. In many cases, the caller can't really handle the error and panics. Empty and Zeros directly panic if they can't allocate memory. This makes things consistent by panicing for the first two cases, removing a fair amount of error handling code. This is also consistent with how Go typically handles these situations.	2025-05-22 14:38:09 -07:00
Jesse Gross	73d6a82cce	ollamarunner: Memory usage reporting This provides granular information about the backend memory allocations required by the runner: - Per backend - Per layer - Weights, cache and graph - Allocation status This can be used for debugging and validating memory estimates.	2025-05-22 14:38:09 -07:00
Jesse Gross	6db8a3771c	ggml: Report graph memory for failed allocations GGML has a function to report the allocated size of a backend buffer. However, this returns 0 if we tried to allocate a buffer and it failed. For memory management purposes, it's important to know how much we were trying to allocate. This extends the API to report attempted sizes for all buffers and whether it succeeeded.	2025-05-22 14:38:09 -07:00
Michael Yang	e0ed984cde	feat: qwen3 dense and sparse models (#10708 ) * feat: qwen3 dense * feat: qwen3moe * fix llama4 moe	2025-05-21 10:21:07 -07:00
Michael Yang	375839ea2d	chore: disable debug in binary libraries (#10788 )	2025-05-21 09:39:38 -07:00
Michael Yang	9ed8bf14cb	ml: add more rope options (#10775 )	2025-05-20 15:51:08 -07:00
Jesse Gross	94ab428e3f	ggml: Seperate tensor load from backend creation Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.	2025-05-19 09:54:22 -07:00
Michael Yang	ef202789fa	fix pixel values padding (#10718 ) * panic if trying to pad 4d * fix pixel values padding	2025-05-15 13:44:44 -07:00
Bruce MacDonald	0aa8b371dd	model: add Qwen2.5-VL support (#10385 )	2025-05-13 20:58:02 -07:00
Michael Yang	23125648b8	chore: update mllama to use ollama engine (#10637 )	2025-05-13 17:36:02 -07:00
Jeffrey Morgan	0cefd46f23	llama: update to commit de4c07f93 (#10655 )	2025-05-12 12:17:26 -07:00
Michael Yang	f95a1f2bef	feat: add trace log level (#10650 ) reduce prompt log to trace level	2025-05-12 11:43:00 -07:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
Jeffrey Morgan	913905028b	all: fix cgo compiler warnings on windows (#10563 )	2025-05-05 08:02:39 -07:00
Jesse Gross	a6ef73f4f2	ggml: Fix race that resulted in "context canceled" when loading Successfully completing processing with an errgroup cancels the associated context. However, we also have a goroutine that is checking for cancelation of the context. As a result, there is a race where the goroutine can pick up the cancelation and report an error, replacing the sucessful error message. To avoid that, this replaces the goroutine with a cancelation check when we are reading files. This also has the advantage of stopping all reads relatively quickly on error and also ensuring that there are no outstanding I/O operations when we return in this case. The downside is that if a file read blocks forever (for example, over the network) then cancelation of the context effectively won't be honored. However, this is also true for other smaller files we read and the tensors are read in small chunks (128K), so it's consistent and better on balance overall.	2025-05-02 13:43:25 -07:00
Jesse Gross	c2f5d6662b	ollamarunner: Re-enable worst case graph preallocation. Worst case graph preallocation was disabled by `a27462b` "ollamarunner: Temporarily disable worst case graph preallocation" since it caused crashes with large batches when not using the GPU. This backports upstream llama.cpp commit f057808 "ggml: Don't assert fail when tensor data changes (#13222)", which fixes the underlying bug and allows reverting the previous workaround.	2025-05-02 12:22:47 -07:00
Jeffrey Morgan	8dd12c873d	llama: update to commit e1e8e099 (#10513 )	2025-05-01 18:24:09 -07:00
Daniel Hiltgen	718eda1b3e	Narrow set of paths we load GGML from (#10485 ) Users may have other incompatible GGML installs on their systems. This will prevent us from trying to load them from the path.	2025-04-30 11:25:22 -07:00
Michael Yang	f0c66e6dea	llama4	2025-04-25 16:59:20 -07:00
Jeffrey Morgan	e9e5f61c45	llama: update to commit 2016f07b (#10352 )	2025-04-24 17:26:02 -07:00
Michael Yang	40b8fdbdca	arange	2025-04-18 11:45:44 -07:00
Jeffrey Morgan	dc264be6ff	ml: add missing cmake property and remove additional CMakeLists.txt (#10310 )	2025-04-16 18:56:29 -07:00
Jeffrey Morgan	943464ccb8	llama: update to commit 71e90e88 (#10192 )	2025-04-16 15:14:01 -07:00
Jesse Gross	ccb7eb8135	ggml: Free ggml_backend_buffer_t when releasing buffer When ggml_backend_buffer_free() is called, the device memory is released but not all backends consistently release the actual ggml_backend_buffer_t in system RAM, causing a memory leak. Bug #10040	2025-04-15 15:29:58 -07:00
Jesse Gross	f50d691254	ggml: Fix memory leak on input tensors For every forward pass through the model, we need to allocate input tensors: tokens, images, positions, outputs and masks. These get allocated in system memory. However, when we close the context that the tensors were allocated through, the metadata gets freed but the actual backend memory does not. This results in a significant memory leak. This makes it so that all the memory allocated through a context gets freed when it is closed. Fixes #10040	2025-04-11 11:13:22 -07:00
Jesse Gross	34c3b68fc8	ggml: Don't allocate CPU buffers as CUDA Host buffers Allocating (and in particular, freeing) memory from CUDA host buffers is expensive and can cause a significant performance hit if we do it for every token. Using normal system memory avoids this issue and also gives the OS more flexibility to manage it. There is no performance impact from this patch directly (either positive or negative) but it makes a difference once we start freeing memory correctly.	2025-04-11 11:13:22 -07:00
Jesse Gross	f33ccd5d27	ggml: Use pointer receivers for Context Context is currently mixed between pointer and value receivers. Change this to be all pointer receivers so don't have to reason about whether the things we are updating in the struct will be retained.	2025-04-11 11:13:22 -07:00
Jesse Gross	bc108b9ad6	ggml: Log filesystem errors Sometimes loading the GGUF file fails with: panic: context canceled This is probably a filesystem error but it doesn't provide any information about what happened.	2025-04-11 11:13:06 -07:00

1 2 3

125 Commits