Commit Graph

125 Commits

Author SHA1 Message Date
youzichuan
89f7077c93 chore: fix some inconsistent function name in comment
Signed-off-by: youzichuan <youzichuan6@outlook.com>
2025-12-29 06:39:52 -06:00
Jesse Gross
858b91f1e5 ggml: Use ordinal IDs for AMD GPUs on Linux when UUID is unavailable
Some AMD GPUs do not provide UUIDs and report only "XX". In these
cases, we should use the ordinal ID as an alternate identifier.
This is the same as we always need to do on Windows for AMD.

In addition, this prints out the ID for each GPU when enumerating
them for easier debugging in the future.
2025-12-29 06:39:52 -06:00
Jesse Gross
3d990dc451 ggml: No-alloc mode
Callers can set a backend buffer type to be no-alloc, meaning that
it does not allocate memory for tensors or operations. This can
be used for calculating memory requirements. Tensors and graphs
must be recreated with no-alloc set to false before loading data.

Defaults to false for newly created backend buffer types.
2025-12-29 06:39:51 -06:00
Jesse Gross
bcd5507f4b ggml: Support closing backends
In order to iteratively find the best memory allocation, we need to
be able to free backend memory so we can try again.
2025-12-29 06:39:51 -06:00
Jesse Gross
723dfd2a33 ggml: Use GGML's typedef'ed pointer types
For many backend data structures, GGML defines a typedef of a pointer
type and returns these from functions. In most cases, CGo understands
that these are interchangable but some parts of Go (such as generics)
think they are two different types. We should prefer the form that
GGML uses.
2025-12-29 06:39:51 -06:00
Daniel Hiltgen
cb241cab63 clean up debugging (#11756) 2025-12-29 06:39:49 -06:00
Michael Yang
ed2e8a9022 gpt-oss (#11672)
* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-12-29 06:39:48 -06:00
Daniel Hiltgen
5038e33776 mac: disable bf16 on unsupported OS versions (#11585)
Support for bf16 was added in MacOS v14+ and attempting to enable
on older versions causes runtime failures.
2025-12-29 06:39:47 -06:00
Oliver Simons
1ee3fe46f3 Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525)
* Enable CUDA Graphs for gemma3n.

Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.

* Remove residual check by reshaping differently in gemma3n model

This should make the heuristics more robust
2025-12-29 06:39:47 -06:00
Michael Yang
7221b90fe1 compile bf16 support into ggml-metal (#11430) 2025-12-29 06:39:43 -06:00
Jesse Gross
b47aa7e75a ggml: Use assigned layers when reporting loading stats
Reporting params.NumGPULayers can be misleading because it is the
requested number of layers, not the actual number that is loaded.
While they are often the same, there are cases where they might mismatch,
such as if the GPU backend is missing.
2025-12-29 06:39:42 -06:00
Jesse Gross
015e39a8be ggml: Disable unused pipeline parallelism
We're not currently using it, even in cases where we could. Disabling
it improves generation performance by 10-30% with multiple GPUs.
2025-12-29 06:39:42 -06:00
Jesse Gross
387cb031b3 ggml: Report ordinal IDs for AMD GPUs on Windows
We don't get valid UUIDs for AMD GPUs on Windows, so the best option
is to use the ordinal IDs. This brings us in line with what we currently
do on the Ollama server - the only exception is AMD GPUs on Linux, which
falls back to using ordinal IDs. The GGML implementation has no fallback
but it doesn't appear to occur for any of the GPUs that we support.

It's also possible that there are collisions between ordinal IDs for
different libraries - however the only places where we use them are
AMD on Windows and Metal on Mac, which can never occur on the same
system.
2025-12-29 06:39:42 -06:00
Jesse Gross
5f139b96ab Revert "ggml: Temporarily disable reporting UUIDs"
The root cause was an unclean upgrade - this code is fine.

This reverts commit 45f216a9c7.
2025-12-29 06:39:41 -06:00
Daniel Hiltgen
e897624123 mimic logs for layers on new engine (#11278)
This adds some extra logs to make the new engine a bit more consistent
with the llama engine.
2025-12-29 06:39:39 -06:00
Jesse Gross
872d190c8f ggml: Temporarily disable reporting UUIDs
This is causing segfaults, so disable it. Currently UUIDs are only
used for debugging purposes, although they planned to be used in
additional ways in the future.

Bug #11211
2025-12-29 06:39:39 -06:00
Michael Yang
801564fa8b add new gemma model (#11204)
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
2025-12-29 06:39:38 -06:00
Daniel Hiltgen
29ec3ddf9a Re-remove cuda v11 (#10694)
* Re-remove cuda v11

Revert the revert - drop v11 support requiring drivers newer than Feb 23

This reverts commit c6bcdc4223.

* Simplify layout

With only one version of the GPU libraries, we can simplify things down somewhat.  (Jetsons still require special handling)

* distinct sbsa variant for linux arm64

This avoids accidentally trying to load the sbsa cuda libraries on
a jetson system which results in crashes.

* temporary prevent rocm+cuda mixed loading
2025-12-29 06:38:18 -06:00
Jesse Gross
290d4c2c6c ggml: Check return status for computation.
We don't check the return status after computing the graph, which
can silently lead to bad outputs if we try to keep going and future
computation succeeds. This appears to happens in certain cases on
Apple M2 devices.

Fixes #11070
2025-12-29 06:38:17 -06:00
Jeffrey Morgan
5e3fb4744b Revert "Revert "ggml: Export GPU UUIDs" (#11115)" (#11117)
Reverts PR #11115. The original change was mistakingly reverted instead of #10822
2025-12-29 06:38:16 -06:00
Jeffrey Morgan
c5237d9462 Revert "ggml: Export GPU UUIDs" (#11115)
This reverts commit aaa7818000.
2025-12-29 06:38:16 -06:00
Jesse Gross
0b9c6cb497 ggml: Export GPU UUIDs
This enables matching up devices and information reported by the backend
with system management libraries such as nvml to get accurate free
memory reporting.
2025-12-29 06:38:10 -06:00
Jesse Gross
d1ed4b17ef ml: Panic rather than return error on tensor allocation failure
FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.
2025-12-29 06:38:06 -06:00
Jesse Gross
6e68feda00 ollamarunner: Memory usage reporting
This provides granular information about the backend memory allocations
required by the runner:
 - Per backend
 - Per layer
 - Weights, cache and graph
 - Allocation status

This can be used for debugging and validating memory estimates.
2025-12-29 06:38:06 -06:00
Jesse Gross
b3de134eda ggml: Report graph memory for failed allocations
GGML has a function to report the allocated size of a backend buffer.
However, this returns 0 if we tried to allocate a buffer and it failed.
For memory management purposes, it's important to know how much we were
trying to allocate. This extends the API to report attempted sizes for
all buffers and whether it succeeeded.
2025-12-29 06:38:06 -06:00
Michael Yang
9215b190fa feat: qwen3 dense and sparse models (#10708)
* feat: qwen3 dense
* feat: qwen3moe
* fix llama4 moe
2025-12-29 06:38:04 -06:00
Michael Yang
02fd383448 chore: disable debug in binary libraries (#10788) 2025-12-29 06:38:04 -06:00
Michael Yang
20dcadf7e8 ml: add more rope options (#10775) 2025-12-29 06:38:03 -06:00
Jesse Gross
7b9ab4cb32 ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-12-29 06:38:02 -06:00
Michael Yang
4e77815773 fix pixel values padding (#10718)
* panic if trying to pad 4d

* fix pixel values padding
2025-12-29 06:38:00 -06:00
Bruce MacDonald
558b0f5fe9 model: add Qwen2.5-VL support (#10385) 2025-12-29 06:37:59 -06:00
Michael Yang
4d12503049 chore: update mllama to use ollama engine (#10637) 2025-12-29 06:37:59 -06:00
Jeffrey Morgan
9163ed39d1 llama: update to commit de4c07f93 (#10655) 2025-12-29 06:37:57 -06:00
Michael Yang
7085a3f89b feat: add trace log level (#10650)
reduce prompt log to trace level
2025-12-29 06:37:57 -06:00
Daniel Hiltgen
39ca55a1ba Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-12-29 06:37:52 -06:00
Jeffrey Morgan
9a44e41802 all: fix cgo compiler warnings on windows (#10563) 2025-12-29 06:37:51 -06:00
Jesse Gross
86eea6770e ggml: Fix race that resulted in "context canceled" when loading
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.

To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.

The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
2025-12-29 06:37:50 -06:00
Jesse Gross
cec8a9dee0 ollamarunner: Re-enable worst case graph preallocation.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
2025-12-29 06:37:50 -06:00
Jeffrey Morgan
723fec1b25 llama: update to commit e1e8e099 (#10513) 2025-12-29 06:37:49 -06:00
Daniel Hiltgen
098fe2f7f7 Narrow set of paths we load GGML from (#10485)
Users may have other incompatible GGML installs on their systems.
This will prevent us from trying to load them from the path.
2025-12-29 06:37:47 -06:00
Michael Yang
0f5c45e19d llama4 2025-12-29 06:37:44 -06:00
Jeffrey Morgan
85d3f71c02 llama: update to commit 2016f07b (#10352) 2025-12-29 06:37:42 -06:00
Michael Yang
c916dd67bf arange 2025-12-29 06:37:40 -06:00
Jeffrey Morgan
8c08f74532 ml: add missing cmake property and remove additional CMakeLists.txt (#10310) 2025-12-29 06:37:39 -06:00
Jeffrey Morgan
3824c0803b llama: update to commit 71e90e88 (#10192) 2025-12-29 06:37:39 -06:00
Jesse Gross
abb8f89af9 ggml: Free ggml_backend_buffer_t when releasing buffer
When ggml_backend_buffer_free() is called, the device memory
is released but not all backends consistently release the actual
ggml_backend_buffer_t in system RAM, causing a memory leak.

Bug #10040
2025-12-29 06:37:37 -06:00
Jesse Gross
f50d691254 ggml: Fix memory leak on input tensors
For every forward pass through the model, we need to allocate input
tensors: tokens, images, positions, outputs and masks. These get
allocated in system memory.

However, when we close the context that the tensors were allocated
through, the metadata gets freed but the actual backend memory does
not. This results in a significant memory leak.

This makes it so that all the memory allocated through a context
gets freed when it is closed.

Fixes #10040
2025-04-11 11:13:22 -07:00
Jesse Gross
34c3b68fc8 ggml: Don't allocate CPU buffers as CUDA Host buffers
Allocating (and in particular, freeing) memory from CUDA host buffers
is expensive and can cause a significant performance hit if we do
it for every token. Using normal system memory avoids this issue
and also gives the OS more flexibility to manage it.

There is no performance impact from this patch directly (either
positive or negative) but it makes a difference once we start
freeing memory correctly.
2025-04-11 11:13:22 -07:00
Jesse Gross
f33ccd5d27 ggml: Use pointer receivers for Context
Context is currently mixed between pointer and value receivers. Change
this to be all pointer receivers so don't have to reason about whether
the things we are updating in the struct will be retained.
2025-04-11 11:13:22 -07:00
Jesse Gross
bc108b9ad6 ggml: Log filesystem errors
Sometimes loading the GGUF file fails with:
panic: context canceled

This is probably a filesystem error but it doesn't provide any
information about what happened.
2025-04-11 11:13:06 -07:00