Commit Graph

90 Commits

Author SHA1 Message Date
Jeffrey Morgan 913905028b
all: fix cgo compiler warnings on windows (#10563) 2025-05-05 08:02:39 -07:00
Jesse Gross a6ef73f4f2 ggml: Fix race that resulted in "context canceled" when loading
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.

To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.

The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
2025-05-02 13:43:25 -07:00
Jesse Gross c2f5d6662b ollamarunner: Re-enable worst case graph preallocation.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
2025-05-02 12:22:47 -07:00
Jeffrey Morgan 8dd12c873d
llama: update to commit e1e8e099 (#10513) 2025-05-01 18:24:09 -07:00
Daniel Hiltgen 718eda1b3e
Narrow set of paths we load GGML from (#10485)
Users may have other incompatible GGML installs on their systems.
This will prevent us from trying to load them from the path.
2025-04-30 11:25:22 -07:00
Michael Yang f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Jeffrey Morgan e9e5f61c45
llama: update to commit 2016f07b (#10352) 2025-04-24 17:26:02 -07:00
Michael Yang 40b8fdbdca arange 2025-04-18 11:45:44 -07:00
Jeffrey Morgan dc264be6ff
ml: add missing cmake property and remove additional CMakeLists.txt (#10310) 2025-04-16 18:56:29 -07:00
Jeffrey Morgan 943464ccb8
llama: update to commit 71e90e88 (#10192) 2025-04-16 15:14:01 -07:00
Jesse Gross ccb7eb8135 ggml: Free ggml_backend_buffer_t when releasing buffer
When ggml_backend_buffer_free() is called, the device memory
is released but not all backends consistently release the actual
ggml_backend_buffer_t in system RAM, causing a memory leak.

Bug #10040
2025-04-15 15:29:58 -07:00
Jesse Gross f50d691254 ggml: Fix memory leak on input tensors
For every forward pass through the model, we need to allocate input
tensors: tokens, images, positions, outputs and masks. These get
allocated in system memory.

However, when we close the context that the tensors were allocated
through, the metadata gets freed but the actual backend memory does
not. This results in a significant memory leak.

This makes it so that all the memory allocated through a context
gets freed when it is closed.

Fixes #10040
2025-04-11 11:13:22 -07:00
Jesse Gross 34c3b68fc8 ggml: Don't allocate CPU buffers as CUDA Host buffers
Allocating (and in particular, freeing) memory from CUDA host buffers
is expensive and can cause a significant performance hit if we do
it for every token. Using normal system memory avoids this issue
and also gives the OS more flexibility to manage it.

There is no performance impact from this patch directly (either
positive or negative) but it makes a difference once we start
freeing memory correctly.
2025-04-11 11:13:22 -07:00
Jesse Gross f33ccd5d27 ggml: Use pointer receivers for Context
Context is currently mixed between pointer and value receivers. Change
this to be all pointer receivers so don't have to reason about whether
the things we are updating in the struct will be retained.
2025-04-11 11:13:22 -07:00
Jesse Gross bc108b9ad6 ggml: Log filesystem errors
Sometimes loading the GGUF file fails with:
panic: context canceled

This is probably a filesystem error but it doesn't provide any
information about what happened.
2025-04-11 11:13:06 -07:00
Jesse Gross dbb149e6f7 ollamarunner: Preallocate worst case graph at startup
Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.
2025-04-08 10:01:28 -07:00
Jesse Gross a807985e59 ggml: Check for OOM and return as Go errors
If there is a CUDA OOM, we currently don't check the return value
and will evetually segfault. This checks for the problem and generates
a Go error. At the moment, this will still result in a panic but having
the error is the first step to being able to handle it more gracefully.
2025-04-08 10:01:28 -07:00
Daniel Hipke 0f3f9e353d
ml/backend/ggml: create a new file descriptor for tensor (#10133)
improves model loading times on network-based filesystems
such as GCS fuse by creating a dedicated file descriptor for each
section of the file being read, reducing seeking
2025-04-04 17:04:24 -07:00
Bruce MacDonald 6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Michael Yang 3b96a93672 fs: move ml.Config to fs package 2025-04-03 13:12:24 -07:00
Jesse Gross 01aa788722 ml: Remove Output from Context interface
Model implementations should use Input for all of their tensors
supplied to the model. This includes tensors that relate to the
outputs, which is confusing since there is also an Output funciton.

Since Output is only used internally in GGML and not used by any
model implementations, we can remove it from the interface to
reduce confusion.
2025-03-27 12:19:43 -07:00
saman-amd ead27aa9fe
Add gfx1200 & gfx1201 support on linux (#9878) 2025-03-27 07:35:19 -07:00
Michael Yang 74bd09652d ml/backend/ggml: load tensors in 32KiB chunks 2025-03-21 14:43:52 -07:00
Bruce MacDonald df94175a0f
ggml: return error on failure to read tensor data (#9872)
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
2025-03-18 16:51:33 -07:00
Michael Yang 021dcf089d
Merge pull request #9824 from ollama/mxyng/sched
conditionally enable parallel pipelines
2025-03-17 15:41:37 -07:00
Jeffrey Morgan 364629b8d6
ml/backend/ggml: allocate memory with malloc when loading model (#9822) 2025-03-17 13:32:40 -07:00
Michael Yang 4561fff36e conditionally enable parallel pipelines 2025-03-17 09:46:07 -07:00
shane.xb.qian 30d7a59ba8 ollama-debug.c: change 'ld' to 'PRIi64'
* macOS has different definition per info from @mxyng
2025-03-13 17:10:37 +08:00
shane.xb.qian 85ab552028 ollama-debug.c: correct mistype
Signed-off-by: shane.xb.qian <shane.qian@foxmail.com>
2025-03-12 22:32:30 +08:00
Michael Yang 63a394068c use 2d pooling 2025-03-11 14:49:20 -07:00
Michael Yang c5cbe4fc2a fallback to cpu 2025-03-11 14:49:19 -07:00
Michael Yang 9e4642e9b3 ollama debug tensor 2025-03-11 14:49:19 -07:00
Michael Yang 6b0486c216 duplicate token_embd to output 2025-03-11 14:49:19 -07:00
Michael Yang 8934324b72 use fast attention 2025-03-11 14:49:18 -07:00
Michael Yang 0df1800436 set non-causal attention 2025-03-11 14:49:18 -07:00
Michael Yang 4b037a97dc add gemma vision encoder 2025-03-11 14:49:17 -07:00
Patrick Devine 5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Jesse Gross 4100ed7bdd ml: Add support for quantized KV cache
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
2025-03-07 18:43:39 -08:00
Jesse Gross 25f9b152f9 ggml-backend: Ensure allocation meet backend requirements
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
2025-03-07 18:43:39 -08:00
Jesse Gross 98272fbd58 additional review comments 2025-03-07 14:08:21 -08:00
Michael Yang b27e8f3f10 ml/backend/ggml: use backend buffer type
this ensures the tensor is created on the right buffer type for backends
such as cpu
2025-03-07 14:08:21 -08:00
Michael Yang 45df786f09 comments 2025-03-07 14:08:21 -08:00
Michael Yang daaf42e4a4 ml/backend/ggml: clean up 2025-03-07 14:08:21 -08:00
Michael Yang 2dc60d4620 ml/backend/ggml: offload vision to cpu
temporary until tensor loading can accurately account for vision models
2025-03-07 14:08:21 -08:00
Michael Yang b5312f30e8 ml/backend/ggml: handle tensor split 2025-03-07 14:08:21 -08:00
Michael Yang 26c2e0bd35 ml/backend/ggml: handle user specified cpu offloading 2025-03-07 14:08:21 -08:00
Michael Yang bf920883d5 ml/backend/ggml: set cpu n_threads 2025-03-07 14:08:21 -08:00
Michael Yang 7bae7fa5ce ml/backend/ggml: create tensor on specific backend
some tensors should be created on specific backends to reduce number of
copies and improve performance
2025-03-07 14:08:21 -08:00
Michael Yang 764e199d67 kvcache: create cache ctx per layer
each cache layer creates and maintains its own context instead of using
a large context for all layers
2025-03-07 14:08:21 -08:00
Michael Yang bfce55db3d model: load non-repeated tensors into multiple backends
some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
2025-03-07 14:08:21 -08:00