Reporting params.NumGPULayers can be misleading because it is the
requested number of layers, not the actual number that is loaded.
While they are often the same, there are cases where they might mismatch,
such as if the GPU backend is missing.
We don't get valid UUIDs for AMD GPUs on Windows, so the best option
is to use the ordinal IDs. This brings us in line with what we currently
do on the Ollama server - the only exception is AMD GPUs on Linux, which
falls back to using ordinal IDs. The GGML implementation has no fallback
but it doesn't appear to occur for any of the GPUs that we support.
It's also possible that there are collisions between ordinal IDs for
different libraries - however the only places where we use them are
AMD on Windows and Metal on Mac, which can never occur on the same
system.
This is causing segfaults, so disable it. Currently UUIDs are only
used for debugging purposes, although they planned to be used in
additional ways in the future.
Bug #11211
* Re-remove cuda v11
Revert the revert - drop v11 support requiring drivers newer than Feb 23
This reverts commit c6bcdc4223.
* Simplify layout
With only one version of the GPU libraries, we can simplify things down somewhat. (Jetsons still require special handling)
* distinct sbsa variant for linux arm64
This avoids accidentally trying to load the sbsa cuda libraries on
a jetson system which results in crashes.
* temporary prevent rocm+cuda mixed loading
We don't check the return status after computing the graph, which
can silently lead to bad outputs if we try to keep going and future
computation succeeds. This appears to happens in certain cases on
Apple M2 devices.
Fixes#11070
This enables matching up devices and information reported by the backend
with system management libraries such as nvml to get accurate free
memory reporting.
FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.
In many cases, the caller can't really handle the error and panics.
Empty and Zeros directly panic if they can't allocate memory.
This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.
This provides granular information about the backend memory allocations
required by the runner:
- Per backend
- Per layer
- Weights, cache and graph
- Allocation status
This can be used for debugging and validating memory estimates.
GGML has a function to report the allocated size of a backend buffer.
However, this returns 0 if we tried to allocate a buffer and it failed.
For memory management purposes, it's important to know how much we were
trying to allocate. This extends the API to report attempted sizes for
all buffers and whether it succeeeded.
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
- Create backend, including enumerating tensors and memory allocation
- Loading tensor data
This allows more flexibility in managing model loading.
* Move quantization logic to GGML via new backend
This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
* Remove "add model quantizations"
This is no longer needed now that quantization is implemented in Go+GGML code directly.
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.
To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.
The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.
This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
When ggml_backend_buffer_free() is called, the device memory
is released but not all backends consistently release the actual
ggml_backend_buffer_t in system RAM, causing a memory leak.
Bug #10040
For every forward pass through the model, we need to allocate input
tensors: tokens, images, positions, outputs and masks. These get
allocated in system memory.
However, when we close the context that the tensors were allocated
through, the metadata gets freed but the actual backend memory does
not. This results in a significant memory leak.
This makes it so that all the memory allocated through a context
gets freed when it is closed.
Fixes#10040
Allocating (and in particular, freeing) memory from CUDA host buffers
is expensive and can cause a significant performance hit if we do
it for every token. Using normal system memory avoids this issue
and also gives the OS more flexibility to manage it.
There is no performance impact from this patch directly (either
positive or negative) but it makes a difference once we start
freeing memory correctly.
Context is currently mixed between pointer and value receivers. Change
this to be all pointer receivers so don't have to reason about whether
the things we are updating in the struct will be retained.
Sometimes loading the GGUF file fails with:
panic: context canceled
This is probably a filesystem error but it doesn't provide any
information about what happened.
Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.
This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.
Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.
If there is a CUDA OOM, we currently don't check the return value
and will evetually segfault. This checks for the problem and generates
a Go error. At the moment, this will still result in a panic but having
the error is the first step to being able to handle it more gracefully.
improves model loading times on network-based filesystems
such as GCS fuse by creating a dedicated file descriptor for each
section of the file being read, reducing seeking
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
Model implementations should use Input for all of their tensors
supplied to the model. This includes tensors that relate to the
outputs, which is confusing since there is also an Output funciton.
Since Output is only used internally in GGML and not used by any
model implementations, we can remove it from the interface to
reduce confusion.
* readme: add Ellama to list of community integrations (#9800)
* readme: add screenpipe to community integrations (#9786)
* Add support for ROCm gfx1151 (#9773)
* conditionally enable parallel pipelines
* sample: make mutations in transforms explicit (#9743)
* updated minP to use early exit making use of sorted tokens
* ml/backend/ggml: allocate memory with malloc when loading model (#9822)
* runner: remove cache prompt flag from ollama runner (#9826)
We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.
* ollamarunner: Check for minBatch of context space when shifting
Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.
Note that there still some corner cases:
- A long prompt that exceeds the context window can get truncated
in the middle of an image. With the current models, this will
result in the model not recognizing the image at all, which is
pretty much the expected result with truncation.
- The context window is set less than the minimum batch size. The
only solution to this is to refuse to load the model with these
settings. However, this can never occur with current models and
default settings.
Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.
* Applied latest patches from McBane87
See this for details: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2708820861
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
* Add ability to enable flash attention on vulkan (#4)
* discover: add flash attention handling for vulkan
* envconfig: fix typo in config.go
As part of the process some code was refactored and I added a new field
FlashAttention to GpuInfo since the previous solution didn't allow for a
granular check via vulkan extensions. As a side effect, this now allows
for granular per-device FA support checking in other places
---------
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
Co-authored-by: zeo <108888572+zeozeozeo@users.noreply.github.com>
Co-authored-by: Louis Beaumont <louis.beaumont@gmail.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Nikita <50599445+nasrally@users.noreply.github.com>
This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.
Most of the infrastructure was already there, this extends it to
be available to the backends.
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
- output backend system info when initializing the backend. this ensures
this information is always present without needing to be called
explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name
The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.
Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.
Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.
The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.
Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.
To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
- for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.
This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.
update Context.Forward to accept multiple tensors to match
Context.Compute signature
update Context.Forward to return Context such that it can be chained
with Context.Compute
During work on our new registry client, I ran into frustrations with CI
where a misspelling in a comment caused the linter to fail, which caused
the tests to not run, which caused the build to not be cached, which
caused the next run to be slow, which caused me to be sad.
This commit address these issues, and pulls in some helpful changes
we've had in CI on ollama.com for some time now.
They are:
* Always run tests, even if the other checks fail.
Tests are the most important part of CI, and should always run. Failures
in tests can be correlated with failures in other checks, and can help
surface the root cause of the failure sooner. This is especially
important when the failure is platform specific, and the tests are not
platform independent.
* Check that `go generate` is clean.
This prevents 'go generate' abuse regressions. This codebase used to use
it to generate platform specific binary build artifacts. Let's make sure
that does not happen again and this powerful tool is used correctly, and
the generated code is checked in.
Also, while adding `go generate` the check, it was revealed that the
generated metal code was putting dates in the comments, resulting in
non-deterministic builds. This is a bad practice, and this commit fixes
that. Git tells us the most important date: the commit date along with
other associated changes.
* Check that `go mod tidy` is clean.
A new job to check that `go mod tidy` is clean was added, to prevent
easily preventable merge conflicts or go.mod changes being deferred to a
future PR that is unrelated to the change that caused the go.mod to
change.
* More robust caching.
We now cache the go build cache, and the go mod download cache
independently. This is because the download cache contains zips that can
be unpacked in parallel faster than they can be fetched and extracted by
tar. This speeds up the build significantly.
The linter is hostile enough. It does not need to also punish us with
longer build times due to small failures like misspellings.