Commit Graph

46 Commits

Author SHA1 Message Date
Michael Yang
b643362f9f gptoss: fix memory calc (#11700) 2025-12-29 06:39:49 -06:00
Jesse Gross
ae8a041461 ggml: Prevent kv cache quanitization on gpt-oss
KV cache quantization has a dependency on the flash attention kernel.
We currently cannot use flash attention with gpt-oss as it requires
additional operations.

The model definition does not call flash attention, so it works
regardless of the setting but the cache will pick up the
quantization type. This updates the flash attention setting earlier
in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671
2025-12-29 06:39:48 -06:00
Michael Yang
ed2e8a9022 gpt-oss (#11672)
* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-12-29 06:39:48 -06:00
Jeffrey Morgan
10119ec2ee fs/ggml: add multiplier in graph estimates (#11208) 2025-12-29 06:39:38 -06:00
Jeffrey Morgan
84998ae4ba fs/ggml: add missing architecture to OllamaEngineRequired() (#11206) 2025-12-29 06:39:38 -06:00
Michael Yang
801564fa8b add new gemma model (#11204)
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
2025-12-29 06:39:38 -06:00
Devon Rifkin
558c1920fa ggml: fix crash for array head counts
If it's an array, it uses the max value in the array

If array values for head counts becomes more popular, we can consider a
more invasive change like #10225 to calculate more accurate estimates.

Fixes: #9984
2025-12-29 06:39:34 -06:00
Michael Yang
4585d231ee Reapply "feat: incremental gguf parser (#10822)" (#11114) (#11119)
* Reapply "feat: incremental gguf parser (#10822)" (#11114)

This reverts commit a6e64fbdf2.

* fix older ggufs
2025-12-29 06:38:17 -06:00
Jeffrey Morgan
4f1588bc37 Revert "feat: incremental gguf parser (#10822)" (#11114)
This reverts commit 6b04cad7e8.
2025-12-29 06:38:16 -06:00
Michael Yang
142efb91b1 gguf: fix write order (#11068)
* ggml: test write gguf order
* ggml: fix write tensor order
2025-12-29 06:38:15 -06:00
Michael Yang
2c6f1dc9c8 feat: incremental gguf parser (#10822)
* incremental gguf parser
* gguf: update test to not rely on gguf on disc
* re-use existing create gguf
* read capabilities from gguf kv
* kv exists
* update tests
* s/doneFunc/successFunc/g
* new buffered reader

---------

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-12-29 06:38:14 -06:00
Jesse Gross
7b9ab4cb32 ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-12-29 06:38:02 -06:00
Bruce MacDonald
c38d583c99 ggml: update qwen25vl vision size estimate (#10711) 2025-12-29 06:38:00 -06:00
Bruce MacDonald
558b0f5fe9 model: add Qwen2.5-VL support (#10385) 2025-12-29 06:37:59 -06:00
Michael Yang
4d12503049 chore: update mllama to use ollama engine (#10637) 2025-12-29 06:37:59 -06:00
Daniel Hiltgen
0132148534 Follow up to #10363 (#10647)
The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.
2025-12-29 06:37:57 -06:00
Daniel Hiltgen
671bf5dcf8 fix data race in WriteGGUF (#10598)
err in the go routine should not be shared with the outer scope
2025-12-29 06:37:53 -06:00
Daniel Hiltgen
39ca55a1ba Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-12-29 06:37:52 -06:00
Jesse Gross
48b6465aff ggml: Reduce log level of "key not found"
Most of the time this is not an error.
2025-12-29 06:37:52 -06:00
Michael Yang
79646ad87d fix: write gguf padding (#10510)
* add gguf_test

* fix padding

padding was being added to offset but not to the running count
2025-12-29 06:37:48 -06:00
Michael Yang
049aa30191 memory 2025-12-29 06:37:45 -06:00
Michael Yang
0f5c45e19d llama4 2025-12-29 06:37:44 -06:00
Michael Yang
8a86190fd4 fix parameter count 2025-12-29 06:37:44 -06:00
Michael Yang
49f807737a default slice values 2025-12-29 06:37:44 -06:00
Michael Yang
51e64c8f69 update comment 2025-12-29 06:37:43 -06:00
Michael Yang
84a6567dee fix token type 2025-12-29 06:37:43 -06:00
Michael Yang
5a8e641272 zero means zero
use a default of 1024 when asking for zero is confusing since most calls
seem to assume 0 means do not ready any data
2025-12-29 06:37:43 -06:00
Michael Yang
96618f6344 generic ggml.array 2025-12-29 06:37:42 -06:00
Michael Yang
584c3176d2 convert: change to colmajor 2025-12-29 06:37:42 -06:00
Michael Yang
c6e2cf38b8 fix write gguf padding 2025-12-29 06:37:37 -06:00
Bruce MacDonald
6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Michael Yang
3b96a93672 fs: move ml.Config to fs package 2025-04-03 13:12:24 -07:00
Jesse Gross
f66216e399 ggml: Support heterogeneous KV cache layer sizes in memory estimation
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890
2025-03-26 13:16:03 -07:00
Michael Yang
4ea4d2b189 Merge pull request #9703 from ollama/mxyng/gemma3-memory
count gemma3 vision tensors
2025-03-13 16:56:34 -07:00
Michael Yang
8d76fa23ef count non-repeating vision layers 2025-03-13 16:53:29 -07:00
Michael Yang
65b88c544f fix divide by zero 2025-03-13 16:35:00 -07:00
Michael Yang
a422ba39c9 roughly count gemma3 graph
the largest operation is by far (q @ k) so just count that for
simplicity
2025-03-13 16:35:00 -07:00
Michael Yang
d2ec22371e count all vision tensors 2025-03-13 16:35:00 -07:00
Michael Yang
033cec232a count gemma3 vision tensors 2025-03-13 16:34:42 -07:00
Patrick Devine
4bed739259 add verbose mode to the show command (#9640)
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
2025-03-13 14:24:27 -07:00
Daniel Hiltgen
ab39e08eb9 llm: auto detect models that require Ollama Engine (#1) 2025-03-11 14:49:20 -07:00
Patrick Devine
5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Daniel Hiltgen
1fdb351c37 New engine: vision models and auto-fallback (#9113)
* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine
2025-03-04 09:03:46 -08:00
Michael Yang
53d2990d9b model: add bos token if configured 2025-02-27 21:04:59 +00:00
Michael Yang
b16367b4b2 fix: add back bf16 support
this was accidentally removed when moving fs/ggml from its previous
location
2025-02-25 19:26:14 +00:00
Michael Yang
58245413f4 next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-02-13 16:31:21 -08:00