Commit Graph

160 Commits

Author SHA1 Message Date
Devon Rifkin e7f56ef3d8 harmony: remove special casing in routes.go
Now that we have a built-in parser abstraction, which was introduced in
<https://github.com/ollama/ollama/pull/12248>, we can modify our harmony
parser to match this and then get rid of nearly all of the
harmony-specific logic in routes.go. We do have a small amount of
code that turns the parser on by default if the architecture matches and
no other built-in parser was provided.

The built-in parser interface was modified in order to handle harmony's
prefill and tool name translation requirements.
2025-09-18 14:55:59 -07:00
Michael Yang 564b558c92
fix(llama): other llama flavours (#12308)
* fix(llama): rope scale

* spm llama

* skip moe models

* cleanup
2025-09-17 12:12:21 -07:00
Devon Rifkin fa1c987a29
Merge pull request #12248 from ollama/drifkin/qwen3-coder-parsing
add qwen3-coder tool support
2025-09-16 10:21:43 -07:00
Michael Yang ad95d5b30b
use split activations when possible (#12293)
* use ggml_*_split activations when possible

* forward qkv
2025-09-16 09:51:19 -07:00
Michael Yang c253433d68
embed: cleanup (#12299)
* cleanup

* use pooling.TypeNone

* pooling test
2025-09-16 09:48:42 -07:00
Michael Yang 3f6642f6fc
model: implement bert in ollama engine (#9080)
* fix truncate

* s/SentencePieceModel/SentencePiece/

* bert

* wordpiece

* refactor pooling

* more tokenizers

* normalize embeddings
2025-09-15 15:35:59 -07:00
Michael Yang 6f7117145f
batch: use tensors for outputs (#12185)
this cleans up the model interface slightly without too much impact in
other areas
2025-09-15 14:33:06 -07:00
Devon Rifkin 472feec2ff address comments 2025-09-15 11:46:25 -07:00
Devon Rifkin 47991940d4 add qwen3-coder tool support
The format qwen3-coder uses is relatively unique, both in rendering and
in parsing. To implement parsing, I wrote a custom parser in similar
style to harmony. For the rendering, I found that the logic would be
much more difficult to follow in a template, so I introduced the concept
of a built-in renderer that uses go code, rather than a template to
generate prompts.

I set us up for future built-in parsers and renderers by making it so
they can be specified in a Modelfile like so:

```
RENDERER "qwen3-coder"
PARSER "qwen3-coder"
```

These need to be provided explicitly because the architecture alone is
not enough to understand what format the model expects to receive, and
what format we expect it to output (e.g., qwen3-coder is `qwen3moe`,
which includes other qwen3-family models as well)

I haven't converted harmony to be one of these "built-ins" yet, since
some of it is in flux with the changes @ParthSareen has been making to
move harmony to the runner. It is likely that many other built-ins will
need to move to the runner as well, but I'm able to slightly defer that
decision since qwen3-coder doesn't have thinking (and therefore doesn't
need to be in the runner to make structured outputs work). I expect to
unify harmony with this approach very soon.

Whether a particular model supports tools or thinking was previously
inferred from templates, but without a template we now also use the
parser itself to declare what it supports. If we have future models that
re-use the same parsing format, but have different capabilities, we'll
want to parameterize them and give them different names to be specified
as a `PARSER`.

Misc changes:

- I worked on the renderer by diffing outputs from the reference
  implementation and ours. To make it easier to do this, I extended
  <https://github.com/ollama/ollama/pull/11875> to also support
  returning the prompt via the openai compat layer
2025-09-15 11:33:47 -07:00
Michael Yang 5994e8e8fd
embedding gemma model (#12181)
* ollama: add embeddings
2025-09-04 09:09:07 -07:00
Michael Yang fb92b61754
logutil: add Trace and TraceContext helpers (#12110) 2025-09-02 13:09:12 -07:00
Daniel Hiltgen 517807cdf2
perf: build graph for next batch async to keep GPU busy (#11863)
* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.
2025-08-29 14:20:28 -07:00
Michael Yang 30fb7e19f8
remove extra field attr (#11205) 2025-08-25 09:58:16 -07:00
Devon Rifkin 463a6caad8 model: add bpe roundtripping tests 2025-08-19 22:05:48 -07:00
Devon Rifkin fc5fb09f51 model: fix boundary in bpe
0x007e is a tilde and was getting adjusted (+0x00a2) to 0x0120 in the
encode, but then in the decode it was getting adjusted down (-0x0100) to
0x0020. The boundary for the +0x00a2 case has been adjusted to fix this

Fixes: #11966
2025-08-19 18:34:49 -07:00
Michael Yang 1a19df1f3a
update vendored llama.cpp and ggml (#11823)
* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch

This will be redone once my branch is merged upstream in llama.cpp

* feat: Update all patches

There are a number that are no longer needed at all:

- 0003-embeddings: Embeddings entirely overhauled on master
- 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely
    overhauled on master
- 0019-metal-add-mean-kernel-14267: Merged upstream
- 0020-CUDA-add-mean-operation-14313: Merged upstream

* feat: Sync llama.cpp and ggml

* fix: Update rsync-filter for all moved/new/removed files

* fix: Add files missing from sync

* fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs

* fix: Add ggml files missing from sync

* fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files

* fix: Remove mtmd main cpp files

* fix: Add missing include in sampling_ext.cpp

* fix: Update llama.go to use mtmd instead of clip/llava

* fix: Add patch for mtmd_input_text

* chore: Ignore *.patched in the patch directory

* fix: Fix support for arch-specific ggml-cpu source files with new arrangement

In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific
implementations were split out into a nested tree structure under
ggml-cpu/arch. This conflicts with standard CGO layout where all
arch-specific source files are expected to live in the same directory as
the parent go module and use suffixes based on GOOS and GOARCH. As such,
there were really two options for getting this to work:

1. Add a patch on top of the GGML sync to rearrange the files to match the
GO layout convention
2. Use CGO directives to conditionally include the nested source files in
the compilation units

This commit does (2) in order to minimize the set of changes needed on top
of the upstream file layout. To get this to work, there are two key things
needed:

1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in
the preprocessor directives
2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to
explicitly include the .c|.cpp files for the given architecture from the
nested directory

* fix: Use mtmd_helper to correctly load the bitmap for the image

* fix: Apply patch for mtmd_text_input

* fix: Add missing stb to llama.cpp rsync-filter

* fix: Add sync'ed stb vendored header

* fix: Use c++17 and include vendor for go wrapper modules

* fix: Update patch 0015 for upstream implementation of uuid

* feat: Bump to the latest tip of the branch

* fix: Update patches for bump

* feat: Bump back to the cenral repo and point at the latest master

This includes granite 4 and a number of other model architectures!

* fix: Revert changes to ggml export GPU UUID patch

* fix: Add patch for GGML_VERSION and GGML_COMMIT constants

* feat: Sync all patched code

* build: Include cmake/common.cmake in ggml sync

* build: Add top-level include for GNUINstallDirs in CMakeLists.txt

This is used to populate CMAKE_INSTALL_BINDIR

* fix: Add a patch to avoid power throttling API on non-msvc windows builds

* fix: Sync patch changes for ggml-cpu.c

* feat: Bump llama.cpp to 4a4f42

This picks up support for Kimi K2 and PLaMO-2

* feat: Sync llama.cpp

* fix: Handle multi-chunk image encodings from mtmd

* fix: Re-number patches after merge with `main`

* feat: Bump to 41e78c in the makefile

* fix: Fix Solar and argsort/copy patches after bump

* fix: Remove Gemma3n CUDA Graphs patch

It was implemented upstream:
https://github.com/ggml-org/llama.cpp/pull/14741

* feat: Sync llama.cpp / ggml after latest bump

* build: Remove unnecessary CFLAGS definitions in cpu.go

* fix: Remove unnecessary additions in the rsync-filter

* fix: Remove unused vendored code for chat template parsing

* Revert "fix: Remove Gemma3n CUDA Graphs patch"

This reverts commit d724caced3.

* fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes

https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394

* fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n

* unwind mxfp4 patch

Prepare to bump ggml with their impl for mxfp4

* bump

* fix windows build error

* Convert tensors at load time

Repack the mxfp4 tensors as ggmls kernels expect them to be.

* convert mlp bf16 to f32

* buffer the conversion better

* reshape earlier

* openai swiglu

* add ids

* split qkv, gate_up

* fix nested alt tags

* fast attention

* remove debug messages

* fix lint

* remove redundant test

* remap values only if source/target are different

* add back i32->i32 copy

* refactor cpu quants

* clean up vendor

* update patch instructions

* clean up patches

* remove webgpu

* update mem

* also handle gpt-oss

* revert convert changes

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-08-14 14:42:58 -07:00
Michael Yang fa7776fd24
gpt-oss (#11672)
* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-08-05 12:21:16 -07:00
Oliver Simons ea85e27bbd
Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525)
* Enable CUDA Graphs for gemma3n.

Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.

* Remove residual check by reshaping differently in gemma3n model

This should make the heuristics more robust
2025-07-29 12:37:06 -07:00
Daniel Hiltgen f8a6e88819
Only load supported models on new engine (#11362)
* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models
2025-07-11 12:21:54 -07:00
Michael Yang 4129af9205
chore: cleanup comments + unused vars (#11225) 2025-06-27 11:45:33 -07:00
Michael Yang 73b642e6f3
add new gemma model (#11204)
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
2025-06-25 21:47:09 -07:00
Jeffrey Morgan 9e125d884c
model: treat 'user defined' tokens as special tokens (#11077) 2025-06-16 16:03:16 -07:00
Michael Yang 2e77aa1ae7
use nn.Linear in place of ml.Tensor (#11049)
while nn.Linear.Forward isn't applicable for sparse MLP, it's still
a nice container for the tensors
2025-06-11 12:10:15 -07:00
Devon Rifkin 5f57b0ef42
add thinking support to the api and cli (#10584)
- Both `/api/generate` and `/api/chat` now accept a `"think"`
  option that allows specifying whether thinking mode should be on or
  not
- Templates get passed this new option so, e.g., qwen3's template can
  put `/think` or `/no_think` in the system prompt depending on the
  value of the setting
- Models' thinking support is inferred by inspecting model templates.
  The prefix and suffix the parser uses to identify thinking support is
  also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
  existing API consumers. If the `"think"` option is not specified, the
  behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
  in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
  or `--think=false` to control thinking, or during an interactive
  session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
  it easy to use thinking in scripting scenarios like
  `ollama run qwen3 --think --hidethinking "my question here"` where you
  just want to see the answer but still want the benefits of thinking
  models
2025-05-28 19:38:52 -07:00
Jesse Gross 1f371ea92f ml: Panic rather than return error on tensor allocation failure
FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.
2025-05-22 14:38:09 -07:00
Michael Yang adff143bcd
fix: mllama quality (#10807)
* fix mllama convert

- transform attn_gate and ffn_gate
- swap attention heads for vision models

* fix mllama

the mlp gate which was applied in the wrong place
2025-05-22 11:30:49 -07:00
Michael Yang c890011322
feat: port qwen2 model (#10782) 2025-05-21 10:21:24 -07:00
Michael Yang e0ed984cde
feat: qwen3 dense and sparse models (#10708)
* feat: qwen3 dense
* feat: qwen3moe
* fix llama4 moe
2025-05-21 10:21:07 -07:00
Michael Yang 69b2fe9282
fix: qwen25vl assign samebatch in multimodal input (#10789)
setting samebatch on the vision start token is problematic because it
will be shared with other inputs that also use images. this will cause
the input to be cached and the runner will not see SameBatch. SameBatch
will also be incorrect since it may be for a different image.

assigning samebatch to the input tokens resolves this by ensure it's
assigned correctly to inputs corresponding to the image.

not setting same batch correctly may cause panics during inference since
images are no longer guaranteed to be in the same batch.
2025-05-21 09:39:20 -07:00
Michael Yang 9ed8bf14cb
ml: add more rope options (#10775) 2025-05-20 15:51:08 -07:00
Michael Yang ff180c3466
fix llama and mistral3 models (#10774)
* fix llama model

* fix mistral3.1 model

do not set default vision layers
2025-05-19 15:06:35 -07:00
Jesse Gross 94ab428e3f ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-05-19 09:54:22 -07:00
Michael Yang 333e360422
model: handle multiple eos tokens (#10577)
* get eos_token_id from generation_config.json

* refactor

* include both ids and strings in trace

* comments

* remove special case for gemma3 special vocab (#10743)
2025-05-16 13:40:23 -07:00
Jesse Gross 3c14461d5d ollamarunner: Separate text and multimodal graphs
For some multimodal models (such as gemma3), we create a single
graph that generates the image embedding and then use this in the
text model. The embedding tensor is completely opaque to the runner.

However, this doesn't work if we need to use the embedding in multiple
batches. This can arise if the embedding is larger than the batch size.
In these cases (as with llama4), we would like to create views that
are more appropriately sized. However, if we do this then the original
source tensor is used in multiple graphs, which isn't allowed. To
avoid that problem, models with this pattern compute the embedding
tensor on first use and recreate the individual views. There is no
longer a single vision and text graph.

This codifies the pattern of separating vision and text graphs. The
logic of computing tensors on demand is moved to the runner, so models
no longer have to worry about this. It also gives the runner visibility
into the multimodal tensors, which is important for memory management.
2025-05-15 13:46:20 -07:00
Michael Yang ef202789fa
fix pixel values padding (#10718)
* panic if trying to pad 4d

* fix pixel values padding
2025-05-15 13:44:44 -07:00
Bruce MacDonald 0aa8b371dd
model: add Qwen2.5-VL support (#10385) 2025-05-13 20:58:02 -07:00
Michael Yang 23125648b8
chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
Michael Yang 526b2ed102
fix vocabulary (#10679) 2025-05-12 17:29:46 -07:00
Bruce MacDonald a7240c6d63
models: remove unused qwen2vl processing (#10677) 2025-05-12 16:08:42 -07:00
Michael Yang f95a1f2bef
feat: add trace log level (#10650)
reduce prompt log to trace level
2025-05-12 11:43:00 -07:00
Michael Yang 5cfc1c39f3
model: fix build (#10416) 2025-04-25 19:24:48 -07:00
Michael Yang 7ba9fa9c7d fixes for maverick 2025-04-25 16:59:20 -07:00
Michael Yang 8bf11b84c1 chunked attention 2025-04-25 16:59:20 -07:00
Michael Yang 470af8ab89 connect vision to text 2025-04-25 16:59:20 -07:00
Michael Yang 178761aef3 image processing
Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-04-25 16:59:20 -07:00
Michael Yang f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Michael Yang d26c18e25c fix token type 2025-04-25 16:59:01 -07:00
Parth Sareen a53d744b01
llama: remove model loading for grammar (#10096) 2025-04-24 11:51:19 -07:00
Michael Yang 40b8fdbdca arange 2025-04-18 11:45:44 -07:00
Jesse Gross dbb149e6f7 ollamarunner: Preallocate worst case graph at startup
Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.
2025-04-08 10:01:28 -07:00
Bruce MacDonald 6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Michael Yang 3b96a93672 fs: move ml.Config to fs package 2025-04-03 13:12:24 -07:00
Jeffrey Morgan b51e0f397c
model: fix issues with spm tokenizer for Gemma 3 (#10081) 2025-04-02 13:22:56 -07:00
Michael Yang 74bd09652d ml/backend/ggml: load tensors in 32KiB chunks 2025-03-21 14:43:52 -07:00
Jesse Gross 0fbfcf3c9c model: Pass input tensor instead of raw data to models
Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.
2025-03-20 13:28:13 -07:00
Jesse Gross 0c220935bd input: Rename Options to Batch
Options is no longer very descriptive of this struct.
2025-03-20 13:28:13 -07:00
Jesse Gross b078dd157c gemma2: Remove second call to Rows
Looks like a merge conflict that broke the model.
2025-03-19 17:28:49 -07:00
Jeffrey Morgan da0e345200
ml: use input context for extracting outputs (#9875) 2025-03-18 18:08:19 -07:00
Jesse Gross 282bfaaa95 ollamarunner: Use a separate context per multimodal input
Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.
2025-03-14 15:38:54 -07:00
Jesse Gross 9679f40146 ml: Allow models to constrain inputs to a single batch
Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697
2025-03-14 15:38:54 -07:00
Michael Yang 3e102b7dad
Update model/model.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2025-03-13 13:11:52 -07:00
Michael Yang 5e2e0b46b1 fix: error if image requested without vision model 2025-03-13 10:52:09 -07:00
Bruce MacDonald a70820daa0
models/gemma3: remove final logit softcap (#9692)
Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.
2025-03-12 10:17:57 -07:00
jmorganca 83f0ec8269 all: address linter errors 2025-03-11 14:49:20 -07:00
jmorganca fb4664fcec model: add more spm tokenizer tests 2025-03-11 14:49:20 -07:00
jmorganca 20e3593863 model: validate left and right pairs before merging them 2025-03-11 14:49:20 -07:00
Michael Yang 63a394068c use 2d pooling 2025-03-11 14:49:20 -07:00
jmorganca 11bfa62796 add trailing \n\n after <end_of_image> to match reference implementation 2025-03-11 14:49:20 -07:00
jmorganca f63e62e546 reduce kernel size, add TODO for loading from config 2025-03-11 14:49:20 -07:00
jmorganca 65b0f329d1 Revert "Allow models to force a new batch"
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
2025-03-11 14:49:20 -07:00
Jesse Gross 06007c0a18 Allow models to force a new batch
This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times
2025-03-11 14:49:20 -07:00
Jesse Gross a8e83a7654 Disable causal attention based on batch index
Currently we are using positions, which are relative to a
sequence and may not be unique.
2025-03-11 14:49:20 -07:00
Jesse Gross 2c40c4d35e Fix follow up images and images split across batches 2025-03-11 14:49:19 -07:00
Michael Yang e95278932b use non-causal mask only for image positions 2025-03-11 14:49:19 -07:00
Michael Yang 9d2a20a763 use non-causal mask for inputs with images 2025-03-11 14:49:19 -07:00
Michael Yang 6b32a2d549 compat with upstream gguf 2025-03-11 14:49:19 -07:00
Michael Yang f888912870 fix vision encoder 2025-03-11 14:49:19 -07:00
Patrick Devine 9b54267e69 fix configs 2025-03-11 14:49:19 -07:00
Michael Yang 46bb0169c4 update model 2025-03-11 14:49:19 -07:00
Michael Yang 8934324b72 use fast attention 2025-03-11 14:49:18 -07:00
Jesse Gross 0e886595bf Fix tests and drift from main 2025-03-11 14:49:18 -07:00
Patrick Devine c62861f4fa fix conversion 2025-03-11 14:49:18 -07:00
Michael Yang 0df1800436 set non-causal attention 2025-03-11 14:49:18 -07:00
Jesse Gross 4346c2409d fix drift from main 2025-03-11 14:49:18 -07:00
Michael Yang 4b037a97dc add gemma vision encoder 2025-03-11 14:49:17 -07:00
Patrick Devine 5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Jesse Gross a1cda80bcb model: Update encoder cache to use multimodal input processing handler
The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.

Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.

Most of this is simply moving the input data structures to a new
package to avoid import cycles.
2025-03-09 17:05:26 -07:00
Jesse Gross 0daaaef8c9 ollamarunner: Quiet debug logging and panic on unimplemented features
Debug logging of every token has previously caused test timeouts
on slower machines.
2025-03-07 18:38:02 -08:00
Jesse Gross 98272fbd58 additional review comments 2025-03-07 14:08:21 -08:00
Michael Yang 7bae7fa5ce ml/backend/ggml: create tensor on specific backend
some tensors should be created on specific backends to reduce number of
copies and improve performance
2025-03-07 14:08:21 -08:00
Michael Yang bab6f34dc0 ml/backend/ggml: update model loading for hybrid/multi backends
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
2025-03-07 14:08:21 -08:00
Jesse Gross a7e63b82be ollamarunner: Improve multimodal input handling
Various vision models have different requirements for how they
receive their inputs. For example:
 - Mllama wants images together with text and the image embeddings
   don't themselves have positions or get stored in the main KV cache
 - Llava-style models feed in embeddings similar to tokens and
   images correspond to a varying number of tokens in the cache.

In addition, the strategy for providing inputs must support batching
and multiple sequences, which are managed by the runner. At the same
time, we want to keep data handling fully in the model so that new
architectures are not bottlenecked by runner code which does not
understand their particular requirements.

This provides a method for models to edit the input stream so that
it meets their needs while still being in a format that the runner
understands. This allows the runner to avoid special processing
for different models.

In addition, this fixes a regression where non-vision models may
try to incorrectly interpret images.
2025-03-06 16:54:16 -08:00
Jesse Gross b70fc4d51e model: Don't unconditionally add special tokens
We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.
2025-03-06 16:54:16 -08:00
Daniel Hiltgen 1fdb351c37
New engine: vision models and auto-fallback (#9113)
* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine
2025-03-04 09:03:46 -08:00
Jesse Gross 854a9195f3 attention: Remove unnecessary contiguous operations
Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.
2025-03-01 20:53:23 -08:00
Michael Yang 3e8b8a1933 ml: update Context.Forward interface
update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute
2025-02-27 22:27:16 +00:00
Michael Yang 53d2990d9b model: add bos token if configured 2025-02-27 21:04:59 +00:00
Jesse Gross f53f4198c3 ml: Abstract attention out of model definitions
There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-02-21 13:16:21 -08:00
Jesse Gross 5c5535c064 models: Prune unused outputs earlier in the forward pass
Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.

Changing the model definition in this way improves token generation
performance by approximately 8%.
2025-02-20 14:49:47 -08:00
Jesse Gross bd6a7d5e64 ollamarunner: Pass runner performance parameters to backends
Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.
2025-02-20 13:27:57 -08:00