with #12181, there's now support for embeddings in ollama engine.
this is done by mutating the architecture and adding _embed when it
detects an embedding model. however this introduced a bug where if
an embedding model was run based on an existing ollama engine model
without an embedding implementation, e.g. llama4, it will pass the
initial arch support check but fail when actually loaded.
there's currently two entrypoints to creating a model. previously this
second entrypoint was necessary because calling model.New would also
load the model. since #11818, this is no longer th case so merge them
to reduce complexity
With the addition of cuda v13, on a clean setup, the level of parallelism
was causing docker desktop to become overwhelmed and compilers
were crashing. This limits to 8 parallel per build stage, with the ability
to override if you have many more cores available.
This change moves back to converting bf16 vision weights to fp16,
specifically if they start with the name "v." (such as v.blk.0.attn_k.weight).
This fixes a bug where converted images are failing because they are trying
to call `im2col` which doesn't have a bf16 kernel in ggml.
The format qwen3-coder uses is relatively unique, both in rendering and
in parsing. To implement parsing, I wrote a custom parser in similar
style to harmony. For the rendering, I found that the logic would be
much more difficult to follow in a template, so I introduced the concept
of a built-in renderer that uses go code, rather than a template to
generate prompts.
I set us up for future built-in parsers and renderers by making it so
they can be specified in a Modelfile like so:
```
RENDERER "qwen3-coder"
PARSER "qwen3-coder"
```
These need to be provided explicitly because the architecture alone is
not enough to understand what format the model expects to receive, and
what format we expect it to output (e.g., qwen3-coder is `qwen3moe`,
which includes other qwen3-family models as well)
I haven't converted harmony to be one of these "built-ins" yet, since
some of it is in flux with the changes @ParthSareen has been making to
move harmony to the runner. It is likely that many other built-ins will
need to move to the runner as well, but I'm able to slightly defer that
decision since qwen3-coder doesn't have thinking (and therefore doesn't
need to be in the runner to make structured outputs work). I expect to
unify harmony with this approach very soon.
Whether a particular model supports tools or thinking was previously
inferred from templates, but without a template we now also use the
parser itself to declare what it supports. If we have future models that
re-use the same parsing format, but have different capabilities, we'll
want to parameterize them and give them different names to be specified
as a `PARSER`.
Misc changes:
- I worked on the renderer by diffing outputs from the reference
implementation and ours. To make it easier to do this, I extended
<https://github.com/ollama/ollama/pull/11875> to also support
returning the prompt via the openai compat layer
Ollama's recent engine update, llama.cpp, caused all models requiring a slice schema to not display images. As a result, the value of numTokens isn't always the length of the sliced image embed, but rather the end length of the schema. This causes the image embed to not be correctly included during all slice processing.