Commit Graph

4431 Commits

Author SHA1 Message Date
Devon Rifkin afe0c10dbc
openai: always provide reasoning
We were missing passing along thinking if content was nil (as opposed
to empty string)

Also added a test for content not being passed, which was the real cause
of <https://github.com/ollama/ollama/issues/11704>, since with the way
`Content` is typed, not passing it and empty string are distinct
2025-12-29 06:39:50 -06:00
Devon Rifkin 1a7d34231f
openai: when converting role=tool messages, propagate the tool name
Added support for converting both `name` and `tool_call_id` fields,
which different clients might provide. `name` is a legacy field from the
OpenAI completions API. For `tool_call_id` we inspect previous messages
and look for a matching tool call ID and grab its name

Issue: https://github.com/ollama/ollama/issues/11704
2025-12-29 06:39:50 -06:00
Patrick Devine 45eabc3083
docs: update the faq (#11760) 2025-12-29 06:39:50 -06:00
Devon Rifkin ae9664c01d
openai: allow for content _and_ tool calls in the same message
Previously our OpenAI chat completions compat layer assumed that tool
calls and content would never be provided together, but this is not a
correct assumption. Content is only optional when tool calls are
present, but tool calls and content can be provided together

Fixes: https://github.com/ollama/ollama/issues/11704
2025-12-29 06:39:50 -06:00
Daniel Hiltgen cb241cab63
clean up debugging (#11756) 2025-12-29 06:39:49 -06:00
Gao feng 3e2a98ad55
Update downloading to pulling in api.md (#11170)
update api.md to make it consist with code.
https://github.com/ollama/ollama/blob/main/server/download.go#L447
2025-12-29 06:39:49 -06:00
Parth Sareen 179bbf2640
docs: update turbo model name (#11707) 2025-12-29 06:39:49 -06:00
Devon Rifkin c9304f161a
tools: support anyOf types
afaik gpt-oss is the first model that meaningfully transforms tool
function definitions in its template. We found that relatively common
definitions that include `anyOf` were not working because the template
was assuming that types were always defined via a `type` field.

anyOf allows for fully recursive types, so I exposed a
`toTypeScriptType()` function to handle this recursive logic in go and
keep the templates cleaner. The gpt-oss templates will need to be
updated to use this.

We should keep building out our function definition support to more
fully support the parts of json schema that make sense for this use
case, but in the meantime this will unblock some users (e.g., zed's
ollama integration w/ gpt-oss). Probably the most urgent is proper array
support
2025-12-29 06:39:49 -06:00
Daniel Hiltgen e5b777a8d9
win: static link msvc libs (#11612)
This should help reduce the runtime dependencies on windows.
2025-12-29 06:39:49 -06:00
Michael Yang b643362f9f
gptoss: fix memory calc (#11700) 2025-12-29 06:39:49 -06:00
Jeffrey Morgan 063d3e8163
docs: add docs for Ollama Turbo (#11687) 2025-12-29 06:39:48 -06:00
Jesse Gross ae8a041461
ggml: Prevent kv cache quanitization on gpt-oss
KV cache quantization has a dependency on the flash attention kernel.
We currently cannot use flash attention with gpt-oss as it requires
additional operations.

The model definition does not call flash attention, so it works
regardless of the setting but the cache will pick up the
quantization type. This updates the flash attention setting earlier
in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671
2025-12-29 06:39:48 -06:00
Michael Yang ed2e8a9022
gpt-oss (#11672)
* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-12-29 06:39:48 -06:00
Jesse Gross 275510ddf5
kvcache: Log contents of cache when unable to find a slot
There is a bug when using sliding window attention where we run
out of KV cache slots. This is likely due to not correctly removing
all of the entries as they slide out of range. This adds additional
logging when this occurs to track down the source.

Bug #10127
2025-12-29 06:39:48 -06:00
Jesse Gross c24014a55d
kvcache: Enable SWA to retain additional entries
Models that use sliding window attention can only resume a sequence
from the cache if it falls within the saved windows. This works well
if the next message picks up where the old one left off. However, it
generally prevents a partial prefix match unless the entire conversation
falls within the sliding window.

This can be a problem with reasoning models where the traces are
supposed to be removed from future messages, forcing the entire
history to be re-evaluated.

This change allows models to specify that a larger amount of the
history be retained in memory, to allow more partial resumption.
It still respects the window that the model was trained on for
token generation.
2025-12-29 06:39:48 -06:00
Sajal Kulshreshtha b923797e99
fixing broken AMD driver link (#11579) 2025-12-29 06:39:47 -06:00
Daniel Hiltgen 612a87dc69
Revert "CI: switch back to x86 macos builder" (#11588)
This reverts commit 9d071e6089.
2025-12-29 06:39:47 -06:00
Daniel Hiltgen 5038e33776
mac: disable bf16 on unsupported OS versions (#11585)
Support for bf16 was added in MacOS v14+ and attempting to enable
on older versions causes runtime failures.
2025-12-29 06:39:47 -06:00
Daniel Hiltgen 1d064a0e20
CI: switch back to x86 macos builder (#11572) 2025-12-29 06:39:47 -06:00
Oliver Simons 1ee3fe46f3
Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525)
* Enable CUDA Graphs for gemma3n.

Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.

* Remove residual check by reshaping differently in gemma3n model

This should make the heuristics more robust
2025-12-29 06:39:47 -06:00
Jesse Gross 279e632945
kvcache: Don't shift empty batches
When we context shift, we delete half the context and apply RoPE
with an offset to the other half. We used to RoPE across the entire
context in a single pass with a zero offset for the deleted
section. With the change to shifting in batches, we can skip any
batches where all of the offsets would be zero. This typically
reduces the number of operations by half.
2025-12-29 06:39:47 -06:00
Yoshi 9bd69d0110
docs: fix typos and remove trailing whitespaces (#11554) 2025-12-29 06:39:46 -06:00
Mayan EDMS 4975cc042e
readme: add Mayan EDMS to community integrations (#11543) 2025-12-29 06:39:46 -06:00
Jesse Gross cdceaff4e1
kvcache: Group shift operations into batches
Currently, when we need to do a shift on the cache, it is one
RoPE operation on the entire size of the cache (per layer). In
some cases, this can create a compute graph that is larger than
the forward pass since the forward pass is working in batches.
Since we don't consider shifting in our memory estimates, it's
possible for this to cause a crash if we run out of memory.

By limiting the size of the RoPE calls to batch size chunks, we
ensure that the shift will never exceed the size of the forward
pass, since the forward pass will also contain a RoPE of the same
size. This does not have a sigificant impact on performance since
RoPE is a math operation that is mostly proportional to the size
of its inputs.

In theory defrag could have the same issue since it also creates a
compute graph outside of the forward pass, however, since it is
only copies, it does not require any working space.
2025-12-29 06:39:46 -06:00
Ruyut 9574ed9bb7
CONTRIBUTING: fix typo in commit message example (#11528) 2025-12-29 06:39:46 -06:00
Patrick Devine 0ab1b140af
cli: catch upstream errors gracefully (#11512) 2025-12-29 06:39:46 -06:00
Jeffrey Morgan d9a78742ad
tools: loosen tool argument parsing (#11509) 2025-12-29 06:39:45 -06:00
minxinyi a35d1c358f
server: use slices.Equal to simplify code (#11502) 2025-12-29 06:39:45 -06:00
Michael Yang 26cd61e41f
s#x/exp/maps#maps# (#11506) 2025-12-29 06:39:45 -06:00
Patrick Devine 95f5d9d6da
Fix GetModelInfo (#11496)
---------

Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-12-29 06:39:45 -06:00
ycomiti f5319ac72b
Update linux.md (#11462) 2025-12-29 06:39:45 -06:00
Stefan Wärting 59b034f040
readme: add GMAI - Gradle Managed to community integrations (#11461) 2025-12-29 06:39:44 -06:00
Jeffrey Morgan 30ec10cb05
tools: fix parsing issue when a tool name is a substring of another (#11456)
Co-authored-by: frob <rick+github@frob.com.au>
2025-12-29 06:39:44 -06:00
zmldndx ffa61a51fc
readme: update argo description to support deep research (#11455) 2025-12-29 06:39:44 -06:00
Daniel Hiltgen 5274cd2ead
ci: switch mac builder to arm64 (#11379)
The macos-13 is x86, while macos-13-xlarge is arm64
2025-12-29 06:39:44 -06:00
frob a1a350b608
docs: add the no-Modelfile function of `ollama create` (#9077) 2025-12-29 06:39:44 -06:00
frob b2a00a0d2a
openai: allow openai endpoint to accept webp images (#11412)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-12-29 06:39:44 -06:00
Haiyue Wang 2e57f92b0c
readme: update the llama.cpp github link (#11427) 2025-12-29 06:39:43 -06:00
Michael Yang 7221b90fe1
compile bf16 support into ggml-metal (#11430) 2025-12-29 06:39:43 -06:00
Parth Sareen 1c48526e2e
cmd: add default assistant role to message construction (#11431) 2025-12-29 06:39:43 -06:00
Bruce MacDonald 9e9238103d
api: fix unreachable status err (#11423)
StatusError was unreachable, the client always checked for error messages in the response body first, and the server always includes error messages with HTTP error status codes.
2025-12-29 06:39:43 -06:00
Marcelo Fornet 8c885fe5eb
docs: fix typo in macos.md (#11425) 2025-12-29 06:39:43 -06:00
先知 43cacd9309
docs: update modelfile.md to reflect current default num_ctx (#11189)
As in the commit 44b466eeb2, the default context length has been increased to 4096.
2025-12-29 06:39:43 -06:00
Jesse Gross b47aa7e75a
ggml: Use assigned layers when reporting loading stats
Reporting params.NumGPULayers can be misleading because it is the
requested number of layers, not the actual number that is loaded.
While they are often the same, there are cases where they might mismatch,
such as if the GPU backend is missing.
2025-12-29 06:39:42 -06:00
Jesse Gross 015e39a8be
ggml: Disable unused pipeline parallelism
We're not currently using it, even in cases where we could. Disabling
it improves generation performance by 10-30% with multiple GPUs.
2025-12-29 06:39:42 -06:00
Daniel Hiltgen 39cec5338a
Only load supported models on new engine (#11362)
* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models
2025-12-29 06:39:42 -06:00
Jesse Gross 387cb031b3
ggml: Report ordinal IDs for AMD GPUs on Windows
We don't get valid UUIDs for AMD GPUs on Windows, so the best option
is to use the ordinal IDs. This brings us in line with what we currently
do on the Ollama server - the only exception is AMD GPUs on Linux, which
falls back to using ordinal IDs. The GGML implementation has no fallback
but it doesn't appear to occur for any of the GPUs that we support.

It's also possible that there are collisions between ordinal IDs for
different libraries - however the only places where we use them are
AMD on Windows and Metal on Mac, which can never occur on the same
system.
2025-12-29 06:39:42 -06:00
Daniel Hiltgen 50e4df359b
doc: add MacOS docs (#11334)
also removes stale model dir instructions for windows
2025-12-29 06:39:42 -06:00
Daniel Hiltgen 4fcc030739
Reduce default parallelism to 1 (#11330)
The current scheduler algorithm of picking the paralellism based on available
VRAM complicates the upcoming dynamic layer memory allocation algorithm.  This
changes the default to 1, with the intent going forward that parallelism is
explicit and will no longer be dynamically determined.  Removal of the dynamic
logic will come in a follow up.
2025-12-29 06:39:41 -06:00
Daniel Hiltgen 1c94c9919b
API/CLI context enhancements (#11331)
* API: expose context size of loaded models

* CLI: add context UX

This adds a column in the ps output to show the models context size.
2025-12-29 06:39:41 -06:00