Commit Graph

4408 Commits

Author SHA1 Message Date
Jesse Gross cdceaff4e1
kvcache: Group shift operations into batches
Currently, when we need to do a shift on the cache, it is one
RoPE operation on the entire size of the cache (per layer). In
some cases, this can create a compute graph that is larger than
the forward pass since the forward pass is working in batches.
Since we don't consider shifting in our memory estimates, it's
possible for this to cause a crash if we run out of memory.

By limiting the size of the RoPE calls to batch size chunks, we
ensure that the shift will never exceed the size of the forward
pass, since the forward pass will also contain a RoPE of the same
size. This does not have a sigificant impact on performance since
RoPE is a math operation that is mostly proportional to the size
of its inputs.

In theory defrag could have the same issue since it also creates a
compute graph outside of the forward pass, however, since it is
only copies, it does not require any working space.
2025-12-29 06:39:46 -06:00
Ruyut 9574ed9bb7
CONTRIBUTING: fix typo in commit message example (#11528) 2025-12-29 06:39:46 -06:00
Patrick Devine 0ab1b140af
cli: catch upstream errors gracefully (#11512) 2025-12-29 06:39:46 -06:00
Jeffrey Morgan d9a78742ad
tools: loosen tool argument parsing (#11509) 2025-12-29 06:39:45 -06:00
minxinyi a35d1c358f
server: use slices.Equal to simplify code (#11502) 2025-12-29 06:39:45 -06:00
Michael Yang 26cd61e41f
s#x/exp/maps#maps# (#11506) 2025-12-29 06:39:45 -06:00
Patrick Devine 95f5d9d6da
Fix GetModelInfo (#11496)
---------

Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-12-29 06:39:45 -06:00
ycomiti f5319ac72b
Update linux.md (#11462) 2025-12-29 06:39:45 -06:00
Stefan Wärting 59b034f040
readme: add GMAI - Gradle Managed to community integrations (#11461) 2025-12-29 06:39:44 -06:00
Jeffrey Morgan 30ec10cb05
tools: fix parsing issue when a tool name is a substring of another (#11456)
Co-authored-by: frob <rick+github@frob.com.au>
2025-12-29 06:39:44 -06:00
zmldndx ffa61a51fc
readme: update argo description to support deep research (#11455) 2025-12-29 06:39:44 -06:00
Daniel Hiltgen 5274cd2ead
ci: switch mac builder to arm64 (#11379)
The macos-13 is x86, while macos-13-xlarge is arm64
2025-12-29 06:39:44 -06:00
frob a1a350b608
docs: add the no-Modelfile function of `ollama create` (#9077) 2025-12-29 06:39:44 -06:00
frob b2a00a0d2a
openai: allow openai endpoint to accept webp images (#11412)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-12-29 06:39:44 -06:00
Haiyue Wang 2e57f92b0c
readme: update the llama.cpp github link (#11427) 2025-12-29 06:39:43 -06:00
Michael Yang 7221b90fe1
compile bf16 support into ggml-metal (#11430) 2025-12-29 06:39:43 -06:00
Parth Sareen 1c48526e2e
cmd: add default assistant role to message construction (#11431) 2025-12-29 06:39:43 -06:00
Bruce MacDonald 9e9238103d
api: fix unreachable status err (#11423)
StatusError was unreachable, the client always checked for error messages in the response body first, and the server always includes error messages with HTTP error status codes.
2025-12-29 06:39:43 -06:00
Marcelo Fornet 8c885fe5eb
docs: fix typo in macos.md (#11425) 2025-12-29 06:39:43 -06:00
先知 43cacd9309
docs: update modelfile.md to reflect current default num_ctx (#11189)
As in the commit 44b466eeb2, the default context length has been increased to 4096.
2025-12-29 06:39:43 -06:00
Jesse Gross b47aa7e75a
ggml: Use assigned layers when reporting loading stats
Reporting params.NumGPULayers can be misleading because it is the
requested number of layers, not the actual number that is loaded.
While they are often the same, there are cases where they might mismatch,
such as if the GPU backend is missing.
2025-12-29 06:39:42 -06:00
Jesse Gross 015e39a8be
ggml: Disable unused pipeline parallelism
We're not currently using it, even in cases where we could. Disabling
it improves generation performance by 10-30% with multiple GPUs.
2025-12-29 06:39:42 -06:00
Daniel Hiltgen 39cec5338a
Only load supported models on new engine (#11362)
* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models
2025-12-29 06:39:42 -06:00
Jesse Gross 387cb031b3
ggml: Report ordinal IDs for AMD GPUs on Windows
We don't get valid UUIDs for AMD GPUs on Windows, so the best option
is to use the ordinal IDs. This brings us in line with what we currently
do on the Ollama server - the only exception is AMD GPUs on Linux, which
falls back to using ordinal IDs. The GGML implementation has no fallback
but it doesn't appear to occur for any of the GPUs that we support.

It's also possible that there are collisions between ordinal IDs for
different libraries - however the only places where we use them are
AMD on Windows and Metal on Mac, which can never occur on the same
system.
2025-12-29 06:39:42 -06:00
Daniel Hiltgen 50e4df359b
doc: add MacOS docs (#11334)
also removes stale model dir instructions for windows
2025-12-29 06:39:42 -06:00
Daniel Hiltgen 4fcc030739
Reduce default parallelism to 1 (#11330)
The current scheduler algorithm of picking the paralellism based on available
VRAM complicates the upcoming dynamic layer memory allocation algorithm.  This
changes the default to 1, with the intent going forward that parallelism is
explicit and will no longer be dynamically determined.  Removal of the dynamic
logic will come in a follow up.
2025-12-29 06:39:41 -06:00
Daniel Hiltgen 1c94c9919b
API/CLI context enhancements (#11331)
* API: expose context size of loaded models

* CLI: add context UX

This adds a column in the ps output to show the models context size.
2025-12-29 06:39:41 -06:00
Parth Sareen 25f6571f34
add `tool_name` to api.md (#11326) 2025-12-29 06:39:41 -06:00
Parth Sareen 1efadee48c
template: add tool result compatibility (#11294) 2025-12-29 06:39:41 -06:00
Daniel Hiltgen fc4cb04cb9
ci: modularization (#11324)
switch a few constants to variables
2025-12-29 06:39:41 -06:00
Jesse Gross 5f139b96ab
Revert "ggml: Temporarily disable reporting UUIDs"
The root cause was an unclean upgrade - this code is fine.

This reverts commit 45f216a9c7.
2025-12-29 06:39:41 -06:00
Jeffrey Morgan ca3520de87
readme: update Ollama icon size 2025-12-29 06:39:40 -06:00
Daniel Hiltgen 55a4a37c3a
int: add performance integration tests (#11173)
usage example:
  go test --tags=integration,perf -count 1 ./integration -v -timeout 1h -run TestModelsPerf 2>&1 | tee int.log
  cat int.log | grep MODEL_PERF_HEADER | cut -f2- -d: > perf.csv
  cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
2025-12-29 06:39:40 -06:00
Daniel Hiltgen ba750172ca
doc: add NVIDIA blackwell to supported list (#11307) 2025-12-29 06:39:40 -06:00
Vincent RAMPAL 35bf6c0a41
Update base image to Ubuntu 24.04 LTS (#9681) 2025-12-29 06:39:40 -06:00
Daniel Hiltgen b23d28b549
doc: Update link for mac install (#11288)
Favor the dmg now.
2025-12-29 06:39:40 -06:00
Daniel Hiltgen e897624123
mimic logs for layers on new engine (#11278)
This adds some extra logs to make the new engine a bit more consistent
with the llama engine.
2025-12-29 06:39:39 -06:00
XuKecheng a3e4bb7f58
readme: add NativeMind to community integrations (#11242) 2025-12-29 06:39:39 -06:00
Jeffrey Morgan 9cf8ef9371
tools: fix parsing tool calls with empty arguments, missing required fields (#11233) 2025-12-29 06:39:39 -06:00
Attogram Project 96be53fe6c
readme: add ollama-bash-toolshed to community integrations (#11224) 2025-12-29 06:39:39 -06:00
Michael Yang 1cdab47113
chore: cleanup comments + unused vars (#11225) 2025-12-29 06:39:39 -06:00
Jesse Gross 872d190c8f
ggml: Temporarily disable reporting UUIDs
This is causing segfaults, so disable it. Currently UUIDs are only
used for debugging purposes, although they planned to be used in
additional ways in the future.

Bug #11211
2025-12-29 06:39:39 -06:00
Michael Yang 8f2099306f
skip quantizing per_layer_token_embd (#11207)
this tensor isn't compatible with cuda when quantized to q4_K so skip it
2025-12-29 06:39:38 -06:00
Daniel Hiltgen 59112600d1
ci: multi-stage release process (#11001) 2025-12-29 06:39:38 -06:00
Jeffrey Morgan 10119ec2ee
fs/ggml: add multiplier in graph estimates (#11208) 2025-12-29 06:39:38 -06:00
Jeffrey Morgan 84998ae4ba
fs/ggml: add missing architecture to OllamaEngineRequired() (#11206) 2025-12-29 06:39:38 -06:00
Michael Yang 801564fa8b
add new gemma model (#11204)
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
2025-12-29 06:39:38 -06:00
Daniel Hiltgen d6253f09c2
ci: arm sbsa fixes (#11194) 2025-12-29 06:39:37 -06:00
Daniel Hiltgen 9cf1db79b4
ci: include dependencies 2025-12-29 06:39:37 -06:00
Daniel Hiltgen 46654149c9
ci: pick up arm sbsa cuda libs (#11192) 2025-12-29 06:39:37 -06:00