Commit Graph

4395 Commits

Author SHA1 Message Date
frob
b2a00a0d2a openai: allow openai endpoint to accept webp images (#11412)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-12-29 06:39:44 -06:00
Haiyue Wang
2e57f92b0c readme: update the llama.cpp github link (#11427) 2025-12-29 06:39:43 -06:00
Michael Yang
7221b90fe1 compile bf16 support into ggml-metal (#11430) 2025-12-29 06:39:43 -06:00
Parth Sareen
1c48526e2e cmd: add default assistant role to message construction (#11431) 2025-12-29 06:39:43 -06:00
Bruce MacDonald
9e9238103d api: fix unreachable status err (#11423)
StatusError was unreachable, the client always checked for error messages in the response body first, and the server always includes error messages with HTTP error status codes.
2025-12-29 06:39:43 -06:00
Marcelo Fornet
8c885fe5eb docs: fix typo in macos.md (#11425) 2025-12-29 06:39:43 -06:00
先知
43cacd9309 docs: update modelfile.md to reflect current default num_ctx (#11189)
As in the commit 44b466eeb2, the default context length has been increased to 4096.
2025-12-29 06:39:43 -06:00
Jesse Gross
b47aa7e75a ggml: Use assigned layers when reporting loading stats
Reporting params.NumGPULayers can be misleading because it is the
requested number of layers, not the actual number that is loaded.
While they are often the same, there are cases where they might mismatch,
such as if the GPU backend is missing.
2025-12-29 06:39:42 -06:00
Jesse Gross
015e39a8be ggml: Disable unused pipeline parallelism
We're not currently using it, even in cases where we could. Disabling
it improves generation performance by 10-30% with multiple GPUs.
2025-12-29 06:39:42 -06:00
Daniel Hiltgen
39cec5338a Only load supported models on new engine (#11362)
* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models
2025-12-29 06:39:42 -06:00
Jesse Gross
387cb031b3 ggml: Report ordinal IDs for AMD GPUs on Windows
We don't get valid UUIDs for AMD GPUs on Windows, so the best option
is to use the ordinal IDs. This brings us in line with what we currently
do on the Ollama server - the only exception is AMD GPUs on Linux, which
falls back to using ordinal IDs. The GGML implementation has no fallback
but it doesn't appear to occur for any of the GPUs that we support.

It's also possible that there are collisions between ordinal IDs for
different libraries - however the only places where we use them are
AMD on Windows and Metal on Mac, which can never occur on the same
system.
2025-12-29 06:39:42 -06:00
Daniel Hiltgen
50e4df359b doc: add MacOS docs (#11334)
also removes stale model dir instructions for windows
2025-12-29 06:39:42 -06:00
Daniel Hiltgen
4fcc030739 Reduce default parallelism to 1 (#11330)
The current scheduler algorithm of picking the paralellism based on available
VRAM complicates the upcoming dynamic layer memory allocation algorithm.  This
changes the default to 1, with the intent going forward that parallelism is
explicit and will no longer be dynamically determined.  Removal of the dynamic
logic will come in a follow up.
2025-12-29 06:39:41 -06:00
Daniel Hiltgen
1c94c9919b API/CLI context enhancements (#11331)
* API: expose context size of loaded models

* CLI: add context UX

This adds a column in the ps output to show the models context size.
2025-12-29 06:39:41 -06:00
Parth Sareen
25f6571f34 add tool_name to api.md (#11326) 2025-12-29 06:39:41 -06:00
Parth Sareen
1efadee48c template: add tool result compatibility (#11294) 2025-12-29 06:39:41 -06:00
Daniel Hiltgen
fc4cb04cb9 ci: modularization (#11324)
switch a few constants to variables
2025-12-29 06:39:41 -06:00
Jesse Gross
5f139b96ab Revert "ggml: Temporarily disable reporting UUIDs"
The root cause was an unclean upgrade - this code is fine.

This reverts commit 45f216a9c7.
2025-12-29 06:39:41 -06:00
Jeffrey Morgan
ca3520de87 readme: update Ollama icon size 2025-12-29 06:39:40 -06:00
Daniel Hiltgen
55a4a37c3a int: add performance integration tests (#11173)
usage example:
  go test --tags=integration,perf -count 1 ./integration -v -timeout 1h -run TestModelsPerf 2>&1 | tee int.log
  cat int.log | grep MODEL_PERF_HEADER | cut -f2- -d: > perf.csv
  cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
2025-12-29 06:39:40 -06:00
Daniel Hiltgen
ba750172ca doc: add NVIDIA blackwell to supported list (#11307) 2025-12-29 06:39:40 -06:00
Vincent RAMPAL
35bf6c0a41 Update base image to Ubuntu 24.04 LTS (#9681) 2025-12-29 06:39:40 -06:00
Daniel Hiltgen
b23d28b549 doc: Update link for mac install (#11288)
Favor the dmg now.
2025-12-29 06:39:40 -06:00
Daniel Hiltgen
e897624123 mimic logs for layers on new engine (#11278)
This adds some extra logs to make the new engine a bit more consistent
with the llama engine.
2025-12-29 06:39:39 -06:00
XuKecheng
a3e4bb7f58 readme: add NativeMind to community integrations (#11242) 2025-12-29 06:39:39 -06:00
Jeffrey Morgan
9cf8ef9371 tools: fix parsing tool calls with empty arguments, missing required fields (#11233) 2025-12-29 06:39:39 -06:00
Attogram Project
96be53fe6c readme: add ollama-bash-toolshed to community integrations (#11224) 2025-12-29 06:39:39 -06:00
Michael Yang
1cdab47113 chore: cleanup comments + unused vars (#11225) 2025-12-29 06:39:39 -06:00
Jesse Gross
872d190c8f ggml: Temporarily disable reporting UUIDs
This is causing segfaults, so disable it. Currently UUIDs are only
used for debugging purposes, although they planned to be used in
additional ways in the future.

Bug #11211
2025-12-29 06:39:39 -06:00
Michael Yang
8f2099306f skip quantizing per_layer_token_embd (#11207)
this tensor isn't compatible with cuda when quantized to q4_K so skip it
2025-12-29 06:39:38 -06:00
Daniel Hiltgen
59112600d1 ci: multi-stage release process (#11001) 2025-12-29 06:39:38 -06:00
Jeffrey Morgan
10119ec2ee fs/ggml: add multiplier in graph estimates (#11208) 2025-12-29 06:39:38 -06:00
Jeffrey Morgan
84998ae4ba fs/ggml: add missing architecture to OllamaEngineRequired() (#11206) 2025-12-29 06:39:38 -06:00
Michael Yang
801564fa8b add new gemma model (#11204)
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
2025-12-29 06:39:38 -06:00
Daniel Hiltgen
d6253f09c2 ci: arm sbsa fixes (#11194) 2025-12-29 06:39:37 -06:00
Daniel Hiltgen
9cf1db79b4 ci: include dependencies 2025-12-29 06:39:37 -06:00
Daniel Hiltgen
46654149c9 ci: pick up arm sbsa cuda libs (#11192) 2025-12-29 06:39:37 -06:00
Daniel Hiltgen
138c973d8f ci: recombine linux amd64 binaries (#11188)
Glue the rocm and archive builds back together.
2025-12-29 06:39:37 -06:00
Devon Rifkin
dd8d037c16 load arrays with up to 1024 elements when estimating
This mirrors the old behavior before #10382
2025-12-29 06:39:37 -06:00
Devon Rifkin
558c1920fa ggml: fix crash for array head counts
If it's an array, it uses the max value in the array

If array values for head counts becomes more popular, we can consider a
more invasive change like #10225 to calculate more accurate estimates.

Fixes: #9984
2025-12-29 06:39:34 -06:00
Daniel Hiltgen
b9b179fe00 ci: rocm parallel builds on windows (#11187)
The preset CMAKE_HIP_FLAGS isn't getting used on Windows.
This passes the parallel flag in through the C/CXX flags, along
with suppression for some log spew warnings to quiet down the build.
2025-12-29 06:38:19 -06:00
Daniel Hiltgen
38f92e7332 CI: switch windows to vs 2022 (#11184)
* CI: switch windows to vs 2022

* ci: fix regex match
2025-12-29 06:38:18 -06:00
Daniel Hiltgen
c012d1805b avoid context overflow (#11175)
For smaller context models, make sure we do not exceed the training size.
2025-12-29 06:38:18 -06:00
Daniel Hiltgen
29ec3ddf9a Re-remove cuda v11 (#10694)
* Re-remove cuda v11

Revert the revert - drop v11 support requiring drivers newer than Feb 23

This reverts commit c6bcdc4223.

* Simplify layout

With only one version of the GPU libraries, we can simplify things down somewhat.  (Jetsons still require special handling)

* distinct sbsa variant for linux arm64

This avoids accidentally trying to load the sbsa cuda libraries on
a jetson system which results in crashes.

* temporary prevent rocm+cuda mixed loading
2025-12-29 06:38:18 -06:00
AJ
d8b03acc1a readme: add ai-hub to community integrations (#11169) 2025-12-29 06:38:18 -06:00
Daniel Hiltgen
95571375dd build speedups (#11142)
Enable parallel building of the GPU architectures.
2025-12-29 06:38:18 -06:00
Michael Yang
69ee842b6e convert: utility for merging tensors (#11069) 2025-12-29 06:38:17 -06:00
Michael Yang
4585d231ee Reapply "feat: incremental gguf parser (#10822)" (#11114) (#11119)
* Reapply "feat: incremental gguf parser (#10822)" (#11114)

This reverts commit a6e64fbdf2.

* fix older ggufs
2025-12-29 06:38:17 -06:00
Jesse Gross
290d4c2c6c ggml: Check return status for computation.
We don't check the return status after computing the graph, which
can silently lead to bad outputs if we try to keep going and future
computation succeeds. This appears to happens in certain cases on
Apple M2 devices.

Fixes #11070
2025-12-29 06:38:17 -06:00
Daniel Hiltgen
29b668e649 int: add coverage for older models (#11137)
Verified these fail on 0.9.1 and pass on HEAD.
2025-12-29 06:38:17 -06:00