Commit Graph

4886 Commits

Author SHA1 Message Date
Grace 809b9c68fa
Add Tool Call ID (#12956)
* routes/types: add tool call id

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2025-11-04 16:43:33 -08:00
Daniel Hiltgen ba8c035846
log: instrument CPU discovery timing (#12960) 2025-11-04 16:23:37 -08:00
Daniel Hiltgen 27f1fde413
discovery: only retry AMD GPUs (#12894)
* discovery: only retry AMD GPUs

CUDA and Vulkan don't crash on unsupported devices, so retry isn't necessary.
This also refactors the code to shift the Library specific logic into the ml
package.

* review comments
2025-11-04 15:33:46 -08:00
virajwad 220e133fca
vulkan: Add memory detection for Intel GPU using DXGI+PDH (#12664)
* PDH free memory skeleton

* Add PDH printing

* Add LUID support for Vulkan

* wire luid from ggml-vulkan to mem-dxgi-pdh file

* Fix to ggml-impl

* Continue skeleton

* Implemented ggml_dxgi_pdh_get_device_memory

* fix comments

* Fix - change value GB to bytes

* add ifdefs to only support windows and not linux

* modify error codes

* Finished ggml_dxgi_pdh_init() function

* completed ggml_dxgi_pdh_release()

* Formatting changes, add static to functions

* fix build errors

* fix go build error

* fix luid - now should match between dxgi and vulkan

* Fix the free memory reporting (was using copy by value, change to reference)

* keep only dxgi1_2.h

* Modifications based on PR feedback

* fix merge conflicts (2) and fix desc1.description printout

* move dxgi + pdh api calls to before the vendor specific library calls

* change from 3 samples to 1 sample for PDH

* modify when old_mode is set

* add fix for building MacOS

* fix release and returns for other vendors

* add patch file
2025-11-04 14:11:55 -08:00
Daniel Hiltgen d3b4b9970a
app: add code for macOS and Windows apps under 'app' (#12933)
* app: add code for macOS and Windows apps under 'app'

* app: add readme

* app: windows and linux only for now

* ci: fix ui CI validation

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
2025-11-04 11:40:17 -08:00
Daniel Hiltgen a4770107a6
vulkan: enable flash attention (#12937)
Also adjusts the vulkan windows build pattern to match recent changes in other backends
so incremental builds are faster.
2025-11-04 10:31:22 -08:00
Jesse Gross ef549d513c ggml: Increase maximum graph size
The initial implementation of qwen3-vl:235b exceeded the maximum graph
size based on the number of tensors. Although this was later fixed
through the use of the mrope operation, we are close to the limit in
some cases. This updates to track the current llama.cpp usage of GGML.
2025-11-03 16:05:37 -08:00
Rajath Bail d2158ca6f4
readme: add Hillnote to community integrations (#12929) 2025-11-03 12:55:04 -08:00
Michael Yang ce3eb0a315
chore(gptoss): cleanup dead code (#12932) 2025-11-03 11:27:15 -08:00
Ryan Coleman 60829f7ec6
readme: add Strands Agents to community integrations (#11740) 2025-11-02 16:01:28 -08:00
Attogram Project 9a50fd584c
readme: add Ollama Bash Lib to community integrations (#12235) 2025-11-02 15:44:56 -08:00
Jesse Gross 392a270261 ggml: Avoid cudaMemsetAsync during memory fitting
We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.
2025-10-31 15:23:28 -07:00
Daniel Hiltgen 3bee3af6ed
cpu: always ensure LibOllamaPath included (#12890)
In CPU only setups the LibOllamaPath was omitted causing
us not to load the ggml-cpu-XXX libraries during inference.
2025-10-31 14:37:29 -07:00
Daniel Hiltgen 83537993d7
logs: catch rocm errors (#12888)
This will help bubble up more crash errors
2025-10-31 09:54:25 -07:00
nicole pardal 7dd4862a89
embeddings: removed redundant TestAPIEmbeddings test (#12863)
This PR removes a redundant test from TestAPIEmbeddings
Contents of this test already exists in embed_test.go and model_arch_test.go
2025-10-30 17:12:33 -07:00
Daniel Hiltgen db973c8fc2
win: avoid ID mixups on refresh (#12869)
On Windows AMD IDs are numeric, and can reorder based on the filter environment.
By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
and ignore unsupported iGPUs.  Without this, on some systems iGPU VRAM was incorrectly
being used to populate the dGPU.
2025-10-30 15:12:14 -07:00
Jesse Gross afaf7ce8c3 ggml: Enable op_offload to improve partial offload performance
When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037
2025-10-30 13:53:10 -07:00
Jesse Gross 26465fb85f ollamarunner: Worst case batch for token generation
We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.

This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.
2025-10-30 13:53:10 -07:00
Daniel Hiltgen 88236bc05f
win: use copy for subprocess logs (#12864)
windows gets confused when we try to hand the stderr file descriptor to the subprocess children.  This ensures the log output
always shows up.
2025-10-30 13:22:00 -07:00
Patrick Devine 76eb7d0fff
testing: test more models with tool calling (#12867) 2025-10-30 13:19:21 -07:00
Michael Yang f67a6df110
interleaved mrope (#12807)
* ml(ggml): mrope
* interleave mrope
2025-10-30 11:29:00 -07:00
Michael Yang 75e75d9afe
qwen3vl: enable flash attention by default (#12862) 2025-10-30 10:51:37 -07:00
Michael Yang ed78e127d0
fix(cmd): unload model before removal (#12832)
this change fixes two bugs with `ollama rm`:

1. before a model is removed, it will first be stopped. this only
   happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
   and should be omitted
2025-10-30 10:41:49 -07:00
Michael Yang d432ade714
fix: qwen2.5vl, qwen3vl composite image (#12841)
this change fixes images with an alpha channel by overlaying the image
onto a white background
2025-10-30 10:33:19 -07:00
Michael Yang 06b3422d5f
tests: add tests and docs for commonly used ops (#12844)
* mulmat
* permute
2025-10-30 10:32:45 -07:00
Athiban Sharon cbe1cf06c4
Update README.md (#12822)
Fixed broken docs links
2025-10-30 13:14:39 -04:00
Grace 0a2d92081b
Removing whitespace between Thinking and Content in Qwen3VL (#12838)
Eats extra whitespace at the end/beginning of content
2025-10-29 15:14:28 -07:00
Daniel Hiltgen c88647104d
int: harden server lifecycle (#12835)
this should reduce zombies during integration runs
2025-10-29 11:50:56 -07:00
Patrick Devine 05aff4a4f1
tests: fix embeddinggemma integration test (#12830) 2025-10-29 11:07:28 -07:00
Michael Yang 0d140bd1af
fix: conv2d bias (#12834) 2025-10-29 11:03:43 -07:00
Jeffrey Morgan 93e45f0f0d
docs: temporarily restore api.md and cleanup docs paths (#12818) 2025-10-28 23:25:48 -07:00
Jeffrey Morgan a342160803
docs: fix root api documentation page (#12813) 2025-10-28 19:17:54 -07:00
Jeffrey Morgan f6c29409dc
docs: add new cloud model + fix openai redirect (#12812) 2025-10-28 19:09:07 -07:00
Michael Yang 7d25b9e194
feat(model): add qwen3vl (#12665) 2025-10-28 17:39:47 -07:00
Patrick Devine 36d64fb531
embed: add distance correlation test for library embed models (#12796) 2025-10-28 16:57:27 -07:00
Parth Sareen d828517e78
docs: update readme and links (#12809) 2025-10-28 16:20:02 -07:00
Daniel Hiltgen 14977a9350
Fix vulkan PCI ID and ID handling (#12775)
* Fix vulkan PCI ID and ID handling

Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection.  Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.

* review comments
2025-10-28 15:15:35 -07:00
Patrick Devine 29f63f37c8
Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810)
This reverts commit 5d347f6d6f.
2025-10-28 14:49:14 -07:00
Parth Sareen 3d99d9779a
docs: add docs for docs.ollama.com (#12805) 2025-10-28 13:18:48 -07:00
Parth Sareen 6d02a43a75
docs: rename to mdx to setup docs site (#12804) 2025-10-28 13:04:31 -07:00
Parth Sareen 5483497d7a
Revert "docs: add reference to docs.ollama.com (#12800)" (#12803)
This reverts commit 934dd9e196.
2025-10-28 12:52:49 -07:00
Parth Sareen 934dd9e196
docs: add reference to docs.ollama.com (#12800) 2025-10-28 12:44:02 -07:00
Michael Yang 1188f408dd
s/From*Slice/From*s/ (#12255) 2025-10-28 12:08:49 -07:00
nicole pardal 15c7d30d9a
embedding tests: added check against exact base64 string (#12790) 2025-10-28 10:37:20 -07:00
Devon Rifkin 9862317174
Merge pull request #12793 from ollama/drifkin/12792_renderer-parser-from
create: inherit FROM model's renderer/parser
2025-10-28 00:15:46 -07:00
Michael Yang ec9eb28f4c
gemma3: make embedding non-causal (#12297) 2025-10-27 19:54:08 -07:00
Devon Rifkin 1bdd816910 create: inherit FROM model's renderer/parser
On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
get propagated to a new model created with a `req.From` parameter. This
is easily triggered via `ollama run qwen3-coder`, then running some save
command like `/save qwen3-coder-custom`.

Added a regression test for this, and then open the config for the
"from" model in order to use its renderer/parser as a default for the
new model. This will fix the CLI and also API-based creates.

Fixes: https://github.com/ollama/ollama/issues/12792
2025-10-27 15:14:19 -07:00
nicole pardal 5d347f6d6f
server: Consolidate embedding truncation in runner (#12730)
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
2025-10-27 11:59:12 -07:00
Patrick Devine b97eb2b858
cloud: set the proxy content-type to the same as local models (#12759) 2025-10-25 10:57:10 -07:00
Jesse Gross ad6f6a1d29 llm: Change memory allocation backoff from exponential to incremental
If we create a memory layout that should fit based on report free VRAM
but allocation still fails, we start applying a backoff. This reduces
free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
points chosen tend to be too dense at the beginning and too sparse at
the end. Therefore, this switches to an incremental backoff (10%, 20%,
30%...).
2025-10-23 12:58:31 -07:00