Commit Graph

4844 Commits

Author SHA1 Message Date
Michael Yang 18087f2ec7 Revert "use llama runner for qwen3 (#12556)"
This reverts commit 3d32249c74.
2025-10-13 13:30:30 -07:00
Michael Yang 6c833d5f8d fix(qwen3): deepseek distill
deepseek's qwen3 distill uses a different rope scheme so support both
2025-10-13 13:30:30 -07:00
Jeffrey Morgan 6544e14735
Reapply "add truncate and shift parameters" (#12582) 2025-10-11 16:06:14 -07:00
Devon Rifkin 5db8a818a1
Merge pull request #12581 from ollama/drifkin/renderer-api-generate
routes: fix built-in renderers for `api/generate`
2025-10-11 14:10:23 -07:00
Devon Rifkin 6db8da9958 routes: fix built-in renderers for `api/generate`
Made it so when api/generate builds up a message array and generates the
prompt it now goes through the same function as `api/chat` for
consistency. This is where we hook the optional built-in renderers to
bypass templates, which was missing for `api/generate` before this
change.

Closes: #12578
2025-10-11 13:57:43 -07:00
frob 0c68ec8d6a
discover: fix typo (#12565) 2025-10-11 12:06:02 -07:00
Daniel Hiltgen 70d9e363e1
doc: remove AMD EOL GPUs (#12567) 2025-10-10 17:16:29 -07:00
Michael Yang 1a2feb2a97 ollamarunner: fix deadlock
hardErrCh will deadlock since forwardBatch is blocked on
computeStartedCh which never gets sent. since the response to
hardErrCh is to panic, just panic instead
2025-10-10 16:49:57 -07:00
Daniel Hiltgen aab2190420
implement nvml for linux (#12517)
* implement nvml for linux

* Improve scheduler logging when VRAM doesn't recover
2025-10-10 15:15:56 -07:00
Michael Yang 629db9dc43 comment split 2025-10-10 13:25:34 -07:00
Michael Yang e0cd511661 fix test 2025-10-10 13:25:34 -07:00
Michael Yang 207332078f fix lint 2025-10-10 13:25:34 -07:00
Michael Yang 93085127f4 convert: slice gate_up weight 2025-10-10 13:25:34 -07:00
Michael Yang c00fa9cc2b convert: split gate_up bias 2025-10-10 13:25:34 -07:00
yajianggroup df411c4b02 refactor: using testing.B.Loop
Signed-off-by: yajianggroup <yajianggroup@outlook.com>
2025-10-10 13:25:29 -07:00
Jeffrey Morgan 3d32249c74
use llama runner for qwen3 (#12556) 2025-10-09 19:08:21 -07:00
Patrick Devine d681cd7c29
thinking: allow `"think": false` for non-thinking models (#12555) 2025-10-09 18:46:00 -07:00
shengxinjing 47298fce39 refactor: use builtin max and min 2025-10-09 16:17:52 -07:00
shengxinjing 4a48937ef1 refactor: use builtin max and min 2025-10-09 16:17:52 -07:00
Michael Yang 967a82f52f ollamarunner: measure only active time 2025-10-09 15:44:04 -07:00
Michael Yang bbbc73d637 llamarunner: update metrics
this change updates how metrics are collected. until now, performance
metrics, specifically initial input processing and subsequent generation
durations, were collected by taking the timestamp when creating a new
sequence, the first token generation, and completing generation. the
processing duration is taken as first token generation sub sequence
creation while generation is taken as completing generation sub first
token generation.

while this approach is an accurate end-to-end metric of processing and
generation, it's not comparable to other tools which only measure the
active, i.e. decode, duration.

this change updates the metrics to only capture decode duration so it
can be more directly compared to other tools
2025-10-09 15:44:04 -07:00
Daniel Hiltgen 15e3611d3d
logs: quiet down context canceled on completion and scheduler noise (#12553)
* logs: quiet down context canceled on completion

If the client closes the connection before Completion finishes, we were
logging at error level implying the runner crashed which was misleading.

time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled"

* quiet down scheduler log error on expected case

Since we don't hold the lock while performing memory load calculations, other
runners can unload in parallel, so finding no runner to unload is a valid scenario
which we shouldn't log at error level.
2025-10-09 10:37:47 -07:00
Parth Sareen 77060d462c
routes: structured outputs for gpt-oss (#12460) 2025-10-08 19:13:38 -07:00
Patrick Devine 1b91d4dda1 openai: change the reasonin_effort field to also take none 2025-10-08 18:21:01 -07:00
Jeffrey Morgan 7d965258ce
Revert "add truncate and shift parameters (#12519)" (#12545)
This reverts commit 6a62b894c7.
2025-10-08 17:57:57 -07:00
Jeffrey Morgan 6a62b894c7
add truncate and shift parameters (#12519) 2025-10-08 17:05:05 -07:00
Patrick Devine 90d429f5a8
thinking: turn on thinking mode for all reasoning models (#12533) 2025-10-08 16:50:13 -07:00
Jesse Gross 1fc35f1260 kvcache: Clean up sliding window state with independent batches
Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that
are out of the cache's window each time we start a new forward pass.

The cache storage needs to handle the window size for each sequence
plus the batch size, since the batch needs to attend to the full
window size. This means that we have greater than a window size
stored while processing the batch.

When the next batch comes, we are currently only looking at the
sequences in the incoming batch to slide the window forward.
However, we also need to clean up the other sequences that might
be occupying space in the batch processing buffer to ensure each
sequence is only using its window size of storage. Failure to do
this can result in "no kv cache slot found" errors.

Fixes: #10127
2025-10-08 16:43:14 -07:00
Jesse Gross aa45f7ce27 discover: Disable flash attention for Jetson Xavier (CC 7.2)
GGML picks the wrong kernel and these systems fail with:
Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437:
ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu
was compiled for: __CUDA_ARCH_LIST__

Fixes #12442
2025-10-08 09:56:15 -07:00
Daniel Hiltgen 4e5d862ec4
Integration test tuning (#12492)
Remove some flaky scenarios, and switch to chat for better reliability
2025-10-08 09:51:25 -07:00
Daniel Hiltgen 303be9304c
docs: improve accuracy of LLM library docs (#12530) 2025-10-07 16:21:07 -07:00
Daniel Hiltgen bd15eba4e4
Bring back escape valve for llm libraries and fix Jetpack6 crash (#12529)
* Bring back escape valve for llm libraries

If the new discovery logic picks the wrong library, this gives users the
ability to force a specific one using the same pattern as before. This
can also potentially speed up bootstrap discovery if one of the libraries
takes a long time to load and ultimately bind to no devices.  For example
unsupported AMD iGPUS can sometimes take a while to discover and rule out.

* Bypass extra discovery on jetpack systems

On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in
cublasInit so if we detect a Jetpack, short-circuit and use that variant.
2025-10-07 16:06:14 -07:00
Devon Rifkin bc71278670
Merge pull request #12509 from ollama/drifkin/oai-compat-refactor
openai: refactor to split compat layer and middleware
2025-10-06 16:22:08 -07:00
Daniel Hiltgen 918231931c
win: fix build script (#12513) 2025-10-06 14:46:45 -07:00
Daniel Hiltgen 04c1849878
discovery: prevent dup OLLAMA_LIBRARY_PATH (#12514)
This variable isn't currently documented or intended as something the user can
override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling
this in the subprocess environment which will cause problems with the new
bootstrap discovery logic.
2025-10-06 14:36:44 -07:00
Devon Rifkin 2c2f4deaa9 openai: refactor to split compat layer and middleware
This makes the core openai compat layer independent of the middleware
that adapts it to our particular gin routes
2025-10-05 14:18:56 -07:00
Daniel Hiltgen 292767afb4
CI: fix win arm build (#12502)
Resolve subtle erroraction stickiness difference between x86 and arm builder setup
2025-10-04 11:46:45 -07:00
Daniel Hiltgen ae5e0f0889
CI: replace clang compiler for windows (#12495) 2025-10-04 09:18:42 -07:00
Jesse Gross 19e6796eac llm: Support KV cache quantization with gpt-oss
With the new version of GGML in #12245, KV cache quantization
no longer causes a fallback to CPU.
2025-10-03 16:31:58 -07:00
Grace 33801c1597
Fixed Deepseek2 adding nil tensor error 2025-10-03 14:20:06 -07:00
Daniel Hiltgen e4340667e3
Workaround broken NVIDIA iGPU free VRAM data (#12490)
The CUDA APIs for reporting free VRAM are useless on NVIDIA iGPU
systems as they only return the kernels actual free memory and ignore
buff/cache allocations which on a typical system will quickly fill up
most of the free system memory.  As a result, we incorrectly think
there's very little available for GPU allocations which is wrong.
2025-10-03 12:17:21 -07:00
Patrick Devine 2fa1e92a99
test: add template error test (#12489) 2025-10-03 12:05:34 -07:00
Daniel Hiltgen 07e36761c3
ci: place rocm windows in correct runner dir (#12487) 2025-10-03 07:28:40 -07:00
Daniel Hiltgen c29fb007c0
CI: temporarily disable clang install (#12486)
This will likely yield builds that have problems with unicode characters
but at least we can start testing the release while we try to find an
alternate clang compiler for windows, or mingw ships a fixed version.
2025-10-02 20:31:18 -07:00
Daniel Hiltgen 730ed6e9e1
ci: fix windows build (#12485) 2025-10-02 19:16:01 -07:00
Daniel Hiltgen dc06601677
ci: fix windows build (#12484) 2025-10-02 18:59:26 -07:00
Patrick Devine 1ed2881ef0
templates: fix crash in improperly defined templates (#12483) 2025-10-02 17:25:55 -07:00
Jesse Gross 0bda72892c llm: Enable flash attention by default for qwen3 and qwen3moe 2025-10-02 17:04:10 -07:00
Daniel Hiltgen 55ca827267
AMD: block running on unsupported gfx900/gfx906 (#12481) 2025-10-02 16:53:05 -07:00
Daniel Hiltgen c68f367ef6
Update GGML to b6646 (#12245)
Notable EOLs with this change:
- MacOS v12 and v13 are no longer supported (v14+ required)
- AMD gfx900 and gfx906 are no longer supported
2025-10-02 14:47:10 -07:00