ollama

Commit Graph

Author	SHA1	Message	Date
Michael Yang	18087f2ec7	Revert "use llama runner for qwen3 (#12556 )" This reverts commit `3d32249c74`.	2025-10-13 13:30:30 -07:00
Michael Yang	6c833d5f8d	fix(qwen3): deepseek distill deepseek's qwen3 distill uses a different rope scheme so support both	2025-10-13 13:30:30 -07:00
Jeffrey Morgan	6544e14735	Reapply "add truncate and shift parameters" (#12582 )	2025-10-11 16:06:14 -07:00
Devon Rifkin	5db8a818a1	Merge pull request #12581 from ollama/drifkin/renderer-api-generate routes: fix built-in renderers for `api/generate`	2025-10-11 14:10:23 -07:00
Devon Rifkin	6db8da9958	routes: fix built-in renderers for `api/generate` Made it so when api/generate builds up a message array and generates the prompt it now goes through the same function as `api/chat` for consistency. This is where we hook the optional built-in renderers to bypass templates, which was missing for `api/generate` before this change. Closes: #12578	2025-10-11 13:57:43 -07:00
frob	0c68ec8d6a	discover: fix typo (#12565 )	2025-10-11 12:06:02 -07:00
Daniel Hiltgen	70d9e363e1	doc: remove AMD EOL GPUs (#12567 )	2025-10-10 17:16:29 -07:00
Michael Yang	1a2feb2a97	ollamarunner: fix deadlock hardErrCh will deadlock since forwardBatch is blocked on computeStartedCh which never gets sent. since the response to hardErrCh is to panic, just panic instead	2025-10-10 16:49:57 -07:00
Daniel Hiltgen	aab2190420	implement nvml for linux (#12517 ) * implement nvml for linux * Improve scheduler logging when VRAM doesn't recover	2025-10-10 15:15:56 -07:00
Michael Yang	629db9dc43	comment split	2025-10-10 13:25:34 -07:00
Michael Yang	e0cd511661	fix test	2025-10-10 13:25:34 -07:00
Michael Yang	207332078f	fix lint	2025-10-10 13:25:34 -07:00
Michael Yang	93085127f4	convert: slice gate_up weight	2025-10-10 13:25:34 -07:00
Michael Yang	c00fa9cc2b	convert: split gate_up bias	2025-10-10 13:25:34 -07:00
yajianggroup	df411c4b02	refactor: using testing.B.Loop Signed-off-by: yajianggroup <yajianggroup@outlook.com>	2025-10-10 13:25:29 -07:00
Jeffrey Morgan	3d32249c74	use llama runner for qwen3 (#12556 )	2025-10-09 19:08:21 -07:00
Patrick Devine	d681cd7c29	thinking: allow `"think": false` for non-thinking models (#12555 )	2025-10-09 18:46:00 -07:00
shengxinjing	47298fce39	refactor: use builtin max and min	2025-10-09 16:17:52 -07:00
shengxinjing	4a48937ef1	refactor: use builtin max and min	2025-10-09 16:17:52 -07:00
Michael Yang	967a82f52f	ollamarunner: measure only active time	2025-10-09 15:44:04 -07:00
Michael Yang	bbbc73d637	llamarunner: update metrics this change updates how metrics are collected. until now, performance metrics, specifically initial input processing and subsequent generation durations, were collected by taking the timestamp when creating a new sequence, the first token generation, and completing generation. the processing duration is taken as first token generation sub sequence creation while generation is taken as completing generation sub first token generation. while this approach is an accurate end-to-end metric of processing and generation, it's not comparable to other tools which only measure the active, i.e. decode, duration. this change updates the metrics to only capture decode duration so it can be more directly compared to other tools	2025-10-09 15:44:04 -07:00
Daniel Hiltgen	15e3611d3d	logs: quiet down context canceled on completion and scheduler noise (#12553 ) * logs: quiet down context canceled on completion If the client closes the connection before Completion finishes, we were logging at error level implying the runner crashed which was misleading. time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled" * quiet down scheduler log error on expected case Since we don't hold the lock while performing memory load calculations, other runners can unload in parallel, so finding no runner to unload is a valid scenario which we shouldn't log at error level.	2025-10-09 10:37:47 -07:00
Parth Sareen	77060d462c	routes: structured outputs for gpt-oss (#12460 )	2025-10-08 19:13:38 -07:00
Patrick Devine	1b91d4dda1	openai: change the reasonin_effort field to also take none	2025-10-08 18:21:01 -07:00
Jeffrey Morgan	7d965258ce	Revert "add truncate and shift parameters (#12519 )" (#12545 ) This reverts commit `6a62b894c7`.	2025-10-08 17:57:57 -07:00
Jeffrey Morgan	6a62b894c7	add truncate and shift parameters (#12519 )	2025-10-08 17:05:05 -07:00
Patrick Devine	90d429f5a8	thinking: turn on thinking mode for all reasoning models (#12533 )	2025-10-08 16:50:13 -07:00
Jesse Gross	1fc35f1260	kvcache: Clean up sliding window state with independent batches Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that are out of the cache's window each time we start a new forward pass. The cache storage needs to handle the window size for each sequence plus the batch size, since the batch needs to attend to the full window size. This means that we have greater than a window size stored while processing the batch. When the next batch comes, we are currently only looking at the sequences in the incoming batch to slide the window forward. However, we also need to clean up the other sequences that might be occupying space in the batch processing buffer to ensure each sequence is only using its window size of storage. Failure to do this can result in "no kv cache slot found" errors. Fixes: #10127	2025-10-08 16:43:14 -07:00
Jesse Gross	aa45f7ce27	discover: Disable flash attention for Jetson Xavier (CC 7.2) GGML picks the wrong kernel and these systems fail with: Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu was compiled for: __CUDA_ARCH_LIST__ Fixes #12442	2025-10-08 09:56:15 -07:00
Daniel Hiltgen	4e5d862ec4	Integration test tuning (#12492 ) Remove some flaky scenarios, and switch to chat for better reliability	2025-10-08 09:51:25 -07:00
Daniel Hiltgen	303be9304c	docs: improve accuracy of LLM library docs (#12530 )	2025-10-07 16:21:07 -07:00
Daniel Hiltgen	bd15eba4e4	Bring back escape valve for llm libraries and fix Jetpack6 crash (#12529 ) * Bring back escape valve for llm libraries If the new discovery logic picks the wrong library, this gives users the ability to force a specific one using the same pattern as before. This can also potentially speed up bootstrap discovery if one of the libraries takes a long time to load and ultimately bind to no devices. For example unsupported AMD iGPUS can sometimes take a while to discover and rule out. * Bypass extra discovery on jetpack systems On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in cublasInit so if we detect a Jetpack, short-circuit and use that variant.	2025-10-07 16:06:14 -07:00
Devon Rifkin	bc71278670	Merge pull request #12509 from ollama/drifkin/oai-compat-refactor openai: refactor to split compat layer and middleware	2025-10-06 16:22:08 -07:00
Daniel Hiltgen	918231931c	win: fix build script (#12513 )	2025-10-06 14:46:45 -07:00
Daniel Hiltgen	04c1849878	discovery: prevent dup OLLAMA_LIBRARY_PATH (#12514 ) This variable isn't currently documented or intended as something the user can override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling this in the subprocess environment which will cause problems with the new bootstrap discovery logic.	2025-10-06 14:36:44 -07:00
Devon Rifkin	2c2f4deaa9	openai: refactor to split compat layer and middleware This makes the core openai compat layer independent of the middleware that adapts it to our particular gin routes	2025-10-05 14:18:56 -07:00
Daniel Hiltgen	292767afb4	CI: fix win arm build (#12502 ) Resolve subtle erroraction stickiness difference between x86 and arm builder setup	2025-10-04 11:46:45 -07:00
Daniel Hiltgen	ae5e0f0889	CI: replace clang compiler for windows (#12495 )	2025-10-04 09:18:42 -07:00
Jesse Gross	19e6796eac	llm: Support KV cache quantization with gpt-oss With the new version of GGML in #12245, KV cache quantization no longer causes a fallback to CPU.	2025-10-03 16:31:58 -07:00
Grace	33801c1597	Fixed Deepseek2 adding nil tensor error	2025-10-03 14:20:06 -07:00
Daniel Hiltgen	e4340667e3	Workaround broken NVIDIA iGPU free VRAM data (#12490 ) The CUDA APIs for reporting free VRAM are useless on NVIDIA iGPU systems as they only return the kernels actual free memory and ignore buff/cache allocations which on a typical system will quickly fill up most of the free system memory. As a result, we incorrectly think there's very little available for GPU allocations which is wrong.	2025-10-03 12:17:21 -07:00
Patrick Devine	2fa1e92a99	test: add template error test (#12489 )	2025-10-03 12:05:34 -07:00
Daniel Hiltgen	07e36761c3	ci: place rocm windows in correct runner dir (#12487 )	2025-10-03 07:28:40 -07:00
Daniel Hiltgen	c29fb007c0	CI: temporarily disable clang install (#12486 ) This will likely yield builds that have problems with unicode characters but at least we can start testing the release while we try to find an alternate clang compiler for windows, or mingw ships a fixed version.	2025-10-02 20:31:18 -07:00
Daniel Hiltgen	730ed6e9e1	ci: fix windows build (#12485 )	2025-10-02 19:16:01 -07:00
Daniel Hiltgen	dc06601677	ci: fix windows build (#12484 )	2025-10-02 18:59:26 -07:00
Patrick Devine	1ed2881ef0	templates: fix crash in improperly defined templates (#12483 )	2025-10-02 17:25:55 -07:00
Jesse Gross	0bda72892c	llm: Enable flash attention by default for qwen3 and qwen3moe	2025-10-02 17:04:10 -07:00
Daniel Hiltgen	55ca827267	AMD: block running on unsupported gfx900/gfx906 (#12481 )	2025-10-02 16:53:05 -07:00
Daniel Hiltgen	c68f367ef6	Update GGML to b6646 (#12245 ) Notable EOLs with this change: - MacOS v12 and v13 are no longer supported (v14+ required) - AMD gfx900 and gfx906 are no longer supported	2025-10-02 14:47:10 -07:00

... 3 4 5 6 7 ...

4844 Commits All Branches Search

4844 Commits

All Branches