ollama

Author	SHA1	Message	Date
Michael Yang	564b558c92	fix(llama): other llama flavours (#12308 ) * fix(llama): rope scale * spm llama * skip moe models * cleanup	2025-09-17 12:12:21 -07:00
Michael Yang	a417ac97ee	prefer ollama engine for qwen3 (#12310 )	2025-09-17 09:48:21 -07:00
Nakasaka, Masato	ac9d59cf69	Fixed wrong structure ID	2025-09-17 16:59:23 +09:00
Nakasaka, Masato	45430ded4b	Fixed missing members in Vulkan header also added zero clear for some structs	2025-09-17 16:04:43 +09:00
Nakasaka, Masato	6cf4e0a7c8	added missing NL	2025-09-17 15:21:24 +09:00
Nakasaka, Masato	73441c9780	Removed unneeded function call Somehow removing this call fixed the crashing when Vulkan header was removed	2025-09-17 15:11:13 +09:00
Nakasaka, Masato	882278a258	Merge remote-tracking branch 'vk-upstream/vulkanV3' into remove-vulkan-header	2025-09-17 09:24:06 +09:00
russcoss	05d53457af	refactor: use the built-in max/min to simplify the code (#12280 ) Signed-off-by: russcoss <russcoss@outlook.com>	2025-09-16 17:14:21 -07:00
Michael Yang	b225508c9b	logutil: fix source field (#12279 )	2025-09-16 16:18:07 -07:00
Inforithmics	176d30744e	fixing lint error	2025-09-16 22:48:24 +02:00
Inforithmics	0d4f3341c3	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-09-16 22:15:31 +02:00
Inforithmics	eb7b5ce9f4	Fix patches apply	2025-09-16 22:14:05 +02:00
Devon Rifkin	fa1c987a29	Merge pull request #12248 from ollama/drifkin/qwen3-coder-parsing add qwen3-coder tool support	2025-09-16 10:21:43 -07:00
Michael Yang	ad95d5b30b	use split activations when possible (#12293 ) * use ggml__split activations when possible forward qkv	2025-09-16 09:51:19 -07:00
Michael Yang	c253433d68	embed: cleanup (#12299 ) * cleanup * use pooling.TypeNone * pooling test	2025-09-16 09:48:42 -07:00
Beshoy Girgis	a1cff89b30	fix: fix CUDA detection for older GPUs (#12300 ) Prioritize GPU compute capability over driver version to ensure Pascal GPUs (CC 6.1) use compatible CUDA v12 libraries instead of v13.	2025-09-16 07:47:06 -07:00
Nakasaka, Masato	7a6b09ebae	Removed unused code Fix linter error in CI	2025-09-16 17:18:49 +09:00
Masato Nakasaka	ede4081253	Fix compile error in Mac Metal is preferred so we're disabling Vulkan for now	2025-09-16 17:00:17 +09:00
Nakasaka, Masato	da466f4f86	Copied minimal definition from vulkan header	2025-09-16 15:05:54 +09:00
Daniel Hiltgen	93c64ea1b1	doc: show how to clear the cgo cache (#12298 )	2025-09-15 15:45:35 -07:00
Michael Yang	3f6642f6fc	model: implement bert in ollama engine (#9080 ) * fix truncate * s/SentencePieceModel/SentencePiece/ * bert * wordpiece * refactor pooling * more tokenizers * normalize embeddings	2025-09-15 15:35:59 -07:00
Michael Yang	6f7117145f	batch: use tensors for outputs (#12185 ) this cleans up the model interface slightly without too much impact in other areas	2025-09-15 14:33:06 -07:00
Devon Rifkin	472feec2ff	address comments	2025-09-15 11:46:25 -07:00
Devon Rifkin	47991940d4	add qwen3-coder tool support The format qwen3-coder uses is relatively unique, both in rendering and in parsing. To implement parsing, I wrote a custom parser in similar style to harmony. For the rendering, I found that the logic would be much more difficult to follow in a template, so I introduced the concept of a built-in renderer that uses go code, rather than a template to generate prompts. I set us up for future built-in parsers and renderers by making it so they can be specified in a Modelfile like so: ``` RENDERER "qwen3-coder" PARSER "qwen3-coder" ``` These need to be provided explicitly because the architecture alone is not enough to understand what format the model expects to receive, and what format we expect it to output (e.g., qwen3-coder is `qwen3moe`, which includes other qwen3-family models as well) I haven't converted harmony to be one of these "built-ins" yet, since some of it is in flux with the changes @ParthSareen has been making to move harmony to the runner. It is likely that many other built-ins will need to move to the runner as well, but I'm able to slightly defer that decision since qwen3-coder doesn't have thinking (and therefore doesn't need to be in the runner to make structured outputs work). I expect to unify harmony with this approach very soon. Whether a particular model supports tools or thinking was previously inferred from templates, but without a template we now also use the parser itself to declare what it supports. If we have future models that re-use the same parsing format, but have different capabilities, we'll want to parameterize them and give them different names to be specified as a `PARSER`. Misc changes: - I worked on the renderer by diffing outputs from the reference implementation and ours. To make it easier to do this, I extended <https://github.com/ollama/ollama/pull/11875> to also support returning the prompt via the openai compat layer	2025-09-15 11:33:47 -07:00
jmorganca	92b96d54ef	Revert "runner: move harmony to runner (#12052 )" This reverts commit `1a558f98e2`. v0.11.11 v0.11.11-rc2 v0.11.11-rc3	2025-09-12 20:40:14 -03:00
jmorganca	9d56e63dbf	Revert "runner: simplify parser entrypoints in runner (#12233 )" This reverts commit `8d6fffaead`.	2025-09-12 20:40:14 -03:00
tc-mb	053092185e	Fix image cannot be seen with slice image on llama engine Ollama's recent engine update, llama.cpp, caused all models requiring a slice schema to not display images. As a result, the value of numTokens isn't always the length of the sliced image embed, but rather the end length of the schema. This causes the image embed to not be correctly included during all slice processing.	2025-09-12 16:25:12 -07:00
Daniel Hiltgen	44a6792873	tests: tighten up a few flaky tests (#12271 ) Sometimes the context test results are pure emoji's Thanksgiving has too much variability, so swap for a more straight forward prompt.	2025-09-12 13:59:34 -07:00
Inforithmics	bdfae41e7b	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-09-12 22:18:42 +02:00
Daniel Hiltgen	e4ce68311a	cuda: remove compression for better compatibility (#12259 ) This retains compatibility with driver 531 and up at the trade-off of space. v0.11.11-rc1	2025-09-12 07:59:14 -07:00
Inforithmics	5053b2e351	Fix Patch	2025-09-12 08:13:17 +02:00
Jesse Gross	26214125e8	ollamarunner: Suppress stack trace during memory allocation Allocation failures can be a normal part of new memory estimates, so we shouldn't print a stack trace in this case.	2025-09-11 14:30:31 -07:00
Daniel Hiltgen	61fb912ca4	CI: fix windows cuda build (#12246 ) * ci: adjust cuda component list v13 has a different breakdown of the components required to build ollama * review comments v0.11.11-rc0	2025-09-11 12:25:26 -07:00
Jesse Gross	aba1575315	llm: Don't try to load split vision models in the Ollama engine If a model with a split vision projector is loaded in the Ollama engine, the projector will be ignored and the model will hallucinate a response. Instead, fallback and try to load the model in the llama engine.	2025-09-11 11:41:55 -07:00
Jesse Gross	eb10390de9	llm: Enable new memory estimates by default New memory estimates (see #11090 for more information) are now enabled automatically for all models running on the Ollama engine, improving both stability and performance through more accurate sizing and allocation. Models running on the llama engine will continue to use the original style of memory estimation.	2025-09-11 11:21:53 -07:00
Michael Yang	feb18cd710	feat: add dimensions field to embed requests (#12242 ) * feat: add field to truncate embeddings * add openai embeddings for dimensions	2025-09-11 10:36:10 -07:00
fengyuchuanshen	8a7e2055d2	cmd: use slices.Contains to simplify code (#12249 )	2025-09-11 09:57:31 -07:00
Inforithmics	69ed26c93b	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-09-11 18:30:21 +02:00
Thomas Stocker	0db9fb4ad4	Merge pull request #4 from rillomas/fix-vulkan-uuid added Vulkan API to get correct Device UUID	2025-09-11 16:24:15 +02:00
Jesse Gross	29ddfc2cab	ggml: Disable flash attention for gemma2 Our new engine implementation of gemma2 doesn't support flash attention, which means that it also doesn't support KV cache quantization. Currently, it is possible to turn these two on, which will result in a crash.	2025-09-10 16:40:45 -07:00
Jesse Gross	71cb86af3e	llm: Remove unneeded warning with flash attention enabled If flash attention is enabled without KV cache quanitization, we will currently always get this warning: level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""	2025-09-10 16:40:45 -07:00
CarbonatedWater.org	5198956372	docs: add ollama-co2 to community integrations (#12230 )	2025-09-10 16:37:10 -07:00
Daniel Hiltgen	17a023f34b	Add v12 + v13 cuda support (#12000 ) * Add support for upcoming NVIDIA Jetsons The latest Jetsons with JetPack 7 are moving to an SBSA compatible model and will not require building a JetPack specific variant. * cuda: bring back dual versions This adds back dual CUDA versions for our releases, with v11 and v13 to cover a broad set of GPUs and driver versions. * win: break up native builds in build_windows.ps1 * v11 build working on windows and linux * switch to cuda v12.8 not JIT * Set CUDA compression to size * enhance manual install linux docs	2025-09-10 12:05:18 -07:00
Parth Sareen	8d6fffaead	runner: simplify parser entrypoints in runner (#12233 )	2025-09-10 11:24:42 -07:00
Masato Nakasaka	dd853c4040	modified UUID code inside ggml	2025-09-10 14:45:12 +09:00
Masato Nakasaka	f4add77fc3	Merge branch 'vulkanV3' into fix-vulkan-uuid	2025-09-10 13:36:06 +09:00
Inforithmics	08bec121eb	Remove Code not in llama.cpp	2025-09-10 00:09:17 +02:00
Inforithmics	d5cecee907	Fix GPU ID Patch	2025-09-09 23:47:08 +02:00
Parth Sareen	20b53eaa72	tests: add tool calling integration test (#12232 )	2025-09-09 14:01:11 -07:00
Daniel Hiltgen	6745182885	tests: reduce stress on CPU to 2 models (#12161 ) * tests: reduce stress on CPU to 2 models This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently * tests: allow slow systems to pass on timeout If a slow system is still streaming a response, and the response will pass validation, don't fail just because the system is slow. * test: unload embedding models more quickly	2025-09-09 09:32:15 -07:00

1 2 3 4 5 ...

4715 Commits