Commit Graph

4533 Commits

Author SHA1 Message Date
Inforithmics 49c4d154ae Enable Vulkan Flash attention in FlashAttentionSupported 2025-08-12 21:55:19 +02:00
Inforithmics e6da524ab7 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-12 21:51:39 +02:00
Inforithmics 2244f304d7 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-12 21:43:10 +02:00
Michael Yang d0cf6c8281
fix(openai): handle reasoning_effort (#11868) 2025-08-12 11:02:01 -07:00
Jesse Gross 8f4ec9ab28 discover: CPU supports flash attention
We already run flash attention on CPUs in cases where we have
partial offloading but were disabling it if running on pure CPU,
 which is unnecessary.
2025-08-11 15:00:34 -07:00
Devon Rifkin dbfd7bd027
Merge pull request #11861 from ollama/drifkin/fix-parsing-error
server: fix error when parsing bad harmony tool calls
2025-08-11 14:59:57 -07:00
Devon Rifkin ee04dbba51 server: fix error when parsing bad harmony tool calls
Thanks @moll for reporting!

Fixes: #11781
2025-08-11 14:09:13 -07:00
Daniel Andersen ea7657b54a
sched: Add support for grouping GPUs (#10678)
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:
 - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
 - Allowing unallocated GPUs to get into power-saving mode.
 - Significantly reduce VRAM allocation when using more than 2 GPUs in a system
 - Due to the reduced memory allocation, you can run more models simultaneously.
2025-08-11 13:59:38 -07:00
Inforithmics 0c27f472e7 Remove commented out code 2025-08-11 18:52:43 +02:00
Inforithmics e3627b2832 Add vulkan to Windows Build script 2025-08-11 18:39:10 +02:00
Inforithmics d1f74e17d4 Update gpu.go 2025-08-10 21:28:59 +02:00
Inforithmics f6dd7070de vk_check_flash_attention 0 means supported 2025-08-10 21:22:26 +02:00
Inforithmics ee24b967f1 fixed flash attention logic enabling 2025-08-10 19:57:14 +02:00
Inforithmics a1393414ce revert remove parenthesis 2025-08-10 17:54:13 +02:00
Inforithmics 5270c4c5f7 enable falsh attention on vulkan 2025-08-10 16:53:13 +02:00
Inforithmics 60a015e8c3 Revert chnages in ggml.go 2025-08-10 16:09:44 +02:00
Inforithmics 1edbfd0559 Revert changes in ggml.go 2025-08-10 16:07:24 +02:00
Inforithmics fd4480a848 Fixed duplicate sync in ggml.go 2025-08-10 16:05:09 +02:00
Inforithmics 2e7452be71 Update Vulkan Code to de4c07f93783a1a96456a44dc16b9db538ee1618 2025-08-10 16:01:07 +02:00
Michael Vorburger 2c776f0780
CONTRIBUTING: Explicitly note docs:... as a good example (#11755) 2025-08-09 18:12:30 -07:00
Thomas Stocker bc5c3fb213
Revert vulkan copy changes in Dockerfile 2025-08-09 22:45:52 +02:00
Thomas Stocker fa13b8de45
Revert some unintented changes in Dockerfile 2025-08-09 22:43:12 +02:00
Thomas Stocker d03fc13d36
Revert changes in Makefile.sync 2025-08-09 22:38:37 +02:00
Thomas Stocker a6d0d6c6ff
Revert changes in runner.go 2025-08-09 22:35:20 +02:00
Thomas Stocker 0ddb64db1f
Revert changes in transforms_test.go 2025-08-09 22:33:42 +02:00
Thomas Stocker 29b1ed0077
Revert whitespace changes in gpu.go 2025-08-09 22:30:13 +02:00
Thomas Stocker 57270767ac
Remove flashattention setting gpu.go 2025-08-09 22:26:54 +02:00
Thomas Stocker 42463fbb7f
Revert changes in amd_linux.go 2025-08-09 22:24:33 +02:00
Thomas Stocker 89ac91099d
Revert changes in amd_linux.go 2025-08-09 22:23:00 +02:00
Thomas Stocker 47bff3e532
Revert 2025-08-09 22:15:54 +02:00
Thomas Stocker 643b1c505e
Revert Readme changes 2025-08-09 22:14:54 +02:00
Inforithmics f8ed1541ed Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-09 21:59:30 +02:00
Jesse Gross 79f6376f5b ggml: No-alloc mode
Callers can set a backend buffer type to be no-alloc, meaning that
it does not allocate memory for tensors or operations. This can
be used for calculating memory requirements. Tensors and graphs
must be recreated with no-alloc set to false before loading data.

Defaults to false for newly created backend buffer types.
2025-08-08 14:57:13 -07:00
Jesse Gross 756c78cfc7 ggml: Support closing backends
In order to iteratively find the best memory allocation, we need to
be able to free backend memory so we can try again.
2025-08-08 14:57:13 -07:00
Jesse Gross d7f4f788d1 ggml: Use GGML's typedef'ed pointer types
For many backend data structures, GGML defines a typedef of a pointer
type and returns these from functions. In most cases, CGo understands
that these are interchangable but some parts of Go (such as generics)
think they are two different types. We should prefer the form that
GGML uses.
2025-08-08 14:57:13 -07:00
Daniel Hiltgen 114c3f2265
tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00
Jesse Gross f2e9c9aff5 server: Reduce gpt-oss context length for small VRAM GPUs
gpt-oss works best with a context length of at least 8k. However,
for GPUs with limited amount of VRAM, there is a significant
performance hit to this increased context. In these cases, we
switch to the Ollama default of 4k
2025-08-07 14:23:55 -07:00
Devon Rifkin aa9d889522
Merge pull request #11765 from ollama/drifkin/thinking-without-content
openai: always provide reasoning
2025-08-06 19:02:23 -07:00
Devon Rifkin 735c41f9ca openai: always provide reasoning
We were missing passing along thinking if content was nil (as opposed
to empty string)

Also added a test for content not being passed, which was the real cause
of <https://github.com/ollama/ollama/issues/11704>, since with the way
`Content` is typed, not passing it and empty string are distinct
2025-08-06 18:54:20 -07:00
Devon Rifkin 223a619468
Merge pull request #11761 from ollama/drifkin/openai-tool-names
openai: when converting role=tool messages, propagate the tool name
2025-08-06 17:53:25 -07:00
Devon Rifkin 759dd78dd6 openai: when converting role=tool messages, propagate the tool name
Added support for converting both `name` and `tool_call_id` fields,
which different clients might provide. `name` is a legacy field from the
OpenAI completions API. For `tool_call_id` we inspect previous messages
and look for a matching tool call ID and grab its name

Issue: https://github.com/ollama/ollama/issues/11704
2025-08-06 17:00:24 -07:00
Patrick Devine 44bc36d063
docs: update the faq (#11760) 2025-08-06 16:55:57 -07:00
Devon Rifkin 8f14e1f5f6
Merge pull request #11759 from ollama/drifkin/oai-tool-calling
openai: allow for content _and_ tool calls in the same message
2025-08-06 16:11:31 -07:00
Devon Rifkin 203c137810 openai: allow for content _and_ tool calls in the same message
Previously our OpenAI chat completions compat layer assumed that tool
calls and content would never be provided together, but this is not a
correct assumption. Content is only optional when tool calls are
present, but tool calls and content can be provided together

Fixes: https://github.com/ollama/ollama/issues/11704
2025-08-06 15:50:30 -07:00
Daniel Hiltgen fa8be9e35c
clean up debugging (#11756) 2025-08-06 13:31:22 -07:00
Gao feng 8a75e9ee15
Update downloading to pulling in api.md (#11170)
update api.md to make it consist with code.
https://github.com/ollama/ollama/blob/main/server/download.go#L447
2025-08-06 11:33:09 -07:00
Parth Sareen 4742e12c23
docs: update turbo model name (#11707) 2025-08-05 17:29:08 -07:00
Devon Rifkin 2d06977ade
Merge pull request #11705 from ollama/drifkin/fn-schema
tools: support anyOf types
2025-08-05 17:02:42 -07:00
Devon Rifkin 30f8a68c4c tools: support anyOf types
afaik gpt-oss is the first model that meaningfully transforms tool
function definitions in its template. We found that relatively common
definitions that include `anyOf` were not working because the template
was assuming that types were always defined via a `type` field.

anyOf allows for fully recursive types, so I exposed a
`toTypeScriptType()` function to handle this recursive logic in go and
keep the templates cleaner. The gpt-oss templates will need to be
updated to use this.

We should keep building out our function definition support to more
fully support the parts of json schema that make sense for this use
case, but in the meantime this will unblock some users (e.g., zed's
ollama integration w/ gpt-oss). Probably the most urgent is proper array
support
2025-08-05 16:46:24 -07:00
Daniel Hiltgen e378e33421
win: static link msvc libs (#11612)
This should help reduce the runtime dependencies on windows.
2025-08-05 16:10:42 -07:00