Commit Graph

4783 Commits

Author SHA1 Message Date
Inforithmics
49c4d154ae Enable Vulkan Flash attention in FlashAttentionSupported 2025-08-12 21:55:19 +02:00
Inforithmics
e6da524ab7 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-12 21:51:39 +02:00
Inforithmics
2244f304d7 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-12 21:43:10 +02:00
Michael Yang
d0cf6c8281 fix(openai): handle reasoning_effort (#11868) 2025-08-12 11:02:01 -07:00
Jesse Gross
8f4ec9ab28 discover: CPU supports flash attention
We already run flash attention on CPUs in cases where we have
partial offloading but were disabling it if running on pure CPU,
 which is unnecessary.
2025-08-11 15:00:34 -07:00
Devon Rifkin
dbfd7bd027 Merge pull request #11861 from ollama/drifkin/fix-parsing-error
server: fix error when parsing bad harmony tool calls
2025-08-11 14:59:57 -07:00
Devon Rifkin
ee04dbba51 server: fix error when parsing bad harmony tool calls
Thanks @moll for reporting!

Fixes: #11781
2025-08-11 14:09:13 -07:00
Daniel Andersen
ea7657b54a sched: Add support for grouping GPUs (#10678)
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:
 - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
 - Allowing unallocated GPUs to get into power-saving mode.
 - Significantly reduce VRAM allocation when using more than 2 GPUs in a system
 - Due to the reduced memory allocation, you can run more models simultaneously.
2025-08-11 13:59:38 -07:00
Inforithmics
0c27f472e7 Remove commented out code 2025-08-11 18:52:43 +02:00
Inforithmics
e3627b2832 Add vulkan to Windows Build script 2025-08-11 18:39:10 +02:00
Inforithmics
d1f74e17d4 Update gpu.go 2025-08-10 21:28:59 +02:00
Inforithmics
f6dd7070de vk_check_flash_attention 0 means supported 2025-08-10 21:22:26 +02:00
Inforithmics
ee24b967f1 fixed flash attention logic enabling 2025-08-10 19:57:14 +02:00
Inforithmics
a1393414ce revert remove parenthesis 2025-08-10 17:54:13 +02:00
Inforithmics
5270c4c5f7 enable falsh attention on vulkan 2025-08-10 16:53:13 +02:00
Inforithmics
60a015e8c3 Revert chnages in ggml.go 2025-08-10 16:09:44 +02:00
Inforithmics
1edbfd0559 Revert changes in ggml.go 2025-08-10 16:07:24 +02:00
Inforithmics
fd4480a848 Fixed duplicate sync in ggml.go 2025-08-10 16:05:09 +02:00
Inforithmics
2e7452be71 Update Vulkan Code to de4c07f93783a1a96456a44dc16b9db538ee1618 2025-08-10 16:01:07 +02:00
Michael Vorburger
2c776f0780 CONTRIBUTING: Explicitly note docs:... as a good example (#11755) 2025-08-09 18:12:30 -07:00
Thomas Stocker
bc5c3fb213 Revert vulkan copy changes in Dockerfile 2025-08-09 22:45:52 +02:00
Thomas Stocker
fa13b8de45 Revert some unintented changes in Dockerfile 2025-08-09 22:43:12 +02:00
Thomas Stocker
d03fc13d36 Revert changes in Makefile.sync 2025-08-09 22:38:37 +02:00
Thomas Stocker
a6d0d6c6ff Revert changes in runner.go 2025-08-09 22:35:20 +02:00
Thomas Stocker
0ddb64db1f Revert changes in transforms_test.go 2025-08-09 22:33:42 +02:00
Thomas Stocker
29b1ed0077 Revert whitespace changes in gpu.go 2025-08-09 22:30:13 +02:00
Thomas Stocker
57270767ac Remove flashattention setting gpu.go 2025-08-09 22:26:54 +02:00
Thomas Stocker
42463fbb7f Revert changes in amd_linux.go 2025-08-09 22:24:33 +02:00
Thomas Stocker
89ac91099d Revert changes in amd_linux.go 2025-08-09 22:23:00 +02:00
Thomas Stocker
47bff3e532 Revert 2025-08-09 22:15:54 +02:00
Thomas Stocker
643b1c505e Revert Readme changes 2025-08-09 22:14:54 +02:00
Inforithmics
f8ed1541ed Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-09 21:59:30 +02:00
Jesse Gross
79f6376f5b ggml: No-alloc mode
Callers can set a backend buffer type to be no-alloc, meaning that
it does not allocate memory for tensors or operations. This can
be used for calculating memory requirements. Tensors and graphs
must be recreated with no-alloc set to false before loading data.

Defaults to false for newly created backend buffer types.
2025-08-08 14:57:13 -07:00
Jesse Gross
756c78cfc7 ggml: Support closing backends
In order to iteratively find the best memory allocation, we need to
be able to free backend memory so we can try again.
2025-08-08 14:57:13 -07:00
Jesse Gross
d7f4f788d1 ggml: Use GGML's typedef'ed pointer types
For many backend data structures, GGML defines a typedef of a pointer
type and returns these from functions. In most cases, CGo understands
that these are interchangable but some parts of Go (such as generics)
think they are two different types. We should prefer the form that
GGML uses.
2025-08-08 14:57:13 -07:00
Daniel Hiltgen
114c3f2265 tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00
Jesse Gross
f2e9c9aff5 server: Reduce gpt-oss context length for small VRAM GPUs
gpt-oss works best with a context length of at least 8k. However,
for GPUs with limited amount of VRAM, there is a significant
performance hit to this increased context. In these cases, we
switch to the Ollama default of 4k
v0.11.4
2025-08-07 14:23:55 -07:00
Devon Rifkin
aa9d889522 Merge pull request #11765 from ollama/drifkin/thinking-without-content
openai: always provide reasoning
v0.11.4-rc0
2025-08-06 19:02:23 -07:00
Devon Rifkin
735c41f9ca openai: always provide reasoning
We were missing passing along thinking if content was nil (as opposed
to empty string)

Also added a test for content not being passed, which was the real cause
of <https://github.com/ollama/ollama/issues/11704>, since with the way
`Content` is typed, not passing it and empty string are distinct
2025-08-06 18:54:20 -07:00
Devon Rifkin
223a619468 Merge pull request #11761 from ollama/drifkin/openai-tool-names
openai: when converting role=tool messages, propagate the tool name
2025-08-06 17:53:25 -07:00
Devon Rifkin
759dd78dd6 openai: when converting role=tool messages, propagate the tool name
Added support for converting both `name` and `tool_call_id` fields,
which different clients might provide. `name` is a legacy field from the
OpenAI completions API. For `tool_call_id` we inspect previous messages
and look for a matching tool call ID and grab its name

Issue: https://github.com/ollama/ollama/issues/11704
2025-08-06 17:00:24 -07:00
Patrick Devine
44bc36d063 docs: update the faq (#11760) 2025-08-06 16:55:57 -07:00
Devon Rifkin
8f14e1f5f6 Merge pull request #11759 from ollama/drifkin/oai-tool-calling
openai: allow for content _and_ tool calls in the same message
2025-08-06 16:11:31 -07:00
Devon Rifkin
203c137810 openai: allow for content _and_ tool calls in the same message
Previously our OpenAI chat completions compat layer assumed that tool
calls and content would never be provided together, but this is not a
correct assumption. Content is only optional when tool calls are
present, but tool calls and content can be provided together

Fixes: https://github.com/ollama/ollama/issues/11704
2025-08-06 15:50:30 -07:00
Daniel Hiltgen
fa8be9e35c clean up debugging (#11756) 2025-08-06 13:31:22 -07:00
Gao feng
8a75e9ee15 Update downloading to pulling in api.md (#11170)
update api.md to make it consist with code.
https://github.com/ollama/ollama/blob/main/server/download.go#L447
2025-08-06 11:33:09 -07:00
Parth Sareen
4742e12c23 docs: update turbo model name (#11707) v0.11.3 2025-08-05 17:29:08 -07:00
Devon Rifkin
2d06977ade Merge pull request #11705 from ollama/drifkin/fn-schema
tools: support anyOf types
v0.11.3-rc0
2025-08-05 17:02:42 -07:00
Devon Rifkin
30f8a68c4c tools: support anyOf types
afaik gpt-oss is the first model that meaningfully transforms tool
function definitions in its template. We found that relatively common
definitions that include `anyOf` were not working because the template
was assuming that types were always defined via a `type` field.

anyOf allows for fully recursive types, so I exposed a
`toTypeScriptType()` function to handle this recursive logic in go and
keep the templates cleaner. The gpt-oss templates will need to be
updated to use this.

We should keep building out our function definition support to more
fully support the parts of json schema that make sense for this use
case, but in the meantime this will unblock some users (e.g., zed's
ollama integration w/ gpt-oss). Probably the most urgent is proper array
support
2025-08-05 16:46:24 -07:00
Daniel Hiltgen
e378e33421 win: static link msvc libs (#11612)
This should help reduce the runtime dependencies on windows.
2025-08-05 16:10:42 -07:00