Commit Graph

4543 Commits

Author SHA1 Message Date
Inforithmics d71c83f2ba Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-14 22:11:08 +02:00
Daniel Hiltgen 7ccfd97a93
doc: clarify both rocm and main bundle necessary (#11900)
Some users expect the rocm bundles to be self-sufficient, but are designed to be additive.
2025-08-14 12:54:55 -07:00
Daniel Hiltgen c385ca8672
test: add valid responses (#11902)
some of the new models need a few more valid responses to pass
2025-08-14 11:07:13 -07:00
Daniel Hiltgen 837379a94c
discovery: fix cudart driver version (#11614)
We prefer the nvcuda library, which reports driver versions. When we
dropped cuda v11, we added a safety check for too-old drivers.  What
we missed was the cudart fallback discovery logic didn't have driver
version wired up.  This fixes cudart discovery to expose the driver
version as well so we no longer reject all GPUs if nvcuda didn't work.
2025-08-13 15:43:33 -07:00
Daniel Hiltgen a24f90604f
int: adjust a few models for integration tests (#11872) 2025-08-13 15:42:36 -07:00
Daniel Hiltgen dc5a645434
cuda: leverage JIT for smaller footprint (#11635)
Prior to this change our official binaries contained both JIT PTX code and
the cubin binary code for our chosen compute capabilities. This change
switches to only compile the PTX code and rely on JIT at runtime for
generating the cubin specific to the users GPU.  The cubins are cached
on the users system, so they should only see a small lag on the very
first model load for a given Ollama release.  This also adds the first
generation of Blackwell GPUs so they aren't reliant on the Hopper PTX.

This change reduces the ggml-cuda.dll from 1.2G to 460M
2025-08-13 15:42:16 -07:00
Inforithmics 6543213e6f Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-13 23:50:00 +02:00
youzichuan bb71654ebe chore: fix some inconsistent function name in comment
Signed-off-by: youzichuan <youzichuan6@outlook.com>
2025-08-13 09:50:27 -07:00
Inforithmics eaf42a646c Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-13 08:27:22 +02:00
Jesse Gross a343ae53a4 ggml: Use ordinal IDs for AMD GPUs on Linux when UUID is unavailable
Some AMD GPUs do not provide UUIDs and report only "XX". In these
cases, we should use the ordinal ID as an alternate identifier.
This is the same as we always need to do on Windows for AMD.

In addition, this prints out the ID for each GPU when enumerating
them for easier debugging in the future.
2025-08-12 16:56:14 -07:00
Inforithmics 49c4d154ae Enable Vulkan Flash attention in FlashAttentionSupported 2025-08-12 21:55:19 +02:00
Inforithmics e6da524ab7 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-12 21:51:39 +02:00
Inforithmics 2244f304d7 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-12 21:43:10 +02:00
Michael Yang d0cf6c8281
fix(openai): handle reasoning_effort (#11868) 2025-08-12 11:02:01 -07:00
Jesse Gross 8f4ec9ab28 discover: CPU supports flash attention
We already run flash attention on CPUs in cases where we have
partial offloading but were disabling it if running on pure CPU,
 which is unnecessary.
2025-08-11 15:00:34 -07:00
Devon Rifkin dbfd7bd027
Merge pull request #11861 from ollama/drifkin/fix-parsing-error
server: fix error when parsing bad harmony tool calls
2025-08-11 14:59:57 -07:00
Devon Rifkin ee04dbba51 server: fix error when parsing bad harmony tool calls
Thanks @moll for reporting!

Fixes: #11781
2025-08-11 14:09:13 -07:00
Daniel Andersen ea7657b54a
sched: Add support for grouping GPUs (#10678)
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:
 - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
 - Allowing unallocated GPUs to get into power-saving mode.
 - Significantly reduce VRAM allocation when using more than 2 GPUs in a system
 - Due to the reduced memory allocation, you can run more models simultaneously.
2025-08-11 13:59:38 -07:00
Inforithmics 0c27f472e7 Remove commented out code 2025-08-11 18:52:43 +02:00
Inforithmics e3627b2832 Add vulkan to Windows Build script 2025-08-11 18:39:10 +02:00
Inforithmics d1f74e17d4 Update gpu.go 2025-08-10 21:28:59 +02:00
Inforithmics f6dd7070de vk_check_flash_attention 0 means supported 2025-08-10 21:22:26 +02:00
Inforithmics ee24b967f1 fixed flash attention logic enabling 2025-08-10 19:57:14 +02:00
Inforithmics a1393414ce revert remove parenthesis 2025-08-10 17:54:13 +02:00
Inforithmics 5270c4c5f7 enable falsh attention on vulkan 2025-08-10 16:53:13 +02:00
Inforithmics 60a015e8c3 Revert chnages in ggml.go 2025-08-10 16:09:44 +02:00
Inforithmics 1edbfd0559 Revert changes in ggml.go 2025-08-10 16:07:24 +02:00
Inforithmics fd4480a848 Fixed duplicate sync in ggml.go 2025-08-10 16:05:09 +02:00
Inforithmics 2e7452be71 Update Vulkan Code to de4c07f93783a1a96456a44dc16b9db538ee1618 2025-08-10 16:01:07 +02:00
Michael Vorburger 2c776f0780
CONTRIBUTING: Explicitly note docs:... as a good example (#11755) 2025-08-09 18:12:30 -07:00
Thomas Stocker bc5c3fb213
Revert vulkan copy changes in Dockerfile 2025-08-09 22:45:52 +02:00
Thomas Stocker fa13b8de45
Revert some unintented changes in Dockerfile 2025-08-09 22:43:12 +02:00
Thomas Stocker d03fc13d36
Revert changes in Makefile.sync 2025-08-09 22:38:37 +02:00
Thomas Stocker a6d0d6c6ff
Revert changes in runner.go 2025-08-09 22:35:20 +02:00
Thomas Stocker 0ddb64db1f
Revert changes in transforms_test.go 2025-08-09 22:33:42 +02:00
Thomas Stocker 29b1ed0077
Revert whitespace changes in gpu.go 2025-08-09 22:30:13 +02:00
Thomas Stocker 57270767ac
Remove flashattention setting gpu.go 2025-08-09 22:26:54 +02:00
Thomas Stocker 42463fbb7f
Revert changes in amd_linux.go 2025-08-09 22:24:33 +02:00
Thomas Stocker 89ac91099d
Revert changes in amd_linux.go 2025-08-09 22:23:00 +02:00
Thomas Stocker 47bff3e532
Revert 2025-08-09 22:15:54 +02:00
Thomas Stocker 643b1c505e
Revert Readme changes 2025-08-09 22:14:54 +02:00
Inforithmics f8ed1541ed Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-09 21:59:30 +02:00
Jesse Gross 79f6376f5b ggml: No-alloc mode
Callers can set a backend buffer type to be no-alloc, meaning that
it does not allocate memory for tensors or operations. This can
be used for calculating memory requirements. Tensors and graphs
must be recreated with no-alloc set to false before loading data.

Defaults to false for newly created backend buffer types.
2025-08-08 14:57:13 -07:00
Jesse Gross 756c78cfc7 ggml: Support closing backends
In order to iteratively find the best memory allocation, we need to
be able to free backend memory so we can try again.
2025-08-08 14:57:13 -07:00
Jesse Gross d7f4f788d1 ggml: Use GGML's typedef'ed pointer types
For many backend data structures, GGML defines a typedef of a pointer
type and returns these from functions. In most cases, CGo understands
that these are interchangable but some parts of Go (such as generics)
think they are two different types. We should prefer the form that
GGML uses.
2025-08-08 14:57:13 -07:00
Daniel Hiltgen 114c3f2265
tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00
Jesse Gross f2e9c9aff5 server: Reduce gpt-oss context length for small VRAM GPUs
gpt-oss works best with a context length of at least 8k. However,
for GPUs with limited amount of VRAM, there is a significant
performance hit to this increased context. In these cases, we
switch to the Ollama default of 4k
2025-08-07 14:23:55 -07:00
Devon Rifkin aa9d889522
Merge pull request #11765 from ollama/drifkin/thinking-without-content
openai: always provide reasoning
2025-08-06 19:02:23 -07:00
Devon Rifkin 735c41f9ca openai: always provide reasoning
We were missing passing along thinking if content was nil (as opposed
to empty string)

Also added a test for content not being passed, which was the real cause
of <https://github.com/ollama/ollama/issues/11704>, since with the way
`Content` is typed, not passing it and empty string are distinct
2025-08-06 18:54:20 -07:00
Devon Rifkin 223a619468
Merge pull request #11761 from ollama/drifkin/openai-tool-names
openai: when converting role=tool messages, propagate the tool name
2025-08-06 17:53:25 -07:00