ollama

Author	SHA1	Message	Date
Inforithmics	9ac9f3a952	fixed formatting	2025-10-04 16:32:39 +02:00
Inforithmics	b2aba4ea83	fixed build	2025-10-04 16:26:03 +02:00
Inforithmics	06528d66aa	fixing build	2025-10-04 16:22:55 +02:00
Inforithmics	75f65bcdbf	merge fixes	2025-10-04 16:11:34 +02:00
Inforithmics	1e46db8748	fixed build	2025-10-04 15:44:23 +02:00
Inforithmics	c4d8c75e54	merge fixes	2025-10-04 15:27:52 +02:00
Inforithmics	294b179688	merge fixes	2025-10-04 15:20:33 +02:00
Inforithmics	f567cc59d4	fix build	2025-10-04 15:08:18 +02:00
Inforithmics	e6c28916e1	Merge branch 'vulkanV3' into VulkanV3Update	2025-10-04 14:59:30 +02:00
Inforithmics	ac6ba7d44b	Merge remote-tracking branch 'upstream/main' into VulkanV3Update	2025-10-04 14:53:59 +02:00
Jesse Gross	19e6796eac	llm: Support KV cache quantization with gpt-oss With the new version of GGML in #12245, KV cache quantization no longer causes a fallback to CPU.	2025-10-03 16:31:58 -07:00
Grace	33801c1597	Fixed Deepseek2 adding nil tensor error	2025-10-03 14:20:06 -07:00
Daniel Hiltgen	e4340667e3	Workaround broken NVIDIA iGPU free VRAM data (#12490 ) The CUDA APIs for reporting free VRAM are useless on NVIDIA iGPU systems as they only return the kernels actual free memory and ignore buff/cache allocations which on a typical system will quickly fill up most of the free system memory. As a result, we incorrectly think there's very little available for GPU allocations which is wrong.	2025-10-03 12:17:21 -07:00
Patrick Devine	2fa1e92a99	test: add template error test (#12489 )	2025-10-03 12:05:34 -07:00
Daniel Hiltgen	07e36761c3	ci: place rocm windows in correct runner dir (#12487 ) v0.12.4-rc4	2025-10-03 07:28:40 -07:00
Daniel Hiltgen	c29fb007c0	CI: temporarily disable clang install (#12486 ) This will likely yield builds that have problems with unicode characters but at least we can start testing the release while we try to find an alternate clang compiler for windows, or mingw ships a fixed version. v0.12.4-rc3	2025-10-02 20:31:18 -07:00
Daniel Hiltgen	730ed6e9e1	ci: fix windows build (#12485 ) v0.12.4-rc2	2025-10-02 19:16:01 -07:00
Daniel Hiltgen	dc06601677	ci: fix windows build (#12484 ) v0.12.4-rc1	2025-10-02 18:59:26 -07:00
Patrick Devine	1ed2881ef0	templates: fix crash in improperly defined templates (#12483 )	2025-10-02 17:25:55 -07:00
Jesse Gross	0bda72892c	llm: Enable flash attention by default for qwen3 and qwen3moe v0.12.4-rc0	2025-10-02 17:04:10 -07:00
Daniel Hiltgen	55ca827267	AMD: block running on unsupported gfx900/gfx906 (#12481 )	2025-10-02 16:53:05 -07:00
Daniel Hiltgen	c68f367ef6	Update GGML to b6646 (#12245 ) Notable EOLs with this change: - MacOS v12 and v13 are no longer supported (v14+ required) - AMD gfx900 and gfx906 are no longer supported	2025-10-02 14:47:10 -07:00
Jesse Gross	fdb109469f	llm: Allow overriding flash attention setting As we automatically enable flash attention for more models, there are likely some cases where we get it wrong. This allows setting OLLAMA_FLASH_ATTENTION=0 to disable it, even for models that usually have flash attention.	2025-10-02 12:07:20 -07:00
Daniel Hiltgen	05a43e078a	fix panic on bootstrapDevices (#12475 ) Wrong index variable was used.	2025-10-01 17:39:29 -07:00
Daniel Hiltgen	bc8909fb38	Use runners for GPU discovery (#12090 ) This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.	2025-10-01 15:12:32 -07:00
Devon Rifkin	6b50f2b9cd	Merge pull request #12461 from ollama/drifkin/qwen3-coder-tweaks qwen3-coder: fix tool definition type rendering	2025-09-30 19:47:44 -07:00
Michael Yang	35ac4eb12c	fix keep alive this reference to keep alive was missed in #12041 so chat has a diffferent behaviour than generate	2025-09-30 17:22:28 -07:00
Jesse Gross	3d0b1734c0	ggml: Preallocate CUDA pool memory The GGML CUDA backend allocates additional memory for intermediate results during calculation. This memory isn't currently allocated during worst case graph reservation and therefore not included in scheduling. This means that as these buffers potentially grow with context length, we could crash. This extends the memory allocation system down layer from the GGML graph to the CUDA layer, preallocating the worst case memory there as well. Fixes #11753	2025-09-30 15:04:43 -07:00
Jesse Gross	efaee8c2d6	ggml: Backport scale kernel fixes The GGML scale kernel uses signed 32-bit ints to represent the number of elements in the tensor. For large images, mistral-small3.2 overflows this, triggering CUDA errors due to negative arguments. Currently, this can happen when the user passes a large image to mistral-small3.2. However, with upcoming changes to reserve CUDA memory, it happens every time mistral-small is loaded as we reserve using a worst case batch. This patch is part of an upstream GGML commit and should be removed after GGML is updated past 0a1b398 "ggml: add ops for WAN video model (cuda && cpu) (#15669)". Fixes #10388	2025-09-30 15:04:43 -07:00
Jesse Gross	734b57da0e	ggml: Remove allocation status reporting For each memory allocation we report the size of the (attempted) allocation and whether it succeeded or failed. The latter status reporting proved to be not that useful in practice as systems such as Windows can automatically overflow from VRAM into RAM, resultings in successful allocations even when there isn't enough memory where we wanted. As a result, this information is only used for debug logging, which isn't worthwhile enough for the amount of code. It also isn't fully accurate, as multiple allocations may result in partial failures.	2025-09-30 15:04:43 -07:00
Devon Rifkin	83021fcf0f	qwen3-coder: fix tool definition type rendering	2025-09-30 15:03:15 -07:00
Michael Yang	0469861d9d	build: call find_package to instantiate library paths	2025-09-30 13:12:46 -07:00
Inforithmics	8619ad6838	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-09-29 22:38:49 +02:00
Inforithmics	a7ddd0e2ae	gofumpt fix	2025-09-26 22:15:58 +02:00
羊撅撅	c47154c08d	fix: correct condition for AMDGPU_TARGETS filtering logic (#12412 )	2025-09-26 11:38:47 -07:00
Patrick Devine	b04e46da3e	bugfix: restore the current runOptions if loading fails in the CLI (#12402 ) There are two bugs when using `/load <model>` for a model that doesn't exist, namely: 1. it will not restore the current model settings if the current model is a thinking model; and 2. it will crash is the current model is a non-thinking model This bug fix saves the current runOptions and then restores them if the model load doesn't happen. It also fixes the crash happening for non-thinking models. v0.12.3	2025-09-25 18:30:45 -07:00
Devon Rifkin	34efbbd3f0	Merge pull request #12417 from ollama/drifkin/qwen3-coder-unicode parsers: fix unicode handling for qwen3-coder	2025-09-25 15:56:34 -07:00
Devon Rifkin	05ba4ca1f4	parsers: fix unicode handling for qwen3-coder When trimming whitespace at the end of every chunk, we were iterating backwards over the string byte-by-byte instead of rune-by-rune. As an example of how this can cause corruption, suppose we have the multi-byte character ✅ (`"\u2705"`), which is represented in utf-8 as the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this caused us to mistakenly slice in the middle of the rune, removing `0x85` and leaving `0xE2 0x9C`, which beyond being the incorrect place to slice, is not even a valid utf-8 character. `trailingWhitespaceLen()` was modified to count from the end in a rune-aware way. Tests with various multibyte unicode characters were also added. Fixes: #12414	2025-09-25 15:47:46 -07:00
Patrick Devine	5a56ff3cf0	cli: add device signin flow when doing ollama push (#12405 )	2025-09-25 15:04:43 -07:00
Gabe Goodhart	2fba04b5fb	tools: handle the case where a tool call sends "arguments" or "parameters" as a serialized json string (#12413 )	2025-09-25 14:37:39 -07:00
Daniel Hiltgen	5647ac91b2	test: harden integration tests for slow start If the server takes a while to start up, block tests from starting until it's online to avoid setting large timeouts in individual test cases.	2025-09-25 10:50:00 -07:00
Daniel Hiltgen	936c6d6be1	win: fix CPU query buffer handling Try in a short loop until we get the size right.	2025-09-25 10:50:00 -07:00
Inforithmics	82f0c7e6a5	ask for supported first	2025-09-25 08:47:04 +02:00
Inforithmics	05bdfedb56	Handle GGML_VK_VISIBLE_DEVICES	2025-09-25 08:23:13 +02:00
Inforithmics	a7e2d21f59	vk_check_flash_attention is not needed (coompat2 coopmapt and scalar implementation exist)	2025-09-25 06:33:15 +02:00
Inforithmics	a2f2d41d89	Merge remote-tracking branch 'upstream/main' into vulkanV3	2025-09-25 03:22:25 +02:00
Inforithmics	3a45922c01	Test if Vulkan device is supported	2025-09-25 03:22:01 +02:00
Daniel Hiltgen	5f9f312bdb	fix - give bootstrapping more time on slow systems	2025-09-24 16:25:56 -07:00
Daniel Hiltgen	5c18fb456c	fix vulkan ids to be underlying	2025-09-24 15:48:35 -07:00
Grace	fbd82ba5bb	Grace/deepseek v3 migration (#12385 ) * init deepseek model file * temp removal of flash attention implementation * shapes and proper, can make a pass * query, key, value have good cosine similarity, but the max diff is a bit high * Attention block is working! ** with eager for now, have not added the mask line * Attention block is working! ** with eager for now, have not added the mask line * working MoE at around 0.95 cosine sim * added cosine similarity function * Starting end to end structure * Trying (and failing) to get rope to work, going to test full thing on tater * running on tater36... just not the right outputs * we have the right values for rope... but its still not working? * chnage Extrapolation Factor to 1 * removed adding residuals twice, removed normalization from shared expert, refactored Norms (Attention, MLP) to be outside the (Attention, MLP) blocks and in the Transformer block instead, add cache setLayer * Temporary modelfiles for cpu * change kpass intermediate step to kv, two layer outputs [0,1] look fine * this calls for 16 chicken nuggets * whoops * cleaning up code * delete stuff we dont need * getting rid of debug statements for llama cpp * working with long contexts * fix long context view error * reverting some changes I made for files that are not apart of pr * Added proper tokenizer for deeepseek3 * clean up model and go test * remove Modelfile * not passing the tests * whoops * how to pass the ci tests * resolving some of the comments * rename * linted and renamed deepseek3 -> deepseek2 * remove name go * addressed changes - main change was adopting qwen3 naming scheme * I cannot with linters * clean up logs * clean up logs --------- Co-authored-by: Grace Guo <graceguo@Graces-MBP.localdomain> Co-authored-by: Grace Guo <graceguo@Graces-MacBook-Pro.local> Co-authored-by: graceguo <graceguo@tater36.localdomain>	2025-09-24 15:19:47 -07:00

1 2 3 4 5 ...

4788 Commits