ollama

Commit Graph

Author	SHA1	Message	Date
Devon Rifkin	afe0c10dbc	openai: always provide reasoning We were missing passing along thinking if content was nil (as opposed to empty string) Also added a test for content not being passed, which was the real cause of <https://github.com/ollama/ollama/issues/11704>, since with the way `Content` is typed, not passing it and empty string are distinct	2025-12-29 06:39:50 -06:00
Devon Rifkin	1a7d34231f	openai: when converting role=tool messages, propagate the tool name Added support for converting both `name` and `tool_call_id` fields, which different clients might provide. `name` is a legacy field from the OpenAI completions API. For `tool_call_id` we inspect previous messages and look for a matching tool call ID and grab its name Issue: https://github.com/ollama/ollama/issues/11704	2025-12-29 06:39:50 -06:00
Patrick Devine	45eabc3083	docs: update the faq (#11760 )	2025-12-29 06:39:50 -06:00
Devon Rifkin	ae9664c01d	openai: allow for content _and_ tool calls in the same message Previously our OpenAI chat completions compat layer assumed that tool calls and content would never be provided together, but this is not a correct assumption. Content is only optional when tool calls are present, but tool calls and content can be provided together Fixes: https://github.com/ollama/ollama/issues/11704	2025-12-29 06:39:50 -06:00
Daniel Hiltgen	cb241cab63	clean up debugging (#11756 )	2025-12-29 06:39:49 -06:00
Gao feng	3e2a98ad55	Update downloading to pulling in api.md (#11170 ) update api.md to make it consist with code. https://github.com/ollama/ollama/blob/main/server/download.go#L447	2025-12-29 06:39:49 -06:00
Parth Sareen	179bbf2640	docs: update turbo model name (#11707 )	2025-12-29 06:39:49 -06:00
Devon Rifkin	c9304f161a	tools: support anyOf types afaik gpt-oss is the first model that meaningfully transforms tool function definitions in its template. We found that relatively common definitions that include `anyOf` were not working because the template was assuming that types were always defined via a `type` field. anyOf allows for fully recursive types, so I exposed a `toTypeScriptType()` function to handle this recursive logic in go and keep the templates cleaner. The gpt-oss templates will need to be updated to use this. We should keep building out our function definition support to more fully support the parts of json schema that make sense for this use case, but in the meantime this will unblock some users (e.g., zed's ollama integration w/ gpt-oss). Probably the most urgent is proper array support	2025-12-29 06:39:49 -06:00
Daniel Hiltgen	e5b777a8d9	win: static link msvc libs (#11612 ) This should help reduce the runtime dependencies on windows.	2025-12-29 06:39:49 -06:00
Michael Yang	b643362f9f	gptoss: fix memory calc (#11700 )	2025-12-29 06:39:49 -06:00
Jeffrey Morgan	063d3e8163	docs: add docs for Ollama Turbo (#11687 )	2025-12-29 06:39:48 -06:00
Jesse Gross	ae8a041461	ggml: Prevent kv cache quanitization on gpt-oss KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671	2025-12-29 06:39:48 -06:00
Michael Yang	ed2e8a9022	gpt-oss (#11672 ) * bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2025-12-29 06:39:48 -06:00
Jesse Gross	275510ddf5	kvcache: Log contents of cache when unable to find a slot There is a bug when using sliding window attention where we run out of KV cache slots. This is likely due to not correctly removing all of the entries as they slide out of range. This adds additional logging when this occurs to track down the source. Bug #10127	2025-12-29 06:39:48 -06:00
Jesse Gross	c24014a55d	kvcache: Enable SWA to retain additional entries Models that use sliding window attention can only resume a sequence from the cache if it falls within the saved windows. This works well if the next message picks up where the old one left off. However, it generally prevents a partial prefix match unless the entire conversation falls within the sliding window. This can be a problem with reasoning models where the traces are supposed to be removed from future messages, forcing the entire history to be re-evaluated. This change allows models to specify that a larger amount of the history be retained in memory, to allow more partial resumption. It still respects the window that the model was trained on for token generation.	2025-12-29 06:39:48 -06:00
Sajal Kulshreshtha	b923797e99	fixing broken AMD driver link (#11579 )	2025-12-29 06:39:47 -06:00
Daniel Hiltgen	612a87dc69	Revert "CI: switch back to x86 macos builder" (#11588 ) This reverts commit `9d071e6089`.	2025-12-29 06:39:47 -06:00
Daniel Hiltgen	5038e33776	mac: disable bf16 on unsupported OS versions (#11585 ) Support for bf16 was added in MacOS v14+ and attempting to enable on older versions causes runtime failures.	2025-12-29 06:39:47 -06:00
Daniel Hiltgen	1d064a0e20	CI: switch back to x86 macos builder (#11572 )	2025-12-29 06:39:47 -06:00
Oliver Simons	1ee3fe46f3	Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525 ) * Enable CUDA Graphs for gemma3n. Similar to https://github.com/ggml-org/llama.cpp/pull/14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks. * Remove residual check by reshaping differently in gemma3n model This should make the heuristics more robust	2025-12-29 06:39:47 -06:00
Jesse Gross	279e632945	kvcache: Don't shift empty batches When we context shift, we delete half the context and apply RoPE with an offset to the other half. We used to RoPE across the entire context in a single pass with a zero offset for the deleted section. With the change to shifting in batches, we can skip any batches where all of the offsets would be zero. This typically reduces the number of operations by half.	2025-12-29 06:39:47 -06:00
Yoshi	9bd69d0110	docs: fix typos and remove trailing whitespaces (#11554 )	2025-12-29 06:39:46 -06:00
Mayan EDMS	4975cc042e	readme: add Mayan EDMS to community integrations (#11543 )	2025-12-29 06:39:46 -06:00
Jesse Gross	cdceaff4e1	kvcache: Group shift operations into batches Currently, when we need to do a shift on the cache, it is one RoPE operation on the entire size of the cache (per layer). In some cases, this can create a compute graph that is larger than the forward pass since the forward pass is working in batches. Since we don't consider shifting in our memory estimates, it's possible for this to cause a crash if we run out of memory. By limiting the size of the RoPE calls to batch size chunks, we ensure that the shift will never exceed the size of the forward pass, since the forward pass will also contain a RoPE of the same size. This does not have a sigificant impact on performance since RoPE is a math operation that is mostly proportional to the size of its inputs. In theory defrag could have the same issue since it also creates a compute graph outside of the forward pass, however, since it is only copies, it does not require any working space.	2025-12-29 06:39:46 -06:00
Ruyut	9574ed9bb7	CONTRIBUTING: fix typo in commit message example (#11528 )	2025-12-29 06:39:46 -06:00
Patrick Devine	0ab1b140af	cli: catch upstream errors gracefully (#11512 )	2025-12-29 06:39:46 -06:00
Jeffrey Morgan	d9a78742ad	tools: loosen tool argument parsing (#11509 )	2025-12-29 06:39:45 -06:00
minxinyi	a35d1c358f	server: use slices.Equal to simplify code (#11502 )	2025-12-29 06:39:45 -06:00
Michael Yang	26cd61e41f	s#x/exp/maps#maps# (#11506 )	2025-12-29 06:39:45 -06:00
Patrick Devine	95f5d9d6da	Fix GetModelInfo (#11496 ) --------- Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-12-29 06:39:45 -06:00
ycomiti	f5319ac72b	Update linux.md (#11462 )	2025-12-29 06:39:45 -06:00
Stefan Wärting	59b034f040	readme: add GMAI - Gradle Managed to community integrations (#11461 )	2025-12-29 06:39:44 -06:00
Jeffrey Morgan	30ec10cb05	tools: fix parsing issue when a tool name is a substring of another (#11456 ) Co-authored-by: frob <rick+github@frob.com.au>	2025-12-29 06:39:44 -06:00
zmldndx	ffa61a51fc	readme: update argo description to support deep research (#11455 )	2025-12-29 06:39:44 -06:00
Daniel Hiltgen	5274cd2ead	ci: switch mac builder to arm64 (#11379 ) The macos-13 is x86, while macos-13-xlarge is arm64	2025-12-29 06:39:44 -06:00
frob	a1a350b608	docs: add the no-Modelfile function of `ollama create` (#9077 )	2025-12-29 06:39:44 -06:00
frob	b2a00a0d2a	openai: allow openai endpoint to accept webp images (#11412 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-12-29 06:39:44 -06:00
Haiyue Wang	2e57f92b0c	readme: update the llama.cpp github link (#11427 )	2025-12-29 06:39:43 -06:00
Michael Yang	7221b90fe1	compile bf16 support into ggml-metal (#11430 )	2025-12-29 06:39:43 -06:00
Parth Sareen	1c48526e2e	cmd: add default assistant role to message construction (#11431 )	2025-12-29 06:39:43 -06:00
Bruce MacDonald	9e9238103d	api: fix unreachable status err (#11423 ) StatusError was unreachable, the client always checked for error messages in the response body first, and the server always includes error messages with HTTP error status codes.	2025-12-29 06:39:43 -06:00
Marcelo Fornet	8c885fe5eb	docs: fix typo in macos.md (#11425 )	2025-12-29 06:39:43 -06:00
先知	43cacd9309	docs: update modelfile.md to reflect current default num_ctx (#11189 ) As in the commit `44b466eeb2`, the default context length has been increased to 4096.	2025-12-29 06:39:43 -06:00
Jesse Gross	b47aa7e75a	ggml: Use assigned layers when reporting loading stats Reporting params.NumGPULayers can be misleading because it is the requested number of layers, not the actual number that is loaded. While they are often the same, there are cases where they might mismatch, such as if the GPU backend is missing.	2025-12-29 06:39:42 -06:00
Jesse Gross	015e39a8be	ggml: Disable unused pipeline parallelism We're not currently using it, even in cases where we could. Disabling it improves generation performance by 10-30% with multiple GPUs.	2025-12-29 06:39:42 -06:00
Daniel Hiltgen	39cec5338a	Only load supported models on new engine (#11362 ) * Only load supported models on new engine Verify the model is supported before trying to load * int: testcase for all library models	2025-12-29 06:39:42 -06:00
Jesse Gross	387cb031b3	ggml: Report ordinal IDs for AMD GPUs on Windows We don't get valid UUIDs for AMD GPUs on Windows, so the best option is to use the ordinal IDs. This brings us in line with what we currently do on the Ollama server - the only exception is AMD GPUs on Linux, which falls back to using ordinal IDs. The GGML implementation has no fallback but it doesn't appear to occur for any of the GPUs that we support. It's also possible that there are collisions between ordinal IDs for different libraries - however the only places where we use them are AMD on Windows and Metal on Mac, which can never occur on the same system.	2025-12-29 06:39:42 -06:00
Daniel Hiltgen	50e4df359b	doc: add MacOS docs (#11334 ) also removes stale model dir instructions for windows	2025-12-29 06:39:42 -06:00
Daniel Hiltgen	4fcc030739	Reduce default parallelism to 1 (#11330 ) The current scheduler algorithm of picking the paralellism based on available VRAM complicates the upcoming dynamic layer memory allocation algorithm. This changes the default to 1, with the intent going forward that parallelism is explicit and will no longer be dynamically determined. Removal of the dynamic logic will come in a follow up.	2025-12-29 06:39:41 -06:00
Daniel Hiltgen	1c94c9919b	API/CLI context enhancements (#11331 ) * API: expose context size of loaded models * CLI: add context UX This adds a column in the ps output to show the models context size.	2025-12-29 06:39:41 -06:00

1 2 3 4 5 ...

4431 Commits All Branches Search

4431 Commits

All Branches