Commit Graph

4469 Commits

Author SHA1 Message Date
Gabe Goodhart d724caced3 fix: Remove Gemma3n CUDA Graphs patch
It was implemented upstream:
https://github.com/ggml-org/llama.cpp/pull/14741

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-30 13:55:21 -04:00
Gabe Goodhart 94912ec7dd fix: Fix Solar and argsort/copy patches after bump
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-30 13:54:38 -04:00
Gabe Goodhart 8fbeb68858 feat: Bump to 41e78c in the makefile
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-30 13:54:03 -04:00
Gabe Goodhart 70d2f70dd3 fix: Re-number patches after merge with `main`
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-30 13:02:08 -04:00
Gabe Goodhart c22e9c9bbd Merge remote-tracking branch 'origin/main' into GraniteFour
* origin/main:
Revert "CI: switch back to x86 macos builder" (#11588)
mac: disable bf16 on unsupported OS versions (#11585)
CI: switch back to x86 macos builder (#11572)
Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525)
kvcache: Don't shift empty batches
docs: fix typos and remove trailing whitespaces (#11554)

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-30 13:01:12 -04:00
Daniel Hiltgen 6dcc5dfb9c
Revert "CI: switch back to x86 macos builder" (#11588)
This reverts commit 9d071e6089.
2025-07-30 08:56:01 -07:00
Daniel Hiltgen 25911a6e6b
mac: disable bf16 on unsupported OS versions (#11585)
Support for bf16 was added in MacOS v14+ and attempting to enable
on older versions causes runtime failures.
2025-07-30 08:50:54 -07:00
Daniel Hiltgen 8afa6e83f2
CI: switch back to x86 macos builder (#11572) 2025-07-29 16:41:25 -07:00
Oliver Simons ea85e27bbd
Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525)
* Enable CUDA Graphs for gemma3n.

Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.

* Remove residual check by reshaping differently in gemma3n model

This should make the heuristics more robust
2025-07-29 12:37:06 -07:00
Jesse Gross c116a7523d kvcache: Don't shift empty batches
When we context shift, we delete half the context and apply RoPE
with an offset to the other half. We used to RoPE across the entire
context in a single pass with a zero offset for the deleted
section. With the change to shifting in batches, we can skip any
batches where all of the offsets would be zero. This typically
reduces the number of operations by half.
2025-07-29 12:32:22 -07:00
Yoshi 3515cc377c
docs: fix typos and remove trailing whitespaces (#11554) 2025-07-28 11:19:13 -07:00
Gabe Goodhart 74d1f478e3 fix: Handle multi-chunk image encodings from mtmd
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-28 11:44:35 -04:00
Gabe Goodhart 444c2bf248 Merge remote-tracking branch 'origin/main' into GraniteFour
* origin/main:
readme: add Mayan EDMS to community integrations (#11543)
kvcache: Group shift operations into batches
CONTRIBUTING: fix typo in commit message example (#11528)
2025-07-28 10:33:49 -04:00
Mayan EDMS bbf66c0b96
readme: add Mayan EDMS to community integrations (#11543) 2025-07-27 15:02:52 -07:00
Jesse Gross 764be7480f kvcache: Group shift operations into batches
Currently, when we need to do a shift on the cache, it is one
RoPE operation on the entire size of the cache (per layer). In
some cases, this can create a compute graph that is larger than
the forward pass since the forward pass is working in batches.
Since we don't consider shifting in our memory estimates, it's
possible for this to cause a crash if we run out of memory.

By limiting the size of the RoPE calls to batch size chunks, we
ensure that the shift will never exceed the size of the forward
pass, since the forward pass will also contain a RoPE of the same
size. This does not have a sigificant impact on performance since
RoPE is a math operation that is mostly proportional to the size
of its inputs.

In theory defrag could have the same issue since it also creates a
compute graph outside of the forward pass, however, since it is
only copies, it does not require any working space.
2025-07-25 16:50:27 -07:00
Ruyut b72e5adb14
CONTRIBUTING: fix typo in commit message example (#11528) 2025-07-25 14:24:06 -07:00
Gabe Goodhart 11a0d7376c Merge remote-tracking branch 'origin/main' into GraniteFour
* origin/main:
cli: catch upstream errors gracefully (#11512)
tools: loosen tool argument parsing (#11509)
server: use slices.Equal to simplify code (#11502)
s#x/exp/maps#maps# (#11506)
Fix GetModelInfo (#11496)
Update linux.md (#11462)
2025-07-25 09:50:47 -06:00
Patrick Devine 80b538e312
cli: catch upstream errors gracefully (#11512) 2025-07-23 22:16:55 -07:00
Jeffrey Morgan 4f8a0166cc
tools: loosen tool argument parsing (#11509) 2025-07-23 21:21:29 -07:00
minxinyi 1e6eab5c33
server: use slices.Equal to simplify code (#11502) 2025-07-23 14:25:39 -07:00
Michael Yang 6c733bf0a6
s#x/exp/maps#maps# (#11506) 2025-07-23 13:23:32 -07:00
Patrick Devine 3bac5cba60
Fix GetModelInfo (#11496)
---------

Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-07-22 13:40:47 -07:00
ycomiti 4151ef8cf7
Update linux.md (#11462) 2025-07-22 11:17:31 -07:00
Gabe Goodhart 895d5563df Merge remote-tracking branch 'origin/main' into GraniteFour
* origin/main:
readme: add GMAI - Gradle Managed to community integrations (#11461)
tools: fix parsing issue when a tool name is a substring of another (#11456)
readme: update argo description to support deep research (#11455)
ci: switch mac builder to arm64 (#11379)
docs: add the no-Modelfile function of `ollama create` (#9077)
openai: allow openai endpoint to accept webp images (#11412)
readme: update the llama.cpp github link (#11427)
compile bf16 support into ggml-metal (#11430)
cmd: add default assistant role to message construction (#11431)
api: fix unreachable status err (#11423)
docs: fix typo in macos.md (#11425)
2025-07-21 15:04:52 -06:00
Stefan Wärting 82da19c634
readme: add GMAI - Gradle Managed to community integrations (#11461) 2025-07-20 14:55:47 -07:00
Jeffrey Morgan bdd9d22dfd
tools: fix parsing issue when a tool name is a substring of another (#11456)
Co-authored-by: frob <rick+github@frob.com.au>
2025-07-20 14:55:14 -07:00
zmldndx 5fc38d042f
readme: update argo description to support deep research (#11455) 2025-07-19 13:29:38 -07:00
Daniel Hiltgen 191d94289d
ci: switch mac builder to arm64 (#11379)
The macos-13 is x86, while macos-13-xlarge is arm64
2025-07-17 07:33:44 -07:00
frob 802ad16ce4
docs: add the no-Modelfile function of `ollama create` (#9077) 2025-07-16 22:16:10 -07:00
frob 5e67f4f90e
openai: allow openai endpoint to accept webp images (#11412)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-07-16 21:31:49 -07:00
Haiyue Wang e840ccb523
readme: update the llama.cpp github link (#11427) 2025-07-16 21:20:28 -07:00
Michael Yang b4fe3adc0a
compile bf16 support into ggml-metal (#11430) 2025-07-16 17:32:57 -07:00
Parth Sareen d73f8aa8c3
cmd: add default assistant role to message construction (#11431) 2025-07-16 11:18:16 -07:00
Bruce MacDonald 92c2e8a56c
api: fix unreachable status err (#11423)
StatusError was unreachable, the client always checked for error messages in the response body first, and the server always includes error messages with HTTP error status codes.
2025-07-16 11:03:28 -07:00
Marcelo Fornet 2e3fd86d48
docs: fix typo in macos.md (#11425) 2025-07-16 10:50:46 -07:00
Gabe Goodhart e6a22f20d1 Merge remote-tracking branch 'origin/main' into GraniteFour
* origin/main:
docs: update modelfile.md to reflect current default num_ctx (#11189)
ggml: Use assigned layers when reporting loading stats
ggml: Disable unused pipeline parallelism
Only load supported models on new engine (#11362)
2025-07-15 14:50:19 -06:00
Gabe Goodhart 5305e2ad14 feat: Sync llama.cpp
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-15 14:50:01 -06:00
Gabe Goodhart 4f462a9f67 feat: Bump llama.cpp to 4a4f42
This picks up support for Kimi K2 and PLaMO-2

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-15 14:49:15 -06:00
先知 4261a3b0b2
docs: update modelfile.md to reflect current default num_ctx (#11189)
As in the commit 44b466eeb2, the default context length has been increased to 4096.
2025-07-11 15:15:00 -07:00
Gabe Goodhart 91e4b10d40 fix: Sync patch changes for ggml-cpu.c
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 16:01:15 -06:00
Gabe Goodhart 0beea04b52 fix: Add a patch to avoid power throttling API on non-msvc windows builds
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 16:00:49 -06:00
Jesse Gross acef9b4c1b ggml: Use assigned layers when reporting loading stats
Reporting params.NumGPULayers can be misleading because it is the
requested number of layers, not the actual number that is loaded.
While they are often the same, there are cases where they might mismatch,
such as if the GPU backend is missing.
2025-07-11 14:21:50 -07:00
Jesse Gross 9a43994c45 ggml: Disable unused pipeline parallelism
We're not currently using it, even in cases where we could. Disabling
it improves generation performance by 10-30% with multiple GPUs.
2025-07-11 13:30:05 -07:00
Gabe Goodhart e8a303a701 build: Add top-level include for GNUINstallDirs in CMakeLists.txt
This is used to populate CMAKE_INSTALL_BINDIR

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 13:44:10 -06:00
Gabe Goodhart 81d821ba9b build: Include cmake/common.cmake in ggml sync
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 13:25:01 -06:00
Daniel Hiltgen f8a6e88819
Only load supported models on new engine (#11362)
* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models
2025-07-11 12:21:54 -07:00
Gabe Goodhart bf1b261611 feat: Sync all patched code
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 11:44:18 -06:00
Gabe Goodhart 3020c462da fix: Add patch for GGML_VERSION and GGML_COMMIT constants
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 11:43:14 -06:00
Gabe Goodhart d7f98e0673 fix: Revert changes to ggml export GPU UUID patch
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 11:42:26 -06:00
Gabe Goodhart 111434ab39 feat: Bump back to the cenral repo and point at the latest master
This includes granite 4 and a number of other model architectures!

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-07-11 10:43:22 -06:00