Commit Graph

4233 Commits

Author SHA1 Message Date
Devon Rifkin 4f231cd13e
server: send 405 instead of 404 for unallowed methods (#10275)
Fixes: #5483
2025-12-29 06:37:53 -06:00
Michael Yang 2bc6ee16e0
server: remove internal cmd (#10595) 2025-12-29 06:37:53 -06:00
Daniel Hiltgen 39ca55a1ba
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-12-29 06:37:52 -06:00
Michael Yang 2f1eb0fcce
discover: fix compiler warnings (#10572) 2025-12-29 06:37:52 -06:00
Jeffrey Morgan 13c66584a5
api: remove unused or unsupported api options (#10574)
Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options
2025-12-29 06:37:52 -06:00
Michael Yang 71167fb878
create blobs in parallel (#10135)
* default max term height
* error on out of tree files
2025-12-29 06:37:52 -06:00
Jesse Gross 48b6465aff
ggml: Reduce log level of "key not found"
Most of the time this is not an error.
2025-12-29 06:37:52 -06:00
Daniel Hiltgen efcc69e96f
win: lint fix (#10571) 2025-12-29 06:37:51 -06:00
Ashok Gelal c833610871
Hide empty terminal window (#8668)
This hides the LlamaServer blank window when chatting outside of the terminal (say like with an app like Msty). This has no other side effects when invoking it the regular way.
2025-12-29 06:37:51 -06:00
Jeffrey Morgan 7e9f243a0d
server: fix panic when runner.Options is nil (#10566) 2025-12-29 06:37:51 -06:00
Jeffrey Morgan 9a44e41802
all: fix cgo compiler warnings on windows (#10563) 2025-12-29 06:37:51 -06:00
湛露先生 0bffcc8cc4
file close check and close. (#10554)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-12-29 06:37:51 -06:00
Daniel Hiltgen d0904ea7f1
win: ensure ollama paths come first (#10549)
For all search path env vars make sure our dirs are first
to avoid potentially finding other incompatible libraries
on the users system.

Also fixes a minor build script glitch for windows rocm
2025-12-29 06:37:50 -06:00
Daniel Hiltgen cf9f00182d
sched: logging improvements (#10550)
This enhances our logging in the scheduler.  The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state).  Runners now have slog wiring to report more details about the
runner, including PID.
2025-12-29 06:37:50 -06:00
aritra saha 541a8575f0
readme: add llama 4 models (#10530) 2025-12-29 06:37:50 -06:00
Jesse Gross 86eea6770e
ggml: Fix race that resulted in "context canceled" when loading
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.

To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.

The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
2025-12-29 06:37:50 -06:00
Jesse Gross cec8a9dee0
ollamarunner: Re-enable worst case graph preallocation.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
2025-12-29 06:37:50 -06:00
Harsh Nevse cc21d627df
readme: update link to langchain in community integrations (#10465) 2025-12-29 06:37:49 -06:00
Jeffrey Morgan 723fec1b25
llama: update to commit e1e8e099 (#10513) 2025-12-29 06:37:49 -06:00
frob cf79e19403
image: add vision capability for projector-based models (#10509)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-12-29 06:37:49 -06:00
Jesse Gross 2276f7f089
kvcache: Log batch size if we can't find a slot
In some cases, we can't find a cache slot when using sliding window
attention. It would be helpful in this (and other cases) to know what
the batch size is.

Bug #10127
2025-12-29 06:37:49 -06:00
Jesse Gross 597f6cd3a9
ollamarunner: Fix memory leak when processing images
The context (and therefore associated input tensors) was not being
properly closed when images were being processed. We were trying to
close them but in reality we were closing over an empty list, preventing
anything from actually being freed.

Fixes #10434
2025-12-29 06:37:49 -06:00
AliAhmedNada dda786304e
readme: add Jirapt project to community integrations (#10522) 2025-12-29 06:37:48 -06:00
aritra saha 33bcef045a
readme: change granite3.2 to granite3.3 (#10525)
Update the list for readme
2025-12-29 06:37:48 -06:00
Michael Yang 79646ad87d
fix: write gguf padding (#10510)
* add gguf_test

* fix padding

padding was being added to offset but not to the running count
2025-12-29 06:37:48 -06:00
Devon Rifkin 55803ceb35
strip out thinking tags in message history for qwen3 & r1 (#10490)
* strip out thinking tags in message history for qwen3 & r1

This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality

* parse model names instead of hacky prefix check
2025-12-29 06:37:48 -06:00
Daniel Hiltgen fee7c406aa
Fix "Stopping..." scheduler hang (#10487)
* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.
2025-12-29 06:37:48 -06:00
Daniel Hiltgen 098fe2f7f7
Narrow set of paths we load GGML from (#10485)
Users may have other incompatible GGML installs on their systems.
This will prevent us from trying to load them from the path.
2025-12-29 06:37:47 -06:00
Shahin R 5234d73611
readme: add link to lumina, a lightweight React frontend client (#10378) 2025-12-29 06:37:47 -06:00
batuhankadioglu 6e74d8d222
all: update several golang.org/x packages (#10436) 2025-12-29 06:37:47 -06:00
Daniel Hiltgen 4d8621629c
integration: fix embedding tests error handling (#10478)
The cleanup routine from InitServerconnection should run in the defer of the test case to properly detect failures and report the server logs
2025-12-29 06:37:47 -06:00
Jesse Gross 13d497db4c
ollamarunner: Temporarily disable worst case graph preallocation
When we later have a large batch running purely on a CPU, this
results the error:
GGML_ASSERT(talloc->buffer_id >= 0)

Disabling this means that we will incrementally reallocate memory
as the graph grows.

Fixes #10410
2025-12-29 06:37:46 -06:00
crStiv 02a3285b60
readme: fix typos (#10399) 2025-12-29 06:37:46 -06:00
Devon Rifkin 528bd3077a
lower default num parallel to 2
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
2025-12-29 06:37:46 -06:00
Devon Rifkin b963dd868b
config: update default context length to 4096 2025-12-29 06:37:46 -06:00
Devon Rifkin 5a7c6c363e
Revert "increase default context length to 4096 (#10364)"
This reverts commit 424f648632.
2025-12-29 06:37:46 -06:00
Michael Yang b236fcc9bf
model: fix build (#10416) 2025-12-29 06:37:45 -06:00
Michael Yang 049aa30191
memory 2025-12-29 06:37:45 -06:00
Michael Yang 644d6c5256
fixes for maverick 2025-12-29 06:37:45 -06:00
Michael Yang d2d5c5e6d5
chunked attention 2025-12-29 06:37:45 -06:00
Michael Yang b7f628b9e8
connect vision to text 2025-12-29 06:37:45 -06:00
Michael Yang b875952e67
image processing
Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-12-29 06:37:44 -06:00
Michael Yang 0f5c45e19d
llama4 2025-12-29 06:37:44 -06:00
Michael Yang 371560df26
fix test 2025-12-29 06:37:44 -06:00
Michael Yang a0d77f1dbe
explicitly decode maxarraysize 1024 2025-12-29 06:37:44 -06:00
Michael Yang 8a86190fd4
fix parameter count 2025-12-29 06:37:44 -06:00
Michael Yang 49f807737a
default slice values 2025-12-29 06:37:44 -06:00
Michael Yang 51e64c8f69
update comment 2025-12-29 06:37:43 -06:00
Michael Yang 84a6567dee
fix token type 2025-12-29 06:37:43 -06:00
Michael Yang 5a8e641272
zero means zero
use a default of 1024 when asking for zero is confusing since most calls
seem to assume 0 means do not ready any data
2025-12-29 06:37:43 -06:00