Commit Graph

4325 Commits

Author SHA1 Message Date
Ashok Gelal
c833610871 Hide empty terminal window (#8668)
This hides the LlamaServer blank window when chatting outside of the terminal (say like with an app like Msty). This has no other side effects when invoking it the regular way.
2025-12-29 06:37:51 -06:00
Jeffrey Morgan
7e9f243a0d server: fix panic when runner.Options is nil (#10566) 2025-12-29 06:37:51 -06:00
Jeffrey Morgan
9a44e41802 all: fix cgo compiler warnings on windows (#10563) 2025-12-29 06:37:51 -06:00
湛露先生
0bffcc8cc4 file close check and close. (#10554)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-12-29 06:37:51 -06:00
Daniel Hiltgen
d0904ea7f1 win: ensure ollama paths come first (#10549)
For all search path env vars make sure our dirs are first
to avoid potentially finding other incompatible libraries
on the users system.

Also fixes a minor build script glitch for windows rocm
2025-12-29 06:37:50 -06:00
Daniel Hiltgen
cf9f00182d sched: logging improvements (#10550)
This enhances our logging in the scheduler.  The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state).  Runners now have slog wiring to report more details about the
runner, including PID.
2025-12-29 06:37:50 -06:00
aritra saha
541a8575f0 readme: add llama 4 models (#10530) 2025-12-29 06:37:50 -06:00
Jesse Gross
86eea6770e ggml: Fix race that resulted in "context canceled" when loading
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.

To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.

The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
2025-12-29 06:37:50 -06:00
Jesse Gross
cec8a9dee0 ollamarunner: Re-enable worst case graph preallocation.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
2025-12-29 06:37:50 -06:00
Harsh Nevse
cc21d627df readme: update link to langchain in community integrations (#10465) 2025-12-29 06:37:49 -06:00
Jeffrey Morgan
723fec1b25 llama: update to commit e1e8e099 (#10513) 2025-12-29 06:37:49 -06:00
frob
cf79e19403 image: add vision capability for projector-based models (#10509)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-12-29 06:37:49 -06:00
Jesse Gross
2276f7f089 kvcache: Log batch size if we can't find a slot
In some cases, we can't find a cache slot when using sliding window
attention. It would be helpful in this (and other cases) to know what
the batch size is.

Bug #10127
2025-12-29 06:37:49 -06:00
Jesse Gross
597f6cd3a9 ollamarunner: Fix memory leak when processing images
The context (and therefore associated input tensors) was not being
properly closed when images were being processed. We were trying to
close them but in reality we were closing over an empty list, preventing
anything from actually being freed.

Fixes #10434
2025-12-29 06:37:49 -06:00
AliAhmedNada
dda786304e readme: add Jirapt project to community integrations (#10522) 2025-12-29 06:37:48 -06:00
aritra saha
33bcef045a readme: change granite3.2 to granite3.3 (#10525)
Update the list for readme
2025-12-29 06:37:48 -06:00
Michael Yang
79646ad87d fix: write gguf padding (#10510)
* add gguf_test

* fix padding

padding was being added to offset but not to the running count
2025-12-29 06:37:48 -06:00
Devon Rifkin
55803ceb35 strip out thinking tags in message history for qwen3 & r1 (#10490)
* strip out thinking tags in message history for qwen3 & r1

This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality

* parse model names instead of hacky prefix check
2025-12-29 06:37:48 -06:00
Daniel Hiltgen
fee7c406aa Fix "Stopping..." scheduler hang (#10487)
* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.
2025-12-29 06:37:48 -06:00
Daniel Hiltgen
098fe2f7f7 Narrow set of paths we load GGML from (#10485)
Users may have other incompatible GGML installs on their systems.
This will prevent us from trying to load them from the path.
2025-12-29 06:37:47 -06:00
Shahin R
5234d73611 readme: add link to lumina, a lightweight React frontend client (#10378) 2025-12-29 06:37:47 -06:00
batuhankadioglu
6e74d8d222 all: update several golang.org/x packages (#10436) 2025-12-29 06:37:47 -06:00
Daniel Hiltgen
4d8621629c integration: fix embedding tests error handling (#10478)
The cleanup routine from InitServerconnection should run in the defer of the test case to properly detect failures and report the server logs
2025-12-29 06:37:47 -06:00
Jesse Gross
13d497db4c ollamarunner: Temporarily disable worst case graph preallocation
When we later have a large batch running purely on a CPU, this
results the error:
GGML_ASSERT(talloc->buffer_id >= 0)

Disabling this means that we will incrementally reallocate memory
as the graph grows.

Fixes #10410
2025-12-29 06:37:46 -06:00
crStiv
02a3285b60 readme: fix typos (#10399) 2025-12-29 06:37:46 -06:00
Devon Rifkin
528bd3077a lower default num parallel to 2
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
2025-12-29 06:37:46 -06:00
Devon Rifkin
b963dd868b config: update default context length to 4096 2025-12-29 06:37:46 -06:00
Devon Rifkin
5a7c6c363e Revert "increase default context length to 4096 (#10364)"
This reverts commit 424f648632.
2025-12-29 06:37:46 -06:00
Michael Yang
b236fcc9bf model: fix build (#10416) 2025-12-29 06:37:45 -06:00
Michael Yang
049aa30191 memory 2025-12-29 06:37:45 -06:00
Michael Yang
644d6c5256 fixes for maverick 2025-12-29 06:37:45 -06:00
Michael Yang
d2d5c5e6d5 chunked attention 2025-12-29 06:37:45 -06:00
Michael Yang
b7f628b9e8 connect vision to text 2025-12-29 06:37:45 -06:00
Michael Yang
b875952e67 image processing
Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-12-29 06:37:44 -06:00
Michael Yang
0f5c45e19d llama4 2025-12-29 06:37:44 -06:00
Michael Yang
371560df26 fix test 2025-12-29 06:37:44 -06:00
Michael Yang
a0d77f1dbe explicitly decode maxarraysize 1024 2025-12-29 06:37:44 -06:00
Michael Yang
8a86190fd4 fix parameter count 2025-12-29 06:37:44 -06:00
Michael Yang
49f807737a default slice values 2025-12-29 06:37:44 -06:00
Michael Yang
51e64c8f69 update comment 2025-12-29 06:37:43 -06:00
Michael Yang
84a6567dee fix token type 2025-12-29 06:37:43 -06:00
Michael Yang
5a8e641272 zero means zero
use a default of 1024 when asking for zero is confusing since most calls
seem to assume 0 means do not ready any data
2025-12-29 06:37:43 -06:00
Michael Yang
f0c5b48f7b convert: use -1 for read all 2025-12-29 06:37:43 -06:00
Michael Yang
96618f6344 generic ggml.array 2025-12-29 06:37:42 -06:00
Michael Yang
5e0d7e9332 fix superfluous call to WriteHeader
the first call to http.ResponseWriter.Write implicitly calls WriteHeader
with http.StatusOK if it hasn't already been called. once WriteHeader
has been called, subsequent calls has no effect. Write is called when
JSON encoding progressUpdateJSON{}. calls to
http.ResponseWriter.WriteHeader after the first encode is useless and
produces a warning:

http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)
2025-12-29 06:37:42 -06:00
Michael Yang
584c3176d2 convert: change to colmajor 2025-12-29 06:37:42 -06:00
Michael Yang
4f01385151 ci: silence deprecated gpu targets warning 2025-12-29 06:37:42 -06:00
Jeffrey Morgan
85d3f71c02 llama: update to commit 2016f07b (#10352) 2025-12-29 06:37:42 -06:00
Parth Sareen
83e848fcb8 server: improve spacing for JSON grammar (#10131) 2025-12-29 06:37:41 -06:00
Parth Sareen
7cf4c146bc llama: remove model loading for grammar (#10096) 2025-12-29 06:37:41 -06:00