Michael Yang
9e4642e9b3
ollama debug tensor
2025-03-11 14:49:19 -07:00
Michael Yang
6b0486c216
duplicate token_embd to output
2025-03-11 14:49:19 -07:00
Michael Yang
d368c039f0
skip repacking vision tensors
2025-03-11 14:49:19 -07:00
Patrick Devine
9b54267e69
fix configs
2025-03-11 14:49:19 -07:00
Michael Yang
46bb0169c4
update model
2025-03-11 14:49:19 -07:00
Michael Yang
8934324b72
use fast attention
2025-03-11 14:49:18 -07:00
Jesse Gross
0e886595bf
Fix tests and drift from main
2025-03-11 14:49:18 -07:00
Patrick Devine
c62861f4fa
fix conversion
2025-03-11 14:49:18 -07:00
Michael Yang
0df1800436
set non-causal attention
2025-03-11 14:49:18 -07:00
Patrick Devine
631fecc6d9
temporary work around for converting spm
2025-03-11 14:49:18 -07:00
Jesse Gross
4346c2409d
fix drift from main
2025-03-11 14:49:18 -07:00
Michael Yang
4b037a97dc
add gemma vision encoder
2025-03-11 14:49:17 -07:00
Patrick Devine
5f74d1fd47
gemma2 impl
2025-03-11 14:35:08 -07:00
Daniel Hiltgen
4dcf80167a
Build release for windows with local script ( #9636 )
2025-03-11 08:34:20 -07:00
Vadim Grinco
9cb4ad02e2
This is no longer needed
...
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-11 14:34:17 +01:00
Vadim Grinco
6b1f84e171
Merging the latest stable ( #2 )
...
* Applied 00-fix-vulkan-building.patch
* Implemented vulkan backend based on the work done by whyvl, Dts0, McBane87 and others
Tested on AMD Ryzen 7 8845HS w/ Radeon 780M Graphics with ROCm disabled
```
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-03-11T13:00:40.793Z level=INFO source=gpu.go:199 msg="vulkan: load libvulkan and libcap ok"
time=2025-03-11T13:00:40.877Z level=INFO source=gpu.go:421 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:443 msg="amdgpu detected, but no compatible rocm library found. Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install "
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:348 msg="unable to verify rocm library: no suitable rocm found, falling back to CPU"
time=2025-03-11T13:00:40.879Z level=INFO source=types.go:137 msg="inference compute" id=0 library=vulkan variant="" compute=1.3 driver=1.3 name="AMD Radeon Graphics (RADV GFX1103_R1)" total="15.6 GiB" available="15.6 GiB"
```
```
# ollama run phi4:14b
>>> /set verbose
Set 'verbose' mode.
>>> how's it going?
Hello! I'm here to help you with any questions or tasks you have. How can I assist you today? 😊
total duration: 3.341959745s
load duration: 18.165612ms
prompt eval count: 15 token(s)
prompt eval duration: 475ms
prompt eval rate: 31.58 tokens/s
eval count: 26 token(s)
eval duration: 2.846s
eval rate: 9.14 tokens/s
>>>
```
2025-03-11 14:09:47 +01:00
Michael Yang
26a26998fb
Merge pull request #9590 from ollama/mxyng/dump-pad
...
fix: pad tensor item if ge zero
2025-03-10 16:34:55 -07:00
Michael Yang
9926eae015
fix: pad tensor item if ge zero
...
this produces a nicer output since both positive and negative values
produces the same width
2025-03-10 16:18:12 -07:00
Vincent Koc
8585b7b151
docs: add opik to observability integrations ( #9626 )
2025-03-10 16:15:10 -07:00
Parth Sareen
7e34f4fbfa
sample: add numerical stability to temperature/softmax transform ( #9631 )
2025-03-10 14:43:53 -07:00
Michael Yang
fe776293f7
Merge pull request #9569 from dwt/patch-1
...
Better WantedBy declaration
2025-03-10 14:09:37 -07:00
frob
d8a5d96b98
docs: Add OLLAMA_CONTEXT_LENGTH to FAQ. ( #9545 )
2025-03-10 11:02:54 -07:00
Xiaowei Zhu
757668c42f
docs: add SwiftChat ( #9540 )
2025-03-10 11:01:09 -07:00
Sam
96ec8afd09
docs(tool): add mcp-llm ( #9537 )
2025-03-10 09:52:02 -07:00
Jeffrey Morgan
e093db92c4
sample: temporarily use grammars for constrained generation in new engine ( #9586 )
2025-03-10 16:17:39 +01:00
Vadim Grinco
31606b2feb
Merged in the right direction
...
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-10 12:51:49 +01:00
Vadim Grinco
b14dd68fee
Fixed the "detached head" issues
...
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-10 12:51:49 +01:00
Vadim Grinco
cff62cc6c2
Merge branch 'ollama_vulkan_stable' into grinco-vulkan
2025-03-10 12:39:00 +01:00
Vadim Grinco
98f699773a
Applied 00-fix-vulkan-building.patch
...
Work done by McBane87 here: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-10 12:34:37 +01:00
Vadim Grinco
e648126fe9
Merge branch 'ollama_vanilla_stable' into ollama_vulkan_stable
2025-03-10 12:29:52 +01:00
Jesse Gross
a1cda80bcb
model: Update encoder cache to use multimodal input processing handler
...
The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.
Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.
Most of this is simply moving the input data structures to a new
package to avoid import cycles.
2025-03-09 17:05:26 -07:00
Vadim Grinco
42bac5cadd
This version works well
...
built based on this: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-09 23:23:26 +01:00
Vadim Grinco
81465ca374
Installing rocm library
...
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-09 20:42:32 +01:00
Jesse Gross
4614fafae0
ollamarunner: Don't panic for unimplemented features at runtime.
...
It's ok to fail on startup but we shouldn't panic during runtime
based on user input. Downgrade the panic to a warning.
2025-03-08 18:58:18 -08:00
Vadim Grinco
189cbb40a6
Updated dockerfile
...
https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871
Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-08 19:40:53 +01:00
Vadim Grinco
747898df04
Merge pull request #1 from ollama/main
...
Merged from ollama/main
2025-03-08 08:56:12 +01:00
Jesse Gross
4100ed7bdd
ml: Add support for quantized KV cache
...
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
2025-03-07 18:43:39 -08:00
Jesse Gross
f52b2615ef
kvcache: Set context for shift offsets
2025-03-07 18:43:39 -08:00
Jesse Gross
25f9b152f9
ggml-backend: Ensure allocation meet backend requirements
...
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
2025-03-07 18:43:39 -08:00
Jesse Gross
6da8b6a879
kvcache: Support non-causal attention
...
Models can disable causality for all or part of their processing
while continuing to store data in the KV cache.
2025-03-07 18:39:27 -08:00
Jesse Gross
0daaaef8c9
ollamarunner: Quiet debug logging and panic on unimplemented features
...
Debug logging of every token has previously caused test timeouts
on slower machines.
2025-03-07 18:38:02 -08:00
Jesse Gross
98272fbd58
additional review comments
2025-03-07 14:08:21 -08:00
Michael Yang
b27e8f3f10
ml/backend/ggml: use backend buffer type
...
this ensures the tensor is created on the right buffer type for backends
such as cpu
2025-03-07 14:08:21 -08:00
Michael Yang
45df786f09
comments
2025-03-07 14:08:21 -08:00
Michael Yang
daaf42e4a4
ml/backend/ggml: clean up
2025-03-07 14:08:21 -08:00
Michael Yang
2dc60d4620
ml/backend/ggml: offload vision to cpu
...
temporary until tensor loading can accurately account for vision models
2025-03-07 14:08:21 -08:00
Michael Yang
b5312f30e8
ml/backend/ggml: handle tensor split
2025-03-07 14:08:21 -08:00
Michael Yang
26c2e0bd35
ml/backend/ggml: handle user specified cpu offloading
2025-03-07 14:08:21 -08:00
Michael Yang
bf920883d5
ml/backend/ggml: set cpu n_threads
2025-03-07 14:08:21 -08:00
Michael Yang
58b9ec1f6b
kvcache: update tests
2025-03-07 14:08:21 -08:00