Commit Graph

3860 Commits

Author SHA1 Message Date
Jesse Gross 60830695c2 ggml-backend: Ensure data is available after async computation
We need to sync before retrieving data after async computation.
It is also important to ensure that the Go buffer is not moved by
the GC across function calls so we do a synchronous copy.
2025-02-13 17:09:26 -08:00
Jesse Gross 01d9a46854 ggml-backend: Let GGML allocate context memory
Passing in a Go buffer is not safe because the garbage collector could
free or move the memory while the context is still open. However, if
we pass in the size and a nil pointer then GGML will allocate it from
the C side.
2025-02-13 17:09:26 -08:00
Jesse Gross d773b7d671 backend: API to support full precision matmul
Most tensor backends try to optimize performance by using a lower
precision for matmuls. However, some operations (such as kq) on
some models are sensitive to this and require full precision.
2025-02-13 17:09:26 -08:00
Jesse Gross 4d4463b2bd backend: Support graph computation that does not return an output
There are two cases where we may not have an output after computing:
 - Prompt processing where the length of the input exceeds the batch
   size
 - Internal memory management operations such as cache defrag and shift
2025-02-13 17:09:26 -08:00
Jesse Gross 0e38297f87 backend: Consistently use int (vs. int64) for tensor shapes
Currently there is a mixture of int and int64 used when dealing with
tensor dimensions and shapes, which causes unnecessary conversions -
they all should be the same type.

In general, most interfaces (such as Pytorch) use int64 for
generality but most implementations (such as CUDA) use int32 for
performance. There isn't much benefit to us to being more flexible
than the implementations we are likely to run on.

In addition, as a practical matter, a model with a tensor with a single
dimension larger than 32 bits is unlikely to run on a 32-bit machine.
2025-02-13 17:09:26 -08:00
Jesse Gross 7e13f568dc backend: Don't return an error on Close
It is not common to return errors with close/free operations - most
people won't check it and even if they did there's probably not much
that can do. It's better to not give implementations false expectations.
2025-02-13 17:09:26 -08:00
Michael Yang 58245413f4
next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-02-13 16:31:21 -08:00
Bùi Đức Nhật 8cf16063a5
docs: add ollamazing to the README.md (#9075) 2025-02-13 10:47:09 -08:00
frob 3a4449e2f1
docs: add H200 as supported device. (#9076)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-02-13 10:44:23 -08:00
Anuraag (Rag) Agrawal 10d59d5f90
openai: finish_reason as tool_calls for streaming with tools (#7963) 2025-02-13 10:20:12 -08:00
Jeffrey Morgan a4f69a0191
build: add -DGGML_CUDA_NO_PEER_COPY=ON for rocm builds on windows (#9060) 2025-02-13 00:23:17 -08:00
Clinton 82658c3eec
readme: add Homebrew to package managers section (#9052) 2025-02-12 11:17:39 -08:00
bloominstrong 378d6e1e6a
docs: fix nix package link (#9045)
removing the channel tag from the url so it will always go to the current stable channel.
2025-02-12 09:16:26 -08:00
Hugues Chocart afa55bc70c
doc: fix link for Abso (#9043) 2025-02-12 09:15:08 -08:00
Michael Yang 49df03da9a
fix: harden backend loading (#9024)
* wrap ggml_backend_load_best in try/catch
* ignore non-ollama paths
2025-02-11 15:36:53 -08:00
Hugues Chocart 0189bdd0b7
readme: add Abso SDK to community integrations (#8973) 2025-02-11 00:14:45 -08:00
Jeffrey Morgan f4711da7bd
ml/backend/ggml: fix crash on dlopen for non-AVX systems (#8976) 2025-02-10 09:52:12 -08:00
Hugues Chocart 38117fba83
readme: add Lunary to observability community integrations (#8975) 2025-02-09 22:08:46 -08:00
Michael Yang 1f766c36fb
ci: use windows-2022 to sign and bundle (#8941)
ollama requires vcruntime140_1.dll which isn't found on 2019. previously
the job used the windows runner (2019) but it explicitly installs
2022 to build the app. since the sign job doesn't actually build
anything, it can use the windows-2022 runner instead.
2025-02-08 13:07:00 -08:00
Qusai Ismael 484a99e428
docs: add LocalLLM app to community integrations (#8953) 2025-02-08 12:28:01 -08:00
DravenK ec6121c331
docs: ollama zig community lib (#8688) 2025-02-08 11:10:47 -08:00
Jeffrey Morgan b86c0a1500
docs: link directly to latest release page for tdm-gcc (#8939) 2025-02-08 00:21:10 -08:00
Guddu Kumar 7e402ebb8c
readme: add deepseek to supported models 2025-02-07 11:28:28 -08:00
Azis Alvriyanto b901a712c6
docs: improve syntax highlighting in code blocks (#8854) 2025-02-07 09:55:07 -08:00
Michael Yang abb8dd57f8
add gfx instinct gpus (#8933) 2025-02-07 09:51:22 -08:00
Leisure Linux a400df48c0
docs: include port in faq.md OLLAMA_HOST examples (#8905) 2025-02-06 18:45:09 -08:00
annilq 6ab4ba4c26
readme: add React Native client to community integrations (#8877) 2025-02-06 17:15:48 -08:00
CosmicEventHorizon e8d4eb3e68
readme: add ChibiChat to community integrations (#8883) 2025-02-06 16:08:46 -08:00
Michael Yang ae7e368f75
build(rocm): add numa, elf (#8900) 2025-02-06 15:46:30 -08:00
oslook 31acd1ebf9
readme: add Ollama Chat WebUI for Docker to community integrations (#8084) 2025-02-06 15:41:02 -08:00
Michael Yang 9a4757ae66
build(rocm): add tinfo (#8899) 2025-02-06 15:08:12 -08:00
Abhinav Pant 7814019708
docs: add step for removing libraries in linux.md (#8897) 2025-02-06 14:54:58 -08:00
Michael Yang b698f9a0d8
build: add missing dependencies (#8896) 2025-02-06 13:12:16 -08:00
Azis Alvriyanto 32285a6d19
format: rename test file from byte_test.go to bytes_test.go (#8865) 2025-02-06 13:06:15 -08:00
Michael Yang 1c198977ec
ci: fix linux archive (#8862)
the find returns intermediate directories which pulls the parent
directories. it also omits files under lib/ollama.

switch back to globbing
2025-02-05 19:45:58 -08:00
zyphixor 330b6c50b0
readme: add simple-discord-ai to community integrations (#8659) 2025-02-05 18:35:04 -08:00
Diego Pereira 928911bc68
runner: avoid buffer overwrite when generating multiple embeddings (#8714)
Shield the code processing the embedding result
from subsequent calls that may overwrite the same
buffer to process a second input when retrieving
model embeddings.
2025-02-05 16:53:33 -08:00
Michael Yang 5b446cc815
chore: update gitattributes (#8860)
* chore: update gitattributes
* chore: add build info source
2025-02-05 16:37:18 -08:00
Daniel Lok 451c1596af
readme: add MLflow Tracing as an observability integration (#8811) 2025-02-05 16:04:24 -08:00
Michael Yang 932bded12f chore: add optional field for server logs 2025-02-05 15:55:32 -08:00
Michael Yang 070ad913ac ci: fix linux archive 2025-02-05 15:08:02 -08:00
Azis Alvriyanto 8d8b9f83ae
format: byte formatting test coverage (#8692)
Removed redundant checks and streamlined the switch-case structure.
Added test cases for both HumanBytes and HumanBytes2 to cover a wide range of scenarios.
2025-02-05 12:23:07 -08:00
Jeffrey Morgan f00d359a67
docs: add section in development.md on library detection (#8855) 2025-02-05 11:16:27 -08:00
Yashwanth A 291def6adb
server: increase timeout in stall detection from 5s to 30s (#8831)
In some cases, downloads slow due to disk i/o or other factors,
causing the download to restart a part. This causes the download
to "reverse" in percent completion. By increasing the timeout to 30s,
this should happen less frequently.
2025-02-05 10:00:26 -08:00
Jeffrey Morgan cd3fbf1c49
llama: use dynamic backend loading for mllama and clip (#8835) 2025-02-05 09:46:56 -08:00
Jeffrey Morgan c852b8e021
server: always print upload/download part info (#8832) 2025-02-04 19:30:49 -08:00
William d8932c55e7
server: fix out of bounds exception on model download (#8746) 2025-02-04 18:52:47 -08:00
Michael Yang 63f0269f7f ci: split docker build by platform
this improves build reliability and concurrency
2025-02-04 17:04:27 -08:00
Jeffrey Morgan 4759ecae19
ml/backend/ggml: fix library loading on macOS amd64 (#8827) 2025-02-04 15:05:39 -08:00
Michael Yang 65b7ecac7b fix extra quote 2025-02-04 08:35:30 -08:00