Commit Graph

109 Commits

Author SHA1 Message Date
Bruce MacDonald 9ceee25d8b chunk vision outputs 2025-05-12 13:49:44 -07:00
Bruce MacDonald 661bf04696 add picture prefix 2025-05-12 13:49:44 -07:00
Bruce MacDonald 9876c8453a update exported functions for tests 2025-05-12 13:49:43 -07:00
Bruce MacDonald 16b13e0cfc Revert "ropeTheta should be 1e5"
This reverts commit cc1638b26763eae7daddd44e3975a885671ef9d3.

This reverts commit
b32385591307e2d33a8f43ce1626b529d2dac83e.
2025-05-12 13:49:43 -07:00
Bruce MacDonald 45f96e898d ropeTheta should be 1e5 2025-05-12 13:49:43 -07:00
Bruce MacDonald 7c555d394c simplify patch creation 2025-05-12 13:49:43 -07:00
Bruce MacDonald 39ee6d2bd0 ranges for lint 2025-05-12 13:49:43 -07:00
Bruce MacDonald 47705b5168 simplify rope changes 2025-05-12 13:49:43 -07:00
Michael Yang 698a92aa4a reverse window 2025-05-12 13:49:43 -07:00
Michael Yang 150c499cae use silu 2025-05-12 13:49:43 -07:00
Bruce MacDonald b68af0370f move sdpa to model forward pass 2025-05-12 13:49:43 -07:00
Bruce MacDonald ca981c8a49 full attn block indexes should be []int32 2025-05-12 13:49:43 -07:00
Bruce MacDonald b3da8a319e Update model_vision.go 2025-05-12 13:49:42 -07:00
Bruce MacDonald 359e1d5b19 full attention layers 2025-05-12 13:49:42 -07:00
Bruce MacDonald ff1f74534b block attention 2025-05-12 13:49:42 -07:00
Bruce MacDonald 104f802df1 remove todos 2025-05-12 13:49:42 -07:00
Bruce MacDonald eed0ac2948 clean up vision model forward pass 2025-05-12 13:49:42 -07:00
Bruce MacDonald fcfad744ff fix patch merger 2025-05-12 13:49:42 -07:00
Michael Yang fb3c16f2a2 window index 2025-05-12 13:49:42 -07:00
Michael Yang ee869f35e4 fix image processing
python built-in `round()` rounds to the nearest even number if the value
is in the middle

https://docs.python.org/3/library/functions.html#round
2025-05-12 13:49:42 -07:00
Michael Yang ff5d1a3dc0 duplicate input embeddings 2025-05-12 13:49:42 -07:00
Michael Yang 88b231f903 use maxgridsize 2025-05-12 13:49:42 -07:00
Michael Yang 7e920c8d75 fix: patch merger and convert
convert:
- split patch embedding
- split qkv

remove duplicate PatchMerger
2025-05-12 13:49:42 -07:00
Bruce MacDonald dd8c619fba fixes after rebase 2025-05-12 13:49:42 -07:00
Bruce MacDonald 2af76d0e7a default to 32 for vision block count 2025-05-12 13:49:42 -07:00
Bruce MacDonald 8d901825f0 reshape cos and sin 2025-05-12 13:49:41 -07:00
Bruce MacDonald 04936b719f Update model_vision.go 2025-05-12 13:49:41 -07:00
Bruce MacDonald 0f0136d419 simplify by doing operations in Go rather than with tensors
Co-Authored-By: Michael Yang <2372640+mxyng@users.noreply.github.com>
2025-05-12 13:49:41 -07:00
Bruce MacDonald 80498f76de fix build 2025-05-12 13:49:41 -07:00
Bruce MacDonald f8b48aa784 Delete model_external_test.go 2025-05-12 13:49:41 -07:00
Bruce MacDonald 5ff0d538b0 wip: implementing rope 2025-05-12 13:49:41 -07:00
Bruce MacDonald eedc969c35 grid refactor 2025-05-12 13:49:41 -07:00
Bruce MacDonald 963531215e update convert 2025-05-12 13:49:41 -07:00
Bruce MacDonald 3fe090f447 get patch embedding vals from config 2025-05-12 13:49:41 -07:00
Bruce MacDonald 1704072746 patch embeddings 2025-05-12 13:49:41 -07:00
Bruce MacDonald c1f9bcb4dd restructure
image processing

Update model.go

Update model.go

Update model.go

no projector

no projector

vision model scaffold

...

...

wip

...

rebase

fix patch merger

tidy

...

Update model_vision.go

server: do not attempt to parse offset file as gguf

This logic was causing issues for me when importing a gguf that had some padding at the end of the file. The valid gguf would be read, but then it would try to read the offset as a different gguf file. This does not seem right.

Update process_image_test.go

apply norm

prompt processing

prompt processing

fix post tokenize

fix gguf padding + populate the split patch embeddings

...

...

another shot at patch embeddings

...

patch embedding

Update model_vision.go

split pixels
2025-05-12 13:49:41 -07:00
Bruce MacDonald 198b1e6db9 text model forward pass 2025-05-12 13:49:41 -07:00
Bruce MacDonald 51ad65f831 ml: structured rope config to allow specifying context len
This commit refactors the Rotary Position Embedding (RoPE) implementation across the codebase to use a structured configuration approach instead of individual parameters.

Key changes:
- Add new RoPEConfig struct with fields for dimension, type, base frequency, and scaling
- Add RopeType enum to formalize different RoPE implementation variants
- Add YarnConfig struct and related configuration for YaRN (Yet Another RoPE extensioN) context extension
- Update RoPE method signature across all tensor interfaces and implementations
- Refactor all model implementations (llama, gemma2, gemma3, mllama) to use the new configuration structure

This change improves code organization, makes the RoPE configuration more explicit, and provides better support for different RoPE variants and context extension methods.
2025-05-12 13:49:41 -07:00
Michael Yang f95a1f2bef
feat: add trace log level (#10650)
reduce prompt log to trace level
2025-05-12 11:43:00 -07:00
Michael Yang 5cfc1c39f3
model: fix build (#10416) 2025-04-25 19:24:48 -07:00
Michael Yang 7ba9fa9c7d fixes for maverick 2025-04-25 16:59:20 -07:00
Michael Yang 8bf11b84c1 chunked attention 2025-04-25 16:59:20 -07:00
Michael Yang 470af8ab89 connect vision to text 2025-04-25 16:59:20 -07:00
Michael Yang 178761aef3 image processing
Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-04-25 16:59:20 -07:00
Michael Yang f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Michael Yang d26c18e25c fix token type 2025-04-25 16:59:01 -07:00
Parth Sareen a53d744b01
llama: remove model loading for grammar (#10096) 2025-04-24 11:51:19 -07:00
Michael Yang 40b8fdbdca arange 2025-04-18 11:45:44 -07:00
Jesse Gross dbb149e6f7 ollamarunner: Preallocate worst case graph at startup
Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.
2025-04-08 10:01:28 -07:00
Bruce MacDonald 6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00