ollama

Files

Jesse Gross f53f4198c3 ml: Abstract attention out of model definitions

There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

2025-02-21 13:16:21 -08:00

imageproc

imageproc mllama refactor (#7537 )

2024-12-14 19:50:15 -08:00

models

ml: Abstract attention out of model definitions

2025-02-21 13:16:21 -08:00

testdata

next ollama runner (#7913 )

2025-02-13 16:31:21 -08:00

model_test.go

Runner for Ollama engine