ollama/ml/backend/ggml
Jesse Gross 1108d8b34e ggml: Enable flash attention for vision encoders
Although the vision component of multimodal models typically already
call the optimized nn.Attention, it is converted into non-fused
operations. That is because the backend-specific fused kernels may
have requirements, such as padding, and they is performed by the
cache, which vision encoders don't use.

This implements a fallback path in the backend, softening the
requirements into optimizations. In turn, this allows flash attention
to be used for vision encoders, saving a significant amount of VRAM
and improving performance.
2025-12-04 15:19:06 -08:00
..
ggml ggml update to b7108 (#12992) 2025-12-03 19:43:29 -08:00
ggml.go ggml: Enable flash attention for vision encoders 2025-12-04 15:19:06 -08:00
ggml_test.go ml: add slice operation (#12870) 2025-11-13 13:28:21 -08:00
quantization.go chore: fix some inconsistent function name in comment 2025-08-13 09:50:27 -07:00
threads.go ollama debug tensor 2025-03-11 14:49:19 -07:00
threads_debug.go ollama debug tensor 2025-03-11 14:49:19 -07:00