ollama/ml/nn
Jesse Gross 1108d8b34e ggml: Enable flash attention for vision encoders
Although the vision component of multimodal models typically already
call the optimized nn.Attention, it is converted into non-fused
operations. That is because the backend-specific fused kernels may
have requirements, such as padding, and they is performed by the
cache, which vision encoders don't use.

This implements a fallback path in the backend, softening the
requirements into optimizations. In turn, this allows flash attention
to be used for vision encoders, saving a significant amount of VRAM
and improving performance.
2025-12-04 15:19:06 -08:00
..
fast ml: add more rope options (#10775) 2025-05-20 15:51:08 -07:00
pooling chore: update models to use slice/chunk/chunksections (#12934) 2025-11-13 15:20:12 -08:00
rope ggml update to b6840 (#12791) 2025-11-06 10:19:22 -08:00
attention.go ggml: Enable flash attention for vision encoders 2025-12-04 15:19:06 -08:00
convolution.go fix: conv2d bias (#12834) 2025-10-29 11:03:43 -07:00
embedding.go next ollama runner (#7913) 2025-02-13 16:31:21 -08:00
linear.go update vendored llama.cpp and ggml (#11823) 2025-08-14 14:42:58 -07:00
normalization.go next ollama runner (#7913) 2025-02-13 16:31:21 -08:00