ollama

History

Jesse Gross 1108d8b34e ggml: Enable flash attention for vision encoders Although the vision component of multimodal models typically already call the optimized nn.Attention, it is converted into non-fused operations. That is because the backend-specific fused kernels may have requirements, such as padding, and they is performed by the cache, which vision encoders don't use. This implements a fallback path in the backend, softening the requirements into optimizations. In turn, this allows flash attention to be used for vision encoders, saving a significant amount of VRAM and improving performance.		2025-12-04 15:19:06 -08:00
..
fast	ml: add more rope options (#10775 )	2025-05-20 15:51:08 -07:00
pooling	chore: update models to use slice/chunk/chunksections (#12934 )	2025-11-13 15:20:12 -08:00
rope	ggml update to b6840 (#12791 )	2025-11-06 10:19:22 -08:00
attention.go	ggml: Enable flash attention for vision encoders	2025-12-04 15:19:06 -08:00
convolution.go	fix: conv2d bias (#12834 )	2025-10-29 11:03:43 -07:00
embedding.go	next ollama runner (#7913 )	2025-02-13 16:31:21 -08:00
linear.go	update vendored llama.cpp and ggml (#11823 )	2025-08-14 14:42:58 -07:00
normalization.go	next ollama runner (#7913 )	2025-02-13 16:31:21 -08:00