ollama

History

Jesse Gross 1108d8b34e ggml: Enable flash attention for vision encoders Although the vision component of multimodal models typically already call the optimized nn.Attention, it is converted into non-fused operations. That is because the backend-specific fused kernels may have requirements, such as padding, and they is performed by the cache, which vision encoders don't use. This implements a fallback path in the backend, softening the requirements into optimizations. In turn, this allows flash attention to be used for vision encoders, saving a significant amount of VRAM and improving performance.		2025-12-04 15:19:06 -08:00
..
ggml	ggml update to b7108 (#12992 )	2025-12-03 19:43:29 -08:00
ggml.go	ggml: Enable flash attention for vision encoders	2025-12-04 15:19:06 -08:00
ggml_test.go	ml: add slice operation (#12870 )	2025-11-13 13:28:21 -08:00
quantization.go	chore: fix some inconsistent function name in comment	2025-08-13 09:50:27 -07:00
threads.go	ollama debug tensor	2025-03-11 14:49:19 -07:00
threads_debug.go	ollama debug tensor	2025-03-11 14:49:19 -07:00