ollama

History

Jesse Gross 1108d8b34e ggml: Enable flash attention for vision encoders Although the vision component of multimodal models typically already call the optimized nn.Attention, it is converted into non-fused operations. That is because the backend-specific fused kernels may have requirements, such as padding, and they is performed by the cache, which vision encoders don't use. This implements a fallback path in the backend, softening the requirements into optimizations. In turn, this allows flash attention to be used for vision encoders, saving a significant amount of VRAM and improving performance.		2025-12-04 15:19:06 -08:00
..
backend	ggml: Enable flash attention for vision encoders	2025-12-04 15:19:06 -08:00
nn	ggml: Enable flash attention for vision encoders	2025-12-04 15:19:06 -08:00
backend.go	ggml: Enable flash attention for vision encoders	2025-12-04 15:19:06 -08:00
device.go	CUDA: filter devices on secondary discovery (#13317 )	2025-12-03 12:58:16 -08:00
path.go	cpu: always ensure LibOllamaPath included (#12890 )	2025-10-31 14:37:29 -07:00