ollama

Files

Jesse Gross 05ccb17c6e kvcache: Use Cast instead of Copy for flash attention masks

Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.

2025-08-19 12:36:28 -07:00

backend

kvcache: Use Cast instead of Copy for flash attention masks

2025-08-19 12:36:28 -07:00

update vendored llama.cpp and ggml (#11823 )

2025-08-14 14:42:58 -07:00

backend.go

kvcache: Use Cast instead of Copy for flash attention masks

2025-08-19 12:36:28 -07:00