If a user hasn't explicitly either enabled or disabled flash attention, automatically enable flash attention if the model supports it and it would not trigger a fallback to CPU. This supports text, vision and embedding models as well as automatic handling of KV cache quantization (which requires flash attention). If a model does not call the fast fused attention operation, this is detected and disables any operations that depend on it. |
||
|---|---|---|
| .. | ||
| ggml | ||
| ggml.go | ||
| ggml_test.go | ||
| quantization.go | ||
| threads.go | ||
| threads_debug.go | ||