If a user hasn't explicitly either enabled or disabled flash attention, automatically enable flash attention if the model supports it and it would not trigger a fallback to CPU. This supports text, vision and embedding models as well as automatic handling of KV cache quantization (which requires flash attention). If a model does not call the fast fused attention operation, this is detected and disables any operations that depend on it. |
||
|---|---|---|
| .. | ||
| backend | ||
| nn | ||
| backend.go | ||
| device.go | ||
| path.go | ||