If a user hasn't explicitly either enabled or disabled flash attention, automatically enable flash attention if the model supports it and it would not trigger a fallback to CPU. This supports text, vision and embedding models as well as automatic handling of KV cache quantization (which requires flash attention). If a model does not call the fast fused attention operation, this is detected and disables any operations that depend on it. |
||
|---|---|---|
| .. | ||
| cache.go | ||
| causal.go | ||
| causal_test.go | ||
| encoder.go | ||
| wrapper.go | ||