If a user hasn't explicitly either enabled or disabled flash attention, automatically enable flash attention if the model supports it and it would not trigger a fallback to CPU. This supports text, vision and embedding models as well as automatic handling of KV cache quantization (which requires flash attention). If a model does not call the fast fused attention operation, this is detected and disables any operations that depend on it. |
||
|---|---|---|
| .. | ||
| llm_darwin.go | ||
| llm_linux.go | ||
| llm_windows.go | ||
| server.go | ||
| server_test.go | ||
| status.go | ||