History

Jesse Gross 21aa666a1e ml: Enable support for flash attention The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.		2025-03-01 20:53:23 -08:00
..
common	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
llamarunner	runner: defer context cancel	2025-02-28 22:27:28 +00:00
ollamarunner	ml: Enable support for flash attention	2025-03-01 20:53:23 -08:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	Runner for Ollama engine	2025-02-13 17:09:26 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding