History

Jesse Gross 9679f40146 ml: Allow models to constrain inputs to a single batch Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes #9697		2025-03-14 15:38:54 -07:00
..
common	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
llamarunner	llm: remove internal subprocess req and resp types (#9324 )	2025-03-14 15:21:53 -07:00
ollamarunner	ml: Allow models to constrain inputs to a single batch	2025-03-14 15:38:54 -07:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	Runner for Ollama engine	2025-02-13 17:09:26 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding