Files

Bruce MacDonald 0c1041ad85 runner: default to greedy sampler for performance (#9407 )

As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.

2025-02-27 16:41:20 -08:00

common

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

llamarunner

runner: simplify tensor split parsing

2025-02-27 18:36:46 +00:00

ollamarunner

runner: default to greedy sampler for performance (#9407 )

2025-02-27 16:41:20 -08:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding