Files

nicole pardal 5d347f6d6f server: Consolidate embedding truncation in runner (#12730 )

Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.

2025-10-27 11:59:12 -07:00

common

chore: fix some inconsistent function name in comment

2025-08-13 09:50:27 -07:00

llamarunner

server: Consolidate embedding truncation in runner (#12730 )

2025-10-27 11:59:12 -07:00

ollamarunner

server: Consolidate embedding truncation in runner (#12730 )

2025-10-27 11:59:12 -07:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding