wip: apply gbnf vocab to logits

ml/backend/ggml: consolidate system info logging
- output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name
2025-03-06 21:44:52 -08:00 · 2025-03-04 15:14:31 -08:00 · 2025-03-04 11:10:56 -08:00 · 2025-03-04 09:03:46 -08:00 · 2025-03-03 19:11:16 -08:00 · 2025-03-03 16:45:40 -08:00
34 changed files with 890 additions and 408 deletions
--- a/7
+++ b/7
@@ -86,10 +86,11 @@ RUN --mount=type=cache,target=/root/.ccache \
        && cmake --install build --component CUDA --strip --parallel 8

 FROM base AS build
-ARG GOVERSION=1.23.4
-RUN curl -fsSL https://golang.org/dl/go${GOVERSION}.linux-$(case $(uname -m) in x86_64) echo amd64 ;; aarch64) echo arm64 ;; esac).tar.gz | tar xz -C /usr/local
-ENV PATH=/usr/local/go/bin:$PATH
 WORKDIR /go/src/github.com/ollama/ollama
+COPY go.mod go.sum .
+RUN curl -fsSL https://golang.org/dl/go$(awk '/^go/ { print $2 }' go.mod).linux-$(case $(uname -m) in x86_64) echo amd64 ;; aarch64) echo arm64 ;; esac).tar.gz | tar xz -C /usr/local
+ENV PATH=/usr/local/go/bin:$PATH
+RUN go mod download
 COPY . .
 ARG GOFLAGS="'-ldflags=-w -s'"
 ENV CGO_ENABLED=1
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
 <div align="center">
-  <a href="https://ollama.com" />
+  <a href="https://ollama.com">
    <img alt="ollama" height="200px" src="https://github.com/ollama/ollama/assets/3325447/0d0b44e2-8f4a-4e99-9b52-a5c1c741c8f7">
  </a>
 </div>
@@ -64,7 +64,7 @@ Here are some example models that can be downloaded:
 | Llama 3.1          | 8B         | 4.7GB | `ollama run llama3.1`            |
 | Llama 3.1          | 405B       | 231GB | `ollama run llama3.1:405b`       |
 | Phi 4              | 14B        | 9.1GB | `ollama run phi4`                |
-| Phi 3 Mini         | 3.8B       | 2.3GB | `ollama run phi3`                |
+| Phi 4 Mini         | 3.8B       | 2.5GB | `ollama run phi4-mini`           |
 | Gemma 2            | 2B         | 1.6GB | `ollama run gemma2:2b`           |
 | Gemma 2            | 9B         | 5.5GB | `ollama run gemma2`              |
 | Gemma 2            | 27B        | 16GB  | `ollama run gemma2:27b`          |
@@ -75,7 +75,7 @@ Here are some example models that can be downloaded:
 | Code Llama         | 7B         | 3.8GB | `ollama run codellama`           |
 | Llama 2 Uncensored | 7B         | 3.8GB | `ollama run llama2-uncensored`   |
 | LLaVA              | 7B         | 4.5GB | `ollama run llava`               |
-| Solar              | 10.7B      | 6.1GB | `ollama run solar`               |
+| Granite-3.2         | 8B         | 4.9GB | `ollama run granite3.2`          |

 > [!NOTE]
 > You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
@@ -388,6 +388,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [LangBot](https://github.com/RockChinQ/LangBot) (LLM-based instant messaging bots platform, with Agents, RAG features, supports multiple platforms)
 - [1Panel](https://github.com/1Panel-dev/1Panel/) (Web-based Linux Server Management Tool)
 - [AstrBot](https://github.com/Soulter/AstrBot/) (User-friendly LLM-based multi-platform chatbot with a WebUI, supporting RAG, LLM agents, and plugins integration)
+- [Reins](https://github.com/ibrahimcetin/reins) (Easily tweak parameters, customize system prompts per chat, and enhance your AI experiments with reasoning model support.)

 ### Cloud

@@ -512,6 +513,8 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [Maid](https://github.com/Mobile-Artificial-Intelligence/maid)
 - [Ollama App](https://github.com/JHubi1/ollama-app) (Modern and easy-to-use multi-platform client for Ollama)
 - [ConfiChat](https://github.com/1runeberg/confichat) (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)
+- [Ollama Android Chat](https://github.com/sunshine0523/OllamaServer) (No need for Termux, start the Ollama service with one click on an Android device)
+- [Reins](https://github.com/ibrahimcetin/reins) (Easily tweak parameters, customize system prompts per chat, and enhance your AI experiments with reasoning model support.)

 ### Extensions & Plugins

--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@@ -34,7 +34,6 @@ import (
 	"github.com/ollama/ollama/api"
 	"github.com/ollama/ollama/envconfig"
 	"github.com/ollama/ollama/format"
-	"github.com/ollama/ollama/llama"
 	"github.com/ollama/ollama/parser"
 	"github.com/ollama/ollama/progress"
 	"github.com/ollama/ollama/runner"
@@ -256,6 +255,7 @@ func StopHandler(cmd *cobra.Command, args []string) error {
 		if strings.Contains(err.Error(), "not found") {
 			return fmt.Errorf("couldn't find model \"%s\" to stop", args[0])
 		}
+		return err
 	}
 	return nil
 }
@@ -338,10 +338,16 @@ func RunHandler(cmd *cobra.Command, args []string) error {
 		return err
 	}

-	// TODO(jessegross): We should either find another way to know if this is
-	// a vision model or remove the logic. Also consider that other modalities will
-	// need different behavior anyways.
-	opts.MultiModal = len(info.ProjectorInfo) != 0 || envconfig.NewEngine()
+	if len(info.ProjectorInfo) != 0 {
+		opts.MultiModal = true
+	}
+	for k := range info.ModelInfo {
+		if strings.Contains(k, ".vision.") {
+			opts.MultiModal = true
+			break
+		}
+	}
+
 	opts.ParentModel = info.Details.ParentModel

 	if interactive {
@@ -1274,7 +1280,6 @@ func NewCLI() *cobra.Command {

 	runnerCmd := &cobra.Command{
 		Use:    "runner",
-		Short:  llama.PrintSystemInfo(),
 		Hidden: true,
 		RunE: func(cmd *cobra.Command, args []string) error {
 			return runner.Execute(os.Args[1:])
--- a/docs/development.md
+++ b/docs/development.md
@@ -118,6 +118,35 @@ To run tests, use `go test`:
 go test ./...
 ```

+> NOTE: In rare cirumstances, you may nedd to change a package using the new
+> "synctest" package in go1.24.
+>
+> If you do not have the "synctest" package enabled, you will not see build or
+> test failures resulting from your change(s), if any, locally, but CI will
+> break.
+>
+> If you see failures in CI, you can either keep pushing changes to see if the
+> CI build passes, or you can enable the "synctest" package locally to see the
+> failures before pushing.
+>
+> To enable the "synctest" package for testing, run the following command:
+>
+> ```shell
+> GOEXPERIMENT=synctest go test ./...
+> ```
+>
+> If you wish to enable synctest for all go commands, you can set the
+> `GOEXPERIMENT` environment variable in your shell profile or by using:
+>
+> ```shell
+> go env -w GOEXPERIMENT=synctest
+> ```
+>
+> Which will enable the "synctest" package for all go commands without needing
+> to set it for all shell sessions.
+>
+> The synctest package is not required for production builds.
+
 ## Library detection

 Ollama looks for acceleration libraries in the following paths relative to the `ollama` executable:
--- a/fs/ggml/ggml.go
+++ b/fs/ggml/ggml.go
@@ -565,6 +565,43 @@ func (f GGML) GraphSize(context, batch uint64, kvCacheType string) (kv, partialO
 	return
 }

+func (llm GGML) VisionGraphSize() (weights, graphSize uint64) {
+	switch llm.KV().Architecture() {
+	case "mllama":
+		for _, layer := range llm.Tensors().GroupLayers()["v"] {
+			weights += layer.Size()
+		}
+
+		kv := func(n string) uint64 {
+			if v, ok := llm.KV()["mllama.vision."+n].(uint32); ok {
+				return uint64(v)
+			}
+
+			return 0
+		}
+
+		imageSize := kv("image_size")
+
+		maxNumTiles := kv("max_num_tiles")
+		embeddingLength := kv("embedding_length")
+		headCount := kv("attention.head_count")
+
+		numPatches := (imageSize / kv("patch_size")) * (imageSize / kv("patch_size"))
+		if _, ok := llm.Tensors().GroupLayers()["v"]["class_embd"]; ok {
+			numPatches++
+		}
+
+		numPaddedPatches := numPatches + 8 - (numPatches%8)%8
+
+		graphSize = 4 * (8 +
+			imageSize*imageSize*kv("num_channels")*maxNumTiles +
+			embeddingLength*numPatches*maxNumTiles +
+			9*embeddingLength*numPaddedPatches*maxNumTiles +
+			numPaddedPatches*maxNumTiles*numPaddedPatches*maxNumTiles*headCount)
+	}
+	return weights, graphSize
+}
+
 // SupportsKVCacheType checks if the requested cache type is supported
 func (f GGML) SupportsKVCacheType(cacheType string) bool {
 	return slices.Contains([]string{"f16", "q8_0", "q4_0"}, cacheType)
--- a/llama/grammar.go
+++ b/llama/grammar.go
@@ -0,0 +1,135 @@
+package llama
+
+/*
+#cgo CFLAGS: -std=c11
+#cgo CXXFLAGS: -std=c++17
+#cgo CPPFLAGS: -I${SRCDIR}/../llama/llama.cpp/include
+#cgo CPPFLAGS: -I${SRCDIR}/../llama/llama.cpp/common
+#cgo CPPFLAGS: -I${SRCDIR}/../llama/llama.cpp/src
+#cgo CPPFLAGS: -I${SRCDIR}
+
+#include <stdlib.h>
+#include <stdbool.h>
+#include "llama.h"
+#include "grammar_ext.h"
+
+// Helper function to handle Go string arrays to C
+static char** makeCharArray(int size) {
+    return (char**)malloc(size * sizeof(char*));
+}
+
+static void setArrayString(char** a, int i, const char* s) {
+    a[i] = (char*)s;
+}
+
+static void freeCharArray(char** a, int size) {
+    free(a);
+}
+*/
+import "C"
+
+import (
+	"errors"
+	"runtime"
+	"unsafe"
+)
+
+// Grammar represents the interface for grammar-based sampling
+type Grammar interface {
+	Apply(logits []float32) ([]float32, error)
+	Close() error
+}
+
+// CGrammar is a wrapper around the C++ grammar implementation
+type CGrammar struct {
+	grammar *C.struct_llama_grammar
+	model   *C.struct_llama_model
+	closed  bool
+}
+
+// NewGrammarWithTokens creates a new grammar using a custom vocabulary defined by tokens
+func NewGrammarWithTokens(grammarStr, grammarRoot string, tokens []string) (Grammar, error) {
+	if grammarStr == "" {
+		return nil, errors.New("empty grammar string")
+	}
+
+	if len(tokens) == 0 {
+		return nil, errors.New("empty token list")
+	}
+
+	// Create C array of strings for tokens
+	cTokens := C.makeCharArray(C.int(len(tokens)))
+	defer C.freeCharArray(cTokens, C.int(len(tokens)))
+
+	// Convert Go strings to C strings and set them in the array
+	cStrings := make([]*C.char, len(tokens))
+	for i, token := range tokens {
+		cStrings[i] = C.CString(token)
+		C.setArrayString(cTokens, C.int(i), cStrings[i])
+	}
+
+	// Create vocabulary from tokens
+	cVocab := C.vocab_bridge_from_tokens((**C.char)(unsafe.Pointer(cTokens)), C.int(len(tokens)))
+
+	// Free the C strings after creating the vocab
+	for _, str := range cStrings {
+		C.free(unsafe.Pointer(str))
+	}
+
+	if cVocab == nil {
+		return nil, errors.New("failed to create vocabulary from tokens")
+	}
+
+	// Make sure to free the vocabulary when we're done
+	defer C.vocab_bridge_free(cVocab)
+
+	cGrammarStr := C.CString(grammarStr)
+	defer C.free(unsafe.Pointer(cGrammarStr))
+
+	cGrammarRoot := C.CString(grammarRoot)
+	defer C.free(unsafe.Pointer(cGrammarRoot))
+
+	// Create grammar using our C wrapper function with the correct signature
+	grammar := C.grammar_create_from_string(cVocab, cGrammarStr, cGrammarRoot)
+	if grammar == nil {
+		return nil, errors.New("failed to initialize grammar")
+	}
+
+	cg := &CGrammar{
+		grammar: grammar,
+		closed:  false,
+	}
+
+	// Set up finalizer to free resources when the object is garbage collected
+	runtime.SetFinalizer(cg, func(g *CGrammar) {
+		g.Close()
+	})
+
+	return cg, nil
+}
+
+// Apply applies grammar constraints to logits
+func (g *CGrammar) Apply(logits []float32) ([]float32, error) {
+	if g.closed || g.grammar == nil {
+		return nil, errors.New("grammar not initialized or already closed")
+	}
+
+	// Create a copy of logits to modify
+	result := make([]float32, len(logits))
+	copy(result, logits)
+
+	// Apply grammar constraints using our C wrapper function
+	C.grammar_apply_to_logits(g.grammar, (*C.float)(&result[0]), C.int(len(result)))
+
+	return result, nil
+}
+
+// Close releases resources associated with the grammar
+func (g *CGrammar) Close() error {
+	if !g.closed && g.grammar != nil {
+		C.grammar_free(g.grammar)
+		g.grammar = nil
+		g.closed = true
+	}
+	return nil
+}
--- a/llama/grammar_ext.cpp
+++ b/llama/grammar_ext.cpp
@@ -0,0 +1,83 @@
+#include <stdlib.h>
+#include <string>
+#include <vector>
+#include <cstdint>
+#include <stdexcept>
+
+#include "llama-sampling.h"
+#include "llama-grammar.h"
+#include "llama-vocab.h"
+#include "grammar_ext.h"
+
+extern "C" {
+
+struct llama_grammar* grammar_create_from_string(const struct llama_vocab* vocab, const char* grammar_str, const char* grammar_root) {
+    try {
+        // Initialize grammar sampler directly with the model
+        struct llama_sampler* sampler = llama_sampler_init_grammar(vocab, grammar_str, grammar_root);
+        if (!sampler) {
+            return nullptr;
+        }
+        
+        // Cast the sampler to a grammar and return it
+        return (struct llama_grammar*)sampler;
+    } catch (const std::exception &err) {
+        return nullptr;
+    }
+}
+
+void grammar_apply_to_logits(struct llama_grammar* grammar, float* logits, int n_logits) {
+    if (!grammar || !logits || n_logits <= 0) {
+        return;
+    }
+
+    // Create token data array for the grammar application
+    llama_token_data* token_data = (llama_token_data*)malloc(n_logits * sizeof(llama_token_data));
+    if (!token_data) {
+        return;
+    }
+
+    // Initialize token data from logits
+    for (int i = 0; i < n_logits; i++) {
+        token_data[i].id = i;
+        token_data[i].logit = logits[i];
+        token_data[i].p = 0.0f;
+    }
+
+    // Create token data array structure
+    llama_token_data_array arr = {
+        .data = token_data,
+        .size = (size_t)n_logits,
+        .sorted = false,
+        .selected = -1
+    };
+
+    // Apply grammar constraints to the token data array
+    llama_grammar_apply_impl(*grammar, &arr);
+
+    // Copy back the modified logits
+    for (int i = 0; i < n_logits; i++) {
+        logits[i] = token_data[i].logit;
+    }
+
+    free(token_data);
+}
+
+void grammar_free(struct llama_grammar* grammar) {
+    if (grammar) {
+        // Free the grammar as a sampler
+        llama_sampler_free((struct llama_sampler*)grammar);
+    }
+}
+
+struct llama_vocab* vocab_bridge_from_tokens(const char** tokens, int n_tokens) {
+    // Call the C++ function from llama-vocab.cpp
+    return llama_vocab_from_tokens(tokens, n_tokens);
+}
+
+void vocab_bridge_free(struct llama_vocab* vocab) {
+    // Call the C++ function from llama-vocab.cpp
+    llama_vocab_free(vocab);
+}
+
+} // extern "C"
--- a/llama/grammar_ext.h
+++ b/llama/grammar_ext.h
@@ -0,0 +1,33 @@
+#ifndef GRAMMAR_EXT_H
+#define GRAMMAR_EXT_H
+
+#include "llama.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Forward declarations
+struct llama_grammar;
+struct llama_vocab;
+
+// Create a new grammar from a string (returns a grammar implemented as a sampler)
+struct llama_grammar* grammar_create_from_string(const struct llama_vocab* vocab, const char* grammar_str, const char* grammar_root);
+
+// Apply grammar constraints to logits
+void grammar_apply_to_logits(struct llama_grammar* grammar, float* logits, int n_logits);
+
+// Free grammar resources (frees the underlying sampler)
+void grammar_free(struct llama_grammar* grammar);
+
+// C wrapper for llama_vocab_from_tokens
+struct llama_vocab* vocab_bridge_from_tokens(const char** tokens, int n_tokens);
+
+// C wrapper for llama_vocab_free
+void vocab_bridge_free(struct llama_vocab* vocab);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif // GRAMMAR_EXT_H
--- a/llama/llama.go
+++ b/llama/llama.go
@@ -18,21 +18,10 @@ package llama

 #include "mllama.h"
 #include "sampling_ext.h"
+#include "grammar_ext.h"

 extern bool llamaProgressCallback(float progress, void *user_data);
 extern void llamaLog(int level, char* text, void* user_data);
-
-typedef enum {COMP_UNKNOWN,COMP_GCC,COMP_CLANG} COMPILER;
-COMPILER inline get_compiler() {
-#if defined(__clang__)
-	return COMP_CLANG;
-#elif defined(__GNUC__)
-	return COMP_GCC;
-#else
-	return UNKNOWN_COMPILER;
-#endif
-}
-
 */
 import "C"

@@ -72,19 +61,6 @@ func BackendInit() {
 	C.llama_backend_init()
 }

-func PrintSystemInfo() string {
-	var compiler string
-	switch C.get_compiler() {
-	case C.COMP_UNKNOWN:
-		compiler = "cgo(unknown_compiler)"
-	case C.COMP_GCC:
-		compiler = "cgo(gcc)"
-	case C.COMP_CLANG:
-		compiler = "cgo(clang)"
-	}
-	return C.GoString(C.llama_print_system_info()) + compiler
-}
-
 func GetModelArch(modelPath string) (string, error) {
 	mp := C.CString(modelPath)
 	defer C.free(unsafe.Pointer(mp))
--- a/llama/patches/0015-try-catch-backend-load.patch
+++ b/llama/patches/0015-try-catch-backend-load.patch
@@ -1,69 +0,0 @@
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: Michael Yang <mxyng@pm.me>
-Date: Tue, 11 Feb 2025 14:06:36 -0800
-Subject: [PATCH] try/catch backend load
-
---
- ggml/src/ggml-backend-reg.cpp | 45 ++++++++++++++++++-----------------
- 1 file changed, 23 insertions(+), 22 deletions(-)
-
-diff --git a/ggml/src/ggml-backend-reg.cpp b/ggml/src/ggml-backend-reg.cpp
-index 98d5e14d..1c19129a 100644
--- a/ggml/src/ggml-backend-reg.cpp
-+++ b/ggml/src/ggml-backend-reg.cpp
-@@ -512,32 +512,33 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent,
-         }
-         fs::directory_iterator dir_it(search_path, fs::directory_options::skip_permission_denied);
-         for (const auto & entry : dir_it) {
-            if (entry.is_regular_file()) {
-                std::wstring filename = entry.path().filename().wstring();
-                std::wstring ext = entry.path().extension().wstring();
-                if (filename.find(file_prefix) == 0 && ext == backend_filename_suffix()) {
-                    dl_handle_ptr handle { dl_load_library(entry.path().wstring()) };
-                    if (!handle && !silent) {
-                        GGML_LOG_ERROR("%s: failed to load %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
-                    }
-                    if (handle) {
-+            try {
-+                if (entry.is_regular_file()) {
-+                    std::wstring filename = entry.path().filename().wstring();
-+                    std::wstring ext = entry.path().extension().wstring();
-+                    if (filename.find(file_prefix) == 0 && ext == backend_filename_suffix()) {
-+                        dl_handle_ptr handle { dl_load_library(entry.path().wstring()) };
-+                        if (!handle) {
-+                            GGML_LOG_ERROR("%s: failed to load %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
-+                            continue;
-+                        }
-+
-                         auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
-                        if (score_fn) {
-                            int s = score_fn();
-#ifndef NDEBUG
-                            GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str(), s);
-#endif
-                            if (s > best_score) {
-                                best_score = s;
-                                best_path = entry.path().wstring();
-                            }
-                        } else {
-                            if (!silent) {
-                                GGML_LOG_INFO("%s: failed to find ggml_backend_score in %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
-                            }
-+                        if (!score_fn) {
-+                            GGML_LOG_DEBUG("%s: failed to find ggml_backend_score in %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
-+                            continue;
-+                        }
-+
-+                        int s = score_fn();
-+                        GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str(), s);
-+                        if (s > best_score) {
-+                            best_score = s;
-+                            best_path = entry.path().wstring();
-                         }
-                     }
-                 }
-+            } catch (const std::exception & e) {
-+                GGML_LOG_ERROR("%s: failed to load %s: %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str(), e.what());
-             }
-         }
-     }
--- a/llama/patches/0015-use-std-filesystem-path-instead-of-wstring.patch
+++ b/llama/patches/0015-use-std-filesystem-path-instead-of-wstring.patch
@@ -4,11 +4,11 @@ Date: Sun, 16 Feb 2025 20:00:22 -0500
 Subject: [PATCH] use std::filesystem::path instead of wstring

 ---
- ggml/src/ggml-backend-reg.cpp | 144 ++++++++++++++--------------------
- 1 file changed, 58 insertions(+), 86 deletions(-)
+ ggml/src/ggml-backend-reg.cpp | 199 +++++++++++++++-------------------
+ 1 file changed, 88 insertions(+), 111 deletions(-)

 diff --git a/ggml/src/ggml-backend-reg.cpp b/ggml/src/ggml-backend-reg.cpp
-index 1c19129a..c854e6bb 100644
+index 98d5e14d..799af5f3 100644
 --- a/ggml/src/ggml-backend-reg.cpp
 +++ b/ggml/src/ggml-backend-reg.cpp
@@ -66,26 +66,6 @@
@@ -264,47 +264,55 @@ index 1c19129a..c854e6bb 100644
     for (const auto & search_path : search_paths) {
         if (!fs::exists(search_path)) {
             continue;
-@@ -514,31 +486,31 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent,
+@@ -513,29 +485,26 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent,
+         fs::directory_iterator dir_it(search_path, fs::directory_options::skip_permission_denied);
         for (const auto & entry : dir_it) {
-             try {
-                 if (entry.is_regular_file()) {
-                    std::wstring filename = entry.path().filename().wstring();
-                    std::wstring ext = entry.path().extension().wstring();
-+                    std::string filename = entry.path().filename().string();
-+                    std::string ext = entry.path().extension().string();
-                     if (filename.find(file_prefix) == 0 && ext == backend_filename_suffix()) {
-                        dl_handle_ptr handle { dl_load_library(entry.path().wstring()) };
-+                        dl_handle_ptr handle { dl_load_library(entry.path()) };
-                         if (!handle) {
-                            GGML_LOG_ERROR("%s: failed to load %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
-+                            GGML_LOG_ERROR("%s: failed to load %s\n", __func__, path_to_string(entry.path()).c_str());
-                             continue;
-                         }
- 
-                         auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
-                         if (!score_fn) {
-                            GGML_LOG_DEBUG("%s: failed to find ggml_backend_score in %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
-+                            GGML_LOG_DEBUG("%s: failed to find ggml_backend_score in %s\n", __func__, path_to_string(entry.path()).c_str());
-                             continue;
-                         }
- 
-                         int s = score_fn();
-                        GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str(), s);
-+                        GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, path_to_string(entry.path()).c_str(), s);
-                         if (s > best_score) {
-                             best_score = s;
-                            best_path = entry.path().wstring();
-+                            best_path = entry.path();
-                         }
+             if (entry.is_regular_file()) {
+-                std::wstring filename = entry.path().filename().wstring();
+-                std::wstring ext = entry.path().extension().wstring();
+                std::string filename = entry.path().filename().string();
+                std::string ext = entry.path().extension().string();
+                 if (filename.find(file_prefix) == 0 && ext == backend_filename_suffix()) {
+-                    dl_handle_ptr handle { dl_load_library(entry.path().wstring()) };
+-                    if (!handle && !silent) {
+-                        GGML_LOG_ERROR("%s: failed to load %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
+                    dl_handle_ptr handle { dl_load_library(entry.path()) };
+                    if (!handle) {
+                        GGML_LOG_ERROR("%s: failed to load %s\n", __func__, path_to_string(entry.path()).c_str());
+                        continue;
+                     }
+-                    if (handle) {
+-                        auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
+-                        if (score_fn) {
+-                            int s = score_fn();
+-#ifndef NDEBUG
+-                            GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str(), s);
+-#endif
+-                            if (s > best_score) {
+-                                best_score = s;
+-                                best_path = entry.path().wstring();
+-                            }
+-                        } else {
+-                            if (!silent) {
+-                                GGML_LOG_INFO("%s: failed to find ggml_backend_score in %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str());
+-                            }
+-                        }
+
+                    auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
+                    if (!score_fn) {
+                        GGML_LOG_DEBUG("%s: failed to find ggml_backend_score in %s\n", __func__, path_to_string(entry.path()).c_str());
+                        continue;
+                    }
+
+                    int s = score_fn();
+                    GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, path_to_string(entry.path()).c_str(), s);
+                    if (s > best_score) {
+                        best_score = s;
+                        best_path = entry.path();
                     }
                 }
-             } catch (const std::exception & e) {
-                GGML_LOG_ERROR("%s: failed to load %s: %s\n", __func__, utf16_to_utf8(entry.path().wstring()).c_str(), e.what());
-+                GGML_LOG_ERROR("%s: failed to load %s: %s\n", __func__, path_to_string(entry.path()).c_str(), e.what());
             }
-         }
-     }
-@@ -546,7 +518,7 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent,
+@@ -545,7 +514,7 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent,
     if (best_score == 0) {
         // try to load the base backend
         for (const auto & search_path : search_paths) {
@@ -313,3 +321,49 @@ index 1c19129a..c854e6bb 100644
             if (fs::exists(path)) {
                 return get_reg().load_backend(path, silent);
             }
+@@ -560,6 +529,14 @@ void ggml_backend_load_all() {
+     ggml_backend_load_all_from_path(nullptr);
+ }
+ 
+static void ggml_backend_try_load_best(const char * name, bool silent, const char * user_search_path) {
+    try {
+        ggml_backend_load_best(name, silent, user_search_path);
+    } catch (const std::exception & e) {
+        GGML_LOG_DEBUG("%s: failed to load %s: %s\n", __func__, name, e.what());
+    }
+}
+
+ void ggml_backend_load_all_from_path(const char * dir_path) {
+ #ifdef NDEBUG
+     bool silent = true;
+@@ -567,18 +544,18 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
+     bool silent = false;
+ #endif
+ 
+-    ggml_backend_load_best("blas", silent, dir_path);
+-    ggml_backend_load_best("cann", silent, dir_path);
+-    ggml_backend_load_best("cuda", silent, dir_path);
+-    ggml_backend_load_best("hip", silent, dir_path);
+-    ggml_backend_load_best("kompute", silent, dir_path);
+-    ggml_backend_load_best("metal", silent, dir_path);
+-    ggml_backend_load_best("rpc", silent, dir_path);
+-    ggml_backend_load_best("sycl", silent, dir_path);
+-    ggml_backend_load_best("vulkan", silent, dir_path);
+-    ggml_backend_load_best("opencl", silent, dir_path);
+-    ggml_backend_load_best("musa", silent, dir_path);
+-    ggml_backend_load_best("cpu", silent, dir_path);
+    ggml_backend_try_load_best("blas", silent, dir_path);
+    ggml_backend_try_load_best("cann", silent, dir_path);
+    ggml_backend_try_load_best("cuda", silent, dir_path);
+    ggml_backend_try_load_best("hip", silent, dir_path);
+    ggml_backend_try_load_best("kompute", silent, dir_path);
+    ggml_backend_try_load_best("metal", silent, dir_path);
+    ggml_backend_try_load_best("rpc", silent, dir_path);
+    ggml_backend_try_load_best("sycl", silent, dir_path);
+    ggml_backend_try_load_best("vulkan", silent, dir_path);
+    ggml_backend_try_load_best("opencl", silent, dir_path);
+    ggml_backend_try_load_best("musa", silent, dir_path);
+    ggml_backend_try_load_best("cpu", silent, dir_path);
+     // check the environment variable GGML_BACKEND_PATH to load an out-of-tree backend
+     const char * backend_path = std::getenv("GGML_BACKEND_PATH");
+     if (backend_path) {
--- a/llama/patches/0016-remove-amx.patch
+++ b/llama/patches/0016-remove-amx.patch
--- a/llama/patches/0017-fix-clip-compiler-error.patch
+++ b/llama/patches/0017-fix-clip-compiler-error.patch
--- a/llama/patches/0018-add-phi4-support.patch
+++ b/llama/patches/0018-add-phi4-support.patch
--- a/llama/patches/0019-expose-llama_vocab-from-tokens.patch
+++ b/llama/patches/0019-expose-llama_vocab-from-tokens.patch
@@ -0,0 +1,117 @@
+From 668a974433edccf2c5fcc2192c39aed601e575f2 Mon Sep 17 00:00:00 2001
+From: Bruce MacDonald <brucewmacdonald@gmail.com>
+Date: Thu, 6 Mar 2025 21:07:06 -0800
+Subject: [PATCH] expose llama_vocab from tokens
+
+---
+ llama/llama.cpp/src/llama-vocab.cpp | 73 +++++++++++++++++++++++++++++
+ llama/llama.cpp/src/llama-vocab.h   | 11 ++++-
+ 2 files changed, 83 insertions(+), 1 deletion(-)
+
+diff --git a/llama/llama.cpp/src/llama-vocab.cpp b/llama/llama.cpp/src/llama-vocab.cpp
+index c7ff28be..ad6e7ad8 100644
+--- a/llama/llama.cpp/src/llama-vocab.cpp
+++ b/llama/llama.cpp/src/llama-vocab.cpp
+@@ -3253,3 +3253,76 @@ int32_t llama_detokenize(
+     return vocab->detokenize(tokens, n_tokens, text, text_len_max, remove_special, unparse_special);
+ }
+ 
+struct llama_vocab *llama_vocab_from_tokens(const char **tokens, int n_tokens)
+{
+    if (!tokens || n_tokens <= 0)
+    {
+        return nullptr;
+    }
+
+    try
+    {
+        // Create a new vocabulary instance
+        llama_vocab *vocab = new llama_vocab();
+        vocab->pimpl = std::make_unique<llama_vocab::impl>(*vocab);
+
+        // Resize the token data vectors
+        vocab->pimpl->id_to_token.resize(n_tokens);
+
+        // Create mappings for all tokens
+        for (int i = 0; i < n_tokens; i++)
+        {
+            std::string word = tokens[i];
+            if (word.empty())
+            {
+                word = "[EMPTY_" + std::to_string(i) + "]";
+            }
+
+            // Add to token mappings
+            vocab->pimpl->token_to_id[word] = i;
+
+            // Set up token data
+            auto &token_data = vocab->pimpl->id_to_token[i];
+            token_data.text = std::move(word);
+            token_data.score = 0.0f; // Default score
+            token_data.attr = LLAMA_TOKEN_ATTR_NORMAL;
+
+            // Detect special tokens
+            if (word == "<s>" || word == "<bos>")
+            {
+                vocab->pimpl->special_bos_id = i;
+            }
+            else if (word == "</s>" || word == "<eos>" || word == "<|endoftext|>")
+            {
+                vocab->pimpl->special_eos_id = i;
+                vocab->pimpl->special_eog_ids.insert(i);
+            }
+            else if (word == "<unk>")
+            {
+                vocab->pimpl->special_unk_id = i;
+            }
+        }
+
+        // Initialize the token-to-piece cache
+        vocab->pimpl->cache_token_to_piece.resize(n_tokens);
+        for (int i = 0; i < n_tokens; i++)
+        {
+            vocab->pimpl->cache_token_to_piece[i] = vocab->pimpl->id_to_token[i].text;
+        }
+
+        return vocab;
+    }
+    catch (const std::exception &err)
+    {
+        return nullptr;
+    }
+}
+
+// Helper function to free the vocab
+void llama_vocab_free(struct llama_vocab *vocab)
+{
+    if (vocab)
+    {
+        delete vocab;
+    }
+}
+\ No newline at end of file
+diff --git a/llama/llama.cpp/src/llama-vocab.h b/llama/llama.cpp/src/llama-vocab.h
+index 5ce35521..eceb28f3 100644
+--- a/llama/llama.cpp/src/llama-vocab.h
+++ b/llama/llama.cpp/src/llama-vocab.h
+@@ -119,7 +119,16 @@ struct llama_vocab {
+ 
+     void print_info() const;
+ 
+-private:
+     struct impl;
+     std::unique_ptr<impl> pimpl;
+ };
+
+// Create a vocabulary from an array of token strings
+// tokens: Array of token strings
+// n_tokens: Number of tokens in the array
+// Returns: A new llama_vocab instance, or nullptr on failure
+// The caller is responsible for freeing the vocabulary using llama_vocab_free
+LLAMA_API struct llama_vocab * llama_vocab_from_tokens(const char ** tokens, int n_tokens);
+
+// Free a vocabulary created with llama_vocab_from_tokens
+LLAMA_API void llama_vocab_free(struct llama_vocab * vocab);
+-- 
+2.39.3 (Apple Git-145)
+
--- a/llm/memory.go
+++ b/llm/memory.go
@@ -115,6 +115,9 @@ func EstimateGPULayers(gpus []discover.GpuInfo, f *ggml.GGML, projectors []strin
 		// multimodal models require at least 2048 context
 		opts.NumCtx = max(opts.NumCtx, 2048)
 	}
+	if projectorWeights == 0 && projectorGraph == 0 {
+		projectorWeights, projectorGraph = f.VisionGraphSize()
+	}

 	layers := f.Tensors().GroupLayers()
 	// add one layer worth of memory as a buffer
--- a/llm/server.go
+++ b/llm/server.go
@@ -30,6 +30,7 @@ import (
 	"github.com/ollama/ollama/format"
 	"github.com/ollama/ollama/fs/ggml"
 	"github.com/ollama/ollama/llama"
+	"github.com/ollama/ollama/model"
 )

 type LlamaServer interface {
@@ -54,8 +55,15 @@ type llmServer struct {
 	options     api.Options
 	numParallel int
 	modelPath   string
-	modelLock   sync.Mutex   // Temporary until we switch fully to Go server
-	model       *llama.Model // If non-nil, the runner is a new Go server
+
+	// llamaModel is an instance of the cgo llama.cpp model definition
+	// nil if this server is running the new engine
+	llamaModel     *llama.Model
+	llamaModelLock sync.Mutex
+
+	// textProcessor handles text encoding/decoding for the model in the Ollama engine
+	// nil if this server is running the llama.cpp based engine
+	textProcessor model.TextProcessor

 	estimate    MemoryEstimate
 	totalLayers uint64
@@ -89,7 +97,7 @@ func LoadModel(model string, maxArraySize int) (*ggml.GGML, error) {

 // NewLlamaServer will run a server for the given GPUs
 // The gpu list must be a single family.
-func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapters, projectors []string, opts api.Options, numParallel int) (LlamaServer, error) {
+func NewLlamaServer(gpus discover.GpuInfoList, modelPath string, f *ggml.GGML, adapters, projectors []string, opts api.Options, numParallel int) (LlamaServer, error) {
 	systemInfo := discover.GetSystemInfo()
 	systemTotalMemory := systemInfo.System.TotalMemory
 	systemFreeMemory := systemInfo.System.FreeMemory
@@ -130,7 +138,7 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt
 	slog.Info("offload", "", estimate)

 	params := []string{
-		"--model", model,
+		"--model", modelPath,
 		"--ctx-size", strconv.Itoa(opts.NumCtx),
 		"--batch-size", strconv.Itoa(opts.NumBatch),
 	}
@@ -153,11 +161,6 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt
 		}
 	}

-	if len(projectors) > 0 {
-		// TODO: applying multiple projectors is not supported by the llama.cpp server yet
-		params = append(params, "--mmproj", projectors[0])
-	}
-
 	defaultThreads := systemInfo.GetOptimalThreadCount()
 	if opts.NumThread > 0 {
 		params = append(params, "--threads", strconv.Itoa(opts.NumThread))
@@ -257,6 +260,34 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt
 		}
 	}
 	slog.Debug("compatible gpu libraries", "compatible", compatible)
+	exe, err := os.Executable()
+	if err != nil {
+		return nil, fmt.Errorf("unable to lookup executable path: %w", err)
+	}
+
+	if eval, err := filepath.EvalSymlinks(exe); err == nil {
+		exe = eval
+	}
+
+	var llamaModel *llama.Model
+	var textProcessor model.TextProcessor
+	if envconfig.NewEngine() {
+		textProcessor, err = model.NewTextProcessor(modelPath)
+		if err != nil {
+			// To prepare for opt-out mode, instead of treating this as an error, we fallback to the old runner
+			slog.Debug("model not yet supported by Ollama engine, switching to compatibility mode", "model", modelPath, "error", err)
+		}
+	}
+	if textProcessor == nil {
+		llamaModel, err = llama.LoadModelFromFile(modelPath, llama.ModelParams{VocabOnly: true})
+		if err != nil {
+			return nil, err
+		}
+	}
+
+	if len(projectors) > 0 && llamaModel != nil {
+		params = append(params, "--mmproj", projectors[0])
+	}

 	// iterate through compatible GPU libraries such as 'cuda_v12', 'cuda_v11', 'rocm', etc.
 	// adding each library's respective path to the LD_LIBRARY_PATH, until finally running
@@ -275,7 +306,9 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt
 			port = rand.Intn(65535-49152) + 49152 // get a random port in the ephemeral range
 		}
 		finalParams := []string{"runner"}
-		if envconfig.NewEngine() {
+		if textProcessor != nil {
+			// New engine
+			// TODO - if we have failure to load scenarios, add logic to retry with the old runner
 			finalParams = append(finalParams, "--ollama-engine")
 		}
 		finalParams = append(finalParams, params...)
@@ -315,28 +348,20 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt
 		// finally, add the root library path
 		libraryPaths = append(libraryPaths, discover.LibOllamaPath)

-		exe, err := os.Executable()
-		if err != nil {
-			return nil, fmt.Errorf("unable to lookup executable path: %w", err)
-		}
-
-		if eval, err := filepath.EvalSymlinks(exe); err == nil {
-			exe = eval
-		}
-
-		// TODO - once fully switched to the Go runner, load the model here for tokenize/detokenize cgo access
 		s := &llmServer{
-			port:        port,
-			cmd:         exec.Command(exe, finalParams...),
-			status:      NewStatusWriter(os.Stderr),
-			options:     opts,
-			modelPath:   model,
-			estimate:    estimate,
-			numParallel: numParallel,
-			sem:         semaphore.NewWeighted(int64(numParallel)),
-			totalLayers: f.KV().BlockCount() + 1,
-			gpus:        gpus,
-			done:        make(chan error, 1),
+			port:          port,
+			cmd:           exec.Command(exe, finalParams...),
+			status:        NewStatusWriter(os.Stderr),
+			options:       opts,
+			modelPath:     modelPath,
+			llamaModel:    llamaModel,
+			textProcessor: textProcessor,
+			estimate:      estimate,
+			numParallel:   numParallel,
+			sem:           semaphore.NewWeighted(int64(numParallel)),
+			totalLayers:   f.KV().BlockCount() + 1,
+			gpus:          gpus,
+			done:          make(chan error, 1),
 		}

 		s.cmd.Env = os.Environ()
@@ -405,6 +430,9 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt
 			}
 			err := fmt.Errorf("error starting runner: %v %s", err, msg)
 			if len(compatible) == 0 {
+				if llamaModel != nil {
+					llama.FreeModel(llamaModel)
+				}
 				return nil, err
 			}

@@ -701,24 +729,29 @@ func (s *llmServer) Completion(ctx context.Context, req CompletionRequest, fn fu
 	}

 	if len(req.Format) > 0 {
-		switch string(req.Format) {
-		case `null`, `""`:
-			// Field was set, but "missing" a value. We accept
-			// these as "not set".
-			break
-		case `"json"`:
-			request["grammar"] = grammarJSON
-		default:
-			if req.Format[0] != '{' {
-				return fmt.Errorf("invalid format: %q; expected \"json\" or a valid JSON Schema object", req.Format)
-			}
+		format := string(req.Format)
+		if format != `null` && format != `""` {
+			if s.textProcessor != nil {
+				// New engine handles this on the backend
+				request["format"] = req.Format
+			} else {
+				// old engine
+				switch format {
+				case `"json"`:
+					request["grammar"] = grammarJSON
+				default:
+					if req.Format[0] != '{' {
+						return fmt.Errorf("invalid format: %q; expected \"json\" or a valid JSON Schema object", req.Format)
+					}

-			// User provided a JSON schema
-			g := llama.SchemaToGrammar(req.Format)
-			if g == nil {
-				return fmt.Errorf("invalid JSON schema in format")
+					// User provided a JSON schema
+					g := llama.SchemaToGrammar(req.Format)
+					if g == nil {
+						return fmt.Errorf("invalid JSON schema in format")
+					}
+					request["grammar"] = string(g)
+				}
 			}
-			request["grammar"] = string(g)
 		}
 	}

@@ -933,64 +966,25 @@ type TokenizeResponse struct {
 }

 func (s *llmServer) Tokenize(ctx context.Context, content string) ([]int, error) {
-	s.modelLock.Lock()
-	defer s.modelLock.Unlock()
-	if s.model != nil {
-		return s.model.Tokenize(content, false, true)
-	}
+	s.llamaModelLock.Lock()
+	defer s.llamaModelLock.Unlock()

-	// Make sure the server is ready
-	status, err := s.getServerStatus(ctx)
-	if err != nil {
-		return nil, err
-	} else if status != ServerStatusReady && status != ServerStatusNoSlotsAvailable {
-		return nil, fmt.Errorf("unexpected server status: %s", status.ToString())
+	if s.llamaModel != nil {
+		return s.llamaModel.Tokenize(content, false, true)
 	}
-
-	data, err := json.Marshal(TokenizeRequest{Content: content})
-	if err != nil {
-		return nil, fmt.Errorf("marshaling encode data: %w", err)
-	}
-
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, fmt.Sprintf("http://127.0.0.1:%d/tokenize", s.port), bytes.NewBuffer(data))
-	if err != nil {
-		return nil, fmt.Errorf("encode request: %w", err)
-	}
-	req.Header.Set("Content-Type", "application/json")
-
-	resp, err := http.DefaultClient.Do(req)
-	if err != nil {
-		return nil, fmt.Errorf("do encode request: %w", err)
-	}
-	defer resp.Body.Close()
-	if resp.StatusCode == http.StatusNotFound {
-		if s.model == nil {
-			slog.Debug("new runner detected, loading model for cgo tokenization")
-			m, err := llama.LoadModelFromFile(s.modelPath, llama.ModelParams{VocabOnly: true})
-			if err != nil {
-				return nil, err
-			}
-			s.model = m
+	if s.textProcessor != nil {
+		tokens, err := s.textProcessor.Encode(content)
+		if err != nil {
+			return nil, err
 		}
-		return s.model.Tokenize(content, false, true)
+		toks := make([]int, len(tokens))
+		for i, t := range tokens {
+			toks[i] = int(t)
+		}
+		return toks, nil
 	}
-
-	body, err := io.ReadAll(resp.Body)
-	if err != nil {
-		return nil, fmt.Errorf("read encode request: %w", err)
-	}
-
-	if resp.StatusCode >= 400 {
-		log.Printf("llm encode error: %s", body)
-		return nil, fmt.Errorf("%s", body)
-	}
-
-	var encoded TokenizeResponse
-	if err := json.Unmarshal(body, &encoded); err != nil {
-		return nil, fmt.Errorf("unmarshal encode response: %w", err)
-	}
-
-	return encoded.Tokens, nil
+	// not reached
+	return nil, fmt.Errorf("no tokenizer configured")
 }

 type DetokenizeRequest struct {
@@ -1002,80 +996,38 @@ type DetokenizeResponse struct {
 }

 func (s *llmServer) Detokenize(ctx context.Context, tokens []int) (string, error) {
-	s.modelLock.Lock()
-	defer s.modelLock.Unlock()
-	if s.model != nil {
+	s.llamaModelLock.Lock()
+	defer s.llamaModelLock.Unlock()
+
+	if s.llamaModel != nil {
 		var resp string
 		for _, token := range tokens {
-			resp += s.model.TokenToPiece(token)
+			resp += s.llamaModel.TokenToPiece(token)
 		}
 		return resp, nil
 	}
-	// Make sure the server is ready
-	status, err := s.getServerStatus(ctx)
-	if err != nil {
-		return "", err
-	} else if status != ServerStatusReady && status != ServerStatusNoSlotsAvailable {
-		return "", fmt.Errorf("unexpected server status: %s", status.ToString())
-	}
-
-	data, err := json.Marshal(DetokenizeRequest{Tokens: tokens})
-	if err != nil {
-		return "", fmt.Errorf("marshaling decode data: %w", err)
-	}
-
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, fmt.Sprintf("http://127.0.0.1:%d/detokenize", s.port), bytes.NewBuffer(data))
-	if err != nil {
-		return "", fmt.Errorf("decode request: %w", err)
-	}
-	req.Header.Set("Content-Type", "application/json")
-
-	resp, err := http.DefaultClient.Do(req)
-	if err != nil {
-		return "", fmt.Errorf("do decode request: %w", err)
-	}
-	defer resp.Body.Close()
-	if resp.StatusCode == http.StatusNotFound {
-		if s.model == nil {
-			slog.Debug("new runner detected, loading model for cgo tokenization")
-			m, err := llama.LoadModelFromFile(s.modelPath, llama.ModelParams{VocabOnly: true})
-			if err != nil {
-				return "", err
-			}
-			s.model = m
+	if s.textProcessor != nil {
+		toks := make([]int32, len(tokens))
+		for i, t := range tokens {
+			toks[i] = int32(t)
 		}
-		var resp string
-		for _, token := range tokens {
-			resp += s.model.TokenToPiece(token)
+		content, err := s.textProcessor.Decode(toks)
+		if err != nil {
+			return "", err
 		}
-		return resp, nil
+		return content, nil
 	}
-
-	body, err := io.ReadAll(resp.Body)
-	if err != nil {
-		return "", fmt.Errorf("read decode request: %w", err)
-	}
-
-	if resp.StatusCode >= 400 {
-		log.Printf("llm decode error: %s", body)
-		return "", fmt.Errorf("%s", body)
-	}
-
-	var decoded DetokenizeResponse
-	if err := json.Unmarshal(body, &decoded); err != nil {
-		return "", fmt.Errorf("unmarshal encode response: %w", err)
-	}
-
-	return decoded.Content, nil
+	// not reached
+	return "", fmt.Errorf("no tokenizer configured")
 }

 func (s *llmServer) Close() error {
-	s.modelLock.Lock()
-	if s.model != nil {
-		llama.FreeModel(s.model)
-		s.model = nil
+	s.llamaModelLock.Lock()
+	if s.llamaModel != nil {
+		llama.FreeModel(s.llamaModel)
+		s.llamaModel = nil
 	}
-	s.modelLock.Unlock()
+	s.llamaModelLock.Unlock()

 	if s.cmd != nil {
 		slog.Debug("stopping llama server")
--- a/ml/backend.go
+++ b/ml/backend.go
@@ -24,7 +24,6 @@ type Backend interface {
 	Config() Config
 	Get(name string) Tensor
 	NewContext() Context
-	SystemInfo() string
 }

 // BackendCacheConfig should be implemented by backends that need special output
--- a/ml/backend/ggml/ggml.go
+++ b/ml/backend/ggml/ggml.go
@@ -1,27 +1,11 @@
 package ggml

-/*
-#cgo CPPFLAGS: -I${SRCDIR}/ggml/include
-#include <stdlib.h>
-#include <stdint.h>
-#include "ggml.h"
-#include "ggml-cpu.h"
-#include "ggml-backend.h"
-static struct ggml_backend_feature * getBackendFeatures(void *fp, ggml_backend_reg_t reg) {return ((ggml_backend_get_features_t)(fp))(reg);}
-static struct ggml_backend_feature * getNextBackendFeatures(struct ggml_backend_feature * feature) { return &feature[1];}
-
-typedef enum {COMP_UNKNOWN,COMP_GCC,COMP_CLANG} COMPILER;
-COMPILER inline get_compiler() {
-#if defined(__clang__)
-	return COMP_CLANG;
-#elif defined(__GNUC__)
-	return COMP_GCC;
-#else
-	return UNKNOWN_COMPILER;
-#endif
-}
-
-*/
+// #cgo CPPFLAGS: -I${SRCDIR}/ggml/include
+// #include <stdlib.h>
+// #include <stdint.h>
+// #include "ggml.h"
+// #include "ggml-cpu.h"
+// #include "ggml-backend.h"
 import "C"

 import (
@@ -729,34 +713,3 @@ func (t *Tensor) ScaledDotProductAttention(ctx ml.Context, key, value, mask ml.T
 		return kqv.Permute(ctx, 0, 2, 1, 3).Contiguous(ctx)
 	}
 }
-
-func (b *Backend) SystemInfo() string {
-	var compiler string
-	switch C.get_compiler() {
-	case C.COMP_UNKNOWN:
-		compiler = "cgo(unknown_compiler)"
-	case C.COMP_GCC:
-		compiler = "cgo(gcc)"
-	case C.COMP_CLANG:
-		compiler = "cgo(clang)"
-	}
-
-	var s string
-	for i := range C.ggml_backend_reg_count() {
-		reg := C.ggml_backend_reg_get(i)
-		fName := C.CString("ggml_backend_get_features")
-		defer C.free(unsafe.Pointer(fName))
-		get_features_fn := C.ggml_backend_reg_get_proc_address(reg, fName)
-		if get_features_fn != nil {
-			s += C.GoString(C.ggml_backend_reg_name(reg))
-			s += " : "
-			for features := C.getBackendFeatures(get_features_fn, reg); features.name != nil; features = C.getNextBackendFeatures(features) {
-				s += C.GoString(features.name)
-				s += " = "
-				s += C.GoString(features.value)
-				s += " | "
-			}
-		}
-	}
-	return s + compiler
-}
--- a/ml/backend/ggml/ggml/src/ggml-backend-reg.cpp
+++ b/ml/backend/ggml/ggml/src/ggml-backend-reg.cpp
@@ -484,33 +484,29 @@ static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent,
        }
        fs::directory_iterator dir_it(search_path, fs::directory_options::skip_permission_denied);
        for (const auto & entry : dir_it) {
-            try {
-                if (entry.is_regular_file()) {
-                    std::string filename = entry.path().filename().string();
-                    std::string ext = entry.path().extension().string();
-                    if (filename.find(file_prefix) == 0 && ext == backend_filename_suffix()) {
-                        dl_handle_ptr handle { dl_load_library(entry.path()) };
-                        if (!handle) {
-                            GGML_LOG_ERROR("%s: failed to load %s\n", __func__, path_to_string(entry.path()).c_str());
-                            continue;
-                        }
+            if (entry.is_regular_file()) {
+                std::string filename = entry.path().filename().string();
+                std::string ext = entry.path().extension().string();
+                if (filename.find(file_prefix) == 0 && ext == backend_filename_suffix()) {
+                    dl_handle_ptr handle { dl_load_library(entry.path()) };
+                    if (!handle) {
+                        GGML_LOG_ERROR("%s: failed to load %s\n", __func__, path_to_string(entry.path()).c_str());
+                        continue;
+                    }

-                        auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
-                        if (!score_fn) {
-                            GGML_LOG_DEBUG("%s: failed to find ggml_backend_score in %s\n", __func__, path_to_string(entry.path()).c_str());
-                            continue;
-                        }
+                    auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
+                    if (!score_fn) {
+                        GGML_LOG_DEBUG("%s: failed to find ggml_backend_score in %s\n", __func__, path_to_string(entry.path()).c_str());
+                        continue;
+                    }

-                        int s = score_fn();
-                        GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, path_to_string(entry.path()).c_str(), s);
-                        if (s > best_score) {
-                            best_score = s;
-                            best_path = entry.path();
-                        }
+                    int s = score_fn();
+                    GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, path_to_string(entry.path()).c_str(), s);
+                    if (s > best_score) {
+                        best_score = s;
+                        best_path = entry.path();
                    }
                }
-            } catch (const std::exception & e) {
-                GGML_LOG_ERROR("%s: failed to load %s: %s\n", __func__, path_to_string(entry.path()).c_str(), e.what());
            }
        }
    }
@@ -533,6 +529,14 @@ void ggml_backend_load_all() {
    ggml_backend_load_all_from_path(nullptr);
 }

+static void ggml_backend_try_load_best(const char * name, bool silent, const char * user_search_path) {
+    try {
+        ggml_backend_load_best(name, silent, user_search_path);
+    } catch (const std::exception & e) {
+        GGML_LOG_DEBUG("%s: failed to load %s: %s\n", __func__, name, e.what());
+    }
+}
+
 void ggml_backend_load_all_from_path(const char * dir_path) {
 #ifdef NDEBUG
    bool silent = true;
@@ -540,18 +544,18 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
    bool silent = false;
 #endif

-    ggml_backend_load_best("blas", silent, dir_path);
-    ggml_backend_load_best("cann", silent, dir_path);
-    ggml_backend_load_best("cuda", silent, dir_path);
-    ggml_backend_load_best("hip", silent, dir_path);
-    ggml_backend_load_best("kompute", silent, dir_path);
-    ggml_backend_load_best("metal", silent, dir_path);
-    ggml_backend_load_best("rpc", silent, dir_path);
-    ggml_backend_load_best("sycl", silent, dir_path);
-    ggml_backend_load_best("vulkan", silent, dir_path);
-    ggml_backend_load_best("opencl", silent, dir_path);
-    ggml_backend_load_best("musa", silent, dir_path);
-    ggml_backend_load_best("cpu", silent, dir_path);
+    ggml_backend_try_load_best("blas", silent, dir_path);
+    ggml_backend_try_load_best("cann", silent, dir_path);
+    ggml_backend_try_load_best("cuda", silent, dir_path);
+    ggml_backend_try_load_best("hip", silent, dir_path);
+    ggml_backend_try_load_best("kompute", silent, dir_path);
+    ggml_backend_try_load_best("metal", silent, dir_path);
+    ggml_backend_try_load_best("rpc", silent, dir_path);
+    ggml_backend_try_load_best("sycl", silent, dir_path);
+    ggml_backend_try_load_best("vulkan", silent, dir_path);
+    ggml_backend_try_load_best("opencl", silent, dir_path);
+    ggml_backend_try_load_best("musa", silent, dir_path);
+    ggml_backend_try_load_best("cpu", silent, dir_path);
    // check the environment variable GGML_BACKEND_PATH to load an out-of-tree backend
    const char * backend_path = std::getenv("GGML_BACKEND_PATH");
    if (backend_path) {
--- a/ml/backend/ggml/ggml/src/ggml.go
+++ b/ml/backend/ggml/ggml/src/ggml.go
@@ -7,6 +7,20 @@ package ggml
 // #include <stdlib.h>
 // #include "ggml-backend.h"
 // extern void sink(int level, char *text, void *user_data);
+// static struct ggml_backend_feature * first_feature(ggml_backend_get_features_t fp, ggml_backend_reg_t reg) { return fp(reg); }
+// static struct ggml_backend_feature * next_feature(struct ggml_backend_feature * feature) { return &feature[1]; }
+/*
+typedef enum { COMPILER_CLANG, COMPILER_GNUC, COMPILER_UNKNOWN } COMPILER;
+static COMPILER compiler_name(void) {
+#if defined(__clang__)
+	return COMPILER_CLANG;
+#elif defined(__GNUC__)
+	return COMPILER_GNUC;
+#else
+	return COMPILER_UNKNOWN;
+#endif
+}
+*/
 import "C"

 import (
@@ -16,6 +30,7 @@ import (
 	"os"
 	"path/filepath"
 	"runtime"
+	"strconv"
 	"strings"
 	"sync"
 	"unsafe"
@@ -90,4 +105,43 @@ var OnceLoad = sync.OnceFunc(func() {
 			visited[abspath] = struct{}{}
 		}
 	}
+
+	slog.Info("system", "", system{})
 })
+
+type system struct{}
+
+func (system) LogValue() slog.Value {
+	var attrs []slog.Attr
+	names := make(map[string]int)
+	for i := range C.ggml_backend_dev_count() {
+		r := C.ggml_backend_dev_backend_reg(C.ggml_backend_dev_get(i))
+
+		func() {
+			fName := C.CString("ggml_backend_get_features")
+			defer C.free(unsafe.Pointer(fName))
+
+			if fn := C.ggml_backend_reg_get_proc_address(r, fName); fn != nil {
+				var features []any
+				for f := C.first_feature(C.ggml_backend_get_features_t(fn), r); f.name != nil; f = C.next_feature(f) {
+					features = append(features, C.GoString(f.name), C.GoString(f.value))
+				}
+
+				name := C.GoString(C.ggml_backend_reg_name(r))
+				attrs = append(attrs, slog.Group(name+"."+strconv.Itoa(names[name]), features...))
+				names[name] += 1
+			}
+		}()
+	}
+
+	switch C.compiler_name() {
+	case C.COMPILER_CLANG:
+		attrs = append(attrs, slog.String("compiler", "cgo(clang)"))
+	case C.COMPILER_GNUC:
+		attrs = append(attrs, slog.String("compiler", "cgo(gcc)"))
+	default:
+		attrs = append(attrs, slog.String("compiler", "cgo(unknown)"))
+	}
+
+	return slog.GroupValue(attrs...)
+}
--- a/model/model.go
+++ b/model/model.go
@@ -16,6 +16,7 @@ import (
 	_ "golang.org/x/image/tiff"
 	_ "golang.org/x/image/webp"

+	fs "github.com/ollama/ollama/fs/ggml"
 	"github.com/ollama/ollama/kvcache"
 	"github.com/ollama/ollama/ml"
 	_ "github.com/ollama/ollama/ml/backend"
@@ -100,6 +101,36 @@ func New(modelPath string, params ml.BackendParams) (Model, error) {
 	return m, nil
 }

+func NewTextProcessor(s string) (TextProcessor, error) {
+	r, err := os.Open(s)
+	if err != nil {
+		return nil, err
+	}
+	defer r.Close()
+	meta, _, err := fs.Decode(r, -1)
+	if err != nil {
+		return nil, err
+	}
+	return getTextProcessor(meta.KV())
+}
+
+func getTextProcessor(kv fs.KV) (TextProcessor, error) {
+	arch := kv.Architecture()
+	f, ok := models[arch]
+	if !ok {
+		return nil, fmt.Errorf("unsupported model architecture %q", arch)
+	}
+	m, err := f(kv)
+	if err != nil {
+		return nil, err
+	}
+	tp, ok := m.(TextProcessor)
+	if !ok {
+		return nil, fmt.Errorf("%v is not a TextProcessor", m)
+	}
+	return tp, nil
+}
+
 func populateFields(base Base, v reflect.Value, tags ...Tag) reflect.Value {
 	t := v.Type()

--- a/model/model_test.go
+++ b/model/model_test.go
@@ -3,9 +3,11 @@ package model
 import (
 	"reflect"
 	"slices"
+	"strings"
 	"testing"

 	"github.com/google/go-cmp/cmp"
+	fs "github.com/ollama/ollama/fs/ggml"
 	"github.com/ollama/ollama/ml"
 	"github.com/ollama/ollama/ml/backend/ggml"
 	"github.com/ollama/ollama/ml/nn"
@@ -134,3 +136,40 @@ func TestPopulateFieldsAlternateName(t *testing.T) {
 		t.Errorf("populateFields() set incorrect values (-want +got):\n%s", diff)
 	}
 }
+
+func TestGetTextProcessor(t *testing.T) {
+	tp, err := getTextProcessor(fs.KV{})
+	if err == nil {
+		t.Error("expected error")
+	} else if !strings.Contains(err.Error(), "unsupported model architecture") {
+		t.Errorf("unexpected error: %v", err)
+	} else if tp != nil {
+		t.Error("expected nil tp")
+	}
+
+	models["dummy"] = func(ml.Config) (Model, error) {
+		return notTextProcessorModel{}, nil
+	}
+	tp, err = getTextProcessor(fs.KV{"general.architecture": "dummy"})
+	if err == nil {
+		t.Error("expected error")
+	} else if !strings.Contains(err.Error(), "not a TextProcessor") {
+		t.Errorf("unexpected error: %v", err)
+	} else if tp != nil {
+		t.Error("expected nil tp")
+	}
+}
+
+type notTextProcessorModel struct{}
+
+func (notTextProcessorModel) Forward(ml.Context, Options) (ml.Tensor, error) {
+	panic("unimplemented")
+}
+
+func (notTextProcessorModel) Backend() ml.Backend {
+	panic("unimplemented")
+}
+
+func (notTextProcessorModel) Config() config {
+	panic("unimplemented")
+}
--- a/model/models/llama/model.go
+++ b/model/models/llama/model.go
@@ -1,7 +1,9 @@
 package llama

 import (
+	"fmt"
 	"math"
+	"strings"

 	"github.com/ollama/ollama/kvcache"
 	"github.com/ollama/ollama/ml"
@@ -29,6 +31,10 @@ type Model struct {
 }

 func New(c ml.Config) (model.Model, error) {
+	if !strings.EqualFold(c.String("tokenizer.ggml.model"), "gpt2") {
+		return nil, fmt.Errorf("tokenizer %s not yet supported", c.String("tokenizer.ggml.model"))
+	}
+
 	m := Model{
 		BytePairEncoding: model.NewBytePairEncoding(
 			c.String("tokenizer.ggml.pretokenizer", `(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+`),
--- a/model/models/mllama/model.go
+++ b/model/models/mllama/model.go
@@ -1,6 +1,8 @@
 package mllama

 import (
+	"fmt"
+
 	"github.com/ollama/ollama/kvcache"
 	"github.com/ollama/ollama/ml"
 	"github.com/ollama/ollama/ml/nn"
@@ -25,6 +27,10 @@ const (
 )

 func New(c ml.Config) (model.Model, error) {
+	// Verify unified config
+	if c.Uint("vision.block_count") == 0 {
+		return nil, fmt.Errorf("non-unified vision model not supported")
+	}
 	m := Model{
 		BytePairEncoding: model.NewBytePairEncoding(
 			c.String("tokenizer.ggml.pretokenizer", `(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+`),
--- a/runner/llamarunner/runner.go
+++ b/runner/llamarunner/runner.go
@@ -931,7 +931,6 @@ func Execute(args []string) error {
 	slog.Info("starting go runner")

 	llama.BackendInit()
-	slog.Info("system", "info", llama.PrintSystemInfo(), "threads", *threads)

 	server := &Server{
 		batchSize: *batchSize,
--- a/runner/ollamarunner/runner.go
+++ b/runner/ollamarunner/runner.go
@@ -428,7 +428,8 @@ func (s *Server) processBatch() error {

 		// sample a token
 		vocabSize := len(logits) / len(options.Outputs)
-
+		// TODO: need access to vocab to apply grammar
+		// token = sampler.Grammar.Apply(logits)
 		token, err := seq.sampler.Sample(logits[seq.iBatch*vocabSize : (seq.iBatch+1)*vocabSize])
 		if err != nil {
 			return fmt.Errorf("failed to sample token: %w", err)
@@ -575,6 +576,13 @@ func (s *Server) completion(w http.ResponseWriter, r *http.Request) {
 		return
 	}

+	// TODO: if grammar is provided, load it
+	// if req.Grammar != "" {
+	// 	grammar := llama.NewGrammarWithTokens(req.Grammar, "root", s.model.Vocabulary)
+	// }
+	// defer grammar.Close()
+	// sampler := sample.WithGrammar(sample.Greedy(), grammar)
+
 	seq, err := s.NewSequence(req.Prompt, req.Images, NewSequenceParams{
 		numPredict: req.NumPredict,
 		stop:       req.Stop,
@@ -786,8 +794,6 @@ func (s *Server) loadModel(
 		panic(err)
 	}

-	slog.Info("system", "info", s.model.Backend().SystemInfo(), "threads", params.NumThreads)
-
 	// TODO(jessegross): LoRA loading
 	if lpath.String() != "" {
 		panic("loras are not yet implemented")
--- a/sample/samplers.go
+++ b/sample/samplers.go
@@ -4,6 +4,7 @@ import (
 	"errors"
 	"math"

+	"github.com/ollama/ollama/llama"
 	"golang.org/x/exp/rand"
 	"gonum.org/v1/gonum/stat/sampleuv"
 )
@@ -57,12 +58,24 @@ func (s weighted) Sample(logits []float32) (int32, error) {
 	return -1, errors.New("weighted sampler failed, no valid token found")
 }

-type greedy struct{}
+type greedy struct {
+	grammar llama.Grammar
+}

 func Greedy() Sampler {
 	return greedy{}
 }

+func WithGrammar(s Sampler, grammar llama.Grammar) Sampler {
+	switch t := s.(type) {
+	case greedy:
+		t.grammar = grammar
+		return t
+	default:
+		return s
+	}
+}
+
 // Sample returns the index of the maximum value in logits.
 func (s greedy) Sample(logits []float32) (int32, error) {
 	if len(logits) == 0 {
--- a/scripts/install.sh
+++ b/scripts/install.sh
@@ -77,11 +77,12 @@ if [ -d "$OLLAMA_INSTALL_DIR/lib/ollama" ] ; then
 fi
 status "Installing ollama to $OLLAMA_INSTALL_DIR"
 $SUDO install -o0 -g0 -m755 -d $BINDIR
-$SUDO install -o0 -g0 -m755 -d "$OLLAMA_INSTALL_DIR"
+$SUDO install -o0 -g0 -m755 -d "$OLLAMA_INSTALL_DIR/lib/ollama"
 status "Downloading Linux ${ARCH} bundle"
 curl --fail --show-error --location --progress-bar \
    "https://ollama.com/download/ollama-linux-${ARCH}.tgz${VER_PARAM}" | \
    $SUDO tar -xzf - -C "$OLLAMA_INSTALL_DIR"
+
 if [ "$OLLAMA_INSTALL_DIR/bin/ollama" != "$BINDIR/ollama" ] ; then
    status "Making ollama accessible in the PATH in $BINDIR"
    $SUDO ln -sf "$OLLAMA_INSTALL_DIR/ollama" "$BINDIR/ollama"
--- a/server/internal/internal/backoff/backoff_test.go
+++ b/server/internal/internal/backoff/backoff_test.go
@@ -1,3 +1,5 @@
+//go:build goexperiment.synctest
+
 package backoff

 import (
--- a/server/internal/internal/syncs/line_test.go
+++ b/server/internal/internal/syncs/line_test.go
@@ -1,3 +1,5 @@
+//go:build goexperiment.synctest
+
 package syncs

 import (
--- a/server/internal/registry/server.go
+++ b/server/internal/registry/server.go
@@ -31,6 +31,10 @@ type Local struct {
 	// Fallback, if set, is used to handle requests that are not handled by
 	// this handler.
 	Fallback http.Handler
+
+	// Prune, if set, is called to prune the local disk cache after a model
+	// is deleted.
+	Prune func() error // optional
 }

 // serverError is like ollama.Error, but with a Status field for the HTTP
@@ -204,7 +208,10 @@ func (s *Local) handleDelete(_ http.ResponseWriter, r *http.Request) error {
 	if !ok {
 		return &serverError{404, "not_found", "model not found"}
 	}
-	return nil
+	if s.Prune == nil {
+		return nil
+	}
+	return s.Prune()
 }

 func decodeUserJSON[T any](r io.Reader) (T, error) {
--- a/server/prompt.go
+++ b/server/prompt.go
@@ -10,7 +10,6 @@ import (
 	"strings"

 	"github.com/ollama/ollama/api"
-	"github.com/ollama/ollama/envconfig"
 	"github.com/ollama/ollama/llm"
 	"github.com/ollama/ollama/model/models/mllama"
 	"github.com/ollama/ollama/template"
@@ -93,7 +92,7 @@ func chatPrompt(ctx context.Context, m *Model, tokenize tokenizeFunc, opts *api.
 			var imgData llm.ImageData

 			if isMllama {
-				if envconfig.NewEngine() {
+				if len(m.ProjectorPaths) == 0 {
 					imgData = llm.ImageData{
 						ID:   len(images),
 						Data: i,
--- a/server/routes.go
+++ b/server/routes.go
@@ -205,7 +205,7 @@ func (s *Server) GenerateHandler(c *gin.Context) {

 	images := make([]llm.ImageData, len(req.Images))
 	for i := range req.Images {
-		if isMllama && !envconfig.NewEngine() {
+		if isMllama && len(model.ProjectorPaths) > 0 {
 			data, opts, err := mllama.Preprocess(bytes.NewReader(req.Images[i]))
 			if err != nil {
 				c.AbortWithStatusJSON(http.StatusInternalServerError, gin.H{"error": "error processing image"})
@@ -1199,6 +1199,8 @@ func (s *Server) GenerateRoutes(rc *ollama.Registry) (http.Handler, error) {
 		Client:   rc,
 		Logger:   slog.Default(), // TODO(bmizerany): Take a logger, do not use slog.Default()
 		Fallback: r,
+
+		Prune: PruneLayers,
 	}

 	return rs, nil
Author	SHA1	Message	Date
Bruce MacDonald	81888abbe4	wip: apply gbnf vocab to logits	2025-03-06 21:44:52 -08:00
Michael Yang	05a01fdecb	ml/backend/ggml: consolidate system info logging - output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name	2025-03-04 15:14:31 -08:00
aritra saha	8fe6f69f28	docs: add granite-3.2 to the readme	2025-03-04 11:10:56 -08:00
Daniel Hiltgen	1fdb351c37	New engine: vision models and auto-fallback (#9113 ) * Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine	2025-03-04 09:03:46 -08:00
Blake Mizerany	7a01ad7614	server/internal/registry: reintroduce pruning on model deletion (#9489 ) This reintroduces aggressive pruning on model deletion as a temporary measure until a more controlled garbage collection (GC) mechanism is implemented. Issues with the current approach: 1. Users may accidentally delete a model (`ollama rm llama3.3` instead of `ollama rm llama3.2`), requiring a full re-download unless another model references the same blobs. 2. Users may assume a deleted model is still referenced elsewhere, but due to prior updates or deletions, the references no longer exist, leading to unnecessary re-downloads. Soon, we should implement a structured GC mechanism to retain unreferenced blobs for a configurable period before removal, which will run on "ollama rm" and other commands we deem appropriate. Users that want to immediately remove unreferenced blobs can use a new prune command that will allow them to specify the age and class of blobs to remove. Example usage: # Run basic blob GC $ ollama prune # Remove unreferenced blobs older than 7 days $ ollama prune --age 7d # Remove all blobs, referenced or not, older than 7 days (and their manifests?) $ ollama prune --age 7d --all # Remove all unreferenced blobs immediately $ ollama prune --age 0 --all # Remove all blobs $ ollama prune --age 0 --all This should provide a safer and more predictable cleanup process.	2025-03-03 19:11:16 -08:00
Blake Mizerany	55ab9f371a	server/.../backoff,syncs: don't break builds without synctest (#9484 ) Previously, developers without the synctest experiment enabled would see build failures when running tests in some server/internal/internal packages using the synctest package. This change makes the transition to use of the package less painful but guards the use of the synctest package with build tags. synctest is enabled in CI. If a new change will break a synctest package, it will break in CI, even if it does not break locally. The developer docs have been updated to help with any confusion about why package tests pass locally but fail in CI.	2025-03-03 16:45:40 -08:00
KindBrave	fefbf8f74b	docs: add Ollama Android Chat community integration	2025-03-03 16:38:32 -08:00
Michael Yang	b428ddd796	docker: use go version from go.mod	2025-03-03 13:02:02 -08:00
Michael Yang	ba7d31240e	fix: own lib/ollama directory expand backend loading error handling to catch more problems and log them instead of panicing	2025-03-03 13:01:18 -08:00
CYJiang	d25efe3954	cmd: add default err return for stop (#9458 )	2025-03-03 12:13:41 -08:00
Mark	36dfb906bb	docs: don't use self-closing tag for anchor element (#9456 )	2025-03-03 11:56:34 -08:00
aritra saha	a6f0f908b9	docs: update phi3-mini to phi4-mini (#9424 ) * Update README.md removed phi 3 mini and added phi4-mini * Update README.md --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-03-03 11:09:21 -08:00
İbrahim Çetin	3b1ddb2b3a	docs: add reins to community integrations (#9411 )	2025-03-03 11:06:30 -08:00