fix gemma2-2b conversion

readme: add vnc-lm discord bot community integration (#6644 )
llm: use json.hpp from common (#6642 )
2024-09-04 16:59:23 -07:00 · 2024-09-04 19:46:02 -04:00 · 2024-09-04 19:34:42 -04:00 · 2024-09-04 17:26:02 -04:00 · 2024-09-04 14:45:09 -04:00 · 2024-09-04 14:19:41 -04:00
21 changed files with 99 additions and 25050 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -18,7 +18,7 @@ See the [development documentation](./docs/development.md) for instructions on h

 * New features: new features (e.g. API fields, environment variables) add surface area to Ollama and make it harder to maintain in the long run as they cannot be removed without potentially breaking users in the future.
 * Refactoring: large code improvements are important, but can be harder or take longer to review and merge.
-* Documentation: small updates to fill in or dorrect missing documentation is helpful, however large documentation additions can be hard to maintain over time.
+* Documentation: small updates to fill in or correct missing documentation is helpful, however large documentation additions can be hard to maintain over time.

 ### Issues that may not be accepted

--- a/20
+++ b/20
@@ -21,7 +21,7 @@ COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate
 ARG CGO_CFLAGS
 ARG CUDA_V11_ARCHITECTURES
-ENV GOARCH amd64 
+ENV GOARCH amd64
 RUN --mount=type=cache,target=/root/.ccache \
    OLLAMA_SKIP_STATIC_GENERATE=1 \
    OLLAMA_SKIP_CPU_GENERATE=1 \
@@ -38,7 +38,7 @@ COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate
 ARG CGO_CFLAGS
 ARG CUDA_V12_ARCHITECTURES
-ENV GOARCH amd64 
+ENV GOARCH amd64
 RUN --mount=type=cache,target=/root/.ccache \
    OLLAMA_SKIP_STATIC_GENERATE=1 \
    OLLAMA_SKIP_CPU_GENERATE=1 \
@@ -56,7 +56,7 @@ COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate
 ARG CGO_CFLAGS
 ARG CUDA_V11_ARCHITECTURES
-ENV GOARCH arm64 
+ENV GOARCH arm64
 RUN OLLAMA_SKIP_STATIC_GENERATE=1 \
    OLLAMA_SKIP_CPU_GENERATE=1 \
    CMAKE_CUDA_ARCHITECTURES="${CUDA_V11_ARCHITECTURES}" \
@@ -72,7 +72,7 @@ COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate
 ARG CGO_CFLAGS
 ARG CUDA_V12_ARCHITECTURES
-ENV GOARCH arm64 
+ENV GOARCH arm64
 RUN --mount=type=cache,target=/root/.ccache \
    OLLAMA_SKIP_STATIC_GENERATE=1 \
    OLLAMA_SKIP_CPU_GENERATE=1 \
@@ -92,7 +92,7 @@ COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate
 ARG CGO_CFLAGS
 ARG AMDGPU_TARGETS
-ENV GOARCH amd64 
+ENV GOARCH amd64
 RUN --mount=type=cache,target=/root/.ccache \
    OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 bash gen_linux.sh
 RUN mkdir -p ../../dist/linux-amd64-rocm/lib/ollama && \
@@ -107,7 +107,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH
 COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 ARG OLLAMA_CUSTOM_CPU_DEFS
 ARG CGO_CFLAGS
-ENV GOARCH amd64 
+ENV GOARCH amd64
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate

 FROM --platform=linux/amd64 cpu-builder-amd64 AS static-build-amd64
@@ -181,17 +181,19 @@ RUN --mount=type=cache,target=/root/.ccache \
 # Strip out ROCm dependencies to keep the primary image lean
 FROM --platform=linux/amd64 ubuntu:22.04 as amd64-libs-without-rocm
 COPY --from=build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/lib/ /scratch/
-RUN cd /scratch/ollama/ && rm -rf rocblas libamd* libdrm* libroc* libhip* libhsa* 
+RUN cd /scratch/ollama/ && rm -rf rocblas libamd* libdrm* libroc* libhip* libhsa*

 # Runtime stages
 FROM --platform=linux/amd64 ubuntu:22.04 as runtime-amd64
 COPY --from=amd64-libs-without-rocm /scratch/ /lib/
-RUN apt-get update && apt-get install -y ca-certificates
+RUN apt-get update && apt-get install -y ca-certificates && \
+    apt-get clean && rm -rf /var/lib/apt/lists/*
 COPY --from=build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/bin/ /bin/

 FROM --platform=linux/arm64 ubuntu:22.04 as runtime-arm64
 COPY --from=build-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/lib/ /lib/
-RUN apt-get update && apt-get install -y ca-certificates
+RUN apt-get update && apt-get install -y ca-certificates && \
+    apt-get clean && rm -rf /var/lib/apt/lists/*
 COPY --from=build-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/bin/ /bin/

 # Radeon images are much larger so we keep it distinct from the CPU/CUDA image
--- a/README.md
+++ b/README.md
@@ -296,12 +296,20 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [OllamaSpring](https://github.com/CrazyNeil/OllamaSpring) (Ollama Client for macOS)
 - [LLocal.in](https://github.com/kartikm7/llocal) (Easy to use Electron Desktop Client for Ollama)
 - [Ollama with Google Mesop](https://github.com/rapidarchitect/ollama_mesop/) (Mesop Chat Client implementation with Ollama)
+- [Painting Droid](https://github.com/mateuszmigas/painting-droid) (Painting app with AI integrations)
 - [Kerlig AI](https://www.kerlig.com/) (AI writing assistant for macOS)
 - [AI Studio](https://github.com/MindWorkAI/AI-Studio)
 - [Sidellama](https://github.com/gyopak/sidellama) (browser-based LLM client)
 - [LLMStack](https://github.com/trypromptly/LLMStack) (No-code multi-agent framework to build LLM agents and workflows)
 - [BoltAI for Mac](https://boltai.com) (AI Chat Client for Mac)
 - [Harbor](https://github.com/av/harbor) (Containerized LLM Toolkit with Ollama as default backend)
+- [Go-CREW](https://www.jonathanhecl.com/go-crew/) (Powerful Offline RAG in Golang)
+- [PartCAD](https://github.com/openvmp/partcad/) (CAD model generation with OpenSCAD and CadQuery)
+- [Ollama4j Web UI](https://github.com/ollama4j/ollama4j-web-ui) - Java-based Web UI for Ollama built with Vaadin, Spring Boot and Ollama4j
+- [PyOllaMx](https://github.com/kspviswa/pyOllaMx) - macOS application capable of chatting with both Ollama and Apple MLX models.
+- [Claude Dev](https://github.com/saoudrizwan/claude-dev) - VSCode extension for multi-file/whole-repo coding
+- [Cherry Studio](https://github.com/kangfenmao/cherry-studio) (Desktop client with Ollama support)
+- [ConfiChat](https://github.com/1runeberg/confichat) (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)

 ### Terminal

@@ -349,11 +357,12 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [LangChainRust](https://github.com/Abraxas-365/langchain-rust) with [example](https://github.com/Abraxas-365/langchain-rust/blob/main/examples/llm_ollama.rs)
 - [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/llm/ollama.html)
 - [LiteLLM](https://github.com/BerriAI/litellm)
+- [OllamaFarm for Go](https://github.com/presbrey/ollamafarm)
 - [OllamaSharp for .NET](https://github.com/awaescher/OllamaSharp)
 - [Ollama for Ruby](https://github.com/gbaptista/ollama-ai)
 - [Ollama-rs for Rust](https://github.com/pepperoni21/ollama-rs)
 - [Ollama-hpp for C++](https://github.com/jmont-dev/ollama-hpp)
- [Ollama4j for Java](https://github.com/amithkoujalgi/ollama4j)
+- [Ollama4j for Java](https://github.com/ollama4j/ollama4j)
 - [ModelFusion Typescript Library](https://modelfusion.dev/integration/model-provider/ollama)
 - [OllamaKit for Swift](https://github.com/kevinhermawan/OllamaKit)
 - [Ollama for Dart](https://github.com/breitburg/dart-ollama)
@@ -370,11 +379,15 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [Portkey](https://portkey.ai/docs/welcome/integration-guides/ollama)
 - [PromptingTools.jl](https://github.com/svilupp/PromptingTools.jl) with an [example](https://svilupp.github.io/PromptingTools.jl/dev/examples/working_with_ollama)
 - [LlamaScript](https://github.com/Project-Llama/llamascript)
+- [Gollm](https://docs.gollm.co/examples/ollama-example)
+- [Ollamaclient for Golang](https://github.com/xyproto/ollamaclient)
+- [High-level function abstraction in Go](https://gitlab.com/tozd/go/fun)

 ### Mobile

 - [Enchanted](https://github.com/AugustDev/enchanted)
 - [Maid](https://github.com/Mobile-Artificial-Intelligence/maid)
+- [ConfiChat](https://github.com/1runeberg/confichat) (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)

 ### Extensions & Plugins

@@ -404,6 +417,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [Discord-Ollama Chat Bot](https://github.com/kevinthedang/discord-ollama) (Generalized TypeScript Discord Bot w/ Tuning Documentation)
 - [Discord AI chat/moderation bot](https://github.com/rapmd73/Companion) Chat/moderation bot written in python. Uses Ollama to create personalities.
 - [Headless Ollama](https://github.com/nischalj10/headless-ollama) (Scripts to automatically install ollama client & models on any OS for apps that depends on ollama server)
+- [vnc-lm](https://github.com/jk011ru/vnc-lm) (A containerized Discord bot with support for attachments and web links)

 ### Supported backends

--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@@ -726,14 +726,17 @@ func ShowHandler(cmd *cobra.Command, args []string) error {
 }

 func showInfo(resp *api.ShowResponse) {
-	arch := resp.ModelInfo["general.architecture"].(string)
-
 	modelData := [][]string{
-		{"arch", arch},
 		{"parameters", resp.Details.ParameterSize},
 		{"quantization", resp.Details.QuantizationLevel},
-		{"context length", fmt.Sprintf("%v", resp.ModelInfo[fmt.Sprintf("%s.context_length", arch)].(float64))},
-		{"embedding length", fmt.Sprintf("%v", resp.ModelInfo[fmt.Sprintf("%s.embedding_length", arch)].(float64))},
+	}
+	if resp.ModelInfo != nil {
+		arch := resp.ModelInfo["general.architecture"].(string)
+		modelData = append(modelData,
+			[]string{"arch", arch},
+			[]string{"context length", fmt.Sprintf("%v", resp.ModelInfo[fmt.Sprintf("%s.context_length", arch)].(float64))},
+			[]string{"embedding length", fmt.Sprintf("%v", resp.ModelInfo[fmt.Sprintf("%s.embedding_length", arch)].(float64))},
+		)
 	}

 	mainTableData := [][]string{
--- a/convert/convert_gemma2.go
+++ b/convert/convert_gemma2.go
@@ -34,10 +34,20 @@ func (p *gemma2Model) KV(t *Tokenizer) llm.KV {
 }

 func (p *gemma2Model) Replacements() []string {
-	return append(
-		p.gemmaModel.Replacements(),
+	return []string{
+		"model.embed_tokens", "token_embd",
+		"model.norm", "output_norm",
+		"model.layers", "blk",
+		"input_layernorm", "attn_norm",
+		"self_attn.q_proj", "attn_q",
+		"self_attn.k_proj", "attn_k",
+		"self_attn.v_proj", "attn_v",
+		"self_attn.o_proj", "attn_output",
+		"mlp.gate_proj", "ffn_gate",
+		"mlp.down_proj", "ffn_down",
+		"mlp.up_proj", "ffn_up",
 		"post_attention_layernorm", "post_attention_norm",
 		"pre_feedforward_layernorm", "ffn_norm",
 		"post_feedforward_layernorm", "post_ffw_norm",
-	)
+	}
 }
--- a/convert/convert_test.go
+++ b/convert/convert_test.go
@@ -96,6 +96,7 @@ func TestConvertModel(t *testing.T) {
 		"Mistral-7B-Instruct-v0.2",
 		"Mixtral-8x7B-Instruct-v0.1",
 		"gemma-2b-it",
+		"gemma-2-2b-it",
 		// microsoft/Phi-3-mini-128-instruct@d548c233192db00165d842bf8edff054bb3212f8
 		"Phi-3-mini-128k-instruct",
 		"all-MiniLM-L6-v2",
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -194,6 +194,8 @@ Refer to the section [above](#how-do-i-configure-ollama-server) for how to set e

 If a different directory needs to be used, set the environment variable `OLLAMA_MODELS` to the chosen directory.

+> Note: on Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama <directory>`.
+
 Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.

 ## How can I use Ollama in Visual Studio Code?
--- a/docs/linux.md
+++ b/docs/linux.md
@@ -35,10 +35,11 @@ curl -fsSL https://ollama.com/download/ollama-linux-amd64-rocm.tgz | sudo tar zx

 ### Adding Ollama as a startup service (recommended)

-Create a user for Ollama:
+Create a user and group for Ollama:

 ```bash
-sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
+sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
+sudo usermod -a -G ollama $(whoami)
 ```

 Create a service file in `/etc/systemd/system/ollama.service`:
@@ -54,6 +55,7 @@ User=ollama
 Group=ollama
 Restart=always
 RestartSec=3
+Environment="PATH=$PATH"

 [Install]
 WantedBy=default.target
@@ -83,10 +85,11 @@ Make sure to install ROCm v6

 ### Start Ollama

-Start Ollama using `systemd`:
+Start Ollama and verify it is running:

 ```bash
 sudo systemctl start ollama
+sudo systemctl status ollama
 ```

 ## Update
--- a/docs/modelfile.md
+++ b/docs/modelfile.md
@@ -128,10 +128,10 @@ Currently supported model architectures:
 #### Build from a GGUF file

 ```modelfile
-FROM ./ollama-model.bin
+FROM ./ollama-model.gguf
 ```

-The GGUF bin file location should be specified as an absolute path or relative to the `Modelfile` location.
+The GGUF file location should be specified as an absolute path or relative to the `Modelfile` location.


 ### PARAMETER
@@ -208,7 +208,7 @@ Currently supported Safetensor adapters:
 #### GGUF adapter

 ```modelfile
-ADAPTER ./ollama-lora.bin
+ADAPTER ./ollama-lora.gguf
 ```

 ### LICENSE
--- a/gpu/cuda_common.go
+++ b/gpu/cuda_common.go
@@ -57,7 +57,7 @@ func cudaVariant(gpuInfo CudaGPUInfo) string {
 		}
 	}

-	if gpuInfo.computeMajor < 6 || gpuInfo.DriverMajor < 12 {
+	if gpuInfo.computeMajor < 6 || gpuInfo.DriverMajor < 12 || (gpuInfo.DriverMajor == 12 && gpuInfo.DriverMinor == 0) {
 		return "v11"
 	}
 	return "v12"
--- a/llm/ext_server/CMakeLists.txt
+++ b/llm/ext_server/CMakeLists.txt
@@ -2,7 +2,7 @@ set(TARGET ollama_llama_server)
 option(LLAMA_SERVER_VERBOSE "Build verbose logging option for Server" ON)
 set(LLAMA_SERVER_LDFLAGS $ENV{LLAMA_SERVER_LDFLAGS})
 include_directories(${CMAKE_CURRENT_SOURCE_DIR})
-add_executable(${TARGET} server.cpp utils.hpp json.hpp httplib.h)
+add_executable(${TARGET} server.cpp utils.hpp httplib.h)
 install(TARGETS ${TARGET} RUNTIME)
 target_compile_definitions(${TARGET} PRIVATE
    SERVER_VERBOSE=$<BOOL:${LLAMA_SERVER_VERBOSE}>
--- a/llm/ext_server/json.hpp
+++ b/llm/ext_server/json.hpp
--- a/llm/ext_server/server.cpp
+++ b/llm/ext_server/server.cpp
@@ -262,7 +262,7 @@ struct server_slot {
       char buffer[512];
        double t_token = t_prompt_processing / n_prompt_tokens_processed;
        double n_tokens_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed;
-        sprintf(buffer, "prompt eval time     = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
+        snprintf(buffer, sizeof(buffer), "prompt eval time     = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
                t_prompt_processing, n_prompt_tokens_processed,
                t_token, n_tokens_second);
        LOG_DEBUG(buffer, {
@@ -276,7 +276,7 @@ struct server_slot {

        t_token = t_token_generation / n_decoded;
        n_tokens_second = 1e3 / t_token_generation * n_decoded;
-        sprintf(buffer, "generation eval time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)",
+        snprintf(buffer, sizeof(buffer), "generation eval time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)",
                t_token_generation, n_decoded,
                t_token, n_tokens_second);
        LOG_DEBUG(buffer, {
@@ -288,7 +288,7 @@ struct server_slot {
            {"n_tokens_second",    n_tokens_second},
        });

-        sprintf(buffer, "          total time = %10.2f ms", t_prompt_processing + t_token_generation);
+        snprintf(buffer, sizeof(buffer), "          total time = %10.2f ms", t_prompt_processing + t_token_generation);
        LOG_DEBUG(buffer, {
            {"slot_id",             id},
            {"task_id",             task_id},
@@ -425,7 +425,7 @@ struct llama_server_context

        n_ctx = llama_n_ctx(ctx);

-        add_bos_token = llama_should_add_bos_token(model);
+        add_bos_token = llama_add_bos_token(model);

        return true;
    }
@@ -1031,7 +1031,7 @@ struct llama_server_context
                continue;
            }

-            if (!llava_image_embed_make_with_clip_img(clp_ctx, params.n_threads, img.img_data, &img.image_embedding, &img.image_tokens)) {
+            if (!llava_image_embed_make_with_clip_img(clp_ctx, params.cpuparams.n_threads, img.img_data, &img.image_embedding, &img.image_tokens)) {
                LOG_TEE("Error processing the given image");
                return false;
            }
@@ -2014,7 +2014,7 @@ static void server_print_usage(const char *argv0, const gpt_params &params,
    printf("options:\n");
    printf("  -h, --help                show this help message and exit\n");
    printf("  -v, --verbose             verbose output (default: %s)\n", server_verbose ? "enabled" : "disabled");
-    printf("  -t N, --threads N         number of threads to use during computation (default: %d)\n", params.n_threads);
+    printf("  -t N, --threads N         number of threads to use during computation (default: %d)\n", params.cpuparams.n_threads);
    printf("  -tb N, --threads-batch N  number of threads to use during batch and prompt processing (default: same as --threads)\n");
    printf("  --threads-http N          number of threads in the http server pool to process requests (default: max(hardware concurrency - 1, --parallel N + 2))\n");
    printf("  -c N, --ctx-size N        size of the prompt context (default: %d)\n", params.n_ctx);
@@ -2287,7 +2287,7 @@ static void server_params_parse(int argc, char **argv, server_params &sparams, g
                invalid_param = true;
                break;
            }
-            params.n_threads = std::stoi(argv[i]);
+            params.cpuparams.n_threads = std::stoi(argv[i]);
        }
        else if (arg == "--grp-attn-n" || arg == "-gan")
        {
@@ -2315,7 +2315,7 @@ static void server_params_parse(int argc, char **argv, server_params &sparams, g
                invalid_param = true;
                break;
            }
-            params.n_threads_batch = std::stoi(argv[i]);
+            params.cpuparams_batch.n_threads = std::stoi(argv[i]);
        }
        else if (arg == "--threads-http")
        {
@@ -2626,6 +2626,11 @@ static void server_params_parse(int argc, char **argv, server_params &sparams, g
        params.kv_overrides.back().key[0] = 0;
    }

+    postprocess_cpu_params(params.cpuparams, nullptr);
+    postprocess_cpu_params(params.cpuparams_batch, &params.cpuparams);
+    postprocess_cpu_params(params.draft_cpuparams, &params.cpuparams);
+    postprocess_cpu_params(params.draft_cpuparams_batch, &params.cpuparams_batch);
+
    if (invalid_param)
    {
        fprintf(stderr, "error: invalid parameter for argument: %s\n", arg.c_str());
@@ -2775,8 +2780,8 @@ int main(int argc, char **argv) {
                            {"commit", LLAMA_COMMIT}});

    LOG_INFO("system info", {
-                                {"n_threads", params.n_threads},
-                                {"n_threads_batch", params.n_threads_batch},
+                                {"n_threads", params.cpuparams.n_threads},
+                                {"n_threads_batch", params.cpuparams_batch.n_threads},
                                {"total_threads", std::thread::hardware_concurrency()},
                                {"system_info", llama_print_system_info()},
                            });
--- a/llm/generate/gen_darwin.sh
+++ b/llm/generate/gen_darwin.sh
@@ -19,7 +19,7 @@ sign() {
    fi
 }

-COMMON_DARWIN_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off"
+COMMON_DARWIN_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DGGML_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off"

 case "${GOARCH}" in
 "amd64")
--- a/llm/llama.cpp
+++ b/llm/llama.cpp
--- a/llm/patches/05-default-pretokenizer.diff
+++ b/llm/patches/05-default-pretokenizer.diff
@@ -1,8 +1,8 @@
 diff --git a/src/llama.cpp b/src/llama.cpp
-index a207451f..2ddf431d 100644
+index 88355971..dd7d41ed 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
-@@ -5347,16 +5347,7 @@ static void llm_load_vocab(
+@@ -6083,16 +6083,7 @@ static void llm_load_vocab(
         if (vocab.type == LLAMA_VOCAB_TYPE_BPE) {
             vocab.tokenizer_add_space_prefix = false;
             vocab.tokenizer_clean_spaces = true;
@@ -20,9 +20,9 @@ index a207451f..2ddf431d 100644
                 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
             } else if (
                     tokenizer_pre == "llama3"   ||
-@@ -5443,7 +5434,8 @@ static void llm_load_vocab(
-                 tokenizer_pre == "codeshell") {
-                 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_CODESHELL;
+@@ -6188,7 +6179,8 @@ static void llm_load_vocab(
+                 tokenizer_pre == "exaone") {
+                 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_EXAONE;
             } else {
 -                throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
 +                LLAMA_LOG_WARN("%s: missing or unrecognized pre-tokenizer type, using: 'default'\n", __func__);
--- a/llm/patches/06-embeddings.diff
+++ b/llm/patches/06-embeddings.diff
@@ -1,37 +1,36 @@
 diff --git a/src/llama.cpp b/src/llama.cpp
-index 1fe2b9f7..a43312a7 100644
+index 88355971..d7db689b 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
-@@ -13689,7 +13689,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
+@@ -15906,7 +15906,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
     const auto n_embd  = hparams.n_embd;
 
     // TODO: use a per-batch flag for logits presence instead
 -    const bool has_logits = !cparams.embeddings;
 +    const bool has_logits =  cparams.causal_attn;
-     const bool has_embd   =  lctx.is_encoding || (cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE));
+     const bool has_embd   =  cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE);
 
     const size_t logits_size = has_logits ? n_vocab*n_outputs_max : 0;
-@@ -13959,17 +13959,25 @@ static int llama_decode_internal(
+@@ -16175,20 +16175,23 @@ static int llama_decode_internal(
             // no output
             res  = nullptr;
             embd = nullptr;
 -        } else if (cparams.embeddings) {
-            res = nullptr; // do not extract logits for embedding case
-            embd = gf->nodes[gf->n_nodes - 1];
-            if (strcmp(embd->name, "result_embd_pooled") != 0) {
-                embd = gf->nodes[gf->n_nodes - 2];
+-            res  = nullptr; // do not extract logits for embedding case
+-            embd = nullptr;
 +        }
 +
 +        if (cparams.embeddings) {
-+            for (int i = gf->n_nodes - 1; i >= 0; --i) {
+             for (int i = gf->n_nodes - 1; i >= 0; --i) {
+-                if (strcmp(gf->nodes[i]->name, "result_embd_pooled") == 0) {
+-                    embd = gf->nodes[i];
 +                embd = gf->nodes[i];
 +                if (strcmp(embd->name, "result_embd_pooled") == 0) {
-+                    break;
-+                }
+                     break;
+                 }
             }
-             GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor");
-        } else {
-+         } else {
+-            GGML_ASSERT(embd != nullptr && "missing embeddings tensor");
+         } else {
             embd = nullptr; // do not extract embeddings when not needed
             GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor");
         }
@@ -39,7 +38,6 @@ index 1fe2b9f7..a43312a7 100644
 +        if (!cparams.causal_attn) {
 +            res = nullptr; // do not extract logits when not needed
 +        }
-+
         // LLAMA_LOG_INFO("graph build time: %.3f ms (%d nodes, %d leafs)\n", (ggml_time_us() - t_start_us)/1000.0, gf->n_nodes, gf->n_leafs);
 
         ggml_backend_sched_alloc_graph(lctx.sched, gf);
--- a/llm/patches/09-lora.diff
+++ b/llm/patches/09-lora.diff
@@ -1,350 +0,0 @@
-diff --git a/common/common.cpp b/common/common.cpp
-index 2e8374d5..70d0afde 100644
--- a/common/common.cpp
-+++ b/common/common.cpp
-@@ -2110,9 +2110,21 @@ struct llama_init_result llama_init_from_gpt_params(gpt_params & params) {
-         loaded_la.adapter = llama_lora_adapter_init(model, la.path.c_str());
-         if (loaded_la.adapter == nullptr) {
-             fprintf(stderr, "%s: error: failed to apply lora adapter '%s'\n", __func__, la.path.c_str());
-            llama_free(lctx);
-            llama_free_model(model);
-            return iparams;
-+
-+            // if that fails, try loading as ggla for compatibility
-+            int err = llama_model_apply_lora_from_file(model,
-+                                                    la.path.c_str(),
-+                                                    la.scale,
-+                                                    nullptr,
-+                                                    params.n_threads);
-+            if (err != 0) {
-+                fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
-+                llama_free(lctx);
-+                llama_free_model(model);
-+                return iparams;
-+            } else {
-+                break;
-+            }
-         }
-         iparams.lora_adapters.push_back(loaded_la); // copy to list of loaded adapters
-     }
-diff --git a/include/llama.h b/include/llama.h
-index 93fd77ca..b0fb37a6 100644
--- a/include/llama.h
-+++ b/include/llama.h
-@@ -1160,6 +1160,20 @@ extern "C" {
- 
-     LLAMA_API void llama_dump_timing_info_yaml(FILE * stream, const struct llama_context * ctx);
- 
-+    // Apply a LoRA adapter to a loaded model
-+    // path_base_model is the path to a higher quality model to use as a base for
-+    // the layers modified by the adapter. Can be NULL to use the current loaded model.
-+    // The model needs to be reloaded before applying a new adapter, otherwise the adapter
-+    // will be applied on top of the previous one
-+    // Returns 0 on success
-+    LLAMA_API int32_t llama_model_apply_lora_from_file(
-+            const struct llama_model * model,
-+                            const char * path_lora,
-+                                float   scale,
-+                            const char * path_base_model,
-+                                int32_t   n_threads);
-+
-+
- #ifdef __cplusplus
- }
- #endif
-diff --git a/src/llama.cpp b/src/llama.cpp
-index 80a0dd0f..9d7b0e17 100644
--- a/src/llama.cpp
-+++ b/src/llama.cpp
-@@ -21880,3 +21880,290 @@ static void llama_log_callback_default(ggml_log_level level, const char * text,
-     fputs(text, stderr);
-     fflush(stderr);
- }
-+
-+static int llama_apply_lora_from_file_internal(
-+    const struct llama_model & model, const char * path_lora, float scale, const char * path_base_model, int n_threads
-+) {
-+    LLAMA_LOG_INFO("%s: applying lora adapter from '%s' - please wait ...\n", __func__, path_lora);
-+
-+    const int64_t t_start_lora_us = ggml_time_us();
-+
-+    llama_file fin(path_lora, "rb");
-+
-+    // verify magic and version
-+    {
-+        uint32_t magic = fin.read_u32();
-+        if (magic != LLAMA_FILE_MAGIC_GGLA) {
-+            LLAMA_LOG_ERROR("%s: bad file magic\n", __func__);
-+            return 1;
-+        }
-+
-+        uint32_t format_version = fin.read_u32();
-+        if (format_version != 1) {
-+            LLAMA_LOG_ERROR("%s: unsupported file version\n", __func__ );
-+            return 1;
-+        }
-+    }
-+
-+    int32_t lora_r = fin.read_u32();
-+    int32_t lora_alpha = fin.read_u32();
-+    float scaling = scale * (float)lora_alpha / (float)lora_r;
-+
-+    LLAMA_LOG_INFO("%s: r = %d, alpha = %d, scaling = %.2f\n", __func__, lora_r, lora_alpha, scaling);
-+
-+    // load base model
-+    std::unique_ptr<llama_model_loader> ml;
-+    if (path_base_model) {
-+        LLAMA_LOG_INFO("%s: loading base model from '%s'\n", __func__, path_base_model);
-+        ml.reset(new llama_model_loader(path_base_model, /*use_mmap*/ true, /*check_tensors*/ false, /*kv_overrides*/ nullptr));
-+        ml->init_mappings(/*prefetch*/ false); // no prefetching
-+    }
-+
-+    struct tensor_meta {
-+        std::string name;
-+        ggml_type type;
-+        int32_t ne[2];
-+        size_t offset;
-+    };
-+    std::map<std::string, tensor_meta> tensor_meta_map;
-+
-+    // load all tensor meta
-+    while (true) {
-+        if (fin.tell() == fin.size) {
-+            // eof
-+            break;
-+        }
-+
-+        int32_t n_dims;
-+        int32_t name_len;
-+        int32_t ftype;
-+
-+        fin.read_raw(&n_dims, sizeof(n_dims));
-+        fin.read_raw(&name_len, sizeof(name_len));
-+        fin.read_raw(&ftype, sizeof(ftype));
-+
-+        if (n_dims != 1 && n_dims != 2) {
-+            LLAMA_LOG_ERROR("%s: unsupported tensor dimension %d\n", __func__, n_dims);
-+            return 1;
-+        }
-+
-+        int32_t ne[2] = { 1, 1 };
-+        for (int i = 0; i < n_dims; ++i) {
-+            fin.read_raw(&ne[i], sizeof(ne[i]));
-+        }
-+
-+        std::string name;
-+        {
-+            GGML_ASSERT(name_len < GGML_MAX_NAME);
-+            char buf[GGML_MAX_NAME];
-+            fin.read_raw(buf, name_len);
-+            name = std::string(buf, name_len);
-+        }
-+
-+        // check for lora suffix
-+        std::string lora_suffix;
-+        if (name.length() > 6) {
-+            lora_suffix = name.substr(name.length() - 6);
-+        }
-+        if (lora_suffix != ".loraA" && lora_suffix != ".loraB") {
-+            LLAMA_LOG_ERROR("%s: error: '%s' is not a lora tensor\n", __func__, name.c_str());
-+            return 1;
-+        }
-+
-+        // tensor type
-+        ggml_type wtype;
-+        switch (ftype) {
-+            case 0: wtype = GGML_TYPE_F32;  break;
-+            case 1: wtype = GGML_TYPE_F16;  break;
-+            default:
-+                    {
-+                        LLAMA_LOG_ERROR("%s: invalid tensor data type '%d'\n",
-+                                __func__, ftype);
-+                        return 1;
-+                    }
-+        }
-+
-+        // data offset
-+        size_t offset = fin.tell();
-+        offset = (offset + 31) & -32;
-+
-+        // skip tensor data
-+        fin.seek(offset + ggml_row_size(wtype, ne[0]) * ne[1], SEEK_SET);
-+
-+        tensor_meta_map.emplace(name, tensor_meta{ name, wtype, { ne[0], ne[1] }, offset });
-+    }
-+
-+    bool warned = false;
-+    int n_tensors = 0;
-+
-+    // apply
-+    ggml_backend_t backend_cpu = ggml_backend_cpu_init();
-+    if (backend_cpu == nullptr) {
-+        LLAMA_LOG_ERROR("%s: error: failed to initialize cpu backend\n", __func__);
-+        return 1;
-+    }
-+    ggml_backend_cpu_set_n_threads(backend_cpu, n_threads);
-+
-+    std::vector<no_init<uint8_t>> read_buf;
-+    for (const auto & it : model.tensors_by_name) {
-+        const std::string & base_name = it.first;
-+        ggml_tensor * model_t = it.second;
-+
-+        if (tensor_meta_map.find(base_name + ".loraA") == tensor_meta_map.end() ||
-+            tensor_meta_map.find(base_name + ".loraB") == tensor_meta_map.end()) {
-+            continue;
-+        }
-+
-+        tensor_meta & metaA = tensor_meta_map.at(base_name + ".loraA");
-+        tensor_meta & metaB = tensor_meta_map.at(base_name + ".loraB");
-+
-+        ggml_init_params lora_init_params = {
-+            /* .mem_size   */ ggml_tensor_overhead()*128 + ggml_graph_overhead(),
-+            /* .mem_buffer */ nullptr,
-+            /* .no_alloc   */ true,
-+        };
-+        ggml_context * lora_ctx = ggml_init(lora_init_params);
-+        if (lora_ctx == nullptr) {
-+            LLAMA_LOG_ERROR("%s: error: failed to initialize lora context\n", __func__);
-+            ggml_backend_free(backend_cpu);
-+            return 1;
-+        }
-+
-+        // create tensors
-+        ggml_tensor * loraA = ggml_new_tensor_2d(lora_ctx, metaA.type, metaA.ne[0], metaA.ne[1]);
-+        ggml_tensor * loraB = ggml_new_tensor_2d(lora_ctx, metaB.type, metaB.ne[0], metaB.ne[1]);
-+        ggml_set_name(loraA, metaA.name.c_str());
-+        ggml_set_name(loraB, metaB.name.c_str());
-+
-+        ggml_tensor * base_t;
-+        if (ml) {
-+            if (!ml->get_tensor_meta(base_name.c_str())) {
-+                LLAMA_LOG_ERROR("%s: error: tensor '%s' not found in base model\n", __func__, base_name.c_str());
-+                return 1;
-+            }
-+            base_t = ggml_dup_tensor(lora_ctx, ml->get_tensor_meta(base_name.c_str()));
-+        } else {
-+            base_t = ggml_dup_tensor(lora_ctx, model_t);
-+        }
-+        ggml_set_name(base_t, base_name.c_str());
-+
-+        // allocate in backend buffer
-+        ggml_backend_buffer_t lora_buf = ggml_backend_alloc_ctx_tensors_from_buft(lora_ctx, ggml_backend_cpu_buffer_type());
-+        if (lora_buf == nullptr) {
-+            LLAMA_LOG_ERROR("%s: error: failed to allocate lora tensors\n", __func__);
-+            return 1;
-+        }
-+
-+        // load tensor data
-+        auto load_tensor = [&read_buf, &fin](const tensor_meta & tensor_meta, ggml_tensor * tensor) {
-+            read_buf.resize(ggml_nbytes(tensor));
-+            fin.seek(tensor_meta.offset, SEEK_SET);
-+            fin.read_raw(read_buf.data(), ggml_nbytes(tensor));
-+            ggml_backend_tensor_set(tensor, read_buf.data(), 0, read_buf.size());
-+        };
-+        load_tensor(metaA, loraA);
-+        load_tensor(metaB, loraB);
-+
-+        // load base model tensor data
-+        if (ml) {
-+            ml->load_data_for(base_t);
-+        } else {
-+            ggml_backend_tensor_copy(model_t, base_t);
-+        }
-+
-+        if (ggml_is_quantized(base_t->type) && !warned) {
-+            LLAMA_LOG_WARN("%s: warning: using a lora adapter with a quantized model may result in poor quality, "
-+                            "use a f16 or f32 base model with --lora-base\n", __func__);
-+            warned = true;
-+        }
-+
-+        if (base_t->ne[0] != loraA->ne[1] || base_t->ne[1] != loraB->ne[1]) {
-+            LLAMA_LOG_ERROR("%s: incompatible tensor dimensions (%" PRId64 " and %" PRId64 ");"
-+                            " are you sure that this adapter is for this model?\n", __func__, base_t->ne[0], loraA->ne[1]);
-+            ggml_free(lora_ctx);
-+            ggml_backend_buffer_free(lora_buf);
-+            ggml_backend_free(backend_cpu);
-+            return 1;
-+        }
-+
-+        auto build_lora_graph = [&]() {
-+            // w = w + BA*s
-+            ggml_tensor * BA = ggml_mul_mat(lora_ctx, loraA, loraB);
-+            ggml_set_name(BA, "BA");
-+
-+            if (scaling != 1.0f) {
-+                BA = ggml_scale(lora_ctx, BA, scaling);
-+                ggml_set_name(BA, "BA_scaled");
-+            }
-+
-+            ggml_tensor * r;
-+            r = ggml_add_inplace(lora_ctx, base_t, BA);
-+            ggml_set_name(r, "r_add");
-+
-+            if (base_t->type != model_t->type) {
-+                // convert the result to the model type
-+                r = ggml_cast(lora_ctx, r, model_t->type);
-+                ggml_set_name(r, "r_cast");
-+            }
-+
-+            return r;
-+        };
-+
-+        ggml_cgraph * gf = ggml_new_graph(lora_ctx);
-+        ggml_tensor * r = build_lora_graph();
-+        ggml_build_forward_expand(gf, r);
-+
-+        ggml_backend_buffer_t graph_buf = ggml_backend_alloc_ctx_tensors_from_buft(lora_ctx, ggml_backend_cpu_buffer_type());
-+        if (graph_buf == nullptr) {
-+            LLAMA_LOG_ERROR("%s: error: failed to allocate graph tensors\n", __func__);
-+            ggml_free(lora_ctx);
-+            ggml_backend_buffer_free(lora_buf);
-+            ggml_backend_free(backend_cpu);
-+            return 1;
-+        }
-+
-+        ggml_backend_graph_compute(backend_cpu, gf);
-+
-+        ggml_backend_tensor_set(model_t, r->data, 0, ggml_nbytes(r));
-+
-+#if 0
-+        // TODO: use scheduler with fallback to CPU for less copies between CPU and GPU
-+        //ggml_backend_sched_t sched = ggml_backend_sched_new(backends.data(), backends.size(), GGML_DEFAULT_GRAPH_SIZE);
-+
-+        // sched compute
-+        ggml_build_forward_expand(gf, build_graph());
-+        ggml_backend_sched_init_measure(sched, gf);
-+
-+        // create the graph again, since the previous one was destroyed by the measure
-+        ggml_graph_clear(gf);
-+        ggml_build_forward_expand(gf, build_graph());
-+        ggml_backend_sched_graph_compute(sched, gf);
-+        ggml_backend_sched_free(sched);
-+#endif
-+
-+        ggml_backend_buffer_free(lora_buf);
-+        ggml_backend_buffer_free(graph_buf);
-+        ggml_free(lora_ctx);
-+
-+        n_tensors++;
-+        if (n_tensors % 4 == 0) {
-+            LLAMA_LOG_INFO(".");
-+        }
-+    }
-+
-+    ggml_backend_free(backend_cpu);
-+
-+    const int64_t t_lora_us = ggml_time_us() - t_start_lora_us;
-+    LLAMA_LOG_INFO(" done (%.2f ms)\n", t_lora_us / 1000.0);
-+
-+    return 0;
-+}
-+
-+int32_t llama_model_apply_lora_from_file(const struct llama_model * model, const char * path_lora, float scale, const char * path_base_model, int32_t n_threads) {
-+    try {
-+        return llama_apply_lora_from_file_internal(*model, path_lora, scale, path_base_model, n_threads);
-+    } catch (const std::exception & err) {
-+        LLAMA_LOG_ERROR("%s: failed to apply lora adapter: %s\n", __func__, err.what());
-+        return 1;
-+    }
-+}
-\ No newline at end of file
--- a/llm/patches/11-phi3-sliding-window.diff
+++ b/llm/patches/11-phi3-sliding-window.diff
@@ -1,43 +0,0 @@
-From 6eedae4cf2fcc8015dac79cb3f28f61fcabacab2 Mon Sep 17 00:00:00 2001
-From: Michael Yang <mxyng@pm.me>
-Date: Wed, 31 Jul 2024 14:57:04 -0700
-Subject: [PATCH] phi3 sliding window
-
---
- src/llama.cpp | 6 +++---
- 1 file changed, 3 insertions(+), 3 deletions(-)
-
-diff --git a/src/llama.cpp b/src/llama.cpp
-index a207451f..f2872d4e 100644
--- a/src/llama.cpp
-+++ b/src/llama.cpp
-@@ -4893,7 +4893,7 @@ static void llm_load_hparams(
-             } break;
-         case LLM_ARCH_PHI3:
-             {
-                ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa);
-+                ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa, false);
-                 ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
- 
-                 switch (hparams.n_layer) {
-@@ -10762,7 +10762,7 @@ struct llm_build_context {
-         struct ggml_tensor * inp_pos = build_inp_pos();
- 
-         // KQ_mask (mask for 1 head, it will be broadcasted to all heads)
-        struct ggml_tensor * KQ_mask_swa = build_inp_KQ_mask_swa();
-+        struct ggml_tensor * KQ_mask = hparams.n_swa > 0 ? build_inp_KQ_mask_swa() : build_inp_KQ_mask();
- 
-         for (int il = 0; il < n_layer; ++il) {
-             auto residual = inpL;
-@@ -10820,7 +10820,7 @@ struct llm_build_context {
- 
-                 cur = llm_build_kv(ctx0, lctx, kv_self, gf,
-                         model.layers[il].wo, model.layers[il].bo,
-                        Kcur, Vcur, Qcur, KQ_mask_swa, n_tokens, kv_head, n_kv, 1.0f, cb, il);
-+                        Kcur, Vcur, Qcur, KQ_mask, n_tokens, kv_head, n_kv, 1.0f, cb, il);
-             }
- 
-             if (il == n_layer - 1) {
-- 
-2.45.2
-
--- a/llm/server.go
+++ b/llm/server.go
@@ -98,7 +98,7 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr
 		systemTotalMemory = systemMemInfo.TotalMemory
 		systemFreeMemory = systemMemInfo.FreeMemory
 		systemSwapFreeMemory = systemMemInfo.FreeSwap
-		slog.Debug("system memory", "total", format.HumanBytes2(systemTotalMemory), "free", format.HumanBytes2(systemFreeMemory), "free_swap", format.HumanBytes2(systemSwapFreeMemory))
+		slog.Info("system memory", "total", format.HumanBytes2(systemTotalMemory), "free", format.HumanBytes2(systemFreeMemory), "free_swap", format.HumanBytes2(systemSwapFreeMemory))
 	}

 	// If the user wants zero GPU layers, reset the gpu list to be CPU/system ram info
--- a/scripts/install.sh
+++ b/scripts/install.sh
@@ -38,7 +38,7 @@ IS_WSL2=false
 KERN=$(uname -r)
 case "$KERN" in
    *icrosoft*WSL2 | *icrosoft*wsl2) IS_WSL2=true;;
-    *icrosoft) error "Microsoft WSL1 is not currently supported. Please upgrade to WSL2 with 'wsl --set-version <distro> 2'" ;;
+    *icrosoft) error "Microsoft WSL1 is not currently supported. Please use WSL2 with 'wsl --set-version <distro> 2'" ;;
    *) ;;
 esac
Author	SHA1	Message	Date
Patrick Devine	db8c944498	fix gemma2-2b conversion	2024-09-04 16:59:23 -07:00
jk011ru	b3554778bd	readme: add vnc-lm discord bot community integration (#6644 )	2024-09-04 19:46:02 -04:00
Pascal Patry	bbe7b96ded	llm: use json.hpp from common (#6642 )	2024-09-04 19:34:42 -04:00
Rune Berg	c18ff18b2c	readme: add confichat to community integrations (#6378 )	2024-09-04 17:26:02 -04:00
Tomoya Fujita	133770a548	docs: add group to manual Linux isntructions and verify service is running (#6430 )	2024-09-04 14:45:09 -04:00
Teïlo M	f36ebfb478	readme: add gollm to the list of community libraries (#6099 )	2024-09-04 14:19:41 -04:00
亢奋猫	5b55379651	readme: add Cherry Studio to community integrations (#6633 )	2024-09-04 10:53:36 -04:00
Mitar	93eb43d020	readme: add Go fun package (#6421 )	2024-09-04 10:52:46 -04:00
Carter	369479cc30	docs: fix spelling error (#6391 ) change "dorrect" to "correct"	2024-09-04 09:42:33 -04:00
Erkin Alp Güney	7d89e48f5c	install.sh: update instructions to use WSL2 (#6450 )	2024-09-04 09:34:53 -04:00
Sam	27bcce6d9f	readme: add claude-dev to community integrations (#6630 )	2024-09-04 09:32:26 -04:00
Viz	491fc312ae	readme: add PyOllaMx project (#6624 )	2024-09-03 23:10:53 -04:00
Jeffrey Morgan	5e2653f9fe	llm: update llama.cpp commit to 8962422 (#6618 )	2024-09-03 21:12:39 -04:00
Daniel Hiltgen	f29b167e1a	Use cuda v11 for driver 525 and older (#6620 ) It looks like driver 525 (aka, cuda driver 12.0) has problems with the cuda v12 library we compile against, so run v11 on those older drivers if detected.	2024-09-03 17:15:31 -07:00
Daniel Hiltgen	037a4d103e	Log system memory at info (#6617 ) On systems with low system memory, we can hit allocation failures that are difficult to diagnose without debug logs. This will make it easier to spot.	2024-09-03 14:55:20 -07:00
Mateusz Migas	50c05d57e0	readme: add Painting Droid community integration (#5514 )	2024-09-03 16:15:54 -04:00
Amith Koujalgi	35159de18a	readme: update Ollama4j link and add link to Ollama4j Web UI (#6608 )	2024-09-03 16:08:50 -04:00
FellowTraveler	94fff5805f	Fix sprintf to snprintf (#5664 ) /Users/au/src/ollama/llm/ext_server/server.cpp:289:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.	2024-09-03 09:32:59 -07:00
OpenVMP	14d5093cd0	readme: add PartCAD tool to readme for generating 3D CAD models using Ollama (#6605 )	2024-09-03 12:28:01 -04:00
R0CKSTAR	9df5f0e8e4	Reduce docker image size (#5847 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2024-09-03 09:25:31 -07:00
presbrey	ad3eb00bee	readme: add OllamaFarm project (#6508 )	2024-09-02 16:05:36 -04:00
Jonathan Hecl	bfc2d61549	readme: add go-crew and Ollamaclient projects (#6583 )	2024-09-02 15:34:26 -04:00
SnoopyTlion	741affdfd6	docs: update faq.md for OLLAMA_MODELS env var permissions (#6587 )	2024-09-02 15:31:29 -04:00
Vimal Kumar	5f7b4a5e30	fix(cmd): show info may have nil ModelInfo (#6579 )	2024-08-31 21:12:17 -07:00
rayfiyo	1aad838707	docs: update GGUF examples and references (#6577 )	2024-08-31 19:34:25 -07:00