runner.go: Check for zero length images

If we get a request with a zero length image, it will result in an out-of-bounds error when we pass the data to the image encoder.
docs: update langchainpy.md with proper model name (#7527 )
2024-11-08 09:39:32 -08:00 · 2024-11-08 09:36:17 -08:00 · 2024-11-08 09:27:04 -08:00 · 2024-11-07 14:26:47 -08:00 · 2024-11-07 14:26:31 -08:00 · 2024-11-07 14:25:53 -08:00
14 changed files with 38 additions and 29 deletions
--- a/api/types.go
+++ b/api/types.go
@@ -236,7 +236,7 @@ type Runner struct {
 	NumGPU    int   `json:"num_gpu,omitempty"`
 	MainGPU   int   `json:"main_gpu,omitempty"`
 	LowVRAM   bool  `json:"low_vram,omitempty"`
-	F16KV     bool  `json:"f16_kv,omitempty"`
+	F16KV     bool  `json:"f16_kv,omitempty"` // Deprecated: This option is ignored
 	LogitsAll bool  `json:"logits_all,omitempty"`
 	VocabOnly bool  `json:"vocab_only,omitempty"`
 	UseMMap   *bool `json:"use_mmap,omitempty"`
@@ -613,7 +613,6 @@ func DefaultOptions() Options {
 			NumGPU:    -1, // -1 here indicates that NumGPU should be set dynamically
 			NumThread: 0,  // let the runtime decide
 			LowVRAM:   false,
-			F16KV:     true,
 			UseMLock:  false,
 			UseMMap:   nil,
 		},
--- a/app/ollama.iss
+++ b/app/ollama.iss
@@ -136,7 +136,7 @@ Type: filesandordirs; Name: "{%TEMP}\ollama*"
 Type: filesandordirs; Name: "{%LOCALAPPDATA}\Programs\Ollama"

 [Messages]
-WizardReady=Ollama Windows Preview
+WizardReady=Ollama
 ReadyLabel1=%nLet's get you up and running with your own large language models.
 SetupAppRunningError=Another Ollama installer is running.%n%nPlease cancel or finish the other installer, then click OK to continue with this install, or Cancel to exit.

--- a/discover/gpu_info_nvcuda.c
+++ b/discover/gpu_info_nvcuda.c
@@ -4,6 +4,7 @@
 #include "gpu_info_nvcuda.h"

 void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
+  LOG(resp->ch.verbose, "initializing %s\n", nvcuda_lib_path);
  CUresult ret;
  resp->err = NULL;
  resp->num_devices = 0;
@@ -57,8 +58,10 @@ void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
      resp->cudaErr = -1;
      return;
    }
+    LOG(resp->ch.verbose, "dlsym: %s - %p\n", l[i].s, *l[i].p);
  }

+  LOG(resp->ch.verbose, "calling cuInit\n");
  ret = (*resp->ch.cuInit)(0);
  if (ret != CUDA_SUCCESS) {
    LOG(resp->ch.verbose, "cuInit err: %d\n", ret);
@@ -75,15 +78,18 @@ void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
  resp->ch.driver_minor = 0;

  // Report driver version if we're in verbose mode, ignore errors
+  LOG(resp->ch.verbose, "calling cuDriverGetVersion\n");
  ret = (*resp->ch.cuDriverGetVersion)(&version);
  if (ret != CUDA_SUCCESS) {
    LOG(resp->ch.verbose, "cuDriverGetVersion failed: %d\n", ret);
  } else {
+    LOG(resp->ch.verbose, "raw version 0x%x\n", version);
    resp->ch.driver_major = version / 1000;
    resp->ch.driver_minor = (version - (resp->ch.driver_major * 1000)) / 10;
    LOG(resp->ch.verbose, "CUDA driver version: %d.%d\n", resp->ch.driver_major, resp->ch.driver_minor);
  }

+  LOG(resp->ch.verbose, "calling cuDeviceGetCount\n");
  ret = (*resp->ch.cuDeviceGetCount)(&resp->num_devices);
  if (ret != CUDA_SUCCESS) {
    LOG(resp->ch.verbose, "cuDeviceGetCount err: %d\n", ret);
@@ -94,6 +100,7 @@ void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
    resp->cudaErr = ret;
    return;
  }
+  LOG(resp->ch.verbose, "device count %d\n", resp->num_devices);
 }

 const int buflen = 256;
--- a/docs/api.md
+++ b/docs/api.md
@@ -355,7 +355,6 @@ curl http://localhost:11434/api/generate -d '{
    "num_gpu": 1,
    "main_gpu": 0,
    "low_vram": false,
-    "f16_kv": true,
    "vocab_only": false,
    "use_mmap": true,
    "use_mlock": false,
--- a/docs/development.md
+++ b/docs/development.md
@@ -108,7 +108,7 @@ Custom CPU settings are not currently supported in the new Go server build but w

 #### Containerized Linux Build

-If you have Docker available, you can build linux binaries with `OLLAMA_NEW_RUNNERS=1 ./scripts/build_linux.sh` which has the CUDA and ROCm dependencies included. The resulting binary is placed in `./dist`
+If you have Docker available, you can build linux binaries with `./scripts/build_linux.sh` which has the CUDA and ROCm dependencies included. The resulting binary is placed in `./dist`

 ### Windows

--- a/docs/tutorials/langchainpy.md
+++ b/docs/tutorials/langchainpy.md
@@ -10,7 +10,7 @@ This sounds like a typical censored response, but even llama2-uncensored gives a

 So let's figure out how we can use **LangChain** with Ollama to ask our question to the actual document, the Odyssey by Homer, using Python.

-Let's start by asking a simple question that we can get an answer to from the **Llama2** model using **Ollama**. First, we need to install the **LangChain** package:
+Let's start by asking a simple question that we can get an answer to from the **Llama3** model using **Ollama**. First, we need to install the **LangChain** package:

 `pip install langchain_community`

--- a/llama/make/Makefile.rocm
+++ b/llama/make/Makefile.rocm
@@ -58,6 +58,8 @@ endif
 GPU_COMPILER_CUFLAGS = \
 	$(GPU_COMPILER_FPIC) \
 	$(addprefix -m,$(GPU_RUNNER_CPU_FLAGS)) \
+	-mf16c \
+	-mfma \
 	-parallel-jobs=2 \
 	-c \
 	-O3 \
@@ -77,6 +79,9 @@ GPU_COMPILER_CUFLAGS = \
 	-D_CRT_SECURE_NO_WARNINGS \
 	-D_GNU_SOURCE \
 	-D_XOPEN_SOURCE=600 \
+	-DUSE_PROF_API=1 \
+	-std=gnu++14 \
+	-x hip \
 	-mllvm=-amdgpu-early-inline-all=true \
 	-mllvm=-amdgpu-function-calls=false \
 	-Wno-expansion-to-defined \
@@ -87,6 +92,12 @@ GPU_COMPILER_CUFLAGS = \
 	-Wno-unused-result \
 	-I.

+# Workaround buggy P2P copy on some windows multi-GPU setups
+# This workaround breaks linux systems with small system RAM, so only enable on windows
+ifeq ($(OS),windows)
+	GPU_COMPILER_CUFLAGS += -DGGML_CUDA_NO_PEER_COPY=1
+endif
+
 include make/gpu.make

 # Adjust the rules from gpu.make to handle the ROCm dependencies properly
--- a/llama/make/gpu.make
+++ b/llama/make/gpu.make
@@ -85,7 +85,7 @@ $(RUNNERS_BUILD_DIR)/$(GPU_RUNNER_NAME)/ollama_llama_server$(EXE_EXT): $(RUNNERS
 	GOARCH=$(ARCH) CGO_LDFLAGS="$(TARGET_CGO_LDFLAGS)" go build -buildmode=pie  $(GPU_GOFLAGS) -trimpath -tags $(subst $(space),$(comma),$(GPU_RUNNER_CPU_FLAGS) $(GPU_RUNNER_GO_TAGS)) -o $@ ./runner
 $(RUNNERS_BUILD_DIR)/$(GPU_RUNNER_NAME)/$(SHARED_PREFIX)ggml_$(GPU_RUNNER_NAME).$(SHARED_EXT): $(GPU_RUNNER_OBJS) $(DIST_GPU_RUNNER_LIB_DEPS) $(COMMON_HDRS) $(GPU_RUNNER_HDRS)
 	@-mkdir -p $(dir $@)
-	$(CCACHE) $(GPU_COMPILER) --shared $(GPU_RUNNER_DRIVER_LIB_LINK) -L${DIST_GPU_RUNNER_DEPS_DIR} $(foreach lib, $(GPU_RUNNER_LIBS_SHORT), -l$(lib)) $(GPU_RUNNER_OBJS) -o $@
+	$(CCACHE) $(GPU_COMPILER) --shared -L$(GPU_LIB_DIR) $(GPU_RUNNER_DRIVER_LIB_LINK) -L${DIST_GPU_RUNNER_DEPS_DIR} $(foreach lib, $(GPU_RUNNER_LIBS_SHORT), -l$(lib)) $(GPU_RUNNER_OBJS) -o $@

 # Distribution targets
 $(RUNNERS_DIST_DIR)/%: $(RUNNERS_BUILD_DIR)/%
--- a/llama/runner/image.go
+++ b/llama/runner/image.go
@@ -68,6 +68,10 @@ func (c *ImageContext) NewEmbed(llamaContext *llama.Context, data []byte, aspect
 		return nil, nil
 	}

+	if len(data) <= 0 {
+		return nil, errors.New("received zero length image")
+	}
+
 	hash := c.hashImage(data)

 	c.mu.Lock()
--- a/llama/runner/runner.go
+++ b/llama/runner/runner.go
@@ -837,14 +837,8 @@ func main() {
 	mlock := flag.Bool("mlock", false, "force system to keep model in RAM rather than swapping or compressing")
 	tensorSplit := flag.String("tensor-split", "", "fraction of the model to offload to each GPU, comma-separated list of proportions")
 	multiUserCache := flag.Bool("multiuser-cache", false, "optimize input cache algorithm for multiple users")
-	// Expose requirements as a JSON output to stdout
 	requirements := flag.Bool("requirements", false, "print json requirement information")

-	// These are either ignored by llama.cpp or have no significance to us
-	_ = flag.Bool("embedding", false, "enable embedding vector output (default: disabled)")
-	_ = flag.Bool("log-disable", false, "disables logging to a file")
-	_ = flag.Bool("memory-f32", false, "use f32 instead of f16 for memory key+value (default: disabled) not recommended: doubles context memory required and no measurable increase in quality")
-
 	flag.Parse()
 	if *requirements {
 		printRequirements(os.Stdout)
--- a/llm/server.go
+++ b/llm/server.go
@@ -186,7 +186,6 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, ggml *GGML, adapter
 		"--model", model,
 		"--ctx-size", strconv.Itoa(opts.NumCtx),
 		"--batch-size", strconv.Itoa(opts.NumBatch),
-		"--embedding",
 	}

 	if opts.NumGPU >= 0 {
@@ -218,10 +217,6 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, ggml *GGML, adapter
 		params = append(params, "--threads", strconv.Itoa(defaultThreads))
 	}

-	if !opts.F16KV {
-		params = append(params, "--memory-f32")
-	}
-
 	flashAttnEnabled := envconfig.FlashAttention()

 	for _, g := range gpus {
--- a/parser/parser_test.go
+++ b/parser/parser_test.go
@@ -440,7 +440,6 @@ func TestParseFileParameters(t *testing.T) {
 		"num_gpu 1":                    {"num_gpu", "1"},
 		"main_gpu 1":                   {"main_gpu", "1"},
 		"low_vram true":                {"low_vram", "true"},
-		"f16_kv true":                  {"f16_kv", "true"},
 		"logits_all true":              {"logits_all", "true"},
 		"vocab_only true":              {"vocab_only", "true"},
 		"use_mmap true":                {"use_mmap", "true"},
--- a/scripts/build_darwin.sh
+++ b/scripts/build_darwin.sh
@@ -6,17 +6,18 @@ set -e

 mkdir -p dist

+# These require Xcode v13 or older to target MacOS v11
+# If installed to an alternate location use the following to enable
+# export SDKROOT=/Applications/Xcode_12.5.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
+# export DEVELOPER_DIR=/Applications/Xcode_12.5.1.app/Contents/Developer
+export CGO_CFLAGS=-mmacosx-version-min=11.3
+export CGO_CXXFLAGS=-mmacosx-version-min=11.3
+export CGO_LDFLAGS=-mmacosx-version-min=11.3
+
 for TARGETARCH in arm64 amd64; do
    echo "Building Go runner darwin $TARGETARCH"
    rm -rf llama/build
    GOOS=darwin ARCH=$TARGETARCH GOARCH=$TARGETARCH make -C llama -j 8
-    # These require Xcode v13 or older to target MacOS v11
-    # If installed to an alternate location use the following to enable
-    # export SDKROOT=/Applications/Xcode_12.5.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
-    # export DEVELOPER_DIR=/Applications/Xcode_12.5.1.app/Contents/Developer
-    export CGO_CFLAGS=-mmacosx-version-min=11.3
-    export CGO_CXXFLAGS=-mmacosx-version-min=11.3
-    export CGO_LDFLAGS=-mmacosx-version-min=11.3
    CGO_ENABLED=1 GOOS=darwin GOARCH=$TARGETARCH go build -trimpath -o dist/ollama-darwin-$TARGETARCH
    CGO_ENABLED=1 GOOS=darwin GOARCH=$TARGETARCH go build -trimpath -cover -o dist/ollama-darwin-$TARGETARCH-cov
 done
--- a/server/sched.go
+++ b/server/sched.go
@@ -130,11 +130,11 @@ func (s *Scheduler) processPending(ctx context.Context) {
 				continue
 			}
 			numParallel := int(envconfig.NumParallel())
-			// TODO (jmorganca): multimodal models don't support parallel yet
+			// TODO (jmorganca): mllama doesn't support parallel yet
 			// see https://github.com/ollama/ollama/issues/4165
-			if len(pending.model.ProjectorPaths) > 0 && numParallel != 1 {
+			if checkMllamaModelFamily(pending.model) && numParallel != 1 {
 				numParallel = 1
-				slog.Warn("multimodal models don't support parallel requests yet")
+				slog.Warn("mllama doesn't support parallel requests yet")
 			}

 			for {
Author	SHA1	Message	Date
Jesse Gross	c2e8cbaa14	runner.go: Check for zero length images If we get a request with a zero length image, it will result in an out-of-bounds error when we pass the data to the image encoder.	2024-11-08 09:39:32 -08:00
Edward J. Schwartz	771fab1dd8	docs: update langchainpy.md with proper model name (#7527 )	2024-11-08 09:36:17 -08:00
Daniel Hiltgen	3a5239e6bf	Set macos min version for all architectures (#7579 )	2024-11-08 09:27:04 -08:00
Daniel Hiltgen	3d25e7bf8c	win: remove preview title from installer (#7529 ) This should have been in #7347 but was overlooked.	2024-11-07 14:26:47 -08:00
Daniel Hiltgen	1618700c5a	Workaround buggy P2P ROCm copy on windows (#7466 ) This enables the workaround code only for windows which should help windows users with muliple AMD GPUs	2024-11-07 14:26:31 -08:00
Daniel Hiltgen	b111aa5a91	Debug logging for nvcuda init (#7532 ) Some users are reporting crashes during nvcuda.dll initialization on windows. This should help narrow down where things are going bad.	2024-11-07 14:25:53 -08:00
Daniel Hiltgen	9e83e550e1	Align rocm compiler flags (#7467 ) Bring consistency with the old generate script behavior	2024-11-07 10:20:50 -08:00
Daniel Hiltgen	fc2a0715df	Be explicit for gpu library link dir (#7560 ) On linux nvcc isn't automatically linking to the same cuda version.	2024-11-07 09:20:40 -08:00
Jesse Gross	3020d2dc58	docs: OLLAMA_NEW_RUNNERS no longer exists	2024-11-06 14:39:02 -08:00
Jesse Gross	a909417602	runner.go: Remove unused arguments Now that server.cpp is gone, we don't need to keep passing arguments that were only ignored and only kept for compatibility.	2024-11-06 13:32:18 -08:00
Jesse Gross	6cd566872b	sched: Lift parallel restriction for multimodal models except mllama The Go runner does not have a problem with supporting parallel requests for most multimodal models. Now that we won't be potentially falling back to server.cpp, this restriction can be lifted. However, the new mllama model can't support parallel requests, so we will need to keep a restriction for that.	2024-11-06 13:32:18 -08:00