ollama

Commit Graph

Author	SHA1	Message	Date
Daniel Hiltgen	d632e23fba	Add Windows arm64 support to official builds (#5712 ) * Unified arm/x86 windows installer This adjusts the installer payloads to be architecture aware so we can cary both amd64 and arm64 binaries in the installer, and install only the applicable architecture at install time. * Include arm64 in official windows build * Harden schedule test for slow windows timers This test seems to be a bit flaky on windows, so give it more time to converge	2024-09-20 13:09:38 -07:00
Michael Yang	504a410f02	llm: add solar pro (preview) (#6846 )	2024-09-17 18:11:26 -07:00
Michael Yang	7bd7b02712	make patches git am-able raw diffs can be applied using `git apply` but not with `git am`. git patches, e.g. through `git format-patch` are both apply-able and am-able	2024-09-17 15:26:40 -07:00
Daniel Hiltgen	56b9af336a	Fix incremental builds on linux (#6780 ) scripts: fix incremental builds on linux or similar	2024-09-13 08:24:08 -07:00
Daniel Hiltgen	fda0d3be52	Use GOARCH for build dirs (#6779 ) Corrects x86_64 vs amd64 discrepancy	2024-09-12 16:38:05 -07:00
Daniel Hiltgen	cd5c8f6471	Optimize container images for startup (#6547 ) * Optimize container images for startup This change adjusts how to handle runner payloads to support container builds where we keep them extracted in the filesystem. This makes it easier to optimize the cpu/cuda vs cpu/rocm images for size, and should result in faster startup times for container images. * Refactor payload logic and add buildx support for faster builds * Move payloads around * Review comments * Converge to buildx based helper scripts * Use docker buildx action for release	2024-09-12 12:10:30 -07:00
Jesse Gross	93ac3760cb	runner: Flush pending responses before returning If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes #6707	2024-09-11 16:39:32 -07:00
Daniel Hiltgen	4a8069f9c4	Quiet down dockers new lint warnings (#6716 ) * Quiet down dockers new lint warnings Docker has recently added lint warnings to build. This cleans up those warnings. * Fix go lint regression	2024-09-09 17:22:20 -07:00
Daniel Hiltgen	56318fb365	Improve logging on GPU too small (#6666 ) When we determine a GPU is too small for any layers, it's not always clear why. This will help troubleshoot those scenarios.	2024-09-06 08:29:36 -07:00
Daniel Hiltgen	6719097649	llm: make load time stall duration configurable via OLLAMA_LOAD_TIMEOUT With the new very large parameter models, some users are willing to wait for a very long time for models to load.	2024-09-05 14:00:08 -07:00
Daniel Hiltgen	b05c9e83d9	Introduce GPU Overhead env var (#5922 ) Provide a mechanism for users to set aside an amount of VRAM on each GPU to make room for other applications they want to start after Ollama, or workaround memory prediction bugs	2024-09-05 13:46:35 -07:00
Michael Yang	bf612cd608	Merge pull request #6260 from ollama/mxyng/mem llama3.1 memory	2024-09-05 13:22:08 -07:00
Pascal Patry	bbe7b96ded	llm: use json.hpp from common (#6642 )	2024-09-04 19:34:42 -04:00
Jeffrey Morgan	5e2653f9fe	llm: update llama.cpp commit to 8962422 (#6618 )	2024-09-03 21:12:39 -04:00
Daniel Hiltgen	037a4d103e	Log system memory at info (#6617 ) On systems with low system memory, we can hit allocation failures that are difficult to diagnose without debug logs. This will make it easier to spot.	2024-09-03 14:55:20 -07:00
FellowTraveler	94fff5805f	Fix sprintf to snprintf (#5664 ) /Users/au/src/ollama/llm/ext_server/server.cpp:289:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.	2024-09-03 09:32:59 -07:00
Michael Yang	11018196e0	remove any unneeded build artifacts	2024-08-29 13:40:47 -07:00
Sean Khatiri	397cae7962	llm: fix typo in comment (#6530 )	2024-08-27 13:28:29 -07:00
Daniel Hiltgen	0f92b19bec	Only enable numa on CPUs (#6484 ) The numa flag may be having a performance impact on multi-socket systems with GPU loads	2024-08-24 17:24:50 -07:00
Patrick Devine	0c819e167b	convert safetensor adapters into GGUF (#6327 )	2024-08-23 11:29:56 -07:00
Daniel Hiltgen	0b03b9c32f	llm: Align cmake define for cuda no peer copy (#6455 ) Define changed recently and this slipped through the cracks with the old name.	2024-08-23 11:20:39 -07:00
Daniel Hiltgen	90ca84172c	Fix embeddings memory corruption (#6467 ) * Fix embeddings memory corruption The patch was leading to a buffer overrun corruption. Once removed though, parallism in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count. To work around this, only use slot 0 for embeddings. * Fix embed integration test assumption The token eval count has changed with recent llama.cpp bumps (0.3.5+)	2024-08-22 14:51:42 -07:00
Michael Yang	77903ab8b4	llama3.1	2024-08-21 11:49:31 -07:00
Daniel Hiltgen	a017cf2fea	Split rocm back out of bundle (#6432 ) We're over budget for github's maximum release artifact size with rocm + 2 cuda versions. This splits rocm back out as a discrete artifact, but keeps the layout so it can be extracted into the same location as the main bundle.	2024-08-20 07:26:38 -07:00
Daniel Hiltgen	f9e31da946	Review comments	2024-08-19 10:36:15 -07:00
Daniel Hiltgen	88bb9e3328	Adjust layout to bin+lib/ollama	2024-08-19 09:38:53 -07:00
Daniel Hiltgen	927d98a6cd	Add windows cuda v12 + v11 support	2024-08-19 09:38:53 -07:00
Daniel Hiltgen	d470ebe78b	Add Jetson cuda variants for arm This adds new variants for arm64 specific to Jetson platforms	2024-08-19 09:38:53 -07:00
Daniel Hiltgen	c7bcb00319	Wire up ccache and pigz in the docker based build This should help speed things up a little	2024-08-19 09:38:53 -07:00
Daniel Hiltgen	74d45f0102	Refactor linux packaging This adjusts linux to follow a similar model to windows with a discrete archive (zip/tgz) to cary the primary executable, and dependent libraries. Runners are still carried as payloads inside the main binary Darwin retain the payload model where the go binary is fully self contained.	2024-08-19 09:38:53 -07:00
Michael Yang	6ffb5cb017	add conversion for microsoft phi 3 mini/medium 4k, 128	2024-08-12 15:13:29 -07:00
Jeffrey Morgan	15c2d8fe14	server: parallelize embeddings in API web handler instead of in subprocess runner (#6220 ) For simplicity, perform parallelization of embedding requests in the API handler instead of offloading this to the subprocess runner. This keeps the scheduling story simpler as it builds on existing parallel requests, similar to existing text completion functionality.	2024-08-11 11:57:10 -07:00
Daniel Hiltgen	25906d72d1	llm: prevent loading too large models on windows (#5926 ) Don't allow loading models that would lead to memory exhaustion (across vram, system memory and disk paging). This check was already applied on Linux but should also be applied on Windows as well.	2024-08-11 11:30:20 -07:00
Daniel Hiltgen	2473bdba5e	Merge pull request #6182 from dhiltgen/more_patterns Catch one more error log	2024-08-08 12:33:17 -07:00
Michael Yang	2003d60159	llama3.1 memory	2024-08-08 11:18:13 -07:00
Jeffrey Morgan	de4fc29773	llm: reserve required number of slots for embeddings (#6219 )	2024-08-06 23:20:49 -04:00
Jeffrey Morgan	e04c7012c2	update llama.cpp submodule to `1e6f6554` (#6208 )	2024-08-06 15:11:45 -04:00
royjhan	86b907f82a	sort batch results (#6189 )	2024-08-05 16:55:34 -07:00
Daniel Hiltgen	f457d63400	Implement linux NUMA detection If the system has multiple numa nodes, enable numa support in llama.cpp If we detect numactl in the path, use that, else use the basic "distribute" mode.	2024-08-05 12:56:20 -07:00
Daniel Hiltgen	04210aa6dd	Catch one more error log	2024-08-05 09:28:07 -07:00
Michael Yang	6a07344786	line feed	2024-08-04 17:25:41 -07:00
Michael Yang	b732beba6a	lint	2024-08-01 17:06:06 -07:00
Michael Yang	0ff42e84b0	Merge pull request #4756 from ollama/mxyng/convert2 refactor convert	2024-08-01 14:16:30 -07:00
Michael Yang	df993fa37b	comments	2024-07-31 15:58:55 -07:00
Michael Yang	5e9db9fb0b	refactor convert	2024-07-31 15:58:33 -07:00
Michael Yang	0f3271db88	patches: phi3 default sliding window attention	2024-07-31 14:58:34 -07:00
Michael Yang	6b252918fb	update convert test to check result data	2024-07-31 10:59:38 -07:00
Michael Yang	5c1912769e	Merge pull request #5473 from ollama/mxyng/environ fix: environ lookup	2024-07-31 10:18:05 -07:00
jmorganca	afa8d6e9d5	patch gemma support	2024-07-30 18:07:29 -07:00
royjhan	1b44d873e7	Add Metrics to `api\embed` response (#5709 ) * add prompt tokens to embed response * rm slog * metrics * types * prompt n * clean up * reset submodule * update tests * test name * list metrics	2024-07-30 13:12:21 -07:00
Jeffrey Morgan	68ee42f995	update llama.cpp submodule to `6eeaeba1` (#6039 )	2024-07-29 13:20:26 -07:00
Tibor Schmidt	f3d7a481b7	feat: add support for min_p (resolve #1142 ) (#1825 )	2024-07-27 14:37:40 -07:00
Jeffrey Morgan	f2a96c7d77	llm: keep patch for llama 3 rope factors (#5987 )	2024-07-26 15:20:52 -07:00
Daniel Hiltgen	e12fff8810	Enable windows error dialog for subprocess startup Make sure if something goes wrong spawning the process, the user gets enough info to be able to try to self correct, or at least file a bug with details so we can fix it. Once the process starts, we immediately change back to the recommended setting to prevent the blocking dialog. This ensures if the model fails to load (OOM, unsupported model type, etc.) the process will exit quickly and we can scan the stdout/stderr of the subprocess for the reason to report via API.	2024-07-22 14:07:27 -07:00
Michael Yang	e2c3f6b3e2	string	2024-07-22 11:27:52 -07:00
Michael Yang	55cd3ddcca	bool	2024-07-22 11:27:21 -07:00
Michael Yang	35b89b2eab	rfc: dynamic environ lookup	2024-07-22 11:25:30 -07:00
Daniel Hiltgen	5784c05397	Merge pull request #5854 from dhiltgen/win_exit_status Refine error reporting for subprocess crash	2024-07-22 10:40:22 -07:00
Jeffrey Morgan	f8fedbda20	Update llama.cpp submodule commit to `d94c6e0c` (#5805 )	2024-07-22 12:42:00 -04:00
Daniel Hiltgen	a3c20e3f18	Refine error reporting for subprocess crash On windows, the exit status winds up being the search term many users search for and end up piling in on issues that are unrelated. This refines the reporting so that if we have a more detailed message we'll suppress the exit status portion of the message.	2024-07-22 08:52:16 -07:00
Jeffrey Morgan	5534f2cc6a	llm: consider `head_dim` in llama arch (#5817 )	2024-07-20 21:48:12 -04:00
Daniel Hiltgen	283948c83b	Adjust windows ROCm discovery The v5 hip library returns unsupported GPUs which wont enumerate at inference time in the runner so this makes sure we align discovery. The gfx906 cards are no longer supported so we shouldn't compile with that GPU type as it wont enumerate at runtime.	2024-07-20 15:17:50 -07:00
Jeffrey Morgan	1475eab95f	add patch for tekken (#5807 )	2024-07-20 13:41:21 -04:00
Michael Yang	4a565cbf94	add chat and generate tests with mock runner	2024-07-16 09:39:31 -07:00
royjhan	b9f5e16c80	Introduce `/api/embed` endpoint supporting batch embedding (#5127 ) * Initial Batch Embedding * Revert "Initial Batch Embedding" This reverts commit `c22d54895a`. * Initial Draft * mock up notes * api/embed draft * add server function * check normalization * clean up * normalization * playing around with truncate stuff * Truncation * Truncation * move normalization to go * Integration Test Template * Truncation Integration Tests * Clean up * use float32 * move normalize * move normalize test * refactoring * integration float32 * input handling and handler testing * Refactoring of legacy and new * clear comments * merge conflicts * touches * embedding type 64 * merge conflicts * fix hanging on single string * refactoring * test values * set context length * clean up * testing clean up * testing clean up * remove function closure * Revert "remove function closure" This reverts commit `55d48c6ed1`. * remove function closure * remove redundant error check * clean up * more clean up * clean up	2024-07-15 12:14:24 -07:00
Jeffrey Morgan	ef98803d63	llm: looser checks for minimum memory (#5677 )	2024-07-13 09:20:05 -07:00
Josh	10e768826c	fix: quant err message (#5616 )	2024-07-11 17:24:29 -07:00
Jeffrey Morgan	c4cf8ad559	llm: avoid loading model if system memory is too small (#5637 ) * llm: avoid loading model if system memory is too small * update log * Instrument swap free space On linux and windows, expose how much swap space is available so we can take that into consideration when scheduling models * use `systemSwapFreeMemory` in check --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2024-07-11 16:42:57 -07:00
Jeffrey Morgan	791650ddef	sched: only error when over-allocating system memory (#5626 )	2024-07-11 00:53:12 -07:00
Jeffrey Morgan	efbf41ed81	llm: dont link cuda with compat libs (#5621 )	2024-07-10 20:01:52 -07:00
Michael Yang	37a570f962	Merge pull request #5612 from ollama/mxyng/mem chatglm graph	2024-07-10 14:18:33 -07:00
Michael Yang	5a739ff4cb	chatglm graph	2024-07-10 13:43:47 -07:00
Jeffrey Morgan	4e262eb2a8	remove `GGML_CUDA_FORCE_MMQ=on` from build (#5588 )	2024-07-10 13:17:13 -07:00
Daniel Hiltgen	b50c818623	Merge pull request #5607 from dhiltgen/win_rocm_v6 Bump ROCm on windows to 6.1.2	2024-07-10 12:47:10 -07:00
Daniel Hiltgen	1f50356e8e	Bump ROCm on windows to 6.1.2 This also adjusts our algorithm to favor our bundled ROCm. I've confirmed VRAM reporting still doesn't work properly so we can't yet enable concurrency by default.	2024-07-10 11:01:22 -07:00
Daniel Hiltgen	22c81f62ec	Remove duplicate merge glitch	2024-07-10 09:01:33 -07:00
Daniel Hiltgen	2d1e3c3229	Merge pull request #5503 from dhiltgen/dual_rocm Workaround broken ROCm p2p copy	2024-07-09 15:44:16 -07:00
Daniel Hiltgen	b51e3b63ac	Statically link c++ and thread lib This makes sure we statically link the c++ and thread library on windows to avoid unnecessary runtime dependencies on non-standard DLLs	2024-07-09 11:34:30 -07:00
Michael Yang	9bbddc37a7	Merge pull request #5126 from ollama/mxyng/messages update message processing	2024-07-09 09:20:44 -07:00
Daniel Hiltgen	0bacb30007	Workaround broken ROCm p2p copy Enable the build flag for llama.cpp to use CPU copy for multi-GPU scenarios.	2024-07-08 09:40:52 -07:00
Jeffrey Morgan	53da2c6965	llm: remove ambiguous comment when putting upper limit on predictions to avoid infinite generation (#5535 )	2024-07-07 14:32:05 -04:00
Jeffrey Morgan	d8def1ff94	llm: allow gemma 2 to context shift (#5534 )	2024-07-07 13:41:51 -04:00
Jeffrey Morgan	571dc61955	Update llama.cpp submodule to `a8db2a9c` (#5530 )	2024-07-07 13:03:09 -04:00
Jeffrey Morgan	0e09c380fc	llm: print caching notices in debug only (#5533 )	2024-07-07 12:38:04 -04:00
Jeffrey Morgan	4607c70641	llm: add `-DBUILD_SHARED_LIBS=off` to common cpu cmake flags (#5520 )	2024-07-06 18:58:16 -04:00
jmorganca	a08f20d910	release: remove unwanted mingw dll.a files	2024-07-06 15:21:15 -04:00
jmorganca	6cea036027	Revert "llm: only statically link libstdc++" This reverts commit `5796bfc401`.	2024-07-06 15:10:48 -04:00
jmorganca	5796bfc401	llm: only statically link libstdc++	2024-07-06 14:06:20 -04:00
jmorganca	f1a379aa56	llm: statically link pthread and stdc++ dependencies in windows build	2024-07-06 12:54:02 -04:00
jmorganca	9ae146993e	llm: add `GGML_STATIC` flag to windows static lib	2024-07-06 03:27:05 -04:00
Jeffrey Morgan	e0348d3fe8	llm: add `COMMON_DARWIN_DEFS` to arm static build (#5513 )	2024-07-05 22:42:42 -04:00
Jeffrey Morgan	2cc854f8cb	llm: fix missing dylibs by restoring old build behavior on Linux and macOS (#5511 ) * Revert "fix cmake build (#5505)" This reverts commit `4fd5f3526a`. * llm: fix missing dylibs by restoring old build behavior * crlf -> lf	2024-07-05 21:48:31 -04:00
Jeffrey Morgan	5304b765b2	llm: put back old include dir (#5507 ) * llm: put back old include dir * llm: update link paths for old submodule commits	2024-07-05 19:34:21 -04:00
Jeffrey Morgan	4fd5f3526a	fix cmake build (#5505 )	2024-07-05 19:07:01 -04:00
Michael Yang	ac7a842e55	fix model reloading ensure runtime model changes (template, system prompt, messages, options) are captured on model updates without needing to reload the server	2024-07-05 13:17:25 -07:00
Jeffrey Morgan	78fb33dd07	fix typo in cgo directives in `llm.go` (#5501 )	2024-07-05 15:18:36 -04:00
Jeffrey Morgan	8f8e736b13	update llama.cpp submodule to `d7fd29f` (#5475 )	2024-07-05 13:25:58 -04:00
Jeffrey Morgan	d89454de80	Use slot with cached prompt instead of least recently used (#5492 ) * Use common prefix to select slot * actually report `longest`	2024-07-05 12:32:47 -04:00
Jeffrey Morgan	e9188e971a	Fix assert on small embedding inputs (#5491 ) * Fix assert on small embedding inputs * Update llm/patches/09-pooling.diff	2024-07-05 11:20:57 -04:00
Daniel Hiltgen	02c24d3d01	Merge pull request #5466 from dhiltgen/fix_clip_unicode Fix clip model loading with unicode paths	2024-07-05 08:16:58 -07:00

1 2 3 4 5 ...

694 Commits