ollama

Author	SHA1	Message	Date
royjhan	a5f23d766e	Merge branch 'main' into royh-batchembed	2024-07-03 11:20:24 -07:00
Roy Han	95e46eeedf	move normalize test	2024-07-03 09:45:42 -07:00
Michael Yang	65a5040e09	fix generate template	2024-07-02 16:42:17 -07:00
royjhan	d626b99b54	OpenAI: v1/completions compatibility (#5209 ) * OpenAI v1 models * Refactor Writers * Add Test Co-Authored-By: Attila Kerekes * Credit Co-Author Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com> * Empty List Testing * Use Namespace for Ownedby * Update Test * Add back envconfig * v1/models docs * Use ModelName Parser * Test Names * Remove Docs * Clean Up * Test name Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * Add Middleware for Chat and List * Completions Endpoint * Testing Cleanup * Test with Fatal * Add functionality to chat test * Rename function * float types * type cleanup * cleaning * more cleaning * Extra test cases * merge conflicts * merge conflicts * merge conflicts * merge conflicts * cleaning * cleaning --------- Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com> Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2024-07-02 16:01:45 -07:00
Michael Yang	dddb58a38b	Merge pull request #5051 from ollama/mxyng/capabilities add model capabilities	2024-07-02 14:26:07 -07:00
Michael Yang	400056e154	Merge pull request #5420 from ollama/mxyng/insecure-path err on insecure path	2024-07-02 14:03:23 -07:00
royjhan	996bb1b85e	OpenAI: /v1/models and /v1/models/{model} compatibility (#5007 ) * OpenAI v1 models * Refactor Writers * Add Test Co-Authored-By: Attila Kerekes * Credit Co-Author Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com> * Empty List Testing * Use Namespace for Ownedby * Update Test * Add back envconfig * v1/models docs * Use ModelName Parser * Test Names * Remove Docs * Clean Up * Test name Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * Add Middleware for Chat and List * Testing Cleanup * Test with Fatal * Add functionality to chat test * OpenAI: /v1/models/{model} compatibility (#5028) * Retrieve Model * OpenAI Delete Model * Retrieve Middleware * Remove Delete from Branch * Update Test * Middleware Test File * Function name * Cleanup * Test Update * Test Update --------- Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com> Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2024-07-02 11:50:56 -07:00
Roy Han	3d060e0ae9	move normalize	2024-07-02 10:35:02 -07:00
Roy Han	00a4cb26ca	use float32	2024-07-02 10:30:29 -07:00
Roy Han	512e0a7bde	Clean up	2024-07-01 16:29:54 -07:00
Roy Han	1a0c8b363c	Truncation Integration Tests	2024-07-01 16:26:30 -07:00
Michael Yang	88bcd79bb9	err on insecure path	2024-07-01 15:55:59 -07:00
Roy Han	aee25acb5b	move normalization to go	2024-07-01 14:10:58 -07:00
Roy Han	9c32b6b9ed	Truncation	2024-07-01 11:59:44 -07:00
Roy Han	1daac52651	Truncation	2024-07-01 11:55:16 -07:00
Michael Yang	da8e2a0447	use kvs to detect embedding models	2024-07-01 10:47:43 -07:00
Michael Yang	a30915bde1	add capabilities	2024-07-01 10:47:43 -07:00
Michael Yang	58e3fff311	rename templates to template	2024-07-01 10:40:54 -07:00
Michael Yang	3f0b309ad4	remove ManifestV2	2024-07-01 10:40:54 -07:00
Daniel Hiltgen	cff3f44f4a	Fix case for NumCtx	2024-07-01 09:43:59 -07:00
Daniel Hiltgen	3518aaef33	Merge pull request #4218 from dhiltgen/auto_parallel Enable concurrency by default	2024-07-01 08:32:29 -07:00
Roy Han	80c1a3f812	playing around with truncate stuff	2024-06-28 18:17:09 -07:00
Roy Han	c111d8bb51	normalization	2024-06-28 17:19:04 -07:00
Roy Han	5213c12354	clean up	2024-06-28 15:26:58 -07:00
Roy Han	b9c74df37b	check normalization	2024-06-28 15:10:58 -07:00
Roy Han	49e341147d	add server function	2024-06-28 15:03:53 -07:00
Roy Han	c406fa7a4c	api/embed draft	2024-06-28 14:54:21 -07:00
Michael Yang	123a722a6f	zip: prevent extracting files into parent dirs (#5314 )	2024-06-26 21:38:21 -07:00
Roy Han	ff191d7cba	Initial Draft	2024-06-25 13:29:47 -07:00
Blake Mizerany	cb42e607c5	llm: speed up gguf decoding by a lot (#5246 ) Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.	2024-06-24 21:47:52 -07:00
Roy Han	0f87628b6d	Revert "Initial Batch Embedding" This reverts commit `c22d54895a`.	2024-06-24 15:26:05 -07:00
Daniel Hiltgen	642cee1342	Sort the ps output Provide consistent ordering for the ps command - longest duration listed first	2024-06-21 15:59:41 -07:00
Daniel Hiltgen	9929751cc8	Disable concurrency for AMD + Windows Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.	2024-06-21 15:45:05 -07:00
Daniel Hiltgen	17b7186cd7	Enable concurrency by default This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.	2024-06-21 15:45:05 -07:00
Michael Yang	e835ef1836	fix: quantization with template	2024-06-21 13:39:25 -07:00
royjhan	fedf71635e	Extend api/show and ollama show to return more model info (#4881 ) * API Show Extended * Initial Draft of Information Co-Authored-By: Patrick Devine <pdevine@sonic.net> * Clean Up * Descriptive arg error messages and other fixes * Second Draft of Show with Projectors Included * Remove Chat Template * Touches * Prevent wrapping from files * Verbose functionality * Docs * Address Feedback * Lint * Resolve Conflicts * Function Name * Tests for api/show model info * Show Test File * Add Projector Test * Clean routes * Projector Check * Move Show Test * Touches * Doc update --------- Co-authored-by: Patrick Devine <pdevine@sonic.net>	2024-06-19 14:19:02 -07:00
Roy Han	c22d54895a	Initial Batch Embedding	2024-06-18 17:34:36 -07:00
royjhan	89c79bec8c	Add ModifiedAt Field to /api/show (#5033 ) * Add Mod Time to Show * Error Handling	2024-06-15 20:53:56 -07:00
Daniel Hiltgen	45cacbaf05	Merge pull request #4517 from dhiltgen/gpu_incremental Enhanced GPU discovery and multi-gpu support with concurrency	2024-06-14 15:35:00 -07:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	ff4f0cbd1d	Prevent multiple concurrent loads on the same gpus While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	fc37c192ae	Refine CPU load behavior with system memory visibility	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	434dfe30c5	Reintroduce nvidia nvml library for windows This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	48702dd149	Harden unload for empty runners	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	5e8ff556cb	Support forced spreading for multi GPU Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00
Jeffrey Morgan	dd7c9ebeaf	server: longer timeout in `TestRequests` (#5046 )	2024-06-14 09:48:25 -07:00
Patrick Devine	94618b2365	add OLLAMA_MODELS to envconfig (#5029 )	2024-06-13 12:52:03 -07:00
Jeffrey Morgan	1fd236d177	server: remove jwt decoding error (#5027 )	2024-06-13 11:21:15 -07:00
Michael Yang	c16f8af911	fix: multiple templates when creating from model multiple templates may appear in a model if a model is created from another model that 1) has an autodetected template and 2) defines a custom template	2024-06-12 13:35:49 -07:00

1 2 3 4 5 ...

630 Commits