Compare commits

..

2 Commits

Author SHA1 Message Date
Matt Williams
fed3843be2 update to resolve jmorganca comments
Signed-off-by: Matt Williams <m@technovangelist.com>
2024-01-04 12:58:07 -08:00
Matt Williams
01d4047ed3 add faq about quant and context
Signed-off-by: Matt Williams <m@technovangelist.com>
2024-01-04 09:45:13 -08:00

View File

@@ -8,12 +8,6 @@ To upgrade Ollama, run the installation process again. On the Mac, click the Oll
Review the [Troubleshooting](./troubleshooting.md) docs for more about using logs.
## What are the components of Ollama that need to be running to work with the CLI, the API, and 3rd party tools?
At the heart of Ollama there are two main components: the server and the client. Even if everything is running on a single machine, there is a server that is running as a service, or background process, and there is some sort of client. Often that is the CLI. For instance, `ollama run llama2` is the command to start the CLI. You will often see this referred to as the REPL, a tool where you can interactively work with Ollama. You can run the server using the command `ollama serve`, but we recommend letting the service run instead. The Ollama installer script for Linux will add a systemd service to your machine that runs `ollama serve` as the user, `ollama`. On macOS, running `ollama` will start the Ollama Menu Bar app which is running the service.
The Ollama service is what actually loads the model and processes your requests. It also serves the API that all clients use, including our REPL and any 3rd party tools. There are some tools that require adding some environment variables to make the service more accessible in different ways. You can learn more about configuring those below.
## How do I use Ollama server environment variables on Mac
On macOS, Ollama runs in the background and is managed by the menubar app. If adding environment variables, Ollama will need to be run manually.
@@ -118,3 +112,26 @@ This can impact both installing Ollama, as well as downloading models.
Open `Control Panel > Networking and Internet > View network status and tasks` and click on `Change adapter settings` on the left panel. Find the `vEthernel (WSL)` adapter, right click and select `Properties`.
Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. *Disable* both of these
properties.
## What does the q in the model tag mean? What is quantization?
Whenever you pull a model without a tag, Ollama will actually pull the q4_0 quantization of the model. You can verify this on the tags page. On https://ollama.ai/library/llama2/tags you can see that the hash for the latest tag matches the hash for the 7b model. ![quant hashes](https://github.com/jmorganca/ollama/assets/633681/814b1b78-8205-4845-89f9-e671b3b96085)
Looking at the that page for any model, you can see several quantization options available. Quantization is a method of compression that allows the model to fit in less space and thus use less RAM and VRAM on your machine.
At a high level, a model is made of an enormous collection of nodes that determine how to generate text. These nodes are connected at different levels with weights. The training process adjusts these weights to be able to output the right text every time.
Most of the source models that we use start with weights that are 32bit floating-point numbers. Those weights, and another concept called biases, add up to be the parameters. So a source model with 7 billion parameters has 7 billion 32bit floating-point numbers, plus a description of all the nodes and more. That adds up to needing at least 28 Gigabytes of memory to load, if you choose to load one of those source models.
Quantization turns those 32bit floating point weights into much smaller integers. The number next to the q indicates the bit size of the weights. So a q4 model converted those 32bit floats into 4bit integers. A 4bit quantization takes up the space for 7billion 4bit integers, plus a little overhead. That comes out to almost 4 Gigabytes. Obviously, there is some loss of information in this process of going from 30GB to 4GB, but it turns out in most cases it isn't really noticeable. In fact, even the 2bit quantization which fits in less than 3GB can be very useful.
There are three major sets of quantizations you will see in the Ollama Library of models: **fp16**, models with just a q and a number, like **q4_0**, and then models with a **K** in the tag. The **fp16** model is one that has been converted and quantized from the source 32bit to 16bit. This will be about half the size of the 32bit source model and is the largest quantization we deliver in the library. The **q4_0**, **q4_1**, **q5_0**, etc. models use two different quantization methods that were the original methods.
The models with a **K** are often referred to as K Quants. This is a method that allows for models of a similar quality but smaller than the original method used. Essentially, it finds clusters of weights and quantizes those together, allowing for higher precision while using the same bit sizes as the regular quantization options. But this requires a set of maps for the model to figure out the original values which have a computational cost. You may see some impact on the speed of models with K quants compared to the regular quantizations.
## What is context, can I increase it, and why doesn't every model support a huge context?
Context refers to the size of the input you can send to a model and get sensible output back. Many models have a context size of 2048 tokens. It's sometimes possible to give it more using the **num_ctx** parameter, but the answers start to degrade. This is because half of the context is "freed" up to allow for more memory. Newer models have been able to increase that context size using different methods. This increase in context size results in a corresponding increase in memory required, sometimes by orders of magnitude.
> !WARNING]
> Currently, over-allocating context size may result in model quality or stability issues.