Title: Running Open-Source LLMs
Open-source models are crucial for anyone interested in artificial intelligence, from practitioners, citizen data scientists to enthusiastic developers who build proof-of-concept prototypes. These models are freely available, meaning we can download, use, and modify them without worrying about licensing fees or restrictions. This accessibility allows we to experiment and innovate without needing a large budget or special permissions, making it easier to bring our ideas to life and test new concepts quickly.
Transparency & Possibility for Fine-tuning. Using open-source models also means you can find out how they work. This transparency is important because it allows you to understand the model's strengths and weaknesses, helping you make informed decisions about how to use them effectively. For some models, we can fine-tune the models to better fit our specific needs, whether we are working on a small project or developing a new application. This flexibility is particularly valuable for developers who need to adapt models to different scenarios or improve them based on real-world feedback.
Confidentiality. Another significant advantage of open-source models is confidentiality. When we run these models locally, our data remains on the hosting server or local machine, meaning we don’t have to send sensitive information to external servers. This is especially important for projects that involve personal or proprietary data, as it helps protect privacy and maintain compliance with data protection regulations.
There are various frameworks out there that we can use to run open-source models locally or on a server environment. Table below features three popular frameworks.
Feature / Framework | Ollama | HuggingFace TGI | llama.cpp |
---|---|---|---|
Ease of Installation | Simple installation via downloadable packages for macOS, Linux, and WSL2 (Windows). User-friendly CLI; also can run as a Docker Image. | Requires setting up Python environments and dependencies. More involved setup. | Requires compilation from source; Run as Docker Image. |
Supported Models | Supports a variety of open-source LLMs with seamless integration; require compatible variants of models (i.e., quantized). | Extensive model support via HuggingFace Hub, including proprietary models. | Primarily optimized for Meta’s LLaMA models and compatible variants. |
Performance Optimization | Optimized for both CPU and GPU usage with options for memory allocation and quantization. | Leverages HuggingFace’s optimizations; supports GPU acceleration with proper setup. | Highly optimized for CPU performance, even on lower-end hardware. GPU support is experimental. |
Scalability | Suitable for both local and server deployments with easy scaling options. | Designed for scalable deployments, including multi-instance setups and cloud integration. | Best suited for single-instance deployments; limited scalability features. |
Customization & Extensibility | Supports model customization and easy switching between different LLMs. Support commons customization. | Highly customizable with access to a wide range of tools and integrations via HuggingFace ecosystem. | Focuses on performance and efficiency for specific models. Tends to require more complex configuration for the customization. |
While every framework has their strengths and shortcomings, we will go with "ollama" as the framework to through this tutorial. It provide good balance between ease-of-use on a local machine and can be scale to more serious usage, as the starting point for us who want to explore open-source models. However, beyond this initial phase of trying open-source models, we encourage you to delve deeper into the capabilities of Ollama and other frameworks as you become more comfortable.
Running large language models locally used to be a hassle, with lots of instance and GPU management eating up resources. For example, the smallest Llama2 model is 13 GB, which means most models with more than 7 billion parameters can't fit on a typical laptop GPU.In comparison, models with capabilities similar to GPT-4 often have few hundred billion of parameters. For instance, the high-capable variant of Llama 3.2 has 405 billion parameters, thus known as Llama-3.2 405B.
This is where quantization comes in. By reducing the model weights to 4-bits, the Llama2–7b chat model shrinks to just 3.8 GB, making it possible to run on a regular laptop. Quantization is like compressing a large file to make it smaller without losing much of its original quality. Quantization reduces the size of the model by simplifying the numbers it uses to make calculations. Instead of using very precise numbers, it uses simpler, smaller ones. This makes the model much lighter and easier to run on devices with limited resources, like a laptop, without significantly affecting its performance.
Ollama has its name from when they began supporting Llama2, but now has expanded to include models like Mistral and Phi-2. Ollama makes it easy to get started with running LLMs on our own hardware in very little setup time.
Ollama provides a list of models (all in GGUF, because underlying Ollama is llama.cpp) that has been cleaned up and made ready to use by Ollama at https://ollama.com/library. The list is well maintained and has clear and clean descriptions of the models and various quantization and sizes of the same model.
As for hardware, because Ollama upports quantization (4-bit as a default), generally the hardware requirements are relatively low, making it ideal to run on end-user devices like our laptops. As a rule of thumb, Ollama suggests that our device should have at least 8GB of RAM to run the 7B models, 16GB of RAM to run the 13B models and 32GB of RAM to run the 33B models. We can extrapolate the numbers accordingly.
Ollama also offers an OpenAI-compliant API server, which means you can use it with little to no changes to your existing code (see the code example in 5.3 Interacting with the Model)
With Ollama and the model set up, you can now run the LLM locally.
Please refer to the list here for models that are directly supported by Ollama https://ollama.com/library .
In this tutorial, we will use gemma 2
, a family of lightweight open models built from the same research and methodology used to create the Gemini models. Find out more about gemma 2
models from here.
Use the following command to start the model. The exact command may vary; consult Ollama's documentation. This command will pull and run the model, since this model has not already been downloaded.
ollama run gemma2
To check if the model is running:
ollama status
This command should display the running status of your models.
Ollama offers an API interface to interact with the model programmatically.
from openai import OpenAI
client = OpenAI(
base_url = 'http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
response = client.chat.completions.create(
model="gemma2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response.choices[0].message.content)
While many LLMs available for download are often described as open source, the reality is more nuanced. Some LLMs may have their source code readily accessible, while others do not. Some provide their weights for free, and others do not. Additionally, some offer datasets and explain how the LLM was trained, whereas others do not provide this information. Generally, most allow free use of the LLM, but with certain conditions, such as restricting usage to research purposes only. Following Sau Sheong's definition in this Medium article, we use the term open LLM to refer to any model that is not a fully closed-source LLM, like GPT-4 or Gemini.
To further enhance your experience and troubleshoot more complex issues, refer to the following resources: