Title: Tokens

  • Tokens
  • Key Parameters for LLM
  • LLMs and Hallucination
  • Prompting Techniques for Builders
  • Hands-on Walkthrough and Tasks

What's Token


  • ✦ For every Large Language Models (LLMs), "tokens" play a crucial role. They are the smallest units of text that the model can understand and manipulate

    • Think of tokens as the building blocks of a sentence.
    • They can represent a word, a part of a word, or even a punctuation mark.
      • For instance, in the sentence "She loves ice-cream", there would be five tokens: "She", "loves", "ice", "-", and "cream".
    • The models learn to understand the statistical relationships between these tokens and produce the next token in a sequence of tokens.
      • Different models use different tokenization processes.
      • For example, OpenAI's GPT-4 likely uses a different tokenization process compared to Gemini from Google. 
Hint

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens =~ 75 words). Note that this is useful for a rough estimation. There are tools from each model provider that can be used to more accurately count the number of tokens.




LLMs have Token Limits

  • ✦ In the early days of Language Learning Models (LLMs), it sounds like a long time ago although it was just early 2023, counting tokens was critical due to the limitations of these models in handling large numbers of tokens.
  • ✦ However, with the release of newer models, such as gpt-4-turbo, gpt-4o, and gemini 1.5 pro onwards, many newer models can now process a significantly larger number of tokens, reducing the criticality for strict token counting.
  • ✦ Below are some of the latest models, at the time of writing, and the number of tokens.
    - The concept of the maximum tokens that the models can handle is also often known as "Context Window".
    - Note that the "Context Window" for the table below includes both the input and output tokens.
MODEL Description (by the respective companies) CONTEXT WINDOW
OpenAI Models overview of models
gpt-4-0125-preview The latest GPT-4 model. Returns a maximum of 4,096 output tokens. 128,000 tokens
gpt-3.5-turbo-0125 The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats. Returns a maximum of 4,096 output tokens 128,000 tokens
Claude Models models overview
claude-3-5-sonnet Highest level of intelligence and capability (among Claude Modes). Returns a maximum of 8,192 output tokens 200,000 tokens
claude-3-opus Powerful model for highly complex tasks. Top-level performance, intelligence, fluency, and understanding. Returns a maximum of 4,096 output tokens. 200,000 tokens
Google Gemini models overview
gemini 1.5 flash Gemini 1.5 Flash is a fast and versatile multimodal model for scaling across diverse tasks. 1,000,000 tokens
gemini 1.5 pro Gemini 1.5 Pro is a mid-size multimodal model that is optimized for a wide-range of reasoning tasks. 1.5 Pro can process large amounts of data at once 2,000,000 tokens
Please refer to the model providers' official websites for the latest information, as the context window for the same models may get updated over time (e.g., gemini 1.5)

  • ✦ It’s important to note that some models may have different token limits for input and output.
    • This means that while a model might be able to accept a large number of tokens as input, it might only be able to generate a smaller number of tokens as output.
    • Therefore, understanding the token limits of a specific model is still crucial.
  • ✦ Furthermore, for open-source models, especially smaller ones that prioritize speed, token counts remain very important.
    • These models often have stricter token limits due to their focus on efficiency and speed.
    • Therefore, efficient token management is still a key consideration when working with these models.
    • It helps ensure that the models operate within their capacity and deliver results quickly.
    • Besides counting the token programmatically with code, which we will be using in our practical tasks, we can also use the web-based tool on https://platform.openai.com/tokenizer
    • You can also try out the tool directly from below, by entering your sample prompt into the text box.



Tokenizer Widget for OpenAI Models

You can use the tool below to understand how a piece of text might be tokenized by a language model, and the total count of tokens in that piece of text.

If the embedded webpage below does not display properly. Use this link

Note: This widget currently does not support gpt-4o and gpt-4o-mini




Tokens & Cost

  • ✦ For many of the LLMs, the pricing is based on the number of tokens processed.

    • By understanding tokens, you can better manage your usage of the model, optimizing costs and ensuring efficient use of resources.
    • Below are some pricing tables for the different models from OpenAI.
      • Prices are typically viewed in units of either units of “per 1M tokens” or “per 1K tokens”.
      • You can think of tokens as pieces of words, where 1,000 tokens is about 750 words. For example, this paragraph is about 35 tokens.
  • ✦ Below are the pricing table for OpenAI's GPT models for reference:

Model Pricing for Input Tokens Pricing for Output Tokens
gpt-4o $5.00 / 1M input tokens $15.00 / 1M output tokens
gpt-4o-mini $0.150 / 1M input tokens $0.600 / 1M output tokens
gpt-4o-mini and gpt-4o-mini-2024-07-18 ??

The name gpt-4o-mini serves as a generic reference to the latest model in this class. gpt-4o-mini-2024-07-18 is the fully declared name of the specific version released on July 18, 2024. 

This naming convention helps distinguish between different versions and updates of the model, ensuring clarity and precision when referring to a particular release.

For the training content in this Bootcamp, we can safely use the generic name gpt-4o-mini which points to the latest model in our notebooks. For more info, visit Models - OpenAI API

Please always refer to official page for the latest pricing

The price is accurate at the time of writing.
Official Pricing Page: https://openai.com/pricing




Estimate Token Counts in Code

We can use the code below to estimate the token counts in the prompt that we will send to LLM.

# This a simplifedfunction is for calculating the tokens given the "text"
# ⚠️ This is simplified implementation that should only be used for a rough estimation

import tiktoken

def count_tokens(text):
    encoding = tiktoken.encoding_for_model('gpt-4o-mini')
    return len(encoding.encode(text))>)

While his function can be used to calculate the tokens in the prompt and output, it DOES NOT automatically count the tokens in the output generated by the LLM.
  • ✦ To calculate the token counts for the output generated, the generated text need to be passed to this function as a separate function call.
  • ✦ For controlling the length of the output, see the 'max_tokens' parameter explained in 2. Key Parameters for LLMs.
  • ✦ While the above code is sufficient for approximating the token counts, if you need more accurate token counts on the prompt, please refer the code below:

    • We recommend to use this function for calculating the tokens in actual projects

      • This is especially useful if the API calls involve lengthy multi-turns chat between the LLM and the users
    • Don't worry about understand this function line-by-line, it's a utility tool

      • The core function is really boiled down to this: encoding.encode(value) in the last few lines of the code
import tiktoken

def num_tokens_from_messages(messages, model="gpt-3.5-turbo"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")

	tokens_per_message = 3
	tokens_per_name = 1
   
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

# For more details, See https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb