Title: LLMs Do Not Have Memory

  • LLMs Do Not Have Memory
  • Prompting Techniques for Better Reasoning
  • Multi-action within a Prompt
  • Prompt Chaining
  • Exception Handling
  • Hands-on Walkthrough and Tasks


LLMs are Stateless

  • ✦ By default, LLMs are stateless — meaning each incoming query (i.e., each time the LLM is triggered to generate the text response) is processed independently of other interactions. The only thing that matters is the current input, nothing else.

  • ✦ There are many applications where remembering previous interactions is very important, such as chatbots. Here, we will find out how we can enable conversations with LLMs as if the LLM remembers the previous conversation.

    - Notice that in the example below, when the second input is sent to the LLM, the output is not relevant to the previous interaction(e.g., running `get_completion()`)
    

    • To make the LLM to engage in a "conversation", we need to send over all the previous prompt and response (i.e., those components highlighted in the BLUE region in the image below).
    • In the example below, the input & output of the first interaction is sent together with the second prompt (i.e., "Which are healthy?")




Implementation in Python

  • ✦ Below is the helper function that we have been using.
    • Pay attention to the messages object in the function.
    • That's the key for implementing the conversational-like interaction with the LLM.
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content
    • messages is a list object where each item is a message.
    • A message object can be either of the three types:
      • A. prompt from users
      • B. response from LLM (aka. AI assistant)
      • C. 🆕 system message:
What is "System Message"
  • The system message helps set the behavior of the assistant.

  • For example, you can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation.

    • The instructions in the system message can guide the model’s tone, style, and content of the responses.
    • However, note that the system message is optional and the model’s behavior without a system message is likely to be similar to using a generic message such as "You are a helpful assistant."

    • It’s also important to note that the system message is considered as a ‘soft’ instruction, meaning the model will try to follow it but it’s not a strict rule.

An example of messages with all these keys is shown below:

messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "List some Fun Activities"},
    {"role": "assistant", "content": "Spa, Hiking, Surfing, and Gaming"},
    {"role": "user", "content": "Which are healthy?"}
]

Another example

messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"}
]

Below is the illustration on the flow of the messages between different "roles"


  • 💡By exposing the messages as one of the helper function's parameter, now we have a more flexible function get_completion_from_message, where you can compose the messages object, instead of just passing in the "user prompt".
def get_completion_by_messages(messages, model="gpt-3.5-turbo", temperature=0, top_p=1.0, max_tokens=1024, n=1):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        top_p=top_p,
        max_tokens=max_tokens,
        n=1
    )
    return response.choices[0].message.content
Try out the practical examples in Weekly Tasks - Week 03



Potential Implications of Bigger Messages

You probably would have guessed what the implications are of continuously stacking messages in the messages parameter for subsequent API calls. While it unlocks more contextually aware and engaging interactions, there's a trade-off to consider concerning resource utilization and performance. Let's delve into three key areas where these trade-offs become apparent:

  1. Increased Token Consumption:

    • Longer Context: Each message you add to the messages list contributes to a longer conversation history that the model needs to process. This directly increases the number of tokens consumed in each API call.

    • Token Billing: Most LLMs' pricing model is based on token usage. As your message history grows, so does the cost of each API call. For lengthy conversations or applications with frequent interactions, this can become a considerable factor.

  2. Context Window Limits:

    • Finite Capacity: Language models have a limited "context window", meaning they can only hold and process a certain number of tokens at once.

    • Truncation Risk: If the total number of tokens in your messages list exceeds the model's context window, the earliest messages will be truncated. This can lead to a loss of crucial context and affect the model's ability to provide accurate and coherent responses.

  3. Potential for Increase Latency:

    • Processing Overhead: As the message history grows, the model requires more time to process and understand the accumulated context. This can lead to a noticeable increase in response latency, especially for models with larger context windows or when dealing with computationally intensive tasks.

Mitigation Strategies:

  • ✦ It's crucial to implement strategies to manage conversation history effectively. This could involve:

    • Summarization: Summarize previous messages to condense information while preserving key context.

    • Selective Retention: Retain only the most relevant messages, discarding less important ones.

    • Session Segmentation: Divide long conversations into logical segments and clear the context window periodically.

    • Token-Efficient Models: Consider using models specifically designed for handling longer contexts, as they may offer a larger context window or more efficient token usage.