icon: LiNotebookTabs
Title: LLMs Do Not Have Memory
✦ By default, LLMs are stateless — meaning each incoming query (i.e., each time the LLM is triggered to generate the text response) is processed independently of other interactions. The only thing that matters is the current input, nothing else.
✦ There are many applications where remembering previous interactions is very important, such as chatbots. Here, we will find out how we can enable conversations with LLMs as if the LLM remembers the previous conversation.
- Notice that in the example below, when the second input is sent to the LLM, the output is not relevant to the previous interaction(e.g., running `get_completion()`)
prompt
and response
(i.e., those components highlighted in the BLUE region in the image below).helper function
that we have been using.
messages
object in the function.def get_completion(prompt, model="gpt-3.5-turbo"):
messages = [{"role": "user", "content": prompt}]
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=0, # this is the degree of randomness of the model's output
)
return response.choices[0].message.content
messages
is a list object where each item is a message.message
object can be either of the three types:
The system message helps set the behavior of the assistant.
For example, you can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation.
An example of messages with all these keys is shown below:
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "List some Fun Activities"},
{"role": "assistant", "content": "Spa, Hiking, Surfing, and Gaming"},
{"role": "user", "content": "Which are healthy?"}
]
Another example
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
Below is the illustration on the flow of the messages between different "roles"
messages
as one of the helper function's parameter, now we have a more flexible function get_completion_from_message
, where you can compose the messages
object, instead of just passing in the "user prompt".def get_completion_by_messages(messages, model="gpt-3.5-turbo", temperature=0, top_p=1.0, max_tokens=1024, n=1):
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens,
n=1
)
return response.choices[0].message.content
Messages
You probably would have guessed what the implications are of continuously stacking messages in the messages parameter for subsequent API calls. While it unlocks more contextually aware and engaging interactions, there's a trade-off to consider concerning resource utilization and performance. Let's delve into three key areas where these trade-offs become apparent:
Increased Token Consumption:
Longer Context: Each message you add to the messages list contributes to a longer conversation history that the model needs to process. This directly increases the number of tokens consumed in each API call.
Token Billing: Most LLMs' pricing model is based on token usage. As your message history grows, so does the cost of each API call. For lengthy conversations or applications with frequent interactions, this can become a considerable factor.
Context Window Limits:
Finite Capacity: Language models have a limited "context window", meaning they can only hold and process a certain number of tokens at once.
Truncation Risk: If the total number of tokens in your messages list exceeds the model's context window, the earliest messages will be truncated. This can lead to a loss of crucial context and affect the model's ability to provide accurate and coherent responses.
Potential for Increase Latency:
Mitigation Strategies:
✦ It's crucial to implement strategies to manage conversation history effectively. This could involve:
Summarization: Summarize previous messages to condense information while preserving key context.
Selective Retention: Retain only the most relevant messages, discarding less important ones.
Session Segmentation: Divide long conversations into logical segments and clear the context window periodically.
Token-Efficient Models: Consider using models specifically designed for handling longer contexts, as they may offer a larger context window or more efficient token usage.