icon: LiWrench
Title: Applying Embeddings
Here is the sample data used in the use cases below:
To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.
from openai.embeddings_utils import get_embedding, cosine_similarity
def search_reviews(df, product_description, n=3, pprint=True):
embedding = get_embedding(product_description, model='text-embedding-3-small')
df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
res = df.sort_values('similarities', ascending=False).head(n)
return res
res = search_reviews(df, 'delicious beans', n=3)
The size of the embeddings varies with the complexity of the underlying model. In order to visualize this high dimensional data we use the t-SNE algorithm to transform the data into two dimensions.
The individual reviews are coloured based on the star rating which the reviewer has given:
The visualization seems to have produced roughly 3 clusters, one of which has mostly negative reviews.
This code is a way to visualize the relationship between different Amazon reviews based on their embeddings and scores. The t-SNE algorithm
is particularly good at preserving local structure in high-dimensional data, making it a popular choice for tasks like this.
import pandas as pd
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib
df = pd.read_csv('output/embedded_1k_reviews.csv')
matrix = df.ada_embedding.apply(eval).to_list()
# Create a t-SNE model and transform the data
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)
vis_dims = tsne.fit_transform(matrix)
colors = ["red", "darkorange", "gold", "turquiose", "darkgreen"]
x = [x for x,y in vis_dims]
y = [y for x,y in vis_dims]
color_indices = df.Score.values - 1
colormap = matplotlib.colors.ListedColormap(colors)
plt.scatter(x, y, c=color_indices, cmap=colormap, alpha=0.3)
plt.title("Amazon ratings visualized in language using t-SNE")
β¦ An embedding serves as a versatile free-text feature encoder within a machine learning model.
β¦ Advantages over Traditional Methods:
β¦ The provided code segment splits the data into a training set and a testing set, which will be utilized for regression and classification use cases
β¦ This time, instead of having the algorithm predict a value anywhere between 1 and 5, we will attempt to classify the exact number of stars for a review into 5 buckets, ranging from 1 to 5 stars.
β¦ After the training, the model learns to predict 1 and 5-star reviews much better than the more nuanced reviews (2-4 stars), likely due to more extreme sentiment expression.
We can use embeddings for zero shot classification without any labeled training data.
from openai.embeddings_utils import cosine_similarity, get_embedding
df= df[df.Score!=3]
df['sentiment'] = df.Score.replace({1:'negative', 2:'negative', 4:'positive', 5:'positive'})
labels = ['negative', 'positive']
label_embeddings = [get_embedding(label, model=model) for label in labels]
def label_score(review_embedding, label_embeddings):
return cosine_similarity(review_embedding, label_embeddings[1]) - cosine_similarity(review_embedding, label_embeddings[0])
prediction = 'positive' if label_score('Sample Review', label_embeddings) > 0 else 'negative'
Clustering is one way of making sense of a large volume of textual data. Embeddings are useful for this task, as they provide semantically meaningful vector representations of each text. Thus, in an unsupervised way, clustering will uncover hidden groupings in our dataset.
In this example, we discover four distinct clusters: one focusing on dog food, one on negative reviews, and two on positive reviews.
import numpy as np
from sklearn.cluster import KMeans
matrix = np.vstack(df.ada_embedding.values)
n_clusters = 4
kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
df['Cluster'] = kmeans.labels_
We can obtain a user embedding by averaging over all of their reviews. Similarly, we can obtain a product embedding by averaging over all the reviews about that product. In order to showcase the usefulness of this approach we use a subset of 50k reviews to cover more reviews per user and per product.
We evaluate the usefulness of these embeddings on a separate test set, where we plot similarity of the user and product embedding as a function of the rating. Interestingly, based on this approach, even before the user receives the product we can predict better than random whether they would like the product.
user_embeddings = df.groupby('UserId').ada_embedding.apply(np.mean)
prod_embeddings = df.groupby('ProductId').ada_embedding.apply(np.mean)
β¦ After seeing some of these example use cases, you might think, βwhy should I care about these text embedding things? Canβt I just make use GPT-4 to analyze the text for me?
β¦ Techniques like Retrieval Augmented Generated (RAG) or Fine-tuning allow tailoring the LLMs to specific problem domains.
β¦ However, itβs important to recognize that these systems are still in their early stages. - Building a robust LLM system presents challenges such as high computational costs, security risks associated with large language models, unpredictable responses, and even hallucinations.
β¦ On the other hand, text embeddings have a long history, are lightweight, and deterministic.
Leveraging embeddings simplifies and reduces the cost of building LLM systems while retaining substantial value. By pre-computing text embeddings, you can significantly accelerate the training and inference process of LLMs. This leads to lower computational costs and faster development cycles. Additionally, embeddings capture semantic and syntactic information about text, providing a strong foundation for LLM performance.
It should be another tool in the NLP toolkit, allowing for efficient similarity search, clustering, and other tasks. Embeddings excel at capturing semantic and syntactic relationships between texts. This makes them invaluable for tasks like finding similar documents, grouping related content, and understanding the overall structure of a text corpus. By combining embeddings with LLMs, you can create more powerful and versatile applications.