Title: Applying Embeddings

Use Cases of Embeddings

  • ✦ Embeddings are commonly used for (but not limited to):
    • Search (where results are ranked by relevance to a query string)
    • Clustering (where text strings are grouped by similarity)
    • Recommendations (where items with related text strings are recommended)
    • Anomaly detection (where outliers with little relatedness are identified)
    • Diversity measurement (where similarity distributions are analyzed)
    • Classification (where text strings are classified by their most similar label)
This note is meant to provide an overview to the various use cases
  • ✦ Therefore only the core part of the code is shown.
  • ✦ We will go through the some of these use cases in detail in our Jupyter Notebook
  • ✦ For use cases not covered in our Jupyter Notebook, you can find the detailed implementation by clicking on the links that are inserted at the end of each use cases below
  • ✦ You don't need to understand the code in every use case below.
    • The primary objective is for us to aware of what are the potential use cases of embeddings
    • and have an intuition of how embeddings are used in such use cases
    • You can delve deep into the use cases that are potentially relevant to your project



Here is the sample data used in the use cases below:


To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.

from openai.embeddings_utils import get_embedding, cosine_similarity

def search_reviews(df, product_description, n=3, pprint=True):
   embedding = get_embedding(product_description, model='text-embedding-3-small')
   df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res

res = search_reviews(df, 'delicious beans', n=3)



Visualizing Complex Data

The size of the embeddings varies with the complexity of the underlying model. In order to visualize this high dimensional data we use the t-SNE algorithm to transform the data into two dimensions.

The individual reviews are coloured based on the star rating which the reviewer has given:

  • β€’ 1-star: red
  • β€’ 2-star: dark orange
  • β€’ 3-star: gold
  • β€’ 4-star: turquoise
  • β€’ 5-star: dark green

The visualization seems to have produced roughly 3 clusters, one of which has mostly negative reviews.

This code is a way to visualize the relationship between different Amazon reviews based on their embeddings and scores. The t-SNE algorithm is particularly good at preserving local structure in high-dimensional data, making it a popular choice for tasks like this.

import pandas as pd
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib

df = pd.read_csv('output/embedded_1k_reviews.csv')
matrix = df.ada_embedding.apply(eval).to_list()

# Create a t-SNE model and transform the data
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)
vis_dims = tsne.fit_transform(matrix)

colors = ["red", "darkorange", "gold", "turquiose", "darkgreen"]
x = [x for x,y in vis_dims]
y = [y for x,y in vis_dims]
color_indices = df.Score.values - 1

colormap = matplotlib.colors.ListedColormap(colors)
plt.scatter(x, y, c=color_indices, cmap=colormap, alpha=0.3)
plt.title("Amazon ratings visualized in language using t-SNE")



Embedding as a text feature encoder for ML algorithms

  • ✦ An embedding serves as a versatile free-text feature encoder within a machine learning model.

    • When dealing with free-text inputs, incorporating embeddings enhances the performance of any machine learning model.
    • Additionally, embeddings can be employed as categorical feature encoders, especially when dealing with numerous and meaningful categorical variable names (such as job titles).
    • Embeddings transform text into meaningful numerical representations that capture semantic relationships between words or phrases.
  • ✦ Advantages over Traditional Methods:

    1. Superior to One-Hot Encoding: Imagine representing job titles like "Software Engineer" and "Data Scientist" with one-hot encoding. You'd end up with a sparse and high-dimensional vector space where these titles are treated as completely unrelated entities. Embeddings, however, can capture the inherent similarity between these roles, leading to better model performance.
    2. Overcoming Challenges of Direct NLP Processing: Traditional NLP techniques often involve complex pipelines with tasks like tokenization, stemming, and part-of-speech tagging. These pipelines can be brittle and computationally expensive. Embeddings offer a more efficient and robust alternative by condensing textual information into dense vectors.
  • ✦ The provided code segment splits the data into a training set and a testing set, which will be utilized for regression and classification use cases


A) Use Embeddings as Feature(s) in a Regression Model

  • ✦ Because the semantic information contained within embeddings is high, the prediction is likely to be decent even without large amounts of data.
  • ✦ We assume that the score (the target variable) is a continuous variable between 1 and 5, and allow the algorithm to predict any floating point value.

B) Use Embeddings as Feature(s) in a Classification Model

  • ✦ This time, instead of having the algorithm predict a value anywhere between 1 and 5, we will attempt to classify the exact number of stars for a review into 5 buckets, ranging from 1 to 5 stars.

  • ✦ After the training, the model learns to predict 1 and 5-star reviews much better than the more nuanced reviews (2-4 stars), likely due to more extreme sentiment expression.




Zero-Shot Classification

We can use embeddings for zero shot classification without any labeled training data.

  • ✦ For each class, we embed the class name or a short description of the class.
  • ✦ To classify some new text in a zero-shot manner, we compare its embedding to all class embeddings and predict the class with the highest similarity.
from openai.embeddings_utils import cosine_similarity, get_embedding

df= df[df.Score!=3]
df['sentiment'] = df.Score.replace({1:'negative', 2:'negative', 4:'positive', 5:'positive'})

labels = ['negative', 'positive']
label_embeddings = [get_embedding(label, model=model) for label in labels]

def label_score(review_embedding, label_embeddings):
   return cosine_similarity(review_embedding, label_embeddings[1]) - cosine_similarity(review_embedding, label_embeddings[0])

prediction = 'positive' if label_score('Sample Review', label_embeddings) > 0 else 'negative'

Clustering

Clustering is one way of making sense of a large volume of textual data. Embeddings are useful for this task, as they provide semantically meaningful vector representations of each text. Thus, in an unsupervised way, clustering will uncover hidden groupings in our dataset.

In this example, we discover four distinct clusters: one focusing on dog food, one on negative reviews, and two on positive reviews.

import numpy as np
from sklearn.cluster import KMeans

matrix = np.vstack(df.ada_embedding.values)
n_clusters = 4

kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
df['Cluster'] = kmeans.labels_



Recommendations

We can obtain a user embedding by averaging over all of their reviews. Similarly, we can obtain a product embedding by averaging over all the reviews about that product. In order to showcase the usefulness of this approach we use a subset of 50k reviews to cover more reviews per user and per product.

We evaluate the usefulness of these embeddings on a separate test set, where we plot similarity of the user and product embedding as a function of the rating. Interestingly, based on this approach, even before the user receives the product we can predict better than random whether they would like the product.

user_embeddings = df.groupby('UserId').ada_embedding.apply(np.mean)
prod_embeddings = df.groupby('ProductId').ada_embedding.apply(np.mean)



Why Can't I just use GPT-4 directly?

  • ✦ After seeing some of these example use cases, you might think, β€œwhy should I care about these text embedding things? Can’t I just make use GPT-4 to analyze the text for me?

  • ✦ Techniques like Retrieval Augmented Generated (RAG) or Fine-tuning allow tailoring the LLMs to specific problem domains. 

  • ✦ However, it’s important to recognize that these systems are still in their early stages.  - Building a robust LLM system presents challenges such as high computational costs, security risks associated with large language models, unpredictable responses, and even hallucinations.

  • ✦ On the other hand, text embeddings have a long history, are lightweight, and deterministic. 

    • Leveraging embeddings simplifies and reduces the cost of building LLM systems while retaining substantial value. By pre-computing text embeddings, you can significantly accelerate the training and inference process of LLMs. This leads to lower computational costs and faster development cycles. Additionally, embeddings capture semantic and syntactic information about text, providing a strong foundation for LLM performance.

    • It should be another tool in the NLP toolkit, allowing for efficient similarity search, clustering, and other tasks. Embeddings excel at capturing semantic and syntactic relationships between texts. This makes them invaluable for tasks like finding similar documents, grouping related content, and understanding the overall structure of a text corpus. By combining embeddings with LLMs, you can create more powerful and versatile applications.