Title: Applying Embeddings

Getting Embeddings

This is our new helper function to get embeddings by passing in a list of text to the function.

def get_embedding(input, model='text-embedding-3-small', dimensions=None):
    response = client.embeddings.create(
        input=input,
        model=model,
        dimensions=dimensions
    )
    return [x.embedding for x in response.data]
  • ✦ The function can take in two different model
    • text-embedding-3-smallthat produces embeddings with 1536 dimension
    • text-embedding-3-large that produces embeddings with 3072 dimensions

Usage is priced per input token. Below is an example of how many pages of text that can be processed per US dollar (assuming ~800 tokens per page):

MODEL ~ PAGES PER USD DOLLAR PERFORMANCE ON MTEB EVAL MAX INPUT
text-embedding-3-small 62,500 62.3% 8191
text-embedding-3-large 9,615 64.6% 8191
text-embedding-ada-002 12,500 61.0% 8191



OpenAI's Note on "Reducing Embedding Dimensions"

Using larger embeddings, for example storing them in a vector store for retrieval, generally costs more and consumes more compute, memory and storage than using smaller embeddings.

With OpenAI's new embedding models, both text-embedding-3-large and text-embedding-3-small allows builders to trade-off performance and cost of using embeddings.

  • ✦ Specifically, builders can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter.

  • ✦ For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 (One of OpenAI's older embedding models) embedding with a size of 1,536.

  • ✦ In general, using the dimensions parameter when creating the embedding is the suggested approach. Code below shows how the helper function is called with the dimensions specified as 512.

# Helper Function for Getting Embeddings
def get_embedding(input, model='text-embedding-3-small', dimensions=None):
    response = client.embeddings.create(
        input=input,
        model=model,
        dimensions=dimensions
    )
    return [x.embedding for x in response.data]

# Calling the function
text = "Python developers prefer snake_case for variable naming"
embeddings = get_embedding(text, dimensions=512)



Visualizing Embeddings

  • ✦ Visualizing data beyond three dimensions is inherently difficult due to our limited spatial intuition.
    • When working with complex embeddings, such as Large Language Models (LLMs) or other high-dimensional representations, it becomes practically impossible to directly visualize them in their original form.
    • One effective approach to make these embeddings more interpretable for humans is dimensionality reduction.
    • Techniques like Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) allow us to compress the data into a lower-dimensional space, typically two dimensions, while preserving its intrinsic structure.
    • By doing so, we can create scatter plots or heatmaps that reveal patterns, clusters, and relationships, making it easier for us to grasp the underlying information

Understanding UMAP

Uniform Manifold Approximation and Projection (UMAP) is a powerful dimensionality reduction technique that can be used to compress and visualize high-dimensional data in a lower-dimensional space.

  • ✦ Unlike other dimensionality reduction techniques, UMAP preserves both the local and global structure of the data, making it an excellent tool for exploratory data analysis.

How Does UMAP Work?

UMAP operates in two main steps:

  1. In the first step, UMAP constructs a high-dimensional graph of the data.
    • It does this by considering each data point and its nearest neighbors in the high-dimensional space.
    • The distance between each point and its neighbors is calculated using a distance metric (such as Euclidean distance), and these distances are used to construct a weighted graph.
  2. In the second step, UMAP optimizes a low-dimensional graph to be as structurally similar as possible to the high-dimensional graph.
    • It uses a force-directed graph layout algorithm to optimize the positions of the points in the low-dimensional space.
    • The goal is to minimize the difference between the high-dimensional and low-dimensional representations of the data.

Why Use UMAP?

UMAP has several advantages over other dimensionality reduction techniques:

  1. Preservation of Structure: UMAP preserves both the local and global structure of the data. This means that both clusters of similar data points and the broader relationships between these clusters are maintained in the lower-dimensional space.

  2. Scalability: UMAP is highly scalable and can handle large datasets efficiently.

  3. Flexibility: UMAP is not limited to just visualization. It can also be used for general non-linear dimension reduction tasks, making it a versatile tool for many data analysis tasks.


Using UMAP in Python

The UMAP algorithm is implemented in the umap-learn package in Python. Here's a simple example of how to use it:

import umap
import numpy as np

# Assume embeddings is your high-dimensional data
embeddings = np.random.rand(100, 50)

reducer = umap.UMAP()
umap_embeddings = reducer.fit_transform(embeddings)

In this example, umap.UMAP() creates a UMAP object, and fit_transform() fits the model to the data and then transforms the data to a lower-dimensional representation. The result, umap_embeddings, is a 2D array of the lower-dimensional embeddings of your data.

In conclusion, UMAP is a powerful tool for data analysts dealing with high-dimensional data. It offers a way to visualize and understand the structure of the data, making it an invaluable tool in the data analyst's toolkit.


Compare and contrast UMAP with PCA

You may have learnt about Principal Component Analysis (PCA) in Data Champions Bootcamp or other machine learning or statistical analysis courses. Here we try to understand why the UMAP is a superior technique compared to PCA, especially when it comes to complex data.

  1. Linearity vs Non-linearity: PCA is a linear dimension reduction technique. It works well when the data lies along a linear subspace, but it may not capture complex structures in the data. On the other hand, UMAP is a non-linear dimension reduction technique. It can capture more complex structures in the data, making it more suitable for high-dimensional data where the structure is not linear.

  2. Preservation of Structure: PCA aims to preserve the variance in the data. It projects the data onto the directions (principal components) where the variance is maximized. However, it does not preserve the distances between data points. UMAP, on the other hand, aims to preserve both the local and global structure of the data. It tries to maintain the distances between nearby points in the high-dimensional space in the lower-dimensional projection.

  3. Scalability: PCA scales well with the number of features, but not with the number of samples. UMAP, however, scales well with both the number of features and the number of samples, making it more suitable for large datasets.

  4. Interpretability: The principal components in PCA are combinations of the original features, which can be interpreted in terms of the original features. This is not the case with UMAP, as it uses a more complex algorithm to reduce dimensionality, which might not be as easily interpretable.

In summary, while PCA is a good choice for linear data and when interpretability is important, UMAP is more suitable for complex, high-dimensional data where preserving the structure of the data is crucial.

import numpy as np
import pandas as pd
import umap # For compressing high-dimensional data (many columns) into lower-dimensional data (e.g. 2 columns) 
import matplotlib.pyplot as plt
import seaborn as sns # For data visualization

# New Helper Function
def get_projected_embeddings(embeddings, random_state=0):
    reducer = umap.UMAP(random_state=random_state).fit(embeddings)
    embeddings_2d_array = reducer.transform(embeddings)
    return pd.DataFrame(embeddings_2d_array, columns=['x', 'y'])

πŸ’‘ Explanation:
  • def get_projected_embeddings(embeddings, random_state=0): 
    • This line defines the function and its parameters.
    • The function takes in two arguments: embeddings (your high-dimensional data) and random_state (a seed for the random number generator, which ensures that the results are reproducible).
  • reducer = umap.UMAP(random_state=random_state).fit(embeddings) 
    • This line creates a UMAP object and fits it to your data.
    • The fit method learns the structure of the data.
  • embeddings_2d_array = reducer.transform(embeddings) 
    • This line transforms the high-dimensional data into a lower-dimensional space.
    • The transformed data is stored in embeddings_2d_array.
  • return pd.DataFrame(embeddings_2d_array, columns=['x', 'y']) 
    • This line converts the lower-dimensional data into a pandas DataFrame for easier manipulation and returns it.
    • The DataFrame has two columns, 'x' and 'y', which represent the two dimensions of the reduced data.

Below is the example of using the new help function and then visualize its output using a scatterplot:




Understand Distance between Embeddings

Since embeddings capture semantic information, they allow us to compare a pair of texts based on their vector representations.

  • ✦ One very common way to compare the distance between a pair of embeddings.

    • The distance between two vectors measures their relatedness.
    • Small distances suggest high relatedness
    • Large distances suggest low relatedness.
  • ✦ With the distance between a pair of embeddings, we can then apply the distance in many other use cases such as:

    • Identify texts that semantically close to a target text, by identifying the texts that have short distance (i.e., closer) to the target text.
    • identify outliers, by identifying the datapoints that furthest away from the rest of typical datapoints
    • identify clusters, by grouping those datapoints that are located close to each other into distinct groups.

Cosine Similarity

Cosine similarity is one of the most common and often the default method used in calculating the distance between a pair of embeddings.

  • ✦ It measures the cosine of the angle between two vectors.
    • If the vectors are identical, the angle is 0 and the cosine similarity is 1.
    • If the vectors are orthogonal, the angle is 90 degrees and the cosine similarity is 0, indicating no similarity.
  • ✦ It quantifies how similar or aligned two vectors are in a high-dimensional space
  • ✦ In Python, you can use the cosine_similarity function from the sklearn.metrics.pairwise module to calculate cosine similarity.

    • In the context of LLMs, we would often rely on LLM frameworks such as Langchain that handles the low-level operations such as calculating the distance behind the scene, while we can focus on the logics of our applications.
    • It's rare that we will need to write the python code for calculating consine similarity on our own.
  • ✦ Cosine similarity is particularly useful for LLM embeddings because it effectively captures the semantic similarity between text documents.

    • It's robust to the high dimensionality of LLM embeddings and is relatively efficient to compute, making it a popular choice for measuring the distance between LLM embeddings.
  • ✦ For production-level retriever that requires searching over many vectors quickly, it is generally suggested to use a vector database.


The Perils of Embeddings: Protecting Sensitive Information

While embeddings offer significant advantages in various applications, they also pose substantial risks to privacy and data security.

Embeddings are essentially numerical representations of text data, and despite their seemingly abstract nature, they can encode sensitive information about individuals or organizations.


Risk of Disclosing Embeddings

  • ✦ Embeddings Contain Sensitive Information:

    • Embeddings derived from sensitive data are equally sensitive.
    • Despite their appearance as cryptic numbers, embeddings encode private details.
  • ✦ Inversion Attacks:

    • Researchers have demonstrated the ability to reverse-engineer embeddings back into their original text form through embedding inversion attacks.
    • Attackers can exploit this technique to recover sensitive information from seemingly harmless numerical representations.

Handling Embeddings with Care:

  • ✦ Privacy Implications:

    • Organizations must acknowledge that embeddings are susceptible to privacy risks.
    • Protecting embeddings is crucial, especially when they represent confidential information.
  • ✦ Balancing Utility and Privacy:

    • While embeddings enhance AI capabilities, it is essential to find a balance between utility and privacy.
    • Robust security measures and awareness are necessary to prevent accidental information leakage.