icon: LiNotebook
Title: Applying Embeddings
This is our new helper function to get embeddings by passing in a list of text to the function.
def get_embedding(input, model='text-embedding-3-small', dimensions=None):
response = client.embeddings.create(
input=input,
model=model,
dimensions=dimensions
)
return [x.embedding for x in response.data]
text-embedding-3-small
that produces embeddings with 1536 dimensiontext-embedding-3-large
that produces embeddings with 3072 dimensionsUsage is priced per input token. Below is an example of how many pages of text that can be processed per US dollar (assuming ~800 tokens per page):
MODEL | ~ PAGES PER USD DOLLAR | PERFORMANCE ON MTEB EVAL | MAX INPUT |
---|---|---|---|
text-embedding-3-small | 62,500 | 62.3% | 8191 |
text-embedding-3-large | 9,615 | 64.6% | 8191 |
text-embedding-ada-002 | 12,500 | 61.0% | 8191 |
Using larger embeddings, for example storing them in a vector store for retrieval, generally costs more and consumes more compute, memory and storage than using smaller embeddings.
With OpenAI's new embedding models, both text-embedding-3-large
and text-embedding-3-small
allows builders to trade-off performance and cost of using embeddings.
β¦ Specifically, builders can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions
API parameter.
β¦ For example, on the MTEB benchmark, a text-embedding-3-large
embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002
(One of OpenAI's older embedding models) embedding with a size of 1,536.
β¦ In general, using the dimensions
parameter when creating the embedding is the suggested approach. Code below shows how the helper function is called with the dimensions specified as 512.
# Helper Function for Getting Embeddings
def get_embedding(input, model='text-embedding-3-small', dimensions=None):
response = client.embeddings.create(
input=input,
model=model,
dimensions=dimensions
)
return [x.embedding for x in response.data]
# Calling the function
text = "Python developers prefer snake_case for variable naming"
embeddings = get_embedding(text, dimensions=512)
Uniform Manifold Approximation and Projection (UMAP) is a powerful dimensionality reduction technique that can be used to compress and visualize high-dimensional data in a lower-dimensional space.
UMAP operates in two main steps:
UMAP has several advantages over other dimensionality reduction techniques:
Preservation of Structure: UMAP preserves both the local and global structure of the data. This means that both clusters of similar data points and the broader relationships between these clusters are maintained in the lower-dimensional space.
Scalability: UMAP is highly scalable and can handle large datasets efficiently.
Flexibility: UMAP is not limited to just visualization. It can also be used for general non-linear dimension reduction tasks, making it a versatile tool for many data analysis tasks.
The UMAP algorithm is implemented in the umap-learn
package in Python. Here's a simple example of how to use it:
import umap
import numpy as np
# Assume embeddings is your high-dimensional data
embeddings = np.random.rand(100, 50)
reducer = umap.UMAP()
umap_embeddings = reducer.fit_transform(embeddings)
In this example, umap.UMAP()
creates a UMAP object, and fit_transform()
fits the model to the data and then transforms the data to a lower-dimensional representation. The result, umap_embeddings
, is a 2D array of the lower-dimensional embeddings of your data.
In conclusion, UMAP is a powerful tool for data analysts dealing with high-dimensional data. It offers a way to visualize and understand the structure of the data, making it an invaluable tool in the data analyst's toolkit.
You may have learnt about Principal Component Analysis (PCA) in Data Champions Bootcamp or other machine learning or statistical analysis courses. Here we try to understand why the UMAP is a superior technique compared to PCA, especially when it comes to complex data.
Linearity vs Non-linearity: PCA is a linear dimension reduction technique. It works well when the data lies along a linear subspace, but it may not capture complex structures in the data. On the other hand, UMAP is a non-linear dimension reduction technique. It can capture more complex structures in the data, making it more suitable for high-dimensional data where the structure is not linear.
Preservation of Structure: PCA aims to preserve the variance in the data. It projects the data onto the directions (principal components) where the variance is maximized. However, it does not preserve the distances between data points. UMAP, on the other hand, aims to preserve both the local and global structure of the data. It tries to maintain the distances between nearby points in the high-dimensional space in the lower-dimensional projection.
Scalability: PCA scales well with the number of features, but not with the number of samples. UMAP, however, scales well with both the number of features and the number of samples, making it more suitable for large datasets.
Interpretability: The principal components in PCA are combinations of the original features, which can be interpreted in terms of the original features. This is not the case with UMAP, as it uses a more complex algorithm to reduce dimensionality, which might not be as easily interpretable.
In summary, while PCA is a good choice for linear data and when interpretability is important, UMAP is more suitable for complex, high-dimensional data where preserving the structure of the data is crucial.
import numpy as np
import pandas as pd
import umap # For compressing high-dimensional data (many columns) into lower-dimensional data (e.g. 2 columns)
import matplotlib.pyplot as plt
import seaborn as sns # For data visualization
# New Helper Function
def get_projected_embeddings(embeddings, random_state=0):
reducer = umap.UMAP(random_state=random_state).fit(embeddings)
embeddings_2d_array = reducer.transform(embeddings)
return pd.DataFrame(embeddings_2d_array, columns=['x', 'y'])
def get_projected_embeddings(embeddings, random_state=0):
reducer = umap.UMAP(random_state=random_state).fit(embeddings)
embeddings_2d_array = reducer.transform(embeddings)
return pd.DataFrame(embeddings_2d_array, columns=['x', 'y'])
Below is the example of using the new help function and then visualize its output using a scatterplot:
Since embeddings capture semantic information, they allow us to compare a pair of texts based on their vector representations.
β¦ One very common way to compare the distance between a pair of embeddings.
β¦ With the distance between a pair of embeddings, we can then apply the distance in many other use cases such as:
Cosine similarity is one of the most common and often the default method used in calculating the distance between a pair of embeddings.
import numpy as np# Define two vectors A and B
A = np.array([1, 2, 3]) # Example vector A
B = np.array([4, 5, 6]) # Example vector B
# Define a function to calculate cosine similarity
def cosine_similarity(vector_a, vector_b):
# Calculate the dot product of A and B
dot_product = np.dot(vector_a, vector_b)
# Calculate the L2 norm (magnitude) of A and B
# **L2 norm** (also known as the **Euclidean norm**) of a vector is the square root of the sum of the squares of its components.
# - The Euclidean norm provides a straightforward measure of the magnitude of a vector.
# - It captures how βbigβ or βlongβ a vector is, regardless of its direction.
norm_a = np.linalg.norm(vector_a)
norm_b = np.linalg.norm(vector_b)
# Calculate cosine similarity
cosine_sim = dot_product / (norm_a * norm_b)
return cosine_sim
# Calculate and print the cosine similarity between A and B
cos_sim = cosine_similarity(A, B)
print(f"The cosine similarity between A and B is: {cos_sim}")
β¦ In Python, you can use the cosine_similarity
function from the sklearn.metrics.pairwise
module to calculate cosine similarity.
Langchain
that handles the low-level operations such as calculating the distance behind the scene, while we can focus on the logics of our applications. consine similarity
on our own.β¦ Cosine similarity is particularly useful for LLM embeddings because it effectively captures the semantic similarity between text documents.
β¦ For production-level retriever that requires searching over many vectors quickly, it is generally suggested to use a vector database.
While embeddings offer significant advantages in various applications, they also pose substantial risks to privacy and data security.
Embeddings are essentially numerical representations of text data, and despite their seemingly abstract nature, they can encode sensitive information about individuals or organizations.
β¦ Embeddings Contain Sensitive Information:
β¦ Inversion Attacks:
β¦ Privacy Implications:
β¦ Balancing Utility and Privacy: