Title: Embeddings

  • Embeddings
  • Handling Embeddings
  • Applying Embeddings
  • Retrieval Augmented Generation (RAG)
  • Hands-on Walkthrough and Tasks

What is Embeddings

  • ✦ Embeddings are a type of representation that bridges the human understanding of language to that of a machine.

    • In the context of Large Language Models (LLMs), to be specific, we are dealing with text embeddings.
    • There are other types of embeddings, such as image, audio, and video embeddings.
    • Embeddings are a powerful technique in machine learning that allows us to represent data in a lower-dimensional space while preserving its semantic meaning.
    • This approach has revolutionized various fields, including natural language processing (NLP), computer vision, and more.
  • ✦ They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

    • Large language models like GPT-4, Gemini, or BERT use word embeddings as the first layer of the model. We know, BERT is not that "large" compared to the other two, but it's still considered a significant advancement in natural language processing.

    • These models convert each word into a dense vector and feed it into the model. The models then use these vectors to predict the next word in a sentence (in the case of GPT-4) or to understand the context of a word (in the case of BERT).

    • These models are trained on a large corpus of text, so they learn the semantic meaning of words. For example, the word “king” is closer in this space to “queen” than it is to “apple”.

    • They are representations of text in a N-dimensional space where words that have the same meaning have a similar representation.

      • The text is translated into numbers, specifically into vectors.
      • That's why we will often see some articles describe embeddings as vectors too.
      • Essential, text embeddings is a vector (i.e., a list) of floating point numbers.
      • In other words, it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.
    • The number of values in a text embedding — known as its “dimension” — depends on the embedding technique (the process of producing the vector), as well as how much information you want it to convey.

    • The embeddings below shows a vector with 8 dimensions.

    • Table below show the common models with the dimensions of their embeddings

Model Embedding Dimension Max Input Tokens
BERT-Base 768 512
BERT-Large 1024 512
GPT-2 768 1024
GPT-3 768 2048
RoBERTa-Base 768 512
RoBERTa-Large 1024 512
DistilBERT 768 512
OpenAI text-embedding-3-small 1536 8191
OpenAI text-embedding-3-large 3072 8191



Visualize Embeddings

  • ✦ Let’s try to visualize the concept. Imagine that we have a collection of sentences that we’ve turned into vectors, using a dense embedding technique.
    • If we simplify these vectors with hundreds of dimensions to just two dimensions, which we can plot them on a similarly designed two-dimensional grid.
    • For example, consider these seven pieces of text:
in_1 = "Flamingo spotted at the bird park"

in_2 = "Sea otter seen playing at the marine park"

in_3 = "Baby panda born at the city zoo"

in_4 = "Python developers prefer snake_case for variable naming"

in_5 = "New JavaScript framework aims to simplify coding"

in_6 = "C++ developers appreciate the power of OOP"

in_7 = "Java is a popular choice for enterprise applications"


list_of_input_texts = [in_1, in_2, in_3, in_4, in_5, in_6, in_7]

  • ✦ Each of the 7 texts will converted into a vector (again, you can understand vector as list for our purpose). The diagram below shows the first text is converted into a vector. Imagine each of the 7 texts has it own vector that has 1536 numerical values. Here we assume we are using OpenAI's text-embedding-3-small.


  • ✦ The diagram below show graph after we simplified the 7 vectors down to 2 dimensions and plot them onto the x and y axes.
    • Observe the distances between the different texts
    • Although the text that starts with "Python developers prefer snake_case", contains two animals, the embedding is further away from the three data points that are truly talking about real animals
    • It is closer to the other two data points that are about programming/coding

We will discuss how are we convert the 1536 dimensions into just 2 dimensions in the later part of Topic 4 Visualizing Embeddings



Why are Embeddings Important

  • ✦ The straightforward reason is that they can reduce data dimensionality and address the primary issue: the necessity for speed.

    • As AI’s capabilities continue to grow, scaling automation can face speed and cost constraints. This is where the recent rise in interest in Embeddings becomes significant.
    • The main application of these technologies is the demand for speed, especially when processing large volumes of text data.
    • This is particularly pertinent for large language models like the GPT series, whether they are closed or open-sourced, where the efficient processing of enormous amounts of text is vital.
    • Embeddings serve as engineering tools to tackle the challenge of processing large-scale text swiftly and cost-effectively.

  • ✦ The initial phase of any Large Language Model (LLM) training is the most crucial: the neural network is constructed from a vast amount of data with an extensive number of features (let’s refer to them as details).

    • Language, of which the text is representing, contains many dimensions that are hard to specify or structurally quantify, including sentiment, grammar, meaning, and objects, just to mention a few.
    • The more dimensions there are, the more challenging it is for computers to analyze and learn from the data. This is where embeddings come in.
    • Data scientists employ embeddings to depict high-dimensional data in a low-dimensional space.
    • Think of embeddings as summaries.
      • They take high-dimensional data and condense it into a smaller, more manageable form, like picking out the key points from a long text.
      • This makes it easier and faster for AI models to process and understand the information. Just like summarizing a book saves you time and effort, embeddings help AI models work more efficiently.
      • Reducing the number of features while still capturing important patterns and relationships is the job of the Embeddings.
      • They allow AI models to learn and make predictions faster and with less computing power.



Embeddings Are Evolving

Embedding models have been used for a long time, primarily for training other LLMs or ML models.

The introduction of Retrieval Augmented Generation (RAG) and subsequently of Vector Store Databases has shed new light on these models.

They have a few common issues:

  1. They have a context length limit, just like Large Language Models.
  2. They usually excel at only one language (English).
  3. High-dimensional vectors are typically required for optimal results.
  4. They are usually trained for a specific task (text, image, or audio).

As research progressed, new state-of-the-art (text) embedding models began producing embeddings with increasingly higher output dimensions, meaning each input text is represented using more values. While this improves performance, it comes at the cost of efficiency and speed. Researchers were therefore motivated to create embedding models whose embeddings could be reasonably reduced in size without significantly sacrificing performance.