icon: LiNotebook
Title: Embeddings
✦ Embeddings are a type of representation that bridges the human understanding of language to that of a machine.
✦ They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
Large language models like GPT-4, Gemini, or BERT use word embeddings as the first layer of the model. We know, BERT is not that "large" compared to the other two, but it's still considered a significant advancement in natural language processing.
These models convert each word into a dense vector and feed it into the model. The models then use these vectors to predict the next word in a sentence (in the case of GPT-4) or to understand the context of a word (in the case of BERT).
These models are trained on a large corpus of text, so they learn the semantic meaning of words. For example, the word “king” is closer in this space to “queen” than it is to “apple”.
They are representations of text in a N-dimensional space where words that have the same meaning have a similar representation.
The number of values in a text embedding — known as its “dimension” — depends on the embedding technique (the process of producing the vector), as well as how much information you want it to convey.
The embeddings below shows a vector with 8 dimensions.
Table below show the common models with the dimensions of their embeddings
Model | Embedding Dimension | Max Input Tokens |
---|---|---|
BERT-Base | 768 | 512 |
BERT-Large | 1024 | 512 |
GPT-2 | 768 | 1024 |
GPT-3 | 768 | 2048 |
RoBERTa-Base | 768 | 512 |
RoBERTa-Large | 1024 | 512 |
DistilBERT | 768 | 512 |
OpenAI text-embedding-3-small | 1536 | 8191 |
OpenAI text-embedding-3-large | 3072 | 8191 |
in_1 = "Flamingo spotted at the bird park"
in_2 = "Sea otter seen playing at the marine park"
in_3 = "Baby panda born at the city zoo"
in_4 = "Python developers prefer snake_case for variable naming"
in_5 = "New JavaScript framework aims to simplify coding"
in_6 = "C++ developers appreciate the power of OOP"
in_7 = "Java is a popular choice for enterprise applications"
list_of_input_texts = [in_1, in_2, in_3, in_4, in_5, in_6, in_7]
text-embedding-3-small
. ✦ The straightforward reason is that they can reduce data dimensionality and address the primary issue: the necessity for speed.
✦ The initial phase of any Large Language Model (LLM) training is the most crucial: the neural network is constructed from a vast amount of data with an extensive number of features (let’s refer to them as details).
Embedding models have been used for a long time, primarily for training other LLMs or ML models.
The introduction of Retrieval Augmented Generation (RAG) and subsequently of Vector Store Databases has shed new light on these models.
They have a few common issues:
As research progressed, new state-of-the-art (text) embedding models began producing embeddings with increasingly higher output dimensions, meaning each input text is represented using more values. While this improves performance, it comes at the cost of efficiency and speed. Researchers were therefore motivated to create embedding models whose embeddings could be reasonably reduced in size without significantly sacrificing performance.