Text embedding

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2022 Apr 3 14:39
Editor
Edited
Edited
2025 Sep 1 22:54

Embeddings are arrays of floating point numbers that represent the semantic meaning of a piece of content

Text encoding is more broader concept which means does not requires to reserve original semantics like
One Hot encoding
The advantage of vectorization is that it enables operations such as addition, subtraction, and multiplication.
We can separate text embeddings using below parameters
  • Multi-lingual
  • Context window size - max passage size
  • Model size - computing resource
  • embedding vector length - storage resource
Embedding-based search is much more powerful than traditional keyword search, promising for high-quality content discovery and LLM augmentation.
Text embedding Notion
 
 
Text Embedding Models
 
 
Sentence embedding Methods
notion image
 
 
 

Leaderboard

OpenAI embeddings, AWS or other commercial embeddings have much larger context windows like 8192.
Performance varies significantly between tasks, so decisions shouldn't be made solely based on overall leaderboard performance.

What is embedding

limit
google-deepmindUpdated 2025 Sep 1 22:25

Single-vector embeddings have structural constraints in handling all top-k combinations for arbitrary instructional and inferential queries. A hybrid approach with multi-vector, sparse, and re-rankers is necessary. Intuitively, this is because compressing "all relationships" into a single point (vector) geometrically requires dividing the vector space into numerous regions. However, the number of regions is bounded by the sign-rank, while the number of regions needed to represent relationships increases by . Therefore, when the number of top-k combinations exceeds the "space partitioning capacity" allowed by the dimension, some combinations cannot be correctly represented.
This limitation is theoretically connected to the sign-rank of the qrel matrix A:
This means that given any dimension d, some qrel combinations cannot be represented. This limitation was empirically confirmed through Free embedding experiments (directly optimizing query/document vectors on test sets):
The proposed LIMIT dataset implements this theory in natural language, designed to create maximally dense combinations (increased graph density). State-of-the-art single-vector embeddings (GritLM, Qwen3, Gemini Embeddings, etc.) fail significantly with Recall@100 < 20%.
Hence sparse models: with high dimensionality (nearly infinite dimensions), they perform almost perfectly on LIMIT. Multi-vector approaches are better than single-vector but not a complete solution.
 
 
 

Recommendations