Code Retreival

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 22 9:23
Editor
Edited
Edited
2026 Feb 18 16:25
Refs
Code and natural language are not semantically similar - it is easier to semantic search on code bases if the code is first translated to natural language before generating embedding vectors.
Code Embedding Models
 
 
 
Track changes only with
Merkle Tree
. Find similar indexes using SimHash, generating a similarity hash representing the entire codebase. Large repositories take hours to generate indexes for semantic search → degrading agent performance. Codebases within an organization are very similar to each other (92% on average) → reuse existing indexes. Vector search with other indexes within the team → if sufficiently similar, copy and use existing index.
Securely indexing large codebases · Cursor
By securely reusing a teammate's existing index, we cut time-to-first-query from hours to seconds on the largest repos.
Securely indexing large codebases · Cursor

Code search grep.app

Code Search | Grep by Vercel
Search for code, files, and paths across half a million public GitHub repositories.
Code Search | Grep by Vercel
Codebases are uniquely hard to search semantically
Cosine similarity in code vs. text.
Codebases are uniquely hard to search semantically
Generating similarities for code generation
Got it, so maybe a ‘hybrid’ approach? I.e. encode the code snippets as class/function/interface name + parameter_names + docstrings as a ‘syntactic’ embedding, and then use a code2seq or the like to generate embeddings based on their AST paths (and get the ‘semantic’ meaning as well). Then whatever the user prompts, I can generate an embedding based off of his prompt (whether a textual description or code) and see if I get some good similarity results for relevant coding snippets. Does this make...
Generating similarities for code generation
 
 

 

Recommendations