Code Retreival

Creator

Creator

Seonglae Cho

Created

Created

2024 Mar 22 9:23

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Feb 18 16:25

Refs

Refs

Text embedding

Code and natural language are not semantically similar - it is easier to semantic search on code bases if the code is first translated to natural language before generating embedding vectors.

Code Embedding Models

Codestral Embed

Nomic Code Embed

Track changes only with

Merkle Tree. Find similar indexes using SimHash, generating a similarity hash representing the entire codebase. Large repositories take hours to generate indexes for semantic search → degrading agent performance. Codebases within an organization are very similar to each other (92% on average) → reuse existing indexes. Vector search with other indexes within the team → if sufficiently similar, copy and use existing index.

Securely indexing large codebases · Cursor

By securely reusing a teammate's existing index, we cut time-to-first-query from hours to seconds on the largest repos.

https://cursor.com/blog/secure-codebase-indexing

Securely indexing large codebases · Cursor

Code search grep.app

Code Search | Grep by Vercel

Search for code, files, and paths across half a million public GitHub repositories.

https://grep.app/

Code Search | Grep by Vercel

Codebases are uniquely hard to search semantically

Cosine similarity in code vs. text.

https://www.greptile.com/blog/semantic

Codebases are uniquely hard to search semantically

Generating similarities for code generation

Got it, so maybe a ‘hybrid’ approach? I.e. encode the code snippets as class/function/interface name + parameter_names + docstrings as a ‘syntactic’ embedding, and then use a code2seq or the like to generate embeddings based on their AST paths (and get the ‘semantic’ meaning as well). Then whatever the user prompts, I can generate an embedding based off of his prompt (whether a textual description or code) and see if I get some good similarity results for relevant coding snippets. Does this make...

Generating similarities for code generation

https://community.openai.com/t/generating-similarities-for-code-generation/276894/9

Generating similarities for code generation

Backlinks

Recommendations

//////