Code and natural language are not semantically similar - it is easier to semantic search on code bases if the code is first translated to natural language before generating embedding vectors.
Code Embedding Models
Track changes only with Merkle Tree. Find similar indexes using SimHash, generating a similarity hash representing the entire codebase. Large repositories take hours to generate indexes for semantic search → degrading agent performance. Codebases within an organization are very similar to each other (92% on average) → reuse existing indexes. Vector search with other indexes within the team → if sufficiently similar, copy and use existing index.
Securely indexing large codebases · Cursor
By securely reusing a teammate's existing index, we cut time-to-first-query from hours to seconds on the largest repos.
https://cursor.com/blog/secure-codebase-indexing

Code search grep.app
Code Search | Grep by Vercel
Search for code, files, and paths across half a million public GitHub repositories.
https://grep.app/

Codebases are uniquely hard to search semantically
Cosine similarity in code vs. text.
https://www.greptile.com/blog/semantic

Generating similarities for code generation
Got it, so maybe a ‘hybrid’ approach? I.e. encode the code snippets as class/function/interface name + parameter_names + docstrings as a ‘syntactic’ embedding, and then use a code2seq or the like to generate embeddings based on their AST paths (and get the ‘semantic’ meaning as well). Then whatever the user prompts, I can generate an embedding based off of his prompt (whether a textual description or code) and see if I get some good similarity results for relevant coding snippets. Does this make...
https://community.openai.com/t/generating-similarities-for-code-generation/276894/9


Seonglae Cho