Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Semantic Search

Semantic search finds code by meaning rather than by name or text pattern. It answers queries like “authentication middleware”, “retry with exponential backoff”, or “parse JSON from HTTP response” — without knowing what the relevant functions are called.

It complements symbol tools: use symbol tools when you know the name, semantic search when you know the concept.

How It Works

Three steps happen when you call semantic_search:

1. Chunking — The first time index(action: build) runs, every source file is split into chunks whose size is derived from the configured model’s context window (roughly max_tokens × 3 chars/token at 85 % utilisation). Splits follow language structure: each top-level function, method, or class becomes its own chunk. When a container (an impl block, a class) exceeds the budget, it is recursively split into one chunk per inner method, plus a header chunk for the container signature. The plain-text fallback path handles languages without tree-sitter support. Each chunk records its 1-indexed start and end line so results link back to exact source locations.

2. Embedding — Each chunk is converted to a vector (a list of floating-point numbers) by the configured embedding model. Semantically similar text produces vectors that point in similar directions in high-dimensional space. The model is selected by onboarding based on your hardware (Ollama availability, GPU, RAM); ollama:nomic-embed-text is the default when Ollama is running, and local:JinaEmbeddingsV2BaseCode is used on CPU-only machines. See Embedding Backends to change it manually. The vectors are stored in .codescout/embeddings.db.

3. Search — Your query is embedded with the same model and compared to every stored chunk using cosine similarity. The closest chunks are returned, ranked by score.

The index is incremental. On subsequent index(action: build) calls, only files that changed since the last run are re-embedded — detected via git diff, then file mtime, then SHA-256 as a fallback chain.

Similarity Scores

Results include a score between 0 and 1:

ScoreMeaning
> 0.85Almost certainly what you’re looking for
0.70 – 0.85Likely relevant — worth inspecting
0.50 – 0.70Tangentially related
< 0.50Probably noise

Code embeddings score lower than prose embeddings for the same conceptual similarity — a score of 0.75 in a code search is strong. Do not compare scores across different embedding models; they are not on the same scale.

You know…Use
The exact namesymbols(pattern)
The file it’s insymbols(path)
A text fragmentgrep(regex)
The concept, not the namesemantic_search(query)
The concept, inside a librarysemantic_search(query, scope: "lib:<name>")

Semantic search is slowest of these options (it embeds your query at call time and scans all stored vectors). Prefer symbol tools when you know the name.

Index Lifecycle

Build the index once before first use:

{ "tool": "index(action: build)", "arguments": {} }

Check its health:

{ "tool": "workspace(action: status)", "arguments": {} }

The index is stored in .codescout/embeddings.db and excluded from version control by default. Each team member builds their own local copy.

Drift detection: workspace(action: status) can report per-file drift scores — a measure of how much file content has changed since it was last indexed. Pass threshold to surface files with high drift:

{ "tool": "workspace(action: status)", "arguments": { "threshold": 0.3 } }

Switching embedding models invalidates the entire index — all chunks must be re-embedded. See Embedding Backends for model selection guidance.

Further Reading