Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Embeddings

codescout uses embeddings for semantic search — finding code by meaning rather than exact text matches. This guide covers how to configure the embedding backend.

⚠ This page describes the pre-v0.12 single-service embedding model and is being phased out. As of v0.12 the default substrate is the Retrieval Stack (Qdrant + dense embedder + sparse SPLADE + cross-encoder reranker, configured via CODESCOUT_* environment variables, not [embeddings] in project.toml). The [embeddings] config block still loads but only the model = "local:..." path is honoured — and only when the binary was built with the local-embed Cargo feature.

If you are setting up a fresh install: read Retrieval Stack instead. It covers the docker-compose stack, Ollama / llama.cpp / OpenAI integration, and the benchmark we used to pick defaults.

If you are upgrading from <v0.12: the model / url / api_key fields in project.toml no longer drive search. Run codescout migrate-memories to move legacy memory data into Qdrant, then bring up the stack.

The remainder of this page is kept as a reference for the legacy code path; treat it as historical.

Quick Start

codescout works out of the box with a bundled embedding model. No setup needed.

On first index(action: build), it downloads all-MiniLM-L6-v2 (~22 MB, quantized) to ~/.cache/huggingface/hub/ and runs it locally via ONNX. This is a one-time download.

# .codescout/project.toml (default — no changes needed)
[embeddings]
model = "local:AllMiniLML6V2Q"

This is fine for single-project use or getting started. For better performance with multiple projects, see the next section.

The bundled model loads into memory per codescout instance. With multiple projects open, this duplicates memory (~22 MB each for the default model). A dedicated embedding server avoids this:

  • One process serves all codescout instances
  • No memory duplication — the model loads once
  • Faster queries — the model stays warm
  • Model freedom — use any model and quantization

Configuration

Point codescout at your server with two fields:

[embeddings]
model = "nomic-embed-text-v1.5"          # model name (sent in API request)
url = "http://127.0.0.1:43300/v1"        # your server's base URL
# api_key = "optional-key"               # or set EMBED_API_KEY env var

The url field works with any server implementing the OpenAI /v1/embeddings API. codescout normalizes the URL automatically — all of these are equivalent:

  • http://127.0.0.1:43300
  • http://127.0.0.1:43300/v1
  • http://127.0.0.1:43300/v1/embeddings

Setup Examples

llama.cpp

Download a GGUF model and start the server:

# Download (example: nomic-embed-text quantized)
wget https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf

# Start server
llama-server -m nomic-embed-text-v1.5.Q8_0.gguf --embeddings --port 43300
[embeddings]
model = "nomic-embed-text-v1.5"
url = "http://127.0.0.1:43300/v1"

Ollama

ollama pull nomic-embed-text
ollama serve  # if not already running
[embeddings]
model = "nomic-embed-text"
url = "http://127.0.0.1:11434/v1"

vLLM

vllm serve nomic-ai/nomic-embed-text-v1.5 --task embed --port 43300
[embeddings]
model = "nomic-embed-text-v1.5"
url = "http://127.0.0.1:43300/v1"

TEI (HuggingFace Text Embeddings Inference)

docker run -p 43300:80 ghcr.io/huggingface/text-embeddings-inference \
  --model-id nomic-ai/nomic-embed-text-v1.5
[embeddings]
model = "nomic-embed-text-v1.5"
url = "http://127.0.0.1:43300/v1"

OpenAI

[embeddings]
model = "text-embedding-3-small"
url = "https://api.openai.com/v1"
api_key = "sk-..."  # or set EMBED_API_KEY env var

Configuration Reference

[embeddings] fields

FieldTypeDefaultDescription
modelstring"local:AllMiniLML6V2Q"Model name. With url: sent in API body. Without url: prefix determines backend.
urlstring(none)Base URL for any OpenAI-compatible /v1/embeddings endpoint.
api_keystring(none)API key sent as Bearer token. Also available via EMBED_API_KEY env var.
drift_detection_enabledbooltrueTrack how much code meaning changes between index builds.

Resolution Order

When codescout needs to embed text, it resolves the backend in this order:

  1. url is set → use it as an OpenAI-compatible endpoint
  2. model starts with local: → bundled ONNX model via fastembed
  3. model starts with ollama: → Ollama API (deprecated — use url instead)
  4. model starts with openai: → OpenAI API with OPENAI_API_KEY
  5. No url, no prefix → try as a local model name, then error with suggestions

Environment Variables

VariableDescription
EMBED_API_KEYAPI key for the embedding endpoint (alternative to config field)
OPENAI_API_KEYOpenAI API key (used with openai: prefix)
OLLAMA_HOSTOllama daemon URL (deprecated — use url field)

Model Recommendations

Minimum recommended: 768 dimensions for good code search quality.

ModelDimsDownloadContextBest For
nomic-embed-text-v1.5768~158 MB (Q) / ~547 MB8192General purpose, good quality
jina-embeddings-v2-base-en768~300 MB8192Code-specialized
bge-m31024~1.2 GB8192Best quality, needs external server
CodeSage-small-v21024~500 MBPurpose-built for code retrieval
text-embedding-3-small1536API only8191OpenAI hosted, no self-hosting

Bundled Local Models

These work with the local: prefix (no server needed):

Model IDDimsSizeContextNotes
NomicEmbedTextV15Q768~158 MB8192General purpose, good quality
NomicEmbedTextV15768~547 MB8192Full precision variant
JinaEmbeddingsV2BaseCode768~300 MB8192Code-specialized
AllMiniLML6V2Q384~22 MB256Default — bundled, zero-config
AllMiniLML6V2384~90 MB256Full precision lightweight

How It Works

  1. AST-aware chunking — tree-sitter extracts top-level definitions (functions, classes, structs). Each chunk is a complete semantic unit, not an arbitrary text window.

  2. Chunk size auto-derived — codescout calculates chunk size from the model’s context window. No manual tuning needed.

  3. Vector storage — embeddings are upserted into Qdrant’s code_chunks collection over gRPC (default localhost:6334). Both a dense and a sparse vector are stored per chunk; query-time hybrid search fuses them via RRF inside Qdrant. See Hybrid Dense + Sparse Retrieval for the topology.

  4. Bundled model lifecycle — when using the local: prefix (compile-time local-embed feature), the ONNX model is loaded lazily on first semantic_search or index(action="build"), cached for 5 minutes, then unloaded to free memory. The default substrate is the HTTP dense embedder service, not the bundled ONNX path.

Choosing a Model

Not sure which model to use? See the Embedding Model Comparison for benchmark results across three models, real-world usage data, and recommendations.

TL;DR: The default (local:AllMiniLML6V2Q) is within 2 points of the best model on a 60-point benchmark, indexes 21x faster, and requires zero setup. Keep it unless you have a specific reason to change.

Troubleshooting

Model mismatch after changing config

If you change the model or url after indexing, the stored vectors are incompatible. Rebuild the index:

index(action: build, force: true)

Endpoint unreachable

Check that the server is running and the URL is correct:

curl http://127.0.0.1:43300/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed-text","input":["test"]}'

Corporate proxy blocking downloads

The bundled model downloads from HuggingFace. If your proxy blocks this:

  1. Download the model on an unrestricted machine
  2. Copy to ~/.cache/huggingface/hub/models--nomic-ai--nomic-embed-text-v1.5/
  3. Or use an external server instead (set url)

Migration from Prefix Syntax

The ollama: prefix is deprecated and will be removed in a future version. Migrate to the url field:

# Before (deprecated)
[embeddings]
model = "ollama:nomic-embed-text"
# After
[embeddings]
model = "nomic-embed-text"
url = "http://localhost:11434/v1"

The custom: prefix has been removed. Migrate to the url field:

# Before (removed)
[embeddings]
model = "custom:my-model@http://my-server:8080"
# After
[embeddings]
model = "my-model"
url = "http://my-server:8080/v1"