The Retrieval Stack

As of v0.12 codescout’s default retrieval substrate is a network-attached stack (Qdrant + three embedding services), not the in-process local-embed path. The local-embed Cargo feature still exists for air-gapped use, but it is no longer the default and is no longer the path the team benchmarks against. If you upgrade from <0.12 and want to keep working, you must either bring up the stack or rebuild with --features local-embed and accept the older sqlite-vec code path. See Migration from local-embed below.

What runs where

Service	Default port	Image / binary	Role
Qdrant	`6334` (gRPC), `6333` (HTTP)	`qdrant/qdrant:v1.17.0`	Vector storage. Two collections: `code_chunks`, `memories`.
Dense embedder	`48081` (HTTP)	`llama.cpp:server` running `CodeRankEmbed-Q4_K_M.gguf` (default)	Text → 768-dim dense vector. Speaks TEI protocol; switchable to OpenAI protocol for Ollama / OpenAI / Anthropic-compatible endpoints.
Sparse SPLADE	`48084` (HTTP)	`text-embeddings-inference` running `prithivida/Splade_PP_en_v1`	Text → sparse vector for lexical complement.
Reranker	`48083` (HTTP)	`text-embeddings-inference` running `BAAI/bge-reranker-base` (CPU) or `bge-reranker-v2-m3` (GPU)	Cross-encoder pairwise re-rank of fused candidates.

codescout connects to these services on 127.0.0.1. There is no per-project substrate — the stack is shared across all projects on a machine.

Bring up the stack

# CPU profile (default — works on any Linux/macOS machine, ~3 GB RAM idle):
docker compose --profile cpu up -d

# GPU profile (CUDA — uses NVIDIA runtime, ~2.5 GB VRAM idle):
docker compose --profile gpu up -d

The dense embedder needs a GGUF model file. First-run setup:

mkdir -p models
cd models
huggingface-cli download nomic-ai/CodeRankEmbed-GGUF \
    CodeRankEmbed-Q4_K_M.gguf --local-dir .
# Or: wget https://huggingface.co/nomic-ai/CodeRankEmbed-GGUF/resolve/main/CodeRankEmbed-Q4_K_M.gguf

If your models/ directory is somewhere else, set CODESCOUT_MODEL_DIR before docker compose up.

Verify everything is healthy:

docker compose ps                 # all services "healthy"
curl -fsS http://127.0.0.1:48081/health   # dense
curl -fsS http://127.0.0.1:48083/health   # reranker
curl -fsS http://127.0.0.1:48084/health   # sparse
curl -fsS http://127.0.0.1:6333/healthz   # qdrant

AMD ROCm profile (`docker compose --profile amd`)

The amd profile in docker-compose.yml runs every leg of the retrieval stack on the GPU: dense embedder, cross-encoder reranker, and sparse SPLADE all share the AMD device. Qdrant runs alongside on CPU as usual. This is the recommended path on any workstation with an AMD GPU and ROCm 7.x installed on the host.

Bring up:

docker compose --profile amd up -d

Topology when using the amd profile:

Service	Port	Image	Notes
Qdrant	`6333` HTTP / `6334` gRPC	`qdrant/qdrant:v1.17.0`	Shared across all profiles.
Dense (`dense-amd`)	`48081`	`rocm/llama.cpp:llama.cpp-b6652.amd0_rocm7.0.0_ubuntu24.04_server`	llama-server `--embedding --pooling mean`, `CodeRankEmbed-Q4_K_M.gguf`.
Reranker (`reranker-amd`)	`48083`	same image	llama-server `--reranking --pooling rank`, `bge-reranker-v2-m3-Q4_K_M.gguf`.
Sparse (`sparse-amd`)	`48084`	`codescout/sparse-amd:tei-1588129f93` (built locally)	TEI-on-ROCm running SPLADE-PP_en_v1. See SPLADE on ROCm.

Required model files in ${CODESCOUT_MODEL_DIR:-./models}:

huggingface-cli download nomic-ai/CodeRankEmbed-GGUF \
    CodeRankEmbed-Q4_K_M.gguf --local-dir ./models      # ~90 MB
huggingface-cli download gpustack/bge-reranker-v2-m3-GGUF \
    bge-reranker-v2-m3-Q4_K_M.gguf --local-dir ./models # ~419 MB

The SPLADE model is pulled by the sparse-amd container at first launch into the huggingface-cache volume; no manual download needed.

Host requirements:

AMD GPU (RX 7xxx / MI series), gfx1100+ recommended
ROCm 7.x installed on host (kernel driver + /dev/kfd, /dev/dri)
User in video and render groups

The compose service declares devices: [/dev/kfd, /dev/dri] and group_add: ["44", "992"] (numeric video/render GIDs — the rocm/pytorch sparse image lacks a render group entry, so group names don’t resolve). No NVIDIA-style runtime extension needed; AMD exposes the GPU via standard Linux character devices.

Wire codescout: copy .env.amd (in the repo root) to .env. It sets the ports above plus the protocol selectors required to talk to llama-server’s /v1/embeddings and /v1/rerank:

CODESCOUT_EMBEDDER_PROTOCOL=llama-server  # /v1/embeddings, not TEI's /embed
CODESCOUT_RERANKER_PROTOCOL=llama-server  # Cohere-shape /rerank

Why dense + reranker use llama.cpp instead of TEI:

TEI’s ROCm path is fragile and lags upstream; rocm/llama.cpp is AMD-built.
Same binary serves the dense embedder and the cross-encoder reranker (--reranking mode), so one image covers two services.

Why sparse uses TEI: SPLADE is an MLM-style model with no llama.cpp implementation, and CPU latency saturates 32 cores on a full reindex of a 21k-chunk project. Building TEI from source against ROCm 7.1 + PyTorch 2.8 (see SPLADE on ROCm) puts SPLADE on the GPU and drops a full reindex from “minutes of CPU melting” to ~6 m 36 s at 121 % CPU.

How codescout finds the stack

codescout reads endpoints from environment variables and falls back to the defaults above:

Env	Default	Effect
`CODESCOUT_QDRANT_URL`	`http://127.0.0.1:6334`	Qdrant gRPC URL
`CODESCOUT_EMBEDDER_URL`	`http://127.0.0.1:48081`	Dense embedder base URL
`CODESCOUT_RERANKER_URL`	`http://127.0.0.1:48083`	Reranker base URL
`CODESCOUT_SPARSE_URL`	`http://127.0.0.1:48084`	Sparse SPLADE base URL
`CODESCOUT_EMBEDDER_PROTOCOL`	`tei`	`tei` (TEI/llama-server native) or `openai`/`llama-server` (Ollama, OpenAI, Anthropic-compatible)
`CODESCOUT_EMBEDDER_MODEL_NAME`	(empty)	Model id sent in OpenAI-protocol JSON payloads
`CODESCOUT_QUERY_PREFIX`	(empty)	Prepended to query text only. Required by some asymmetric models (e.g. Nomic, BGE-large).
`CODESCOUT_RERANKER_PROTOCOL`	`tei`	`tei` (HuggingFace TEI) or `llama-server`/`infinity`/`cohere` (Cohere-shape `/rerank`, used by llama-server `--reranking`)
`CODESCOUT_RERANKER_MODEL`	(unset)	Override the reranker model id (Infinity-protocol only)

Using Ollama / llama.cpp / OpenAI as the dense embedder

The shipped stack uses llama.cpp:server for the dense leg, but the dense service is just an HTTP endpoint behind CODESCOUT_EMBEDDER_URL. Any TEI-compatible or OpenAI-compatible server will work.

Ollama

Ollama exposes an OpenAI-compatible embeddings endpoint at http://localhost:11434/v1. Pull a model and point codescout at it:

ollama pull nomic-embed-text         # or any model with /api/embeddings
export CODESCOUT_EMBEDDER_URL=http://127.0.0.1:11434
export CODESCOUT_EMBEDDER_PROTOCOL=openai
export CODESCOUT_EMBEDDER_MODEL_NAME=nomic-embed-text

# Optional — Nomic needs a query prefix for asymmetric search:
export CODESCOUT_QUERY_PREFIX="search_query: "

You still need Qdrant + the reranker + the sparse service running from the docker-compose stack — Ollama only replaces the dense leg. Stop the compose dense-cpu or dense-gpu container so the port is free:

docker compose --profile cpu stop dense-cpu

llama.cpp (standalone)

If you already run llama-server outside docker, the same approach applies:

llama-server -m ~/models/CodeRankEmbed-Q4_K_M.gguf \
    --port 48081 --embedding --pooling mean --ctx-size 8192

…then leave CODESCOUT_EMBEDDER_URL and CODESCOUT_EMBEDDER_PROTOCOL at their defaults. The compose dense-* service is just a packaged version of this command — see docker-compose.yml for the full flag list.

OpenAI / Anthropic-compatible APIs

export CODESCOUT_EMBEDDER_URL=https://api.openai.com/v1
export CODESCOUT_EMBEDDER_PROTOCOL=openai
export CODESCOUT_EMBEDDER_MODEL_NAME=text-embedding-3-small
# (codescout reads OPENAI_API_KEY from the environment automatically)

Cost: a full index of a ~10k-file Rust project is roughly 8 M tokens at ~768-dim. Budget accordingly.

How we chose the components — benchmark summary

A 75-query retrieval benchmark was run across ~15 candidate stacks on a pinned worktree of this repo. The full history lives in docs/trackers/retrieval-benchmark.md. Headline results below — all measured on the same query set at bm25_boost=5.0, mode=code, with cross-encoder rerank enabled unless noted.

Dense embedder

Model	Quantization	Query prefix	Score (out of 75)	Notes
CodeRankEmbed	Q4_K_M (90 MB)	none	37	Champion. Best on env-var / identifier-bag queries. Q4 loses asymmetric subspace if a prefix is forced.
CodeRankEmbed	f16 (~550 MB)	required	34	f16 with prefix peaked one point below Q4 no-prefix.
jina-embeddings-v2-base-code	(native)	none	36	Strong general-code model; +2 vs jina without sparse fusion.
Nomic Embed Code 7B	Q4	required	24	“Claimed CoIR SOTA” failed on real-world queries — bigger is not better.
Tavily-stack baseline (CodeRank, no rerank, sqlite-vec + tantivy)	Q4_K_M	none	28	Reference point for the legacy substrate we replaced.

Why Q4 over f16: Q4_K_M scores higher than f16 in our query set when no prefix is set, and runs in ≤1 GB RAM. The f16 advantage only appears when the model’s asymmetric query prefix is enabled, and even then it caps one point below Q4 no-prefix. We default to Q4 no-prefix.

Sparse leg

We initially shipped a local Tantivy BM25 leg. It scored similarly to SPLADE on lexical queries but was a maintenance burden (tantivy compile time, on-disk index drift, separate rebuild step) and could not run as a service. We migrated to SPLADE-PP_en_v1 via TEI — same conceptual role, runs as a container, no per-project index. The benchmark showed sparse fusion gives +2 points over dense-only at bm25_boost=5.0.

Reranker

Model	Protocol	T5 (real-usage tier, /15)	Full /75	Latency (p95)
bge-reranker-v2-m3	TEI	10	37	~80 ms (GPU)
bge-reranker-base	TEI	9	35	~250 ms (CPU)
jina-rerank-v2	Infinity	11	38 (jina-v2 dense), 36 (CodeRank Q4 dense)	~120 ms

bge-v2-m3 wins on the full suite and is the default. jina-rerank-v2 lifts the T5 (real-usage) tier by +1 every time but loses on long natural-language queries. The protocol toggle (CODESCOUT_RERANKER_PROTOCOL=infinity) lets you swap with a single env var — no rebuild needed.

Stack-wide latency (champion config)

Stage	CPU profile	GPU profile
Dense embed (single query)	~30 ms	~5 ms
Sparse embed (single query)	~80 ms	~30 ms
Qdrant hybrid search (RRF)	~10 ms	~10 ms
Cross-encoder rerank (top-20)	~250 ms	~80 ms
End-to-end `semantic_search`	~370 ms	~125 ms

Indexing throughput on the codescout repo itself (~3.5 k chunks):

Profile	Wall time	Throughput
CPU	~45 s	~80 chunks/s
GPU	~12 s	~290 chunks/s

Migration from local-embed

If you have a .codescout/embeddings/project.db from a pre-v0.12 install:

# 1. Stand up the stack (see above)
# 2. Re-embed legacy memories into Qdrant:
codescout migrate-memories --dry-run    # preview
codescout migrate-memories              # execute
# 3. Re-index your project:
codescout index

The legacy sqlite-vec file is no longer read after migration. You can delete it once you’ve verified memory recall works against the new substrate.

If you cannot run the stack (air-gapped, embedded environment), build with local-embed:

cargo install codescout --no-default-features --features local-embed,http,librarian

This restores the in-process ONNX + fastembed path. Note: the network retrieval pipeline (sparse fusion, cross-encoder rerank) is not available in this mode — semantic_search falls back to pure dense vector scoring.

Troubleshooting

Symptom	Likely cause	Fix
`semantic_search` returns “stack unreachable”	dense/sparse/rerank/qdrant container not running	`docker compose ps` then start the missing profile
Empty results despite indexed data	wrong project_id namespace	`workspace status` to confirm the active project_id; `codescout index --force` to rebuild
Slow first query (10+ s)	model warmup on cold container	normal — subsequent queries hit the loaded model
`migrate-memories` reports “db not found”	legacy file at unexpected path	pass `--db-path /path/to/embeddings.db` explicitly

Keyboard shortcuts

codescout Manual