Configuration Reference
Everything in Pathfinder is driven by a single pathfinder.yaml file. This is the full annotated reference.
server
Top-level server identity and session management.
- name — Identifies the server in MCP tool descriptions. Required.
- version — Semantic version string. Required.
- max_sessions_per_ip — Rate limiting. Prevents a single IP from opening too many concurrent sessions. Optional, defaults to 20.
- session_ttl_minutes — Sessions with no activity for this many minutes are cleaned up. Optional, defaults to 30.
sources
Define where your content lives. Each source becomes a virtual filesystem that agents can explore.
- type — Determines chunking strategy.
markdownsplits on headings and uses token-based chunks.codesplits on function boundaries and uses line-based chunks.raw-texttreats content as plain text with token-based chunks.htmlparses HTML structure and extracts text content with token-based chunks.documentextracts text from PDF and DOCX files with token-based chunks. - category — Optional tag applied to all chunks from this source. Sources with
category: faqcontribute to the/faq.txtendpoint and can be queried via knowledge tools. - repo — If provided, Pathfinder clones and updates from this repo. Omit for local directories.
- version — Tags all indexed chunks with this version string. Used for version-filtered search queries.
- base_url — Base URL prepended to the derived slug when generating document links. Works together with
url_derivation. - url_derivation — Controls how file paths are transformed into URL slugs that get appended to
base_url. Each step is applied in order:- strip_prefix — Remove this prefix from the file path before generating the URL. For example,
"docs/"turnsdocs/guide/auth.mdxintoguide/auth.mdx. - strip_suffix — Remove the file extension. For example,
".mdx"turnsguide/auth.mdxintoguide/auth. - strip_route_groups — When
true, removes Next.js-style route group segments like(marketing)or(guides)from the path. - strip_index — When
true, removes a trailing/indexfrom the path (e.g.guide/auth/indexbecomesguide/auth).
- strip_prefix — Remove this prefix from the file path before generating the URL. For example,
- chunk — Use
target_tokens/overlap_tokensfor markdown and raw-text sources. Usetarget_lines/overlap_linesfor code sources.
Source type: markdown
Splits content on Markdown headings and uses token-based chunks. Best for documentation written in .md or .mdx files. Configure with target_tokens and overlap_tokens.
Source type: code
Splits on function/class boundaries and uses line-based chunks. Best for source code. Configure with target_lines and overlap_lines.
Source type: raw-text
Treats content as unstructured plain text with token-based chunks. Use when content has no Markdown or code structure.
Source type: html
Parses HTML structure, extracts text content, and uses token-based chunks. Added in v1.4. Best for static sites, rendered documentation, or any HTML content.
Source type: slack
Indexes Slack threads as Q&A pairs. An LLM distills each qualifying thread into a question-answer pair with a confidence score. All pairs are stored; confidence filtering happens at query time.
- channels — Array of Slack channel IDs to monitor. The bot must be a member of each channel.
- confidence_threshold — Minimum confidence score (0-1) for Q&A pairs to appear in query results. Pairs below this threshold are still stored but filtered out at query time.
- trigger_emoji — When a user reacts with this emoji on a thread, it triggers immediate reindexing of that thread. Useful for curating high-quality answers.
- min_thread_replies — Threads with fewer replies are skipped during indexing.
- distiller_model — The OpenAI model used to distill threads into Q&A pairs. Defaults to
gpt-4o-mini.
Required environment variables: SLACK_BOT_TOKEN, SLACK_SIGNING_SECRET (for emoji-triggered webhook verification).
Source type: discord
Indexes Discord channels as Q&A pairs. Supports two channel types with different extraction strategies.
- guild_id — The Discord server ID.
- channels — Array of channel objects, each with an
idandtype. - type: text — Text channels use LLM distillation (same as Slack). Threads are distilled into Q&A pairs with confidence scores.
- type: forum — Forum channels extract Q&A directly from the post title (question) and replies (answer). Confidence is always 1.0 since the structure is explicit.
- confidence_threshold — Minimum confidence for query results. Forum posts always pass (confidence 1.0).
- min_thread_replies — Applies to text channel threads only.
- distiller_model — OpenAI model for thread distillation on text channels. Optional, defaults to
gpt-4o-mini.
Discord does not support emoji-triggered reindexing (requires Gateway WebSocket, which Pathfinder does not maintain).
Required environment variables: DISCORD_BOT_TOKEN, DISCORD_PUBLIC_KEY (for webhook Ed25519 verification). The bot needs the MESSAGE_CONTENT privileged intent and Read Message History + View Channels permissions.
Source type: notion
Index Notion pages and database entries as searchable markdown documents. Blocks are recursively converted to markdown, and database entry properties are serialized as YAML frontmatter.
- root_pages — Array of Notion page IDs. Each page and its children (up to
max_depth) are indexed. Optional, defaults to[]. - databases — Array of Notion database IDs. All entries in each database are indexed with their properties. Optional, defaults to
[]. - max_depth — Maximum depth for recursive child page discovery. Range 1-20, default 5.
- include_properties — When true, database entry properties (Status, Priority, Tags, etc.) are prepended as YAML frontmatter to the page content. Default true.
If both root_pages and databases are empty, all pages accessible to the integration token are indexed. Requires NOTION_TOKEN environment variable.
By default, Notion sources are indexed as documents and referenced by search tools. To also make them available via knowledge tools and the /faq.txt endpoint, add category: faq to the source config — useful for Notion databases structured as Q&A or FAQ collections.
Note: The chunk config is available on all source types. For Slack and Discord, the Q&A chunker produces one chunk per Q&A pair (chunk settings have no effect). For Notion, the markdown chunker respects target_tokens and overlap_tokens.
Source type: document
Index PDF and DOCX files. Requires optional peer dependencies: npm install pdf-parse for PDF support and npm install mammoth for DOCX support. Uses token-based chunks like markdown sources.
- PDF support — Requires
pdf-parsepeer dependency. Text is extracted page-by-page. Scanned PDFs (image-only pages with no extractable text) are detected and skipped with a warning. - DOCX support — Requires
mammothpeer dependency. Document content is converted to plain text for chunking. - max_file_size — Defaults to 10MB (10485760 bytes) for document sources. Large files are skipped with a warning.
url_derivation example
Given a file at docs/(guides)/getting-started/index.mdx with the following config:
The derivation steps produce:
tools
Tools are what agents actually call. Four types: search, bash, collect, and knowledge.
search
Semantic search over embedded content. Requires a database and embedding config.
Search Modes
The search_mode field controls how search queries are executed:
- vector (default) — cosine similarity on embeddings. Best for semantic/conceptual queries.
- keyword — PostgreSQL full-text search (tsvector/tsquery). Best for exact terms, error codes, and technical identifiers. Does not require an embedding call at query time.
- hybrid — runs both vector and keyword searches in parallel, then merges results using Reciprocal Rank Fusion (RRF, k=60). Best overall recall for mixed query types.
bash
Filesystem exploration with find, grep, cat, ls, head. Works with no database.
- grep_strategy —
memorypasses grep through to bash unchanged.vectorintercepts grep and runs semantic search.hybridruns bash grep first, falls back to semantic if no results. - workspace — When true, agents can write files to
/workspace/. Requires a persistent volume in production (see Deploy Guide). - session_state — When true,
cdpersists across tool calls within a session.
collect
Data collection tools. Agents submit structured data based on a JSON schema you define. Submitted data is stored in the local PostgreSQL database in the collected_data table, with a tool_name column and a data JSONB column containing the fields from your schema. What you do with the collected data is up to you — query it directly, build dashboards, pipe it to analytics, or export it. Pathfinder stores it; you decide how to use it.
knowledge
FAQ and knowledge base tool. Exposes Q&A pairs from FAQ-category sources (Slack, Discord) to agents. Supports two modes: browse (no query, returns recent pairs) and search (semantic query over pairs).
- sources — Array of source names. These sources should have
category: faqset. - min_confidence — Override the per-source confidence threshold for this tool. Q&A pairs below this score are filtered from results.
- default_limit — Number of results returned when the agent doesn't specify a limit.
- max_limit — Upper bound on results the agent can request.
When called without a query, the tool returns recent Q&A pairs (browse mode). When called with a query, it performs semantic search over the pairs.
embedding
Required when using search tools. Configures how content is embedded for vector search.
Provider options
Three embedding providers are supported. OPENAI_API_KEY is only required when using the openai provider.
OpenAI (default) — uses the OpenAI embeddings API. Requires OPENAI_API_KEY.
Ollama — calls a local Ollama instance over HTTP. No API key needed. Requires Ollama running with the model pulled (ollama pull nomic-embed-text).
Local — runs @xenova/transformers in-process. Zero external dependencies, CPU-only. Install the optional peer dependency: npm install @xenova/transformers.
Dimension mismatch detection
If you change embedding providers or models on an existing database, Pathfinder warns at startup when the configured dimensions don't match the existing vector index. You'll need to reindex to use the new dimensions.
indexing
Required when using search tools. Controls automatic reindexing behavior.
webhook
Optional. Triggers targeted reindexing when you push to GitHub.
Point a GitHub webhook at https://your-server/webhooks/github with the push event. Set GITHUB_WEBHOOK_SECRET to match the webhook secret.
analytics
Optional. Enables query logging and the built-in analytics dashboard at /analytics.
- enabled — When
true, Pathfinder logs queries and serves the analytics dashboard at/analytics. Defaultfalse. - log_queries — When
true, all search queries are logged with latency and result counts. Defaulttrue(when analytics is enabled). - token — Bearer token for authenticating requests to
/api/analytics/*endpoints. Use theANALYTICS_TOKENenvironment variable. Required when analytics is enabled. - retention_days — How long to keep analytics data before automatic cleanup. Default 90 days.
The analytics dashboard provides top queries, empty result tracking, and latency metrics. API endpoints: /api/analytics/summary, /api/analytics/queries, /api/analytics/empty-queries.
Example Configs
Bash-only (no database)
The simplest setup. No API keys, no database. Agents explore docs with shell commands.
Full stack with semantic search
Bash exploration plus RAG search. Requires Postgres and an OpenAI API key.
Multi-repo
Index docs from multiple repositories into a single Pathfinder instance.
Multi-version
Serve multiple versions of the same docs. Agents can filter search results by version.