Configuration Reference
Everything in Pathfinder is driven by a single
pathfinder.yaml file. This is the full annotated
reference.
server
Top-level server identity and session management.
- name — Identifies the server in MCP tool descriptions. Required.
- version — Semantic version string. Required.
-
max_sessions — Global cap on total concurrent
sessions across all IPs. When exceeded, new connections receive a
503with a descriptive JSON body (error: "capacity_exceeded",totalSessions,maxSessions,retryAfterSeconds,contact) and aRetry-Afterheader. A warning is logged when sessions exceed 80% of this cap. Optional, defaults to 1000. - max_sessions_per_ip — Per-IP rate limiting. Prevents a single IP from opening too many concurrent sessions. Optional, defaults to 20.
- session_ttl_minutes — Idle timeout for active sessions (sessions that have invoked at least one tool). Sessions with no activity for this many minutes are cleaned up. Optional, defaults to 30.
- session_unused_ttl_minutes — Idle timeout for unused sessions (sessions that connected but never invoked a tool). Designed to shed idle MCP connections quickly while keeping active sessions alive longer. Set to 15 by default to match Railway's 15-minute SSE connection hard limit. Optional.
-
allowlist — Array of IPv4/IPv6 addresses or CIDR
ranges that bypass
max_sessions_per_ipentirely. Use for trusted crawlers (e.g.160.79.106.35— Anthropic Assistant) or internal health-probe ranges. Optional, defaults to empty. Entries are matched against the per-request client IP — behind a reverse proxy this requirestrust_proxy: true; with the defaulttrust_proxy: falsethe server only sees the proxy's TCP socket address and no upstream client IP will ever match. Directly-exposed servers use the socket peer and need no extra config. Clients that hit the limit receive a429with a descriptive JSON body (error,reason,limit,currentCount,retryAfterSeconds,contact) and aRetry-Afterheader. -
trust_proxy — Whether to honor the
X-Forwarded-Forheader for client-IP resolution (used by rate limiting, allowlist checks, tracing, and analytics). Whentrue, the server populatesreq.ipfrom the leftmost entry ofX-Forwarded-For. This is REQUIRED when the server runs behind a reverse proxy (Railway, Fly, Nginx, etc.) that setsX-Forwarded-For. Optional, defaults tofalse. ⚠️ Security: only enable this when the proxy discards any client-suppliedX-Forwarded-ForAND sets its own trusted value. If the proxy passes through client-suppliedX-Forwarded-For, attackers can sendX-Forwarded-For: 160.79.106.35to be seen as an allowlisted IP and bypass the rate limiter entirely. Whenfalse(default),X-Forwarded-Foris ignored and the server uses the TCP socket's peer address.
sources
Define where your content lives. Each source becomes a virtual filesystem that agents can explore.
-
type — Determines chunking strategy.
markdownsplits on headings and uses token-based chunks.codesplits on function boundaries and uses line-based chunks.raw-texttreats content as plain text with token-based chunks.htmlparses HTML structure and extracts text content with token-based chunks.documentextracts text from PDF and DOCX files with token-based chunks. -
category — Optional tag applied to all chunks from
this source. Sources with
category: faqcontribute to the/faq.txtendpoint and can be queried via knowledge tools. - repo — If provided, Pathfinder clones and updates from this repo. Omit for local directories.
- version — Tags all indexed chunks with this version string. Used for version-filtered search queries.
-
base_url — Base URL prepended to the derived slug
when generating document links. Works together with
url_derivation. -
url_derivation — Controls how file paths are
transformed into URL slugs that get appended to
base_url. Each step is applied in order:-
strip_prefix — Remove this prefix from the file
path before generating the URL. For example,
"docs/"turnsdocs/guide/auth.mdxintoguide/auth.mdx. -
strip_suffix — Remove the file extension. For
example,
".mdx"turnsguide/auth.mdxintoguide/auth. -
strip_route_groups — When
true, removes Next.js-style route group segments like(marketing)or(guides)from the path. -
strip_index — When
true, removes a trailing/indexfrom the path (e.g.guide/auth/indexbecomesguide/auth).
-
strip_prefix — Remove this prefix from the file
path before generating the URL. For example,
-
chunk — Use
target_tokens/overlap_tokensfor markdown and raw-text sources. Usetarget_lines/overlap_linesfor code sources.
Note: The chunk config is available on all source
types, but the effect varies. Markdown, raw-text, HTML, and document
sources use token-based chunks
(target_tokens/overlap_tokens). Code
sources use line-based chunks
(target_lines/overlap_lines). For Slack
and Discord, the Q&A chunker produces one chunk per Q&A pair
and chunk settings have no effect. For Notion, the markdown chunker
respects target_tokens and
overlap_tokens.
Source type: markdown
Splits content on Markdown headings and uses token-based chunks. Best
for documentation written in .md or
.mdx files. Configure with target_tokens and
overlap_tokens.
Source type: code
Splits on function/class boundaries and uses line-based chunks. Best
for source code. Configure with target_lines and
overlap_lines.
Source type: raw-text
Treats content as unstructured plain text with token-based chunks. Use when content has no Markdown or code structure.
Source type: html
Parses HTML structure, extracts text content, and uses token-based chunks. Best for static sites, rendered documentation, or any HTML content.
Source type: slack
Indexes Slack threads as Q&A pairs. An LLM distills each qualifying thread into a question-answer pair with a confidence score. All pairs are stored; confidence filtering happens at query time.
- channels — Array of Slack channel IDs to monitor. The bot must be a member of each channel.
- confidence_threshold — Minimum confidence score (0-1) for Q&A pairs to appear in query results. Pairs below this threshold are still stored but filtered out at query time.
- trigger_emoji — When a user reacts with this emoji on a thread, it triggers immediate reindexing of that thread. Useful for curating high-quality answers.
- min_thread_replies — Threads with fewer replies are skipped during indexing.
-
distiller_model — The OpenAI model used to distill
threads into Q&A pairs. Defaults to
gpt-4o-mini.
Required environment variables: SLACK_BOT_TOKEN,
SLACK_SIGNING_SECRET (for emoji-triggered webhook
verification).
Source type: discord
Indexes Discord channels as Q&A pairs. Supports two channel types with different extraction strategies.
- guild_id — The Discord server ID.
-
channels — Array of channel objects, each with an
idandtype. - type: text — Text channels use LLM distillation (same as Slack). Threads are distilled into Q&A pairs with confidence scores.
- type: forum — Forum channels extract Q&A directly from the post title (question) and replies (answer). Confidence is always 1.0 since the structure is explicit.
- confidence_threshold — Minimum confidence for query results. Forum posts always pass (confidence 1.0).
- min_thread_replies — Applies to text channel threads only.
-
distiller_model — OpenAI model for thread
distillation on text channels. Optional, defaults to
gpt-4o-mini.
Discord does not support emoji-triggered reindexing (requires Gateway WebSocket, which Pathfinder does not maintain).
Required environment variables: DISCORD_BOT_TOKEN,
DISCORD_PUBLIC_KEY (for webhook Ed25519 verification).
The bot needs the MESSAGE_CONTENT privileged intent and
Read Message History +
View Channels permissions.
Source type: notion
Index Notion pages and database entries as searchable markdown documents. Blocks are recursively converted to markdown, and database entry properties are serialized as YAML frontmatter.
-
root_pages — Array of Notion page IDs. Each page
and its children (up to
max_depth) are indexed. Optional, defaults to[]. -
databases — Array of Notion database IDs. All
entries in each database are indexed with their properties.
Optional, defaults to
[]. - max_depth — Maximum depth for recursive child page discovery. Range 1-20, default 5.
- include_properties — When true, database entry properties (Status, Priority, Tags, etc.) are prepended as YAML frontmatter to the page content. Default true.
If both root_pages and databases are empty,
all pages accessible to the integration token are indexed. Requires
NOTION_TOKEN environment variable.
By default, Notion sources are indexed as documents and referenced by
search tools. To also make them available via
knowledge tools and the /faq.txt endpoint,
add category: faq to the source config — useful for
Notion databases structured as Q&A or FAQ collections.
Source type: document
Index PDF and DOCX files. Requires optional peer dependencies:
npm install pdf-parse for PDF support and
npm install mammoth for DOCX support. Uses token-based
chunks like markdown sources.
-
PDF support — Requires
pdf-parsepeer dependency. Text is extracted page-by-page. Scanned PDFs (image-only pages with no extractable text) are detected and skipped with a warning. -
DOCX support — Requires
mammothpeer dependency. Document content is converted to plain text for chunking. - max_file_size — Defaults to 10MB (10485760 bytes) for document sources. Large files are skipped with a warning.
url_derivation example
Given a file at
docs/(guides)/getting-started/index.mdx with the
following config:
The derivation steps produce:
tools
Tools are what agents actually call. Four types: search,
bash, collect, and knowledge.
search
Semantic search over embedded content. Requires a database and embedding config.
Search Modes
The search_mode field controls how search queries are
executed:
- vector (default) — cosine similarity on embeddings. Best for semantic/conceptual queries.
- keyword — PostgreSQL full-text search (tsvector/tsquery). Best for exact terms, error codes, and technical identifiers. Does not require an embedding call at query time.
- hybrid — runs both vector and keyword searches in parallel, then merges results using Reciprocal Rank Fusion (RRF, k=60). Best overall recall for mixed query types.
bash
Filesystem exploration with find, grep, cat, ls, head. Works with no database.
-
grep_strategy —
memorypasses grep through to bash unchanged.vectorintercepts grep and runs semantic search.hybridruns bash grep first, falls back to semantic if no results. -
workspace — When true, agents can write files to
/workspace/. Requires a persistent volume in production (see Deploy Guide). -
session_state — When true,
cdpersists across tool calls within a session.
collect
Data collection tools. Agents submit structured data based on a JSON
schema you define. Submitted data is stored in the local PostgreSQL
database in the collected_data table, with a
tool_name column and a data JSONB column
containing the fields from your schema. What you do with the collected
data is up to you — query it directly, build dashboards, pipe it to
analytics, or export it. Pathfinder stores it; you decide how to use
it.
knowledge
FAQ and knowledge base tool. Exposes Q&A pairs from FAQ-category sources (Slack, Discord) to agents. Supports two modes: browse (no query, returns recent pairs) and search (semantic query over pairs).
-
sources — Array of source names. These sources
should have
category: faqset. - min_confidence — Override the per-source confidence threshold for this tool. Q&A pairs below this score are filtered from results.
- default_limit — Number of results returned when the agent doesn't specify a limit.
- max_limit — Upper bound on results the agent can request.
When called without a query, the tool returns recent Q&A pairs (browse mode). When called with a query, it performs semantic search over the pairs.
embedding
Required when using search tools. Configures how content is embedded for vector search.
Provider options
Three embedding providers are supported.
OPENAI_API_KEY is only required when using the
openai provider.
OpenAI (default) — uses the OpenAI embeddings API.
Requires OPENAI_API_KEY.
Ollama — calls a local Ollama instance over HTTP. No
API key needed. Requires
Ollama running with the model pulled
(ollama pull nomic-embed-text).
Local — runs
@xenova/transformers in-process. Zero external
dependencies, CPU-only. Install the optional peer dependency:
npm install @xenova/transformers.
Dimension mismatch detection
If you change embedding providers or models on an existing database, Pathfinder warns at startup when the configured dimensions don't match the existing vector index. You'll need to reindex to use the new dimensions.
indexing
Required when using search tools. Controls automatic reindexing behavior.
webhook
Optional. Triggers targeted reindexing when you push to GitHub.
Point a GitHub webhook at
https://your-server/webhooks/github with the
push event. Set GITHUB_WEBHOOK_SECRET to
match the webhook secret.
analytics
Optional. Enables query logging and the built-in analytics dashboard
at /analytics.
-
enabled — When
true, Pathfinder logs queries and serves the analytics dashboard at/analytics. Defaultfalse. -
log_queries — When
true, all search queries are logged with latency and result counts. Defaulttrue(when analytics is enabled). -
token — Bearer token for authenticating requests to
/api/analytics/*endpoints. Prefer theANALYTICS_TOKENenvironment variable; setanalytics.tokenonly if you need the value inline (YAML is loaded as-is — no${VAR}interpolation). One ofanalytics.tokenor theANALYTICS_TOKENenv var is required when analytics is enabled. - retention_days — How long to keep analytics data before automatic cleanup. Default 90 days.
The analytics dashboard provides top queries, empty result tracking,
and latency metrics. API endpoints:
/api/analytics/summary,
/api/analytics/queries,
/api/analytics/empty-queries.
Authentication
Pathfinder runs an anonymous OAuth 2.1 flow for MCP clients
automatically — no config needed beyond the
MCP_JWT_SECRET environment variable documented in the
Deployment Guide. No user
accounts, no sign-up, no dashboard — clients that perform the
handshake receive a token whose subject is always
anonymous. This satisfies MCP clients that require OAuth
(claude.ai, newer Claude Code builds) while keeping Pathfinder a pure
knowledge server.
OAuth endpoints
All standard OAuth 2.1 + dynamic client registration endpoints are served automatically:
-
/.well-known/oauth-protected-resource— resource metadata (RFC 9728). -
/.well-known/oauth-authorization-server— authorization server metadata (RFC 8414). -
/register— dynamic client registration (RFC 7591). Clients register themselves on first connect. -
/authorize— authorization endpoint. Renders a one-click approve page and redirects with a code. -
/token— token endpoint. Exchanges codes or refresh tokens for access tokens. -
/revoke— token revocation (RFC 7009).
Bearer auth on /mcp and /sse
Both transport endpoints use opportunistic bearer authentication:
-
Requests without an
Authorizationheader pass through unchanged — useful for clients that don't speak OAuth. - Requests with a valid bearer token attach the token claims to the session.
-
Requests with an invalid or expired bearer are rejected with
401 Unauthorizedand aWWW-Authenticatechallenge pointing at the resource metadata endpoint.
Clients that see the 401 challenge automatically discover the OAuth
server, register, and retry with a valid token. Rotating
MCP_JWT_SECRET invalidates all issued tokens at once —
clients re-authenticate transparently on the next request.
Example Configs
Bash-only (no database)
The simplest setup. No API keys, no database. Agents explore docs with shell commands.
Full stack with semantic search
Bash exploration plus RAG search. Requires Postgres and an OpenAI API key.
Multi-repo
Index docs from multiple repositories into a single Pathfinder instance.
Multi-version
Serve multiple versions of the same docs. Agents can filter search results by version.