Configuration Reference

Everything in Pathfinder is driven by a single pathfinder.yaml file. This is the full annotated reference.

server

Top-level server identity and session management.

server: name: my-project-docs # Server name, exposed in MCP metadata version: 1.0.0 # Server version string max_sessions_per_ip: 20 # Max concurrent MCP sessions per IP (default: 20) session_ttl_minutes: 30 # Idle session timeout in minutes (default: 30)

sources

Define where your content lives. Each source becomes a virtual filesystem that agents can explore.

sources: - name: docs # Unique name, referenced by tools type: markdown # markdown | code | raw-text | html | document | slack | discord | notion repo: https://github.com/org/repo.git # Git repo URL (optional for local paths) branch: main # Git branch to track (default: default branch) path: docs/ # Directory within the repo to index version: v2 # Version tag for multi-version docs (optional) file_patterns: # Glob patterns for files to include - "**/*.mdx" - "**/*.md" exclude_patterns: # Glob patterns to exclude (optional) - "**/node_modules/**" - "**/_internal/**" skip_dirs: # Directory names to skip entirely (optional) - .git - node_modules max_file_size: 100000 # Max file size in bytes to index (optional) category: faq # Optional. Marks content as FAQ for /faq.txt and knowledge tools base_url: https://docs.example.com # Base URL for generating doc links (optional) url_derivation: # How to derive URLs from file paths (optional) strip_prefix: docs/ strip_suffix: .mdx strip_route_groups: true strip_index: true chunk: # How to split content for embeddings target_tokens: 600 # For markdown/raw-text sources overlap_tokens: 50 # Token overlap between chunks target_lines: 100 # For code sources overlap_lines: 10 # Line overlap between chunks

Source type: markdown

Splits content on Markdown headings and uses token-based chunks. Best for documentation written in .md or .mdx files. Configure with target_tokens and overlap_tokens.

Source type: code

Splits on function/class boundaries and uses line-based chunks. Best for source code. Configure with target_lines and overlap_lines.

Source type: raw-text

Treats content as unstructured plain text with token-based chunks. Use when content has no Markdown or code structure.

Source type: html

Parses HTML structure, extracts text content, and uses token-based chunks. Added in v1.4. Best for static sites, rendered documentation, or any HTML content.

Source type: slack

Indexes Slack threads as Q&A pairs. An LLM distills each qualifying thread into a question-answer pair with a confidence score. All pairs are stored; confidence filtering happens at query time.

sources: - name: community-support type: slack category: faq channels: - C06ABC123 # Channel IDs to index - C06DEF456 confidence_threshold: 0.7 # Min confidence for query results (0-1) trigger_emoji: white_check_mark # Emoji reaction triggers immediate reindex of thread min_thread_replies: 2 # Minimum replies for a thread to be indexed distiller_model: gpt-4o-mini # LLM model for distillation (default: gpt-4o-mini)

Required environment variables: SLACK_BOT_TOKEN, SLACK_SIGNING_SECRET (for emoji-triggered webhook verification).

Source type: discord

Indexes Discord channels as Q&A pairs. Supports two channel types with different extraction strategies.

sources: - name: discord-support type: discord category: faq guild_id: "1234567890" # Discord server (guild) ID channels: - id: "1111111111" # Text channel — LLM distillation type: text - id: "2222222222" # Forum channel — direct Q&A extraction type: forum confidence_threshold: 0.7 # Min confidence for query results (0-1) min_thread_replies: 2 # Minimum replies for text threads

Discord does not support emoji-triggered reindexing (requires Gateway WebSocket, which Pathfinder does not maintain).

Required environment variables: DISCORD_BOT_TOKEN, DISCORD_PUBLIC_KEY (for webhook Ed25519 verification). The bot needs the MESSAGE_CONTENT privileged intent and Read Message History + View Channels permissions.

Source type: notion

Index Notion pages and database entries as searchable markdown documents. Blocks are recursively converted to markdown, and database entry properties are serialized as YAML frontmatter.

sources: - name: notion-wiki type: notion root_pages: - "abc123def456..." # Notion page IDs to index (including children) databases: - "789xyz..." # Database IDs — all entries indexed max_depth: 5 # How deep to recurse into child pages (1-20, default: 5) include_properties: true # Serialize database properties as YAML frontmatter (default: true) chunk: target_tokens: 600 overlap_tokens: 50

If both root_pages and databases are empty, all pages accessible to the integration token are indexed. Requires NOTION_TOKEN environment variable.

By default, Notion sources are indexed as documents and referenced by search tools. To also make them available via knowledge tools and the /faq.txt endpoint, add category: faq to the source config — useful for Notion databases structured as Q&A or FAQ collections.

Note: The chunk config is available on all source types. For Slack and Discord, the Q&A chunker produces one chunk per Q&A pair (chunk settings have no effect). For Notion, the markdown chunker respects target_tokens and overlap_tokens.

Source type: document

Index PDF and DOCX files. Requires optional peer dependencies: npm install pdf-parse for PDF support and npm install mammoth for DOCX support. Uses token-based chunks like markdown sources.

sources: - name: manuals type: document path: ./documents file_patterns: - "**/*.pdf" - "**/*.docx" max_file_size: 10485760 # 10MB default limit chunk: target_tokens: 600 overlap_tokens: 50

url_derivation example

Given a file at docs/(guides)/getting-started/index.mdx with the following config:

base_url: https://docs.example.com/ url_derivation: strip_prefix: docs/ strip_suffix: .mdx strip_route_groups: true strip_index: true

The derivation steps produce:

# Original file path docs/(guides)/getting-started/index.mdx # After strip_prefix: "docs/" (guides)/getting-started/index.mdx # After strip_suffix: ".mdx" (guides)/getting-started/index # After strip_route_groups: true getting-started/index # After strip_index: true getting-started # Final URL = base_url + slug https://docs.example.com/getting-started

tools

Tools are what agents actually call. Four types: search, bash, collect, and knowledge.

Semantic search over embedded content. Requires a database and embedding config.

tools: - name: search-docs # Tool name exposed to agents type: search description: Search the docs # Description shown to agents source: docs # Which source to search default_limit: 5 # Default number of results max_limit: 20 # Maximum results an agent can request result_format: docs # docs | code | raw min_score: 0.3 # Minimum cosine similarity threshold (0-1, optional) search_mode: vector # vector (default) | keyword | hybrid

Search Modes

The search_mode field controls how search queries are executed:

bash

Filesystem exploration with find, grep, cat, ls, head. Works with no database.

tools: - name: explore-docs # Tool name exposed to agents type: bash description: Explore the docs # Description shown to agents sources: [docs] # Sources to mount (can be multiple) bash: session_state: true # Persist CWD across commands grep_strategy: hybrid # memory | vector | hybrid virtual_files: true # Expose qmd and related as virtual commands workspace: true # Enable writable /workspace directory max_file_size: 100000 # Max file size for cat output (bytes) cache: max_entries: 100 # Max cached command results ttl_seconds: 300 # Cache TTL in seconds

collect

Data collection tools. Agents submit structured data based on a JSON schema you define. Submitted data is stored in the local PostgreSQL database in the collected_data table, with a tool_name column and a data JSONB column containing the fields from your schema. What you do with the collected data is up to you — query it directly, build dashboards, pipe it to analytics, or export it. Pathfinder stores it; you decide how to use it.

tools: - name: submit-feedback type: collect description: Report search quality or broken links response: Thanks for the feedback! schema: rating: type: enum description: How helpful was the result? required: true values: [helpful, somewhat, not_helpful] comment: type: string description: Optional details

knowledge

FAQ and knowledge base tool. Exposes Q&A pairs from FAQ-category sources (Slack, Discord) to agents. Supports two modes: browse (no query, returns recent pairs) and search (semantic query over pairs).

tools: - name: knowledge-base type: knowledge description: Search community Q&A knowledge base sources: [community-support, discord-support] # Sources with category: faq min_confidence: 0.7 # Override source-level confidence threshold default_limit: 10 # Default number of results max_limit: 50 # Maximum results an agent can request

When called without a query, the tool returns recent Q&A pairs (browse mode). When called with a query, it performs semantic search over the pairs.

embedding

Required when using search tools. Configures how content is embedded for vector search.

embedding: provider: openai # openai | ollama | local model: text-embedding-3-small # Model name (provider-specific) dimensions: 1536 # Vector dimensions (must match model)

Provider options

Three embedding providers are supported. OPENAI_API_KEY is only required when using the openai provider.

OpenAI (default) — uses the OpenAI embeddings API. Requires OPENAI_API_KEY.

embedding: provider: openai model: text-embedding-3-small dimensions: 1536

Ollama — calls a local Ollama instance over HTTP. No API key needed. Requires Ollama running with the model pulled (ollama pull nomic-embed-text).

embedding: provider: ollama model: nomic-embed-text dimensions: 768 base_url: http://localhost:11434 # Optional, this is the default

Local — runs @xenova/transformers in-process. Zero external dependencies, CPU-only. Install the optional peer dependency: npm install @xenova/transformers.

embedding: provider: local model: Xenova/all-MiniLM-L6-v2 dimensions: 384

Dimension mismatch detection

If you change embedding providers or models on an existing database, Pathfinder warns at startup when the configured dimensions don't match the existing vector index. You'll need to reindex to use the new dimensions.

indexing

Required when using search tools. Controls automatic reindexing behavior.

indexing: auto_reindex: true # Enable daily automatic reindexing reindex_hour_utc: 3 # Hour (0-23 UTC) to run daily reindex stale_threshold_hours: 24 # Re-embed chunks older than this

webhook

Optional. Triggers targeted reindexing when you push to GitHub.

webhook: repo_sources: # Map GitHub repo to source names "your-org/your-repo": [docs] path_triggers: # Only reindex when these paths change docs: ["docs/"]

Point a GitHub webhook at https://your-server/webhooks/github with the push event. Set GITHUB_WEBHOOK_SECRET to match the webhook secret.

analytics

Optional. Enables query logging and the built-in analytics dashboard at /analytics.

analytics: enabled: false # Enable analytics (default: false) log_queries: true # Log all search queries (default: true) token: ${ANALYTICS_TOKEN} # Bearer token for /api/analytics endpoints retention_days: 90 # Days to retain data (default: 90)

The analytics dashboard provides top queries, empty result tracking, and latency metrics. API endpoints: /api/analytics/summary, /api/analytics/queries, /api/analytics/empty-queries.

Example Configs

Bash-only (no database)

The simplest setup. No API keys, no database. Agents explore docs with shell commands.

server: name: my-docs version: 1.0.0 sources: - name: docs type: markdown path: ./docs file_patterns: ["**/*.md", "**/*.mdx"] tools: - name: explore-docs type: bash description: Explore project documentation sources: [docs] bash: session_state: true

Full stack with semantic search

Bash exploration plus RAG search. Requires Postgres and an OpenAI API key.

server: name: my-docs version: 1.0.0 sources: - name: docs type: markdown repo: https://github.com/org/repo.git path: docs/ file_patterns: ["**/*.mdx", "**/*.md"] chunk: target_tokens: 600 overlap_tokens: 50 tools: - name: search-docs type: search description: Semantic search over documentation source: docs default_limit: 5 max_limit: 20 result_format: docs - name: explore-docs type: bash description: Explore documentation with shell commands sources: [docs] bash: session_state: true grep_strategy: hybrid workspace: true embedding: provider: openai model: text-embedding-3-small dimensions: 1536 indexing: auto_reindex: true reindex_hour_utc: 3 stale_threshold_hours: 24

Multi-repo

Index docs from multiple repositories into a single Pathfinder instance.

server: name: platform-docs version: 1.0.0 sources: - name: api-docs type: markdown repo: https://github.com/org/api.git path: docs/ file_patterns: ["**/*.md"] chunk: { target_tokens: 600, overlap_tokens: 50 } - name: sdk-docs type: markdown repo: https://github.com/org/sdk.git path: docs/ file_patterns: ["**/*.mdx"] chunk: { target_tokens: 600, overlap_tokens: 50 } - name: sdk-source type: code repo: https://github.com/org/sdk.git path: src/ file_patterns: ["**/*.ts"] chunk: { target_lines: 100, overlap_lines: 10 } tools: - name: search-api type: search description: Search API documentation source: api-docs default_limit: 5 max_limit: 20 result_format: docs - name: search-sdk type: search description: Search SDK documentation and source source: sdk-docs default_limit: 5 max_limit: 20 result_format: docs - name: explore-all type: bash description: Explore all documentation and source sources: [api-docs, sdk-docs, sdk-source] bash: session_state: true grep_strategy: hybrid embedding: provider: openai model: text-embedding-3-small dimensions: 1536 indexing: auto_reindex: true reindex_hour_utc: 3 stale_threshold_hours: 24

Multi-version

Serve multiple versions of the same docs. Agents can filter search results by version.

server: name: versioned-docs version: 1.0.0 sources: - name: docs-v1 type: markdown repo: https://github.com/org/repo.git branch: v1 path: docs/ version: v1 file_patterns: ["**/*.mdx"] chunk: { target_tokens: 600, overlap_tokens: 50 } - name: docs-v2 type: markdown repo: https://github.com/org/repo.git branch: v2 path: docs/ version: v2 file_patterns: ["**/*.mdx"] chunk: { target_tokens: 600, overlap_tokens: 50 } tools: - name: search-docs type: search description: Search docs (specify version parameter to filter) source: docs-v2 default_limit: 5 max_limit: 20 result_format: docs - name: explore-docs type: bash description: Explore documentation filesystem sources: [docs-v1, docs-v2] bash: session_state: true embedding: provider: openai model: text-embedding-3-small dimensions: 1536 indexing: auto_reindex: true reindex_hour_utc: 3 stale_threshold_hours: 24