swarmvault ingest

Ingest a local file path, directory path, or URL into canonical source storage.

Usage

swarmvault ingest <input> [--review] [--guide] [--no-guide] [--answers-file <path>] [--repo-root <path>] [--include <glob...>] [--exclude <glob...>] [--max-files <n>] [--include-third-party] [--include-resources] [--include-generated] [--no-gitignore] [--no-swarmvaultignore] [--video] [--no-include-assets] [--max-asset-size <bytes>] [--commit]

Arguments

  • <input> - A local file path, local directory path, or URL

Options

  • --no-include-assets - Skip downloading remote image assets when ingesting a URL
  • --max-asset-size <bytes> - Cap the bytes fetched for a single remote image asset during URL ingest
  • --review - Compile immediately after ingest and stage a source-scoped review artifact plus approval bundle
  • --guide - Compile immediately after ingest and create the stronger guided-session flow: source brief, resumable session, source review, source guide, and an approval bundle whose target pages follow the active profile
  • --no-guide - Skip guided mode even when profile.guidedIngestDefault enables it in config
  • --answers-file <path> - Pre-answer the guided-session questions for --guide without waiting for an interactive prompt
  • --repo-root <path> - Override the detected repo root when ingesting a directory
  • --include <glob...> - Only ingest files matching one or more glob patterns during directory ingest
  • --exclude <glob...> - Skip files matching one or more glob patterns during directory ingest
  • --max-files <n> - Cap the number of files imported from a directory
  • --include-third-party - Include files classified as third-party dependency material
  • --include-resources - Include files classified as bundled resources such as app assets
  • --include-generated - Include files classified as generated output
  • --no-gitignore - Ignore .gitignore rules during directory ingest
  • --no-swarmvaultignore - Ignore .swarmvaultignore rules during directory ingest
  • --video - Treat a URL input as a public video, extract audio with yt-dlp, and transcribe through tasks.audioProvider
  • --commit - Stage wiki/ and state/ changes and commit them when the vault root is inside a git worktree

Examples

swarmvault ingest ./research-paper.pdf
swarmvault ingest ./research-paper.pdf --commit
swarmvault ingest ./customer-call.srt --guide
swarmvault source session transcript-or-session-id
swarmvault ingest ./mailbox.mbox --guide
swarmvault ingest ./calendar.ics
swarmvault ingest ./slack-export.zip --guide
swarmvault ingest ./brief.docx
swarmvault ingest ./book.epub
swarmvault ingest ./customer-call.mp3
swarmvault ingest ./product-demo.mp4
swarmvault ingest ./dataset.csv
swarmvault ingest ./workbook.xlsx
swarmvault ingest ./deck.pptx
swarmvault ingest ./notes/meeting.md
swarmvault ingest ./apps/api
swarmvault ingest ./apps/api --include '**/*.ts' --exclude '**/*.test.ts'
swarmvault ingest ./ios-app --include-resources
swarmvault ingest ./monorepo --include-third-party --include-generated
swarmvault ingest https://example.com/article
swarmvault ingest https://www.youtube.com/watch?v=dQw4w9WgXcQ
swarmvault ingest --video https://example.com/product-demo.mp4
swarmvault ingest https://example.com/article --max-asset-size 2097152

What It Does

  1. Detects whether the input is a file, directory, or URL
  2. Recursively walks directories, respecting .gitignore, .swarmvaultignore, and SwarmVault's built-in ignore set by default
  3. Classifies repo files as first_party, third_party, resource, or generated
  4. Detects MIME type and source kind for each imported file
  5. Extracts text content when possible, including chapter-split EPUB, tabular previews for CSV/TSV and XLSX, slide text plus notes for PPTX, timestamped transcripts, Slack export conversations, mailbox messages, calendar events, provider-backed audio/video transcripts, and direct YouTube transcripts
  6. Writes immutable source copies under raw/sources/
  7. For remote HTML and markdown URLs, downloads referenced remote images into raw/assets/<sourceId>/ by default
  8. Rewrites stored markdown image references to local relative asset paths when those assets are downloaded
  9. Stores extracted text under state/extracts/<sourceId>.md when available
  10. Stores extractor metadata under state/extracts/<sourceId>.json
  11. Writes manifests to state/manifests/
  12. When --review is set, runs one compile and stages a source review page under wiki/outputs/source-reviews/
  13. When guided mode is enabled, either by --guide or by profile.guidedIngestDefault, writes a source brief, creates a resumable session under wiki/outputs/source-sessions/, stages a source review plus source guide, and creates a clearly labeled approval bundle that targets canonical pages only when the active profile uses guidedSessionMode: "canonical_review"

Directory ingest also records repoRelativePath and sourceClass on imported manifests. Compile later uses those paths to build state/code-index.json, resolve local imports across the repo tree, and keep graph reports focused on first-party material by default.

Interactive file and directory ingest emits bounded progress on stderr when running in a TTY, including the active file and processed content size. JSON, MCP, watch, and CI-style flows stay quiet. Parser or grammar compatibility failures stay local to the affected source and surface as diagnostics instead of aborting unrelated code analysis.

When --commit is set, SwarmVault stages wiki/ and state/ changes only. Canonical raw/ source copies stay untouched in git unless you stage them separately. Outside a git worktree, --commit becomes a no-op.

Audio and video files use tasks.audioProvider when you configure a provider with audio capability. Local video extraction shells out to ffmpeg; public video URL extraction with --video shells out to yt-dlp. Without the provider or extractor binary, SwarmVault still ingests the source and records an explicit extraction warning instead of failing. Supported YouTube URLs bypass generic HTML ingest and go straight through transcript capture without requiring a model provider.

When --review or guided mode is in play, the output artifacts are intentionally split by role:

ArtifactCreated byPurpose
Source briefsource add, ingest (always)Auto summary written to wiki/outputs/source-briefs/
Source reviewsource review, source add --guide, ingest --review, ingest --guideLighter staged assessment in wiki/outputs/source-reviews/
Source guidesource guide, source add --guide, ingest --guideGuided walkthrough with approval-bundled updates in wiki/outputs/source-guides/
Source sessionsource session, source add --guide, ingest --guideResumable workflow state in wiki/outputs/source-sessions/ and state/source-sessions/

Supported Formats

  • Markdown (.md, .mdx)
  • Plain text and reStructuredText (.txt, .rst, .rest)
  • PDF (.pdf)
  • Word documents — full family (.docx, .docm, .dotx, .dotm, including macro-enabled and template variants)
  • Rich Text Format (.rtf)
  • OpenDocument text, slides, and spreadsheets (.odt, .odp, .ods)
  • EPUB books (.epub)
  • CSV and TSV datasets (.csv, .tsv)
  • Excel workbooks — full family (.xlsx, .xlsm, .xlsb, .xls, .xltx, .xltm, including macro-enabled, binary, and legacy formats)
  • PowerPoint decks — full family (.pptx, .pptm, .potx, .potm, including macro-enabled and template variants)
  • Jupyter notebooks (.ipynb)
  • BibTeX libraries (.bib)
  • Org-mode documents (.org)
  • AsciiDoc (.adoc, .asciidoc)
  • Transcript files (.srt, .vtt)
  • Slack export archives or extracted Slack export directories
  • Email (.eml, .mbox)
  • Calendar exports (.ics)
  • Audio files (.mp3, .wav, .m4a, .aac, .ogg, .webm, and other audio/* inputs) via tasks.audioProvider
  • Video files (.mp4, .mov, .m4v, .mkv, .avi, and other video/* inputs) via ffmpeg audio extraction plus tasks.audioProvider
  • HTML files and URLs
  • YouTube transcript URLs (youtube.com/watch, youtu.be, youtube.com/embed, youtube.com/shorts)
  • Public video URLs with --video via yt-dlp audio extraction plus tasks.audioProvider
  • Images (.png, .jpg, .jpeg, .gif, .webp, .bmp, .tif, .tiff, .svg, .ico, .heic, .heif, .avif, .jxl)
  • Structured config and data (.json, .jsonc, .json5, .toml, .yaml, .yml, .xml, .ini, .conf, .cfg, .properties, .env) with schema-hint previews
  • Developer manifests and marker files (package.json, tsconfig.json, Cargo.toml, pyproject.toml, go.mod, go.sum, Dockerfile, Makefile, LICENSE, .gitignore, .editorconfig, .npmrc, and similar) via content-sniffed text ingest — plain-text developer files are never silently dropped as binary
  • JavaScript and TypeScript (.js, .mjs, .cjs, .jsx, .ts, .mts, .cts, .tsx)
  • Bash and shell scripts (.sh, .bash, .zsh, plus executable shebang scripts without an extension including #!/usr/bin/env node|python|ruby|bash|zsh variants)
  • Python (.py)
  • Go (.go)
  • Rust (.rs)
  • Java (.java)
  • C# (.cs)
  • C and C++ (.c, .cc, .cpp, .cxx, .h, .hh, .hpp, .hxx)
  • PHP (.php)
  • Ruby (.rb)
  • PowerShell (.ps1, .psm1, .psd1)
  • Kotlin (.kt, .kts)
  • Scala (.scala, .sc)
  • Dart and Flutter repos (.dart)
  • Lua (.lua)
  • Zig (.zig)
  • Elixir (.ex, .exs)
  • OCaml (.ml, .mli)
  • Objective-C (.m, .mm — note: .h headers resolve to C/C++ because ObjC and C++ headers are textually identical)
  • ReScript (.res, .resi)
  • Solidity (.sol)
  • Vue single-file components (.vue)
  • Svelte single-file components (.svelte)
  • Julia (.jl) via packaged WASM grammar extraction
  • Verilog/SystemVerilog (.v, .vh, .sv, .svh) via packaged WASM grammar extraction
  • R (.r, .R) with an explicit parser-asset diagnostic until a safe packaged grammar is available
  • CSS (.css)
  • SQL (.sql) — parser-backed table/view extraction plus read/write/join/reference graph edges
  • HTML (.html, .htm) — local HTML files are parsed for custom elements, id-bearing anchors, and <link>/<script> imports
  • Other files as binary blobs

Output

Prints the generated source ID to stdout for single-manifest file and URL ingest.

For grouped/container inputs such as EPUB, mailbox files, calendar files, Slack exports, and for directory ingest, SwarmVault prints a JSON or text summary of created, updated, unchanged, and removed manifests.

When remote assets are localized, the manifest also records them as attachments, and Obsidian plus the local graph workspace can render those local files directly.

PDF, DOCX, EPUB, CSV/TSV, XLSX, PPTX, transcript, Slack export, email, calendar, audio, video, and YouTube ingest all write extraction sidecars before compile. Image ingest uses the configured visionProvider for structured OCR/diagram extraction when a real multimodal provider is available. If image, audio, or video extraction cannot run, SwarmVault still ingests the source and records an explicit warning in the extraction sidecar instead of treating the source as an unexplained empty blob.