Sources & Manifests

Every piece of input to SwarmVault is a source. Each source gets an immutable manifest that tracks its metadata.

Source Manifest

When you run swarmvault ingest, a manifest is created with:

  • sourceId — Stable slug plus content-hash prefix for the canonical source record
  • originalPath — Where a file-based source came from
  • url — The original URL for URL-based sources
  • mimeType — Detected content type
  • sourceKindmarkdown, text, pdf, image, html, docx, epub, csv, xlsx, pptx, odt, odp, ods, jupyter, data, bibtex, rtf, org, asciidoc, transcript, chat_export, email, calendar, binary, or code
  • language — Code language for ingested code sources when applicable
  • contentHash — SHA-256 of the raw content
  • sourceGroupId / sourceGroupTitle — Shared grouping for one-to-many ingests such as EPUB chapters
  • partIndex / partCount / partTitle — Per-part metadata for grouped sources
  • storedPath — Canonical immutable source copy under raw/sources/
  • extractedTextPath — Canonical extracted markdown/text when available
  • attachments — Localized asset files under raw/assets/<sourceId>/ when ingest copied sidecar or remote image references
  • createdAt / updatedAt — Timestamps for the source manifest record

Text Extraction

SwarmVault extracts text based on content type:

  • HTML — Processed with Mozilla Readability, converted to Markdown via Turndown
  • Remote HTML/Markdown URLs — Remote images can be downloaded into raw/assets/ and rewritten to local relative links in stored markdown
  • PDF — Local text extraction with pdfjs-dist
  • Word documents (`.docx`, `.docm`, `.dotx`, `.dotm`) — Local text and metadata extraction across modern, macro-enabled, and template variants
  • Rich Text (`.rtf`) — Parser-backed RTF walk into plain-text paragraphs
  • OpenDocument (ODT / ODP / ODS) — Local archive parsing with text, slide, and sheet extraction
  • EPUB — Local archive parsing with chapter-split HTML-to-markdown extraction
  • CSV / TSV — Local tabular summaries with bounded previews and column hints
  • Excel workbooks (`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xltx`, `.xltm`) — Local workbook parsing with bounded sheet previews (modern, macro-enabled, binary, and legacy biff8 formats)
  • PowerPoint decks (`.pptx`, `.pptm`, `.potx`, `.potm`) — Local slide and speaker-note extraction across macro-enabled and template variants
  • Jupyter notebooks (`.ipynb`) — Local cell and output extraction
  • BibTeX (`.bib`) — Parser-backed citation entry extraction
  • Org-mode (`.org`) — AST-backed headline, list, and block extraction
  • AsciiDoc (`.adoc`, `.asciidoc`) — Asciidoctor-backed section and metadata extraction
  • Transcript files (`.srt`, `.vtt`) — Local timestamped transcript extraction
  • Slack export archives or extracted directories — Local channel/day conversation extraction
  • Email (`.eml`, `.mbox`) — Local message extraction and mailbox expansion
  • Calendar (`.ics`) — Local VEVENT expansion
  • Markdown — Used as-is
  • Plain text and `.rst` — Used as text, with lightweight .rst heading and directive normalization
  • Config / data (`.json`, `.jsonc`, `.json5`, `.toml`, `.yaml`, `.yml`, `.xml`, `.ini`, `.conf`, `.cfg`, `.properties`, `.env`) — Stored with structured previews and key/value schema hints
  • Developer manifests (`package.json`, `tsconfig.json`, `Cargo.toml`, `pyproject.toml`, `go.mod`, `go.sum`, `Dockerfile`, `Makefile`, `LICENSE`, `.gitignore`, `.editorconfig`, `.npmrc`, and similar) — Content-sniffed via istextorbinary so plaintext developer files are never silently dropped as binary
  • Code — Stored as text, then parsed during compile into module and symbol facts. Covers .js, .mjs, .cjs, .jsx, .ts, .mts, .cts, .tsx, .sh, .bash, .zsh, .py, .go, .rs, .java, .kt, .kts, .scala, .sc, .dart, .lua, .zig, .cs, .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .hxx, .php, .rb, .ps1, .psm1, .psd1, .ex, .exs, .ml, .mli, .m, .mm, .res, .resi, .sol, .vue, .css, .html, .htm, plus extensionless executable scripts with #!/usr/bin/env node|python|ruby|bash|zsh shebangs
  • Images (`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tif`, `.tiff`, `.svg`, `.ico`, `.heic`, `.heif`, `.avif`, `.jxl`) — Analyzed via vision provider (if configured)

Immutability

Raw sources in raw/ are never modified after ingestion. This ensures reproducible compilation and reliable provenance tracking.