Sources & Manifests

Every piece of input to SwarmVault is a source. Each source gets an immutable manifest that tracks its metadata.

Source Manifest

When you run swarmvault ingest, a manifest is created with:

sourceId — Stable slug plus content-hash prefix for the canonical source record
originalPath — Where a file-based source came from
url — The original URL for URL-based sources
mimeType — Detected content type
sourceKind — markdown, text, pdf, image, html, docx, epub, csv, xlsx, pptx, odt, odp, ods, jupyter, data, bibtex, rtf, org, asciidoc, transcript, chat_export, email, calendar, binary, or code
language — Code language for ingested code sources when applicable
contentHash — SHA-256 of the raw content
sourceGroupId / sourceGroupTitle — Shared grouping for one-to-many ingests such as EPUB chapters
partIndex / partCount / partTitle — Per-part metadata for grouped sources
storedPath — Canonical immutable source copy under raw/sources/
extractedTextPath — Canonical extracted markdown/text when available
attachments — Localized asset files under raw/assets/<sourceId>/ when ingest copied sidecar or remote image references
createdAt / updatedAt — Timestamps for the source manifest record

Text Extraction

SwarmVault extracts text based on content type:

HTML — Processed with Mozilla Readability, converted to Markdown via Turndown
Remote HTML/Markdown URLs — Remote images can be downloaded into raw/assets/ and rewritten to local relative links in stored markdown
PDF — Local text extraction with pdfjs-dist
Word documents (`.docx`, `.docm`, `.dotx`, `.dotm`) — Local text and metadata extraction across modern, macro-enabled, and template variants
Rich Text (`.rtf`) — Parser-backed RTF walk into plain-text paragraphs
OpenDocument (ODT / ODP / ODS) — Local archive parsing with text, slide, and sheet extraction
EPUB — Local archive parsing with chapter-split HTML-to-markdown extraction
CSV / TSV — Local tabular summaries with bounded previews and column hints
Excel workbooks (`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xltx`, `.xltm`) — Local workbook parsing with bounded sheet previews (modern, macro-enabled, binary, and legacy biff8 formats)
PowerPoint decks (`.pptx`, `.pptm`, `.potx`, `.potm`) — Local slide and speaker-note extraction across macro-enabled and template variants
Jupyter notebooks (`.ipynb`) — Local cell and output extraction
BibTeX (`.bib`) — Parser-backed citation entry extraction
Org-mode (`.org`) — AST-backed headline, list, and block extraction
AsciiDoc (`.adoc`, `.asciidoc`) — Asciidoctor-backed section and metadata extraction
Transcript files (`.srt`, `.vtt`) — Local timestamped transcript extraction
Slack export archives or extracted directories — Local channel/day conversation extraction
Email (`.eml`, `.mbox`) — Local message extraction and mailbox expansion
Calendar (`.ics`) — Local VEVENT expansion
Markdown — Used as-is
Plain text and `.rst` — Used as text, with lightweight .rst heading and directive normalization
Config / data (`.json`, `.jsonc`, `.json5`, `.toml`, `.yaml`, `.yml`, `.xml`, `.ini`, `.conf`, `.cfg`, `.properties`, `.env`) — Stored with structured previews and key/value schema hints
Developer manifests (`package.json`, `tsconfig.json`, `Cargo.toml`, `pyproject.toml`, `go.mod`, `go.sum`, `Dockerfile`, `Makefile`, `LICENSE`, `.gitignore`, `.editorconfig`, `.npmrc`, and similar) — Content-sniffed via istextorbinary so plaintext developer files are never silently dropped as binary
Code — Stored as text, then parsed during compile into module and symbol facts. Covers .js, .mjs, .cjs, .jsx, .ts, .mts, .cts, .tsx, .sh, .bash, .zsh, .py, .go, .rs, .java, .kt, .kts, .scala, .sc, .dart, .lua, .zig, .cs, .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .hxx, .php, .rb, .ps1, .psm1, .psd1, .ex, .exs, .ml, .mli, .m, .mm, .res, .resi, .sol, .vue, .css, .html, .htm, plus extensionless executable scripts with #!/usr/bin/env node|python|ruby|bash|zsh shebangs
Images (`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tif`, `.tiff`, `.svg`, `.ico`, `.heic`, `.heif`, `.avif`, `.jxl`) — Analyzed via vision provider (if configured)

Immutability

Raw sources in raw/ are never modified after ingestion. This ensures reproducible compilation and reliable provenance tracking.