Sources & Manifests
Every piece of input to SwarmVault is a source. Each source gets an immutable manifest that tracks its metadata.
Source Manifest
When you run swarmvault ingest, a manifest is created with:
- sourceId — Stable slug plus content-hash prefix for the canonical source record
- originalPath — Where a file-based source came from
- url — The original URL for URL-based sources
- mimeType — Detected content type
- sourceKind —
markdown,text,pdf,image,html,docx,epub,csv,xlsx,pptx,odt,odp,ods,jupyter,data,bibtex,rtf,org,asciidoc,transcript,chat_export,email,calendar,binary, orcode - language — Code language for ingested code sources when applicable
- contentHash — SHA-256 of the raw content
- sourceGroupId / sourceGroupTitle — Shared grouping for one-to-many ingests such as EPUB chapters
- partIndex / partCount / partTitle — Per-part metadata for grouped sources
- storedPath — Canonical immutable source copy under
raw/sources/ - extractedTextPath — Canonical extracted markdown/text when available
- attachments — Localized asset files under
raw/assets/<sourceId>/when ingest copied sidecar or remote image references - createdAt / updatedAt — Timestamps for the source manifest record
Text Extraction
SwarmVault extracts text based on content type:
- HTML — Processed with Mozilla Readability, converted to Markdown via Turndown
- Remote HTML/Markdown URLs — Remote images can be downloaded into
raw/assets/and rewritten to local relative links in stored markdown - PDF — Local text extraction with
pdfjs-dist - Word documents (`.docx`, `.docm`, `.dotx`, `.dotm`) — Local text and metadata extraction across modern, macro-enabled, and template variants
- Rich Text (`.rtf`) — Parser-backed RTF walk into plain-text paragraphs
- OpenDocument (ODT / ODP / ODS) — Local archive parsing with text, slide, and sheet extraction
- EPUB — Local archive parsing with chapter-split HTML-to-markdown extraction
- CSV / TSV — Local tabular summaries with bounded previews and column hints
- Excel workbooks (`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xltx`, `.xltm`) — Local workbook parsing with bounded sheet previews (modern, macro-enabled, binary, and legacy biff8 formats)
- PowerPoint decks (`.pptx`, `.pptm`, `.potx`, `.potm`) — Local slide and speaker-note extraction across macro-enabled and template variants
- Jupyter notebooks (`.ipynb`) — Local cell and output extraction
- BibTeX (`.bib`) — Parser-backed citation entry extraction
- Org-mode (`.org`) — AST-backed headline, list, and block extraction
- AsciiDoc (`.adoc`, `.asciidoc`) — Asciidoctor-backed section and metadata extraction
- Transcript files (`.srt`, `.vtt`) — Local timestamped transcript extraction
- Slack export archives or extracted directories — Local channel/day conversation extraction
- Email (`.eml`, `.mbox`) — Local message extraction and mailbox expansion
- Calendar (`.ics`) — Local VEVENT expansion
- Markdown — Used as-is
- Plain text and `.rst` — Used as text, with lightweight
.rstheading and directive normalization - Config / data (`.json`, `.jsonc`, `.json5`, `.toml`, `.yaml`, `.yml`, `.xml`, `.ini`, `.conf`, `.cfg`, `.properties`, `.env`) — Stored with structured previews and key/value schema hints
- Developer manifests (`package.json`, `tsconfig.json`, `Cargo.toml`, `pyproject.toml`, `go.mod`, `go.sum`, `Dockerfile`, `Makefile`, `LICENSE`, `.gitignore`, `.editorconfig`, `.npmrc`, and similar) — Content-sniffed via
istextorbinaryso plaintext developer files are never silently dropped as binary - Code — Stored as text, then parsed during compile into module and symbol facts. Covers
.js,.mjs,.cjs,.jsx,.ts,.mts,.cts,.tsx,.sh,.bash,.zsh,.py,.go,.rs,.java,.kt,.kts,.scala,.sc,.dart,.lua,.zig,.cs,.c,.cc,.cpp,.cxx,.h,.hh,.hpp,.hxx,.php,.rb,.ps1,.psm1,.psd1,.ex,.exs,.ml,.mli,.m,.mm,.res,.resi,.sol,.vue,.css,.html,.htm, plus extensionless executable scripts with#!/usr/bin/env node|python|ruby|bash|zshshebangs - Images (`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tif`, `.tiff`, `.svg`, `.ico`, `.heic`, `.heif`, `.avif`, `.jxl`) — Analyzed via vision provider (if configured)
Immutability
Raw sources in raw/ are never modified after ingestion. This ensures reproducible compilation and reliable provenance tracking.