Sources & Manifests

Every piece of input to SwarmVault is a source. Each source gets an immutable manifest that tracks its metadata.

Source Manifest

When you run swarmvault ingest, a manifest is created with:

  • sourceId — Content-based hash for deduplication
  • originalPath — Where the source came from (file path or URL)
  • mimeType — Detected content type
  • contentHash — SHA-256 of the raw content
  • ingestedAt — Timestamp of ingestion

Text Extraction

SwarmVault extracts text based on content type:

  • HTML — Processed with Mozilla Readability, converted to Markdown via Turndown
  • PDF — Text extraction with pdf-parse
  • Markdown — Used as-is
  • Plain text — Used as-is
  • Images — Analyzed via vision provider (if configured)

Immutability

Raw sources in raw/ are never modified after ingestion. This ensures reproducible compilation and reliable provenance tracking.