Sources & Manifests
Every piece of input to SwarmVault is a source. Each source gets an immutable manifest that tracks its metadata.
Source Manifest
When you run swarmvault ingest, a manifest is created with:
- sourceId — Content-based hash for deduplication
- originalPath — Where the source came from (file path or URL)
- mimeType — Detected content type
- contentHash — SHA-256 of the raw content
- ingestedAt — Timestamp of ingestion
Text Extraction
SwarmVault extracts text based on content type:
- HTML — Processed with Mozilla Readability, converted to Markdown via Turndown
- PDF — Text extraction with pdf-parse
- Markdown — Used as-is
- Plain text — Used as-is
- Images — Analyzed via vision provider (if configured)
Immutability
Raw sources in raw/ are never modified after ingestion. This ensures reproducible compilation and reliable provenance tracking.