Methodology

Technical transparency: how we transform historical documents into searchable, analyzable data.

Rasin.ai uses a multi-stage AI pipeline to transform historical documents into searchable, analyzable data. This page documents our technical approach for transparency and reproducibility.

1. Data Collection

We aggregate primary sources from digital archives, libraries, and databases worldwide using automated downloaders. Each source is tracked in PostgreSQL with metadata including:

  • Source name, category, and language
  • Download status and timestamps
  • Item counts and file sizes
  • Original URLs and archive locations

Currently, the collection includes 43+ sources with 150,000+ documents organized into 8 categories: archives, databases, periodicals, primary sources, secondary sources, legal documents, maps/GIS data, and the Digital Library of the Caribbean.

All source code for downloaders is open and reproducible. See our Sources page for the complete catalog.

2. Optical Character Recognition (OCR)

Historical documents are processed through docTR, a state-of-the-art OCR model optimized for document understanding. We use GPU acceleration (NVIDIA and AMD GPUs) to process documents efficiently.

Current status: 228,000+ pages processed across multiple languages (French, English, Spanish, Haitian Kreyòl).

Quality considerations: OCR accuracy varies based on document quality, age, and preservation condition. Scanned documents from the 18th and 19th centuries may contain errors due to faded text, printing irregularities, or damage.

3. Text Chunking

Extracted text is segmented into semantically meaningful chunks using a combination of:

  • Boundary detection: Identifying section breaks, paragraphs, and logical divisions
  • Size optimization: Balancing chunk size for embedding models (typically 200-500 tokens)
  • Context preservation: Maintaining enough context for standalone readability

Current status: 266,000+ chunks extracted and stored in PostgreSQL.

4. Embeddings & Vector Search

Each text chunk is encoded into a 1024-dimensional vector using BGE-M3, a multilingual embedding model that supports French, English, Spanish, and Haitian Kreyòl.

Embeddings are stored in Qdrant, a vector database optimized for semantic search. This enables:

  • Finding documents by meaning rather than exact keyword matches
  • Cross-lingual search (query in English, find French documents)
  • Similarity-based retrieval for research questions

5. Entity Extraction

We use GLiNER, a generalist named entity recognition (NER) model, to extract structured information from text. Seven specialized extractors identify:

  • People: Historical figures, officials, revolutionaries, enslaved individuals
  • Places: Cities, regions, plantations, colonies
  • Events: Battles, treaties, uprisings, political moments
  • Organizations: Military units, government bodies, institutions
  • Dates: Temporal references and timelines
  • Legal concepts: Laws, codes, decrees
  • Economic terms: Trade goods, currencies, financial transactions

Entities are deduplicated and resolved to canonical forms (e.g., "Toussaint Louverture" and "Toussaint L'Ouverture" are merged).

6. Knowledge Graph

Extracted entities and their relationships are stored in Neo4j, a graph database that models connections between people, places, events, and documents.

Current graph size: 20,000+ entities with relationships including:

  • MENTIONED_IN: Entity appears in a document
  • RELATED_TO: Entities co-occur in the same context
  • PARTICIPATED_IN: People involved in events
  • LOCATED_IN: Events or entities tied to places

The knowledge graph enables exploratory research, network analysis, and relationship discovery.

7. AI-Powered Search & Chat

Our chat interface uses hybrid retrieval combining:

  • Vector search: Semantic similarity via embeddings
  • Graph traversal: Entity-based document discovery
  • Reciprocal Rank Fusion (RRF): Merging results from both methods

Retrieved documents are passed to a large language model (LLM) that generates responses with inline citations. All answers include direct links to source documents for verification.

Limitations & Caveats

This project uses AI to assist research, but it is not a replacement for primary source analysis. Users should be aware of:

  • OCR errors: Automated text extraction may misread damaged or low-quality documents
  • Entity extraction errors: NER models may misidentify or miss entities
  • AI-generated content: Chat responses are generated by an LLM and may contain inaccuracies or hallucinations. Always verify claims against the cited sources.
  • Incomplete coverage: Our collection is growing but does not represent all available sources on Haitian history
  • Language bias: OCR and NER models may perform better on English and French than on Haitian Kreyòl or Spanish

We encourage critical engagement with the platform and welcome feedback on errors or improvements.

Reproducibility

All code, data processing pipelines, and model configurations are version-controlled and documented. Researchers can reproduce our methods or apply them to other historical corpora.

For questions about technical implementation or data access, contact us at contact@studio1804.org

Try it yourself

Experience the full pipeline in action. Ask questions and see how we retrieve, analyze, and cite historical sources.