Methodology
Technical transparency: how we transform historical documents into searchable, analyzable data.
Rasin.ai uses a multi-stage AI pipeline to transform historical documents into searchable, analyzable data. This page documents our technical approach for transparency and reproducibility.
1. Data Collection
We aggregate primary sources from digital archives, libraries, and databases worldwide using automated downloaders. Each source is tracked in PostgreSQL with metadata including:
- Source name, category, and language
- Download status and timestamps
- Item counts and file sizes
- Original URLs and archive locations
Currently, the collection includes 43+ sources with 150,000+ documents organized into 8 categories: archives, databases, periodicals, primary sources, secondary sources, legal documents, maps/GIS data, and the Digital Library of the Caribbean.
All source code for downloaders is open and reproducible. See our Sources page for the complete catalog.
2. Optical Character Recognition (OCR)
Historical documents are processed through docTR, a state-of-the-art OCR model optimized for document understanding. We use GPU acceleration (NVIDIA and AMD GPUs) to process documents efficiently.
Current status: 228,000+ pages processed across multiple languages (French, English, Spanish, Haitian Kreyòl).
Quality considerations: OCR accuracy varies based on document quality, age, and preservation condition. Scanned documents from the 18th and 19th centuries may contain errors due to faded text, printing irregularities, or damage.
3. Text Chunking
Extracted text is segmented into semantically meaningful chunks using a combination of:
- Boundary detection: Identifying section breaks, paragraphs, and logical divisions
- Size optimization: Balancing chunk size for embedding models (typically 200-500 tokens)
- Context preservation: Maintaining enough context for standalone readability
Current status: 266,000+ chunks extracted and stored in PostgreSQL.
4. Embeddings & Vector Search
Each text chunk is encoded into a 1024-dimensional vector using BGE-M3, a multilingual embedding model that supports French, English, Spanish, and Haitian Kreyòl.
Embeddings are stored in Qdrant, a vector database optimized for semantic search. This enables:
- Finding documents by meaning rather than exact keyword matches
- Cross-lingual search (query in English, find French documents)
- Similarity-based retrieval for research questions
5. Entity Extraction
We use GLiNER, a generalist named entity recognition (NER) model, to extract structured information from text. Seven specialized extractors identify:
- People: Historical figures, officials, revolutionaries, enslaved individuals
- Places: Cities, regions, plantations, colonies
- Events: Battles, treaties, uprisings, political moments
- Organizations: Military units, government bodies, institutions
- Dates: Temporal references and timelines
- Legal concepts: Laws, codes, decrees
- Economic terms: Trade goods, currencies, financial transactions
Entities are deduplicated and resolved to canonical forms (e.g., "Toussaint Louverture" and "Toussaint L'Ouverture" are merged).
6. Knowledge Graph
Extracted entities and their relationships are stored in Neo4j, a graph database that models connections between people, places, events, and documents.
Current graph size: 20,000+ entities with relationships including:
MENTIONED_IN: Entity appears in a documentRELATED_TO: Entities co-occur in the same contextPARTICIPATED_IN: People involved in eventsLOCATED_IN: Events or entities tied to places
The knowledge graph enables exploratory research, network analysis, and relationship discovery.
7. AI-Powered Search & Chat
Our chat interface uses hybrid retrieval combining:
- Vector search: Semantic similarity via embeddings
- Graph traversal: Entity-based document discovery
- Reciprocal Rank Fusion (RRF): Merging results from both methods
Retrieved documents are passed to a large language model (LLM) that generates responses with inline citations. All answers include direct links to source documents for verification.
Limitations & Caveats
This project uses AI to assist research, but it is not a replacement for primary source analysis. Users should be aware of:
- OCR errors: Automated text extraction may misread damaged or low-quality documents
- Entity extraction errors: NER models may misidentify or miss entities
- AI-generated content: Chat responses are generated by an LLM and may contain inaccuracies or hallucinations. Always verify claims against the cited sources.
- Incomplete coverage: Our collection is growing but does not represent all available sources on Haitian history
- Language bias: OCR and NER models may perform better on English and French than on Haitian Kreyòl or Spanish
We encourage critical engagement with the platform and welcome feedback on errors or improvements.
Reproducibility
All code, data processing pipelines, and model configurations are version-controlled and documented. Researchers can reproduce our methods or apply them to other historical corpora.
For questions about technical implementation or data access, contact us at contact@studio1804.org
Try it yourself
Experience the full pipeline in action. Ask questions and see how we retrieve, analyze, and cite historical sources.