Research Methodology & Transparency

Methodology

How Rasin treats sources, manages bias, and verifies the claims its AI makes — documented for researchers, librarians, and institutions.

Guiding Principles

What researchers should know before using this tool

Source pluralism

Rasin aggregates 150+ collections from institutions in France, Haiti, the US, and the Caribbean. No single archive dominates the corpus. Every source is classified by evidentiary status, authorial perspective, and quality tier.

Perspective transparency

Each source is tagged with who wrote it and from what position of power. When search results skew heavily toward one viewpoint — for example, colonial administrative records — the system warns the user and names the bias.

Citation verification

Every claim the AI makes is traced to a specific passage. A separate natural language inference model checks whether the cited passage actually supports the claim. Unverified citations are flagged, not hidden.

Honest about limitations

OCR fails on some 18th-century manuscripts. Cross-lingual retrieval is weaker for Kreyòl. The AI can hallucinate. These gaps are documented here and surfaced in the interface, not concealed.

Interpretive transparency

Rasin draws on two knowledge layers: an archival corpus of digitized primary sources, and a curated interpretive layer of scholarly synthesis. Neither is neutral. The archival corpus over-represents colonial institutional perspectives. The interpretive layer reflects specific scholarly traditions. Both layers are labeled so you can see which informs any given claim.

Source Authority Framework

How sources are classified and weighted

Every source in the corpus is assigned metadata that describes its evidentiary value, authorial perspective, and digitization quality. This framework, informed by Michel-Rolph Trouillot's Silencing the Past, shapes how the system ranks and presents search results.

Evidentiary status

Sources are classified into seven categories based on their relationship to the events they describe:

  • Primary Witness Account — authored by someone who participated in or directly observed the events (e.g., Boisrond-Tonnerre's memoirs of drafting the Declaration of Independence)
  • Primary Official Record — laws, decrees, constitutions, and government records produced at the time (e.g., Haiti's constitutions from 1801–1889 via HaitiDOI/Duke)
  • Primary Record — structured data from the period: ship manifests, census records, fugitive slave advertisements
  • Secondary Contemporary Analysis — written close to the events by informed observers (e.g., Bellegarde's critique of the U.S. occupation, published in 1929)
  • Secondary Scholarly Analysis — modern academic work analyzing the historical record (e.g., C.L.R. James, Trouillot, Fick)
  • Tertiary Reference — encyclopedias, compilations, and derivative works
  • Curated Interpretation — scholarly synthesis frameworks that integrate and interpret across multiple primary and secondary sources, reflecting specific editorial and analytical choices

Authorial perspective

Sources are tagged with the perspective of their author or institution. This is not a judgment of quality — it is a tool for researchers to understand whose voice is represented:

  • Colonial Administrative — records produced by the French colonial government
  • Colonial Planter — documents from slaveholders and plantation owners
  • Haitian Government — official records of the independent Haitian state
  • Haitian Intellectual — works by Haitian thinkers, writers, and public figures
  • Enslaved (Mediated) — records about enslaved people written by those who held power over them. The subjects had no voice in these documents.
  • Foreign Observer — accounts by diplomats, travelers, and journalists from outside Haiti
  • Scholarly Analysis — modern academic research
  • Oral Tradition — transmitted knowledge captured in later written form
  • Curated Interpretation — analytical frameworks synthesizing multiple scholars' arguments into structured interpretive models. These name their source scholars and reflect specific historiographical traditions.

Example perspective caveats

These caveats are attached to sources in our corpus and shown alongside search results:

"These advertisements were written by enslavers seeking to recapture people who self-liberated. The described persons had no voice in these records."

— Marronnage fugitive slave advertisements

"Records of compensation paid to former enslavers after Haitian independence. Documents the financial claims of colonists, not the perspectives of those they enslaved."

— CNRS indemnity records

"Douglass served as U.S. Minister to Haiti. His observations reflect both solidarity with Black self-governance and American diplomatic interests."

— Frederick Douglass correspondence

"These are curated analytical frameworks synthesizing scholars including Fick, Casimir, and Trouillot. They represent one interpretive tradition within Haitian historiography — centering enslaved agency, resistance, and cultural continuity. Alternative scholarly frameworks exist."

— Curated concept and synthesis notes

Quality tiers

Sources are rated 0–4 based on digitization quality, scholarly edition status, and completeness. Higher-tier sources are prioritized in search results, but a lower-tier source with strong relevance to a query can still surface.

Bias & Perspective Awareness

Addressing structural imbalances in the archive

Colonial-era Haitian history was documented primarily by colonizers — French administrators, plantation owners, and foreign observers. The archive itself is structurally biased. Many historical voices — enslaved people, rural populations, women — survive only through records written by those who held power over them.

Rasin addresses this in two ways. First, every source is tagged with its authorial perspective. Second, after retrieval, the system checks the perspective distribution of the results. If more than 70% of results share one perspective, a warning is shown to the user.

Example warning: "Most retrieved passages come from colonial administrative sources. These reflect the perspectives of the colonial government, not necessarily those of the people being governed. Consider searching for additional perspectives."

The 'Enslaved (Mediated)' perspective label makes structural archival silence explicit rather than hiding it. When you see this label, it means the information comes from records about people who could not speak for themselves in the surviving documents.

From raw document to cited answer in 7 steps

Pipeline: Documents → OCR → Chunks → Embeddings → Graph → Search → Answer
01

Data Collection

103+ collections

Primary sources are aggregated from 150+ digital archives, libraries, and databases using automated downloaders. Each item is tracked in PostgreSQL with metadata: source name, language, download status, item counts, and original URLs.

These archives are fragmented across institutions in France, Haiti, the US, and the Caribbean, with no unified catalog. Many exist only as uncatalogued scans or deteriorating microfilm.

The collection spans 8 categories — archives, databases, periodicals, primary sources, secondary sources, legal documents, maps/GIS data, and the Digital Library of the Caribbean. All downloaders are documented and reproducible.

02

OCR & Text Extraction

281K+ pages processed

Historical documents are processed through docTR, a state-of-the-art OCR model optimized for document understanding. GPU acceleration (NVIDIA and AMD) is used to process at scale across French, English, Spanish, and Haitian Kreyòl.

Standard OCR tools fail on these materials — 18th-century French typefaces with archaic ligatures, Kreyòl with no standardized orthography before the 20th century, and manuscripts damaged by tropical humidity. We tune recognition parameters per-collection to maximize accuracy.

OCR accuracy varies by document quality and age. Scanned 18th–19th century materials may contain errors due to faded text, printing irregularities, or damage. All limitations are flagged in source metadata.

03

Text Chunking

401K+ chunks

Extracted text is segmented into semantically meaningful chunks using boundary detection (section breaks, paragraphs, logical divisions), size optimization for embedding models (200–500 tokens), and context preservation for standalone readability.

04

Embeddings & Vector Search

1024-dim · BGE-M3 · BM25

Each chunk is encoded into a 1024-dimensional vector using BGE-M3, a multilingual embedding model supporting French, English, Spanish, and Haitian Kreyòl. Embeddings are stored in Qdrant, enabling semantic search by meaning rather than exact keyword matches.

This also enables cross-lingual retrieval: a query in English like 'slave revolt 1791' matches French documents about 'révolte des esclaves' and Kreyòl texts — a capability rare in digital humanities tools, where researchers typically must search each language separately.

A full-text BM25 index provides keyword search as a complementary retrieval path, catching exact-match queries that semantic search alone may miss.

Cross-lingual: EN, FR, KR queries converge nearby in vector space
05

Entity Extraction

39K+ entities

GLiNER, a generalist named entity recognition model, extracts structured information across seven categories: people (historical figures, enslaved individuals), places (cities, plantations, colonies), events (battles, treaties, uprisings), organizations, dates, legal concepts, and economic terms.

Entities are deduplicated and resolved to canonical forms — "Toussaint Louverture" and "Toussaint L'Ouverture" are merged into a single node.

06

Knowledge Graph

Neo4j · 4 relation types

Entities and their relationships are stored in Neo4j. The graph captures MENTIONED_IN (entity in document), RELATED_TO (co-occurrence), PARTICIPATED_IN (people in events), and LOCATED_IN (entities tied to places).

The knowledge graph enables exploratory research, network analysis, and relationship discovery across centuries of sources.

07

AI-Powered Search & Chat

Hybrid RAG · RRF · Reranker · NLI

The chat interface uses hybrid retrieval combining three parallel paths: vector search (semantic similarity), full-text BM25 (keyword matching), and graph traversal (entity-based document discovery). Results are merged using Reciprocal Rank Fusion (RRF), then reranked by a cross-encoder (bge-reranker-v2-m3) for precision.

Retrieved passages are passed to Nemotron-3-Nano-30B (a hybrid Mamba-2 + MoE architecture with 3B active parameters per token) running on TRT-LLM. The model generates responses with inline citations.

Unlike most AI chat tools, every claim is traced to a specific passage that researchers can verify. Most retrieval-augmented generation systems leave citation accuracy to the LLM alone — we add a separate NLI verification step (DeBERTa) that checks whether each citation actually supports its claim.

AI Models & Transparency

Every model used, why it was chosen, and what it gets wrong

Text generation

3B active params · MoE · TRT-LLM

Nemotron-3-Nano-30B-A3B

Open-weight, self-hosted, multilingual. Runs entirely on our hardware at ~$50/month — no data sent to external APIs.

Smaller than GPT-4-class models. May struggle with complex synthesis across many sources. Query translation can hallucinate keywords.

Multilingual embeddings

1024-dim · sentence-transformers

BAAI/bge-m3

State-of-the-art multilingual model. Encodes French, English, Spanish, and Kreyòl into the same 1024-dimensional vector space, enabling cross-lingual search.

Haitian Kreyòl is underrepresented in the model's training data. Cross-lingual retrieval (e.g., English query matching French document) is weaker than same-language retrieval.

Cross-encoder reranker

Cross-encoder · CUDA

BAAI/bge-reranker-v2-m3

Reads full query-passage pairs together, producing more accurate relevance scores than embedding similarity alone.

Can override source quality signals. A well-matching low-quality OCR fragment may outrank a curated scholarly edition.

Citation verification

NLI entailment · ~180M params

DeBERTa-v3-base-mnli

Independent NLI model that checks whether each cited passage actually supports the claim made. Catches hallucinated or misattributed citations.

Runs on CPU with limited throughput — checks up to 9 citation pairs per response. Some borderline cases may be missed.

Entity extraction

Zero-shot NER · multilingual

GLiNER (gliner_multi_pii-v1)

Zero-shot multilingual NER that requires no training data for the historical domain. Extracts people, places, events, organizations, dates, legal concepts, and economic terms.

Weaker on Haitian Kreyòl and historical French orthography. May miss entities in OCR-garbled text.

OCR & text extraction

CNN-based · FP16 GPU

docTR (db_resnet50 + crnn_vgg16_bn)

Good accuracy on printed documents. GPU-accelerated across NVIDIA, AMD, and Apple Silicon hardware.

Struggles with 18th-century French typefaces, archaic ligatures, and manuscripts damaged by tropical humidity. Some sources have high error rates.

Citation Verification

How the system checks its own claims

Unlike most AI chat tools that rely on the language model alone for citation accuracy, Rasin adds independent verification after the response is generated:

1. Quote matchingIf the AI uses a direct quote, the system checks that the quoted text actually exists in the cited passage. Misquotes are flagged immediately.
2. NLI entailmentA separate DeBERTa model evaluates whether each claim is logically supported by its cited passage. Claims scored below the confidence threshold are flagged.
3. Citation reassignmentWhen a citation fails verification, the system searches for a better-matching passage. If one is found, the citation is reassigned. If none support the claim, the citation is marked as unverified.
4. Fallback verificationWhen the NLI model is unavailable, the system falls back to embedding cosine similarity between the claim and the cited passage.

Users see citations marked with confidence levels (high, medium, low) and caveats such as "This citation could not be verified against the source passage."

Retrieval: Query fans out to Vector search, BM25, Graph traversal → RRF → Reranker → LLM → NLI → Answer

Evaluation Results

Measured retrieval performance

We evaluate retrieval quality against a golden set of 51 queries spanning factual lookups, entity searches, cross-lingual retrieval, and synthesis questions at three difficulty levels (easy, medium, hard).

~80%

Recall@10

Of the expected sources in our corpus, roughly 80% appear in the top 10 results.

~58%

Recall@5

About 58% of expected sources appear in the top 5 results.

~0.40

MRR

Mean Reciprocal Rank — on average, the first relevant result appears around position 2.5.

Known failures

Two queries consistently fail due to poor OCR quality on the source documents (Boukman and Dumesle). Cross-lingual queries (English searching French sources) perform 10–15% worse than same-language queries.

These metrics are computed automatically and updated as the corpus and models evolve.

Limitations & caveats

AI assists research — it does not replace primary source analysis.

OCR errorsAutomated text extraction may misread damaged or low-quality documents, especially 18th–19th century materials.
Entity errorsNER models may misidentify or miss entities, particularly in historical Kreyòl and Spanish text.
AI hallucinationsChat responses are generated by an LLM and may contain inaccuracies. Always verify claims against the cited sources.
Coverage gapsThe collection is growing but does not represent all available sources on Haitian history.
Language biasOCR and NER models may perform better on French and English than on Haitian Kreyòl or Spanish.
Cross-lingual gapEnglish queries searching French or Kreyòl sources perform 10–15% worse than same-language queries. Query expansion partially compensates.
Reranker overrideThe cross-encoder reranker can override source quality signals. A well-matching low-quality fragment may outrank a curated scholarly edition.
Temporal coverageThe corpus focuses on 1697–1947. Post-1947 Haitian history is selectively covered and underrepresented.
Archival silenceMany historical voices — enslaved people, rural populations, women — survive only through records written by those who held power over them. The system can only surface what was documented.

Reproducibility

All processing pipelines and model configurations are documented on this page. For collaboration inquiries, contact us.

A detailed architecture document is available upon request for institutional review and grant evaluation.

Infrastructure

From prototype to production

No user queries or documents are sent to third-party AI APIs. All inference runs on infrastructure we control — currently self-hosted, transitioning to cloud GPU through the NVIDIA Inception program.

Prototyping phase (current)

The current prototype runs entirely on a single NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified memory) — LLM inference, embeddings, vector search, knowledge graph, and web serving. No external API calls; every query is processed on our own hardware.

This keeps prototyping costs around ~$50/month — electricity plus domains — enabling rapid iteration without cloud spending. The DGX Spark validated that the full pipeline fits within a single-node memory budget.

Cloud migration (next phase)

As Rasin moves from prototype to production, the platform will migrate to GPU cloud infrastructure through the NVIDIA Inception program.

Cloud deployment enables horizontal scaling (multiple concurrent users, parallel OCR workers), higher availability, and access to larger GPU memory for upgraded models such as Qwen3.5-122B. The architecture is containerized and designed to be cloud-portable — the same service definitions run on DGX, cloud VMs, or managed Kubernetes.

The core commitment remains: no user queries or documents are sent to third-party AI APIs. All LLM inference, embedding, and reranking will continue to run on infrastructure we control, whether self-hosted or cloud-provisioned.

LLM Inference

Nemotron-3-Nano-30B via TRT-LLM

MoE · 3B active

Embeddings

BGE-M3 · 1024-dim · CUDA

Multilingual

Reranker

bge-reranker-v2-m3 · CUDA

Cross-encoder

OCR Processing

docTR · FP16 GPU

281K+ pages

Technical detailsexpand ↓

Hardware

ChipGrace Blackwell GB10 (sm_121a)
Memory128 GB unified LPDDR5x
Storage4 TB NVMe
Power~240 W average

Software stack

  • · CUDA 12.8 / 13.1
  • · TRT-LLM (pytorch backend, NVFP4)
  • · sentence-transformers (embedding + reranker)
  • · torch.compile for OCR (FP16)
  • · Docker Compose orchestration
  • · Langfuse · Prometheus · Grafana

Memory budget (128 GB total)

TRT-LLM (Nemotron-3-Nano)~19 GB
Embedding (BGE-M3 × 2)~5 GB
Reranker (bge-reranker-v2-m3)~2.5 GB
Qdrant8 GB
Neo4j8 GB
PostgreSQL4 GB
Web + API4 GB
Observability~2 GB
Total allocated~53 GB
KV cache + OS~75 GB