Rust-Powered Full-Text Search w/Tantivy

🇧🇷Background and Motivation

I lived in Brazil for five years, from 2018 to 2023. We had our children there, and one of the things that my wife and I chose to do for our family was to naturalize and become Brazilian citizens. Our children had been born there and were already citizens at birth by being born on Brazilian soil. For my wife and me, the process was more involved.

Brazil is notorious for its bureaucracy, and the federal government publishes a daily journal called the Diário Oficial da União (DOU) to announce new laws, regulations, and other official acts. It also announces naturalizations and test results there.

One of the more rigorous parts of becoming a citizen is that you have to pass a Portuguese language test. It’s not easy, it’s long-form, essay-format and the results are published… you guessed it. In the Diário Oficial da União (DOU). That’s it, you don’t get a notification. Just check online. The Federal Police literally told us to Google our names in quotes + “DOU” see if we’d been published.

The naturalization process takes months. But if you spot your result quickly and act fast, you can reduce that timeline by weeks or months. At the time we were applying, the backlog was only getting worse—so staying on top of updates was critical.

A few years back, I built what is basically a small scale ETL (Extract, Transform, Load) system to automate this process. It downloads the DOU daily journals in PDF format, extracts the text, and indexes it for full-text search. I set up a system to run queries against that index to find important notifications. If it finds any matches, it sends an email alert.

We also used the system I built to catch our own naturalization approvals. After more than a year of waiting, seeing our names from the real-time email alerts was a massive relief, as we were ready to get our passports and other identity documents made.

What It Does

The system is basically an ETL pipeline for the DOU. It downloads the daily journal, extracts the text from the PDFs, indexes it for full-text search, and then runs queries against that index to find important notifications. If it finds any matches, it sends an email alert.

I chose to use Tantivy for the full-text search engine because it’s a Rust library that’s basically a version of Apache Lucene. It’s not a search engine itself but you can use it to build a search engine, which is what I did here.

After extracting the PDF text (via an external tool), I indexed the file text with Tantivy, with the filename as the key / document id.

Then when querying for matches, we get a list of possible matches w/a confidence score for each match, which is useful for determining how relevant the match is. If above a certain threshold, I can queue up an alert w/the given filename that’s indexed, and can search the referenced file for the text.

Tantivy indexes require a dedicated directory to store the index files. Data can’t really be removed from indexes, but indexing is fast. So I just configured the system via an environment variables FTS_DIR, which points to the full-text search directory. This allows for flexibility—I simply change the environment variable or create a symlink to point to a new directory when needed.

Architecture and Flow

The architecture consists of several Rust binaries that work together in a pipeline, with data persistence handled by PostgreSQL and Tantivy. Here’s how it flows:

Orchestration: A shell script (daily-doutools-flow.sh) runs via cron job and coordinates the entire process
Link Collection: fetch_dou_links retrieves PDF download URLs from the DOU website for a specified date range
PDF Processing: download_pdf_dou downloads PDFs and extract_dou_text converts them to text using a Docker container
Indexing: index_dou builds a Tantivy full-text search index from the extracted text
Querying: query_index runs stored searches against the index and records results in PostgreSQL
Alerting: send_alerts checks for new matches and sends email notifications
Data Extraction: add_natz_persons parses naturalization notices into structured data
Reporting: daily_report generates statistical summaries

The full-text search is built with Tantivy, which provides fast, memory-efficient indexing and querying. PostgreSQL stores metadata, alert configurations, and structured data about the documents and naturalizations. The alert system uses SMTP to send real-time notifications when new matches are found.

Future Enhancements

This currently runs on a $10 per month VPS. If I were architecting this as a production system, I’d use proper AWS cloud architecture, put the FTS dir on an EBS volume (with regular snapshots), and structure the coordination/flow to be an event-based system instead of coordinating with a single top-down shell script which is run kicked off by a cron job.

The current approach works great for a personal tool, but a production-grade system would benefit from:

Scalable Infrastructure: AWS or similar cloud services with proper resource allocation
Durable Storage: EBS volumes with regular snapshots for the full-text search directory
Event-Driven Architecture: Replace shell script orchestration and email alerts with an event-based system and task queues
Parallelization: Process multiple documents simultaneously for faster throughput
Monitoring and Logging: Add robust observability for system health and performance
Error Recovery: Implement better retry logic and failure handling

For now, though, this simple architecture is more than sufficient for my personal use case—it reliably processes the DOU publications and alerts me to important notices without requiring significant infrastructure investment.

🇧🇷Background and Motivation#

What It Does#

Architecture and Flow#

Future Enhancements#

🇧🇷Background and Motivation

What It Does

Architecture and Flow

Future Enhancements