1. Introduction
Building a structured OCR for newspapers is no simple task. Unlike books or documents, newspapers are messy—often noisy, skewed, and low-resolution.
Traditional OCR tools struggle with such complex layouts.
Newspapers also don’t follow a standard layout. They use multiple columns, captions, mixed fonts, and articles that may jump across pages.
Because of this, tools like Tesseract often return jumbled, unstructured text. These tools read line by line—without understanding the context.
But what if you need structured data like titles, authors, dates, or page numbers? Raw text simply isn’t enough.
To solve this, we’ll combine YOLOX for detecting layout blocks with Vision LLM for intelligent text extraction.
This modern OCR pipeline turns scanned pages into clean, structured JSON—each block labeled and ordered properly.
This blog walks you through how to build a structured OCR for newspapers using modern AI tools.
Let’s dive in.
2. Project Overview: Structured OCR for Newspapers
This project helps extract structured content from scanned newspaper pages. The system detects layout blocks—such as titles, captions, and article bodies—and then reads the text using AI.
Here’s how it works:
- A user uploads a newspaper image.
- The system detects blocks like titles, subheadings, text, and captions using YOLOX.
- Each block is sent to an OCR engine:
- EasyOCR for simpler content
- Vision LLM for dense or complex regions
- Extracted text is grouped and labeled.
- A clean, structured JSON file is returned.
This JSON can be used for research, digital archiving, or searchable databases. It’s both machine-readable and easy to understand.
Key Components
- YOLOX – For object detection and layout analysis
- EasyOCR / Vision LLM – For flexible text extraction
- Python 3.10 – With
.env
for API key management
This system can run locally or on a small server. A GPU helps, but it’s not strictly required for testing.
3. Training YOLOX for Structured OCR in Newspapers
Before running the pipeline, you’ll need to train a custom YOLOX model that can detect newspaper block types.
3.1 Create a Virtual Environment
Use Python 3.10.13:
python3.10 -m venv .venv source .venv/bin/activate # macOS/Linux # .venv\Scripts\activate # Windows
3.2 Install Dependencies
First, upgrade pip and install all required packages:
pip install --upgrade pip pip install -r requirements.txt
3.3 Creating a Newspaper-Specific Dataset for OCR
Make sure your dataset is annotated in COCO format with relevant classes like:
title
subheading
textblock
caption
author
page_number
Folder structure should look like this:
datasets/ ├── train2017/ ├── val2017/ └── annotations/ ├── instances_train2017.json └── instances_val2017.json
3.4 Configure the YOLOX Experiment
Create an experiment file at:
exps/example/custom/newspaper_yolox.py
Set training parameters like number of classes, dataset paths, and batch size:
self.num_classes = 6 self.data_dir = "datasets" self.train_ann = "annotations/instances_train2017.json"
3.5 Start Training
Run this command to begin training:
python tools/train.py -expn newspaper_yolox -d 1 -b 8 --fp16
-expn
: Name of your experiment-d
: Number of GPUs-b
: Batch size--fp16
: Enables mixed precision (faster on GPU)
3.6 Save the Best Model
Once training is complete, use the best checkpoint found at:
YOLOX_outputs/newspaper_yolox/best_ckpt.pth
4. How Structured OCR for Newspapers Works
Let’s break down the full pipeline, from layout detection to structured output.
4.1 Detecting Layout Blocks with YOLOX
First, the image is passed through the trained YOLOX model. It detects different layout components like:
- Titles and subheadings
- Body text blocks
- Captions and authors
- Illustrations and page numbers
For each block, YOLOX returns bounding boxes, labels, and confidence scores. These boxes are then cropped to isolate individual regions.
4.2 Choosing the Right OCR Engine
Next, each cropped block is passed to an OCR engine. Based on the type and size of the block, we choose:
- EasyOCR: Fast and accurate for clean text
- Visiom LLM: More powerful for noisy, wrapped, or stylized blocks
This decision can be made automatically using simple logic in your code.
4.3 Prompt Engineering for Better OCR Output
To get the most out of the vision language model, use custom prompts for each block type.
For example:
“Extract the full title from this image. Do not include captions or author names.”
These prompts help the LLM focus on what matters. You can customize prompts in functions.py
for each content type.
4.4 Structuring the Output
After text is extracted, we group and label each block. This step includes:
- Sorting blocks top-to-bottom and left-to-right
- Matching captions with illustrations
- Linking authors with nearby titles
Finally, we create a structured JSON:
{ "title": "New Discovery in AI", "author": "Jane Doe", "text": "Researchers at XYZ University...", "caption": "Illustration of the AI model." }
With YOLOX and Vision LLM, you can finally create a reliable structured OCR for newspapers that delivers clean, labeled output.
5. Challenges in Building Structured OCR for Newspapers
Building this system wasn’t easy. Here are some real challenges we faced—and how we solved them.
5.1 Complex Layouts
Newspapers don’t follow rules. Articles wrap around ads. Titles sit next to unrelated images. To train YOLOX well, we needed many diverse examples.
The key lesson: annotate a wide range of layouts and fonts to get consistent results.
5.2 OCR Struggles with Noisy Scans
Low-quality scans are a real problem. Blurry text and ink smudges confused EasyOCR.
Switching to Vision LLM for key blocks (like titles or captions) improved results significantly—but it added cost and latency.
5.3 Balancing Speed and Accuracy
Vision LLM was accurate, but slow and expensive. So, we added a toggle to choose between EasyOCR (fast) and Vision LLM (accurate) based on the use case.
This way, users could balance performance and quality.
5.4 Annotating the Dataset
Labeling layout blocks manually took time—but it was essential. We used tools like Label Studio to speed up annotation.
In the future, pre-trained layout models could help reduce this workload.
5.5 Matching Related Regions
It wasn’t always easy to connect authors to their articles or captions to illustrations. We used proximity rules to group nearby blocks, but it wasn’t perfect.
A potential improvement could be using layout graphs or document parsing models.
6. Conclusion
OCR for newspapers is tough—but not impossible. Standard tools alone won’t cut it. You need layout awareness, smart extraction, and structured output.
By training YOLOX on newspaper-specific classes, we detected meaningful regions like titles, captions, and authors. With EasyOCR and Vision LLM, we extracted clean text—even from difficult scans.
The final result? A structured, labeled JSON ready for indexing, research, or digital archives.
Whether you’re digitizing archives or automating editorial tasks, this structured OCR for newspapers pipeline is powerful, scalable, and open source.
Thanks for reading! Try the pipeline, improve it, and share your results. We’d love to see what you build.