Structured OCR for Newspapers: Using YOLOX and Vision LLMs

1. Introduction

Building a structured OCR for newspapers is no simple task. Unlike books or documents, newspapers are messy—often noisy, skewed, and low-resolution.

Traditional OCR tools struggle with such complex layouts.

Newspapers also don’t follow a standard layout. They use multiple columns, captions, mixed fonts, and articles that may jump across pages.

Because of this, tools like Tesseract often return jumbled, unstructured text. These tools read line by line—without understanding the context.

But what if you need structured data like titles, authors, dates, or page numbers? Raw text simply isn’t enough.

To solve this, we’ll combine YOLOX for detecting layout blocks with Vision LLM for intelligent text extraction.

This modern OCR pipeline turns scanned pages into clean, structured JSON—each block labeled and ordered properly.

This blog walks you through how to build a structured OCR for newspapers using modern AI tools.

Let’s dive in.

2. Project Overview: Structured OCR for Newspapers

This project helps extract structured content from scanned newspaper pages. The system detects layout blocks—such as titles, captions, and article bodies—and then reads the text using AI.

Here’s how it works:

A user uploads a newspaper image.
The system detects blocks like titles, subheadings, text, and captions using YOLOX.
Each block is sent to an OCR engine:
- EasyOCR for simpler content
- Vision LLM for dense or complex regions
Extracted text is grouped and labeled.
A clean, structured JSON file is returned.

This JSON can be used for research, digital archiving, or searchable databases. It’s both machine-readable and easy to understand.

Key Components

YOLOX – For object detection and layout analysis
EasyOCR / Vision LLM – For flexible text extraction
Python 3.10 – With .env for API key management

This system can run locally or on a small server. A GPU helps, but it’s not strictly required for testing.

3. Training YOLOX for Structured OCR in Newspapers

Before running the pipeline, you’ll need to train a custom YOLOX model that can detect newspaper block types.

3.1 Create a Virtual Environment

Use Python 3.10.13:

python3.10 -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate    # Windows

3.2 Install Dependencies

First, upgrade pip and install all required packages:

pip install --upgrade pip
pip install -r requirements.txt

3.3 Creating a Newspaper-Specific Dataset for OCR

Make sure your dataset is annotated in COCO format with relevant classes like:

title
subheading
textblock
caption
author
page_number

Folder structure should look like this:

datasets/
├── train2017/
├── val2017/
└── annotations/
    ├── instances_train2017.json
    └── instances_val2017.json

3.4 Configure the YOLOX Experiment

Create an experiment file at:

exps/example/custom/newspaper_yolox.py

Set training parameters like number of classes, dataset paths, and batch size:

self.num_classes = 6
self.data_dir = "datasets"
self.train_ann = "annotations/instances_train2017.json"

3.5 Start Training

Run this command to begin training:

python tools/train.py -expn newspaper_yolox -d 1 -b 8 --fp16

-expn: Name of your experiment
-d: Number of GPUs
-b: Batch size
--fp16: Enables mixed precision (faster on GPU)

3.6 Save the Best Model

Once training is complete, use the best checkpoint found at:

YOLOX_outputs/newspaper_yolox/best_ckpt.pth

4. How Structured OCR for Newspapers Works

Let’s break down the full pipeline, from layout detection to structured output.

4.1 Detecting Layout Blocks with YOLOX

First, the image is passed through the trained YOLOX model. It detects different layout components like:

Titles and subheadings
Body text blocks
Captions and authors
Illustrations and page numbers

For each block, YOLOX returns bounding boxes, labels, and confidence scores. These boxes are then cropped to isolate individual regions.

4.2 Choosing the Right OCR Engine

Next, each cropped block is passed to an OCR engine. Based on the type and size of the block, we choose:

EasyOCR: Fast and accurate for clean text
Visiom LLM: More powerful for noisy, wrapped, or stylized blocks

This decision can be made automatically using simple logic in your code.

4.3 Prompt Engineering for Better OCR Output

To get the most out of the vision language model, use custom prompts for each block type.

For example:

“Extract the full title from this image. Do not include captions or author names.”

These prompts help the LLM focus on what matters. You can customize prompts in functions.py for each content type.

4.4 Structuring the Output

After text is extracted, we group and label each block. This step includes:

Sorting blocks top-to-bottom and left-to-right
Matching captions with illustrations
Linking authors with nearby titles

Finally, we create a structured JSON:

{
  "title": "New Discovery in AI",
  "author": "Jane Doe",
  "text": "Researchers at XYZ University...",
  "caption": "Illustration of the AI model."
}

With YOLOX and Vision LLM, you can finally create a reliable structured OCR for newspapers that delivers clean, labeled output.

5. Challenges in Building Structured OCR for Newspapers

Building this system wasn’t easy. Here are some real challenges we faced—and how we solved them.

5.1 Complex Layouts

Newspapers don’t follow rules. Articles wrap around ads. Titles sit next to unrelated images. To train YOLOX well, we needed many diverse examples.

The key lesson: annotate a wide range of layouts and fonts to get consistent results.

5.2 OCR Struggles with Noisy Scans

Low-quality scans are a real problem. Blurry text and ink smudges confused EasyOCR.

Switching to Vision LLM for key blocks (like titles or captions) improved results significantly—but it added cost and latency.

5.3 Balancing Speed and Accuracy

Vision LLM was accurate, but slow and expensive. So, we added a toggle to choose between EasyOCR (fast) and Vision LLM (accurate) based on the use case.

This way, users could balance performance and quality.

5.4 Annotating the Dataset

Labeling layout blocks manually took time—but it was essential. We used tools like Label Studio to speed up annotation.

In the future, pre-trained layout models could help reduce this workload.

5.5 Matching Related Regions

It wasn’t always easy to connect authors to their articles or captions to illustrations. We used proximity rules to group nearby blocks, but it wasn’t perfect.

A potential improvement could be using layout graphs or document parsing models.

6. Conclusion

OCR for newspapers is tough—but not impossible. Standard tools alone won’t cut it. You need layout awareness, smart extraction, and structured output.

By training YOLOX on newspaper-specific classes, we detected meaningful regions like titles, captions, and authors. With EasyOCR and Vision LLM, we extracted clean text—even from difficult scans.

The final result? A structured, labeled JSON ready for indexing, research, or digital archives.

Whether you’re digitizing archives or automating editorial tasks, this structured OCR for newspapers pipeline is powerful, scalable, and open source.

Thanks for reading! Try the pipeline, improve it, and share your results. We’d love to see what you build.

Darshan
4 Badges

Darshan, a Software Engineer, specializes in Machine Learning, crafting intelligent systems that revolutionize automation. Expertise in data-driven algorithms ensures high accuracy and adaptive models, delivering dynamic, innovative solutions.

Structured OCR for Newspapers: Using YOLOX and Vision LLMs

1. Introduction

2. Project Overview: Structured OCR for Newspapers

Key Components

3. Training YOLOX for Structured OCR in Newspapers

3.1 Create a Virtual Environment

3.2 Install Dependencies

3.3 Creating a Newspaper-Specific Dataset for OCR

3.4 Configure the YOLOX Experiment

3.5 Start Training

3.6 Save the Best Model

4. How Structured OCR for Newspapers Works

4.1 Detecting Layout Blocks with YOLOX

4.2 Choosing the Right OCR Engine

4.3 Prompt Engineering for Better OCR Output

4.4 Structuring the Output

5. Challenges in Building Structured OCR for Newspapers

5.1 Complex Layouts

5.2 OCR Struggles with Noisy Scans

5.3 Balancing Speed and Accuracy

5.4 Annotating the Dataset

5.5 Matching Related Regions

6. Conclusion

Like this:

Related

Projesh Kar

Leave a Comment Cancel reply

Product Highlight

Recent Posts

8 best moisturisers that lock in all-day hydration

Prof. Goldman’s Statement on the Supreme Court’s Demolition of the Internet in Free Speech Coalition v. Paxton

Black Coffee: Market Mayhem

Realty stock with ₹22,000 Cr presales guidance for FY26 to keep on your radar

Explaining Delta and How to Use it to Craft Our Option Trades

Let’s Talk Web Components – CodePen

Is AI Threatening the Future of Designers and Which Visual Trends Will Surprise Us?

AI is Making Workers Feel More Satisfied – and Stressed

Tesla can rally 45% due to leg up over competitor Waymo, says Benchmark

Tech stocks are powering this record-setting rally on Wall Street — but how long can it last?

1. Introduction

2. Project Overview: Structured OCR for Newspapers

Key Components

3. Training YOLOX for Structured OCR in Newspapers

3.1 Create a Virtual Environment

3.2 Install Dependencies

3.3 Creating a Newspaper-Specific Dataset for OCR

3.4 Configure the YOLOX Experiment

3.5 Start Training

3.6 Save the Best Model

4. How Structured OCR for Newspapers Works

4.1 Detecting Layout Blocks with YOLOX

4.2 Choosing the Right OCR Engine

4.3 Prompt Engineering for Better OCR Output

4.4 Structuring the Output

5. Challenges in Building Structured OCR for Newspapers

5.1 Complex Layouts

5.2 OCR Struggles with Noisy Scans

5.3 Balancing Speed and Accuracy

5.4 Annotating the Dataset

5.5 Matching Related Regions

6. Conclusion

Share this:

Like this:

Related

Projesh Kar

Leave a Comment Cancel reply

Product Highlight

Recent Posts