Chinese Text Extraction from Images Using Python and Tesseract OCR

2 minute read

image2txt

Introduction

Extracting text from images is a common task in document digitization, automated translations, and data processing. In this guide, we’ll walk through setting up Tesseract OCR to extract Chinese text from images using Python.

Prerequisites

Before we begin, ensure you have the following installed:

1. Install Tesseract OCR

Download and install Tesseract OCR from the official repository:

Windows users: Download the latest Windows installer.

Linux/macOS users: Install via package manager:

sudo apt install tesseract-ocr   # Ubuntu/Debian
brew install tesseract           # macOS (Homebrew)

2. Add Tesseract to System PATH (Windows)

After installation, add Tesseract to your system PATH:

Go to Settings > Environment Variables.
Find and edit the Path variable under System Variables.
Click New, then add:
```
C:\Program Files\Tesseract-OCR\
```
Click OK and restart your system.

3. Verify Tesseract Installation

Run the following command to check available languages:

tesseract --list-langs

Expected output (if Chinese is installed):

List of available languages (8):
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
eng
jpn
jpn_vert
osd

If chi_sim is missing, download it manually from Tesseract OCR tessdata and place it in:

C:\Program Files\Tesseract-OCR\tessdata\

4. Install Required Python Libraries

Ensure you have pytesseract and Pillow installed:

pip install pytesseract pillow

The Python Script

The script scans a folder for images (.png, .jpg, .jpeg, .webp), extracts text using Tesseract OCR, and saves the output as a .txt file with the same name as the folder.

import os
import pytesseract
from PIL import Image

# Set Tesseract-OCR path for Windows
pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

# Ensure TESSDATA_PREFIX is set correctly
os.environ["TESSDATA_PREFIX"] = r"C:\\Program Files\\Tesseract-OCR"

def extract_text_from_images():
    # Get current script directory
    script_dir = os.path.dirname(os.path.abspath(__file__))
    folder_name = os.path.basename(script_dir)  # Use folder name for output file
    output_file = os.path.join(script_dir, f"{folder_name}.txt")
    
    # Supported image formats
    image_extensions = {".png", ".jpg", ".jpeg", ".webp"}
    
    # Get all valid image files sorted in numeric order
    image_files = sorted(
        [f for f in os.listdir(script_dir) if os.path.splitext(f)[1].lower() in image_extensions],
        key=lambda x: int(os.path.splitext(x)[0])  # Sort by numeric filename
    )
    
    if not image_files:
        print("No image files found.")
        return
    
    extracted_texts = []
    for image_file in image_files:
        image_path = os.path.join(script_dir, image_file)
        try:
            img = Image.open(image_path)
            text = pytesseract.image_to_string(img, lang="chi_sim")  # Extract Chinese text
            extracted_texts.append(text)
        except Exception as e:
            print(f"Error processing {image_file}: {e}")
    
    # Write extracted text to output file
    with open(output_file, "w", encoding="utf-8") as f:
        f.write("\n\n".join(extracted_texts))
    
    print(f"Text extraction complete. Output saved as: {output_file}")

if __name__ == "__main__":
    extract_text_from_images()

How the Script Works

Note: The folder path must only contain English letters (A-Z) to avoid issues with Tesseract OCR.

The script automatically detects the folder where it’s placed.
It scans for image files (.png, .jpg, .jpeg, .webp) named in a numeric sequence (e.g., 001.jpg, 002.png).
It extracts text from each image while preserving formatting.
The extracted text is compiled into a single .txt file named after the folder.

Summary

This guide walks through setting up Tesseract OCR, configuring paths, installing required languages, and using a Python script to extract Chinese text from images. This automation can be useful for document digitization, subtitle extraction, or archiving printed materials into searchable text.

Give it a try, and let me know if you have any questions! 🚀

Support Me 💖

Share on

Twitter Facebook LinkedIn

Matthew Choo