Chinese Text Extraction from Images Using Python and Tesseract OCR
Introduction
Extracting text from images is a common task in document digitization, automated translations, and data processing. In this guide, we’ll walk through setting up Tesseract OCR to extract Chinese text from images using Python.
Prerequisites
Before we begin, ensure you have the following installed:
1. Install Tesseract OCR
Download and install Tesseract OCR from the official repository:
- Windows users: Download the latest Windows installer.
- Linux/macOS users: Install via package manager:
sudo apt install tesseract-ocr # Ubuntu/Debian brew install tesseract # macOS (Homebrew)
2. Add Tesseract to System PATH (Windows)
After installation, add Tesseract to your system PATH:
- Go to Settings > Environment Variables.
- Find and edit the
Path
variable under System Variables. - Click New, then add:
C:\Program Files\Tesseract-OCR\
- Click OK and restart your system.
3. Verify Tesseract Installation
Run the following command to check available languages:
tesseract --list-langs
Expected output (if Chinese is installed):
List of available languages (8):
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
eng
jpn
jpn_vert
osd
If chi_sim
is missing, download it manually from Tesseract OCR tessdata and place it in:
C:\Program Files\Tesseract-OCR\tessdata\
4. Install Required Python Libraries
Ensure you have pytesseract and Pillow installed:
pip install pytesseract pillow
The Python Script
The script scans a folder for images (.png
, .jpg
, .jpeg
, .webp
), extracts text using Tesseract OCR, and saves the output as a .txt
file with the same name as the folder.
import os
import pytesseract
from PIL import Image
# Set Tesseract-OCR path for Windows
pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
# Ensure TESSDATA_PREFIX is set correctly
os.environ["TESSDATA_PREFIX"] = r"C:\\Program Files\\Tesseract-OCR"
def extract_text_from_images():
# Get current script directory
script_dir = os.path.dirname(os.path.abspath(__file__))
folder_name = os.path.basename(script_dir) # Use folder name for output file
output_file = os.path.join(script_dir, f"{folder_name}.txt")
# Supported image formats
image_extensions = {".png", ".jpg", ".jpeg", ".webp"}
# Get all valid image files sorted in numeric order
image_files = sorted(
[f for f in os.listdir(script_dir) if os.path.splitext(f)[1].lower() in image_extensions],
key=lambda x: int(os.path.splitext(x)[0]) # Sort by numeric filename
)
if not image_files:
print("No image files found.")
return
extracted_texts = []
for image_file in image_files:
image_path = os.path.join(script_dir, image_file)
try:
img = Image.open(image_path)
text = pytesseract.image_to_string(img, lang="chi_sim") # Extract Chinese text
extracted_texts.append(text)
except Exception as e:
print(f"Error processing {image_file}: {e}")
# Write extracted text to output file
with open(output_file, "w", encoding="utf-8") as f:
f.write("\n\n".join(extracted_texts))
print(f"Text extraction complete. Output saved as: {output_file}")
if __name__ == "__main__":
extract_text_from_images()
How the Script Works
Note: The folder path must only contain English letters (A-Z) to avoid issues with Tesseract OCR.
- The script automatically detects the folder where it’s placed.
- It scans for image files (
.png
,.jpg
,.jpeg
,.webp
) named in a numeric sequence (e.g.,001.jpg
,002.png
). - It extracts text from each image while preserving formatting.
- The extracted text is compiled into a single
.txt
file named after the folder.
Summary
This guide walks through setting up Tesseract OCR, configuring paths, installing required languages, and using a Python script to extract Chinese text from images. This automation can be useful for document digitization, subtitle extraction, or archiving printed materials into searchable text.
Give it a try, and let me know if you have any questions! 🚀