Removing Random Characters in Story Text Files
When processing large text files, especially those converted from different formats, unwanted random characters can appear. These can include symbols, numbers, and inconsistent letter sequences.
Problem Analysis
The Python script below is designed to clean text files by removing unwanted random character sequences. It scans all .txt
files within a specified folder, applies a regex-based cleaning function, and saves the modified text back to the file.
Example of Unwanted Text
The script will remove text patterns like:
h2 ?# _% j2 k. C( Y, v
These random sequences do not contribute to meaningful content and can disrupt readability.
Python Script for Cleaning Text
import os
import re
def clean_text(text):
# Define a regex pattern to remove sequences matching 'SPACE-XX' where XX is not both A-Z
pattern = r'\s+[A-Za-z!@#$%^&*()_+\-=\[\]{};:\'",.<>?/\\|`~0-9][^A-Za-z](?:\s+[A-Za-z!@#$%^&*()_+\-=\[\]{};:\'",.<>?/\\|`~0-9][^A-Za-z])+'
# Remove matching sequences from the text
cleaned_text = re.sub(pattern, '', text)
return cleaned_text
def process_files(folder):
for root, _, files in os.walk(folder):
for file in files:
if file.endswith(".txt"):
file_path = os.path.join(root, file)
try:
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
cleaned_text = clean_text(text)
with open(file_path, 'w', encoding='utf-8') as f:
f.write(cleaned_text)
print(f"Processed: {file_path}")
except Exception as e:
print(f"Error processing {file_path}: {e}")
if __name__ == "__main__":
folder_path = os.path.dirname(os.path.abspath(__file__))
process_files(folder_path)
print("Processing complete!")
How It Works
- The
clean_text()
function uses a regex pattern to find and remove unwanted sequences. - The
process_files()
function scans all.txt
files in the folder and applies the cleaning function. - The script processes each file and overwrites it with cleaned text.
- Simply place the script in the respective folder and run it—no need to edit anything, as it will automatically detect the folder path.
This script ensures your story text files remain free from clutter and enhances readability.