How to Batch Convert Credit Card Statement PDF to Excel
Note: due to pdfplumber
imperfection in detecting blank cells. You do need to cut and align the date and value columns with the text labels.
Prerequisites
Before proceeding, ensure you have the following installed on your system:
- Python (latest version)
pdfplumber
Python library- PowerShell or Command Prompt
Step 1: Install Python
- Open PowerShell or Command Prompt.
- Download and install Python:
- Visit Python’s official website and download the latest version.
- Run the installer and ensure you select “Add Python to PATH” before clicking Install Now.
- Verify installation by running:
python --version
Step 2: Install pdfplumber
- Open PowerShell or Command Prompt.
- Install
pdfplumber
using pip:pip install pdfplumber pandas openpyxl
Step 3: Prepare PDF Files
- Place the PDF files you want to convert in the
Downloads
folder.- Example path:
C:\Users\YourUsername\Downloads\A
- Example path:
Step 4: Run the Script to Convert PDF to Excel
-
Open Notepad and paste the following Python script: (Remember update folder path & password if any) ```import pdfplumber import pandas as pd import os
# Folder containing PDFs & file password if any pdf_folder = r”C:\Users\YourUsername\Downloads\A” password = “xxxx”
# Get all PDF files in the folder pdf_files = [f for f in os.listdir(pdf_folder) if f.lower().endswith(“.pdf”)]
for pdf_file in pdf_files: pdf_path = os.path.join(pdf_folder, pdf_file) output_file = os.path.join(pdf_folder, pdf_file.replace(“.pdf”, “.xlsx”))
data = [] max_columns = 0 # Track max columns for consistency try: with pdfplumber.open(pdf_path, password=password) as pdf: print(f"✅ Opened PDF: {pdf_path} ({len(pdf.pages)} pages detected)") for page_number, page in enumerate(pdf.pages, start=1): print(f"🔹 Processing Page {page_number}") tables = page.extract_tables() if not tables: print(f"⚠️ No structured table found on Page {page_number}, trying text extraction...") page_text = page.extract_text() if page_text: tables = [page_text.split("\n")] # Treat raw text lines as table rows if tables: print(f"✅ {len(tables)} table(s) found on Page {page_number}") for table in tables: for row in table: if row is None: empty_row = [""] * max_columns if max_columns > 0 else [""] data.append(empty_row) continue while len(row) < max_columns: row.append("") split_rows = [] row_max_lines = 1 for cell in row: if cell: lines = cell.strip().split("\n") split_rows.append(lines) row_max_lines = max(row_max_lines, len(lines)) else: split_rows.append([""]) for i in range(row_max_lines): new_row = [col[i] if i < len(col) else "" for col in split_rows] data.append(new_row) max_columns = max(max_columns, len(row)) except Exception as e: print(f"❌ Failed to process {pdf_file}: {e}") continue for i in range(len(data)): while len(data[i]) < max_columns: data[i].append("") df = pd.DataFrame(data) df.columns = [f"Column {i+1}" for i in range(max_columns)] df.to_excel(output_file, index=False) print(f"✅ Conversion done! Excel file saved at: {output_file}") ```
- Save the file as
convert_pdf_to_excel.py
in theDownloads\A
folder. - Open PowerShell or Command Prompt and navigate to
Downloads\A
:cd %USERPROFILE%\Downloads\A
- Run the script:
python convert_pdf_to_excel.py
Step 5: Check the Output
- The converted Excel files will be saved in the
Downloads
folder. - Open the
.xlsx
file using Excel to verify the extracted data.
Troubleshooting
- If Python is not recognized, restart your computer or reinstall Python ensuring “Add Python to PATH” is selected.
- If
pdfplumber
is missing, try reinstalling it with:pip install --upgrade pdfplumber pandas openpyxl
- If the script doesn’t work for your PDFs, some documents may have embedded images instead of selectable text.
By following these steps, you can efficiently extract tables from PDFs into Excel using PowerShell or Command Prompt with Python.
Step 6: Combine all excels into a single excel file
- Open Notepad and paste the following Python script:
import pandas as pd import os input_folder = os.path.expanduser("~/Downloads/A") output_file = os.path.join(input_folder, "combined.xlsx") all_data = [] for file in os.listdir(input_folder): if file.endswith(".xlsx") and file != "combined.xlsx": file_path = os.path.join(input_folder, file) df = pd.read_excel(file_path, sheet_name=0) all_data.append(df) combined_df = pd.concat(all_data, ignore_index=True) combined_df.to_excel(output_file, index=False) print(f"All Excel files combined into {output_file} successfully!")
- Save the file as
combine_excels.py
in theDownloads
folder. - Open PowerShell or Command Prompt and navigate to
Downloads
:cd %USERPROFILE%\Downloads\A
- Run the script:
python combine_excels.py
Give it a try and simplify your expense tracking! 🚀