Automating Metadata Removal with Python Scripts

Why Automate Metadata Removal with Python

Manually removing metadata through Excel’s Document Inspector works for occasional one-off tasks, but it falls apart quickly when you need to clean files regularly, process batches, or integrate metadata removal into automated workflows. Python solves all of these problems by giving you full programmatic control over every aspect of an XLSX file’s metadata.

With Python, you can build scripts that run as pre-commit hooks in version control systems, integrate into CI/CD pipelines, execute as scheduled cron jobs, or operate as part of file-processing microservices. The same script that cleans one file can clean ten thousand files with minimal modification.

Key Advantages of Python for Metadata Removal

Reproducibility — Scripts produce identical results every time, eliminating human error from manual cleaning.
Batch processing — Clean hundreds or thousands of files in a single command with directory walking and glob patterns.
Pipeline integration — Embed metadata removal into CI/CD, pre-commit hooks, file upload handlers, and email gateways.
Deep access — Python can reach metadata that the Document Inspector misses, including ZIP-level artifacts and XML remnants.
Auditability — Generate logs of exactly what was removed from each file for compliance documentation.

Setting Up Your Python Environment

Before writing any metadata removal scripts, you need to install the right libraries. The Python ecosystem offers several packages for working with Excel files, each with different strengths. Here is the recommended setup for a comprehensive metadata removal toolkit.

Installing Required Packages

# Create a virtual environment (recommended)
python -m venv metadata-cleaner
source metadata-cleaner/bin/activate  # Linux/macOS
# metadata-cleaner\Scripts\activate  # Windows

# Install core libraries
pip install openpyxl    # Read/write XLSX files with full metadata access
pip install lxml        # Fast XML parsing for low-level manipulation
pip install python-zipfile-generator  # For rebuilding ZIP archives cleanly

# Optional but useful
pip install click       # Build CLI tools with argument parsing
pip install rich        # Pretty terminal output for progress bars and tables

openpyxl is the primary library for this work. It provides direct access to document properties, comments, styles, and workbook structure through a clean Python API.

The openpyxl library is the workhorse of Excel metadata manipulation in Python. It reads and writes the Office Open XML format natively, giving you access to document properties, cell comments, defined names, hidden sheets, and more. For operations that openpyxl doesn’t cover directly — such as stripping custom XML parts or cleaning ZIP-level metadata — you can combine it with Python’s built-in zipfile module and lxml for XML manipulation.

Removing Core Document Properties

The most common metadata that needs removal is stored in the core document properties: author name, last modified by, title, subject, keywords, and description. These fields are stored in docProps/core.xml inside the XLSX archive and are the first thing anyone checking your file’s metadata will see.

Stripping Core Properties with openpyxl

from openpyxl import load_workbook
from datetime import datetime

def remove_core_properties(filepath, output_path=None):
    """Remove all core document properties from an XLSX file."""
    wb = load_workbook(filepath)
    props = wb.properties

    # Clear author and contributor fields
    props.creator = ""
    props.lastModifiedBy = ""

    # Clear descriptive metadata
    props.title = ""
    props.subject = ""
    props.description = ""
    props.keywords = ""
    props.category = ""

    # Neutralize timestamps (set to epoch or a fixed date)
    neutral_date = datetime(2000, 1, 1)
    props.created = neutral_date
    props.modified = neutral_date

    # Clear additional fields
    props.identifier = ""
    props.language = ""
    props.revision = ""
    props.version = ""
    props.contentStatus = ""

    # Save to output path or overwrite
    save_path = output_path or filepath
    wb.save(save_path)
    print(f"Core properties removed: {save_path}")
    return save_path

This function clears all standard Dublin Core metadata fields that Excel stores in docProps/core.xml. Setting timestamps to a fixed date rather than None avoids potential issues with applications that expect valid datetime values.

Important: Application Properties Too

Core properties are only half the story. Excel also stores extended properties in docProps/app.xml, which includes the application name (e.g., “Microsoft Excel”), application version, company name, and manager field. These can reveal your software version, organization, and internal hierarchy.

def remove_extended_properties(filepath, output_path=None):
    """Remove extended/application properties using lxml."""
    import zipfile
    from lxml import etree
    from io import BytesIO
    import shutil

    output = output_path or filepath
    temp_path = filepath + ".tmp"

    with zipfile.ZipFile(filepath, 'r') as zin:
        with zipfile.ZipFile(temp_path, 'w') as zout:
            for item in zin.infolist():
                data = zin.read(item.filename)

                if item.filename == "docProps/app.xml":
                    tree = etree.parse(BytesIO(data))
                    root = tree.getroot()
                    ns = root.nsmap.get(None, '')
                    nsmap = {'ep': ns} if ns else {}

                    # Clear sensitive fields
                    for tag in ['Company', 'Manager', 'Application',
                                'AppVersion']:
                        prefix = f'{{{ns}}}' if ns else ''
                        elem = root.find(f'{prefix}{tag}')
                        if elem is not None:
                            elem.text = ""

                    data = etree.tostring(tree, xml_declaration=True,
                                          encoding='UTF-8',
                                          standalone=True)

                zout.writestr(item, data)

    shutil.move(temp_path, output)
    print(f"Extended properties removed: {output}")

Stripping Comments, Notes, and Threaded Discussions

Comments are one of the most dangerous forms of hidden metadata in Excel files. They often contain internal review notes, personal opinions about data, names of reviewers, and context that was never meant for external audiences. Modern Excel has two types: legacy comments (now called “Notes”) and threaded comments (the newer discussion-style format). Your Python script needs to handle both.

Removing All Comments with openpyxl

def remove_all_comments(filepath, output_path=None):
    """Remove all comments and notes from every worksheet."""
    wb = load_workbook(filepath)
    total_removed = 0

    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        comments_found = []

        # Iterate through all cells to find comments
        for row in ws.iter_rows():
            for cell in row:
                if cell.comment is not None:
                    comments_found.append(cell.coordinate)
                    cell.comment = None

        if comments_found:
            print(f"  Sheet '{sheet_name}': removed {len(comments_found)} "
                  f"comments from {', '.join(comments_found[:5])}"
                  f"{'...' if len(comments_found) > 5 else ''}")
            total_removed += len(comments_found)

    save_path = output_path or filepath
    wb.save(save_path)
    print(f"Total comments removed: {total_removed}")
    return total_removed

This approach catches legacy comments/notes. For threaded comments (introduced in Office 365), you may need to also strip the xl/threadedComments/ XML parts at the ZIP level, as openpyxl may not fully handle them.

Removing Threaded Comments at the XML Level

def remove_threaded_comments(filepath, output_path=None):
    """Remove threaded comments by stripping XML parts."""
    import zipfile
    import shutil

    output = output_path or filepath
    temp_path = filepath + ".tmp"
    removed_parts = []

    with zipfile.ZipFile(filepath, 'r') as zin:
        with zipfile.ZipFile(temp_path, 'w') as zout:
            for item in zin.infolist():
                # Skip threaded comment files and their rels
                if ('threadedComments' in item.filename or
                    'persons/person' in item.filename):
                    removed_parts.append(item.filename)
                    continue

                data = zin.read(item.filename)

                # Clean references from relationship files
                if item.filename.endswith('.rels'):
                    content = data.decode('utf-8')
                    if 'threadedComment' in content:
                        from lxml import etree
                        from io import BytesIO
                        tree = etree.parse(BytesIO(data))
                        root = tree.getroot()
                        for rel in root:
                            target = rel.get('Target', '')
                            if 'threadedComment' in target:
                                root.remove(rel)
                        data = etree.tostring(
                            tree, xml_declaration=True,
                            encoding='UTF-8', standalone=True)

                zout.writestr(item, data)

    shutil.move(temp_path, output)
    if removed_parts:
        print(f"Removed threaded comment parts: {removed_parts}")
    return removed_parts

Handling Hidden Sheets and Defined Names

Hidden and very hidden sheets are a major source of data leakage. They may contain lookup tables, calculation intermediaries, configuration data, or entire datasets that were meant to stay internal. Defined names (named ranges) can also reference cells containing sensitive data or reveal the structure of internal formulas. Your automation script should either remove hidden sheets entirely or make them visible for review.

Detecting and Removing Hidden Sheets

from openpyxl.worksheet.worksheet import Worksheet

def handle_hidden_sheets(filepath, output_path=None, action="remove"):
    """Detect and handle hidden/very-hidden sheets.

    Args:
        action: "remove" to delete hidden sheets,
                "reveal" to make them visible,
                "report" to just list them
    """
    wb = load_workbook(filepath)
    hidden_sheets = []

    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        state = ws.sheet_state  # 'visible', 'hidden', or 'veryHidden'

        if state != 'visible':
            hidden_sheets.append({
                'name': sheet_name,
                'state': state,
                'rows': ws.max_row,
                'cols': ws.max_column
            })

    if not hidden_sheets:
        print("No hidden sheets found.")
        return hidden_sheets

    print(f"Found {len(hidden_sheets)} hidden sheet(s):")
    for sheet in hidden_sheets:
        print(f"  - '{sheet['name']}' ({sheet['state']}, "
              f"{sheet['rows']}x{sheet['cols']})")

    if action == "remove":
        for sheet in hidden_sheets:
            del wb[sheet['name']]
        print(f"Removed {len(hidden_sheets)} hidden sheet(s).")
    elif action == "reveal":
        for sheet in hidden_sheets:
            wb[sheet['name']].sheet_state = 'visible'
        print(f"Made {len(hidden_sheets)} sheet(s) visible.")

    if action != "report":
        save_path = output_path or filepath
        wb.save(save_path)

    return hidden_sheets

Cleaning Defined Names

def remove_defined_names(filepath, output_path=None,
                         keep_print_areas=False):
    """Remove defined names that could leak information."""
    wb = load_workbook(filepath)
    names_removed = []

    # Build list of names to remove
    names_to_remove = []
    for dn in wb.defined_names.definedName:
        # Optionally keep print area and print title definitions
        if keep_print_areas and dn.name in ('_xlnm.Print_Area',
                                              '_xlnm.Print_Titles'):
            continue
        names_to_remove.append(dn.name)

    for name in names_to_remove:
        del wb.defined_names[name]
        names_removed.append(name)

    save_path = output_path or filepath
    wb.save(save_path)
    print(f"Removed {len(names_removed)} defined name(s): "
          f"{', '.join(names_removed[:10])}")
    return names_removed

Removing Custom Document Properties

Beyond the standard core and extended properties, Excel files can contain custom document properties stored in docProps/custom.xml. Organizations often use these to store document classification levels, project codes, internal tracking IDs, workflow states, or DLP (Data Loss Prevention) tags. These custom properties can be extremely revealing about your internal processes and must be stripped before external sharing.

Stripping Custom Properties at the ZIP Level

def remove_custom_properties(filepath, output_path=None):
    """Remove custom document properties (docProps/custom.xml)."""
    import zipfile
    import shutil
    from lxml import etree
    from io import BytesIO

    output = output_path or filepath
    temp_path = filepath + ".tmp"
    properties_removed = []

    with zipfile.ZipFile(filepath, 'r') as zin:
        with zipfile.ZipFile(temp_path, 'w') as zout:
            for item in zin.infolist():
                data = zin.read(item.filename)

                if item.filename == "docProps/custom.xml":
                    # Parse and log what we are removing
                    tree = etree.parse(BytesIO(data))
                    root = tree.getroot()
                    for prop in root:
                        name = prop.get('name', 'unknown')
                        properties_removed.append(name)

                    # Write an empty custom properties file
                    ns = 'http://schemas.openxmlformats.org/officeDocument/2006/custom-properties'
                    nsvt = 'http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes'
                    new_root = etree.Element(
                        f'{{{ns}}}Properties',
                        nsmap={None: ns, 'vt': nsvt})
                    data = etree.tostring(
                        new_root, xml_declaration=True,
                        encoding='UTF-8', standalone=True)

                zout.writestr(item, data)

    shutil.move(temp_path, output)
    if properties_removed:
        print(f"Removed {len(properties_removed)} custom properties: "
              f"{', '.join(properties_removed)}")
    return properties_removed

Custom properties often contain the most organizationally sensitive metadata — document classification labels like “CONFIDENTIAL”, project tracking codes, and DRM markers that reveal internal systems.

Building a Complete Metadata Cleaning Script

Now that we have individual functions for each type of metadata, we can combine them into a comprehensive cleaning script. This unified approach ensures that no metadata category is overlooked and provides a consistent interface for all your automation needs.

Complete Metadata Removal Pipeline

import os
import shutil
import json
from datetime import datetime
from pathlib import Path
from openpyxl import load_workbook

class ExcelMetadataCleaner:
    """Comprehensive XLSX metadata removal tool."""

    def __init__(self, filepath):
        self.filepath = Path(filepath)
        self.log = []

    def _log(self, action, details):
        entry = {
            'timestamp': datetime.now().isoformat(),
            'action': action,
            'details': details
        }
        self.log.append(entry)
        print(f"  [{action}] {details}")

    def clean_core_properties(self, wb):
        """Strip all core document properties."""
        props = wb.properties
        fields_cleared = []

        if props.creator:
            fields_cleared.append(f"creator='{props.creator}'")
            props.creator = ""
        if props.lastModifiedBy:
            fields_cleared.append(
                f"lastModifiedBy='{props.lastModifiedBy}'")
            props.lastModifiedBy = ""

        for field in ['title', 'subject', 'description',
                      'keywords', 'category', 'identifier',
                      'language', 'revision', 'version',
                      'contentStatus']:
            val = getattr(props, field, None)
            if val:
                fields_cleared.append(f"{field}='{val}'")
                setattr(props, field, "")

        neutral = datetime(2000, 1, 1)
        props.created = neutral
        props.modified = neutral

        self._log('core_properties',
                   f"Cleared {len(fields_cleared)} fields: "
                   f"{', '.join(fields_cleared[:5])}")

    def clean_comments(self, wb):
        """Remove all cell comments across all sheets."""
        total = 0
        for ws in wb.worksheets:
            count = 0
            for row in ws.iter_rows():
                for cell in row:
                    if cell.comment is not None:
                        cell.comment = None
                        count += 1
            if count:
                self._log('comments',
                           f"Sheet '{ws.title}': {count} comments")
                total += count
        if total == 0:
            self._log('comments', 'No comments found')

    def clean_hidden_sheets(self, wb, action="remove"):
        """Remove or reveal hidden sheets."""
        hidden = [(name, wb[name].sheet_state)
                  for name in wb.sheetnames
                  if wb[name].sheet_state != 'visible']

        if not hidden:
            self._log('hidden_sheets', 'No hidden sheets found')
            return

        for name, state in hidden:
            if action == "remove":
                del wb[name]
                self._log('hidden_sheets',
                           f"Removed '{name}' ({state})")
            else:
                wb[name].sheet_state = 'visible'
                self._log('hidden_sheets',
                           f"Revealed '{name}' ({state})")

    def clean_defined_names(self, wb):
        """Remove all defined names except print areas."""
        removed = []
        names_to_remove = []
        for dn in wb.defined_names.definedName:
            if dn.name not in ('_xlnm.Print_Area',
                                '_xlnm.Print_Titles'):
                names_to_remove.append(dn.name)

        for name in names_to_remove:
            del wb.defined_names[name]
            removed.append(name)

        self._log('defined_names',
                   f"Removed {len(removed)} name(s)")

    def clean(self, output_path=None, remove_hidden=True):
        """Run the full metadata cleaning pipeline."""
        output = Path(output_path) if output_path else self.filepath
        print(f"\nCleaning: {self.filepath.name}")
        print("-" * 50)

        # Phase 1: openpyxl-level cleaning
        wb = load_workbook(str(self.filepath))
        self.clean_core_properties(wb)
        self.clean_comments(wb)
        if remove_hidden:
            self.clean_hidden_sheets(wb)
        self.clean_defined_names(wb)

        # Save intermediate result
        temp_path = str(output) + ".intermediate.xlsx"
        wb.save(temp_path)
        wb.close()

        # Phase 2: ZIP-level cleaning
        self._clean_zip_level(temp_path, str(output))
        os.remove(temp_path)

        print(f"\nOutput saved: {output}")
        return self.log

    def _clean_zip_level(self, input_path, output_path):
        """Clean metadata at the ZIP/XML level."""
        import zipfile
        from lxml import etree
        from io import BytesIO

        temp = output_path + ".tmp"
        with zipfile.ZipFile(input_path, 'r') as zin:
            with zipfile.ZipFile(temp, 'w',
                                  zipfile.ZIP_DEFLATED) as zout:
                for item in zin.infolist():
                    data = zin.read(item.filename)

                    # Skip threaded comments
                    if ('threadedComments' in item.filename or
                        'persons/person' in item.filename):
                        self._log('zip_clean',
                                   f"Removed: {item.filename}")
                        continue

                    # Clean app.xml
                    if item.filename == "docProps/app.xml":
                        data = self._clean_app_xml(data)

                    # Clean custom.xml
                    if item.filename == "docProps/custom.xml":
                        data = self._clean_custom_xml(data)

                    # Neutralize ZIP entry timestamps
                    item.date_time = (2000, 1, 1, 0, 0, 0)
                    zout.writestr(item, data)

        shutil.move(temp, output_path)
        self._log('zip_clean', 'ZIP-level cleaning complete')

    def _clean_app_xml(self, data):
        from lxml import etree
        from io import BytesIO
        tree = etree.parse(BytesIO(data))
        root = tree.getroot()
        ns = root.nsmap.get(None, '')
        for tag in ['Company', 'Manager', 'Application',
                    'AppVersion']:
            elem = root.find(f'{{{ns}}}{tag}')
            if elem is not None:
                self._log('app_properties',
                           f"Cleared {tag}='{elem.text}'")
                elem.text = ""
        return etree.tostring(tree, xml_declaration=True,
                              encoding='UTF-8', standalone=True)

    def _clean_custom_xml(self, data):
        from lxml import etree
        from io import BytesIO
        tree = etree.parse(BytesIO(data))
        root = tree.getroot()
        props = [prop.get('name', '?') for prop in root]
        if props:
            self._log('custom_properties',
                       f"Cleared: {', '.join(props)}")
        ns = 'http://schemas.openxmlformats.org/officeDocument/2006/custom-properties'
        nsvt = 'http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes'
        new_root = etree.Element(f'{{{ns}}}Properties',
                                  nsmap={None: ns, 'vt': nsvt})
        return etree.tostring(new_root, xml_declaration=True,
                              encoding='UTF-8', standalone=True)

Batch Processing: Cleaning Entire Directories

One of the biggest advantages of Python automation is the ability to process entire directories of files. Whether you are cleaning an export from a shared drive before sending it to an external auditor, or sanitizing all spreadsheets in a release package, batch processing saves hours of manual work and ensures consistency.

Directory Walker with Progress Reporting

from pathlib import Path
import json

def batch_clean(input_dir, output_dir=None, recursive=True):
    """Clean all XLSX files in a directory.

    Args:
        input_dir: Directory containing XLSX files
        output_dir: Where to save cleaned files (None = overwrite)
        recursive: Whether to walk subdirectories
    """
    input_path = Path(input_dir)
    pattern = "**/*.xlsx" if recursive else "*.xlsx"
    files = list(input_path.glob(pattern))

    # Filter out temp files and already-cleaned files
    files = [f for f in files if not f.name.startswith('~')
             and '.cleaned.' not in f.name]

    print(f"Found {len(files)} XLSX file(s) to process")
    results = {'success': [], 'failed': [], 'skipped': []}

    for i, filepath in enumerate(files, 1):
        print(f"\n[{i}/{len(files)}] Processing: {filepath.name}")

        # Determine output path
        if output_dir:
            out_dir = Path(output_dir) / filepath.parent.relative_to(
                input_path)
            out_dir.mkdir(parents=True, exist_ok=True)
            out_path = out_dir / filepath.name
        else:
            out_path = filepath

        try:
            cleaner = ExcelMetadataCleaner(str(filepath))
            log = cleaner.clean(output_path=str(out_path))
            results['success'].append({
                'file': str(filepath),
                'output': str(out_path),
                'actions': len(log)
            })
        except Exception as e:
            print(f"  ERROR: {e}")
            results['failed'].append({
                'file': str(filepath),
                'error': str(e)
            })

    # Print summary
    print("\n" + "=" * 60)
    print(f"Batch complete: {len(results['success'])} succeeded, "
          f"{len(results['failed'])} failed")

    # Save report
    report_path = Path(output_dir or input_dir) / "cleaning_report.json"
    with open(report_path, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"Report saved: {report_path}")

    return results

Performance Tip: Parallel Processing

For large batches (hundreds or thousands of files), you can speed up processing significantly with Python’s concurrent.futures module:

from concurrent.futures import ProcessPoolExecutor, as_completed

def clean_single_file(args):
    filepath, output_path = args
    cleaner = ExcelMetadataCleaner(filepath)
    return cleaner.clean(output_path=output_path)

def batch_clean_parallel(input_dir, output_dir, max_workers=4):
    """Clean files in parallel using multiple CPU cores."""
    files = list(Path(input_dir).glob("**/*.xlsx"))
    tasks = [(str(f), str(Path(output_dir) / f.name)) for f in files]

    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(clean_single_file, t): t
                   for t in tasks}
        for future in as_completed(futures):
            filepath = futures[future][0]
            try:
                future.result()
                print(f"Done: {filepath}")
            except Exception as e:
                print(f"Failed: {filepath} - {e}")

Building a Command-Line Interface

Wrapping your metadata removal logic in a proper CLI tool makes it accessible to non-developers and easy to invoke from shell scripts, Makefiles, and CI/CD pipelines. The click library provides a clean, Pythonic way to build command-line interfaces with argument parsing, help text, and validation built in.

CLI Tool with Click

#!/usr/bin/env python3
"""excel-clean: Strip metadata from XLSX files."""

import click
from pathlib import Path

@click.group()
@click.version_option(version="1.0.0")
def cli():
    """Remove metadata from Excel XLSX files."""
    pass

@cli.command()
@click.argument('files', nargs=-1, type=click.Path(exists=True))
@click.option('--output-dir', '-o', type=click.Path(),
              help='Output directory (default: overwrite in place)')
@click.option('--keep-hidden/--remove-hidden', default=False,
              help='Keep or remove hidden sheets')
@click.option('--dry-run', is_flag=True,
              help='Report what would be removed without modifying')
def clean(files, output_dir, keep_hidden, dry_run):
    """Clean metadata from one or more XLSX files."""
    if not files:
        click.echo("No files specified. Use --help for usage.")
        return

    for filepath in files:
        path = Path(filepath)
        if path.is_dir():
            xlsx_files = list(path.glob("**/*.xlsx"))
            click.echo(f"Found {len(xlsx_files)} files in {path}")
        else:
            xlsx_files = [path]

        for xlsx in xlsx_files:
            if dry_run:
                click.echo(f"[DRY RUN] Would clean: {xlsx}")
                continue

            out = Path(output_dir) / xlsx.name if output_dir else None
            cleaner = ExcelMetadataCleaner(str(xlsx))
            cleaner.clean(output_path=str(out) if out else None,
                          remove_hidden=not keep_hidden)

@cli.command()
@click.argument('file', type=click.Path(exists=True))
def inspect(file):
    """Show metadata present in an XLSX file without modifying."""
    wb = load_workbook(file)
    props = wb.properties

    click.echo(f"\nMetadata for: {file}")
    click.echo("=" * 50)
    click.echo(f"Creator:         {props.creator or '(empty)'}")
    click.echo(f"Last Modified By:{props.lastModifiedBy or '(empty)'}")
    click.echo(f"Created:         {props.created}")
    click.echo(f"Modified:        {props.modified}")
    click.echo(f"Title:           {props.title or '(empty)'}")
    click.echo(f"Company:         (check docProps/app.xml)")
    click.echo(f"Sheets:          {len(wb.sheetnames)}")

    hidden = [n for n in wb.sheetnames
              if wb[n].sheet_state != 'visible']
    click.echo(f"Hidden sheets:   {len(hidden)}")

    comment_count = sum(
        1 for ws in wb.worksheets
        for row in ws.iter_rows()
        for cell in row if cell.comment)
    click.echo(f"Comments:        {comment_count}")

    name_count = len(list(wb.defined_names.definedName))
    click.echo(f"Defined names:   {name_count}")

if __name__ == '__main__':
    cli()

Usage examples:

# Clean a single file in place
python excel_clean.py clean report.xlsx

# Clean multiple files to a separate directory
python excel_clean.py clean *.xlsx -o ./cleaned/

# Inspect without modifying
python excel_clean.py inspect report.xlsx

# Dry run on a whole directory
python excel_clean.py clean ./reports/ --dry-run

Integration Patterns for Real-World Workflows

A standalone cleaning script is useful, but the real power comes from integrating metadata removal into your existing workflows. Here are four proven integration patterns used by organizations that handle sensitive Excel files regularly.

Git Pre-Commit Hook

Prevent metadata from ever reaching your repository by cleaning XLSX files automatically before each commit.

.git/hooks/pre-commit

#!/bin/bash
# Pre-commit hook: clean XLSX metadata before committing

XLSX_FILES=$(git diff --cached --name-only --diff-filter=ACM \
  | grep -i '\.xlsx$')

if [ -z "$XLSX_FILES" ]; then
    exit 0
fi

echo "Cleaning metadata from staged XLSX files..."
for file in $XLSX_FILES; do
    python excel_clean.py clean "$file"
    git add "$file"  # Re-stage the cleaned version
done

echo "XLSX metadata cleaned successfully."
exit 0

FastAPI File Upload Handler

Clean uploaded Excel files on-the-fly in a web application before storing them.

FastAPI Endpoint for Metadata-Clean Uploads

from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse
import tempfile
import os

app = FastAPI()

@app.post("/api/clean-xlsx")
async def clean_xlsx(file: UploadFile = File(...)):
    """Upload an XLSX file and receive a cleaned version."""
    if not file.filename.endswith('.xlsx'):
        return {"error": "Only XLSX files are supported"}

    # Save upload to temp file
    with tempfile.NamedTemporaryFile(suffix='.xlsx',
                                      delete=False) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name

    # Clean metadata
    output_path = tmp_path + ".cleaned.xlsx"
    cleaner = ExcelMetadataCleaner(tmp_path)
    cleaner.clean(output_path=output_path)

    # Clean up input temp file
    os.unlink(tmp_path)

    return FileResponse(
        output_path,
        filename=f"cleaned_{file.filename}",
        media_type="application/vnd.openxmlformats-officedocument"
                   ".spreadsheetml.sheet"
    )

Scheduled Cron Job

Automatically clean all XLSX files in a shared directory on a regular schedule.

Crontab Entry + Wrapper Script

# Run every day at 2 AM: clean all XLSX in the outbox folder
# Add to crontab with: crontab -e
0 2 * * * /opt/metadata-cleaner/venv/bin/python \
  /opt/metadata-cleaner/excel_clean.py clean \
  /shared/outbox/ -o /shared/outbox/cleaned/ \
  >> /var/log/metadata-cleaner.log 2>&1

GitHub Actions Workflow

Add metadata cleaning as a step in your CI/CD pipeline to ensure no XLSX files ship with sensitive metadata.

.github/workflows/clean-metadata.yml

name: Clean Excel Metadata
on:
  pull_request:
    paths:
      - '**/*.xlsx'

jobs:
  clean-metadata:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install openpyxl lxml click

      - name: Find and clean XLSX files
        run: |
          CHANGED=$(git diff --name-only HEAD~1 | grep -i '\.xlsx$' || true)
          if [ -n "$CHANGED" ]; then
            python excel_clean.py clean $CHANGED
            echo "Cleaned: $CHANGED"
          fi

      - name: Commit cleaned files
        run: |
          git config user.name "metadata-bot"
          git config user.email "bot@example.com"
          git add -A
          git diff --cached --quiet || \
            git commit -m "chore: strip XLSX metadata"

Verifying Your Cleaning Results

Trust but verify. After cleaning, you should always confirm that metadata has actually been removed. A verification step is especially important when building automated pipelines, because a bug in your cleaning logic could silently pass through sensitive data for months before anyone notices.

Post-Cleaning Verification Function

def verify_clean(filepath):
    """Verify that an XLSX file has been properly cleaned.

    Returns a dict of findings. An empty 'issues' list means
    the file is clean.
    """
    wb = load_workbook(filepath)
    props = wb.properties
    issues = []

    # Check core properties
    if props.creator:
        issues.append(f"Creator still set: '{props.creator}'")
    if props.lastModifiedBy:
        issues.append(
            f"LastModifiedBy still set: '{props.lastModifiedBy}'")
    if props.title:
        issues.append(f"Title still set: '{props.title}'")

    # Check for comments
    for ws in wb.worksheets:
        for row in ws.iter_rows():
            for cell in row:
                if cell.comment:
                    issues.append(
                        f"Comment in {ws.title}!{cell.coordinate}")

    # Check for hidden sheets
    for name in wb.sheetnames:
        if wb[name].sheet_state != 'visible':
            issues.append(f"Hidden sheet: '{name}'")

    # Check ZIP-level metadata
    import zipfile
    with zipfile.ZipFile(filepath, 'r') as z:
        for item in z.infolist():
            if 'threadedComments' in item.filename:
                issues.append(
                    f"Threaded comments remain: {item.filename}")
            if 'custom.xml' in item.filename:
                data = z.read(item.filename)
                if b'<property' in data.lower() if isinstance(
                        data, bytes) else b'<property' in data:
                    issues.append("Custom properties still present")

    if issues:
        print(f"FAIL - {len(issues)} issue(s) found:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("PASS - File is clean")

    return {'filepath': filepath, 'clean': len(issues) == 0,
            'issues': issues}

Common Pitfalls and How to Avoid Them

Building reliable metadata removal scripts requires awareness of several edge cases and gotchas that can silently undermine your cleaning efforts. Here are the most common pitfalls encountered in production environments.

Pitfall 1: Ignoring the Shared String Table

When you delete a cell’s value, the original string may remain in the shared string table (xl/sharedStrings.xml). openpyxl handles this correctly when you save, but if you are manipulating XML directly, you must rebuild the shared string table to avoid leaving orphaned data that can be recovered.

Pitfall 2: Forgetting Pivot Table Caches

Pivot tables store a complete copy of their source data in a cache (xl/pivotCache/). Even if you delete the source sheet, the cache retains all the original data. Always check for and clean pivot table caches separately, or remove them entirely if the pivot table is not needed.

Pitfall 3: Password-Protected Files

openpyxl cannot open encrypted or password-protected XLSX files. Your script should detect these and either skip them with a warning or use msoffcrypto-tool to decrypt first (if you have the password). Silently failing on encrypted files means they pass through with all metadata intact.

Pitfall 4: Embedded Objects and Images

Embedded images (JPEG, PNG) carry their own EXIF metadata including camera model, GPS coordinates, and timestamps. Excel stores these in xl/media/. Your cleaning script should strip EXIF data from embedded images using a library like Pillow or remove embedded objects entirely.

Pitfall 5: VBA Macro Metadata

If you are processing .xlsm files (macro-enabled workbooks), the VBA project contains its own metadata: module names, developer comments, project properties, and sometimes developer names. openpyxl preserves but does not provide API access to VBA content. Use oletools for VBA-level cleaning.

Key Takeaways

openpyxl is your primary tool — it handles core properties, comments, hidden sheets, and defined names. Combine it with zipfile and lxml for ZIP-level and custom XML cleaning.
Two-phase cleaning is essential — first use openpyxl for high-level metadata, then work at the ZIP/XML level for threaded comments, custom properties, app metadata, and timestamp neutralization.
Always verify after cleaning — automated verification catches bugs in your cleaning logic before they become compliance incidents. Build verification into your pipeline, not just your testing.
Integration matters more than scripts — a cleaning script is only useful if it runs automatically. Pre-commit hooks, CI/CD steps, upload handlers, and cron jobs ensure metadata removal actually happens in practice.
Watch for edge cases — pivot caches, shared string tables, embedded image EXIF data, threaded comments, and VBA metadata are frequently missed by basic cleaning approaches. A comprehensive script addresses all of these.
Generate audit logs — for compliance purposes, record exactly what metadata was found and removed from each file. This creates a defensible record for regulators and auditors.