Python is the ideal language for automating Excel metadata removal at scale. Whether you need to clean a single file before sending it to a client or process thousands of spreadsheets as part of a CI/CD pipeline, Python’s rich ecosystem of libraries makes it straightforward to strip author names, timestamps, comments, hidden sheets, and other sensitive properties from XLSX files programmatically.
Manually removing metadata through Excel’s Document Inspector works for occasional one-off tasks, but it falls apart quickly when you need to clean files regularly, process batches, or integrate metadata removal into automated workflows. Python solves all of these problems by giving you full programmatic control over every aspect of an XLSX file’s metadata.
With Python, you can build scripts that run as pre-commit hooks in version control systems, integrate into CI/CD pipelines, execute as scheduled cron jobs, or operate as part of file-processing microservices. The same script that cleans one file can clean ten thousand files with minimal modification.
Before writing any metadata removal scripts, you need to install the right libraries. The Python ecosystem offers several packages for working with Excel files, each with different strengths. Here is the recommended setup for a comprehensive metadata removal toolkit.
# Create a virtual environment (recommended) python -m venv metadata-cleaner source metadata-cleaner/bin/activate # Linux/macOS # metadata-cleaner\Scripts\activate # Windows # Install core libraries pip install openpyxl # Read/write XLSX files with full metadata access pip install lxml # Fast XML parsing for low-level manipulation pip install python-zipfile-generator # For rebuilding ZIP archives cleanly # Optional but useful pip install click # Build CLI tools with argument parsing pip install rich # Pretty terminal output for progress bars and tables
openpyxl is the primary library for this work. It provides direct access to document properties, comments, styles, and workbook structure through a clean Python API.
The openpyxl library is the workhorse of Excel metadata manipulation in Python. It reads and writes the Office Open XML format natively, giving you access to document properties, cell comments, defined names, hidden sheets, and more. For operations that openpyxl doesn’t cover directly — such as stripping custom XML parts or cleaning ZIP-level metadata — you can combine it with Python’s built-in zipfile module and lxml for XML manipulation.
The most common metadata that needs removal is stored in the core document properties: author name, last modified by, title, subject, keywords, and description. These fields are stored in docProps/core.xml inside the XLSX archive and are the first thing anyone checking your file’s metadata will see.
from openpyxl import load_workbook
from datetime import datetime
def remove_core_properties(filepath, output_path=None):
"""Remove all core document properties from an XLSX file."""
wb = load_workbook(filepath)
props = wb.properties
# Clear author and contributor fields
props.creator = ""
props.lastModifiedBy = ""
# Clear descriptive metadata
props.title = ""
props.subject = ""
props.description = ""
props.keywords = ""
props.category = ""
# Neutralize timestamps (set to epoch or a fixed date)
neutral_date = datetime(2000, 1, 1)
props.created = neutral_date
props.modified = neutral_date
# Clear additional fields
props.identifier = ""
props.language = ""
props.revision = ""
props.version = ""
props.contentStatus = ""
# Save to output path or overwrite
save_path = output_path or filepath
wb.save(save_path)
print(f"Core properties removed: {save_path}")
return save_pathThis function clears all standard Dublin Core metadata fields that Excel stores in docProps/core.xml. Setting timestamps to a fixed date rather than None avoids potential issues with applications that expect valid datetime values.
Core properties are only half the story. Excel also stores extended properties in docProps/app.xml, which includes the application name (e.g., “Microsoft Excel”), application version, company name, and manager field. These can reveal your software version, organization, and internal hierarchy.
def remove_extended_properties(filepath, output_path=None):
"""Remove extended/application properties using lxml."""
import zipfile
from lxml import etree
from io import BytesIO
import shutil
output = output_path or filepath
temp_path = filepath + ".tmp"
with zipfile.ZipFile(filepath, 'r') as zin:
with zipfile.ZipFile(temp_path, 'w') as zout:
for item in zin.infolist():
data = zin.read(item.filename)
if item.filename == "docProps/app.xml":
tree = etree.parse(BytesIO(data))
root = tree.getroot()
ns = root.nsmap.get(None, '')
nsmap = {'ep': ns} if ns else {}
# Clear sensitive fields
for tag in ['Company', 'Manager', 'Application',
'AppVersion']:
prefix = f'{{{ns}}}' if ns else ''
elem = root.find(f'{prefix}{tag}')
if elem is not None:
elem.text = ""
data = etree.tostring(tree, xml_declaration=True,
encoding='UTF-8',
standalone=True)
zout.writestr(item, data)
shutil.move(temp_path, output)
print(f"Extended properties removed: {output}")Comments are one of the most dangerous forms of hidden metadata in Excel files. They often contain internal review notes, personal opinions about data, names of reviewers, and context that was never meant for external audiences. Modern Excel has two types: legacy comments (now called “Notes”) and threaded comments (the newer discussion-style format). Your Python script needs to handle both.
def remove_all_comments(filepath, output_path=None):
"""Remove all comments and notes from every worksheet."""
wb = load_workbook(filepath)
total_removed = 0
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
comments_found = []
# Iterate through all cells to find comments
for row in ws.iter_rows():
for cell in row:
if cell.comment is not None:
comments_found.append(cell.coordinate)
cell.comment = None
if comments_found:
print(f" Sheet '{sheet_name}': removed {len(comments_found)} "
f"comments from {', '.join(comments_found[:5])}"
f"{'...' if len(comments_found) > 5 else ''}")
total_removed += len(comments_found)
save_path = output_path or filepath
wb.save(save_path)
print(f"Total comments removed: {total_removed}")
return total_removedThis approach catches legacy comments/notes. For threaded comments (introduced in Office 365), you may need to also strip the xl/threadedComments/ XML parts at the ZIP level, as openpyxl may not fully handle them.
def remove_threaded_comments(filepath, output_path=None):
"""Remove threaded comments by stripping XML parts."""
import zipfile
import shutil
output = output_path or filepath
temp_path = filepath + ".tmp"
removed_parts = []
with zipfile.ZipFile(filepath, 'r') as zin:
with zipfile.ZipFile(temp_path, 'w') as zout:
for item in zin.infolist():
# Skip threaded comment files and their rels
if ('threadedComments' in item.filename or
'persons/person' in item.filename):
removed_parts.append(item.filename)
continue
data = zin.read(item.filename)
# Clean references from relationship files
if item.filename.endswith('.rels'):
content = data.decode('utf-8')
if 'threadedComment' in content:
from lxml import etree
from io import BytesIO
tree = etree.parse(BytesIO(data))
root = tree.getroot()
for rel in root:
target = rel.get('Target', '')
if 'threadedComment' in target:
root.remove(rel)
data = etree.tostring(
tree, xml_declaration=True,
encoding='UTF-8', standalone=True)
zout.writestr(item, data)
shutil.move(temp_path, output)
if removed_parts:
print(f"Removed threaded comment parts: {removed_parts}")
return removed_partsHidden and very hidden sheets are a major source of data leakage. They may contain lookup tables, calculation intermediaries, configuration data, or entire datasets that were meant to stay internal. Defined names (named ranges) can also reference cells containing sensitive data or reveal the structure of internal formulas. Your automation script should either remove hidden sheets entirely or make them visible for review.
from openpyxl.worksheet.worksheet import Worksheet
def handle_hidden_sheets(filepath, output_path=None, action="remove"):
"""Detect and handle hidden/very-hidden sheets.
Args:
action: "remove" to delete hidden sheets,
"reveal" to make them visible,
"report" to just list them
"""
wb = load_workbook(filepath)
hidden_sheets = []
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
state = ws.sheet_state # 'visible', 'hidden', or 'veryHidden'
if state != 'visible':
hidden_sheets.append({
'name': sheet_name,
'state': state,
'rows': ws.max_row,
'cols': ws.max_column
})
if not hidden_sheets:
print("No hidden sheets found.")
return hidden_sheets
print(f"Found {len(hidden_sheets)} hidden sheet(s):")
for sheet in hidden_sheets:
print(f" - '{sheet['name']}' ({sheet['state']}, "
f"{sheet['rows']}x{sheet['cols']})")
if action == "remove":
for sheet in hidden_sheets:
del wb[sheet['name']]
print(f"Removed {len(hidden_sheets)} hidden sheet(s).")
elif action == "reveal":
for sheet in hidden_sheets:
wb[sheet['name']].sheet_state = 'visible'
print(f"Made {len(hidden_sheets)} sheet(s) visible.")
if action != "report":
save_path = output_path or filepath
wb.save(save_path)
return hidden_sheetsdef remove_defined_names(filepath, output_path=None,
keep_print_areas=False):
"""Remove defined names that could leak information."""
wb = load_workbook(filepath)
names_removed = []
# Build list of names to remove
names_to_remove = []
for dn in wb.defined_names.definedName:
# Optionally keep print area and print title definitions
if keep_print_areas and dn.name in ('_xlnm.Print_Area',
'_xlnm.Print_Titles'):
continue
names_to_remove.append(dn.name)
for name in names_to_remove:
del wb.defined_names[name]
names_removed.append(name)
save_path = output_path or filepath
wb.save(save_path)
print(f"Removed {len(names_removed)} defined name(s): "
f"{', '.join(names_removed[:10])}")
return names_removedBeyond the standard core and extended properties, Excel files can contain custom document properties stored in docProps/custom.xml. Organizations often use these to store document classification levels, project codes, internal tracking IDs, workflow states, or DLP (Data Loss Prevention) tags. These custom properties can be extremely revealing about your internal processes and must be stripped before external sharing.
def remove_custom_properties(filepath, output_path=None):
"""Remove custom document properties (docProps/custom.xml)."""
import zipfile
import shutil
from lxml import etree
from io import BytesIO
output = output_path or filepath
temp_path = filepath + ".tmp"
properties_removed = []
with zipfile.ZipFile(filepath, 'r') as zin:
with zipfile.ZipFile(temp_path, 'w') as zout:
for item in zin.infolist():
data = zin.read(item.filename)
if item.filename == "docProps/custom.xml":
# Parse and log what we are removing
tree = etree.parse(BytesIO(data))
root = tree.getroot()
for prop in root:
name = prop.get('name', 'unknown')
properties_removed.append(name)
# Write an empty custom properties file
ns = 'http://schemas.openxmlformats.org/officeDocument/2006/custom-properties'
nsvt = 'http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes'
new_root = etree.Element(
f'{{{ns}}}Properties',
nsmap={None: ns, 'vt': nsvt})
data = etree.tostring(
new_root, xml_declaration=True,
encoding='UTF-8', standalone=True)
zout.writestr(item, data)
shutil.move(temp_path, output)
if properties_removed:
print(f"Removed {len(properties_removed)} custom properties: "
f"{', '.join(properties_removed)}")
return properties_removedCustom properties often contain the most organizationally sensitive metadata — document classification labels like “CONFIDENTIAL”, project tracking codes, and DRM markers that reveal internal systems.
Now that we have individual functions for each type of metadata, we can combine them into a comprehensive cleaning script. This unified approach ensures that no metadata category is overlooked and provides a consistent interface for all your automation needs.
import os
import shutil
import json
from datetime import datetime
from pathlib import Path
from openpyxl import load_workbook
class ExcelMetadataCleaner:
"""Comprehensive XLSX metadata removal tool."""
def __init__(self, filepath):
self.filepath = Path(filepath)
self.log = []
def _log(self, action, details):
entry = {
'timestamp': datetime.now().isoformat(),
'action': action,
'details': details
}
self.log.append(entry)
print(f" [{action}] {details}")
def clean_core_properties(self, wb):
"""Strip all core document properties."""
props = wb.properties
fields_cleared = []
if props.creator:
fields_cleared.append(f"creator='{props.creator}'")
props.creator = ""
if props.lastModifiedBy:
fields_cleared.append(
f"lastModifiedBy='{props.lastModifiedBy}'")
props.lastModifiedBy = ""
for field in ['title', 'subject', 'description',
'keywords', 'category', 'identifier',
'language', 'revision', 'version',
'contentStatus']:
val = getattr(props, field, None)
if val:
fields_cleared.append(f"{field}='{val}'")
setattr(props, field, "")
neutral = datetime(2000, 1, 1)
props.created = neutral
props.modified = neutral
self._log('core_properties',
f"Cleared {len(fields_cleared)} fields: "
f"{', '.join(fields_cleared[:5])}")
def clean_comments(self, wb):
"""Remove all cell comments across all sheets."""
total = 0
for ws in wb.worksheets:
count = 0
for row in ws.iter_rows():
for cell in row:
if cell.comment is not None:
cell.comment = None
count += 1
if count:
self._log('comments',
f"Sheet '{ws.title}': {count} comments")
total += count
if total == 0:
self._log('comments', 'No comments found')
def clean_hidden_sheets(self, wb, action="remove"):
"""Remove or reveal hidden sheets."""
hidden = [(name, wb[name].sheet_state)
for name in wb.sheetnames
if wb[name].sheet_state != 'visible']
if not hidden:
self._log('hidden_sheets', 'No hidden sheets found')
return
for name, state in hidden:
if action == "remove":
del wb[name]
self._log('hidden_sheets',
f"Removed '{name}' ({state})")
else:
wb[name].sheet_state = 'visible'
self._log('hidden_sheets',
f"Revealed '{name}' ({state})")
def clean_defined_names(self, wb):
"""Remove all defined names except print areas."""
removed = []
names_to_remove = []
for dn in wb.defined_names.definedName:
if dn.name not in ('_xlnm.Print_Area',
'_xlnm.Print_Titles'):
names_to_remove.append(dn.name)
for name in names_to_remove:
del wb.defined_names[name]
removed.append(name)
self._log('defined_names',
f"Removed {len(removed)} name(s)")
def clean(self, output_path=None, remove_hidden=True):
"""Run the full metadata cleaning pipeline."""
output = Path(output_path) if output_path else self.filepath
print(f"\nCleaning: {self.filepath.name}")
print("-" * 50)
# Phase 1: openpyxl-level cleaning
wb = load_workbook(str(self.filepath))
self.clean_core_properties(wb)
self.clean_comments(wb)
if remove_hidden:
self.clean_hidden_sheets(wb)
self.clean_defined_names(wb)
# Save intermediate result
temp_path = str(output) + ".intermediate.xlsx"
wb.save(temp_path)
wb.close()
# Phase 2: ZIP-level cleaning
self._clean_zip_level(temp_path, str(output))
os.remove(temp_path)
print(f"\nOutput saved: {output}")
return self.log
def _clean_zip_level(self, input_path, output_path):
"""Clean metadata at the ZIP/XML level."""
import zipfile
from lxml import etree
from io import BytesIO
temp = output_path + ".tmp"
with zipfile.ZipFile(input_path, 'r') as zin:
with zipfile.ZipFile(temp, 'w',
zipfile.ZIP_DEFLATED) as zout:
for item in zin.infolist():
data = zin.read(item.filename)
# Skip threaded comments
if ('threadedComments' in item.filename or
'persons/person' in item.filename):
self._log('zip_clean',
f"Removed: {item.filename}")
continue
# Clean app.xml
if item.filename == "docProps/app.xml":
data = self._clean_app_xml(data)
# Clean custom.xml
if item.filename == "docProps/custom.xml":
data = self._clean_custom_xml(data)
# Neutralize ZIP entry timestamps
item.date_time = (2000, 1, 1, 0, 0, 0)
zout.writestr(item, data)
shutil.move(temp, output_path)
self._log('zip_clean', 'ZIP-level cleaning complete')
def _clean_app_xml(self, data):
from lxml import etree
from io import BytesIO
tree = etree.parse(BytesIO(data))
root = tree.getroot()
ns = root.nsmap.get(None, '')
for tag in ['Company', 'Manager', 'Application',
'AppVersion']:
elem = root.find(f'{{{ns}}}{tag}')
if elem is not None:
self._log('app_properties',
f"Cleared {tag}='{elem.text}'")
elem.text = ""
return etree.tostring(tree, xml_declaration=True,
encoding='UTF-8', standalone=True)
def _clean_custom_xml(self, data):
from lxml import etree
from io import BytesIO
tree = etree.parse(BytesIO(data))
root = tree.getroot()
props = [prop.get('name', '?') for prop in root]
if props:
self._log('custom_properties',
f"Cleared: {', '.join(props)}")
ns = 'http://schemas.openxmlformats.org/officeDocument/2006/custom-properties'
nsvt = 'http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes'
new_root = etree.Element(f'{{{ns}}}Properties',
nsmap={None: ns, 'vt': nsvt})
return etree.tostring(new_root, xml_declaration=True,
encoding='UTF-8', standalone=True)One of the biggest advantages of Python automation is the ability to process entire directories of files. Whether you are cleaning an export from a shared drive before sending it to an external auditor, or sanitizing all spreadsheets in a release package, batch processing saves hours of manual work and ensures consistency.
from pathlib import Path
import json
def batch_clean(input_dir, output_dir=None, recursive=True):
"""Clean all XLSX files in a directory.
Args:
input_dir: Directory containing XLSX files
output_dir: Where to save cleaned files (None = overwrite)
recursive: Whether to walk subdirectories
"""
input_path = Path(input_dir)
pattern = "**/*.xlsx" if recursive else "*.xlsx"
files = list(input_path.glob(pattern))
# Filter out temp files and already-cleaned files
files = [f for f in files if not f.name.startswith('~')
and '.cleaned.' not in f.name]
print(f"Found {len(files)} XLSX file(s) to process")
results = {'success': [], 'failed': [], 'skipped': []}
for i, filepath in enumerate(files, 1):
print(f"\n[{i}/{len(files)}] Processing: {filepath.name}")
# Determine output path
if output_dir:
out_dir = Path(output_dir) / filepath.parent.relative_to(
input_path)
out_dir.mkdir(parents=True, exist_ok=True)
out_path = out_dir / filepath.name
else:
out_path = filepath
try:
cleaner = ExcelMetadataCleaner(str(filepath))
log = cleaner.clean(output_path=str(out_path))
results['success'].append({
'file': str(filepath),
'output': str(out_path),
'actions': len(log)
})
except Exception as e:
print(f" ERROR: {e}")
results['failed'].append({
'file': str(filepath),
'error': str(e)
})
# Print summary
print("\n" + "=" * 60)
print(f"Batch complete: {len(results['success'])} succeeded, "
f"{len(results['failed'])} failed")
# Save report
report_path = Path(output_dir or input_dir) / "cleaning_report.json"
with open(report_path, 'w') as f:
json.dump(results, f, indent=2)
print(f"Report saved: {report_path}")
return resultsFor large batches (hundreds or thousands of files), you can speed up processing significantly with Python’s concurrent.futures module:
from concurrent.futures import ProcessPoolExecutor, as_completed
def clean_single_file(args):
filepath, output_path = args
cleaner = ExcelMetadataCleaner(filepath)
return cleaner.clean(output_path=output_path)
def batch_clean_parallel(input_dir, output_dir, max_workers=4):
"""Clean files in parallel using multiple CPU cores."""
files = list(Path(input_dir).glob("**/*.xlsx"))
tasks = [(str(f), str(Path(output_dir) / f.name)) for f in files]
with ProcessPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(clean_single_file, t): t
for t in tasks}
for future in as_completed(futures):
filepath = futures[future][0]
try:
future.result()
print(f"Done: {filepath}")
except Exception as e:
print(f"Failed: {filepath} - {e}")Wrapping your metadata removal logic in a proper CLI tool makes it accessible to non-developers and easy to invoke from shell scripts, Makefiles, and CI/CD pipelines. The click library provides a clean, Pythonic way to build command-line interfaces with argument parsing, help text, and validation built in.
#!/usr/bin/env python3
"""excel-clean: Strip metadata from XLSX files."""
import click
from pathlib import Path
@click.group()
@click.version_option(version="1.0.0")
def cli():
"""Remove metadata from Excel XLSX files."""
pass
@cli.command()
@click.argument('files', nargs=-1, type=click.Path(exists=True))
@click.option('--output-dir', '-o', type=click.Path(),
help='Output directory (default: overwrite in place)')
@click.option('--keep-hidden/--remove-hidden', default=False,
help='Keep or remove hidden sheets')
@click.option('--dry-run', is_flag=True,
help='Report what would be removed without modifying')
def clean(files, output_dir, keep_hidden, dry_run):
"""Clean metadata from one or more XLSX files."""
if not files:
click.echo("No files specified. Use --help for usage.")
return
for filepath in files:
path = Path(filepath)
if path.is_dir():
xlsx_files = list(path.glob("**/*.xlsx"))
click.echo(f"Found {len(xlsx_files)} files in {path}")
else:
xlsx_files = [path]
for xlsx in xlsx_files:
if dry_run:
click.echo(f"[DRY RUN] Would clean: {xlsx}")
continue
out = Path(output_dir) / xlsx.name if output_dir else None
cleaner = ExcelMetadataCleaner(str(xlsx))
cleaner.clean(output_path=str(out) if out else None,
remove_hidden=not keep_hidden)
@cli.command()
@click.argument('file', type=click.Path(exists=True))
def inspect(file):
"""Show metadata present in an XLSX file without modifying."""
wb = load_workbook(file)
props = wb.properties
click.echo(f"\nMetadata for: {file}")
click.echo("=" * 50)
click.echo(f"Creator: {props.creator or '(empty)'}")
click.echo(f"Last Modified By:{props.lastModifiedBy or '(empty)'}")
click.echo(f"Created: {props.created}")
click.echo(f"Modified: {props.modified}")
click.echo(f"Title: {props.title or '(empty)'}")
click.echo(f"Company: (check docProps/app.xml)")
click.echo(f"Sheets: {len(wb.sheetnames)}")
hidden = [n for n in wb.sheetnames
if wb[n].sheet_state != 'visible']
click.echo(f"Hidden sheets: {len(hidden)}")
comment_count = sum(
1 for ws in wb.worksheets
for row in ws.iter_rows()
for cell in row if cell.comment)
click.echo(f"Comments: {comment_count}")
name_count = len(list(wb.defined_names.definedName))
click.echo(f"Defined names: {name_count}")
if __name__ == '__main__':
cli()Usage examples:
# Clean a single file in place python excel_clean.py clean report.xlsx # Clean multiple files to a separate directory python excel_clean.py clean *.xlsx -o ./cleaned/ # Inspect without modifying python excel_clean.py inspect report.xlsx # Dry run on a whole directory python excel_clean.py clean ./reports/ --dry-run
A standalone cleaning script is useful, but the real power comes from integrating metadata removal into your existing workflows. Here are four proven integration patterns used by organizations that handle sensitive Excel files regularly.
Prevent metadata from ever reaching your repository by cleaning XLSX files automatically before each commit.
#!/bin/bash
# Pre-commit hook: clean XLSX metadata before committing
XLSX_FILES=$(git diff --cached --name-only --diff-filter=ACM \
| grep -i '\.xlsx$')
if [ -z "$XLSX_FILES" ]; then
exit 0
fi
echo "Cleaning metadata from staged XLSX files..."
for file in $XLSX_FILES; do
python excel_clean.py clean "$file"
git add "$file" # Re-stage the cleaned version
done
echo "XLSX metadata cleaned successfully."
exit 0Clean uploaded Excel files on-the-fly in a web application before storing them.
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse
import tempfile
import os
app = FastAPI()
@app.post("/api/clean-xlsx")
async def clean_xlsx(file: UploadFile = File(...)):
"""Upload an XLSX file and receive a cleaned version."""
if not file.filename.endswith('.xlsx'):
return {"error": "Only XLSX files are supported"}
# Save upload to temp file
with tempfile.NamedTemporaryFile(suffix='.xlsx',
delete=False) as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
# Clean metadata
output_path = tmp_path + ".cleaned.xlsx"
cleaner = ExcelMetadataCleaner(tmp_path)
cleaner.clean(output_path=output_path)
# Clean up input temp file
os.unlink(tmp_path)
return FileResponse(
output_path,
filename=f"cleaned_{file.filename}",
media_type="application/vnd.openxmlformats-officedocument"
".spreadsheetml.sheet"
)Automatically clean all XLSX files in a shared directory on a regular schedule.
# Run every day at 2 AM: clean all XLSX in the outbox folder # Add to crontab with: crontab -e 0 2 * * * /opt/metadata-cleaner/venv/bin/python \ /opt/metadata-cleaner/excel_clean.py clean \ /shared/outbox/ -o /shared/outbox/cleaned/ \ >> /var/log/metadata-cleaner.log 2>&1
Add metadata cleaning as a step in your CI/CD pipeline to ensure no XLSX files ship with sensitive metadata.
name: Clean Excel Metadata
on:
pull_request:
paths:
- '**/*.xlsx'
jobs:
clean-metadata:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install openpyxl lxml click
- name: Find and clean XLSX files
run: |
CHANGED=$(git diff --name-only HEAD~1 | grep -i '\.xlsx$' || true)
if [ -n "$CHANGED" ]; then
python excel_clean.py clean $CHANGED
echo "Cleaned: $CHANGED"
fi
- name: Commit cleaned files
run: |
git config user.name "metadata-bot"
git config user.email "bot@example.com"
git add -A
git diff --cached --quiet || \
git commit -m "chore: strip XLSX metadata"Trust but verify. After cleaning, you should always confirm that metadata has actually been removed. A verification step is especially important when building automated pipelines, because a bug in your cleaning logic could silently pass through sensitive data for months before anyone notices.
def verify_clean(filepath):
"""Verify that an XLSX file has been properly cleaned.
Returns a dict of findings. An empty 'issues' list means
the file is clean.
"""
wb = load_workbook(filepath)
props = wb.properties
issues = []
# Check core properties
if props.creator:
issues.append(f"Creator still set: '{props.creator}'")
if props.lastModifiedBy:
issues.append(
f"LastModifiedBy still set: '{props.lastModifiedBy}'")
if props.title:
issues.append(f"Title still set: '{props.title}'")
# Check for comments
for ws in wb.worksheets:
for row in ws.iter_rows():
for cell in row:
if cell.comment:
issues.append(
f"Comment in {ws.title}!{cell.coordinate}")
# Check for hidden sheets
for name in wb.sheetnames:
if wb[name].sheet_state != 'visible':
issues.append(f"Hidden sheet: '{name}'")
# Check ZIP-level metadata
import zipfile
with zipfile.ZipFile(filepath, 'r') as z:
for item in z.infolist():
if 'threadedComments' in item.filename:
issues.append(
f"Threaded comments remain: {item.filename}")
if 'custom.xml' in item.filename:
data = z.read(item.filename)
if b'<property' in data.lower() if isinstance(
data, bytes) else b'<property' in data:
issues.append("Custom properties still present")
if issues:
print(f"FAIL - {len(issues)} issue(s) found:")
for issue in issues:
print(f" - {issue}")
else:
print("PASS - File is clean")
return {'filepath': filepath, 'clean': len(issues) == 0,
'issues': issues}Building reliable metadata removal scripts requires awareness of several edge cases and gotchas that can silently undermine your cleaning efforts. Here are the most common pitfalls encountered in production environments.
When you delete a cell’s value, the original string may remain in the shared string table (xl/sharedStrings.xml). openpyxl handles this correctly when you save, but if you are manipulating XML directly, you must rebuild the shared string table to avoid leaving orphaned data that can be recovered.
Pivot tables store a complete copy of their source data in a cache (xl/pivotCache/). Even if you delete the source sheet, the cache retains all the original data. Always check for and clean pivot table caches separately, or remove them entirely if the pivot table is not needed.
openpyxl cannot open encrypted or password-protected XLSX files. Your script should detect these and either skip them with a warning or use msoffcrypto-tool to decrypt first (if you have the password). Silently failing on encrypted files means they pass through with all metadata intact.
Embedded images (JPEG, PNG) carry their own EXIF metadata including camera model, GPS coordinates, and timestamps. Excel stores these in xl/media/. Your cleaning script should strip EXIF data from embedded images using a library like Pillow or remove embedded objects entirely.
If you are processing .xlsm files (macro-enabled workbooks), the VBA project contains its own metadata: module names, developer comments, project properties, and sometimes developer names. openpyxl preserves but does not provide API access to VBA content. Use oletools for VBA-level cleaning.
zipfile and lxml for ZIP-level and custom XML cleaning.Build automated pipelines to strip metadata from XLSX files before sharing.
Build scalable pipelines for enterprise metadata management across thousands of files.
Learn how XLSX files are structured as ZIP archives containing XML documents.