When you need to audit metadata across hundreds or thousands of Excel files—for compliance sweeps, forensic investigations, or data governance programs—manual inspection is impossible. You need automated pipelines that can walk directories, extract metadata from every XLSX file, store results in a queryable format, and surface findings through reports and dashboards.
A single Excel file's metadata tells a story. Thousands of Excel files' metadata tells you the state of your organization. Bulk processing reveals patterns that individual inspection never could: which departments are leaking author names, how many files contain hidden sheets, whether legacy files still carry former employee credentials, and which documents were created by unexpected applications.
Whether you're preparing for a GDPR audit, investigating a data breach, migrating file shares to the cloud, or enforcing metadata hygiene policies, the challenge is the same: you need to process files at scale without sacrificing accuracy or overwhelming your infrastructure.
Compliance & Governance
Migration & Cleanup
A robust bulk processing pipeline has five stages: discovery, extraction, storage, analysis, and reporting. Each stage can be scaled independently, and failures at any stage shouldn't lose work from previous stages.
Discovery → Extraction → Storage → Analysis → Reporting
Walk directories → Parse ZIP/XML → Database/CSV → Query patterns → Dashboards/alerts
Recursively walk file systems, network shares, or cloud storage to build an inventory of all XLSX files. Filter by extension, size, and modification date. Track which files have been processed to support incremental runs.
Open each XLSX file as a ZIP archive and parse the XML metadata files (core.xml, app.xml, workbook.xml). Extract author, timestamps, revision count, application version, hidden sheets, and custom properties. Handle corrupt files gracefully.
Persist extracted metadata to a queryable store—SQLite for local work, PostgreSQL for team access, or CSV/Parquet for data analysis tools. Include the file path, hash, and extraction timestamp alongside the metadata fields.
Run queries against stored metadata to find patterns: files with personal author names, documents created by unexpected applications, hidden sheets, stale files with high revision counts, or timestamp anomalies suggesting tampering.
Generate summary reports, export filtered results, trigger alerts for policy violations, or feed data into dashboards. Provide both executive summaries and drill-down detail for technical teams.
The discovery stage needs to be fast and resumable. Walking a file server with millions of files can take hours, so you want to separate the file inventory from the metadata extraction. This lets you restart extraction without re-scanning the entire file system.
import os
import hashlib
import sqlite3
from pathlib import Path
from datetime import datetime
def discover_xlsx_files(root: str, db_path: str) -> int:
"""Walk directory tree and register XLSX files in SQLite."""
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS files (
path TEXT PRIMARY KEY,
size INTEGER,
mtime TEXT,
discovered_at TEXT,
processed INTEGER DEFAULT 0
)
""")
count = 0
for dirpath, _, filenames in os.walk(root):
for fname in filenames:
if not fname.lower().endswith(".xlsx"):
continue
full = os.path.join(dirpath, fname)
stat = os.stat(full)
conn.execute(
"INSERT OR IGNORE INTO files "
"(path, size, mtime, discovered_at) "
"VALUES (?, ?, ?, ?)",
(full, stat.st_size,
datetime.fromtimestamp(stat.st_mtime).isoformat(),
datetime.now().isoformat()),
)
count += 1
conn.commit()
conn.close()
return count
os.scandir() over os.listdir(): It returns directory entries with cached stat results, avoiding a separate system call per file.~$*.xlsx files (Excel lock files) and files in $RECYCLE.BIN or .Trash directories.The extraction engine is the core of your pipeline. It opens each XLSX file as a ZIP archive, reads the relevant XML entries, parses the metadata fields, and returns a structured result. The key design decisions are: what to extract, how to handle errors, and how to parallelize the work.
import zipfile
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field
from typing import Optional
NS = {
"cp": "http://schemas.openxmlformats.org/package/2006/metadata/core-properties",
"dc": "http://purl.org/dc/elements/1.1/",
"dcterms": "http://purl.org/dc/terms/",
"ep": "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties",
}
@dataclass
class FileMetadata:
path: str
creator: str = ""
last_modified_by: str = ""
created: str = ""
modified: str = ""
revision: str = ""
application: str = ""
company: str = ""
sheet_count: int = 0
hidden_sheets: list[str] = field(default_factory=list)
error: Optional[str] = None
def extract_metadata(filepath: str) -> FileMetadata:
"""Extract metadata from a single XLSX file."""
result = FileMetadata(path=filepath)
try:
with zipfile.ZipFile(filepath, "r") as zf:
# Core properties
if "docProps/core.xml" in zf.namelist():
tree = ET.fromstring(zf.read("docProps/core.xml"))
result.creator = _text(tree, "dc:creator")
result.last_modified_by = _text(tree, "cp:lastModifiedBy")
result.created = _text(tree, "dcterms:created")
result.modified = _text(tree, "dcterms:modified")
result.revision = _text(tree, "cp:revision")
# App properties
if "docProps/app.xml" in zf.namelist():
tree = ET.fromstring(zf.read("docProps/app.xml"))
result.application = _text(tree, "ep:Application")
result.company = _text(tree, "ep:Company")
# Sheet visibility
if "xl/workbook.xml" in zf.namelist():
wb = ET.fromstring(zf.read("xl/workbook.xml"))
ns = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
sheets = wb.findall(f"{{{ns}}}sheets/{{{ns}}}sheet")
result.sheet_count = len(sheets)
result.hidden_sheets = [
s.get("name", "")
for s in sheets
if s.get("state") in ("hidden", "veryHidden")
]
except zipfile.BadZipFile:
result.error = "Not a valid ZIP/XLSX file"
except Exception as e:
result.error = str(e)
return result
def _text(tree: ET.Element, tag: str) -> str:
el = tree.find(tag, NS)
return (el.text or "") if el is not None else ""
When processing thousands of files, you will encounter corrupt files, password-protected archives, files with unexpected encodings, and files that are actually XLS (old binary format) renamed to .xlsx. Your extractor must:
signal.alarm() or threading timeouts.zipfile.ZipFile in read mode to stream entries.Metadata extraction is I/O-bound (reading files from disk or network) with bursts of CPU work (parsing XML). This makes it ideal for parallel processing. Python'sconcurrent.futures module provides a clean API for both thread-based and process-based parallelism.
import sqlite3
from concurrent.futures import ProcessPoolExecutor, as_completed
from extractor import extract_metadata, FileMetadata
def process_all(db_path: str, workers: int = 8) -> dict:
"""Process all unprocessed files in the database."""
conn = sqlite3.connect(db_path)
rows = conn.execute(
"SELECT path FROM files WHERE processed = 0"
).fetchall()
paths = [r[0] for r in rows]
conn.close()
stats = {"total": len(paths), "success": 0, "errors": 0}
results: list[FileMetadata] = []
with ProcessPoolExecutor(max_workers=workers) as pool:
futures = {
pool.submit(extract_metadata, p): p
for p in paths
}
for future in as_completed(futures):
meta = future.result()
results.append(meta)
if meta.error:
stats["errors"] += 1
else:
stats["success"] += 1
# Store results back in database
store_results(db_path, results)
return stats
8–16 workers. SSD throughput is the bottleneck. Expect 500–2,000 files/minute depending on file size.
4–8 workers. Network latency dominates. Copy files locally first for 10x speed improvement.
16–32 workers with async I/O. Use boto3 for S3, azure-storage-blob for Azure. Download and extract in parallel.
Raw extraction results need to be stored in a format that supports both ad-hoc queries and structured reporting. SQLite works well for standalone tools processing up to a few hundred thousand files. For larger datasets or team access, PostgreSQL or a data warehouse provides better concurrency and query performance.
def store_results(db_path: str, results: list) -> None:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS metadata (
path TEXT PRIMARY KEY,
creator TEXT,
last_modified_by TEXT,
created TEXT,
modified TEXT,
revision TEXT,
application TEXT,
company TEXT,
sheet_count INTEGER,
hidden_sheets TEXT,
error TEXT,
extracted_at TEXT
)
""")
for m in results:
conn.execute(
"INSERT OR REPLACE INTO metadata VALUES "
"(?,?,?,?,?,?,?,?,?,?,?,?)",
(m.path, m.creator, m.last_modified_by,
m.created, m.modified, m.revision,
m.application, m.company, m.sheet_count,
",".join(m.hidden_sheets),
m.error, datetime.now().isoformat()),
)
# Mark file as processed in discovery table
conn.execute(
"UPDATE files SET processed = 1 WHERE path = ?",
(m.path,),
)
conn.commit()
conn.close()
Once your metadata is in a database, you can run powerful queries to surface compliance issues, security risks, and organizational patterns. Here are queries that deliver immediate value.
SELECT path, creator, last_modified_by
FROM metadata
WHERE creator != ''
AND creator NOT IN ('Admin', 'System', 'Template')
ORDER BY creator;
SELECT path, hidden_sheets, sheet_count
FROM metadata
WHERE hidden_sheets != ''
ORDER BY sheet_count DESC;
SELECT creator, COUNT(*) AS file_count
FROM metadata
WHERE creator != '' AND error IS NULL
GROUP BY creator
ORDER BY file_count DESC
LIMIT 50;
SELECT application, COUNT(*) AS file_count
FROM metadata
WHERE application NOT LIKE '%Excel%'
AND application != ''
GROUP BY application
ORDER BY file_count DESC;
Wrapping the pipeline stages into a CLI tool makes it easy to run scans on-demand, schedule them with cron, or integrate them into CI/CD workflows. A well-designed CLI separates the discovery, extraction, and reporting steps so they can be run independently or chained together.
import argparse
from discover import discover_xlsx_files
from parallel_processor import process_all
from reports import generate_report
def main():
parser = argparse.ArgumentParser(
description="Bulk Excel metadata processor"
)
sub = parser.add_subparsers(dest="command")
# Discover command
disc = sub.add_parser("discover",
help="Scan directories for XLSX files")
disc.add_argument("root",
help="Root directory to scan")
disc.add_argument("--db", default="metadata.db",
help="SQLite database path")
# Extract command
ext = sub.add_parser("extract",
help="Extract metadata from discovered files")
ext.add_argument("--db", default="metadata.db")
ext.add_argument("--workers", type=int, default=8)
# Report command
rep = sub.add_parser("report",
help="Generate metadata analysis report")
rep.add_argument("--db", default="metadata.db")
rep.add_argument("--format",
choices=["text", "csv", "json"],
default="text")
args = parser.parse_args()
if args.command == "discover":
count = discover_xlsx_files(args.root, args.db)
print(f"Discovered {count} XLSX files")
elif args.command == "extract":
stats = process_all(args.db, args.workers)
print(f"Processed: {stats['total']} files")
print(f"Success: {stats['success']}")
print(f"Errors: {stats['errors']}")
elif args.command == "report":
generate_report(args.db, args.format)
if __name__ == "__main__":
main()
# Step 1: Discover files on a network share
$ python cli.py discover /mnt/shared-drive --db audit.db
Discovered 14,832 XLSX files
# Step 2: Extract metadata with 12 workers
$ python cli.py extract --db audit.db --workers 12
Processed: 14,832 files | Success: 14,607 | Errors: 225
# Step 3: Generate CSV report for the compliance team
$ python cli.py report --db audit.db --format csv > report.csv
The value of bulk processing is in the analysis. Different stakeholders need different views: executives want summary statistics, compliance teams want policy violation lists, and IT teams want error breakdowns. Design your reporting to serve all three audiences.
import sqlite3
import json
import csv
import sys
def generate_report(db_path: str, fmt: str) -> None:
conn = sqlite3.connect(db_path)
summary = {
"total_files": _scalar(conn,
"SELECT COUNT(*) FROM metadata"),
"files_with_errors": _scalar(conn,
"SELECT COUNT(*) FROM metadata "
"WHERE error IS NOT NULL"),
"files_with_author": _scalar(conn,
"SELECT COUNT(*) FROM metadata "
"WHERE creator != ''"),
"files_with_hidden_sheets": _scalar(conn,
"SELECT COUNT(*) FROM metadata "
"WHERE hidden_sheets != ''"),
"unique_authors": _scalar(conn,
"SELECT COUNT(DISTINCT creator) "
"FROM metadata WHERE creator != ''"),
"unique_applications": _scalar(conn,
"SELECT COUNT(DISTINCT application) "
"FROM metadata WHERE application != ''"),
}
if fmt == "json":
print(json.dumps(summary, indent=2))
elif fmt == "csv":
w = csv.writer(sys.stdout)
w.writerow(summary.keys())
w.writerow(summary.values())
else:
for k, v in summary.items():
print(f"{k:<30} {v:>10}")
def _scalar(conn, query: str) -> int:
return conn.execute(query).fetchone()[0]
Compliance Metrics
Operational Metrics
Real-world file systems are messy. Your pipeline will encounter edge cases that never appear when testing with a handful of clean files. Here are the most common issues and how to handle them.
Password-protected XLSX files encrypt the ZIP entries but often leave the file structure readable. You can detect protection without cracking the password by checking for EncryptedPackage in the OLE container or a [Content_Types].xml entry that references encryption.
Strategy: Log these files as “encrypted” with the error field. Flag them for manual review rather than attempting decryption.
Users sometimes rename .xls files to .xlsx without actually converting them. These files fail ZIP parsing because the old binary format is not a ZIP archive. Your extractor's BadZipFile handler catches these automatically.
Strategy: Check the file's magic bytes. If the file starts withD0 CF 11 E0 (OLE2 header), it's an old-format XLS file. Use a library like olefile or xlrd to extract metadata from these.
Shared drives often contain many copies of the same file. Processing duplicates wastes time and inflates your statistics. Use content hashing to deduplicate: hash the file contents (or the first 8 KB for speed) and skip files with a known hash.
Strategy: Add a content_hash column to your discovery table. After extraction, group duplicates and report them separately—duplicate counts are useful metrics for storage optimization.
Files over 100 MB can slow down extraction significantly if your code reads the entire archive. Since metadata XML files are typically under 10 KB, you only need to read specific ZIP entries. Python's zipfilemodule supports this natively—just open the specific entry without extracting the full archive.
Strategy: Set a file size threshold (e.g., 200 MB). For files above the threshold, use streaming extraction that reads only the ZIP central directory and the target XML entries.
One-time scans are useful, but ongoing monitoring catches new files and policy violations as they appear. By scheduling your pipeline to run incrementally, you maintain a continuously updated inventory of metadata across your organization.
The discovery layer tracks which files have been processed. On subsequent runs, it only registers new files (by checking INSERT OR IGNORE) and the extraction stage only processes rows where processed = 0.
processed flag for changed files (compare mtime)# Nightly discovery + extraction
0 2 * * * python cli.py discover /data --db meta.db
0 3 * * * python cli.py extract --db meta.db
# Weekly compliance report
0 8 * * 1 python cli.py report --db meta.db \
--format csv > /reports/weekly.csv
Separate the discovery and extraction steps so a slow network scan doesn't delay the extraction of already-discovered files.
Processing speed depends on storage type, file size distribution, and worker count. Here are realistic benchmarks from processing real-world file shares.
| Storage | Workers | Files/min | 10K Files |
|---|---|---|---|
| Local SSD | 8 | 1,500–2,000 | 5–7 min |
| Local HDD | 4 | 400–800 | 13–25 min |
| SMB/NFS share | 8 | 200–500 | 20–50 min |
| S3 (same region) | 32 | 800–1,200 | 8–13 min |
| Azure Blob | 32 | 600–1,000 | 10–17 min |
ThreadPoolExecutor for I/O-bound cloud downloads. Mix both in a two-stage pipeline.cProfile to identify actual bottlenecks. XML parsing is rarely the slow part—disk I/O usually is.Building a file inventory first lets you track progress, resume interrupted runs, and re-process files without re-scanning the entire file system.
At scale, you will encounter corrupt files, unexpected formats, and permission errors. Every extraction must be wrapped in error handling that logs failures without stopping the pipeline.
Metadata extraction is I/O-bound. Use process pools for local files and thread pools for cloud downloads. Match your worker count to your storage throughput, not your CPU core count.
SQLite handles hundreds of thousands of records locally. Use PostgreSQL for team access. The ability to run SQL queries against your metadata inventory is where the real analytical value emerges.
Before building a bulk processing pipeline, use MetaData Analyzer to see what metadata lives inside your Excel files. Upload a sample file and explore the author names, timestamps, hidden sheets, and application fingerprints that your pipeline will need to extract at scale.