Every Excel file that enters your system carries hidden intelligence—author names, company identifiers, revision history, timestamps, and sometimes sensitive content in hidden sheets. Building metadata extraction into your applications lets you inspect, validate, and act on this data automatically, whether you're building a compliance tool, a document management system, or a security scanning pipeline.
Most organizations treat Excel metadata as an afterthought—something to strip out before sharing, if anyone remembers. But metadata is actually a rich data source that can power document classification, compliance checks, audit logging, and security scanning. The key is extracting it programmatically so your application can make decisions based on what it finds.
Whether you're building a SaaS document portal, an internal compliance dashboard, or a file upload endpoint that needs to screen incoming spreadsheets, integrating metadata analysis gives your application the ability to understand documents at a deeper level than just their visible cell data.
Inbound File Processing
Outbound File Screening
Before writing extraction code, you need to understand where metadata lives inside an XLSX file. An XLSX file is a ZIP archive containing XML files organized by the Open Packaging Conventions (OPC) standard. Your API needs to read specific files within this archive to extract different categories of metadata.
| Archive Path | Contains | Key Fields |
|---|---|---|
docProps/core.xml | Identity & timestamps | creator, lastModifiedBy, created, modified, revision |
docProps/app.xml | Application fingerprint | Application, AppVersion, Company, Manager, TotalTime |
docProps/custom.xml | Custom properties | DLP labels, SharePoint IDs, classification tags |
xl/workbook.xml | Workbook structure | Sheet names, defined names, hidden sheet states |
xl/sharedStrings.xml | Text content pool | All unique strings including orphaned entries |
xl/comments*.xml | Cell comments | Comment text with author attribution |
Don't extract everything upfront. Design your API to parse only the XML files requested by the caller. Extracting core properties from core.xml is fast (a few kilobytes), but parsing sharedStrings.xml in a large workbook can mean processing megabytes of XML. Lazy parsing keeps your API responsive and reduces memory usage.
Python is the most natural choice for server-side metadata extraction thanks to mature libraries like openpyxl for high-level access and the standard library's zipfile and xml.etree modules for low-level archive manipulation. Here's how to build a reusable extraction module that you can integrate into any Python web framework.
# Core metadata extraction module
# pip install openpyxl lxml
import zipfile
import io
from typing import Any
from lxml import etree
from openpyxl import load_workbook
from dataclasses import dataclass, asdict
@dataclass
class CoreProperties:
creator: str = ""
last_modified_by: str = ""
created: str = ""
modified: str = ""
title: str = ""
subject: str = ""
description: str = ""
keywords: str = ""
revision: str = ""
@dataclass
class AppProperties:
application: str = ""
app_version: str = ""
company: str = ""
manager: str = ""
total_time: str = ""
@dataclass
class SheetInfo:
name: str = ""
state: str = "visible"
has_comments: bool = False
class ExcelMetadataExtractor:
"""Extract metadata from XLSX files."""
def __init__(self, file_bytes: bytes):
self._data = io.BytesIO(file_bytes)
self._validate()
def _validate(self):
if not zipfile.is_zipfile(self._data):
raise ValueError("Not a valid XLSX file")
self._data.seek(0)
def get_core_properties(self) -> CoreProperties:
with zipfile.ZipFile(self._data) as zf:
try:
xml = zf.read("docProps/core.xml")
except KeyError:
return CoreProperties()
root = etree.parse(io.BytesIO(xml)).getroot()
ns = {
"dc": "http://purl.org/dc/elements/1.1/",
"cp": "http://schemas.openxmlformats.org/package/2006/metadata/core-properties",
"dcterms": "http://purl.org/dc/terms/",
}
def text(xpath: str) -> str:
el = root.find(xpath, ns)
return el.text if el is not None and el.text else ""
return CoreProperties(
creator=text("dc:creator"),
last_modified_by=text("cp:lastModifiedBy"),
created=text("dcterms:created"),
modified=text("dcterms:modified"),
title=text("dc:title"),
subject=text("dc:subject"),
description=text("dc:description"),
keywords=text("cp:keywords"),
revision=text("cp:revision"),
)
def get_app_properties(self) -> AppProperties:
with zipfile.ZipFile(self._data) as zf:
try:
xml = zf.read("docProps/app.xml")
except KeyError:
return AppProperties()
root = etree.parse(io.BytesIO(xml)).getroot()
ns = "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties"
def text(tag: str) -> str:
el = root.find(f"{{{ns}}}{tag}", )
return el.text if el is not None and el.text else ""
return AppProperties(
application=text("Application"),
app_version=text("AppVersion"),
company=text("Company"),
manager=text("Manager"),
total_time=text("TotalTime"),
)
def get_sheets(self) -> list[SheetInfo]:
wb = load_workbook(self._data, read_only=True)
sheets = []
for name in wb.sheetnames:
ws = wb[name]
state = ws.sheet_state or "visible"
sheets.append(SheetInfo(name=name, state=state))
wb.close()
return sheets
def to_dict(self) -> dict[str, Any]:
return {
"core": asdict(self.get_core_properties()),
"app": asdict(self.get_app_properties()),
"sheets": [asdict(s) for s in self.get_sheets()],
}
This module gives you a clean, typed interface for extracting metadata. The ExcelMetadataExtractor class accepts raw bytes (perfect for file uploads), validates that the input is actually a ZIP archive, and provides separate methods for each metadata category. The to_dict() method produces a JSON-serializable output suitable for API responses.
bytes instead of file paths makes the class work with upload streams, cloud storage blobs, and in-memory buffers without touching the filesystem.read_only=True with openpyxl prevents the library from loading cell data into memory, which matters for large files.With the extraction module in place, wrapping it in a REST API takes minimal code. Here's a FastAPI implementation that accepts file uploads and returns structured metadata as JSON. This pattern works for any Python web framework—Flask, Django, or Starlette—with minor adjustments to the request handling.
# pip install fastapi uvicorn python-multipart
from fastapi import FastAPI, UploadFile, HTTPException
from metadata_extractor import ExcelMetadataExtractor
app = FastAPI(title="Excel Metadata API")
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
@app.post("/api/v1/metadata")
async def extract_metadata(file: UploadFile):
# Validate file type
if not file.filename.endswith((".xlsx", ".xlsm")):
raise HTTPException(400, "Only .xlsx and .xlsm files are supported")
# Read with size limit
contents = await file.read()
if len(contents) > MAX_FILE_SIZE:
raise HTTPException(413, "File exceeds 50 MB limit")
try:
extractor = ExcelMetadataExtractor(contents)
return {
"filename": file.filename,
"size_bytes": len(contents),
"metadata": extractor.to_dict(),
}
except ValueError as e:
raise HTTPException(422, str(e))
{
"filename": "Q4-forecast.xlsx",
"size_bytes": 284672,
"metadata": {
"core": {
"creator": "Sarah Chen",
"last_modified_by": "Michael Torres",
"created": "2025-11-15T09:23:00Z",
"modified": "2026-01-20T16:45:00Z",
"title": "Q4 Revenue Forecast - DRAFT",
"revision": "47"
},
"app": {
"application": "Microsoft Excel",
"app_version": "16.0300",
"company": "Acme Financial Services",
"total_time": "2340"
},
"sheets": [
{ "name": "Summary", "state": "visible", "has_comments": false },
{ "name": "Revenue Detail", "state": "visible", "has_comments": true },
{ "name": "Internal Notes", "state": "hidden", "has_comments": false },
{ "name": "Assumptions", "state": "veryHidden", "has_comments": false }
]
}
}
.xlsx extension check is a first filter, but always validate the ZIP structure and XML content server-side.lxml with resolve_entities=False to prevent XXE (XML External Entity) attacks from crafted XLSX files.For browser-based applications or Node.js backends, you can extract XLSX metadata using JavaScript's built-in ZIP handling or libraries like JSZip and xlsx (SheetJS). Client-side extraction is particularly valuable for privacy-sensitive applications where files should never leave the user's browser.
// npm install jszip
import JSZip from 'jszip';
interface CoreProperties {
creator: string;
lastModifiedBy: string;
created: string;
modified: string;
title: string;
revision: string;
}
interface AppProperties {
application: string;
appVersion: string;
company: string;
manager: string;
totalTime: string;
}
function parseXmlText(xml: string, tag: string): string {
const parser = new DOMParser();
const doc = parser.parseFromString(xml, 'text/xml');
const el = doc.getElementsByTagNameNS('*', tag)[0];
return el?.textContent ?? '';
}
export async function extractMetadata(file: File | ArrayBuffer) {
const zip = await JSZip.loadAsync(file);
// Extract core properties
const coreXml = await zip.file('docProps/core.xml')
?.async('string') ?? '';
const core: CoreProperties = {
creator: parseXmlText(coreXml, 'creator'),
lastModifiedBy: parseXmlText(coreXml, 'lastModifiedBy'),
created: parseXmlText(coreXml, 'created'),
modified: parseXmlText(coreXml, 'modified'),
title: parseXmlText(coreXml, 'title'),
revision: parseXmlText(coreXml, 'revision'),
};
// Extract app properties
const appXml = await zip.file('docProps/app.xml')
?.async('string') ?? '';
const app: AppProperties = {
application: parseXmlText(appXml, 'Application'),
appVersion: parseXmlText(appXml, 'AppVersion'),
company: parseXmlText(appXml, 'Company'),
manager: parseXmlText(appXml, 'Manager'),
totalTime: parseXmlText(appXml, 'TotalTime'),
};
// Detect hidden sheets from workbook.xml
const wbXml = await zip.file('xl/workbook.xml')
?.async('string') ?? '';
const parser = new DOMParser();
const wbDoc = parser.parseFromString(wbXml, 'text/xml');
const sheetEls = wbDoc.getElementsByTagNameNS('*', 'sheet');
const sheets = Array.from(sheetEls).map(el => ({
name: el.getAttribute('name') ?? '',
state: el.getAttribute('state') ?? 'visible',
}));
return { core, app, sheets };
}
Running metadata extraction in the browser means the file never leaves the user's device. This is a powerful architectural choice for applications that handle sensitive documents:
import { extractMetadata } from './metadataExtractor';
function FileInspector() {
const [metadata, setMetadata] = useState<any>(null);
const handleUpload = async (e: ChangeEvent<HTMLInputElement>) => {
const file = e.target.files?.[0];
if (!file) return;
const result = await extractMetadata(file);
setMetadata(result);
};
return (
<div>
<input type="file" accept=".xlsx,.xlsm"
onChange={handleUpload} />
{metadata && (
<pre>{JSON.stringify(metadata, null, 2)}</pre>
)}
</div>
);
}
Beyond the standard core and app properties, many enterprise environments embed custom metadata in XLSX files. SharePoint adds document IDs, Microsoft Purview (formerly MIP) adds sensitivity labels, and organizations often define their own classification tags. This data lives in docProps/custom.xml and requires separate parsing.
import zipfile, io
from lxml import etree
CUSTOM_NS = "http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"
VT_NS = "http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"
def extract_custom_properties(file_bytes: bytes) -> dict:
"""Extract custom XML properties from an XLSX file."""
props = {}
with zipfile.ZipFile(io.BytesIO(file_bytes)) as zf:
try:
xml = zf.read("docProps/custom.xml")
except KeyError:
return props
root = etree.parse(io.BytesIO(xml)).getroot()
for prop in root.findall(f"{{{CUSTOM_NS}}}property"):
name = prop.get("name", "")
# Value can be in different VT type elements
for child in prop:
if child.text:
props[name] = {
"value": child.text,
"type": child.tag.split("}")[-1],
}
break
return props
{
"MSIP_Label_a1b2c3_Enabled": {
"value": "true",
"type": "lpwstr"
},
"MSIP_Label_a1b2c3_Name": {
"value": "Confidential",
"type": "lpwstr"
},
"_dlc_DocId": {
"value": "PROJ-2847-5591",
"type": "lpwstr"
},
"ContentType": {
"value": "Financial Report",
"type": "lpwstr"
}
}
The real power of metadata API integration emerges when you combine extraction with automated policy enforcement. A compliance scanner can check every uploaded file against a set of rules and block, flag, or clean files that violate your organization's metadata policy.
from dataclasses import dataclass, field
from metadata_extractor import ExcelMetadataExtractor
@dataclass
class PolicyViolation:
rule: str
severity: str # "error", "warning", "info"
message: str
field: str = ""
value: str = ""
@dataclass
class ScanResult:
filename: str
passed: bool
violations: list[PolicyViolation] = field(default_factory=list)
class ComplianceScanner:
"""Check Excel files against metadata policies."""
def scan(self, filename: str, file_bytes: bytes) -> ScanResult:
violations = []
ext = ExcelMetadataExtractor(file_bytes)
core = ext.get_core_properties()
app = ext.get_app_properties()
sheets = ext.get_sheets()
# Rule: No personal names in author fields
if core.creator:
violations.append(PolicyViolation(
rule="no-personal-author",
severity="error",
message="Author name present in metadata",
field="core.creator",
value=core.creator,
))
# Rule: No company name leakage
if app.company:
violations.append(PolicyViolation(
rule="no-company-name",
severity="warning",
message="Company name found in app properties",
field="app.company",
value=app.company,
))
# Rule: No hidden or very hidden sheets
hidden = [s for s in sheets
if s.state in ("hidden", "veryHidden")]
for sheet in hidden:
violations.append(PolicyViolation(
rule="no-hidden-sheets",
severity="error",
message=f"Hidden sheet detected: {sheet.name}",
field="sheets",
value=sheet.name,
))
# Rule: High revision count indicates draft
if core.revision and int(core.revision) > 20:
violations.append(PolicyViolation(
rule="high-revision-count",
severity="info",
message=f"High revision count ({core.revision}) may indicate draft status",
field="core.revision",
value=core.revision,
))
has_errors = any(v.severity == "error"
for v in violations)
return ScanResult(
filename=filename,
passed=not has_errors,
violations=violations,
)
Block file distribution. Personal names in author, hidden sheets, sensitive labels present.
Flag for review. Company name present, high editing time, custom properties detected.
Logged for audit. High revision count, application version, creation date patterns.
How you integrate metadata analysis depends on your application architecture. Here are proven patterns for the most common setups, each with trade-offs worth considering.
Intercept file uploads at the middleware layer before they reach your business logic. This works well for web applications with file upload endpoints.
Client → Upload Endpoint → Metadata Middleware → Validation → Storage
Accept the upload immediately, then process metadata asynchronously via a message queue (Redis, RabbitMQ, SQS). Results are stored and available for later retrieval or webhook notification.
Client → Upload → Storage → Queue → Worker → Metadata DB
Deploy metadata extraction as a standalone microservice that other services call via HTTP or gRPC. This isolates the extraction logic and its dependencies from your main application.
Any Service → Metadata Service (POST /analyze) → JSON Response
Perform all extraction in the browser using JavaScript. The file never leaves the client. Metadata results can be sent to your backend as structured JSON without the file itself.
File Input → Browser JS → Metadata JSON → API (metadata only)
Metadata extraction code needs thorough testing because XLSX files vary enormously. Files created by different applications (Excel, Google Sheets, LibreOffice, Numbers) produce different XML structures, and some fields may be missing entirely. Build a test suite that covers these variations.
import pytest
from pathlib import Path
from metadata_extractor import ExcelMetadataExtractor
FIXTURES = Path("tests/fixtures")
def test_standard_excel_file():
"""File created by Microsoft Excel."""
data = (FIXTURES / "excel-standard.xlsx").read_bytes()
ext = ExcelMetadataExtractor(data)
core = ext.get_core_properties()
assert core.creator != ""
assert core.created != ""
def test_google_sheets_export():
"""Google Sheets exports omit some properties."""
data = (FIXTURES / "google-sheets.xlsx").read_bytes()
ext = ExcelMetadataExtractor(data)
app = ext.get_app_properties()
# Google Sheets uses "Google Sheets" as application
assert "Google" in app.application
def test_missing_core_xml():
"""Handles XLSX files without core.xml gracefully."""
data = (FIXTURES / "minimal.xlsx").read_bytes()
ext = ExcelMetadataExtractor(data)
core = ext.get_core_properties()
assert core.creator == "" # Empty, not an exception
def test_hidden_sheets_detected():
"""Detects hidden and veryHidden sheets."""
data = (FIXTURES / "hidden-sheets.xlsx").read_bytes()
ext = ExcelMetadataExtractor(data)
sheets = ext.get_sheets()
states = {s.name: s.state for s in sheets}
assert states["Secret"] == "veryHidden"
def test_invalid_file_raises():
"""Non-ZIP files raise ValueError."""
with pytest.raises(ValueError):
ExcelMetadataExtractor(b"not a zip file")
def test_large_file_performance():
"""Core property extraction under 100ms for 50MB files."""
import time
data = (FIXTURES / "large-file.xlsx").read_bytes()
start = time.perf_counter()
ext = ExcelMetadataExtractor(data)
ext.get_core_properties()
elapsed = time.perf_counter() - start
assert elapsed < 0.1, f"Took {elapsed:.2f}s, expected <0.1s"
Build a collection of test XLSX files covering these scenarios:
Metadata extraction performance depends primarily on how much of the ZIP archive you need to read. Core properties (core.xml and app.xml) are typically under 2 KB each and can be extracted in under 10 milliseconds. Sheet detection requires reading workbook.xml (usually 5–20 KB). The expensive operations are reading shared strings or comments from large files.
| Operation | 1 MB File | 10 MB File | 50 MB File |
|---|---|---|---|
| Core properties only | < 5 ms | < 8 ms | < 15 ms |
| Core + App properties | < 8 ms | < 12 ms | < 20 ms |
| All properties + sheet list | < 15 ms | < 30 ms | < 60 ms |
| Full analysis with shared strings | < 50 ms | < 400 ms | < 2 s |
core.xml without decompressing the entire archive.zipfile module can read entries without loading the full file into memory. Use zf.open() instead of zf.read() for streaming.Every XLSX file is a ZIP archive containing XML files at known paths. Your extraction code reads specific entries (core.xml, app.xml, workbook.xml) without needing to understand the spreadsheet data itself.
Building your API to accept raw bytes instead of file paths makes it work seamlessly with upload streams, cloud storage, and in-memory buffers without requiring temporary files on disk.
File upload endpoints are attack surfaces. Validate ZIP structure, limit file sizes, sanitize XML parsing to prevent XXE attacks, and set processing timeouts to guard against zip bombs.
Upload middleware for simple apps, async queues for high volume, microservices for polyglot stacks, client-side for maximum privacy. Match the pattern to your architecture and throughput needs.
Before building your own metadata integration, use MetaData Analyzer to explore the metadata hidden in your Excel files. See exactly what author names, timestamps, hidden sheets, and custom properties are embedded—then build your extraction logic with confidence.