An XLSX file is not a single document—it is a ZIP archive containing dozens of interconnected XML files, each carrying metadata that reveals the document's history, authorship, application environment, and internal structure. Understanding this architecture is essential for forensic analysis, metadata auditing, and building tools that interact with spreadsheet data at the deepest level.
When Microsoft introduced the Office Open XML (OOXML) format with Office 2007, it replaced the legacy binary .xls format with a structured, XML-based approach. Every .xlsx file you create is actually a ZIP archive containing a carefully organized collection of XML files, relationship definitions, and optional binary components. This design follows the Open Packaging Conventions (OPC) standard defined in ECMA-376 and ISO/IEC 29500.
This matters for metadata analysis because metadata is not stored in a single location. It is distributed across multiple XML files, each serving a distinct purpose. Author names, timestamps, application versions, editing statistics, theme data, shared strings, calculation chains, and structural relationships are all encoded in separate files within the archive. To fully understand a spreadsheet's metadata footprint, you need to understand where each piece of information lives and how these files relate to each other.
You can verify that any XLSX file is a ZIP archive by renaming it from .xlsx to .zip and opening it with any archive utility. Alternatively, use a command-line tool:
# List the contents of an XLSX file
unzip -l report.xlsx
# Extract all XML files for inspection
unzip report.xlsx -d report_extracted/
When you extract a typical XLSX file, you will find a directory tree that looks something like this. The exact contents vary depending on the spreadsheet's complexity, but the core structure is always present:
# Typical XLSX directory structure
report.xlsx/
├── [Content_Types].xml
├── _rels/
│ └── .rels
├── docProps/
│ ├── core.xml
│ ├── app.xml
│ └── custom.xml (optional)
└── xl/
├── workbook.xml
├── sharedStrings.xml
├── styles.xml
├── theme/
│ └── theme1.xml
├── worksheets/
│ ├── sheet1.xml
│ ├── sheet2.xml
│ └── ...
├── calcChain.xml
├── _rels/
│ └── workbook.xml.rels
├── printerSettings/ (optional)
├── drawings/ (optional)
├── charts/ (optional)
└── media/ (optional)
Each of these files and directories serves a specific purpose. Let's examine the most important ones in detail, focusing on where metadata is stored and what information each file reveals.
[Content_Types].xml and _rels/.rels define the archive structure and media types.
docProps/core.xml and docProps/app.xml hold authorship, timestamps, and application metadata.
The xl/ directory contains all spreadsheet content: worksheets, styles, strings, and formulas.
The [Content_Types].xml file sits at the root of the archive and acts as a manifest that declares the MIME type of every part in the package. The OPC specification requires this file, and Excel uses it to know how to interpret each file within the archive.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="...schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml"
ContentType="application/xml"/>
<Override PartName="/xl/workbook.xml"
ContentType="application/vnd.openxmlformats-officedocument
.spreadsheetml.sheet.main+xml"/>
<Override PartName="/xl/worksheets/sheet1.xml"
ContentType="application/vnd.openxmlformats-officedocument
.spreadsheetml.worksheet+xml"/>
<Override PartName="/docProps/core.xml"
ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
<!-- ... more overrides ... -->
</Types>
vnd.ms-excel.sheet.macroEnabled contains macros even if saved as .xlsxoleObject or activeX parts reveal embedded contentRelationship files (.rels) define how parts in the package connect to each other. They are the wiring diagram that tells Excel which XML file is the workbook, which files are worksheets, where the document properties live, and how every component relates to every other component.
There are two levels of relationships. The package-level relationships in _rels/.rels connect the root of the archive to top-level components. Part-level relationships (like xl/_rels/workbook.xml.rels) connect specific parts to their dependencies.
<Relationships xmlns="...schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="...officeDocument/2006/relationships/officeDocument"
Target="xl/workbook.xml"/>
<Relationship Id="rId2"
Type="...package/2006/relationships/metadata/core-properties"
Target="docProps/core.xml"/>
<Relationship Id="rId3"
Type="...officeDocument/2006/relationships/extended-properties"
Target="docProps/app.xml"/>
</Relationships>
The package-level relationships point to the three primary entry points: the workbook itself, the core document properties, and the extended application properties. The Id attributes (rId1, rId2, etc.) are internal references used to link parts together.
<Relationships xmlns="...">
<Relationship Id="rId1"
Type="...relationships/worksheet"
Target="worksheets/sheet1.xml"/>
<Relationship Id="rId2"
Type="...relationships/worksheet"
Target="worksheets/sheet2.xml"/>
<Relationship Id="rId3"
Type="...relationships/theme"
Target="theme/theme1.xml"/>
<Relationship Id="rId4"
Type="...relationships/styles"
Target="styles.xml"/>
<Relationship Id="rId5"
Type="...relationships/sharedStrings"
Target="sharedStrings.xml"/>
</Relationships>
These relationships map the workbook to its worksheets, shared strings table, styles, and theme. The relationship IDs here correspond to the r:idattributes in workbook.xml that reference each sheet.
\\server\share\template.xlsxThe core properties file is the most commonly discussed source of metadata in Excel files, and for good reason. It contains the fields that directly identify who created the document, when it was created, when it was last modified, and who last saved it. This file follows the Dublin Core metadata standard and the OPC core properties schema.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cp:coreProperties
xmlns:cp="...schemas.openxmlformats.org/package/2006/metadata/core-properties"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:creator>Jane Smith</dc:creator>
<cp:lastModifiedBy>Bob Johnson</cp:lastModifiedBy>
<dcterms:created xsi:type="dcterms:W3CDTF">
2025-08-15T09:23:17Z
</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">
2026-01-10T14:45:32Z
</dcterms:modified>
<cp:revision>12</cp:revision>
<dc:title>Q4 Financial Report</dc:title>
<dc:subject>Quarterly Financials</dc:subject>
<dc:description>Internal revenue analysis</dc:description>
<cp:keywords>finance, Q4, revenue, confidential</cp:keywords>
<cp:category>Financial Reports</cp:category>
</cp:coreProperties>
dc:creator
The name of the person who originally created the file. Taken from the Office account name or Windows user profile at creation time.
cp:lastModifiedBy
The name of the last person to save the file. Updated on every save operation.
dcterms:created
ISO 8601 timestamp of when the file was first created. Uses W3CDTF format with UTC timezone.
dcterms:modified
ISO 8601 timestamp of the last save. Updated every time the file is saved, whether content changes or not.
Core properties are the most visible metadata leak. Many organizations unknowingly expose:
dc:creator and cp:lastModifiedBy, revealing who worked on a documentWhile core.xml captures who and when, app.xml captures what and how. This file records detailed information about the application that created and last modified the document, along with structural statistics about the workbook itself. This information is particularly valuable for forensic analysis because it reveals the software environment behind the document.
<Properties xmlns="...schemas.openxmlformats.org/officeDocument/2006/
extended-properties">
<Application>Microsoft Excel</Application>
<AppVersion>16.0300</AppVersion>
<DocSecurity>0</DocSecurity>
<ScaleCrop>false</ScaleCrop>
<Company>Acme Corporation</Company>
<Manager>Sarah Director</Manager>
<LinksUpToDate>false</LinksUpToDate>
<SharedDoc>false</SharedDoc>
<HeadingPairs>
<vt:vector size="2" baseType="variant">
<vt:variant><vt:lpstr>Worksheets</vt:lpstr></vt:variant>
<vt:variant><vt:i4>5</vt:i4></vt:variant>
</vt:vector>
</HeadingPairs>
<TitlesOfParts>
<vt:vector size="5" baseType="lpstr">
<vt:lpstr>Summary</vt:lpstr>
<vt:lpstr>Revenue Detail</vt:lpstr>
<vt:lpstr>Cost Analysis</vt:lpstr>
<vt:lpstr>Projections</vt:lpstr>
<vt:lpstr>Raw Data</vt:lpstr>
</vt:vector>
</TitlesOfParts>
</Properties>
The AppVersion field encodes the Excel version in a specific format. This allows forensic investigators to determine exactly which version of Excel was used to save the file:
AppVersion
Excel Version
12.0000
Excel 2007
14.0300
Excel 2010
15.0300
Excel 2013
16.0300
Excel 2016 / 2019 / 365
app.xmlThe optional custom.xml file stores user-defined or application-defined custom properties. These are arbitrary key-value pairs that can be added programmatically or through document management systems. Many organizations overlook this file when cleaning metadata because it is not visible through Excel's standard File > Properties dialog.
<Properties xmlns="...schemas.openxmlformats.org/officeDocument/2006/
custom-properties">
<property fmtid="..." pid="2" name="Classification">
<vt:lpwstr>Confidential - Internal Only</vt:lpwstr>
</property>
<property fmtid="..." pid="3" name="Department">
<vt:lpwstr>Finance - Treasury</vt:lpwstr>
</property>
<property fmtid="..." pid="4" name="DMS_DocID">
<vt:lpwstr>DOC-2025-FIN-00847</vt:lpwstr>
</property>
<property fmtid="..." pid="5" name="ReviewStatus">
<vt:lpwstr>Approved by Legal</vt:lpwstr>
</property>
</Properties>
Document Management Systems
Data Loss Prevention (DLP)
The workbook file is the central hub of the spreadsheet data. It defines which sheets exist, their order, their visibility state, and structural properties like defined names, calculation settings, and workbook protection. From a metadata perspective, this file reveals the document's organizational structure.
<workbook xmlns="...schemas.openxmlformats.org/spreadsheetml/2006/main">
<fileVersion appName="xl"
lastEdited="7" lowestEdited="7"
rupBuild="27425"/>
<workbookPr defaultThemeVersion="166925"/>
<sheets>
<sheet name="Summary" sheetId="1" r:id="rId1"/>
<sheet name="Revenue Detail" sheetId="2" r:id="rId2"/>
<sheet name="Hidden Analysis" sheetId="3" r:id="rId3"
state="hidden"/>
<sheet name="Config" sheetId="4" r:id="rId4"
state="veryHidden"/>
</sheets>
<definedNames>
<definedName name="_xlnm.Print_Area"
localSheetId="0">Summary!$A$1:$G$50</definedName>
<definedName name="CostBasis">
'Hidden Analysis'!$B$2:$B$100</definedName>
</definedNames>
<calcPr calcId="191029"/>
</workbook>
Excel supports three sheet visibility states. The state attribute in the <sheet> element controls this:
visible (default)
The sheet appears in the tab bar. No state attribute is present.
hidden
The sheet is hidden from the tab bar but can be unhidden through the right-click menu. Marked with state="hidden".
veryHidden
The sheet is invisible and cannot be unhidden through the normal UI. Only accessible via VBA or by editing the XML directly. Marked with state="veryHidden".
The <definedNames> section is a rich source of metadata intelligence. Named ranges often expose:
CostBasis, MarginCalc, or DiscountTiers reveal the purpose and structure of hidden data_xlnm.Print_Area name reveals which data was intended for printing and distributionExcel optimizes storage by maintaining a single shared strings table. Instead of storing each text value directly in the worksheet cells, it stores the text once in sharedStrings.xml and references it by index from the worksheet. This design has significant implications for metadata analysis and forensic investigation.
<sst xmlns="...schemas.openxmlformats.org/spreadsheetml/2006/main"
count="1847" uniqueCount="523">
<si><t>Product Name</t></si> <!-- index 0 -->
<si><t>Unit Price</t></si> <!-- index 1 -->
<si><t>CONFIDENTIAL - DO NOT SHARE</t></si> <!-- index 2 -->
<si><t>Internal Review Draft</t></si> <!-- index 3 -->
<si><t>john.smith@acme.com</t></si> <!-- index 4 -->
<!-- ... 518 more strings ... -->
</sst>
The count attribute indicates total string references across all cells, while uniqueCount shows unique strings. The ratio reveals data patterns:
When cells are deleted, their text values may remain in the shared strings table as orphaned entries. This is one of the most powerful forensic artifacts in XLSX files:
Each worksheet is stored as a separate XML file. These files contain the actual cell data, formulas, formatting references, data validation rules, conditional formatting, comments, hyperlinks, and sheet-level protection settings. The worksheet XML is where content meets metadata.
<worksheet xmlns="...spreadsheetml/2006/main">
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="D15" sqref="D15"/>
</sheetView>
</sheetViews>
<sheetData>
<row r="1" spans="1:5">
<c r="A1" t="s"><v>0</v></c> <!-- shared string index 0 -->
<c r="B1" t="s"><v>1</v></c> <!-- shared string index 1 -->
<c r="C1"><v>99.95</v></c> <!-- numeric value -->
<c r="D1" s="4"> <!-- style index 4 -->
<f>C1*1.2</f> <!-- formula -->
<v>119.94</v> <!-- cached result -->
</c>
</row>
</sheetData>
<hyperlinks>
<hyperlink ref="A10" r:id="rId1"/>
</hyperlinks>
</worksheet>
Active Cell Selection
The activeCell attribute in <sheetView>records where the cursor was when the file was last saved. This reveals which cell the last editor was working on—potentially exposing the focus of their analysis.
Formulas and Calculations
The <f> element stores the actual formula text. Even if the visible value is a simple number, the formula reveals the calculation logic, cell references, and potentially references to external workbooks.
Hyperlinks
Hyperlinks stored in the worksheet can point to internal network resources, SharePoint sites, or external URLs that reveal the organization's infrastructure.
Data Validation
Validation rules define dropdown lists, allowed ranges, and input constraints. These can expose business rules, valid value sets, and the intended use of data fields.
The styles file contains all cell formatting definitions used in the workbook: number formats, fonts, fills, borders, and cell style combinations. While this may seem purely visual, styles carry meaningful metadata about the document's origin, purpose, and editing history.
Custom number formats in <numFmts> reveal how data is intended to be interpreted and displayed. These format codes are highly informative:
<!-- Currency format reveals locale and precision -->
<numFmt numFmtId="164" formatCode=""$"#,##0.00"/>
<!-- Date format reveals regional conventions -->
<numFmt numFmtId="165" formatCode="dd/mm/yyyy"/>
<!-- Accounting format with specific currency -->
<numFmt numFmtId="166" formatCode="_-"\u00a3"* #,##0.00"/>
The fonts defined in styles.xml are surprisingly useful for determining which version and edition of Excel created the file. Default fonts changed across versions:
Default Font
Excel Version
Arial 10pt
Excel 2003 and earlier (legacy)
Calibri 11pt
Excel 2007 through 2013
Calibri 11pt
Excel 2016 / 2019 / 365 (default)
Aptos Narrow 11pt
Excel 365 (2024+ with new default)
The theme file defines the color palette, font scheme, and visual effects used throughout the workbook. While themes primarily control appearance, they also carry metadata that reveals the document's origin and creation context.
Application Origin
Corporate Identity
The calculation chain file records the order in which Excel recalculates formulas. This file exists to optimize recalculation performance, but it also serves as a metadata artifact that reveals the spreadsheet's formula structure even when formulas have been converted to values.
<calcChain xmlns="...spreadsheetml/2006/main">
<c r="D1" i="1"/>
<c r="D2" i="1"/>
<c r="E1" i="1"/>
<c r="F5" i="2"/>
<c r="G10" i="3" l="1"/>
</calcChain>
i attribute references the sheet index, revealing formula locations even across hidden sheetsl (level) attribute indicates dependency depth, revealing how complex the calculation model iscalcChain.xml is absent from a file that contains formulas, it may indicate the file was generated by a non-Excel application or was manually reconstructedOne of the most overlooked metadata sources in XLSX files is the printer settings directory. When a user sets a print area, adjusts page setup, or prints a preview, Excel stores binary printer configuration data that can reveal detailed information about the hardware environment.
.bin files contains the exact printer driver name (e.g., "HP LaserJet Pro MFP M428fdw")\\PRINTSERVER\3rd-Floor-HP, revealing internal server names and physical locationsBeyond the XML content, the ZIP container itself carries metadata in its file headers. Every file entry in a ZIP archive includes timestamps, compression parameters, and file attributes that exist independently of the XML metadata. This creates a second, often overlooked layer of forensic evidence.
# View ZIP entry details with timestamps
$ unzip -v report.xlsx
Length Method Size Cmpr Date Time CRC-32 Name
------ ------ ---- ---- ---- ---- ------ ----
1580 Defl:N 447 72% 01-10-26 14:45 a3b2c1d0 [Content_Types].xml
590 Defl:N 243 59% 01-10-26 14:45 e4f5a6b7 _rels/.rels
838 Defl:N 380 55% 01-10-26 14:45 c8d9e0f1 docProps/core.xml
4290 Defl:N 1285 70% 01-10-26 14:45 2a3b4c5d xl/workbook.xml
89432 Defl:N 18726 79% 01-10-26 14:45 6e7f8091 xl/worksheets/sheet1.xml
ZIP entry timestamps should match the dcterms:modified date in core.xml. All entries typically share the same timestamp because Excel writes the entire archive in one operation. Inconsistent timestamps suggest the archive was manually modified after creation.
Different applications use different compression levels. Excel typically uses Deflate with a specific compression ratio. Files generated by libraries like openpyxl, Apache POI, or EPPlus often show different compression characteristics, helping identify the true source application.
Understanding the XML structure is only valuable when you apply it to real-world analysis. Here is a systematic approach to examining an XLSX file's metadata using command-line tools available on any system.
# Step 1: Extract the archive
mkdir analysis && cd analysis
unzip ../report.xlsx -d extracted/
# Step 2: Examine identity and timestamps
cat extracted/docProps/core.xml | xmllint --format -
# Step 3: Check application fingerprint and company
cat extracted/docProps/app.xml | xmllint --format -
# Step 4: Look for custom properties
cat extracted/docProps/custom.xml 2>/dev/null | xmllint --format -
# Step 5: Inspect workbook structure and hidden sheets
cat extracted/xl/workbook.xml | xmllint --format -
# Step 6: Search shared strings for sensitive data
cat extracted/xl/sharedStrings.xml | xmllint --format - | \
grep -i "confidential\|internal\|draft\|password\|secret"
# Step 7: Check for orphaned relationship targets
cat extracted/xl/_rels/workbook.xml.rels | xmllint --format -
# Step 8: Compare ZIP timestamps with core.xml dates
unzip -v ../report.xlsx | head -20
Before Sharing Externally
dc:creator appropriate to share?Company in app.xml expose your organization?sharedStrings.xml contain sensitive text?For Forensic Investigation
Not all XLSX files are created by Microsoft Excel. Many are generated by programming libraries, web applications, and reporting tools. These files often have distinctive structural signatures that differ from genuine Excel output. Knowing these differences is critical for forensic analysis and authenticity verification.
openpyxl (Python)
Sets Application to "Microsoft Excel" but uses distinctive XML formatting with different whitespace patterns. Often missing calcChain.xml and printer settings. Theme file may be minimal or use the default openpyxl theme.
Apache POI (Java)
May set Application to "Apache POI" or leave it blank. Uses XSSF format with specific namespace prefixes. Relationship IDs often follow a different numbering pattern than Excel.
EPPlus (.NET)
Identifies itself with Application set to "Microsoft Excel" and a specific AppVersion. Missing calcChain.xml is common. Custom XML namespace prefixes differ from native Excel output.
Google Sheets (Export)
Exported files often have minimal app.xml properties, missing Company and Manager fields, and a simplified theme. The XML structure is valid but noticeably different from native Excel output in element ordering and attribute presence.
Every XLSX file is a ZIP archive containing interconnected XML files. Understanding this structure is the foundation for any meaningful metadata analysis, forensic investigation, or privacy audit.
Metadata is not confined to core.xml. Author names, timestamps, application fingerprints, network paths, and sensitive text are distributed across dozens of files within the archive.
Shared strings, calculation chains, and relationship entries often survive data deletion. These orphaned artifacts are invaluable for forensic recovery and detecting what was changed before a file was shared.
A thorough metadata cleanup must address every XML file in the archive, not just the document properties. Shared strings, printer settings, custom properties, and relationship files all carry potentially sensitive information.