Back to Blog
Technical

Deep Dive into Excel XML Structure and Metadata Storage

An XLSX file is not a single document—it is a ZIP archive containing dozens of interconnected XML files, each carrying metadata that reveals the document's history, authorship, application environment, and internal structure. Understanding this architecture is essential for forensic analysis, metadata auditing, and building tools that interact with spreadsheet data at the deepest level.

By Technical TeamFebruary 26, 202624 min read

The XLSX File: A ZIP Archive in Disguise

When Microsoft introduced the Office Open XML (OOXML) format with Office 2007, it replaced the legacy binary .xls format with a structured, XML-based approach. Every .xlsx file you create is actually a ZIP archive containing a carefully organized collection of XML files, relationship definitions, and optional binary components. This design follows the Open Packaging Conventions (OPC) standard defined in ECMA-376 and ISO/IEC 29500.

This matters for metadata analysis because metadata is not stored in a single location. It is distributed across multiple XML files, each serving a distinct purpose. Author names, timestamps, application versions, editing statistics, theme data, shared strings, calculation chains, and structural relationships are all encoded in separate files within the archive. To fully understand a spreadsheet's metadata footprint, you need to understand where each piece of information lives and how these files relate to each other.

Verifying the ZIP Structure Yourself

You can verify that any XLSX file is a ZIP archive by renaming it from .xlsx to .zip and opening it with any archive utility. Alternatively, use a command-line tool:

# List the contents of an XLSX file

unzip -l report.xlsx

# Extract all XML files for inspection

unzip report.xlsx -d report_extracted/

Anatomy of an XLSX Archive

When you extract a typical XLSX file, you will find a directory tree that looks something like this. The exact contents vary depending on the spreadsheet's complexity, but the core structure is always present:

# Typical XLSX directory structure

report.xlsx/

  ├── [Content_Types].xml

  ├── _rels/

  │   └── .rels

  ├── docProps/

  │   ├── core.xml

  │   ├── app.xml

  │   └── custom.xml   (optional)

  └── xl/

      ├── workbook.xml

      ├── sharedStrings.xml

      ├── styles.xml

      ├── theme/

      │   └── theme1.xml

      ├── worksheets/

      │   ├── sheet1.xml

      │   ├── sheet2.xml

      │   └── ...

      ├── calcChain.xml

      ├── _rels/

      │   └── workbook.xml.rels

      ├── printerSettings/  (optional)

      ├── drawings/        (optional)

      ├── charts/          (optional)

      └── media/           (optional)

Each of these files and directories serves a specific purpose. Let's examine the most important ones in detail, focusing on where metadata is stored and what information each file reveals.

Package Level

[Content_Types].xml and _rels/.rels define the archive structure and media types.

Document Properties

docProps/core.xml and docProps/app.xml hold authorship, timestamps, and application metadata.

Workbook Data

The xl/ directory contains all spreadsheet content: worksheets, styles, strings, and formulas.

[Content_Types].xml: The Package Manifest

The [Content_Types].xml file sits at the root of the archive and acts as a manifest that declares the MIME type of every part in the package. The OPC specification requires this file, and Excel uses it to know how to interpret each file within the archive.

[Content_Types].xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<Types xmlns="...schemas.openxmlformats.org/package/2006/content-types">

  <Default Extension="rels"

    ContentType="application/vnd.openxmlformats-package.relationships+xml"/>

  <Default Extension="xml"

    ContentType="application/xml"/>

  <Override PartName="/xl/workbook.xml"

    ContentType="application/vnd.openxmlformats-officedocument

      .spreadsheetml.sheet.main+xml"/>

  <Override PartName="/xl/worksheets/sheet1.xml"

    ContentType="application/vnd.openxmlformats-officedocument

      .spreadsheetml.worksheet+xml"/>

  <Override PartName="/docProps/core.xml"

    ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>

  <!-- ... more overrides ... -->

</Types>

Forensic Value of Content Types

  • Macro detection: A file with content type vnd.ms-excel.sheet.macroEnabled contains macros even if saved as .xlsx
  • Embedded objects: Overrides for oleObject or activeX parts reveal embedded content
  • Application fingerprinting: The specific content types and namespace URIs vary between Excel versions and third-party libraries
  • Missing or extra parts: A content type entry without a matching file (or vice versa) can indicate tampering or corruption

Relationship Files: The Wiring Diagram

Relationship files (.rels) define how parts in the package connect to each other. They are the wiring diagram that tells Excel which XML file is the workbook, which files are worksheets, where the document properties live, and how every component relates to every other component.

There are two levels of relationships. The package-level relationships in _rels/.rels connect the root of the archive to top-level components. Part-level relationships (like xl/_rels/workbook.xml.rels) connect specific parts to their dependencies.

Package-Level Relationships (_rels/.rels)

<Relationships xmlns="...schemas.openxmlformats.org/package/2006/relationships">

  <Relationship Id="rId1"

    Type="...officeDocument/2006/relationships/officeDocument"

    Target="xl/workbook.xml"/>

  <Relationship Id="rId2"

    Type="...package/2006/relationships/metadata/core-properties"

    Target="docProps/core.xml"/>

  <Relationship Id="rId3"

    Type="...officeDocument/2006/relationships/extended-properties"

    Target="docProps/app.xml"/>

</Relationships>

The package-level relationships point to the three primary entry points: the workbook itself, the core document properties, and the extended application properties. The Id attributes (rId1, rId2, etc.) are internal references used to link parts together.

Workbook-Level Relationships (xl/_rels/workbook.xml.rels)

<Relationships xmlns="...">

  <Relationship Id="rId1"

    Type="...relationships/worksheet"

    Target="worksheets/sheet1.xml"/>

  <Relationship Id="rId2"

    Type="...relationships/worksheet"

    Target="worksheets/sheet2.xml"/>

  <Relationship Id="rId3"

    Type="...relationships/theme"

    Target="theme/theme1.xml"/>

  <Relationship Id="rId4"

    Type="...relationships/styles"

    Target="styles.xml"/>

  <Relationship Id="rId5"

    Type="...relationships/sharedStrings"

    Target="sharedStrings.xml"/>

</Relationships>

These relationships map the workbook to its worksheets, shared strings table, styles, and theme. The relationship IDs here correspond to the r:idattributes in workbook.xml that reference each sheet.

Metadata Leaks in Relationships

  • External link targets: Relationships can point to external URLs or file paths, exposing internal network paths like \\server\share\template.xlsx
  • Deleted sheet remnants: Relationship entries may reference sheets or objects that were deleted from the workbook but whose relationship entries remain
  • Printer settings paths: Relationships to printer settings parts can reveal the printer models and network printer paths used during editing
  • OLE object references: Embedded objects carry their own relationship chains that can reveal source applications and file paths

docProps/core.xml: Identity and Timestamps

The core properties file is the most commonly discussed source of metadata in Excel files, and for good reason. It contains the fields that directly identify who created the document, when it was created, when it was last modified, and who last saved it. This file follows the Dublin Core metadata standard and the OPC core properties schema.

docProps/core.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<cp:coreProperties

  xmlns:cp="...schemas.openxmlformats.org/package/2006/metadata/core-properties"

  xmlns:dc="http://purl.org/dc/elements/1.1/"

  xmlns:dcterms="http://purl.org/dc/terms/"

  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

  <dc:creator>Jane Smith</dc:creator>

  <cp:lastModifiedBy>Bob Johnson</cp:lastModifiedBy>

  <dcterms:created xsi:type="dcterms:W3CDTF">

    2025-08-15T09:23:17Z

  </dcterms:created>

  <dcterms:modified xsi:type="dcterms:W3CDTF">

    2026-01-10T14:45:32Z

  </dcterms:modified>

  <cp:revision>12</cp:revision>

  <dc:title>Q4 Financial Report</dc:title>

  <dc:subject>Quarterly Financials</dc:subject>

  <dc:description>Internal revenue analysis</dc:description>

  <cp:keywords>finance, Q4, revenue, confidential</cp:keywords>

  <cp:category>Financial Reports</cp:category>

</cp:coreProperties>

Identity Fields

dc:creator

The name of the person who originally created the file. Taken from the Office account name or Windows user profile at creation time.

cp:lastModifiedBy

The name of the last person to save the file. Updated on every save operation.

Temporal Fields

dcterms:created

ISO 8601 timestamp of when the file was first created. Uses W3CDTF format with UTC timezone.

dcterms:modified

ISO 8601 timestamp of the last save. Updated every time the file is saved, whether content changes or not.

Privacy Risks in Core Properties

Core properties are the most visible metadata leak. Many organizations unknowingly expose:

  • Employee full names in dc:creator and cp:lastModifiedBy, revealing who worked on a document
  • Document titles and descriptions that describe internal projects or contain classification labels like "confidential"
  • Keywords and categories that expose internal taxonomy systems and organizational structure
  • Creation dates that reveal when work began, potentially exposing planning timelines to competitors

docProps/app.xml: Application Fingerprints

While core.xml captures who and when, app.xml captures what and how. This file records detailed information about the application that created and last modified the document, along with structural statistics about the workbook itself. This information is particularly valuable for forensic analysis because it reveals the software environment behind the document.

docProps/app.xml

<Properties xmlns="...schemas.openxmlformats.org/officeDocument/2006/

  extended-properties">

  <Application>Microsoft Excel</Application>

  <AppVersion>16.0300</AppVersion>

  <DocSecurity>0</DocSecurity>

  <ScaleCrop>false</ScaleCrop>

  <Company>Acme Corporation</Company>

  <Manager>Sarah Director</Manager>

  <LinksUpToDate>false</LinksUpToDate>

  <SharedDoc>false</SharedDoc>

  <HeadingPairs>

    <vt:vector size="2" baseType="variant">

      <vt:variant><vt:lpstr>Worksheets</vt:lpstr></vt:variant>

      <vt:variant><vt:i4>5</vt:i4></vt:variant>

    </vt:vector>

  </HeadingPairs>

  <TitlesOfParts>

    <vt:vector size="5" baseType="lpstr">

      <vt:lpstr>Summary</vt:lpstr>

      <vt:lpstr>Revenue Detail</vt:lpstr>

      <vt:lpstr>Cost Analysis</vt:lpstr>

      <vt:lpstr>Projections</vt:lpstr>

      <vt:lpstr>Raw Data</vt:lpstr>

    </vt:vector>

  </TitlesOfParts>

</Properties>

Application Version Fingerprints

The AppVersion field encodes the Excel version in a specific format. This allows forensic investigators to determine exactly which version of Excel was used to save the file:

AppVersion

Excel Version

12.0000

Excel 2007

14.0300

Excel 2010

15.0300

Excel 2013

16.0300

Excel 2016 / 2019 / 365

Sensitive Fields in app.xml

  • Company: Reveals the organization name configured during Office installation—often exposing the company behind an "anonymous" document
  • Manager: Exposes organizational hierarchy by naming the document creator's manager
  • TitlesOfParts: Lists all worksheet names, including sheets that may have been hidden or deleted after the last save to app.xml
  • HeadingPairs: Reveals whether the file contains named ranges, charts, or other object types beyond basic worksheets

docProps/custom.xml: Custom and Hidden Properties

The optional custom.xml file stores user-defined or application-defined custom properties. These are arbitrary key-value pairs that can be added programmatically or through document management systems. Many organizations overlook this file when cleaning metadata because it is not visible through Excel's standard File > Properties dialog.

docProps/custom.xml

<Properties xmlns="...schemas.openxmlformats.org/officeDocument/2006/

  custom-properties">

  <property fmtid="..." pid="2" name="Classification">

    <vt:lpwstr>Confidential - Internal Only</vt:lpwstr>

  </property>

  <property fmtid="..." pid="3" name="Department">

    <vt:lpwstr>Finance - Treasury</vt:lpwstr>

  </property>

  <property fmtid="..." pid="4" name="DMS_DocID">

    <vt:lpwstr>DOC-2025-FIN-00847</vt:lpwstr>

  </property>

  <property fmtid="..." pid="5" name="ReviewStatus">

    <vt:lpwstr>Approved by Legal</vt:lpwstr>

  </property>

</Properties>

Common Sources of Custom Properties

Document Management Systems

  • • SharePoint document IDs and library metadata
  • • iManage/NetDocuments classification tags
  • • OpenText document tracking numbers
  • • Workflow status and approval chains

Data Loss Prevention (DLP)

  • • Sensitivity labels (e.g., Microsoft Purview)
  • • Classification stamps from security tools
  • • Content inspection fingerprints
  • • Policy compliance flags

xl/workbook.xml: The Workbook Blueprint

The workbook file is the central hub of the spreadsheet data. It defines which sheets exist, their order, their visibility state, and structural properties like defined names, calculation settings, and workbook protection. From a metadata perspective, this file reveals the document's organizational structure.

xl/workbook.xml (key sections)

<workbook xmlns="...schemas.openxmlformats.org/spreadsheetml/2006/main">

  <fileVersion appName="xl"

    lastEdited="7" lowestEdited="7"

    rupBuild="27425"/>

  <workbookPr defaultThemeVersion="166925"/>

  <sheets>

    <sheet name="Summary" sheetId="1" r:id="rId1"/>

    <sheet name="Revenue Detail" sheetId="2" r:id="rId2"/>

    <sheet name="Hidden Analysis" sheetId="3" r:id="rId3"

      state="hidden"/>

    <sheet name="Config" sheetId="4" r:id="rId4"

      state="veryHidden"/>

  </sheets>

  <definedNames>

    <definedName name="_xlnm.Print_Area"

      localSheetId="0">Summary!$A$1:$G$50</definedName>

    <definedName name="CostBasis">

      'Hidden Analysis'!$B$2:$B$100</definedName>

  </definedNames>

  <calcPr calcId="191029"/>

</workbook>

Hidden and Very Hidden Sheets

Excel supports three sheet visibility states. The state attribute in the <sheet> element controls this:

visible (default)

The sheet appears in the tab bar. No state attribute is present.

hidden

The sheet is hidden from the tab bar but can be unhidden through the right-click menu. Marked with state="hidden".

veryHidden

The sheet is invisible and cannot be unhidden through the normal UI. Only accessible via VBA or by editing the XML directly. Marked with state="veryHidden".

What Defined Names Reveal

The <definedNames> section is a rich source of metadata intelligence. Named ranges often expose:

  • Data structure hints: Names like CostBasis, MarginCalc, or DiscountTiers reveal the purpose and structure of hidden data
  • Cross-references to hidden sheets: Named ranges that reference hidden or very hidden sheets expose data the creator intended to conceal
  • Print areas: The _xlnm.Print_Area name reveals which data was intended for printing and distribution
  • External references: Named ranges pointing to other workbooks expose file paths and network locations

xl/sharedStrings.xml: The Text Repository

Excel optimizes storage by maintaining a single shared strings table. Instead of storing each text value directly in the worksheet cells, it stores the text once in sharedStrings.xml and references it by index from the worksheet. This design has significant implications for metadata analysis and forensic investigation.

xl/sharedStrings.xml

<sst xmlns="...schemas.openxmlformats.org/spreadsheetml/2006/main"

  count="1847" uniqueCount="523">

  <si><t>Product Name</t></si>       <!-- index 0 -->

  <si><t>Unit Price</t></si>          <!-- index 1 -->

  <si><t>CONFIDENTIAL - DO NOT SHARE</t></si> <!-- index 2 -->

  <si><t>Internal Review Draft</t></si>   <!-- index 3 -->

  <si><t>john.smith@acme.com</t></si>    <!-- index 4 -->

  <!-- ... 518 more strings ... -->

</sst>

Count vs UniqueCount

The count attribute indicates total string references across all cells, while uniqueCount shows unique strings. The ratio reveals data patterns:

  • • High count/uniqueCount ratio = many repeated values (category columns, status fields)
  • • Ratio close to 1:1 = mostly unique text (names, descriptions, IDs)
  • • Unusually high uniqueCount with low cell count may indicate orphaned strings from deleted data

Orphaned Strings

When cells are deleted, their text values may remain in the shared strings table as orphaned entries. This is one of the most powerful forensic artifacts in XLSX files:

  • • Deleted employee names, email addresses, or client information
  • • Draft text that was removed before the final version
  • • Error messages or notes from the development process
  • • Data from deleted columns or rows that was never cleaned from the SST

xl/worksheets/sheetN.xml: Cell-Level Data and Metadata

Each worksheet is stored as a separate XML file. These files contain the actual cell data, formulas, formatting references, data validation rules, conditional formatting, comments, hyperlinks, and sheet-level protection settings. The worksheet XML is where content meets metadata.

xl/worksheets/sheet1.xml (simplified)

<worksheet xmlns="...spreadsheetml/2006/main">

  <sheetViews>

    <sheetView tabSelected="1" workbookViewId="0">

      <selection activeCell="D15" sqref="D15"/>

    </sheetView>

  </sheetViews>

  <sheetData>

    <row r="1" spans="1:5">

      <c r="A1" t="s"><v>0</v></c>   <!-- shared string index 0 -->

      <c r="B1" t="s"><v>1</v></c>   <!-- shared string index 1 -->

      <c r="C1"><v>99.95</v></c>        <!-- numeric value -->

      <c r="D1" s="4">                    <!-- style index 4 -->

        <f>C1*1.2</f>               <!-- formula -->

        <v>119.94</v>              <!-- cached result -->

      </c>

    </row>

  </sheetData>

  <hyperlinks>

    <hyperlink ref="A10" r:id="rId1"/>

  </hyperlinks>

</worksheet>

Metadata Hidden in Worksheet XML

Active Cell Selection

The activeCell attribute in <sheetView>records where the cursor was when the file was last saved. This reveals which cell the last editor was working on—potentially exposing the focus of their analysis.

Formulas and Calculations

The <f> element stores the actual formula text. Even if the visible value is a simple number, the formula reveals the calculation logic, cell references, and potentially references to external workbooks.

Hyperlinks

Hyperlinks stored in the worksheet can point to internal network resources, SharePoint sites, or external URLs that reveal the organization's infrastructure.

Data Validation

Validation rules define dropdown lists, allowed ranges, and input constraints. These can expose business rules, valid value sets, and the intended use of data fields.

xl/styles.xml: Formatting and Number Formats

The styles file contains all cell formatting definitions used in the workbook: number formats, fonts, fills, borders, and cell style combinations. While this may seem purely visual, styles carry meaningful metadata about the document's origin, purpose, and editing history.

Number Formats Reveal Intent

Custom number formats in <numFmts> reveal how data is intended to be interpreted and displayed. These format codes are highly informative:

<!-- Currency format reveals locale and precision -->

<numFmt numFmtId="164" formatCode="&quot;$&quot;#,##0.00"/>

<!-- Date format reveals regional conventions -->

<numFmt numFmtId="165" formatCode="dd/mm/yyyy"/>

<!-- Accounting format with specific currency -->

<numFmt numFmtId="166" formatCode="_-&quot;\u00a3&quot;* #,##0.00"/>

  • Currency symbols reveal the geographic origin ($ vs \u00a3 vs \u20ac vs \u00a5)
  • Date format order (dd/mm vs mm/dd) indicates regional locale settings
  • Decimal precision hints at the financial context (2 decimals for currency, 4+ for interest rates)
  • Custom format strings with text like "kg", "units", or "bps" reveal the domain of the data

Font Metadata as Application Fingerprints

The fonts defined in styles.xml are surprisingly useful for determining which version and edition of Excel created the file. Default fonts changed across versions:

Default Font

Excel Version

Arial 10pt

Excel 2003 and earlier (legacy)

Calibri 11pt

Excel 2007 through 2013

Calibri 11pt

Excel 2016 / 2019 / 365 (default)

Aptos Narrow 11pt

Excel 365 (2024+ with new default)

xl/theme/theme1.xml: Visual Identity and Origin Clues

The theme file defines the color palette, font scheme, and visual effects used throughout the workbook. While themes primarily control appearance, they also carry metadata that reveals the document's origin and creation context.

What Themes Reveal

Application Origin

  • • Theme name and ID identify the Office version that created the file
  • • Custom themes may carry corporate branding names (e.g., "Acme Corporate 2025")
  • • Files created by third-party libraries often have minimal or non-standard themes

Corporate Identity

  • • Custom color palettes match corporate brand guidelines
  • • Font pairs reveal whether the organization uses custom typography
  • • Theme versioning can indicate when the corporate template was last updated

xl/calcChain.xml: The Calculation Dependency Map

The calculation chain file records the order in which Excel recalculates formulas. This file exists to optimize recalculation performance, but it also serves as a metadata artifact that reveals the spreadsheet's formula structure even when formulas have been converted to values.

xl/calcChain.xml

<calcChain xmlns="...spreadsheetml/2006/main">

  <c r="D1" i="1"/>

  <c r="D2" i="1"/>

  <c r="E1" i="1"/>

  <c r="F5" i="2"/>

  <c r="G10" i="3" l="1"/>

</calcChain>

Forensic Value of calcChain.xml

  • Ghost formulas: If a user pastes values over formulas but the calcChain still references those cells, it proves formulas existed there previously
  • Sheet references: The i attribute references the sheet index, revealing formula locations even across hidden sheets
  • Dependency chains: The l (level) attribute indicates dependency depth, revealing how complex the calculation model is
  • Missing file: If calcChain.xml is absent from a file that contains formulas, it may indicate the file was generated by a non-Excel application or was manually reconstructed

xl/printerSettings/: Hardware Fingerprints

One of the most overlooked metadata sources in XLSX files is the printer settings directory. When a user sets a print area, adjusts page setup, or prints a preview, Excel stores binary printer configuration data that can reveal detailed information about the hardware environment.

What Printer Settings Expose

  • Printer model and manufacturer: The DEVMODE structure stored in .bin files contains the exact printer driver name (e.g., "HP LaserJet Pro MFP M428fdw")
  • Network printer paths: Network printers expose UNC paths like \\PRINTSERVER\3rd-Floor-HP, revealing internal server names and physical locations
  • Paper size and orientation: Regional paper sizes (Letter vs A4) indicate geographic origin
  • Print resolution: DPI settings and color modes can indicate whether the document was set up for draft review or final production

ZIP Archive-Level Metadata

Beyond the XML content, the ZIP container itself carries metadata in its file headers. Every file entry in a ZIP archive includes timestamps, compression parameters, and file attributes that exist independently of the XML metadata. This creates a second, often overlooked layer of forensic evidence.

# View ZIP entry details with timestamps

$ unzip -v report.xlsx

Length  Method  Size  Cmpr  Date    Time   CRC-32  Name

------  ------  ----  ----  ----    ----   ------  ----

  1580  Defl:N   447  72%  01-10-26 14:45  a3b2c1d0  [Content_Types].xml

   590  Defl:N   243  59%  01-10-26 14:45  e4f5a6b7  _rels/.rels

   838  Defl:N   380  55%  01-10-26 14:45  c8d9e0f1  docProps/core.xml

  4290  Defl:N  1285  70%  01-10-26 14:45  2a3b4c5d  xl/workbook.xml

 89432  Defl:N 18726  79%  01-10-26 14:45  6e7f8091  xl/worksheets/sheet1.xml

Timestamp Comparison

ZIP entry timestamps should match the dcterms:modified date in core.xml. All entries typically share the same timestamp because Excel writes the entire archive in one operation. Inconsistent timestamps suggest the archive was manually modified after creation.

Compression Signatures

Different applications use different compression levels. Excel typically uses Deflate with a specific compression ratio. Files generated by libraries like openpyxl, Apache POI, or EPPlus often show different compression characteristics, helping identify the true source application.

Practical Analysis: Putting It All Together

Understanding the XML structure is only valuable when you apply it to real-world analysis. Here is a systematic approach to examining an XLSX file's metadata using command-line tools available on any system.

Step-by-Step Metadata Extraction

# Step 1: Extract the archive

mkdir analysis && cd analysis

unzip ../report.xlsx -d extracted/

# Step 2: Examine identity and timestamps

cat extracted/docProps/core.xml | xmllint --format -

# Step 3: Check application fingerprint and company

cat extracted/docProps/app.xml | xmllint --format -

# Step 4: Look for custom properties

cat extracted/docProps/custom.xml 2>/dev/null | xmllint --format -

# Step 5: Inspect workbook structure and hidden sheets

cat extracted/xl/workbook.xml | xmllint --format -

# Step 6: Search shared strings for sensitive data

cat extracted/xl/sharedStrings.xml | xmllint --format - | \

  grep -i "confidential\|internal\|draft\|password\|secret"

# Step 7: Check for orphaned relationship targets

cat extracted/xl/_rels/workbook.xml.rels | xmllint --format -

# Step 8: Compare ZIP timestamps with core.xml dates

unzip -v ../report.xlsx | head -20

Quick Metadata Audit Checklist

Before Sharing Externally

  • • Is dc:creator appropriate to share?
  • • Does Company in app.xml expose your organization?
  • • Are there hidden or veryHidden sheets?
  • • Does sharedStrings.xml contain sensitive text?
  • • Are there custom properties with classification labels?

For Forensic Investigation

  • • Do core.xml timestamps match ZIP header timestamps?
  • • Does AppVersion match the claimed Excel version?
  • • Are there orphaned strings in the shared strings table?
  • • Do relationship files reference deleted parts?
  • • Does the calcChain reference cells that no longer contain formulas?

Identifying Non-Excel Origin: Third-Party Library Signatures

Not all XLSX files are created by Microsoft Excel. Many are generated by programming libraries, web applications, and reporting tools. These files often have distinctive structural signatures that differ from genuine Excel output. Knowing these differences is critical for forensic analysis and authenticity verification.

Common Library Fingerprints

openpyxl (Python)

Sets Application to "Microsoft Excel" but uses distinctive XML formatting with different whitespace patterns. Often missing calcChain.xml and printer settings. Theme file may be minimal or use the default openpyxl theme.

Apache POI (Java)

May set Application to "Apache POI" or leave it blank. Uses XSSF format with specific namespace prefixes. Relationship IDs often follow a different numbering pattern than Excel.

EPPlus (.NET)

Identifies itself with Application set to "Microsoft Excel" and a specific AppVersion. Missing calcChain.xml is common. Custom XML namespace prefixes differ from native Excel output.

Google Sheets (Export)

Exported files often have minimal app.xml properties, missing Company and Manager fields, and a simplified theme. The XML structure is valid but noticeably different from native Excel output in element ordering and attribute presence.

Key Takeaways

XLSX Is a ZIP of XMLs

Every XLSX file is a ZIP archive containing interconnected XML files. Understanding this structure is the foundation for any meaningful metadata analysis, forensic investigation, or privacy audit.

Metadata Lives Everywhere

Metadata is not confined to core.xml. Author names, timestamps, application fingerprints, network paths, and sensitive text are distributed across dozens of files within the archive.

Orphaned Data Persists

Shared strings, calculation chains, and relationship entries often survive data deletion. These orphaned artifacts are invaluable for forensic recovery and detecting what was changed before a file was shared.

Clean Before Sharing

A thorough metadata cleanup must address every XML file in the archive, not just the document properties. Shared strings, printer settings, custom properties, and relationship files all carry potentially sensitive information.

Explore Your Excel File's XML Structure

Use MetaData Analyzer to instantly inspect the metadata hidden across every XML file inside your XLSX documents. See author information, application fingerprints, hidden sheets, shared strings, and more—without manually extracting the archive.