Back to Blog
Technical

Understanding the XML Structure Inside XLSX Files

Every XLSX file is a ZIP archive containing a structured collection of XML documents. Understanding this internal structure is essential for developers who need to programmatically read, modify, or generate spreadsheets, and for analysts who want to understand exactly what data an Excel file contains beyond what appears on screen.

Technical Team
March 31, 2026
19 min read

XLSX Is Just a ZIP Archive

The single most important thing to understand about the XLSX format is that it is not a monolithic binary blob. It is a standard ZIP archive with a .xlsx file extension. You can rename any XLSX file to .zip, open it with any archive tool, and browse its contents directly. This design comes from the Office Open XML (OOXML) standard, which Microsoft introduced with Office 2007 to replace the older binary .xls format.

The format follows the Open Packaging Conventions (OPC) specification, which defines how multiple XML parts are organized, named, and linked together inside the ZIP container. Each part has a content type, a location within the archive, and relationships that describe how parts reference each other.

Try It Yourself: Unzipping an XLSX File

You can explore any XLSX file’s internal structure right now using command-line tools:

# Make a copy and rename to .zip
cp spreadsheet.xlsx spreadsheet.zip

# List contents
unzip -l spreadsheet.zip

# Extract all files
unzip spreadsheet.zip -d spreadsheet_extracted/

# Or on macOS/Linux, directly unzip the .xlsx
unzip spreadsheet.xlsx -d spreadsheet_extracted/

On Windows, you can also use 7-Zip or WinRAR to open the file directly without renaming.

When you unzip an XLSX file, you will see a directory structure similar to the following. The exact contents vary depending on what features the workbook uses, but the core structure is consistent across all XLSX files.

The Directory Layout

spreadsheet.xlsx/
├── [Content_Types].xml          # Maps file extensions to content types
├── _rels/
│   └── .rels                    # Top-level relationships
├── docProps/
│   ├── core.xml                 # Author, title, dates (Dublin Core)
│   └── app.xml                  # Application info, sheet names
└── xl/
    ├── workbook.xml             # Workbook structure and sheet list
    ├── sharedStrings.xml        # Deduplicated string table
    ├── styles.xml               # Cell formatting and number formats
    ├── theme/
    │   └── theme1.xml           # Color and font theme
    ├── worksheets/
    │   ├── sheet1.xml           # First worksheet data
    │   ├── sheet2.xml           # Second worksheet data
    │   └── ...
    ├── _rels/
    │   └── workbook.xml.rels    # Links sheet IDs to file paths
    └── (optional parts)
        ├── calcChain.xml        # Formula calculation order
        ├── comments1.xml        # Cell comments
        ├── drawings/            # Charts, images, shapes
        ├── tables/              # Structured table definitions
        ├── pivotTables/         # Pivot table definitions
        ├── pivotCache/          # Pivot table cached data
        └── printerSettings/     # Printer configuration

Every part in this structure serves a specific purpose. Let’s walk through each major component and understand what it contains, why it matters, and what you can learn from reading it directly.

Content Types and Relationships: The Glue

Before Excel can read any XML part, it needs to know what type of content each file contains and how parts reference each other. Two mechanisms handle this: the content types file and the relationship files.

[Content_Types].xml

This file sits at the root of the archive and acts as a registry. It maps file extensions and specific part names to MIME-like content types so the application knows how to process each file.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <Default Extension="rels"
    ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <Default Extension="xml"
    ContentType="application/xml"/>
  <Override PartName="/xl/workbook.xml"
    ContentType="application/vnd.openxmlformats-officedocument
      .spreadsheetml.sheet.main+xml"/>
  <Override PartName="/xl/worksheets/sheet1.xml"
    ContentType="application/vnd.openxmlformats-officedocument
      .spreadsheetml.worksheet+xml"/>
  <Override PartName="/xl/sharedStrings.xml"
    ContentType="application/vnd.openxmlformats-officedocument
      .spreadsheetml.sharedStrings+xml"/>
</Types>

Relationship Files (.rels)

Relationship files define how parts connect to each other. They use a simple target-and-type model. The top-level _rels/.rels file links the root of the package to the workbook, document properties, and other top-level parts. Each major part can have its own .rels file — for example, xl/_rels/workbook.xml.rels maps the logical sheet IDs used in workbook.xml to the actual worksheet file paths.

<!-- _rels/.rels -->
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1"
    Type="http://schemas.openxmlformats.org/.../officeDocument"
    Target="xl/workbook.xml"/>
  <Relationship Id="rId2"
    Type="http://schemas.openxmlformats.org/.../core-properties"
    Target="docProps/core.xml"/>
  <Relationship Id="rId3"
    Type="http://schemas.openxmlformats.org/.../extended-properties"
    Target="docProps/app.xml"/>
</Relationships>

Why Relationships Matter

The relationship system means you cannot simply rename or move a file within the ZIP and expect it to work. If you move sheet1.xml to a different folder, you must also update the relationship file that points to it. When programmatically modifying XLSX files, forgetting to update relationships is one of the most common causes of file corruption.

Document Properties: core.xml and app.xml

The docProps/ folder contains metadata about the document itself rather than its data content. Two files are standard: core.xml for Dublin Core metadata and app.xml for application-specific properties.

core.xml — Who and When

This file stores authorship and timing information using the Dublin Core metadata standard. It is the primary source of personally identifiable information in most Excel files.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cp:coreProperties
  xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <dc:creator>Jane Smith</dc:creator>
  <cp:lastModifiedBy>Bob Johnson</cp:lastModifiedBy>
  <dcterms:created xsi:type="dcterms:W3CDTF">2026-01-15T09:30:00Z</dcterms:created>
  <dcterms:modified xsi:type="dcterms:W3CDTF">2026-03-28T14:22:00Z</dcterms:modified>
  <dc:title>Q1 Budget Report</dc:title>
  <dc:description>Internal budget analysis</dc:description>
  <cp:revision>47</cp:revision>
</cp:coreProperties>

app.xml — Application Fingerprint

This file records which application created the file, the company name from the Office installation, sheet names, and document statistics. It can reveal organizational information and the software environment used.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/.../extended-properties">
  <Application>Microsoft Excel</Application>
  <AppVersion>16.0300</AppVersion>
  <Company>Acme Corporation</Company>
  <TitlesOfParts>
    <vt:vector size="3" baseType="lpstr">
      <vt:lpstr>Revenue</vt:lpstr>
      <vt:lpstr>Expenses</vt:lpstr>
      <vt:lpstr>Projections</vt:lpstr>
    </vt:vector>
  </TitlesOfParts>
  <DocSecurity>0</DocSecurity>
</Properties>

Privacy Implications

  • core.xml reveals the full name of the person who created the file and the last person who modified it, along with exact timestamps.
  • app.xml reveals the company name configured in the Office installation, all worksheet names (including hidden sheets), and the exact Excel version.
  • The revision count in core.xml reveals how many times the file has been saved, indicating how much editing has occurred.

workbook.xml: The Central Hub

The workbook file is the central orchestration point for the entire spreadsheet. It defines which sheets exist, their order, their visibility state, named ranges, and workbook-level settings. It does not contain any cell data — that lives in the individual worksheet files.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main"
  xmlns:r="http://schemas.openxmlformats.org/.../relationships">
  <sheets>
    <sheet name="Revenue" sheetId="1" r:id="rId1"/>
    <sheet name="Expenses" sheetId="2" r:id="rId2"/>
    <sheet name="Internal Notes" sheetId="3" state="hidden" r:id="rId3"/>
    <sheet name="Raw Data" sheetId="4" state="veryHidden" r:id="rId4"/>
  </sheets>
  <definedNames>
    <definedName name="TaxRate">Revenue!$B$2</definedName>
    <definedName name="_xlnm.Print_Area" localSheetId="0">Revenue!$A$1:$G$50</definedName>
  </definedNames>
</workbook>

Hidden and Very Hidden Sheets

Notice the state attribute on sheets. There are three visibility levels:

  • visible (default, no state attribute) — normal sheet tabs visible to the user.
  • hidden — sheet is not shown in tabs but can be unhidden through the Excel UI (right-click on sheet tabs).
  • veryHidden — sheet can only be unhidden through the VBA editor or by editing the XML directly. Regular users will never discover it through the normal interface.

Very hidden sheets are a common place to store lookup tables, configuration data, or sensitive information that the workbook author did not intend to share. Inspecting workbook.xml is the most reliable way to find them.

Worksheet XML: Where the Data Lives

Each worksheet is stored as a separate XML file in the xl/worksheets/ folder. This is where cell values, formulas, and cell-level formatting references are stored. Understanding the worksheet XML structure is essential for anyone working with XLSX files programmatically.

Basic Cell Structure

Cell data is organized into rows, and each row contains cells. Each cell has a reference (like “A1”), a type indicator, a style index, and a value.

<worksheet xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main">
  <sheetData>
    <row r="1" spans="1:4">
      <!-- String cell referencing shared string index 0 -->
      <c r="A1" t="s" s="1">
        <v>0</v>
      </c>
      <!-- Numeric cell -->
      <c r="B1" s="2">
        <v>42500.75</v>
      </c>
      <!-- Formula cell -->
      <c r="C1" s="3">
        <f>B1*1.1</f>
        <v>46750.825</v>
      </c>
      <!-- Boolean cell -->
      <c r="D1" t="b">
        <v>1</v>
      </c>
    </row>
  </sheetData>
</worksheet>

Cell Type Codes

Type (t attribute)MeaningValue interpretation
sShared stringIndex into sharedStrings.xml
n or omittedNumberLiteral numeric value
bBoolean0 = FALSE, 1 = TRUE
eErrorError string like #REF!
strInline stringLiteral string (not shared)
inlineStrRich inline stringRich text stored in <is> element

Dates Are Numbers

One of the most confusing aspects of XLSX for newcomers is that dates are not stored as dates. They are stored as plain numbers — specifically, the count of days since January 1, 1900 (or January 1, 1904 on some Mac-originated files). The value 44927 represents January 1, 2023. Whether a cell displays as a date or a number depends entirely on the style applied to it through the s attribute, which indexes into styles.xml.

This means that if you read the raw XML, you cannot tell whether 44927 is a date or just the number 44,927 without also checking the style definition. This is a frequent source of bugs in custom XLSX parsers.

Formulas in XML

Formula cells contain both a <f> element with the formula text and a <v> element with the last calculated value. This dual storage means you can read formula results without recalculating, but the cached values may be stale if the file was saved without recalculation.

<!-- Simple formula -->
<c r="C1">
  <f>SUM(A1:B1)</f>
  <v>150</v>
</c>

<!-- Shared formula (applied across a range) -->
<c r="C2">
  <f t="shared" ref="C2:C100" si="0">SUM(A2:B2)</f>
  <v>200</v>
</c>
<c r="C3">
  <f t="shared" si="0"/>
  <v>175</v>
</c>

<!-- Array formula (Ctrl+Shift+Enter) -->
<c r="E1">
  <f t="array" ref="E1:E10">TRANSPOSE(A1:J1)</f>
  <v>42</v>
</c>

Shared Formulas Save Space

When the same formula pattern is applied to a range of cells (like dragging a formula down a column), Excel stores the full formula only once and marks subsequent cells as shared with t="shared" and a shared index (si). This significantly reduces file size for spreadsheets with repetitive formulas. If you are building an XLSX writer, implementing shared formulas correctly is one of the key optimizations.

sharedStrings.xml: The String Deduplication Table

Instead of storing the same text value in every cell that uses it, XLSX maintains a central string table in xl/sharedStrings.xml. Each unique string is stored once, and cells reference it by index. This is a significant space optimization — a column with 10,000 rows all containing “Approved” stores the string once and references index 0 ten thousand times.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main"
  count="15000" uniqueCount="342">
  <si><t>Product Name</t></si>        <!-- index 0 -->
  <si><t>Revenue</t></si>             <!-- index 1 -->
  <si><t>Quarter</t></si>             <!-- index 2 -->
  <si>                                 <!-- index 3: rich text -->
    <r>
      <t>Total: </t>
    </r>
    <r>
      <rPr><b/><color rgb="FFFF0000"/></rPr>
      <t>$1,250,000</t>
    </r>
  </si>
</sst>

The count attribute shows the total number of string references across all cells, while uniqueCount shows how many distinct strings exist. Rich text (text with mixed formatting within a single cell) uses <r> elements with run properties (<rPr>) to apply different styles to different parts of the string.

Forensic Insight: Orphaned Strings

The shared string table sometimes contains strings that are no longer referenced by any cell. These “orphaned” strings can be remnants of deleted data. If a user deletes a column containing sensitive information, the string values may persist in sharedStrings.xml even though no cell points to them. This is one reason why simply deleting visible data does not guarantee that the information has been removed from the file.

styles.xml: Formatting and Number Formats

The styles file defines all cell formatting used in the workbook. Rather than embedding style information in each cell, XLSX uses a layered indexing system. Each cell has a style index (s attribute) that points into a cross-reference table (cellXfs), which in turn references font, fill, border, and number format definitions.

<styleSheet xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main">
  <numFmts count="2">
    <numFmt numFmtId="164" formatCode="&quot;$&quot;#,##0.00"/>
    <numFmt numFmtId="165" formatCode="yyyy-mm-dd"/>
  </numFmts>
  <fonts count="3">
    <font><sz val="11"/><name val="Calibri"/></font>
    <font><b/><sz val="14"/><name val="Calibri"/></font>
    <font><sz val="11"/><color rgb="FFFF0000"/><name val="Calibri"/></font>
  </fonts>
  <fills count="2">
    <fill><patternFill patternType="none"/></fill>
    <fill><patternFill patternType="solid">
      <fgColor rgb="FFFFFF00"/>
    </patternFill></fill>
  </fills>
  <cellXfs count="3">
    <!-- Style index 0: default -->
    <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>
    <!-- Style index 1: bold header -->
    <xf numFmtId="0" fontId="1" fillId="0" borderId="0"/>
    <!-- Style index 2: currency format -->
    <xf numFmtId="164" fontId="0" fillId="0" borderId="0"
      applyNumberFormat="1"/>
  </cellXfs>
</styleSheet>

When a cell has s="2", it uses the third entry in cellXfs, which applies number format 164 (the custom currency format "$"#,##0.00). This indirection system means the styles file can be complex, but it keeps individual worksheet files small and avoids repeating formatting information.

Built-in Number Format IDs

Number format IDs below 164 are built-in and not explicitly listed in the XML. Some important ones:

  • 0 — General
  • 10 (integer)
  • 20.00 (two decimals)
  • 90% (percentage)
  • 14mm-dd-yy (date)
  • 20h:mm (time)
  • 22m/d/yy h:mm (date+time)
  • 164+ — Custom formats defined in <numFmts>

Other Important Parts

calcChain.xml — Formula Calculation Order

This file records the order in which Excel should recalculate formulas. It lists cell references and their sheet IDs in dependency order. While not essential for reading data, it affects how recalculation engines process the workbook. If you delete or modify formulas programmatically, you should also update or remove the calculation chain to avoid recalculation errors.

theme/theme1.xml — Visual Theme

The theme file defines the color palette, font scheme, and formatting effects used throughout the workbook. When a cell color is defined as “theme color 4 with 40% tint,” the actual RGB value is computed from this theme definition. Changing the theme changes the appearance of every element that references theme colors — which is how Excel’s built-in color schemes work.

comments and threadedComments

Cell comments are stored in separate XML files (comments1.xml, etc.) linked to worksheets through relationships. Newer versions of Excel also use threadedComments for the modern threaded comment system, which includes author identifiers, timestamps, and reply chains. These files are a significant source of hidden metadata that reveals internal discussions.

tables/ — Structured Table Definitions

When you create a structured table in Excel (Insert → Table), the definition is stored in a separate XML file in the xl/tables/ folder. Table definitions include the table name, column names, the cell range the table covers, and any auto-filter settings. Tables can reveal the intended structure and purpose of data even when the visible formatting has been stripped.

Practical Applications

Understanding XLSX internals has direct practical value for several common tasks:

Metadata Auditing

Read core.xml and app.xml directly to identify author names, company information, and timestamps without relying on Excel’s Document Inspector, which can miss certain metadata.

Finding Hidden Content

Check workbook.xml for hidden and veryHidden sheets, scan sharedStrings.xml for orphaned strings, and look for comments files to find data that is invisible in the normal spreadsheet view.

Programmatic Generation

Build XLSX files from scratch by assembling the required XML parts and packaging them into a ZIP archive. This approach gives you full control over the output and is how libraries like openpyxl and SheetJS work internally.

Targeted Sanitization

Remove specific metadata files (like core.xml or comment files) from the ZIP without affecting the spreadsheet data, or rewrite properties to contain only approved values before sharing externally.

Quick Python Example: Reading XML Parts

import zipfile
from lxml import etree

# Open the XLSX as a ZIP archive
with zipfile.ZipFile('report.xlsx', 'r') as zf:
    # List all files inside
    for name in zf.namelist():
        print(name)

    # Read and parse core properties
    with zf.open('docProps/core.xml') as f:
        tree = etree.parse(f)
        root = tree.getroot()
        ns = {
            'dc': 'http://purl.org/dc/elements/1.1/',
            'cp': 'http://schemas.openxmlformats.org/package/2006/metadata/core-properties'
        }
        creator = root.find('.//dc:creator', ns)
        print(f"Author: {creator.text if creator is not None else 'Unknown'}")

    # Read shared strings
    with zf.open('xl/sharedStrings.xml') as f:
        tree = etree.parse(f)
        ns = {'s': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
        strings = [
            si.find('.//s:t', ns).text
            for si in tree.findall('.//s:si', ns)
            if si.find('.//s:t', ns) is not None
        ]
        print(f"Found {len(strings)} unique strings")
        for s in strings[:10]:
            print(f"  - {s}")

Common Pitfalls When Working with XLSX XML

1. Namespace Handling

Every XLSX XML file uses namespaces extensively. If your XML parser does not handle namespaces correctly, you will fail to find any elements. Always use namespace-aware parsing and register the correct namespace prefixes for the OOXML schemas.

2. Relationship Consistency

Adding, removing, or renaming parts requires updating both [Content_Types].xml and the relevant .rels files. Missing this step results in files that open with errors or fail to open entirely.

3. Date/Number Ambiguity

Never assume a numeric cell value is “just a number.” Check its style index against styles.xml to determine if a date number format is applied. Many XLSX parsing bugs stem from treating date serial numbers as regular integers.

4. Shared String Indices

String cells store an index, not the string itself. If you modify sharedStrings.xml (adding or removing entries), all index references in all worksheets must be updated. Failing to do this corrupts every string reference.

5. ZIP Compression Settings

When repackaging modified XML files back into a ZIP, use the ZIP_DEFLATED compression method. Some tools default to ZIP_STORED (no compression), which produces valid but unnecessarily large files. Also ensure that directory entries in the ZIP do not use backslashes — use forward slashes only.

Explore Your XLSX Files with MetaData Analyzer

Understanding the XML structure is powerful, but you do not need to manually parse ZIP archives every time. MetaData Analyzer reads the internal structure of your XLSX files and presents all metadata, hidden content, author information, and structural details in a clear, actionable format — without sending your files to any server.

Key Takeaways

  • XLSX files are ZIP archives — you can explore their contents with any standard archive tool or programmatically with ZIP libraries.
  • Relationships bind parts together.rels files define how XML documents reference each other, and breaking these links corrupts the file.
  • Document properties expose personal datacore.xml and app.xml contain author names, company names, and timestamps that travel with every copy of the file.
  • Strings are deduplicated — the shared string table stores each unique text value once and cells reference them by index, which also means deleted data can persist as orphaned entries.
  • Dates are numbers with formatting — you must check styles.xml to distinguish dates from regular numbers, which is a frequent source of parsing bugs.
  • Hidden sheets exist in three visibility states — the veryHidden state is invisible through the normal Excel UI and can only be detected by reading workbook.xml directly.