Every XLSX file is a ZIP archive containing a structured collection of XML documents. Understanding this internal structure is essential for developers who need to programmatically read, modify, or generate spreadsheets, and for analysts who want to understand exactly what data an Excel file contains beyond what appears on screen.
The single most important thing to understand about the XLSX format is that it is not a monolithic binary blob. It is a standard ZIP archive with a .xlsx file extension. You can rename any XLSX file to .zip, open it with any archive tool, and browse its contents directly. This design comes from the Office Open XML (OOXML) standard, which Microsoft introduced with Office 2007 to replace the older binary .xls format.
The format follows the Open Packaging Conventions (OPC) specification, which defines how multiple XML parts are organized, named, and linked together inside the ZIP container. Each part has a content type, a location within the archive, and relationships that describe how parts reference each other.
You can explore any XLSX file’s internal structure right now using command-line tools:
# Make a copy and rename to .zip cp spreadsheet.xlsx spreadsheet.zip # List contents unzip -l spreadsheet.zip # Extract all files unzip spreadsheet.zip -d spreadsheet_extracted/ # Or on macOS/Linux, directly unzip the .xlsx unzip spreadsheet.xlsx -d spreadsheet_extracted/
On Windows, you can also use 7-Zip or WinRAR to open the file directly without renaming.
When you unzip an XLSX file, you will see a directory structure similar to the following. The exact contents vary depending on what features the workbook uses, but the core structure is consistent across all XLSX files.
spreadsheet.xlsx/
├── [Content_Types].xml # Maps file extensions to content types
├── _rels/
│ └── .rels # Top-level relationships
├── docProps/
│ ├── core.xml # Author, title, dates (Dublin Core)
│ └── app.xml # Application info, sheet names
└── xl/
├── workbook.xml # Workbook structure and sheet list
├── sharedStrings.xml # Deduplicated string table
├── styles.xml # Cell formatting and number formats
├── theme/
│ └── theme1.xml # Color and font theme
├── worksheets/
│ ├── sheet1.xml # First worksheet data
│ ├── sheet2.xml # Second worksheet data
│ └── ...
├── _rels/
│ └── workbook.xml.rels # Links sheet IDs to file paths
└── (optional parts)
├── calcChain.xml # Formula calculation order
├── comments1.xml # Cell comments
├── drawings/ # Charts, images, shapes
├── tables/ # Structured table definitions
├── pivotTables/ # Pivot table definitions
├── pivotCache/ # Pivot table cached data
└── printerSettings/ # Printer configurationEvery part in this structure serves a specific purpose. Let’s walk through each major component and understand what it contains, why it matters, and what you can learn from reading it directly.
Before Excel can read any XML part, it needs to know what type of content each file contains and how parts reference each other. Two mechanisms handle this: the content types file and the relationship files.
This file sits at the root of the archive and acts as a registry. It maps file extensions and specific part names to MIME-like content types so the application knows how to process each file.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml"
ContentType="application/xml"/>
<Override PartName="/xl/workbook.xml"
ContentType="application/vnd.openxmlformats-officedocument
.spreadsheetml.sheet.main+xml"/>
<Override PartName="/xl/worksheets/sheet1.xml"
ContentType="application/vnd.openxmlformats-officedocument
.spreadsheetml.worksheet+xml"/>
<Override PartName="/xl/sharedStrings.xml"
ContentType="application/vnd.openxmlformats-officedocument
.spreadsheetml.sharedStrings+xml"/>
</Types>Relationship files define how parts connect to each other. They use a simple target-and-type model. The top-level _rels/.rels file links the root of the package to the workbook, document properties, and other top-level parts. Each major part can have its own .rels file — for example, xl/_rels/workbook.xml.rels maps the logical sheet IDs used in workbook.xml to the actual worksheet file paths.
<!-- _rels/.rels -->
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/.../officeDocument"
Target="xl/workbook.xml"/>
<Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/.../core-properties"
Target="docProps/core.xml"/>
<Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/.../extended-properties"
Target="docProps/app.xml"/>
</Relationships>The relationship system means you cannot simply rename or move a file within the ZIP and expect it to work. If you move sheet1.xml to a different folder, you must also update the relationship file that points to it. When programmatically modifying XLSX files, forgetting to update relationships is one of the most common causes of file corruption.
The docProps/ folder contains metadata about the document itself rather than its data content. Two files are standard: core.xml for Dublin Core metadata and app.xml for application-specific properties.
This file stores authorship and timing information using the Dublin Core metadata standard. It is the primary source of personally identifiable information in most Excel files.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <dc:creator>Jane Smith</dc:creator> <cp:lastModifiedBy>Bob Johnson</cp:lastModifiedBy> <dcterms:created xsi:type="dcterms:W3CDTF">2026-01-15T09:30:00Z</dcterms:created> <dcterms:modified xsi:type="dcterms:W3CDTF">2026-03-28T14:22:00Z</dcterms:modified> <dc:title>Q1 Budget Report</dc:title> <dc:description>Internal budget analysis</dc:description> <cp:revision>47</cp:revision> </cp:coreProperties>
This file records which application created the file, the company name from the Office installation, sheet names, and document statistics. It can reveal organizational information and the software environment used.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/.../extended-properties">
<Application>Microsoft Excel</Application>
<AppVersion>16.0300</AppVersion>
<Company>Acme Corporation</Company>
<TitlesOfParts>
<vt:vector size="3" baseType="lpstr">
<vt:lpstr>Revenue</vt:lpstr>
<vt:lpstr>Expenses</vt:lpstr>
<vt:lpstr>Projections</vt:lpstr>
</vt:vector>
</TitlesOfParts>
<DocSecurity>0</DocSecurity>
</Properties>revision count in core.xml reveals how many times the file has been saved, indicating how much editing has occurred.The workbook file is the central orchestration point for the entire spreadsheet. It defines which sheets exist, their order, their visibility state, named ranges, and workbook-level settings. It does not contain any cell data — that lives in the individual worksheet files.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/.../relationships">
<sheets>
<sheet name="Revenue" sheetId="1" r:id="rId1"/>
<sheet name="Expenses" sheetId="2" r:id="rId2"/>
<sheet name="Internal Notes" sheetId="3" state="hidden" r:id="rId3"/>
<sheet name="Raw Data" sheetId="4" state="veryHidden" r:id="rId4"/>
</sheets>
<definedNames>
<definedName name="TaxRate">Revenue!$B$2</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="0">Revenue!$A$1:$G$50</definedName>
</definedNames>
</workbook>Notice the state attribute on sheets. There are three visibility levels:
Very hidden sheets are a common place to store lookup tables, configuration data, or sensitive information that the workbook author did not intend to share. Inspecting workbook.xml is the most reliable way to find them.
Each worksheet is stored as a separate XML file in the xl/worksheets/ folder. This is where cell values, formulas, and cell-level formatting references are stored. Understanding the worksheet XML structure is essential for anyone working with XLSX files programmatically.
Cell data is organized into rows, and each row contains cells. Each cell has a reference (like “A1”), a type indicator, a style index, and a value.
<worksheet xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main">
<sheetData>
<row r="1" spans="1:4">
<!-- String cell referencing shared string index 0 -->
<c r="A1" t="s" s="1">
<v>0</v>
</c>
<!-- Numeric cell -->
<c r="B1" s="2">
<v>42500.75</v>
</c>
<!-- Formula cell -->
<c r="C1" s="3">
<f>B1*1.1</f>
<v>46750.825</v>
</c>
<!-- Boolean cell -->
<c r="D1" t="b">
<v>1</v>
</c>
</row>
</sheetData>
</worksheet>Type (t attribute) | Meaning | Value interpretation |
|---|---|---|
s | Shared string | Index into sharedStrings.xml |
n or omitted | Number | Literal numeric value |
b | Boolean | 0 = FALSE, 1 = TRUE |
e | Error | Error string like #REF! |
str | Inline string | Literal string (not shared) |
inlineStr | Rich inline string | Rich text stored in <is> element |
One of the most confusing aspects of XLSX for newcomers is that dates are not stored as dates. They are stored as plain numbers — specifically, the count of days since January 1, 1900 (or January 1, 1904 on some Mac-originated files). The value 44927 represents January 1, 2023. Whether a cell displays as a date or a number depends entirely on the style applied to it through the s attribute, which indexes into styles.xml.
This means that if you read the raw XML, you cannot tell whether 44927 is a date or just the number 44,927 without also checking the style definition. This is a frequent source of bugs in custom XLSX parsers.
Formula cells contain both a <f> element with the formula text and a <v> element with the last calculated value. This dual storage means you can read formula results without recalculating, but the cached values may be stale if the file was saved without recalculation.
<!-- Simple formula --> <c r="C1"> <f>SUM(A1:B1)</f> <v>150</v> </c> <!-- Shared formula (applied across a range) --> <c r="C2"> <f t="shared" ref="C2:C100" si="0">SUM(A2:B2)</f> <v>200</v> </c> <c r="C3"> <f t="shared" si="0"/> <v>175</v> </c> <!-- Array formula (Ctrl+Shift+Enter) --> <c r="E1"> <f t="array" ref="E1:E10">TRANSPOSE(A1:J1)</f> <v>42</v> </c>
When the same formula pattern is applied to a range of cells (like dragging a formula down a column), Excel stores the full formula only once and marks subsequent cells as shared with t="shared" and a shared index (si). This significantly reduces file size for spreadsheets with repetitive formulas. If you are building an XLSX writer, implementing shared formulas correctly is one of the key optimizations.
Instead of storing the same text value in every cell that uses it, XLSX maintains a central string table in xl/sharedStrings.xml. Each unique string is stored once, and cells reference it by index. This is a significant space optimization — a column with 10,000 rows all containing “Approved” stores the string once and references index 0 ten thousand times.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main"
count="15000" uniqueCount="342">
<si><t>Product Name</t></si> <!-- index 0 -->
<si><t>Revenue</t></si> <!-- index 1 -->
<si><t>Quarter</t></si> <!-- index 2 -->
<si> <!-- index 3: rich text -->
<r>
<t>Total: </t>
</r>
<r>
<rPr><b/><color rgb="FFFF0000"/></rPr>
<t>$1,250,000</t>
</r>
</si>
</sst>The count attribute shows the total number of string references across all cells, while uniqueCount shows how many distinct strings exist. Rich text (text with mixed formatting within a single cell) uses <r> elements with run properties (<rPr>) to apply different styles to different parts of the string.
The shared string table sometimes contains strings that are no longer referenced by any cell. These “orphaned” strings can be remnants of deleted data. If a user deletes a column containing sensitive information, the string values may persist in sharedStrings.xml even though no cell points to them. This is one reason why simply deleting visible data does not guarantee that the information has been removed from the file.
The styles file defines all cell formatting used in the workbook. Rather than embedding style information in each cell, XLSX uses a layered indexing system. Each cell has a style index (s attribute) that points into a cross-reference table (cellXfs), which in turn references font, fill, border, and number format definitions.
<styleSheet xmlns="http://schemas.openxmlformats.org/.../spreadsheetml/2006/main">
<numFmts count="2">
<numFmt numFmtId="164" formatCode=""$"#,##0.00"/>
<numFmt numFmtId="165" formatCode="yyyy-mm-dd"/>
</numFmts>
<fonts count="3">
<font><sz val="11"/><name val="Calibri"/></font>
<font><b/><sz val="14"/><name val="Calibri"/></font>
<font><sz val="11"/><color rgb="FFFF0000"/><name val="Calibri"/></font>
</fonts>
<fills count="2">
<fill><patternFill patternType="none"/></fill>
<fill><patternFill patternType="solid">
<fgColor rgb="FFFFFF00"/>
</patternFill></fill>
</fills>
<cellXfs count="3">
<!-- Style index 0: default -->
<xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>
<!-- Style index 1: bold header -->
<xf numFmtId="0" fontId="1" fillId="0" borderId="0"/>
<!-- Style index 2: currency format -->
<xf numFmtId="164" fontId="0" fillId="0" borderId="0"
applyNumberFormat="1"/>
</cellXfs>
</styleSheet>When a cell has s="2", it uses the third entry in cellXfs, which applies number format 164 (the custom currency format "$"#,##0.00). This indirection system means the styles file can be complex, but it keeps individual worksheet files small and avoids repeating formatting information.
Number format IDs below 164 are built-in and not explicitly listed in the XML. Some important ones:
0 — General1 — 0 (integer)2 — 0.00 (two decimals)9 — 0% (percentage)14 — mm-dd-yy (date)20 — h:mm (time)22 — m/d/yy h:mm (date+time)164+ — Custom formats defined in <numFmts>This file records the order in which Excel should recalculate formulas. It lists cell references and their sheet IDs in dependency order. While not essential for reading data, it affects how recalculation engines process the workbook. If you delete or modify formulas programmatically, you should also update or remove the calculation chain to avoid recalculation errors.
The theme file defines the color palette, font scheme, and formatting effects used throughout the workbook. When a cell color is defined as “theme color 4 with 40% tint,” the actual RGB value is computed from this theme definition. Changing the theme changes the appearance of every element that references theme colors — which is how Excel’s built-in color schemes work.
Cell comments are stored in separate XML files (comments1.xml, etc.) linked to worksheets through relationships. Newer versions of Excel also use threadedComments for the modern threaded comment system, which includes author identifiers, timestamps, and reply chains. These files are a significant source of hidden metadata that reveals internal discussions.
When you create a structured table in Excel (Insert → Table), the definition is stored in a separate XML file in the xl/tables/ folder. Table definitions include the table name, column names, the cell range the table covers, and any auto-filter settings. Tables can reveal the intended structure and purpose of data even when the visible formatting has been stripped.
Understanding XLSX internals has direct practical value for several common tasks:
Read core.xml and app.xml directly to identify author names, company information, and timestamps without relying on Excel’s Document Inspector, which can miss certain metadata.
Check workbook.xml for hidden and veryHidden sheets, scan sharedStrings.xml for orphaned strings, and look for comments files to find data that is invisible in the normal spreadsheet view.
Build XLSX files from scratch by assembling the required XML parts and packaging them into a ZIP archive. This approach gives you full control over the output and is how libraries like openpyxl and SheetJS work internally.
Remove specific metadata files (like core.xml or comment files) from the ZIP without affecting the spreadsheet data, or rewrite properties to contain only approved values before sharing externally.
import zipfile
from lxml import etree
# Open the XLSX as a ZIP archive
with zipfile.ZipFile('report.xlsx', 'r') as zf:
# List all files inside
for name in zf.namelist():
print(name)
# Read and parse core properties
with zf.open('docProps/core.xml') as f:
tree = etree.parse(f)
root = tree.getroot()
ns = {
'dc': 'http://purl.org/dc/elements/1.1/',
'cp': 'http://schemas.openxmlformats.org/package/2006/metadata/core-properties'
}
creator = root.find('.//dc:creator', ns)
print(f"Author: {creator.text if creator is not None else 'Unknown'}")
# Read shared strings
with zf.open('xl/sharedStrings.xml') as f:
tree = etree.parse(f)
ns = {'s': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
strings = [
si.find('.//s:t', ns).text
for si in tree.findall('.//s:si', ns)
if si.find('.//s:t', ns) is not None
]
print(f"Found {len(strings)} unique strings")
for s in strings[:10]:
print(f" - {s}")Every XLSX XML file uses namespaces extensively. If your XML parser does not handle namespaces correctly, you will fail to find any elements. Always use namespace-aware parsing and register the correct namespace prefixes for the OOXML schemas.
Adding, removing, or renaming parts requires updating both [Content_Types].xml and the relevant .rels files. Missing this step results in files that open with errors or fail to open entirely.
Never assume a numeric cell value is “just a number.” Check its style index against styles.xml to determine if a date number format is applied. Many XLSX parsing bugs stem from treating date serial numbers as regular integers.
String cells store an index, not the string itself. If you modify sharedStrings.xml (adding or removing entries), all index references in all worksheets must be updated. Failing to do this corrupts every string reference.
When repackaging modified XML files back into a ZIP, use the ZIP_DEFLATED compression method. Some tools default to ZIP_STORED (no compression), which produces valid but unnecessarily large files. Also ensure that directory entries in the ZIP do not use backslashes — use forward slashes only.
Understanding the XML structure is powerful, but you do not need to manually parse ZIP archives every time. MetaData Analyzer reads the internal structure of your XLSX files and presents all metadata, hidden content, author information, and structural details in a clear, actionable format — without sending your files to any server.
.rels files define how XML documents reference each other, and breaking these links corrupts the file.core.xml and app.xml contain author names, company names, and timestamps that travel with every copy of the file.styles.xml to distinguish dates from regular numbers, which is a frequent source of parsing bugs.veryHidden state is invisible through the normal Excel UI and can only be detected by reading workbook.xml directly.Comprehensive guide to how Excel stores metadata across all XML parts.
Build automated pipelines to strip metadata from XLSX files before sharing.
Step-by-step guide to finding hidden sheets, comments, and metadata.