Back to Blog
Forensics

Detecting Fraudulent Excel Documents Through Metadata Analysis

Fraudulent Excel documents are used in insurance claims, financial reporting, contract disputes, regulatory filings, and legal proceedings every day. Whether a spreadsheet has been fabricated from scratch, backdated to appear older than it is, or selectively altered to change key figures, the metadata layer almost always preserves evidence of the deception. This guide teaches you how to find it.

By Forensics TeamFebruary 20, 202623 min read

Why Metadata Exposes Document Fraud

Every Excel file carries an invisible layer of information that its creator rarely thinks about. Document properties, timestamps, author records, editing history, application version strings, and internal XML structures all record the true story of how a file was created and modified—regardless of what the visible cell data claims.

Fraudsters focus on the content: the numbers, the dates displayed in cells, the formulas that produce the desired totals. They almost never clean every metadata artifact. The result is a gap between what the document claims to be and what its metadata reveals it actually is. That gap is where forensic investigators operate.

Common Types of Excel Document Fraud

  • Backdating: Creating a document today but claiming it was created weeks, months, or years ago
  • Fabrication: Manufacturing an entire spreadsheet to support a false claim, invoice, or report
  • Selective alteration: Changing specific values in an otherwise legitimate file to misrepresent figures
  • Author spoofing: Making a document appear to have been created by a different person or organization
  • Version substitution: Replacing an earlier version of a file with a modified one while preserving the original filename
  • Template fraud: Using a legitimate template to create a fabricated document that inherits real metadata
  • Inflation of editing history: Making a recently created file appear to have a long editing history

Timestamp Analysis: The First Line of Detection

Timestamps are the most frequently exploited—and most frequently revealing—metadata in fraudulent documents. Excel files carry multiple independent timestamps that are difficult to manipulate consistently. When these timestamps contradict each other or contradict the claimed history of the document, fraud is likely.

The Four Timestamp Layers

Every Excel file has at least four independent sources of timestamp data. A legitimate document shows consistency across all four. A fraudulent document almost always has discrepancies.

1. File System Timestamps

  • Created: When the file first appeared on disk
  • Modified: Last write to the file
  • Accessed: Last read of the file
  • • Easily manipulated with system tools
  • • Reset when files are copied or downloaded

2. OPC Core Properties

  • dcterms:created — Document creation time
  • dcterms:modified — Last modification time
  • • Stored inside docProps/core.xml
  • • Editable by modifying the XML directly
  • • Often overlooked by fraudsters using GUI tools

3. Extended Properties

  • TotalTime — Total editing time in minutes
  • Application — Excel version used
  • AppVersion — Specific build number
  • • Stored in docProps/app.xml
  • • Rarely manipulated because most people do not know they exist

4. Internal XML Artifacts

  • • Printer settings with driver version dates
  • • Theme and style XML creation timestamps
  • • Calculation chain rebuild markers
  • • Shared string table structure patterns
  • • Almost never manipulated—most fraudsters do not know they exist

# Extract timestamps from an XLSX file

mkdir xlsx_extract && unzip suspect.xlsx -d xlsx_extract/

 

# View core properties (created/modified dates, author)

cat xlsx_extract/docProps/core.xml

 

# View extended properties (editing time, app version)

cat xlsx_extract/docProps/app.xml

 

# Compare with file system timestamps

stat suspect.xlsx

Timestamp Red Flags

These timestamp anomalies are strong indicators of document fraud. Any single red flag warrants deeper investigation; multiple red flags together constitute compelling evidence.

Creation Date After Modification Date

If dcterms:created is later than dcterms:modified, someone has manually edited the creation timestamp to make the document appear older. Excel never naturally produces this condition.

TotalTime Inconsistent with Date Span

A document claiming to have been created six months ago but showing a TotalTime of 5 minutes has almost certainly been fabricated recently. A document with six months of legitimate use would show hours of editing time.

Application Version Did Not Exist at Claimed Date

If the document claims to have been created in 2021 but the AppVersion field references Excel 2024 (version 16.0090), the file cannot have been created at the claimed date. The application version is written automatically and is rarely manipulated.

Timezone and Working Hours Anomalies

A document purportedly created by a London office but with creation timestamps at 3:00 AM GMT, or a document from a 9-to-5 organization with editing at 2:00 AM on a weekend, suggests the document was not created in the normal course of business.

File System Date Precedes Document Date

If the file system creation date is significantly earlier than the document's internal dcterms:created timestamp, the internal timestamp may have been modified. While file copying can cause the reverse, this specific pattern is suspicious.

Author and Origin Analysis

Fraudulent documents often need to appear to come from a specific person or organization. Metadata analysis can verify or disprove claimed authorship by examining multiple independent author indicators that are difficult to fake consistently.

Author Verification Points

Excel records authorship in multiple locations. A legitimate document has consistent authorship data. A fabricated document frequently has mismatches that reveal the true creator.

Metadata FieldLocationWhat It Reveals
dc:creatordocProps/core.xmlOriginal author's name or username
cp:lastModifiedBydocProps/core.xmlLast person who edited the file
ManagerdocProps/app.xmlManager field (if populated)
CompanydocProps/app.xmlOrganization tied to the license
Printer pathsxl/printerSettings/Network printer names revealing office location
Comments/annotationsxl/comments*.xmlComment author names from earlier edits
Custom propertiesdocProps/custom.xmlOrganization-specific metadata tags

# Extract all author-related metadata

unzip -o suspect.xlsx -d suspect_contents/

 

# Check core properties for author fields

grep -i "creator\|lastModifiedBy" suspect_contents/docProps/core.xml

 

# Check company and manager fields

grep -i "company\|manager" suspect_contents/docProps/app.xml

 

# Check comment authors

grep -i "author" suspect_contents/xl/comments*.xml 2>/dev/null

 

# Check printer settings for network paths

strings suspect_contents/xl/printerSettings/*.bin 2>/dev/null

Author Discrepancy Patterns

These patterns frequently appear in documents where authorship has been misrepresented. Each pattern tells a different story about how the fraud was constructed.

Creator and LastModifiedBy Are the Same

In a document that has purportedly been reviewed and approved by multiple people over months, finding that dc:creator and cp:lastModifiedBy are the same person—especially the person submitting the document—suggests it was created in a single session by one person.

Company Field Mismatch

A document claiming to originate from "Acme Corp" but with a Company field showing "Smith Consulting" was created on a machine licensed to a different organization. The Company field is set by the Office installation and not typically modified by users.

Printer Settings from Wrong Location

Binary printer settings embedded in the file can contain network printer names like \\OFFICE-NYC-3F\HP-LaserJet. If the document was purportedly created at a different location, the printer metadata contradicts the claimed origin.

Comment Authors from the Wrong Team

If a file purportedly from the finance department contains comments authored by people known to work in a completely different department or external organization, the file's provenance is suspect. Comment author names are set by Office and not easily faked without XML editing.

Structural Analysis of the XLSX Archive

Beyond timestamps and author fields, the internal structure of an XLSX file contains subtle artifacts that are extremely difficult to fake. These structural indicators can reveal whether a file was genuinely created incrementally over time or manufactured in a single session to appear authentic.

Shared String Table Analysis

Excel stores all text cell values in a shared string table (xl/sharedStrings.xml). The order of strings in this table reflects the order in which text was entered into the workbook. This ordering is a powerful forensic artifact.

Legitimate File Pattern

In a file built incrementally over time, the shared string table reflects the natural order of data entry. Headers appear early, data added later appears later in the table. Strings from different editing sessions are interspersed naturally.

Fabricated File Pattern

A file created in one session by pasting data has a string table that mirrors the cell layout exactly—left to right, top to bottom. There is no interspersion. The string order is too perfect, suggesting a single-pass creation.

# Extract and examine the shared string table

cat xlsx_extract/xl/sharedStrings.xml | \

  grep -oP '<t[^>]*>[^<]+</t>' | head -30

 

# Count total unique strings

grep -c '<si>' xlsx_extract/xl/sharedStrings.xml

 

# Look for the uniqueCount vs count attributes

# count = total references, uniqueCount = unique strings

# A very low ratio suggests simple data without reuse

head -1 xlsx_extract/xl/sharedStrings.xml

Style and Formatting Artifacts

The styles file (xl/styles.xml) contains every cell format, number format, font, fill, and border used in the workbook. The structure of this file reveals how the document was built.

Format Proliferation

A legitimate document that evolved over time accumulates styles incrementally. It typically has unused styles from earlier iterations, duplicate-but-slightly-different formats from different editors, and a large cellXfs (cell format) count relative to the actual variety of visible formatting. A fabricated document has minimal, clean styles with no orphaned formats.

Number Format Clues

Custom number formats in numFmts can reveal locale information. A document claiming to be from the US that uses European-style number formats (commas for decimals, periods for thousands) was likely created on a machine with European locale settings.

Font Availability

The fonts referenced in styles.xml reveal the operating system and software available to the creator. A document using Calibri (Windows default) claiming to be created on a Mac (which defaults to different fonts) is inconsistent. Similarly, fonts specific to certain Office versions can contradict the claimed creation date.

Calculation Chain Analysis

The calculation chain (xl/calcChain.xml) records the order in which Excel recalculates formulas. This ordering is determined by cell dependencies and is rebuilt when the workbook structure changes. It provides a subtle but reliable indicator of document authenticity.

What to Look For

  • Missing calcChain.xml: If a document with formulas has no calculation chain file, it may have been manually constructed by editing the XML directly rather than through Excel
  • Chain inconsistency: Formulas referencing cells that do not exist in the worksheet indicate structural manipulation
  • Simple chain for complex formulas: A workbook with sophisticated financial formulas but an unusually simple or short calculation chain may have had formulas pasted as values and then selectively converted back

# Check if calculation chain exists

ls -la xlsx_extract/xl/calcChain.xml 2>/dev/null

 

# Count formula references in calc chain

grep -c '<c ' xlsx_extract/xl/calcChain.xml 2>/dev/null

 

# Compare with actual formulas in sheets

grep -c '<f>\|<f ' xlsx_extract/xl/worksheets/sheet1.xml

Application Version Forensics

The Excel version that created a document is recorded in the file and is one of the most reliable fraud indicators. Fraudsters can change the dates in a document, but they rarely think to change the application version—and even when they do, the internal XML schemas and feature usage betray the true version.

Version Timeline Verification

Cross-referencing the AppVersion value with the claimed creation date can immediately expose backdating. If the Excel version did not exist when the document was supposedly created, the document is fraudulent.

AppVersion ValueExcel VersionRelease Date
12.0000Excel 2007January 2007
14.0300Excel 2010June 2010
15.0300Excel 2013January 2013
16.0300Excel 2016September 2015
16.0300Excel 2019September 2018
16.0###Microsoft 365Varies by update channel

Important: Excel 2016, 2019, 2021, and Microsoft 365 all use major version 16.0. The minor version number (the digits after 16.0) can help distinguish between them, as Microsoft 365 is continuously updated and has higher minor version numbers than the perpetual-license releases.

# Extract application version

grep -i "AppVersion\|Application" xlsx_extract/docProps/app.xml

 

# Example output showing Excel version:

# <Application>Microsoft Excel</Application>

# <AppVersion>16.0300</AppVersion>

 

# Check XML namespace versions for additional clues

head -5 xlsx_extract/xl/workbook.xml

# Newer Excel versions use updated namespace URIs

Editing History and Revision Analysis

The editing history of a document tells a story about how it was used. A legitimate business document that has been actively used for months shows evidence of repeated editing. A fabricated document that was created in one session lacks this depth of history—and the absence itself is evidence.

Editing Depth Indicators

Signs of Genuine Editing History

  • TotalTime consistent with the file's age
  • • Multiple distinct lastModifiedBy users over time
  • • Orphaned styles from earlier formatting that was changed
  • • Named ranges referencing deleted worksheets
  • • Hidden or very hidden sheets with draft data
  • • Undo history depth consistent with editing sessions
  • • Print area definitions reflecting iterative adjustments

Signs of Fabrication

  • TotalTime of zero or near-zero minutes
  • dc:creator and cp:lastModifiedBy identical
  • • Clean styles with no orphaned formats
  • • No named ranges, no hidden sheets
  • • Minimal internal structure for claimed complexity
  • • All data appears to have been entered in one pass
  • • No printer settings (file was never printed)

# Check total editing time

grep -i "TotalTime" xlsx_extract/docProps/app.xml

 

# Check for hidden sheets

grep -i 'state="hidden\|state="veryHidden' \

  xlsx_extract/xl/workbook.xml

 

# Check for named ranges

grep -i '<definedName' xlsx_extract/xl/workbook.xml

 

# Count worksheets

grep -c '<sheet ' xlsx_extract/xl/workbook.xml

 

# Check for printer settings

ls xlsx_extract/xl/printerSettings/ 2>/dev/null

The "Too Clean" Document Problem

Paradoxically, one of the strongest indicators of a fabricated document is that it is too clean. Real business documents are messy. They accumulate artifacts from multiple users, multiple editing sessions, format changes, and iterative development. A document that lacks all of these artifacts despite claiming a long history is suspicious.

The Cleanliness Checklist

  • No orphaned shared strings: Real documents accumulate strings from deleted cells; a perfectly clean string table suggests single-pass creation
  • No unused styles: Real documents have style artifacts from earlier formatting; clean styles suggest fabrication
  • No defined names pointing to deleted ranges: Documents that evolve usually have stale references
  • Perfectly sequential cell references: Real data entry is rarely perfectly sequential across all rows and columns
  • No conditional formatting remnants: Business documents frequently have conditional formatting that has been partially removed or modified
  • Zero revision number: A file claiming months of use should have a revision count greater than 1

Detecting Backdated Documents

Backdating—creating a document now but claiming it existed at an earlier date—is one of the most common forms of document fraud. It appears in insurance claims, regulatory filings, financial audits, and contract disputes. Metadata analysis provides multiple independent methods to detect it.

Backdating Detection Methods

1. Application Version vs. Claimed Date

The most definitive test. If the AppVersion value identifies an Excel version that was released after the claimed creation date, the document cannot have been created when it claims. This artifact is rarely manipulated because most fraudsters are unaware it exists.

# A document claiming creation in March 2012

# but showing AppVersion 15.0300 (Excel 2013)

# could not have been created at the claimed date

grep "AppVersion" xlsx_extract/docProps/app.xml

2. XML Schema Namespace Analysis

Different Excel versions use different XML namespace URIs. A document using namespace URIs introduced in Excel 2019 cannot have been created in 2016. Namespace URIs are deeply embedded in the XML structure and are almost never manipulated by fraudsters.

# Check namespace declarations

head -3 xlsx_extract/xl/workbook.xml

head -3 xlsx_extract/[Content_Types].xml

 

# Look for newer content types or relationships

cat xlsx_extract/[Content_Types].xml

3. Feature Usage Analysis

Excel features like dynamic arrays (XLOOKUP, FILTER, UNIQUE), new chart types (funnel, map, treemap), or Power Query connections were introduced in specific versions. A document claiming to be from 2015 that uses XLOOKUP (introduced in 2020) is definitively backdated.

4. Theme and Template Dating

Excel themes change between versions. The default color palette, font selections, and theme names in xl/theme/theme1.xml can reveal which version of Excel created the document, independent of the claimed date.

# Check the theme file for version-specific defaults

grep -i "name=\|majorFont\|minorFont" \

  xlsx_extract/xl/theme/theme1.xml | head -10

5. ZIP Archive Metadata

XLSX files are ZIP archives, and each file entry in the ZIP has its own modification timestamp. These timestamps are set when Excel writes the file and are independent of the document properties. Most fraudsters who editcore.xml forget to modify the ZIP entry timestamps.

# List ZIP entries with their timestamps

unzip -l suspect.xlsx

 

# Compare ZIP entry dates with claimed creation date

# All entries should post-date the creation date

Detecting Selective Alterations

Sometimes the fraud is not in creating a new document but in modifying an existing one—changing a few key values while leaving the rest intact to maintain an appearance of authenticity. Detecting selective alterations requires comparing what the metadata says about the editing pattern with the document's content.

Alteration Indicators in Cell Data

Formula vs. Hardcoded Value Inconsistency

In a legitimate financial spreadsheet, totals are calculated by formulas. If specific total cells contain hardcoded values while surrounding cells use formulas, the totals may have been manually overwritten. Check whether sum cells actually contain =SUM() formulas or just static numbers.

Formatting Inconsistencies in Modified Cells

When individual cells are edited, they sometimes acquire different formatting from surrounding cells—a different number format, precision, or style index. These formatting anomalies can pinpoint exactly which cells were modified after the original document was created.

Shared String Table Position

If a cell value was changed, the new text is appended to the end of the shared string table while the old value may still exist as an orphan. Finding the replacement value at the end of the string table—far from where similar data appears—suggests a later modification.

# Check for cells with hardcoded values where formulas are expected

# In sheet XML, <v> without <f> means hardcoded

# Look for patterns where most cells in a column have formulas

# but specific cells have only values

 

# Extract all cell entries from a worksheet

grep -oP '<c r="[^"]+"[^>]*>.*?</c>' \

  xlsx_extract/xl/worksheets/sheet1.xml | head -20

 

# Find cells with values but no formulas in a formula column

# Compare the style index (s attribute) of modified vs neighboring cells

Cross-Referencing with External Records

The strongest fraud detection combines internal metadata analysis with external corroboration. When a document's metadata contradicts external records, the case becomes significantly stronger.

External Corroboration Sources

  • • Email attachment logs showing when file was sent
  • • Cloud storage version history (OneDrive, SharePoint)
  • • Backup system snapshots with file hashes
  • • Print server logs confirming when documents were printed
  • • DLP (Data Loss Prevention) logs tracking file movements
  • • File server audit logs showing access patterns

Hash Comparison

If a copy of the original document exists in backups, email attachments, or cloud storage, comparing file hashes immediately reveals whether the current version has been modified. Even a single changed byte produces a completely different hash, making this the most definitive test for alteration.

Fraud Detection Investigation Checklist

Step-by-Step Fraud Detection Process

Preserve and hash the original file

Create a forensic copy immediately. Hash both the original and your working copy (MD5 and SHA-256). Document the chain of custody from the moment you receive the file.

Extract and examine document properties

Unzip the XLSX and review docProps/core.xml and docProps/app.xml. Record the creator, last modified by, creation date, modification date, total editing time, and application version.

Verify timestamp consistency

Compare the four timestamp layers: file system, OPC core properties, ZIP entry timestamps, and any internal date references. Flag any discrepancies between creation date, modification date, and total editing time.

Validate application version against claimed date

Check that the AppVersion value corresponds to an Excel version that existed at the claimed creation date. Examine XML namespaces and feature usage for additional version indicators.

Analyze author and origin metadata

Verify that the creator, company, printer paths, and comment authors are consistent with the claimed origin. Cross-reference with known organizational data.

Examine structural artifacts

Review the shared string table ordering, style complexity, calculation chain, named ranges, and hidden sheets. Assess whether the structural complexity matches the claimed document history.

Check for selective alterations

Look for hardcoded values where formulas are expected, formatting inconsistencies in individual cells, and string table positions that suggest late modifications.

Cross-reference with external records

Compare the file against backups, email attachments, cloud storage versions, print logs, and any other external records that can corroborate or contradict the document's claimed history.

Document findings with evidence preservation

Compile all findings into a structured report. Include exact metadata values, screenshots, hash values, and the specific inconsistencies found. Ensure all evidence is preserved in a manner suitable for legal proceedings.

Real-World Fraud Scenarios and Detection

Understanding how these techniques work in practice helps investigators know what to look for. These scenarios illustrate common fraud patterns and the metadata artifacts that expose them.

Scenario 1: Backdated Insurance Claim

A claimant submits an inventory spreadsheet supposedly created before a loss event, listing valuable items for an insurance claim. The spreadsheet's creation date in core.xml shows a date three months before the loss.

Detection Evidence

  • AppVersion shows Excel version released two months after the claimed creation date
  • TotalTime is 8 minutes for a 200-row inventory that would take hours to compile
  • • ZIP entry timestamps all show the same date—two weeks after the loss event
  • • The shared string table order matches perfect top-to-bottom entry, suggesting single-session creation
  • • No printer settings exist despite the claim that a printed copy was kept on file

Conclusion: Five independent metadata indicators contradict the claimed creation date. The document was fabricated after the loss event and backdated.

Scenario 2: Altered Financial Report

A quarterly financial report submitted to auditors shows favorable revenue figures. A whistleblower alleges the numbers were changed after the reporting period closed.

Detection Evidence

  • • Revenue total cells contain hardcoded values; all other financial cells use formulas
  • • The hardcoded cells have a different style index (s attribute) from adjacent formula cells
  • cp:lastModifiedBy shows the CFO's username; earlier backup copies show a different preparer
  • • Modification date is five days after the quarterly close, matching the whistleblower's timeline
  • • The backup copy from the file server (pre-modification) hashes differently and shows lower revenue figures

Conclusion: Selective cell modifications with formatting inconsistencies, confirmed by comparison with the pre-modification backup, establish that revenue figures were manually overwritten after the reporting period.

Scenario 3: Forged Vendor Quote

An employee submits a vendor quote spreadsheet to justify a procurement decision, claiming it was received from the vendor. Investigation reveals the employee has a financial relationship with the vendor.

Detection Evidence

  • dc:creator is the employee's username, not the vendor's
  • Company field shows the employee's organization, not the vendor's
  • • Printer settings contain the employee's office network printer path
  • • The document's theme matches the employee's organization's Excel template, not the vendor's
  • • Email headers show the "vendor quote" attachment was created minutes before the email was sent

Conclusion: Multiple origin indicators confirm the document was created internally by the employee, not received from the vendor. The quote was fabricated.

Legal Considerations and Evidence Handling

Metadata-based fraud detection often produces evidence intended for legal proceedings. The way evidence is collected, preserved, and documented determines whether it will be admissible and persuasive.

Evidence Handling Best Practices

Chain of Custody

  • • Document how and when you received the file
  • • Record every person who has handled the file
  • • Hash the original before any analysis begins
  • • Work only on forensic copies, never the original
  • • Log every tool used and every command executed

Report Documentation

  • • Include exact metadata values, not summaries
  • • Show the raw XML alongside your interpretation
  • • Explain the significance of each finding clearly
  • • Note limitations and alternative explanations
  • • Use screenshots and hex dumps for critical evidence

Reproducibility

  • • Document your methodology so others can reproduce it
  • • Use well-known, accepted forensic tools
  • • Provide the file hashes for verification
  • • Include step-by-step instructions for each finding
  • • Ensure a qualified peer can reach the same conclusions

Important Caveats

  • • Metadata can be manipulated; no single artifact is definitive
  • • Innocent explanations may exist (file copying, system clock issues)
  • • Multiple independent indicators are stronger than any single one
  • • Always consider the totality of evidence, not isolated findings
  • • Consult with legal counsel before presenting findings

Conclusion

Fraudulent Excel documents carry the seeds of their own detection. The very metadata systems that make Excel files functional—timestamps, author records, version tracking, style management, and internal XML structures—also create an audit trail that is extraordinarily difficult to fake completely.

The key principle for investigators is convergence. No single metadata artifact is conclusive on its own. File system timestamps can be manipulated. Document properties can be edited. Even application versions could theoretically be changed by someone with XML editing skills. But when multiple independent indicators all point to the same conclusion—that a document was created at a different time, by a different person, or on a different machine than claimed—the convergence of evidence becomes compelling.

Fraudsters think about the visible content: the numbers in cells, the dates in headers, the names on the document. They rarely think about the invisible layer of metadata that records the true history of every file they create. That asymmetry—between what the fraudster controls and what they overlook—is what makes metadata analysis one of the most powerful tools in document fraud detection.

Analyze Your Excel Files for Fraud Indicators

Use our metadata analyzer to inspect Excel files for timestamp inconsistencies, author discrepancies, and structural anomalies that may indicate document fraud