Detecting Embedded Content in OOXML Documents

On Advanced Practices, we are always looking for new ways to find
malicious activity and track adversaries over time. Today we’re
sharing a technique we use to detect and cluster Microsoft Office
documents—specifically those in the Office
Open XML (OOXML)
 file format. Additionally, we’re releasing a
tool so analysts and defenders can automatically generate YARA rules
using this technique.

OOXML File Format

Beginning with Microsoft Office 2007, the default file format for
Excel, PowerPoint, and Word documents switched from an Object Linking
and Embedding (OLE) based format to OOXML. For now, the only part of
this that’s important to understand is OOXML documents are just a
bunch of folders and files packaged into a ZIP archive. Let’s look at
the Word document this blog post is being written in (Figure 1), for example:


➜ file example.docx
example.docx: Microsoft Word 2007+


➜ unzip -v example.docx
Archive:  example.docx

 Length   Method    Size  Cmpr    Date    Time  
CRC-32   Name

——–  ——  ——- —- ———- —–
——–  —-

    1445  Defl:S      358  75% 01-01-1980 00:00
576f9132  [Content_Types].xml

     590  Defl:S      239  60% 01-01-1980 00:00
b71a911e  _rels/.rels

    1559  Defl:S      407  74% 01-01-1980 00:00
33ce17ac  word/_rels/document.xml.rels

   10861  Defl:S     2480  77% 01-01-1980 00:00
f0af2147  word/document.xml

    8393  Defl:S     1746  79% 01-01-1980 00:00
9867f4b6  word/theme/theme1.xml

    4725  Defl:S     1416  70% 01-01-1980 00:00
718205c5  word/settings.xml

     655  Defl:S      295  55% 01-01-1980 00:00
bf8dd4bd  word/webSettings.xml

     755  Defl:S      367  51% 01-01-1980 00:00
5bf1cf49  docProps/core.xml

     991  Defl:S      476  52% 01-01-1980 00:00
bad67489  docProps/app.xml

   30308  Defl:S     3104  90% 01-01-1980 00:00
ce0f21cd  word/styles.xml

    7781  Defl:S      952  88% 01-01-1980 00:00
9f45bf02  word/numbering.xml

    2230  Defl:S      559  75% 01-01-1980 00:00
63baaf8c  word/fontTable.xml

——–          ——- 
—                            ——-

   70293            12399  82%                 
          12 files

Figure 1: unzip -v output for example.docx

Now, even though we used the unzip
command, we didn’t actually unzip the archive. The output provided by
the -v option is derived from the ZIP
local file headers
, which contain a wealth of information on the
compressed files. Of particular interest is the CRC-32 value.

A cyclic
redundancy check (CRC)
is an algorithm designed to detect errors
or unintended changes to data. The idea is a system can calculate a
CRC value before and after a transfer or transformation of data as a
simple way to ensure its integrity. For ZIP archives, the CRC-32
values confirm the decompressed files are the same as they were prior
to compression. Which is great and all, but they can serve other use
cases too.

Detection

Forget about error-detection. A ZIP CRC-32 value is essentially a
small hash of the uncompressed file, and what better way to identify a
file than by its hash? While the chance of a collision for CRC-32 is
significantly higher than other algorithms such as SHA-256 or even
MD5, it can be paired with additional metadata like the file name (or
extension) and size to reduce false positives.

Here’s a hex dump of the first local file header from the previous
example (Figure 2):



Figure 2: Hex dump of the first local
file header for example.docx

Using the
CRC-32
,
uncompressed file size
, and
file name
fields, a YARA rule for this entry can be
written as follows:

rule content_types {
    meta:
        author = “Aaron Stephens
<aaron.stephens@mandiant.com>”
       
description = “Example OOXML rule.”

    strings:
        $crc = { 32 91
6f 57 }
        $name =
“[Content_Types].xml”
        $size = { a5
05 00 00 }

    condition:
        $size at
@crc[1] + 8 and $name at @crc[1] + 16
}

NOTE: The numeric fields are stored in little-endian.

Examples

Advanced Practices uses this technique to find similar documents
that contain the same embedded file over time. Here are a couple
real-world examples:

Document: 397ba1d0601558dfe34cd5aafaedd18e

File: 0dc39af4899f6aa0a8d29426aba59314
(wordmediaimage1.png)

Groups: UNC1130,
UNC1837, UNC1965

rule png_397ba1d0601558dfe34cd5aafaedd18e
{
    meta:
        author = “Aaron
Stephens <aaron.stephens@mandiant.com>”
 
      description = “PNG in OOXML
document.”

    strings:
        $crc =
{f8158b40}
        $ext = “.png”
   
    $ufs = {b42c0000}

    condition:
        $ufs at
@crc[1] + 8 and $ext at @crc[1] + uint16(@crc[1] + 12) + 16
– 4
}

This rule detects OOXML documents, which contain a specific PNG
image seen in Figure 3.



Figure 3: PNG embedded in phishing documents

Figure 3 is found in several documents dropping LATEOP, and has been
attributed to groups such as UNC1130,
a North Korean state-sponsored threat actor.

Document:
252227b8701d45deb0cc6b0edad98836

File:
3bdfaf98d820a1d8536625b9efd3bb14 ([Content_Types].xml)

Groups: FIN7

rule xml_252227b8701d45deb0cc6b0edad98836
{
    meta:
        author = “Aaron
Stephens <aaron.stephens@mandiant.com>”
 
      description = “[Content_Types].xml in OOXML
document.”

    strings:
        $crc =
{8cf0d220}
        $name =
“[Content_Types].xml”
        $ufs =
{9b060000}

    condition:
        $ufs at
@crc[1] + 8 and $name at @crc[1] + 16
}

This rule detects a specific [Content_Types].xml file, which is shown
(formatted) in Figure 4.



Figure 4: Formatted [Content_Types].xml file

This file maps different parts of the OOXML package to their content
type. Given a unique enough combination of parts and types, the [Content_Types].xml file can be a great way to
find similar OOXML documents. This particular example is found in
multiple FIN7 GRIFFON samples.

Tooling

Last but not least, it’s time to introduce apooxml, a Python tool that can be used to
quickly and easily generate YARA rules just like these. Here’s how it works:

➜ python3 apooxml.py -h
usage: apooxml.py [-h] [-a AUTHOR] [-n NAME] [-o OUT]
sample

Generate YARA rules for OOXML
documents.

positional arguments:
 
sample                OOXML document to generate YARA rule
from.

optional arguments:
  -h,
–help            show this help message and exit
 
-a AUTHOR, –author AUTHOR
                       
YARA rule author.
  -n NAME, –name NAME  YARA rule
name.
  -o OUT, –out OUT     YARA rule file
name.


➜ python3 apooxml.py -o ‘example.yara’
397ba1d0601558dfe34cd5aafaedd18e
 1.
[Content_Types].xml             1980-01-01 00:00:00 
14506c9d  1613
 2. _rels/.rels                    
1980-01-01 00:00:00  b71a911e  590
 3.
word/_rels/document.xml.rels    1980-01-01 00:00:00 
ab5e83b7  1207
 4. word/document.xml              
1980-01-01 00:00:00  44c9bf93  2692
 5.
word/_rels/vbaProject.bin.rels  1980-01-01 00:00:00 
ef601408  277
 6. word/vbaProject.bin            
1980-01-01 00:00:00  ab54dacf  10752
 7.
word/media/image1.png           1980-01-01 00:00:00 
408b15f8  11444
 8. word/theme/theme1.xml          
1980-01-01 00:00:00  4276c88b  7088
 9.
word/settings.xml               1980-01-01 00:00:00 
17044d98  2750
10. word/vbaData.xml               
1980-01-01 00:00:00  9209afe1  1292
11.
word/fontTable.xml              1980-01-01 00:00:00 
37e3715b  960
12. word/stylesWithEffects.xml     
1980-01-01 00:00:00  c883d0b1  16755
13.
docProps/app.xml                1980-01-01 00:00:00 
3cc6382c  982
14. word/webSettings.xml           
1980-01-01 00:00:00  4e16a017  428
15.
docProps/core.xml               1980-01-01 00:00:00 
8cef183c  643
16. word/styles.xml                
1980-01-01 00:00:00  1f9b9145  16002

Enter a number corresponding to the desired
entry: 7

Wrote YARA rule to example.yara.

➜ cat example.yara
rule
ooxml_png_crc_397ba1d0601558dfe34cd5aafaedd18e {
   
meta:
        author = “apooxml”
   
    description = “Generated by apooxml.”
 
      reference_md5 =
“397ba1d0601558dfe34cd5aafaedd18e”

    strings:
        $crc =
{f8158b40}
        $ext = “.png”
   
    $ufs = {b42c0000}

    condition:
        $ufs at
@crc[1] + 8 and $ext at @crc[1] + uint16(@crc[1] + 12) + 16
– 4
}

For more details, check out the repository on GitHub.

 

By admin