Have you ever wondered how people writing reports about malware can say where the malware was likely developed?
Sometimes you get totally lucky and log files created by the malware will help answer the question. Given the following line from a log:
11/16/2009 6:41:48 PM –> Hook instalate lsass.exe
We can use Google Translate’s “language detect” feature to help up determine the language used:
Of course, it’s not often we get THAT lucky!
A more interesting method is the examination of certain structures known as the Resource Directory within the executable file itself. For the purpose of this post, I will not be describing the Resource Directory structure. It’s a complicated beast, making it a topic I will save for later posts that actually warrant and/or require a low-level understanding of it. Suffice it to say, the Resource Directory is where embedded resources like bitmaps (used in GUI graphics), file icons, etc. are stored. The structure is frequently compared to the layout of files on a file system, although I think it’s insulting to file systems to say such a thing. For those more graphically inclined, I took the following image from http://www.devsource.com/images/stories/PEFigure2.jpg.
For the sake of example, here’s some images showing you just a few of the resources embedded inside of notepad.exe: (using CFF Explorer from: http://www.ntcore.com/exsuite.php)
Now it’s important to note that an executable may have only a few or even zero resources – especially in the case of malware. Consider the following example showing a recent piece of malware with only a single resource called “BINARY.”
Moving on, let’s look at another piece of malware… Below, we see this piece of malware has five resource directories.
We could pick any of the five for this analysis, but I’ll pick RCData – mostly because it’s typically an interesting directory to examine when reverse engineering malware. (This is because RCData defines a raw data resource for an application. Raw data resources permit the inclusion of any binary data directly in the executable file.) Under RCData, we see three separate entries:
The first one to catch my eye is the one called IE_PLUGIN. I’ll show a screenshot of it below, but am saving the subject of executables embedded within executables for a MUCH more technical post in the near future (when it’s not 1:30 am and I actually feel like writing more!).
Going back to the entry structure itself, the IE_PLUGIN entry will have at least one Directory Entry underneath it to describe the size(s) and offset(s) to the data contained within that resource. I have expanded it as shown next:
And that’s where things get interesting – as it relates to answering the question at the start of this post anyways. Notice the ID: 1055. That’s our money shot for helping to determine what country this binary was compiled in. Or, more specifically, the default locale codepage of the computer used to compile this binary. Those ID’s have very legitimate uses, for example, you can have the same dialog in English, French and German localized forms. The system will choose the dialog to load based on the thread’s locale. However, when resources are added to the binary without explicitly setting them to different locale IDs, those resources will be assigned the default locale ID of the compiler’s computer.
So in the example above, what does 1055 mean?
It means this piece of malware likely was developed (or at least compiled in) Turkey.
How do we know that one resource wasn’t added with a custom ID? Because we see the same ID when looking at almost all the other resources in the file (anything with an ID of zero just means “use the default locale”):
In this case, we are also lucky enough to have other strings in the binary (once unpacked) to help solidify the assertion this binary is from Turkey. One such string is “Aktif Pencere,” which Google’s Translation detection engine shows as:
However, as you can see, this technique is very useful even when no strings are present – in logs or the binary itself.
So is this how the default binary locale identification works normally (eg: non-malware executable files)?
Not exactly. The above techniques are generally used with malware (if the malware even has exposed resources), but not generally with normal/legitimate binaries. Consider the following legitimate binary. What is the source locale for the following example?
As you see in the green box, we have some cursor resources with the ID for the United States. (I’m including a lookup table at the bottom of this post.) In the orange box, there are additional cursor resources with the ID for Germany. In the red box is RCData, like we examined before, but all of these resources have the ID specifying the default language of the computer executing the application.
As it turns out, the normal value to examine is the ID for the Version Information Table resource (in the blue box). In the case above, it’s the Czech Republic. The Version Information Table contains the “metadata” you normally see depicted in locations like this:
In the above screenshot, Windows is identifying the source/target local as English, and specifically, United States English (as opposed to UK English, Australian English, etc…). That information is not stored within the Version Information table, but rather is determined by the ID of the Version Information Table.
However, in malware, the Version Information table is almost always stripped or mangled, as is the case with our original example from earlier:
Because of that, the earlier techniques are more applicable to malware.
Below, I’m including a table to help you translate Resource Entry IDs to locales (sorted by decimal ID number).