Java
is currently one of the hottest technologies around, and with the growing
use of the Net, it’s poised for an explosive growth. Like almost
everything else connected to the Net, Java has its dark sides too. Here, we
attempt to shed some light on one such dark facet of Java–the ease of
decompiling Java executables.
Decompilation is the reverse
of compilation–a decompiler is a tool that converts the executable machine
code back to the source code that produced it. The technique of
decompilation is not a novel one. In fact, the first decompiler appeared in
the early 1960s, about a decade after its compiler counterpart. However,
writing a decompiler has always been an uphill task and the result was never
totally satisfactory. This is because of several technical problems for
example, the difficulty in combining a sequence of machine instructions to
get back the programming construct–like an if-then loop–that generated
it. Unfortunately, that isn’t true any more with Java.
Today, there are decompilers
available for Java that allow anybody, even with limited knowledge of the
language, to reverse engineer most Java class files–that is, get back
their source code. Although no language is decompilation-proof, in the case
of Java, the very strengths that have made it a huge success are its
weaknesses against a decompiler.
Before going into how this
happen, let’s take a look at the Java class files structure.
Java class files
Java class
files are binaries, the result of compiling Java source code using a
compiler tool such as Sun’s javac. They contain:
-
Symbolic information:
the names of attributes and methods in the Java source program -
Byte codes: the
result of compiling methods in the Java source program, and optionally -
Debugging information
Class files don’t contain
source code or comments. Still, they contain all the information that’s
needed by a Java interpreter to execute the application. The symbolic
information contained in the class file describes precisely the class
structure, the inheritance tree, the methods, and the attributes.
Let’s get to how Java’s
strengths as a language make it vulnerable to decompilation attacks.
Platform
independence is compile once, run everywhere.
A single class file in Java
can run on a variety of platforms. Java achieves this by being both a
compiled as well as an interpreted language. A Java source file is first
compiled into an intermediate byte code representation. These byte codes, as
we explained above, are contained in the class file. These can be
interpreted by the Java interpreter, also called the Java Virtual Machine (JVM).
The byte codes are actually a
platform-independent machine language–that is, instructions for a virtual
processor. The JVM shields the application from the real hardware by
emulating this virtual processor–hence the name virtual machine. The
platform-dependent aspects are isolated in the JVM and different JVM
implementations are needed for different hardware platforms. This
intermediate representation is the trick that makes Java programs
cross-platform.
The following method makes
this concept of byte codes clearer. The method "main" here is
taken from a "Hello world" program written in Java.
public static
void main(String<> args)
{
System.out.println("Hello
World");
}
This method is compiled into
four byte codes:
-
getstatic
java.io.PrintStream.out -
ldc
"Hello World" -
invokevirtual
void println(java.lang. String) -
return
As can be seen in the above
example, the byte codes are similar to assembly language but without
registers. This is due to the fact that the virtual processor emulated by
the JVM is a Stack machine. That is, it has no explicit registers to store
data. The operands for all its instructions come from the in-memory stack.
The virtual processor only keeps track of the location of the next
instruction to be executed (traditionally called the Program Counter
and a pointer to the top of the stack (called the Stack Pointer
All this leads to an
instruction set that’s much simpler than that of a real processor. This
simplicity of instructions makes compilation easier and also leads to faster
execution. But sadly enough, it also makes decompilation easier and makes
the "Decompile once, run everywhere" dream of the pirate come
true.
Object-oriented nature
That Java is
an object-oriented language is a well-known fact. Lesser known, however, is
the fact that this is also true of the JVM. The JVM actually emulates an
object-oriented processor. In other words, the byte codes support
object-oriented operations. This fact is illustrated by the byte codes for
the instructions "getstatic" and "invokevirtual" in the
above example. Using these object-supporting instructions, the sequence of
four byte codes above makes a call to the "println" method of the
"java.io.PrintStream" object. To support this object-oriented
model, the JVM is also responsible for:
-
Dynamic
loading of Java classes -
Inheritance
and polymorphism. that is, when calling a method, it walks the
inheritance tree to call overridden methods correctly -
Memory
management and garbage collection
The object-oriented nature of
the virtual machine has major implications on the Java class file format. In
particular, to support dynamic loading of classes and method invocation, the
JVM needs symbolic information. There is no other place to keep this
symbolic information except in the class file itself. Thus, this information
is also available to a decompiler and helps it to restore the same names for
classes, methods and attributes in the decompiled file as were used in the
original Java source file.
Protecting your Java
applications
Does that
mean that there is no way of preventing class files from being decompiled?
A technique called Code
obfuscation provides some hope to developers. An obfuscator, when applied to
class files, makes it harder for decompilers to extract useful information
from them. It works by changing most of the symbolic information present in
a class file. For example, it might replace human-friendly class names, like
Employee, with less friendly names like 112. As a result, it becomes very
difficult to make any sense of the code, even when the class file has been
successfully decompiled.
Some obfuscators go even
further. They introduce byte code combinations that are notoriously
difficult to decompile. Typically, they do so by modifying byte codes,
adding useless and harmless instructions. An example of this is adding some
dead code–code that is never executed, for example, the code put after the
"return" statement–to the methods. So, even as the program runs
alright, it might not decompile properly.
Unfortunately, many of the
present-day Java decompilers can detect this, and successfully decompile a
class file that’s obfuscated in such a manner.
There are lots of obfuscators
available and some are even free. One of the free ones is HashJava, which
can be downloaded from www.sbktech.org.
Another one, Jobe is free for non-commercial use and is available at www.cs.ucsd.edu/users/ej.