Miguel Ángel Ballesteros bio photo

Miguel Ángel Ballesteros

Maker, using software to bring great ideas to life. Manager, empowering and developing people to achieve meaningful goals. Father, devoted to family. Lifelong learner, with a passion for generative AI.

Email LinkedIn Github
RSS Feed

Java Technology (II) – Inside the JVM (EN)

Available in Español .

Overview

This was my second publication in a professional magazine: RPP (Revista Profesional de Programadores).

This is the second article in a 3-part series. See the 1st and 3rd parts.


It is a general rule that it is not possible to perfectly understand the functioning of any machine without going into small details. These are the ones that often provide the most information about the possibilities of the machine we are analyzing. To them, the small details of the JVM, we will dedicate this article.

In the second article of the series, we will continue to delve into the functioning of the Java Virtual Machine. We will analyze the JVM instruction set and the “.class” format. I have found it especially difficult to find a middle ground between absolute detail and a shallow general description. I have finally opted for a fairly detailed presentation, but one that highlights what (in my opinion, of course) is most important.

We will start first with the binary storage format of Java applications.

.class Format

When I began to study the internal workings of the JVM, I found that the technical specifications (see [1]) explained in great detail the binary storage format in which the bytecodes of each class were saved. At first, I thought that the specifications were aimed at both people or teams interested in implementing the machine itself, and those who wished to develop compilers and similar tools. I thought that perhaps as an appendix or something similar it would be fine, but I did not see its explanation justified in the main body of the specifications: not even before the instruction set. I soon realized my mistake. The .class format hides many of the secrets of the internal workings of the JVM. We will see this throughout the article.

Every Java application starts from a file with the source code. After passing through the compiler, it is broken down into several binary files, one for each class or interface defined in the source file. These files, with a .class extension, contain all the information that defines the class and are the files that the interpreter loads to execute the applications.

A Java binary file consists of a stream of bytes in which numerical values that exceed 8-bits are arranged from most to least significant (that is, 256 is encoded with the pair of bytes {1,0}). This format is supported by classes such as java.io.DataInputStream and java.io.DataOutputStream. Many of the quantities contained in the file are quantities of 1, 2, or 4 bytes, which we will represent by U1, U2, and U4 (in fact, a .class file contains only quantities of these types and character strings…, nothing more). Table A shows how these quantities can be read in Java.


Table A: Accessing .class files. The java.io.DataInputStream and java.io.DataOutputStream classes and the following methods are used:

Type Size Method Used
U1 1 byte readUnsignedByte
U2 2 bytes readUnsignedShort
U4 4 bytes readInt

The basic structure of the .class format is shown in Figure A.

.class format structure

The obsession of these gentlemen with coffee is, to say the least, curious. Every Java binary file begins with a U4 (4 bytes) of constant value that in hexadecimal is represented by 0xCAFEBABE. It is the “magic” number that identifies the file as a Java binary file.

This number is followed by two short numbers representing the minor and major version numbers.

Next comes the constant pool. This pool is nothing more than a table of values where character strings, field, method, and class identifiers, constant values, etc. are stored. Due to its importance, we will talk about it later.

The access flags come after the constant pool. A class can have the following attributes: ACC_PUBLIC (visible to everyone), ACC_FINAL (cannot be derived, overloaded, etc.), ACC_INTERFACE (the class is an interface), and ACC_ABSTRACT.

The current class and the superclass are two indices (U2) to the constant pool. At the positions pointed to by these indices, we find the items or references to said classes. The concept of an item will be fully understood when we talk about the constant pool. For now, we can think of them as character strings with the full name of the class.

The array of implemented interfaces is an array of U2 with indices to the constant pool, where we find the items or references to said interfaces.

The array of field descriptors is a matrix of structures that store information about the fields contained in each class. Access information is stored, an index (to the constant pool) to a string with the field name, another index that points to the field’s signature (see the SIGNATURES box), and attributes. The only attribute specific to a class field is the “ConstantValue” attribute, which indicates that the field is a static numerical constant and has that numerical value associated with it.

SIGNATURES These are strings that represent a method, a field, or an array in an abbreviated form. For example, an integer matrix int[][] would have the signature “[[I”. A Thread[] vector would be signed with the string “[Ljava.lang.Thread”;

A descriptor from the array of method descriptors contains: the access flags to the method, an index (in the pool) to the method name, another index to the method signature, and some attributes. In this case, the attributes are more varied. Two types of attributes are recognized: “Code” and “Exceptions”. The “Code” attribute has an associated block of bytes with the method’s bytecodes. The “Exceptions” attribute has associated information about the exceptions that can result from the execution of the method.

Finally, there is the array of attribute descriptors. In this case, the attributes refer to the class and, in Sun’s initial specifications, it can only be a single attribute with the name “SourceFile”, which has the name of the Java source file associated with it.

All attributes have the same structure. The ATTRIBUTES box shows the different types of attributes, most of them already mentioned, explained in some detail.


ATTRIBUTES

All attributes have an identical structure. They begin with a U2, an index to the constant pool, which points to a string with the attribute name (for example, the string “Code”). This index is followed by a U4 that contains the length of the data. The content of this data is interpreted according to the type of attribute it is. Some of these types are described below:

  • “SourceFile”: The data is simply a U2, an index that points to a string with the name of the source file.
  • “ConstantValue”: A U2 index that points to a CONSTANT_Long, CONSTANT_Float, CONSTANT_Double, or CONSTANT_Integer (described in the constant pool).
  • “Code”: Two U2s with the size of the stack and the local variable space required by the method. They are followed by a U4 with the length of the method’s bytecodes and a block of U1s with the bytecodes. The code is followed by a U2 with the number of exceptions and an array of descriptors for these exceptions. An exception descriptor consists of two U2s with the start and end code address of the exception handler’s activity. It is followed by a U2 with the address in the code (the bytecodes) where the interrupt handler routine begins. Finally, the descriptor ends with an index (U2) to the constant pool where a reference to the exception is found (that is, a class derived from Throwable). Everything refers to the beginning of the code. After the exception descriptors, there is a space for additional attributes, such as “LineNumberTable” and “LocalVariableTable” for debugging information.
  • “Exceptions”: A U2 with the number of exceptions that the method can throw. An array of U2s, indices to the pool, which point to the different exceptions.
  • “LineNumberTable”: Used by debuggers to determine which part of the binary code corresponds to a given position in the source code.
  • “LocalVariableTable”: Used by debuggers to determine the value of a given local variable during the dynamic execution of the method.

The Constant Pool

The constant pool (see Table B) has a special importance in the JVM. From the .class file, most of the data structures are used to fill symbol tables and so on. The constant pool, however, is a block of data that is associated with each running class and is copied as is into memory. Its importance lies in the fact that, on many occasions, the JVM’s instructions have operands that are indices to this pool. Therefore, the operation of the JVM is linked to the constant pool itself.

In the constant pool, we can basically find three things: character strings (not String objects, just strings), references, and constant numerical values. The references can be to classes, methods, fields, interfaces, and Strings. In many cases, these references are nothing more than indices to character strings in the constant pool itself. The usefulness of the references is better understood with an example.

Imagine that in a Java program you want to read the value of one of the fields of a class. The instruction to use is getfield. This instruction has a single operand that is an index to the constant pool. When executing the instruction, a reference to the object from which we want to extract the value of the field is read from the stack. At the indicated position of the constant pool, we find a tag of the CONSTANT_Field type (see Table B). The reference or item found is consistent (we expected to find this tag and not another).

After the tag, we find two indices to the pool: the first points to a CONSTANT_Class and the second to a CONSTANT_NameAndType). From the CONSTANT_Class item, we extract the string with the class name and check that it is consistent with the object whose reference we have extracted from the stack. Finally, from the CONSTANT_NameAndType item, we get the field’s signature, which we will then search for in the class’s field table. Once the position is located, we get an offset in the instance’s data structure. From there, we will get the value of the field. This example anticipates the next article in this series (in which we will talk about execution in the JVM), but it is included because it is a good example of how the JVM works in relation to the constant pool (the example has deliberately omitted some details that accelerate subsequent accesses to the field).


Table B: Constants that we can find in the pool.

All entries in the constant pool begin with a tag (U1) that indicates the type of constant stored in that entry.

Type Description -
CONSTANT_Class After the U1 of the tag (which indicates that it is a class), comes an index (U2) to a Utf8 string with the full name of the class. This constant is a reference to a class. -
CONSTANT_Utf8, CONSTANT_Unicode It is a character string. After the tag that identifies it as a Utf8 string, come the data according to the UTF-8 format. Similar in the case of Unicode. -
CONSTANT_Fieldref, CONSTANT_Methodref, CONSTANT_InterfaceMethodref After the tag comes an index (U2) that points to a CONSTANT_Class that references the class to which the method, field, or interface belongs. A second index points to a CONSTANT_NameAndType, which stores the name and signature of the method, field, or interface. -
CONSTANT_String Represents String type objects. After the tag goes an index to a Utf8 string with which the String object will be initialized. -
CONSTANT_Integer, CONSTANT_Float Represents 4-byte constants, so after the tag there is a U4 with the corresponding value. -
CONSTANT_Long, CONSTANT_Double Represents 8-byte constants, so after the tag there are two U4s with the corresponding value. -
CONSTANT_NameAndType Represent a field or a method, but without indicating the class to which they belong. After the tag, an index (U2) points to a Utf8 with the name and another index to another Utf8 with the signature. -

JVM Instruction Set

In the JVM 1.0 specifications, we find a total of 164 instructions (although there are more bytecodes due to accelerators). These instructions cover very different areas, as we can see in Table C.

Of the entire set of instructions, a high percentage are accelerators. Accelerators are instructions that usually require two or more bytecodes and are reduced to a single one. This is possible because there are “leftover” bytecodes.

An example of an instruction and its accelerators is iload, from the area of loading local variables onto the stack (see the second area in Table C). The iload instruction has the associated bytecode 21 and is followed by another byte vindex.

This instruction loads the integer located at the vindex position of the method’s local variable space onto the stack. In many cases, a method will not have more than 3 or 4 local variables, and they are also used quite a bit. Therefore, the iload instruction has accelerator instructions iload_0, iload_1, iload_2, and iload_3, with associated bytecodes 26, 27, 28, and 29 respectively. As is easy to imagine, these accelerator instructions load the first 4 local variables onto the stack. As they only occupy 1 bytecode, the resulting code is more compact and execution is faster.

The example we have just exposed leads us to analyze the efficiency of the bytecodes. They were designed to be efficiently stored and executed. But how do they try to achieve these objectives? Let’s see some keys.

Bytecode Efficiency

It was clear from the beginning that the best way to encode Java was to use an 8-bit code. Practically all microprocessors use the byte as the minimum processing unit, so it was a good choice as it avoided subsequent processing of the bit stream. In addition, although bytecodes with fewer bits would have been desirable (to reduce the size of the code), 7 bits (128 possibilities) are not enough to encode all the JVM’s instructions.

With 256 code words and about 150 instructions, bandwidth is wasted (let’s remember that Java gained strength on the Internet, and the transmission of bytecodes had to be as efficient as possible). We can take advantage of the free code words to introduce accelerators: the most used instructions, or sets of them, can be encoded with unique bytecodes. If the JVM is interpreted (that is, it does not use a JIT compiler), we have the additional benefit that the interpretation is faster.

The wide instruction is oriented in the same line (see the fourth area of Table C). The tactic is the same as we find when looking for an optimal encoding of a data block: the most frequent should be encoded to occupy little space. On the contrary, what happens on few occasions can be encoded with long sequences.

The efficiency of the bytecodes is also sought by using highly complex CISC instructions. We already saw with an example the number of operations that the JVM performed when it executed the getfield instruction. Instructions like this are not typical of usual microprocessors.

The JVM simulates a processor specifically designed to execute object-based programs (does anyone know a micro with an instruction called invokevirtual?).

The creation of data structures as complex as arrays of objects is also reflected in JVM instructions.

A Note on Security

There are also no instructions for memory management. All memory access is transparent to the user, which provides a high level of security. When a method is running, a cracker (who has introduced bytecodes consistent with the JVM, and who will therefore pass the verification process) can only access the stack, the local variable space, arrays, and fields of other objects. The method’s code is enclosed in a space specifically dedicated to it, from which it is not possible to exit with control flow instructions. In fact, all jump instructions take the beginning of the method’s code as a reference.


Table C: JVM INSTRUCTION SET.

The instructions are separated into the following areas (the number in parentheses is the number of instructions):

  • Moving constants to the stack (11)
  • Loading local variables onto the stack (10)
  • Unloading stack onto local variables (11)
  • Index extension for loading, unloading, and incrementing (1)
  • Array handling (20)
  • Stack handling instructions (10)
  • Arithmetic instructions (24)
  • Logical instructions (12)
  • Type conversion (15)
  • Flow control (27)
  • Method return (7)
  • Jump tables (2)
  • Manipulation of object fields (4)
  • Method invocation (4)
  • Exception handling (1)
  • Instance creation and type checking (3)
  • Object monitoring (2)

The Next Article

With the next article, we will finish the series on Java Technology. We will talk about the JVM in execution (with practical examples) and about the data structures necessary to implement it. We will then be in a position to talk about security in Java, and we will. To end the series, we will explore the four possible relationships linked to the execution of methods according to the client/server model. The objective will be to show the reader how only three of them are commonly used and how Java (more than Java, its JVM) gives us the key to develop the unexplored fourth relationship (the most interesting and complex of all).


Bibliography:

The references listed below can be found in .PDF format at JavaSoft: http://www.javasoft.com/.

[1] The Java Virtual Machine Specification. August, 1995. Here you will find the Java binary format and the instruction set perfectly detailed. Do not expect to find, however, comments, justifications for what is stated, or great clarifying explanations; after all, they are technical specifications.