java.util.jar.Pack200.Packer.UNKNOWN_ATTRIBUTE
.This document specifies an archive format called "Pack200". It is optimized for applications written in the Java programming language. Such applications are usually delivered as collections of classes, sometimes with associated resource files.
This format allows any number (from one to hundreds of thousands) of Java classes to be encoded by a compressor, transmitted compactly in a single block of bytes, and decoded by a decompressor into equivalent Java class files. Because it can also represent class resources and other "side files", it can serve as an alternative to the JAR archive for some deployment tasks, notably downloading Java applications.
The Pack200 format can decrease the size of a Java application by a factor of seven to nine, compared with an equivalent JAR containing uncompressed ("stored") class files. By contrast, using the zip DEFLATE algorithm integral to JAR and ZIP archives gains a factor of two. The undocumented "crunch" mechanism, used to deploy JDK downloads in the past, gains a corresponding factor of five to six. Note that all these figures assume and incorporate the effects of a post-pass with DEFLATE or a similar off-the-shelf compression algorithm. (See DEFLATE Compressed Data Format Specification for more information.)
The main motivation for this format is to decrease disk and bandwidth requirements for Java application packaging, transmission, and delivery. An earlier version has been used to package the downloads for Java SE 1.4.1 and 1.4.2 (code names "Hopper" and "Mantis").
This format is not intended for fast loading into a virtual machine, and does not attempt to improve start-up speed or memory footprint in running Java applications. The heavy engineering requirements on a directly loadable file format would make optimal compression impossible. The decompressor of an aggressively compressed file will have a complex job to do, and must be allowed memory and CPU time to do this job. We assume that Pack200 archives will be unpacked on the same classes of machine that run Java SE and Java EE applications.
This format is batch-oriented, optimized for packaging and transmission of Java classes. It does not support random access of individually stored classes. In order to emphasize the sequential nature of this archive format, we will use the verb transmit, rather than store, to refer to the formatting of data produced by a compressor and accepted by a decompressor.
This format does not attempt to duplicate work performed by DEFLATE or other byte-oriented compression algorithms. Typically, tools using Pack200 will further compress archives by storing them in ZIP files or using some other compression technique. The presence of such a post-pass compressor is an assumption made by the design of Pack200.
This specification is complex, and may seen to some readers needlessly complex. The design decisions reflected here have been motivated by extensive testing and experimentation with actual Java class files found in real products. An attempt to remove complexity from this specification is likely also to remove measurably significant compression efficiency.
A Pack200 archive, like a JAR file, carries images of class files and other ("resource") files arranged in a hierarchical directory structure.
No special compression or transformation is performed on any file other than class files, other than (possibly) reordering them by file type. The post-pass compressor is assumed to provide adequate compression on those spans of the archive which carry images of non-class files.
Classes are represented in a form which suppresses their individual constant pools, in favor of a large constant pool that serves the entire archive. Thus, when a class file is extracted from a Pack200 archive, a new constant pool will be created for it, and all constant pool references adjusted. This does not change the semantics of the class, but it will generally modify the bitwise image of the file, and perhaps even its size, as unused constant pool entries are deleted.
Note that every class file must be fully parsed by the compressor, so that all constant pool indexes may be found and (later) renumbered. This requirement applies to any class, field, method, or code attribute which refers to the constant. The Pack200 format supports a moderate range of attributes. A modest variety of new attribute layouts may be declared to the compressor, and the Pack200 format provides a place to transmit such layouts for use by decompressors. See the section Attribute Layout Definitions for more information.
The archive file consists of a short archive header, followed by a number of independent file sections called bands. (There are about 100 of them; this number can vary.) Each band transmits an array of small integers, called the elements of the band. After the archive header, the archive consists only of bands.
Logically, a band is an implicitly sized array of 32-bit unsigned integers. However, it is rare for a band to physically require more than one or two bytes per element, since element encodings are chosen to transmit the actual band values more compactly.
A band has no fixed header, not even an indication of its size. The count of band elements is deduced by the decompressor from the contents of previous bands, or (ultimately) from the archive header.
The elements of any given band have a common meaning and role. For example, the names of all transmitted classes are in a single band, while all the class field counts are in a different band. Each band is transmitted as a contiguous segment of bytes within the archive. This contiguity is the main reason that a Pack200 archive, while compact to begin with, is very compressible by utilities like zip.
In some bands, each element helps describe one object. For example, each class is associated with a count of fields, which is given in the archive as a corresponding value in the class_field_count
band. In other bands, one object may be described by a run of zero or more elements in a band. For example, the interfaces implemented by a class are given in the archive as a corresponding run of zero or more values in the class_interface
band; each such value is an index referring to a single class.
A very few bands (less than 10) contain non-homogeneous bytes, such as the images of resource files. These few are called byte bands. Many of the bands contain references into the constant pool. Some contain access modifier flags and related bits. One band, called the char band, contains string characters in a specially chosen encoding called "CHAR3" (somewhat akin to UTF8, but not identical). The rest of the bands transmit integers with a variety of other interpretations.
One of the integer bands (cp_Utf8_big_chars) carries the characters of a single CONSTANT_Utf8 string selected for special treatment. Uniquely, this band is repeated zero or more times, depending on how many strings are selected for this special treatment. (These specially-transmitted "big strings" are explained in the section Utf8 Constants.)
Most bands have an easily understood function, such as transmitting the number of methods in a class, or the name of a field, or the operand of a "getfield" bytecode instruction.
Except for the special case of byte bands, a band is never considered as a sized span of bytes, but rather as a counted series of elements which are encoded integers. Since integer encodings are typically variable-sized (when regarded as byte sequences), there is no firm rule for deriving a band's byte size from its element count. Indeed, finding the end of a band requires that it be parsed byte-by-byte.
Each band uses one or more coding tactics to encode its elements as sequences of bytes. (The post-pass compressor is expected to turn these bytes into bit sequences, but its actions are not specified by this document.) The Pack200 compressor is free to choose and vary encoding tactics used by a band. The encoding tactics used in one band are independent of those used in other bands. The space of supported encoding tactics is an important part of the Pack200 specification. The present section introduces these encodings, which are defined in detail in the section Specification of Band Coding.
As one might expect, byte bands (which encode bytewise data such as the resource file images) encode their integers as unsigned 8-bit bytes. This encoding is named BYTE1 in this specification. (Compare the type "u1" in the class file definition.)
Other bands have values with a much larger dynamic range, including (in a few cases) negative numbers, and/or values all the way up to the 32-bit unsigned maximum. Most of these encodings are variable-length, in the expectation that the typical band element will be relatively small in magnitude, even though some elements may be large and require more bytes to represent. Some bands that are expected to exhibit strong correlations in their element sequences are encoded as successive differences (delta encoding) rather than absolute numeric values.
Each band is associated with a primary encoding which the compressor and decompressor agree to use when transmitting elements of that band. For any band after the end of the segment header, except for the byte bands, the compressor can optionally specify a secondary encoding to use instead of the primary encoding. In essence, the band's encoding defaults to the primary, unless there is an explicitly declared secondary. This allows the Pack200 format to adapt more closely to the actual statistics of band elements.
For example, most bands, such as those holding counts and sizes, have a primary encoding called UNSIGNED5. This is a general-purpose unsigned encoding which represents values in the range [0..191] as a single byte, and scales up to a maximum size of five bytes for numbers larger than about fifty million. However, if a band contains only numbers in the range [0,255], the BYTE1 encoding is more compact, and the compressor is allowed to instruct the decompressor to use this instead.
When the compressor specifies a secondary encoding, it must emit an optional band coding specifier, which is described in the section Meta-Coding. A customized encoding method often saves many bytes, if it matches more precisely the actual dynamic range of the band's element values.
The Pack200 format is based on a schema of integer encoding systems, parameterized by four numbers (B,H,S,D). From this infinite set, about 10 selected encodings serve as primary encodings for bands, and about 100 more serve as optional encodings. The system of (B,H,S,D) encodings is explained at length in the section Specification of Band Coding. For the present purposes, a brief introduction will suffice.
The parameter B (1<=B<=5) is the maximum length in bytes of the encoding of a single integer. Less significant bits are always encoded in earlier bytes.
The parameter H (1<=H<=256) is the radix of the encoding. it also determines the conditions under which the encoding's byte sequences terminate. The co-parameter L (0<=L<=255) is defined as (256-H). Encoded byte sequences are allowed to contain only one byte with a value less than L, and that byte must be the last in the sequence. Thus, larger H values make for longer average encoding lengths. If H is 256 and L is zero, then the encoding method is fixed-length, because all encodings must be exactly B bytes long.
The parameter S (0<=S<=2) determines whether and how the encoding represents signed numbers. (More precisely, since band elements are conventionally regarded as unsigned 32-bit integers, S determines the coding of band elements larger than the largest 31-bit unsigned number, 2147483647. But we shall continue to refer to such numbers as negative, where the distinction is irrelevant.) S denotes the number of least-significant bits which serve as a sign bit. If S is zero, the numbers are unsigned. If S is one, the LSB of the unsigned number is exclusive-ored into the right-shifted remainder of the unsigned number to produce a corresponding signed number. If S is more than one, the signed number produced is negative only if all S least-significant bits are set, and otherwise these low bits contribute to the positive magnitude of the integer. This representation is efficient for bands containing mostly-positive numbers, such as bytecode branch offsets. The interpretation of sign bits is more precisely described in the section Encoding of Signed Integers.
The parameter D (0<=D<=1) determines whether the band transmits its data via successive differences (i.e., delta encoding). In writing these encodings, the parameters S and D may be omitted if both are zero, and D may be omitted if zero.
Thus, the encoding (1,256,0,0), also written as (1,256), represents numbers as unsigned bytes, while (4,256) represents numbers as unsigned little-endian 32-bit integers. The encoding (1,256,1) maps single bytes to numbers in the range [-128,127]. The encoding (1,256,1,1) expresses any sequence of numbers whose successive differences are in the range [-128,127] (and whose first number is in that range).
The very common encoding UNSIGNED5 can represent the full unsigned range of 32-bit integers. A byte sequence in this encoding consists either of four bytes less than 192 followed by an arbitrary byte, or else from zero to three bytes less than 192 followed by a byte greater than or equal to 192. The unsigned value is formed by scaling each succeeding byte by successive powers of the radix 64, and adding all the scaled byte values together.
value | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 |
---|---|---|---|---|---|
1 | 1 | ||||
191 | 191 | ||||
192 | 192 | 0 | |||
193 | 193 | 0 | |||
255 | 255 | 0 | |||
256 | 192 | 1 | |||
512 | 192 | 5 | |||
1024 | 192 | 13 | |||
2048 | 192 | 29 | |||
12479 | 255 | 191 | |||
12480 | 192 | 192 | 0 | ||
798911 | 255 | 255 | 191 | ||
798912 | 192 | 192 | 192 | 0 | |
51130559 | 255 | 255 | 255 | 191 | |
51130560 | 192 | 192 | 192 | 192 | 0 |
0xFFFFFFFF | 255 | 252 | 252 | 252 | 252 |
Here is the set of primary encodings used by at least one band:
Type | Name | Purpose |
---|---|---|
(1,256) | BYTE1 | bytes |
(3,128) | CHAR3 | Java characters |
(5,4) | BCI5 | bytecode positions |
(5,4,2) | BRANCH5 | bytecode branch offsets |
(5,64) | UNSIGNED5 | general unsigned ints |
(5,64,1) | SIGNED5 | general signed ints |
(5,64,0,1) | UDELTA5 | monotonic sequences |
(5,64,1,1) | DELTA5 | autocorrelated sequences |
(5,64,2,1) | MDELTA5 | mostly monotonic sequences |
The file as a whole consists of about 200 bands in four major groups. After a header, the constant pool contents come first, followed by most of the class schema, followed by a consolidation of all attribute information (for classes, fields, methods, codes, and the archive as a whole), and ending with bytecodes for all methods.
Rather than describe the bands in a long and dull sequence, we use a simple, non-recursive grammar to present them in an organized fashion. This grammar roughly parallels the grammar of a class file:
pack200_segment:
segment_header
*band_headers :BYTE1
cp_bands
attr_definition_bands
ic_bands
class_bands
bc_bands
file_bands
The terminals of this grammar are scalars and bands. To make the terminals easier to recognize, scalar names begin with pound sign "#" and band names begin with a star "*". A scalar is an encoded integer (or a small fixed number of them) which appears within the header of the archive file. When a scalar is mentioned in the grammar, it is followed by a colon and the encoding and bracketed count of the scalar value(s). Whenever a band is mentioned, it is followed by a colon and its primary encoding, followed by a comment in square brackets indicating its length. If the band encodes references into some other data structure, such as a constant pool, that structure is mentioned as a comment in parentheses. If the references are allowed to be null, that fact is mentioned also.
Here is an example:
example_non_terminal:
other_non_terminal
#single_scalar_integer :UNSIGNED5[1]
#four_byte_scalar :BYTE1[4]
*band_of_integers :UNSIGNED5 [#integer_count]
*band_of_method_references :UNSIGNED5 [SUM(*ref_counts)] (cp_Method)
*band_of_strings_and_nulls :UNSIGNED5 [...] (null or cp_String)
Many of the band lengths are simply references to previous scalars (such as [#class_count]
) or sums of previous bands which transmit counts (e.g., [SUM(*class_method_count)]
). These bracketed lengths must be regarded as comments in the grammar, since the text of the specification defines band lengths verbally. Band counts which cannot be briefly summarized as sometimes specified in a partial expression which contains an ellipsis.
Occasionally we use additional grammatical notations. We use the ordinary regular expression operators (X | Y)
means a match for either X or Y, (X)*
means any number of matches for X, (X)+
means one or more matches for X, and (X)?
means either a match for X or nothing. In one case described in the section Utf8 Constants, where a band may have several instances, the expression (band1)\ **\
length(band2) means that there are as many instances of band1 as there are elements in band2. Likewise, in three cases, the expression (band1)\ **\
#scalar indicates that there are zero or one instances of band1 according to the value (zero or one, respectively) of the boolean scalar value.
The entire Pack200 archive format may be repeated one or more times in a transmission to the decompressor. In this case, each complete set of transmitted bands bands is viewed as a segment of the whole transmission. The decompressor is required to accept multiple segments, and to process each segment independently in order. The output files resulting from each segment are accumulated in order. If the decompressor is producing a JAR file, multiple segments must be accumulated into one output JAR file, whose elements are made up of the respective elements transmitted in each segment, in order.
pack200_archive:
(pack200_segment)+
Note that each segment begins anew with the Pack200 magic number and version numbers.
(With one caveat, this means that Pack200 archives may be concatenated with the natural meaning of combining the archives they represent. As will be seen below, if the #archive_size
field is not present in each segment header, it must be inserted before another segment is appended, so that the decompressor can find the end of each segment. The segment header format is such that this adjustment can be made easily.)
When combined with a streaming transport layer, segmentation provides a way for a compressor to decrease latency or to decrease the decompressor's memory requirements, at a cost in compressor efficiency.
Segmentation can also help solve a scaling problem with very large archives (of many tens of megabytes), in which the width of indexes into the combined global constant pools can grow beyond the benefits of global constant sharing. By starting a new segment, the compressor resets the decompressor's constant pools, clearing out useless constants, at the cost of restating constants that are still in use. (The trade-offs are similar in flavor to the design of copying garbage collectors.)
The header consists of a magic number and version information, in a format exactly parallel to the Java class file format. However, besides the magic number, there are no other big-endian integers in the Pack200 format. All integer encodings in bands present arithmetically less-significant bits first.
Only the archive magic numbers (the first four bytes of the segment header) are fixed in size. All other archive structures are variable in size and must therefore be parsed sequentially. Generally speaking, decompressors must perform two passes over each segment of the archive, one to size and parse the bands, and one to sequentially extract information from the bands into each JAR element in the output.
segment_header:
archive_magic archive_header
archive_magic:
#archive_magic_word :BYTE1[4]
archive_header:
#archive_minver :UNSIGNED5[1]
#archive_majver :UNSIGNED5[1]
#archive_options :UNSIGNED5[1]
(archive_file_counts) ** (#have_file_headers)
(archive_special_counts) ** (#have_special_formats)
cp_counts
class_counts
archive_file_counts:
#archive_size_hi :UNSIGNED5[1]
#archive_size_lo :UNSIGNED5[1]
#archive_next_count :UNSIGNED5[1]
#archive_modtime :UNSIGNED5[1]
#file_count :UNSIGNED5[1]
The archive_magic_word
consists of the four bytes 0xCA, 0xFE, 0xD0, 0x0D. The #archive_minver
must be the number 1. The #archive_majver
must be the number 170. Both of the latter two values may be incremented in the future to reflect small revisions in this file format. Note that in previous versions of this standard, the minor and major version numbers were 7 and 150, or 1 and 160.
The header also contains initial counts of constant pool entries and other "top-level" entities. All of these counts are given in UNSIGNED5 format. Some of these counts are conditionally present, controlled by bits in the #archive_options
word. The rule for a missing header value (and for missing values in general unless otherwise specified) is that the decompressor must behave as if it had received an explicit zero value.
The #archive_options
word is interpreted bitwise. Certain bits are given symbolic names as follows, where the LSB is numbered as bit zero:
Bit | Name |
Meaning if set in #archive_options
|
---|---|---|
0 |
have_special_formats
|
archive_special_counts contains counts
|
1 |
have_cp_numbers
|
cp_number_counts contains counts
|
2 |
have_all_code_flags
|
code_flags_lo contains an element for every code
|
3 |
have_cp_extra_counts
|
cp_extra_counts contains counts
|
4 |
have_file_headers
|
archive_file_counts contains counts
|
5 |
deflate_hint
|
request compressed JAR file for all elements |
6 |
have_file_modtime
|
file_modtime contains modtimes
|
7 |
have_file_options
|
file_options contains option bits
|
8 |
have_file_size_hi
|
file_size_hi contains high-order size words
|
9 |
have_class_flags_hi
|
class_flags_hi contains extra attribute flags
|
10 |
have_field_flags_hi
|
field_flags_hi contains extra attribute flags
|
11 |
have_method_flags_hi
|
method_flags_hi contains extra attribute flags
|
12 |
have_code_flags_hi
|
code_flags_hi contains extra attribute flags
|
13 | (unused, must be zero) | |
... | (unused, must be zero) | |
31 | (unused, must be zero) |
If have_special_formats
(the LSB) is set, the band archive_special_counts
will have two elements, as specified in the grammar. Otherwise this band will have zero elements. Similarly, if have_cp_numbers
is set, the band cp_number_counts
will have four elements, as specified in the grammar. Otherwise this band will have zero elements. If have_all_code_flags
is set, the band code_flags
must contain one element for every Code
attribute. Otherwise this band may have fewer elements. (This option is useful when Code
attributes are rich in sub-attributes, perhaps because the JAR contains large amounts of debugging information.)
If have_cp_extra_counts
is set, the band cp_extra_counts
will contain four elements, as specified in the grammar. Otherwise this band will have zero elements. (The flag may only be set when #archive_majver
is 170 or higher. These extra counts correspond to additional constant pool types introduced in recent revisions of the classfile format.)
If have_file_headers
is set, the band archive_file_counts
will have five elements, as specified in the grammar. Otherwise this band will have zero elements. If deflate_hint
is set, the decompressor is requested (but not required) to reduce the size of its output. For example, if it produces a JAR file, it may deflate the JAR elements. If the option have_file_modtime
(or have_file_options
, have_file_size_hi
, have_class_flags_hi
, have_field_flags_hi
, have_method_flags_hi
, have_code_flags_hi
, respectively) is set, the corresponding band file_modtime
(or file_options
, file_size_hi
, class_flags_hi
, field_flags_hi
, method_flags_hi
, code_flags_hi
, respectively) may be non-empty, as specified below. Otherwise that band will have zero elements. Other bits in the archive options must be zero and are reserved for future use.
Immediately after the #archive_options
word is a pair of 32-bit numbers which together form a 64-bit length that helps decompressors easily buffer incoming data up to the end of the archive segment without reading beyond the segment. The value #archive_size
is the unsigned 64-bit value composed of the unsigned 32-bit words #archive_size_lo
and #archive_size_hi
, where the former value is the low-order 32-bit word and the latter value is the high-order 32-bit word. The value #archive_size
is either zero or declares the number of bytes in the archive segment, starting immediately after #archive_size_lo
and before #archive_next_count
and ending with the last band, the *file_bits
band. (That is, a non-zero size includes the size of #archive_next_count
, *file_bits
, and everything in between.) This value is redundant, but if non-zero must be correctly supplied by any compressor. If the archive segment is transmitted as part of a stream, and if other data (such as additional segments) follow the archive, the #archive_size
value must not be zero. Although it may be completely ignored by the decompressor, this value may help some decompressors to buffer their inputs more efficiently.
Immediately after #archive_size
is another value #archive_next_count
which estimates the number of archive segments immediately following the current one. This number need not be correct, and may always be zero. It is intended to provide decompressors with a hint as to the amount of remaining decompression work to do. (Such hints are routinely displayed in progress bars.)
If the #archive_modtime
value is non-zero, the decompressor is requested (but not required) to adjust the system-specific modification time for its output. For example, if the decompressor produces a JAR file, it may set the modification date of each JAR element, or of the JAR file itself, to that date, in the absence of any more specific directive, such a non-zero file_modtime
value. The #archive_modtime
value is interpreted as the number of seconds since the epoch used by System.currentTimeMillis
, which is 1/1/1970, 00:00:00 GMT. However, the special value zero is reserved to indicate the absence of any modification time for the archive as a whole. A decompressor may supply an arbitrary value in place of a missing archive modification time. If this is done, and if file_modtime
values are present, those values are interpreted relative to modification time supplied for #archive_modtime
by the compressor.
The #file_count
value gives the number of files which are described in detail by the archive. Note that a class file which is simple enough (as described below) does not need to be described as a file, because it is enough that the class itself is transmitted. Therefore, #file_count
may be less than the number of transmitted classes. (This latter number is called #class_count
.) On the other hand, transmitted files do not need to contain classes, so that #file_count
can also be greater than the number of classes.
The archive file contains up to sixteen sets of constants called "constant pools". (These structures are analogous, but not identical, to a similarly named structure in class files.) These constant pools are described in detail in the next section.
The cardinality of each constant pool is given in UNSIGNED5 format in elements of the cp_counts
structure of the archive header:
cp_counts:
#cp_Utf8_count :UNSIGNED5[1]
(cp_number_counts) ** (#have_cp_numbers)
#cp_String_count :UNSIGNED5[1]
#cp_Class_count :UNSIGNED5[1]
#cp_Signature_count :UNSIGNED5[1]
#cp_Descr_count :UNSIGNED5[1]
#cp_Field_count :UNSIGNED5[1]
#cp_Method_count :UNSIGNED5[1]
#cp_Imethod_count :UNSIGNED5[1]
(cp_extra_counts) ** (#have_cp_extra_counts)
cp_number_counts:
#cp_Int_count :UNSIGNED5[1]
#cp_Float_count :UNSIGNED5[1]
#cp_Long_count :UNSIGNED5[1]
#cp_Double_count :UNSIGNED5[1]
cp_extra_counts:
#cp_MethodHandle_count :UNSIGNED5[1]
#cp_MethodType_count :UNSIGNED5[1]
#cp_BootstrapMethod_count :UNSIGNED5[1]
#cp_InvokeDynamic_count :UNSIGNED5[1]
archive_special_counts:
#band_headers_size :UNSIGNED5[1]
#attr_definition_count :UNSIGNED5[1]
class_counts:
#ic_count :UNSIGNED5[1]
#default_class_minver :UNSIGNED5[1]
#default_class_majver :UNSIGNED5[1]
#class_count :UNSIGNED5[1]
The order in which these counts occur parallels the order in which the constant pools themselves are transmitted in the archive. This order is called the definition order of the constant pools.
The four numeric constant pools are sized by cp_number_counts
. They specify numbers used occasionally by ldc
bytecodes. Because a minority of classes actually use such bytecodes, the sizes are optional, under the control of #have_cp_numbers
.
Likewise, the last four constant pools are sized by cp_extra_counts
. They specify linkage information for the invokedynamic
instruction, and certain types of constants (method handles and method types). As with the numeric entries, these sizes are optional, under the control of #have_cp_extra_counts
.
Every class file has a header that includes magic number and a minor and major version numbers. The magic number (0xCAFEBABE), being a fixed constant, is not transmitted in the Pack200 archive. However, the version numbers, which can vary somewhat, must be recorded.
In the expectation that particular version numbers will be prevalent, the archive header transmits default major and minor version numbers. Both of these numbers are given in UNSIGNED5 format.
Individual classes in the archive (as described in the section Attribute Bands) may optionally specify their own version numbers in a pseudo-attribute which overrides the #default_class_minver
and #default_class_majver
given in the archive header.
The count #band_headers_size
gives the size in bytes of band_headers
. The format of these bytes, which help define band coding specifiers for secondary codings, will be discussed much later, in the section Meta-Coding.
The archive header also specifies the number of attribute types (#attr_definition_count
), class definitions (#class_count
), nested class declarations (#ic_count
). These numbers are used to size various other bands, as noted in the definition of those bands.
The arithmetic sum of all numbers in cp_counts
must be less than the value 536870912
(2^29
). A compressor is forbidden to transmit an archive for which the sum reaches or exceeds this limit. This constraint is intended to allow decompressors to use, internally, a unified numbering of constants, even if certain constants (such as demangled inner class names or implicit SourceFile
names) must be added on the fly during decompression to the constant pool.
Note on implementation: The archive_header
directly contains 3 numbers, while cp_counts
contains at least 8 numbers and class_counts
contains exactly 4. Therefore, the minimum size of the archive_header
is 15 bytes, and this is always preceded by 4 bytes of archive_magic
. A decompressor which wished to avoid any spurious read-ahead could read and buffer an initial block of 19 bytes, which would be certain to contain the #archive_size
fields. (This assumes that the version numbers are small, as they are.) It could then immediately parse enough information to determine how many additional bytes of buffer storage would be required to scan the rest of the archive, at least up to the resource file images. By the time the resource file images in *file_bits
must be parsed, the decompressor will have already read the *file_size
bands, and will be able to remove any uncertainty originally present in #archive_size
. (If the #archive_size
fields are zero, the decompressor may be forced to read blindly up to the end of all data on the input channel in order to parse the archive.)
The format defines sixteen independent constant pools, eleven of which correspond directly to the eleven basic types of constants found in Java class files. In the Pack200 archive format, they are organized in a fixed logical ordering called their definition order. This order determines their presentation in the archive's band structure, and also the construction of certain constant numberings.
The constant pools consolidate the information in the constant pools of all input class files. Each constant pool contains a single type of constant value. Each of the eleven constant pool types found in Java 6 class files is placed in its own constant pool. A twelfth pool holds signatures, which in class files are UTF8 strings that represent method and field types, but in Pack200 archives are a separately compressed data type.
Three more constant pools hold constants defined as part of the Java 7 classfile format (major version 51 or later). One more constant pool holds constants to be assembled into the BootstrapMethods
attribute, which may be viewed as an appendix to the classfile constant pool.
The following table gives the sixteen predefined constant pools in their definition order. It also defines their correspondence with the constant types found in the Java class file format.
Name | class file element | class file tag | Purpose |
---|---|---|---|
cp_Utf8 | CONSTANT_Utf8_info | CONSTANT_Utf8 | basic string data |
cp_Int | CONSTANT_Integer_info | CONSTANT_Integer | int constant |
cp_Float | CONSTANT_Float_info | CONSTANT_Float | float constant |
cp_Long | CONSTANT_Long_info | CONSTANT_Long | long constant |
cp_Double | CONSTANT_Double_info | CONSTANT_Double | double constant |
cp_String | CONSTANT_String_info | CONSTANT_String | String constant |
cp_Class | CONSTANT_Class_info | CONSTANT_Class | class reference |
cp_Signature | (see below) | (none) | method, field, or variable type |
cp_Descr | CONSTANT_NameAndType_info | CONSTANT_NameAndType | pair of (name, type) |
cp_Field | CONSTANT_Fieldref_info | CONSTANT_Fieldref | field reference |
cp_Method | CONSTANT_Methodref_info | CONSTANT_Methodref | method call |
cp_Imethod | CONSTANT_InterfaceMethodref_info | CONSTANT_InterfaceMethodref | interface call |
cp_MethodHandle | CONSTANT_MethodHandle_info | CONSTANT_MethodHandle | method handle constant |
cp_MethodType | CONSTANT_MethodType_info | CONSTANT_MethodType | method type constant |
cp_BootstrapMethod | BootstrapMethods_attribute.bootstrap_methods[i] | (none; side table to constant pool) | bootstrap method specifier |
cp_InvokeDynamic | CONSTANT_InvokeDynamic_info | CONSTANT_InvokeDynamic | invokedynamic call |
(Note that the definition order for these constant pools is similar to the numeric ordering of the corresponding class file tag. For example, CONSTANT_Utf8 is the first tag, and cp_Utf8 is the first constant pool in the definition order. However, there are differences in detail. In particular, note that cp_String comes before cp_Class, even though the corresponding class file tags are in the reverse order.)
Each constant pool is logically a set of symbolic or numeric values, all of one type. In the Pack200 archive, it is structured as a sequence of such values, each with a unique index within the sequence. Elsewhere in the archive, whenever a constant must be mentioned, its index within the appropriate constant pool (or within a subset of a pool, or within a group of pools) is given.
Unlike the class file format, constant pool indexing in the Pack200 archive format starts at zero. In the rare places where null references are expected, the possibility is documented, and the null references themselves are encoded by a zero index, while all other index values are incremented by one. Where null references are not expected, they may still be encoded by the 32-bit index value -1.
Also unlike class files, there are no "gaps" or unused indexes associated with long or double values. Occasionally, we will refer formally to an element of a constant pool by using array notation. For example, the first three integers in the cp_Int
constant pool are cp_Int[0]
, cp_Int[1]
, and cp_Int[2]
, and the last integer is cp_Int[cp_Int_count-1]
. Note that both bands and constant pools are treated in this specification as arrays with zero origin.
It is illegal for a compressor to transmit the same constant twice. That is, all constant pool entries must be unique.
In a class file, with a few exceptions, constant pool references are strongly typed, in that every reference either is null or refers to a constant pool entry of a fixed tag associated with the context of the reference. The exceptions are ConstantValue
attributes and operand fields of the ldc
, ldc_w
, and ldc2_w
instructions.
However, because constant pools in a Pack200 archive are separately indexed, all constant references must be strongly typed, so that there is only one constant pool (or subset of a pool) into which any given index applies. This is accomplished either by deducing the constant type from context (in the case of ConstantValue
attributes) or adding extra information to the context of the reference. (Within the bytecode stream, ldc
is split out by constant type into distinct instructions sldc
, ildc
, etc.)
The layout of constant pool bands in the archive follows the definition order of the constant pools themselves. Each constant pool is represented in its turn by a sequence of one or more bands. The sequence of constant pool bands is therefore structured as follows:
cp_bands:
cp_Utf8
*cp_Int :UDELTA5 [#cp_Int_count]
*cp_Float :UDELTA5 [#cp_Float_count]
cp_Long
cp_Double
*cp_String :UDELTA5 [#cp_String_count] (cp_Utf8)
*cp_Class :UDELTA5 [#cp_Class_count] (cp_Utf8)
cp_Signature
cp_Descr
cp_Field
cp_Method
cp_Imethod
cp_MethodHandle
*cp_MethodType :UDELTA5 [#cp_MethodType_count] (cp_Signature)
cp_BootstrapMethod
cp_InvokeDynamic
As can be seen here, constant pools whose entries can be derived from a single 32-bit integer or a single reference are represented as single bands. The other constant pools are represented as groups of bands. Each constant pool gives its name to a corresponding grammar element. The production of cp_bands
declares the order of each band or group of bands that transmits the constants in the eponymous constant pool.
Note the use of UDELTA5 as primary encodings for several bands. Although the Pack200 file format does not require any particular ordering of values in constant pools, it is generally advantageous to order them so that the encoded values in bands using UDELTA5 are monotonically increasing. (Negative deltas can be encoded, although expensively, by UDELTA5. Also, a signed secondary encoding may be chosen by the compressor instead of UDELTA5.) The choice of primary encodings in this specification reflects an expectation that high-quality compressors, when presented with a choice of output orderings, will make choices that result in good use of primary band encodings.
In a few cases, several of the sixteen constant pools are combined into a group, so that indexes can be formed to refer to elements of the group as a whole. Such a group is completely defined by its constituent constant pools. An index into a constant pool group is always formed consistently with the overall order of the constituent constant pools, as well as with the internal order of each pool. For example, if a group cp_ABC
is formed of three pools cp_A
, cp_B
, and cp_C
of sizes COUNT(cp_A)
, COUNT(cp_B)
, and COUNT(cp_C)
, respectively, then a group index selects from one of the three pools depending on whether it is in the range 0 .. COUNT(cp_A)-1
, COUNT(cp_A) .. COUNT(cp_A)+COUNT(cp_B)-1
, or COUNT(cp_A)+COUNT(cp_B) .. COUNT(cp_ABC)-1
, respectively. The size of the pool group is of course the sum of the constituent pools.
In some other cases, subsets of constant pools are indexed in an abbreviated fashion. An index into a constant pool subset is always formed consistently with the order of the pool itself. For example, the index 0
always refers to the first element of the subset, the index 1
refers to the second, and so on.
Here are definitions of the pool groups used in this specification.
Name | Constituents | Purpose |
---|---|---|
cp_All | (all sixteen pools) | generic references |
cp_LoadableValue | cp_Int, cp_Float, cp_Long, cp_Double, cp_String, cp_Class, cp_MethodHandle, cp_MethodType |
qldc operand, bootstrap method argument
|
cp_AnyMember | cp_Field, cp_Method, cp_Imethod |
get or invoke operand, method handle component
|
The group cp_All
comprises the sequence of all constants, in their natural order of occurrence within the Pack200 archive. Thus, an index into cp_All
is in effect an untyped reference to any constant.
The encoding of the cp_Utf8
constant pool will be described shortly. First we will describe the coding of the numeric, string, and class constant pools.
The values in the cp_Int
constant pool are directly represented by the values encoded in its band. The values in the cp_Float
constant pool are obtained as if by applying java.lang.Float.intBitsToFloat
to the 32-bit values encoded in its band.
The 64-bit values of the cp_Long
band are obtained by adjoining, as high and low words, corresponding 32-bit values from the two bands cp_Long_hi
and cp_Long_lo
. (They contain the high and low words, respectively.) Likewise, the values of the cp_Double
band are obtained by first adjoining high and low words from two other bands, and then retyping the resulting 64-bit integers as if by applying java.lang.Double.longBitsToDouble
. The high and low words are in the bands cp_Double_hi
and cp_Double_lo
, respectively.
cp_Long:
*cp_Long_hi :UDELTA5 [#cp_Long_count]
*cp_Long_lo :DELTA5 [#cp_Long_count]
cp_Double:
*cp_Double_hi :UDELTA5 [#cp_Double_count]
*cp_Double_lo :DELTA5 [#cp_Double_count]
When floating point values are eventually written to class files, their bits must be accurately stored, as if they were processed by java.lang.Float.floatToRawIntBits
or java.lang.Double.doubleToRawLongBits
. In this way "NaN" values are transmitted faithfully, without normalization.
(Note: Although it would be possible to extend the band-coding techniques of Pack200 to encode 64-bit values directly, this file format always transmits 64-bit values as pairs of 32-bit values, transmitted in pairs of bands. This simplifies implementations, allowing them to process band data with 32-bit data paths.)
Each string in a cp_String
constant pool is represented in its band by a reference to the string's spelling, as a cp_Utf8
constant.
Likewise, each class in a cp_Class
constant pool is represented in its band by a reference to the class's spelling, as a cp_Utf8
constant. (As in the class file and the VM, the spelling uses the slash character to delimit package prefix components.)
Each value in the cp_Utf8
constant pool is an array of zero or more 16-bit Java characters. (The name "Utf8" is an anachronism, referring only to the basic character string type in Java class files. No part of the Pack200 file format uses any UTF encoding, but the term "Utf8" is retained to emphasize the connection with class files.) As in the class file format, all character string data is transmitted in this one kind of constant pool entry. When the context is clear enough, we loosely refer to these arrays as "strings", even though they are distinct from the Java constants in the cp_String constant pool.
If cp_Utf8
is not empty, its first value is always the empty string. Thus, the empty string is always given an index of zero. No band values are transmitted to represent this empty string. (Values in the bands cp_Utf8_prefix
and cp_Utf8_suffix
pertain to strings after the empty string in cp_Utf8
.) Because constants cannot be duplicated, all other elements of cp_Utf8
contain at least one character.
Each string in the cp_Utf8
constant pool is represented in the archive file in two parts, a prefix and a suffix. The suffix is represented both by a count and a sequence of character codes. The prefix is represented only by a count; the characters of the prefix are duplicated from the previous string. This technique allows the compressor (if it chooses) to represent strings by means of their successive differences. The first suffix value and the first two prefix values are suppressed in the transmitted archive. The values, if transmitted, would always be zero, since the first cp_Utf8
string is always empty, and the second string can never share a non-empty prefix with the first string.
Each suffix is transmitted in exactly one of two ways, as a small suffix or a big suffix. Small suffixes are transmitted contiguously in one large band. Big suffixes are rare; they are intended as a way to compactly transmit strings with unusual statistics, such as strings which are really tables of numeric data. Each big suffix is transmitted in its own band. This allows the compressor the option to choose an efficiently customized secondary coding for the characters in each large, unusual string.
cp_Utf8:
*cp_Utf8_prefix :DELTA5 [MAX(0,#cp_Utf8_count-2)]
*cp_Utf8_suffix :UNSIGNED5 [MAX(0,#cp_Utf8_count-1)]
*cp_Utf8_chars :CHAR3 [SUM( *cp_Utf8_suffix )]
*cp_Utf8_big_suffix :DELTA5 [COUNT(0, *cp_Utf8_suffix )]
(*cp_Utf8_big_chars :DELTA5)
** length(cp_Utf8_big_suffix) [SUM( *cp_Utf8_big_suffix )]
(Note: It is helpful, but not required by this specification, to order the strings lexicographically and find maximal common prefixes between adjacent strings. It is permissible to ignore the prefix sharing feature and declare all prefixes to be zero.)
The band cp_Utf8_suffix
contains one less than #cp_Utf8_count
elements. It specifies, in order, the small suffix length of each constant pool string (except the implicit first string, which is empty). The band cp_Utf8_prefix
contains two less than #cp_Utf8_count
elements. It specifies, in order, the prefix length of each constant pool string (except the first two strings, which always have zero prefix lengths). The prefix length encodes the length of a prefix substring shared by both cp_Utf8[i-1]
and cp_Utf8[i]
. If there are fewer than three strings in cp_Utf8
, then cp_Utf8_prefix
is empty. The prefix length is typically a large proportion (sometimes half) of the total length.
Each value in the band cp_Utf8_chars
is a 16-bit number expressing a Java character. This band contains the characters of all small suffixes, in order. For each successive string, cp_Utf8_chars
contains an additional run of values encoding the characters of its small suffix, if any. Therefore, the total length of this band is the sum of all values in the cp_Utf8_suffix
band.
Whenever a small suffix length for a constant pool entry is zero, the string has no small suffix, but a big suffix instead. The length of each big suffix is given by an element of the cp_Utf8_big_suffix
band. (Therefore, the length of this band is precisely the count of zero values in the cp_Utf8_suffix
band.) Each big suffix is transmitted as a separate band of 16-bit character values, one band element per character. There is one such band per big suffix. These bands immediately follow the cp_Utf8_big_suffix
band, and are collectively called the cp_Utf8_big_chars
bands. Although normally data of the same type are collected into a single band, these strings are placed in separate bands so that they may be independently encoded. These strings typically encode arrays of binary data, rather than true Java characters.
A big suffix can have a length of zero, which means that the constant pool entry is a non-empty prefix of the previous string. In such a case, the corresponding suffix band will occupy no space in the archive file, since bands of zero length occupy no bytes at all.
Please see the Appendix section Representation of cp_Utf8
Constant Pool for pseudo-code explaining this concept.
Each cp_Signature
constant pool entry represents a type string, exactly as field and method types are declared in the class file format. Although the class file format uses simple Utf8
constant pool entries for such strings, the Pack200 file format allocates a separate constant pool for them, in order to use more compact representation than for general strings.
Signature constants are special in that they can contain embedded references to cp_Class
constants. The signature constant is equivalent to a Utf8 constant, once the embedded class references are expanded into their spellings. Signature constants are flexible enough to represent arbitrary strings, but are intended for strings that often contain class names following the capital letter ell 'L'
. These include field, method, and local variables types, and generic Signature
attributes.
Each signature string is decomposed into a form and a sequence of zero or more class references. The class references are obtained by locating in the original signature string all sequences that match the pattern "L
classname", removing the classname part, and treating it as a reference into the cp_Class
constant pool. The form is defined as the residue after the class names have been removed from the original string.
The form is not allowed to contain any occurrences of the letter ell ('L
'), other than those which mark removed class names.
Thus, the form will contain the character 'L
' wherever the original type string refers to a class. By counting the number of these characters in a form, the decompressor may deduce the number of classes to which the original type string refers. This number is called the class length of the form.
Note that cp_Class
constants are not required to refer to existing classes, nor are they even required to have valid class name spellings. Therefore, the compressor has considerable latitude in choosing which class names (if any) it will extract from signature strings. In an extreme case, the compressor can make each form identical to its corresponding signature string, and simply satisfy its class length by emitting references to a fictitious class with an empty name. (This class length would have to include any occurrences of 'L'
in the class names themselves.)
Note also that these encoding rules do not require a class name to be followed by any particular character, although it will generally be a semicolon, or perhaps a left-angle bracket, in the case of an instance of a generic type in a Signature
attribute.
The rules also allow the compressor to decide, arbitrarily, how many characters (if any) after each letter ell 'L'
will be transmitted as part of a class name. Therefore, a given signature string may be representable by several forms, any of which the decompressor must be prepared to process.
Here are some examples:
Type Signature String | Form |
Class Len. |
Class... |
---|---|---|---|
F
|
F
|
0 | |
[Z
|
[Z
|
0 | |
[[[LLL;
|
[[[L;
|
1 |
LL
|
[[[LLL;
|
[[[LLL;
|
3 |
(empty), (empty), (empty)
|
([Ljava/lang/String;)V
|
([L;)V
|
1 |
java/lang/String
|
Ljava/util/List<Lpkg/Item;>;
|
L<L;>;
|
2 |
java/util/List, pkg/Item
|
(Ljava/lang/String;II)Lpkg/Item;
|
(L;II)L;
|
2 |
java/lang/String, pkg/Item
|
Ljava/util/List<Ljava/lang/Byte;>;
|
L<L;>;
|
2 |
java/util/List, java/lang/Byte
|
<ELEM:>(Ljava/util/List<TELEM;>;)TELEM;
|
<EL:>(L<TEL;>;)TEL;
|
1 |
EM, java/util/List, EM, EM
|
ALLOWABLE
|
ALLOWABLE
|
3 |
(empty), (empty), (empty)
|
ALLOWABLE
|
AL
|
1 |
LOWABLE
|
ALLOWABLE
|
ALABL
|
2 |
LOW, E
|
The forms of all strings in the cp_Signature
constant pool are given in order in the cp_Signature_form
band, as one cp_Utf8
reference per signature string. For each form, the class-length sequence of classes required to reconstitute the signature string is transmitted as a run of cp_Class
references in the cp_Signature_classes
band. As a consequence of these definitions, the length of the cp_Signature_classes
band is the total number of letter ell 'L'
occurrences in the sequence of spellings of forms mentioned in cp_Signature_form
.
cp_Signature:
*cp_Signature_form :DELTA5 [#cp_Signature_count](cp_Utf8)
*cp_Signature_classes :UDELTA5 [COUNT('L',...)] (cp_Class)
Please see the Appendix section for Representation of cp_Signature
Constant Pool pseudo-code explaining this concept.
The next four constant pools contain ordered pairs of various kinds of values (types, names, strings, or classes). Each cp_Descr
constant is an ordered pair of a name (represented as a cp_Utf8
reference) and a type (represented as a cp_Signature
reference). These references are transmitted in corresponding elements of the cp_Descr_name
and cp_Descr_type
bands.
Likewise, each cp_Field
, cp_Method
, or cp_Imethod
constant is an ordered pair of a class (represented as a cp_Class
reference) and a name-and-type descriptor (represented as a cp_Descr
reference). Again, these references are transmitted in corresponding elements of the associated bands.
cp_Descr:
*cp_Descr_name :DELTA5 [#cp_Descr_count] (cp_Utf8)
*cp_Descr_type :UDELTA5 [#cp_Descr_count] (cp_Signature)
cp_Field:
*cp_Field_class :DELTA5 [#cp_Field_count] (cp_Class)
*cp_Field_desc :UDELTA5 [#cp_Field_count] (cp_Descr)
cp_Method:
*cp_Method_class :DELTA5 [#cp_Method_count] (cp_Class)
*cp_Method_desc :UDELTA5 [#cp_Method_count] (cp_Descr)
cp_Imethod:
*cp_Imethod_class :DELTA5 [#cp_Imethod_count] (cp_Class)
*cp_Imethod_desc :UDELTA5 [#cp_Imethod_count] (cp_Descr)
For additional constant pools define symbolic references for invokedynamic
instructions and related entities. (These constant pools are only non-empty if their corresponding counts are non-zero, which in turn can only be true if cp_extra_counts
is present.) Each cp_MethodHandle
is an ordered pair of a reference kind (a small integer) and a reference to an element of cp_Field
, cp_Method
, or cp_Imethod
. (This reference is transmitted as an index into cp_AnyMember
, which is defined above as a pool group containing those three pools.) Each cp_MethodType
is represented as a cp_Signature
reference. Each cp_InvokeDynamic
is an ordered pair of a bootstrap method specifier (represented as a cp_BootstrapMethod
) and a type (represented as a cp_Signature
reference). Each cp_BootstrapMethod
specifier is an ordered pair of a bootstrap method reference (represented as a cp_MethodHandle
) and a counted series of zero or more constant pool references to constants which can be loaded onto the JVM stack. (These references are transmitted as indexes into the pool group cp_LoadableValue
.) All these values and references are transmitted in associated bands as described below.
cp_MethodHandle:
*cp_MethodHandle_refkind :DELTA5 [#cp_MethodHandle_count]
*cp_MethodHandle_member :UDELTA5 [#cp_MethodHandle_count] (cp_AnyMember)
cp_BootstrapMethod:
*cp_BootstrapMethod_ref :DELTA5 [#cp_BootstrapMethod_count] (cp_MethodHandle)
*cp_BootstrapMethod_arg_count :UDELTA5 [#cp_BootstrapMethod_count]
*cp_BootstrapMethod_arg :DELTA5 [SUM(*BootstrapMethod_arg_count)] (cp_LoadableValue)
cp_InvokeDynamic:
*cp_InvokeDynamic_spec :DELTA5 [#cp_InvokeDynamic_count] (cp_BootstrapMethod)
*cp_InvokeDynamic_descr :UDELTA5 [#cp_InvokeDynamic_count] (cp_Descr)
Since a JAR archive consists of a sequence of files, each with contents and a few other attributes, the Pack200 archive can also represent a sequence of files with their attributes. Of course, the contents of class files are specially transmitted, but whether a file (i.e., a JAR archive element) contains a class or another resource, the Pack200 archive is able to transmit the following attributes:
A class file whose contents are compressed by Pack200 is marked as a "stub" with a special option bit. A class stub file must be declared to have a length of zero. It may be declared to have the empty string for its name. If there are not enough class stub files transmitted to match with the number of classes transmitted, the decompressor must act as if the compressor had transmitted an additional series of trivial stubs, with no name or contents, and modification time and deflation hint copied from the archive header.
Each resource file is transmitted in the Pack200 archive as a simple bytewise image, under a relative pathname using slash '/' as a directory separator. (This is the same pathname convention as is used in ZIP archives.) Each file (both resource files and class files) can optionally be associated with a deflation hint and a modification date.
file_bands:
*file_name :UNSIGNED5 [#file_count] (cp_Utf8)
*file_size_hi :UNSIGNED5 [#file_count*(#have_file_size_hi)]
*file_size_lo :UNSIGNED5 [#file_count]
*file_modtime :DELTA5 [#file_count*(#have_file_modtime)]
*file_options :UNSIGNED5 [#file_count*(#have_file_options)]
*file_bits :BYTE1 [SUM(*file_size)]
There are no additional file attributes, such as comments, extra attributes, or CRC values. An application which requires such attributes on some files may encode those files into an appropriate archive file format (such as ZIP), and transmit those archives as resource files in a Pack200 archive.
Each file name is transmitted as an element of the file_name
band, and each length (in bytes) is transmitted in the corresponding elements of the file_size_lo
band and (if present) the file_size_hi
band. All three bands are of length #file_count
, except that file_size_hi
has zero elements if the have_file_size_hi
bit in the #archive_options
is clear.
The length in bytes of each file is the corresponding unsigned 32-bit value taken from the file_size_lo
band, if the file_size_hi
band is empty. Otherwise, the file length is the unsigned 64-bit value composed of the corresonding unsigned 32-bit elements of the file_size_lo
and file_size_hi
bands, where the former value is the low-order 32-bit word and the latter value is the high-order 32-bit word.
The bytes of all non-class (i.e., resource) files follow immediately in the file_bits
band. Each file is given as a corresponding run of byte values.
The optional band file_modtime
, if not empty, supplies a modification time for each resource file (and potentially for class files, if class file stubs are present). Each integer value gives a difference (in seconds) from the #archive_modtime
value. (Therefore, if the #archive_modtime
value is zero, the individual file times must be interpreted as absolute values.) The order of transmission of file_modtime
is consistent with file_name
. This band is empty if the have_file_modtime
bit in the #archive_options
is clear.
The optional band file_options
, if not empty, supplies flag bits for each resource file (and potentially for class files, if class file stubs are present). This band is empty if the have_file_options
bit in the #archive_options
is clear. Each file_options
word is interpreted bitwise. Certain bits are given symbolic names as follows, where the LSB is numbered as bit zero:
Bit | Name | Purpose |
---|---|---|
0 | deflate_hint | request compressed JAR file element |
1 | is_class_stub | this file contains class bytecodes |
If deflate_hint
(the LSB) is set, the decompressor is requested (but not required) to reduce the size of its output. For example, if it produces a JAR file, it may deflate the JAR elements. Since the deflate_hint
bit in the #archive_options
word has the same effect, the two bits are in effect combined with a logical OR for each file. If is_class_stub
(the second LSB) is set, this resource file description is actually a stub, and the file contents are defined as the bytecodes of one of the classes defined in the Pack200 archive. Other bits in the file options must be zero and are reserved for future use.
If a transmitted file is marked as a class stub, its contents must be transmitted as if the file were empty. (That is, file_size_hi
and file_size_lo
must both be zero, and there may not be any corresponding bytes in file_bits
.) The decompressor is required to supply contents for the resource file by reconstituting the contents of a class file for a class transmitted in class_this
, etc. Like any other resource file, the class stub is allowed to have a name, a modification time and a deflate_hint
.
The class stubs and classes correspond in order. Define the derived parameter #class_stub_count
as the number of files marked as class stubs. (It is also the count of is_class_stub
bits set in file_options
.) Then #class_stub_count
must be no larger than #class_count
. The first class stub defines a file which receives bytecodes for the first class, and so on. Although a class stub has a name (transmitted in file_name
), this name may be the empty string. In this case, the decompressor is required to use the standard name for the class file, which is created from the class's bytecode name (using slash '/' for a package separator) by appending the string ".class". (Thus, a classfile can have an arbitrary non-standard name, but the compression works best if its name is derived from its class in the usual way.)
If #class_count
is larger than #class_stub_count
, then the decompressor must behave as if a sufficient number of trivial extra class stubs were transmitted after the last explicit file. These trivial stubs have an empty name string and a zero deflate_hint
or file_modtime
.
Thus, in the simple case where there are no class stubs at all (perhaps because have_file_options
is zero), the decompressor must produce its files in their transmission order, followed by the transmission order of the classes. Each class file must have its standard name, and a modification time and deflation hint inherited from #archive_modtime
and the deflate_hint
bit in #archive_options
. But at the cost of extra transmission size, the compressor can direct the decompressor to present the resources and class files in any fixed order, with arbitrary names, modification times, and deflation hints for each output file. The decompressor must honor this ordering, if its output is in a form (such as a JAR archive) where order is significant.
Compressors are free, unless otherwise directed, to choose any ordering of files. It is often advantageous to place files with similar statistics next to each other, so that the post-pass compressor (if any) may process their contents together (in the same window, in the case of the DEFLATE algorithm). It is likely that a compressor which is able to reorder its input files for efficient transmission will have a command option which forces it to retain the order in which input files were presented, because this order is significant to some (but not all) deployment applications.
Compressors are free to transmit class files as if they were resource files. This provides a way to transmit class files which must be preserved bit-for-bit, or which compressors cannot transmit compressed with sufficient accuracy. Decompressors are required to accept class files transmitted "bitwise" as resource files.
Note: The bands controlling files and their attributes are transmitted last in the archive, in order to ease decompressor implementation slightly, since the last thing a decompressor does is to assemble output files. In particular, resource files can be of arbitrary size, and placing their bits at the end of the archive allows the decompressor to avoid allocating temporary storage for them.
The Pack200 file format directly supports up to 16 modifier bits in all output flags fields, as defined by the class file format. It also directly supports certain predefined attributes, such as Code
, InnerClasses
, and ConstantValue
It is also possible for the compressor to define additional attributes, passing enough information to the decompressor to properly extract their data from bands and reconstitute them in class files. The presence or absence of all attributes is controlled by the setting of certain bits in flag bands.
Five sets of flag bands carry modifier bits and/or attribute control bits. The ic_flags
band carries modifier bits for nested classes. Also, modifier and attribute control bits are carried by class_flags_lo
, field_flags_lo
, method_flags_lo
, and code_flags_lo
, and the four corresponding optional high word bands, class_flags_hi
, field_flags_hi
, method_flags_hi
, and code_flags_hi
. (There is no high word for ic_flags
.) Each value in these bands is interpreted as an unsigned 32-bit binary number. Each bit in these bands independently specifies the presence of a Java access modifier (such as ACC_PRIVATE
), or of a class, field, method, or code attribute (such as Deprecated
or SourceFile
), or of some other necessary control information (such as whether a class file has a non-default version number).
Every class (resp. field, method, Code
attribute) has a corresponding flags value of up to 63 bits transmitted in class_flags_lo
(resp., field_flags_lo
, method_flags_lo
, code_flags_lo
) and optionally in class_flags_hi
(resp., field_flags_hi
, method_flags_hi
, code_flags_hi
). These flags values, assembled into 64-bit numbers, are named class_flags
(resp., field_flags
, method_flags
, code_flags
). Except for Code
attributes, each of the low sixteen flag bits may be used to transmit access flags. The high sixteen bits (or forty-seven, if the optional high word is transmitted) are used to indicate the presence of attributes, either predefined or compressor-defined. The sixty-fourth bit position of a flags value (if transmitted) is reserved and must be zero.
The flags value for a Code
attribute is optional and taken to be zero if missing. A Code
attribute has a corresponding flags value transmitted if and only if either the have_all_code_flags
bit in the #archive_options
is set, or else the element of code_headers
corresponding to the Code
attribute has the special value zero. (See the discussion of code headers in the section Class Schema.)
For classes, nested classes, fields, and methods, the assignment of flag bit positions to modifiers is the same as that in the class file format. (For example, the LSB always represents ACC_PUBLIC
, in both Pack200 archive and class file formats.) An overflow attribute is an attribute whose presence is not indicated directly via a flag bit, and but is instead indicated by a occurrence of its index in a separate band. For classes, fields, methods, and codes, bit 16 (as set in the mask 0x0001000) indicates the presence of overflow attributes. For nested classes, flag bit 16 indicates the presence of explicit outer class and name fields, as explained in the section Nested Classes.
Bit | Meaning |
---|---|
0 | ACC_PUBLIC |
1 | ACC_PRIVATE |
2 | ACC_PROTECTED |
3 | ACC_STATIC |
4 | ACC_FINAL |
5 | ACC_SYNCHRONIZED (ACC_SUPER) |
6 | ACC_VOLATILE (ACC_BRIDGE*) |
7 | ACC_TRANSIENT (ACC_VARARGS*) |
8 | ACC_NATIVE |
9 | ACC_INTERFACE |
10 | ACC_ABSTRACT |
11 | ACC_STRICT |
12 | ACC_SYNTHETIC |
13 | ACC_ANNOTATION |
14 | ACC_ENUM |
16 | overflow (more band data elsewhere) |
The assignment of flag bit positions to attributes is done at the option of the compressor, and is independently specified by the compressor for the four kinds of flag words that have to do with attribute-carrying objects (in classes, fields, methods, and codes). When a flag bit position is assigned to an attribute, if that bit is visible in the class file, it must be forced clear before the decompressor writes it to the class file. Therefore, if the compressor expects the decompressor to reproduce a particular non-zero flag bit as output, the compressor must refrain from assigning that bit position to attributes.
In particular, the low-order 16 bits of class, field, and method flags are visible in the class file, and can thus carry modifier bits. None of the bits of a code flags word is visible in the class file; these flag bits are used only for code sub-attributes.
By contrast, the low-order 16 bits of nested class records are used solely for modifiers, since nested class records do not contain attributes. In the rest of this section on attributes, we will disregard the flag words associated with nested class records.
Any of the 63 bit positions in a flags value can be assigned by the compressor to indicate to the decompressor the presence of some particular attribute. The compressor may "take over" a modifier bit by assigning it an attribute definition. (This is done by emitting a definition for the attribute, which mentions that bit position.)
If the compressor "takes over" a bit (modifier or not) that is being used by default for another purpose, the bit loses its previous meaning. However, the compressor may not emit an explicit definition for the same bit twice (in the same context).
Each kind of attribute is defined by four pieces of information: the entity to which it applies (class, field, method, or code), the bit position, if any, to which it is assigned, the name of the attribute (as it appears in the class file), and the layout of the attribute (which allows the decompressor to properly format occurrences of the attribute in the class file). It is an error for the compressor to specify the same name and layout twice in the same context. It is not an error to repeat a name with a different layout, or a layout with a different name.
The first two items are bit-encoded into a single-byte "header", and transmitted in the attr_definition_headers
band. The last two items are transmitted as cp_Utf8
references in corresponding elements of the attr_definition_name
and attr_definition_layout
. Thus there are three bands by which the compressor declares attribute types to the decompressor, and each band is of length #attr_definition_count
.
attr_definition_bands:
*attr_definition_headers :BYTE1 [#attr_definition_count]
*attr_definition_name :UNSIGNED5 [#attr_definition_count] (cp_Utf8)
*attr_definition_layout :UNSIGNED5 [#attr_definition_count] (cp_Utf8)
The least significant two bits of an attribute definition header byte are treated as an unsigned field, and give the attribute's context type, which is type of entity to which the attribute applies:
(h & 0x03)
|
Context Type |
---|---|
0 | attribute applies to classes |
1 | attribute applies to fields |
2 | attribute applies to methods |
3 | attribute applies to Code attributes |
The most significant six bits of an attribute definition header byte are treated as an unsigned field, and give the optional bit position to which the attribute is assigned in the flags word.
(h >> 2)
|
Flag Bit Assignment |
---|---|
0 | overflow attribute, not assigned to any bit |
1 | attribute assigned to bit 0 (LSB of lo word) |
2 | attribute assigned to bit 1 |
3 | attribute assigned to bit 2 |
... | |
32 | attribute assigned to bit 31 (MSB of lo word) |
33 | attribute assigned to bit 32 (LSB of hi word) |
... | |
63 | attribute assigned to bit 62 |
(no value) | bit 63 must be zero (MSB of hi word) |
Every class attribute, whether predefined in this specification or explicitly defined by the compressor, has a unique number called its attribute index. If the attribute is assigned a flag bit, then its attribute index is identical to the flag bit's position (a number in [0..62]). (All predefined attributes are assigned a flag bit by default.)
If a class attribute is not assigned a flag bit, it is an overflow attribute, and its index is assigned sequentially in the order in which class attributes are defined (i.e., transmitted). The first index to be assigned sequentially in this way is 32 if #have_class_flags_hi
is clear, and 63 if it is set. These indexes are used within the archive to declare attribute occurrences for individual classes.
Likewise, field, method, and code attributes are assigned their own attribute indexes, independently of class attributes and of each other. Therefore, an attribute (i.e., a name and layout pair) is uniquely specified within the archive by its context type and attribute index. The field and method attributes may be assigned to flag bits, or else they are overflow attributes with indexes of 32 or larger. Like other attributes, code attributes may be assigned explicit numbers or implicitly assigned indexes starting with 32. (As with class attributes, if the high flag word bands are selected by the appropriate bit of #archive_options
, then field, method, or code overflow attributes have indexes of 63 or larger.)
Some attributes are predefined, and do not require the compressor to emit definitions for them. They have implicitly defined layouts and flag bit assignments (i.e., indexes).
Attribute indexes less than 63 are usable in all cases, but they may conflict with those assigned to predefined attributes, including predefined attributes defined (in the range 16 to 62) in future expansions of the Pack200 format. In the present version, all predefined attributes have indexes in the range [17..31], so that modifier flags are not taken over by default, and high flag words do not need to be used routinely.
Here are the names and index assignments of the predefined attributes. (The predefined layouts are given below.)
Index | Context Type | Name |
---|---|---|
16 | C,F,M | (overflow attributes) |
17 | Class | SourceFile |
18 | Class | EnclosingMethod |
19 | C,F,M | Signature |
20 | C,F,M | Deprecated |
21 | C,F,M | RuntimeVisibleAnnotations |
22 | C,F,M | RuntimeInvisibleAnnotations |
23 | Class | InnerClasses |
24 | Class | "class-file version" |
17 | Field | ConstantValue |
17 | Method | Code |
18 | Method | Exceptions |
23 | Method | RuntimeVisibleParameterAnnotations |
24 | Method | RuntimeInvisibleParameterAnnotations |
25 | Method | AnnotationDefault |
26 | Method | MethodParameters |
27 | C,F,M,Code | RuntimeVisibleTypeAnnotations |
28 | C,F,M,Code | RuntimeInvisibleTypeAnnotations |
0 | Code | StackMapTable |
1 | Code | LineNumberTable |
2 | Code | LocalVariableTable |
3 | Code | LocalVariableTypeTable |
16 | Code | (overflow attributes) |
Bit 16 is predefined as an indicator of the presence of overflow attributes for classes, fields, methods, and codes. If an entity has overflow attributes, it will possess a corresponding count in an "attr_count" band, and each overflow attribute will be part of a run of values in an "attr_indexes" band, which specifies the layouts of the overflow attributes. This processing of overflow attributes is described more fully in the section Attribute Bands.
These predefined attribute indexes determine not only which bit positions are used to select those attributes, but also the fixed band order in which the attribute data are transmitted, as described in the section Attribute Layout Definitions.
Certain bits are predefined to support the five types of metadata attributes on classes, fields, and methods, RuntimeVisibleAnnotations
, RuntimeInvisibleAnnotations
, RuntimeVisibleParameterAnnotations
, RuntimeInvisibleParameterAnnotations
, and AnnotationDefault
. (The last three types apply only to methods.) The meaning and format of these attributes is defined by JSR 175. Similarly type metadata attributes RuntimeVisibleTypeAnnotations
and RuntimeInvisibleTypeAnnotations
on classes, fields, methods and code, are predefined. The meaning and format of these attributes are defined and described by JSR 308.
It is permissible for a compressor to refrain from setting bits in any flag words except for bit 16, and transmit all attribute layout indexes explicitly in class_attr_indexes
or a similar band. Decompressors are required to process either kind of occurrence of an attribute index. (It is also permissible, though useless, for an attribute count to be an explicitly transmitted as zero.) Compressors are encouraged to make clever choices and clear bit 16 and set other assigned flag bits when possible.
Note that none of the predefined bits interfere with any present or future flag bits in the 16-bit flag values stored in the class file format. However, compressors are free to reuse one of the low-order 16 bits of the flags word, by binding them to other attributes, if no file actually sets them.
The InnerClasses
attribute is treated specially, as documented in the section Nested Classes, and it is an error for the compressor to emit an attribute definition for it in the class context. The Code
attribute is also treated specially. It is an error to emit an attribute definition for it in the method context.
This specification does not define how a compressor is informed of the existence or format of attributes which are not predefined. It simply assumes that compressors are informed of such attributes, and it requires that compressors properly transmit this information to decompressors. As a special case, it is reasonable for any compressor to transmit an attribute containing no bytes (i.e., of zero length) as if its layout were known to be the empty string.
Attribute layouts govern the compressor as it parses attribute bodies from class files into collections of scalar values. Layouts also govern the decompressor as it decodes transmitted bands of those scalar values, and stores them, correctly formatted, into class files. For example, when a class file stores an unsigned integer in two bytes (in big-endian order) in an attribute, the decompressor uses a layout declaration to this effect, so that it can take a band element (which is a 32-bit integer that in this case happens to be less than 65536) and store it correctly into a class file. The transmission of such an integer is very different from the transmission of a constant pool reference, even though they appear as undifferentiated bytes in the class file format.
In what follows, we say that some given element of an attribute layout governs a scalar value stored in a class file attribute, if the decompressor must use that element to reconstitute, into a class file, the representation of that scalar value. We also say that the given layout element governs the band in which the scalar value is transmitted.
An attribute layout is defined by a string in a "little language". The string must be parsed by the decompressor into a sequence of layout elements, each of which governs the transmission and storage of attribute values. In particular, the layout declares the locations of all constant pool references, allowing their values to be transmitted with appropriate representations, either as constant pool indexes or else some other sort of number.
The simplest usable attribute layout would be a sequence of constant pool reference declarations, intermixed with one-byte declarations to govern everything else, and compressors are free to use such layouts to describe attributes. However, more specific attribute layouts lead to better compression.
Attributes typically contain constant pool references and small integers. They often contain integers which control the replication of subsequent patterns. Constant pool references are strongly typed where possible, and the encoding provision for null references must be declared also. Small integers which encode flag bits or bytecode indexes are also declared as such, so that special encoding techniques may be used on them.
Layout declarations are UTF8 strings formed according to the following grammar. (This grammar is independent of the grammar which describes band structure, or any other grammar appearing in other parts of this specification.)
attribute_layout:
( layout_element )* | ( callable )+
layout_element:
( integral | replication | union | call | reference )
callable:
'[' body ']'
body:
( layout_element )+
integral:
( unsigned_int | signed_int | bc_index | bc_offset | flag )
unsigned_int:
uint_type
signed_int:
'S' uint_type
any_int:
( unsigned_int | signed_int )
bc_index:
( 'P' uint_type | 'PO' uint_type )
bc_offset:
'O' any_int
flag:
'F' uint_type
uint_type:
( 'B' | 'H' | 'I' | 'V' )
replication:
'N' uint_type '[' body ']'
union:
'T' any_int (union_case)* '(' ')' '[' (body)? ']'
union_case:
'(' union_case_tag (',' union_case_tag)* ')' '[' (body)? ']'
union_case_tag:
( numeral | numeral '-' numeral )
call:
'(' numeral ')'
reference:
reference_type ( 'N' )? uint_type
reference_type:
( constant_ref | schema_ref | utf8_ref | untyped_ref )
constant_ref:
( 'KI' | 'KJ' | 'KF' | 'KD' | 'KS' | 'KQ' | 'KM' | 'KT' | 'KL' )
schema_ref:
( 'RC' | 'RS' | 'RD' | 'RF' | 'RM' | 'RI' | 'RY' | 'RB' | 'RN' )
utf8_ref:
'RU'
untyped_ref:
'RQ'
numeral:
'(' ('-')? (digit)+ ')'
digit:
( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' )
Each occurrence of an attribute in a class file is associated (by the compressor) with a corresponding layout definition which describes accurately the meaning of that attribute's bytes. The compressor must assign each format element its own band for transmitting successive values of that format element. In the case of the predefined attributes, the bands assigned to their layout elements are defined by name in this specification.
Each value governed by an attribute layout element is converted by the compressor into a 32-bit value and transmitted as an element of a band uniquely created for and governed by that layout element. For integral
layout elements, the conversion simply represents the number stored in the class file under the given type, while reference
elements convert a local constant pool reference (local to the class file, that is) into a global, typed reference within the archive.
Multiple occurrences of the same kind of layout element are regarded as distinct layout elements. Also, bands are never shared by layouts. Therefore, every new layout definition transmitted by the compressor implicitly defines its own set of bands. Reassigning an previously defined attribute layout to a new index creates a new set of bands; it does not reuse the previously defined sets of bands.
If the compressor defines new attributes, it must also create the bands they govern. It must transmit these bands, immediately after the place reserved in the band grammar for predefined attributes (at the end of class_attr_bands
, field_attr_bands
, method_attr_bands
, or code_attr_bands
). The ordering of these bands must correspond to the definition order of the attribute layout elements that govern them. In this way, the decompressor will be able to find those attribute values.
The definition order of two layout elements in the same attribute layout string corresponds to their order of occurrence within that string. The definition order of two layout elements not in the same attribute layout string corresponds to the index order of their layout definitions in the attr_definition_layout
band. (That is, bands for layouts with lower indexes precede bands for layouts of the same context type but with higher indexes. This is true even if the lower-indexed layout happens to be defined later in the attr_definition_layout
band.) Note that the bands governed by predefined attributes appear to follow such an ordering also. However, the bands of predefined class attributes precede all the other class attribute bands, and likewise for fields, methods, and code attributes.
Integer values stored in attributes are sized as 1, 2, or 4 bytes, depending on the use of the format character 'B', 'H', or 'I', respectively. A integer type is normally unsigned but prefixed with 'S' becomes signed. This signing governs the lengthening of 1-byte and 2-byte types to 32-bit values (either sign extension or zero filling). It also determines the primary encoding used to transmit the 32-bit values. Because all integers stored in class files are "big-endian", all integral format elements refer to integers coded with higher-order bits in earlier bytes.
Integral
layouts (i.e., those elements under the integral
nonterminal) govern signed or unsigned integer values. Integral layouts whose declarations include the 'P' or 'O' characters use special primary encodings (BCI5 or BRANCH5) as described below. Any other integral layouts whose declarations include the 'S' character govern bands with a primary encoding of SIGNED5. Unsigned integer and flag layout elements govern bands with a primary encoding of UNSIGNED5, except that the integer layout 'B' (unsigned byte) governs a band with a primary encoding of BYTE1.
(Note that there is no special encoding for signed bytes. If an attribute contains a signed byte field, it may be just as well to treat that field as if it were unsigned, and give it a plain 'B' layout element. Besides completeness, the 'SB' layout is intended to allow small-magnitude one-byte integers to be transmitted using a byte alphabet in which sign bits are rare. This is desirable when the post-pass compressor uses Huffman coding to represent bytes; such codings perform best when a consistent byte alphabet is presented to the compressor.)
A stored bytecode index is declared with the layout prefix 'P'. This layout element may only be used on methods or codes. Any number may be stored in a bytecode index. However, before the stored numbers are transmitted as band elements, they are renumbered in the expectation that byte positions not on instruction boundaries will be rare.
The BCI renumbering compactly indexes instruction boundaries, and also provides a coding (somewhat less compact) for all other 32-bit integers, such as the addresses of bytes inside instructions or outside the bounds of the bytecodes. In particular, the first byte of the first instruction is numbered zero, the first byte of the second instruction is numbered one, and so on through all instructions. The byte position one past the last instruction is numbered with the next number (the number of instructions). The second byte of the first multibyte instruction is numbered with the next number, which is one more than the number of all instructions. All remaining unnumbered bytes are assigned subsequent numbers, without any further disturbances of ordering. For large enough positive numbers, and for negative numbers, the renumbering is the identity function.
For purposes of locating instruction boundaries for BCI renumbering, the _wide bytecode (0xc4) is taken to be part of the following instruction, whose format it helps determine. If the compressor transmits byte_escape (254) or ref_escape (253) pseudo-instructions, the decompressor must accept the bytecode sequences produced by each of these instructions as integral instructions, when computing BCI renumberings. Thus, there will be an instruction boundary for each byte in bc_codes (except _wide bytecodes), plus an extra boundary for each "aload_0_xxx" variation (such as "aload_0_getstatic_this"), because these pseudo-opcodes expand to pairs of bytecode instructions.
Please see the Appendix section Representation of Byte Offsets for pseudo-code explaining this concept.
If the layout element is prefixed by 'P' and not 'PO', then the renumbered bytecode index is transmitted. This renumbering is called the bytecode index renumbering. The primary encoding for bands containing bytecode indexes is BCI5.
Here is an example of how this works. Imagine a method with 20 bytes of bytecode data and five instructions, at the following positions: { 0, 4, 6, 10, 17 }
. BCI renumbering translates these particular numbers compactly to [0..4]
. More completely, BCI renumbering translates [0..20]
to the numbers { 0, 6, 7, 8, 1, 9, 2, 10, 11, 12, 3, 13, 14, 15, 16, 17, 18, 4, 19, 20, 5 }
, and anything outside the range [0..20]
stays the same after BCI renumbering. If a attribute of this method has a 'P' layout element, and has the number 6 stored in the class file, the number 2 will be transmitted, since 6 is the offset of the third instruction.
If the layout prefix is 'PO', the previous layout element must also have been a bytecode index of layout type 'P' or 'PO'. (There must not be intervening structure, such as brackets '[' or ']'.) In this case, the value transmitted in the band governed by the 'PO' element is the difference between the renumbered bytecode index governed by the current 'PO' element, and the renumbered bytecode index governed by the previous 'P' or 'PO' element. (In the case of a previous 'P' element, this is in fact the value transmitted at the corresponding position in the previous band.)
The 'PO' layout elements govern the same kind of attribute data as 'P' elements, but they use an encoding which expects that adjacent bytecode indexes are correlated. This renumbering, including a difference with the previous element, is called the bytecode offset renumbering The primary encoding for bands containing bytecode offsets is BRANCH5. This encoding is used even if the layout element contains the 'S' or the 'B' character.
A bytecode offset is declared with the layout prefix 'O'. This layout element must immediately follow a previous bytecode index element (with prefix 'P' and not 'PO'). Any value governed by this layout element is regarded as an offset, which when applied to the corresponding previously stored bytecode index, produces another bytecode index. That is, both the previous value and the sum of the previous value and the current value are expected to refer to instruction boundaries. As with the 'PO' layout, the value transmitted is the difference of the two renumbered bytecode indexes. Bands governed by 'PO' layouts contain bytecode offsets, and use the BRANCH5 primary encoding. Unlike the 'PO' layout, the stored value governed by an 'O' layout is the difference between two bytecode indexes.
Thus, bands governed by both 'O' and 'PO' layouts contain bytecode offsets, and use BRANCH5 to encode those offsets. But attribute values governed by both 'P' and 'PO' layouts store absolute bytecode indexes; only 'O' layouts govern stored bytecode offsets. The stored offsets are treated as unsigned fields in the class file unless the 'S' character is present. By contrast, stored indexes are always treated as unsigned fields.
If the preceding example attribute for a 20-byte method also has a 'PO' layout element following its 'P' element, and the number 20 is stored in the class file, the number to be transmitted is found by renumbering 20 to 5, and then subtracting the previous transmitted number (5-2). The number 3 will therefore be transmitted. If instead the stored number is 7 (a BCI inside the instruction at position 6), the transmitted number will be (10-2) or 8. The same transmitted values (3, 8), if governed by an 'O' layout element instead of 'PO', would correspond to stored values of 14 (20-6) and 1 (7-6), instead of 20 and 7.
A flag element (with prefix 'F') encodes integer values which are expected to encode short array of bits, rather than arithmetic values. The bits are not required to be interpreted as access modifiers. The primary encoding for bands transmitting flag layout elements is UNSIGNED5, except that 'FB' layouts use a primary encoding of BYTE1.
Integral values transmitted under the 'V' format character instead of 'B', 'H', or 'I' occupy no bytes at all in the class file attribute. These layout elements allow the decompressor to use counts or tags to control transmission without having them appear directly in class file attributes. For example, consider an attribute which is an array of bytes that is not self-sizing but expands to fill the byte-count mentioned in the attribute header in the class file. Such an attribute might have a layout specification of 'NV[B]'. The compressor would be responsible to decide which values to transmit for a 'V' layout element. (Such decisions would have to make use of detailed information about attribute formats that is not part of this specification.) In all cases, the decompressor must respect these values (when used as counts or tags) but must not store them in the class file.
A replication is introduced by a prefix 'N', which is followed by an integer element called the replication count, and by then a series of elements, called the replication body, enclosed in square brackets. The attribute data governed by this layout consists of a replication count followed by an array of data counted by the replication count. Each array element is governed by the layouts in the replication body.
A replication count layout element governs a band transmitting the replication counts, which has a primary encoding of UNSIGNED5 (or BYTE1, if the layout starts with 'NB'). The bands governed by the corresponding replication body may be sized by summing the values transmitted in the band containing the counts.
A union is introduced by a prefix 'T', which is followed by an integer element called the union tag, and by then a series of labeled, bracketed groups of elements, called the union cases. (The numerals can have any number of digits, but their arithmetic values are truncated to 32-bit integers before being compared with band values, which also are 32 bits in size.) Each label but the last consists of one or more parenthesized, possibly signed decimal numerals, called union tags. The last label (which is the default case) must be an empty pair of parentheses. (No union may contain two occurrences of the same case tag.) The attribute data governed by this layout consists of an integral tag value followed by data whose format is determined by the tag. The data following the tag is governed by the (unique) union case whose label matches the tag's value, or else the default case. The bands governed by each union case may be sized by counting the number of values in the band governed by the tag's layout. As with plain integral bands, the primary encoding of union tag band is SIGNED5, BYTE1, or UNSIGNED5, depending on whether the layout contains an 'S' character, is 'TB', or otherwise.
In a union tag, two numerals separated by a hyphen specify an inclusive range. The second number must be greater than the first. This is an abbreviation for the list of all numerals from the first to the second, inclusive. The layout is treated exactly as if the abbreviation were replaced by the complete list.
Constant pool references in an attribute may be typed strongly, and transmitted as indexes into one of the archive's constant pools. The layout types beginning with 'KI', 'KJ', 'KF', 'KD', and 'KS' must govern stored indexes in the local constant pool to constants of type integer, long, float, double, and string. The layout types beginnig 'KM' and 'KT' govern indexes to constants of type MethodHandle and MethodType. Likewise, the layout types beginning with 'RC', 'RS', 'RD', 'RF', 'RM', 'RI', and 'RU' must govern stored indexes in the local constant pool to symbols of type class, signature, descriptor (pair of name and type), field reference, method reference, interface method reference, and UTF8 string. The layout type 'RY' governs indexes to invokedynamic descriptors, while the type 'RB' governs indexes to bootstrap method specifiers (which are elements of the BootstrapMethods
attribute, not of the constant pool). All these references are transmitted as 32-bit indexes into the corresponding global constant pools, in bands with primary encoding of UNSIGNED5. (This is true even of layouts which contain 'B'.) See table below.
Note that signature references (layout 'RS') are identical to Utf8 references (layout 'RU') within a class file, but lead to different coding tactics in the archive file, since the cp_Utf8
and cp_Signature
constant pools have independent indexes and are transmitted differently.
The layout elements beginning 'KQ' may only occur in attributes on fields. They refer to constants in a constant pool whose identity is determined by the field's signature. (This layout element is probably useful only for ConstantValue
attributes, which are predefined.) The following table gives field signatures legal for use with 'KQ' layouts, and the corresponding constant pools in which the corresponding stored references must be found.
Field Signature | 'KQ' Constant Pool |
---|---|
B | cp_Int |
S | cp_Int |
C | cp_Int |
Z | cp_Int |
I | cp_Int |
J | cp_Long |
F | cp_Float |
D | cp_Double |
Ljava/lang/String; | cp_String |
Ljava/lang/Class; | cp_Class |
Three additional layout types transmit indexes into constant pool groups, instead of individual constant pools. They may be used when the compressor elects to use a layout which has references that are not strongly typed. The combined type 'KL' may be used for any index that refers to a valid operand of the 'ldc' instruction, and the transmitted indexes will be renumbered relative to the cp_LoadableValue
pool group. Similarly, the combined layout type 'RN' may be used for any index that refers to a valid operand of any of the 'get' or 'invoke' instructions (except 'invokedynamic'), and these indexes will renumbered relative to the cp_AnyMember
pool group.
Finally, if a constant pool reference cannot be typed at all, the compressor must use an untyped_ref
('RQ'), which transmits an index into the comprehensive constant pool group cp_All
. For example, all the elements of the cp_Utf8
pool are numbered the same in the 'RQ' and 'RU' layouts. But the first element (element zero) of the cp_Int
pool is numbered, for untyped references, as cp_Utf8_count
, and the first element of the cp_Float
pool is numbered, for untyped references, as cp_Utf8_count+cp_Int_count
.
If a compressor is faced with an attribute which contains a constant pool index of an unexpected type, it may either refuse to transmit the attribute, or elect to transmit the attribute under a relaxed layout definition using 'RQ' or 'RQN' elements instead of more strongly-typed elements. (If the unexpected type is a loadable constant or member reference, the compressor may elect to use 'RN' or 'KL' elements instead.) Decompressors are required to honor all legal layout definitions transmitted by compressors, even if they might produce illegally formatted class files. (In an extreme case, a compressor may elect to transmit a class file as if it were a resource file, obtaining no class-specific compression, but preserving unusually formatted attributes with bit-for-bit accuracy.)
If a reference
layout type includes the character 'N', all band values encoding constant pool entries are incremented by one, and the null value is encoded as zero. Otherwise, the null value is encoded as negative one (-1).
The effect on primary band encodings of the rules given above may be summarized as a list of prioritized rules for layout elements, where the first applicable rule determines the encoding:
Here is a table summarizing bands and encodings for various layout elements. (Not all possible combinations of types and integer sizes are shown.)
Layout Element |
Stored Value |
Transmitted Value |
Primary Encoding |
---|---|---|---|
B | u1 | x | BYTE1 |
FB | u1 | x | BYTE1 |
SB | u1 | (byte)x | SIGNED5 |
H | u2 | x | UNSIGNED5 |
FH | u2 | x | UNSIGNED5 |
SH | u2 | (short)x | SIGNED5 |
I | u4 | x | UNSIGNED5 |
FI | u4 | x | UNSIGNED5 |
SI | u4 | x | SIGNED5 |
PH | u2 | renumber_bci(x) | BCI5 |
POH | u2 | renumber_bci(x) - renumber_bci(x0) | BRANCH5 |
OH | u2 | renumber_bci(x0+x) - renumber_bci(x0) | BRANCH5 |
NB[...] | u1 | x (also serves as size count) | BYTE1 |
NH[...] | u2 | x (also serves as size count) | UNSIGNED5 |
NI[...] | u4 | x (also serves as size count) | UNSIGNED5 |
TB... | u1 | x | BYTE1 |
TSB... | u1 | (byte)x | SIGNED5 |
TH... | u2 | x | UNSIGNED5 |
TSH... | u2 | (short)x | SIGNED5 |
KIB | u1 | indexOf(lcp[x], cp_Int) | UNSIGNED5 |
KIH | u2 | indexOf(lcp[x], cp_Int) | UNSIGNED5 |
KII | u4 | indexOf(lcp[x], cp_Int) | UNSIGNED5 |
KINH | u2 | 1+indexOf(lcp[x], cp_Int) | UNSIGNED5 |
KJH | u2 | indexOf(lcp[x], cp_Long) | UNSIGNED5 |
KFH | u2 | indexOf(lcp[x], cp_Float) | UNSIGNED5 |
KDH | u2 | indexOf(lcp[x], cp_Double) | UNSIGNED5 |
KSH | u2 | indexOf(lcp[x], cp_String) | UNSIGNED5 |
KQH | u2 | indexOf(lcp[x], cp_FieldSpecific) | UNSIGNED5 |
KMH | u2 | indexOf(lcp[x], cp_MethodHandle) | UNSIGNED5 |
KTH | u2 | indexOf(lcp[x], cp_MethodType) | UNSIGNED5 |
KLH | u2 | indexOf(lcp[x], cp_LoadableValue) | UNSIGNED5 |
RCH | u2 | indexOf(lcp[x], cp_Class) | UNSIGNED5 |
RSH | u2 | indexOf(lcp[x], cp_Signature) | UNSIGNED5 |
RDH | u2 | indexOf(lcp[x], cp_Descr) | UNSIGNED5 |
RFH | u2 | indexOf(lcp[x], cp_Field) | UNSIGNED5 |
RMH | u2 | indexOf(lcp[x], cp_Method) | UNSIGNED5 |
RIH | u2 | indexOf(lcp[x], cp_Imethod) | UNSIGNED5 |
RUH | u2 | indexOf(lcp[x], cp_Utf8) | UNSIGNED5 |
RQH | u2 | indexOf(lcp[x], cp_All) | UNSIGNED5 |
RQNH | u2 | 1+indexOf(lcp[x], cp_All) | UNSIGNED5 |
RQNI | u4 | 1+indexOf(lcp[x], cp_All) | UNSIGNED5 |
RYH | u2 | indexOf(lcp[x], cp_InvokeDynamic) | UNSIGNED5 |
RBH | u2 | indexOf(class.BootstrapMethods[x], cp_BootstrapMethod) | UNSIGNED5 |
RNH | u2 | indexOf(lcp[x], cp_AnyMember) | UNSIGNED5 |
Here, the variable x names a value stored in the attribute, governed by the layout element. The variable x0 names the stored value governed by the immediately previous layout element, which must have begun with 'P'. The expression renumber_bci(x) denotes the renumbering of a bytecode index x to shorten references to instruction boundaries, as described above. The expression lcp[x] denotes a local constant pool reference, with index x, or a distinct null value if x is zero.
The expression class.BootstrapMethods[x] denotes an element of the BootstrapMethods
attribute of the current class. This attribute contains additional complex constants required by CONSTANT_InvokeDynamic
constant pool entries.
The expression indexOf(lcp[x], cp) denotes the index (zero-based) in a Pack200 global constant pool cp of the constant lcp[x], which is assumed to be of a type appropriate to cp. This expression has the value -1 if lcp[x] has the distinct null value.
Note that a null reference can always be transmitted in a band containing references, by transmitting the value zero (if the band is specified to accept nulls) or by transmitting the value -1 (otherwise). The UNSIGNED5 encoding can represent -1 in five bytes.
The name cp_FieldSpecific refers to a global constant pool selected by the enclosing field's signature, as described above.
The top-level structure of a layout specification may be a series of bodies which are independently callable layouts. In such a case, the attribute data governed by the whole layout specification is governed by the first body in the sequence of callables. The other callables can govern data, but they only do this if they are actually called. The effect is to define a mutually recursive set of layout specifications, and give control to the first. Note that callables may not be nested inside any other syntax. This feature is useful for supporting recursively structured class file attributes, such as metadata.
A call layout element is a parenthesized signed decimal numeral N. It refers to the Nth callable in the top-level structure of the layout specification, relative to the callable in which the call appears. (It is illegal for calls to appear outside of callables.) We will refer to this callable as the call's callee.
(For example, the layout element '(2)' calls the second callable layout after the one in which the call appears. There must be a matching callee for every callable. A call spelled '(1)' must not appear in the last callable of a layout, and likewise a call spelled '(-1)' must not occur in the first callable. Note that the self-call spelled '(0)' is always legal.)
A call layout element indirectly governs the attribute data more directly governed by the callable that the layout element calls. As far as class file format is concerned, the effect is the same as if the text within the callable's body were substituted instead of the call.
However, this substitution semantics does not describe the effect of a call layout element on band structure. A call layout element does not directly govern any bands, but rather specifies that the data it indirectly governs is to be transmitted in the bands governed more directly by the callee.
If a call's callee begins textually later than the call itself, the call is a forward call. The effect of a forward call is simply to make the bands of the callee sharable by the call and any other forward calls to the same callee. For example, the following layout specification has two bands, the latter of which transmits data indirectly governed by either of the union cases. (The default union case governs no bands, so that a tag byte other than ASCII 'A' or 'B' has no following bytes. Whitespace has been added to the layout for ease of reading.)
[TB
(65) [(1)]
(66) [(1)]
( ) []
]
[H]
A call's callee can also begin textually earlier than the call itself. Such a call, which must have a spelling of the form '(0)' or '(-N)', is referred to as a backward call. Its callee is referred to as a backward callable. (Callables which are the target of no calls or only of forward calls are not backward callables; all others are.) Backward calls and callables introduce the possibility of recursion and looping. Here is an example of an N-ary tree whose leaves are Utf8 strings preceded by a zero-byte, and whose internal nodes are counted arrays of tree nodes preceded by a one-byte: In this example, the callee encloses the call, for a direct recursion. The layout governs three bands, under the elements 'TB', 'NH', and 'RUH'.
[TB
(1) [NH[ (0) ]]
(0) [RUH]
]
The sizing of bands governed by such mutually-recursive layouts is assisted by explicit counts of backward calls, transmitted before any of the layout's attribute bands, in class_attr_calls
and three similar bands, which are described later.
The layout definitions of the predefined attributes are as follows:
Context Type | Name | Layout Definition |
---|---|---|
Class | "class-file version" | (empty) *(see note) |
Class | InnerClasses | (empty) *(see note) |
Class | EnclosingMethod | RCHRDNH |
Class | SourceFile | RUNH *(see note) |
Class | Signature | RSH |
Class | (metadata) | (see Metadata Layouts) |
Class | Deprecated | (empty) |
Class | (type metadata) | (see Metadata Layouts) |
Field | ConstantValue | KQH |
Field | Signature | RSH |
Field | (metadata) | (see Metadata Layouts) |
Field | Deprecated | (empty) |
Field | (type metadata) | (see Metadata Layouts) |
Method | Code | (empty) *(see note) |
Method | Exceptions | NH[RCH] |
Method | Signature | RSH |
Method | (metadata) | (see Metadata Layouts) |
Method | Deprecated | (empty) |
Method | MethodParameters | NB[RUNHFH] |
Method | (type metadata) | (see Metadata Layouts) |
Code | StackMapTable | (see Stack Map Layouts) |
Code | LineNumberTable | NH[PHH] |
Code | LocalVariableTable | NH[PHOHRUHRSHH] |
Code | LocalVariableTypeTable | NH[PHOHRUHRSHH] |
Code | (type metadata) | (see Metadata Layouts) |
The star '*' in the layout definitions of "class-file version", InnerClasses
, SourceFile
, and Code
reflects the fact that these attributes are given special processing. For a class, the "class-file version" pseudo-attribute is transmitted as if using the format VV
, but the decompressor does not store the result in a class file attribute, but rather in the class file header. (See the discussion in the section Attribute Bands.) For a class, the InnerClasses
attribute is partially transmitted as if using the format NV[RCVTV[(0)[]()[RCNVRUNV]]]
, but the decompressor processes the received values further before (possibly) emitting an InnerClasses
attribute. (See the discussion in the section Local InnerClasses Attributes.) When a SourceFile
attribute is transmitted using the predefined layout, a special rule allows it to default to the obvious standard string. Finally, for a method, the Code
attribute is transmitted under the code_bands
.
There is a predefined attribute layout for the StackMapTable
attribute of a Code
attribute. With some whitespace and abbreviation added for readability, it is as follows:
[NH[(1)]]
[TB
(64-127) [(2)]
(247) [(1)(2)]
(248-251) [(1)]
(252) [(1)(2)]
(253) [(1)(2)(2)]
(254) [(1)(2)(2)(2)]
(255) [(1)NH[(2)]NH[(2)]]
() []
]
[H]
[TB
(7) [RCH]
(8) [PH]
() []
]
The following observations may be deduced by comparing this layout specification with the class file format specification which defines the StackMapTable
attribute. The second callable describes a stack_map_frame
structure from the class file format specification. Within the union in the second callable, the cases stand for the following stack_map_frame
union members, respectively: same_locals_1_stack_item_frame
, same_locals_1_stack_item_extended
, chop_frame
(and also same_frame_extended
), append_frame
(for three layout union cases), full_frame
, and (in the default layout union case) same_frame
. The third and fourth callables describe an offset_delta
value and a verification_type_info
structure from the class file format specification.
The predefined attribute layouts for non-parameter metadata annotations are the same in all three contexts. With some whitespace added for readability, the layout is as follows:
[NH[(1)]]
[RSH NH[RUH(1)]]
[TB
(66,67,73,83,90) [KIH]
(68) [KDH]
(70) [KFH]
(74) [KJH]
(99) [RSH]
(101) [RSH RUH]
(115) [RUH]
(91) [NH[(0)]]
(64) [RSH NH[RUH(0)]]
() []
]
Both visible and invisible annotations use this layout. The second callable describes an annotation
structure from the metadata specification. The third callable describes a element_value
structure from the metadata specification. This last callable, all by itself, is used for the AnnotationDefault
attribute on methods. Method parameter annotations (both visible and invisible) are also predefined using this layout, with a prepended callable to count the number of parameters:
[NB[(1)]]
[NH[(1)]]
[RSH NH[RUH(1)]]
[TB...]
Additionally RuntimeVisibleTypeAnnotation and RuntimeInvisibleTypeAnnotation are predefined, using similar strategies as before. Both visible and invisible type annotations use a common layout. (They were defined by JSR 308).
[NH[(1)(2)(3)]]
[TB
(0-1) [B]
(16) [FH]
(17-18) [BB]
(19-21) []
(22) [B]
(23) [H]
(64-65) [NH[PHOHH]]
(66) [H]
(67-70) [PH]
(71-75) [PHB]
() []
]
[NB[BB]]
[RSH NH[RUH(1)]]
[TB...]
The first callable defines the type_annotation
as a whole. The second callable refers to target_type
and target_info
union structure, and the third callable refers to target_path
structure. The remaining callables are identical with layouts specified above for metadata. Specifically the fourth and the fifth callables refers to the annotation
and element_value
structures specified for metadata. (These were defined in the JSR 175 specification). Like the previous metadata layouts, these type metadata layouts are predefined in the compressor, and are applicable to Class, Field, Method, and also Code structures.
It is possible for the compressor to emit any number of definitions for the same attribute name. These definitions are assigned different attribute indexes (and perhaps flag bits), and are treated as distinct attributes, despite having the same name. If an attribute is too complex to define in the layout language, each occurrence of that attribute can be given a separate definition by the compressor. The minimum requirement is that each layout accurately declare the size of the corresponding attribute occurrence, and locate all constant pool references within that occurrence.
When a null reference is transmitted in the band governed by the predefined SourceFile
attribute, the decompressor is required to replace it by a reference to a Utf8 string whose spelling consists of the associated class name string, with the following modifications made, in order:
This rule matches the historical behavior of many Java compilers, and allows the compressor to avoid allocating Utf8 strings for "obvious" derived SourceFile
names. (If a compressor needs to transmit an irregular SourceFile
attribute with a true null reference, it must use a non-standard layout.) Here is a table of examples of class names and the corresponding derived SourceFile
names.
Class |
SourceFile
|
---|---|
foo | foo.java |
foo/bar | bar.java |
foo/bar$baz | bar.java |
foo/bar#baz#1 | bar.java |
foo.bar.baz#1 | baz.java |
A nested class record is a four-tuple specifying two classes, a name, and some flags, as documented for the InnerClasses
attribute of class files. It is primarily represented in the Pack200 archive by four bands, whose contents are applied equally to all class files in the archive.
ic_bands:
*ic_this_class :UDELTA5 [#ic_count] (cp_Class)
*ic_flags :UNSIGNED5 [#ic_count]
*ic_outer_class :DELTA5 [COUNT(1<<16,...)] (null or cp_Class)
*ic_name :DELTA5 [LENGTH(*ic_outer_class)] (null or cp_Utf8)
These four-tuples are shared globally, like constant pool entries. In this specification, they are conventionally written <C,F,C2,N>
, even though they are stored in a different order in the class file format. As a set, these globally defined four-tuples are called ic_All
. There is usually no explicit linkage from individual classes in the archive to nested class records. Instead, when extracting a class file from a Pack200 archive, an subset of nested class records may be selected which is sufficient to describe all nested classes actually mentioned in the extracted class file's constant pool. For any class file X to be extracted, this subset will be called ic_Relevant(X)
, the relevant subset for X of ic_All
. The algorithm for selecting the relevant subset is described in the section Ordering of Constant Pools. Optionally, the compressor can specify, for any given class, an adjustment to its relevant subset, by transmitting a local InnerClasses
attribute. This also is described in the section Local InnerClasses Attributes.
The ic_this_class
and ic_flags
bands are both of length #ic_count
, and the corresponding elements of these bands specify, for each tuple, the nested class identity (represented as a cp_Class
reference) and the flags bitmask.
The nested class flag bit at position 16 (as set in the mask 0x00010000) is inside the archive file to indicate if there are corresponding entries for the tuple in the ic_outer_class
and ic_name
bands. Thus, the length of both of these bands is the sum of all flag bits at position 16. Typically, only a few percent of nested classes need to set this bit and specify outer and name fields explicitly.
If a tuple has an entry in the ic_outer_class
and ic_name
bands, these specify its outer class and simple name. (They are represented respectively as a possibly null cp_Class
reference and a possibly null cp_Utf8
reference.) Otherwise, the tuple's outer class and name are said to be predicted. In this case, they must be correctly predictable from the name of the nested class itself, by parsing its spelling.
A nested class has a bytecode name which names the class within class files. The spelling of this name, sometimes called a "mangled name", has extra punctuation signs and perhaps digits as well as the name of a containing class. If the bytecode name can be parsed into an outer class and a class name, and this class and name are identical with the true outer class and class name of the nested class, then we say the outer class and class name are predictable.
The extraction of the predictable outer class and class name must follow the following grammar for class bytecode names, as applied to the spelling of the nested class name. (This grammar is independent of the grammar governing band structure, or any other grammar appearing in other parts of this specification.) The terminal DOLLAR refers to any character (such as '$' or '#') whose code is 0x2D or lower. The terminals SLASH refers to the slash or dot characters '/' or '.', which have the codes 0x2E and 0x2F. The terminal DIGIT refers to an ASCII decimal digit, one of the ten character codes from 0x30 to 0x39 inclusive. The terminal LETTER refers to any other character. That is, it refers to any character whose code is 0x3A or higher.
bcn:
(bcnCase1 | bcnCase2 | bcnCase3 | bcnCase4)
bcnCase1:
packageQual (namePart)? DOLLAR number
bcnCase2:
packageQual (namePart)? DOLLAR number DOLLAR predictableICName
bcnCase3:
predictableOuter DOLLAR predictableICName
bcnCase4:
packageQual (namePart)?
predictableOuter:
packageQual namePart
predictableICName:
LETTER (LETTER | DIGIT)*
namePart:
(LETTER | DIGIT | DOLLAR)+
number:
(DIGIT)+
packageQual:
(namePart SLASH)*
This grammar ambiguously divides an arbitrary class name into several parts, which may include an optional predictedOuter prefix, an optional predictedICName suffix, and a optional numeric suffix. Any ambiguity must be resolved by preferring the alternative cases for bcn
nonterminal in the order given. E.g., if bcnCase1
matches, it is used, even though either or both other cases may match also.
The predictable nested class name is the string corresponding to the predictableICName
nonterminal, if it was parsed. Otherwise the predictable nested class name is taken to be null.
If a predictableOuter
outer nonterminal is parsed, the corresponding string is the predictable outer class name. Otherwise the predictable outer class reference is taken to be null. Note that if a number
nonterminal is parsed, no predictableOuter
can be parsed. Here are some examples of prediction, nonprediction, and misprediction:
nested class: |
member named Entry of java/util/Map
|
mangled name: |
java/util/Map$Entry
|
outer, name: |
java/util/Map , Entry
|
predictable? |
yes (since Map.Entry is a member of Map )
|
nested class: | anonymous |
mangled name: |
java/util/AbstractList$1
|
outer, name: | (none), (none) |
predictable? | yes (since the ic_name is null) |
nested class: |
non-member, named Local
|
mangled name: |
java/util/AbstractList$2$Local
|
outer, name: |
(none), Local
|
predictable? |
yes (since the ic_name is Local )
|
nested class: |
non-member, named Local
|
mangled name: |
java/util/AbstractList#2#Local
|
outer, name: |
(none), Local
|
predictable? |
yes (since the ic_name is Local )
|
nested class: |
member named $2$Local of Foo
|
mangled name: |
Foo$$2$Local
|
outer, name: |
(none), Local
|
predictable? | no (outer name missing, ic_name mispredicted) |
nested class: |
class named Red$Herring
|
mangled name: |
Red$Herring
|
outer, name: |
Red , Herring
|
predictable? | no (the predicted ic_name Herring) |
nested class: |
member named Q of X$1
|
mangled name: |
X$1$Q
|
outer, name: |
(none), Q
|
predictable? |
no (since Q is a member of X$1 )
|
nested class: |
member named Z of X$Y
|
mangled name: |
X$Y$Z
|
outer, name: |
X$Y , Z
|
predictable? | yes (since X$Y.Z is a member of X$Y) |
nested class: |
member named Y$Z of X
|
mangled name: |
X$Y$Z
|
outer, name: |
X$Y , Z
|
predictable? | no (outer name and ic_name are mispredicted) |
When a decompressor is processing a nested class name marked "predictable", it must parse the nested class's bytecode name into an enclosing class, an optional number, and an optional name. (The compressor might choose to always specify outer classes and names explicitly, in which case the decompressor would not be required to parse the nested class names at all. However, decompressors must always be prepared to perform this parsing.)
If the compressor did not transmit an entry in cp_Utf8
or cp_Class
for a predicted name or outer class, the decompressor must create such a constant internally. It will be referred to as a cp_Utf8
or cp_Class
entry, even though it is not in the constant transmission sequence cp_All
. However, if the compressor did transmit a constant with the same spelling as the predicted name or outer class, the decompressor must use that constant rather than creating a new one. The creation of such internal constants within the decompressor is detectable in an output file because they are inserted in a special order in the output file's constant pool. (See discussion of the output ordering rules in the section Ordering of Constant Pools.)
Please see the Appendix section Representation of Byte Offsets for pseudo-code explaining this concept.
When the decompressor receives the ic_this_class
, ic_flags
, ic_name
, and ic_outer_class
bands, and performs any required name parsing, it creates the set ic_All
of four-tuples. This set is the primary source of InnerClasses
attributes synthesized for individual class files by the decompressor.
The purpose of the Pack200 archive is to transmit a set of classes in a form which a decompressor can later use to reconstruct class files. As a rule, each separate kind of data stored in class files is transmitted in a band dedicated to all values of that kind. Here is the band structure for all class-specific information except bytecodes:
class_bands:
*class_this :DELTA5 [#class_count] (cp_Class)
*class_super :DELTA5 [#class_count] (cp_Class)
*class_interface_count :DELTA5 [#class_count]
*class_interface :DELTA5 [SUM(*class_interface_count)] (cp_Class)
*class_field_count :DELTA5 [#class_count]
*class_method_count :DELTA5 [#class_count]
*field_descr :DELTA5 [SUM(*class_field_count)] (cp_Descr)
field_attr_bands
*method_descr :MDELTA5 [SUM(*class_method_count)] (cp_Descr)
method_attr_bands
class_attr_bands
code_bands
The class_this
, class_super
, class_flags_lo
, class_flags_hi
(if present), class_interface_count
, class_field_count
, and class_method_count
bands are all of length #class_count
, and the corresponding elements of these bands transmit in turn each class's name and super-class, and the number of implemented interfaces, declared fields, and declared methods. (The access modifier bits in each class's flags word are mixed with attribute indicators, and are transmitted in class_flags_lo
, as defined above.)
The class_interface
band contains runs of class interface declarations, one run for each element of class_interface_count
, and applying to the corresponding class.
Each element of class_this
, class_super
, and class_interface
is a reference (in fact, a non-null reference) to the constant pool cp_Class
.
In the unique case of java/lang/Object
, the stored super-class must be a null reference. Rather than disturbing the transmission of all other elements of the class_super
band, we use the convention that a compressor, when it encounters a null super-class reference, must transmit in its place a duplicate of the current class's reference. (Since it is illegal for a class to inherit from itself, this renumbering of null references is unambiguous.)
The field_descr
bands contain one run of elements for each element of class_field_count
, and apply to successive fields in the corresponding class. The method_descr
bands contain one run of elements for each element of class_method_count
, and apply to successive methods in the corresponding class.
(Unlike the class file format, field and method descriptors are stored as single references to a "name-and-type" constant pool, rather than as pairs of name references and type references.)
The method_descr
band has a primary encoding which performs best if the method references are mostly sorted. Compressors may elect to take advantage of this fact by sorting the methods in each class. (Note: It is generally acceptable for compressors to reorder methods for best compression, but fields must not be reordered, since their order is apparent through reflection and significant to some facilities, such as serialization.)
Attribute bands are placed in the positions indicated by the grammar. They are described below. The attribute bands include flags bands, which carry both access modifiers and attribute indicators for the corresponding classes, fields, and methods.
The Pack200 archive ends with bands devoted to method code blocks and the bytecodes themselves. If a method has a code attribute, the compressor must mention it in the flags bit (the 6th LSB) to which it is assigned, or as an overflow attribute (with an index of 6). The fields of a Code attribute are transmitted in the bands organized under the code_bands
nonterminal.
The specification of each Code attribute begins with three key parameters, which declare the number of stack and local slots, and the number of handlers. #Stack
is the number of stack slots used by the code. #NALocal
is the number of non-argument locals used by the code. (The actual number of locals declared in the class file will be #NALocal
plus the number of locals required by method parameters, as determined by the method's signature.) #Handler
is the number of exception handlers.
The code_headers
band contains a series of bytes, each of which tersely encodes the first three key parameters of a code attribute: Each of the 255 non-zero byte values encodes a unique triple of #NALocal
, #Stack
, and #Handler
. There is (of course) one code header for each Code attribute transmitted.
The special code header byte zero (0x00) indicates that the code attribute's three parameters are to be found, instead, in the code_max_stack
, code_max_na_locals
, and code_handler_count
bands. If the code header byte is non-zero, these three bands do not transmit entries for the corresponding Code attribute.
The code_flags_lo
band transmits an entry for a corresponding Code
attribute if and only if one or both of the following is true: the Code attribute's code header byte is zero, or the have_all_code_flags
bit is set in the #archive_options
word. If a Code
attribute does not have a transmitted code_flags_lo
value, its flags word is taken to be zero, and it has no sub-attributes. The code_flags_hi
band transmits a corresponding entry if and only if the code_flags_lo
value was transmitted as just described, and also #have_code_flags_hi
is set.
Non-zero code header bytes denote code attribute parameters according to the following scheme:
Header Range | #Stack | #NALocal | #Handler |
---|---|---|---|
1 <= x <= 144 | (x-1) % 12 | (x-1) / 12 | 0 |
145 <= x <= 208 | (x-145) % 8 | (x-145) / 8 | 1 |
209 <= x <= 255 | (x-209) % 7 | (x-209) / 7 | 2 |
x = 0 |
code_max_stack
|
code_max_na_locals
|
code_handler_count
|
This scheme allows a single byte to describe a Code attribute in about 95% of all cases. (Note: If debugging attributes are not stripped, the compressor should probably set the have_all_code_flags
bit, so that the presence of these attributes will not force the use of the special zero code header byte.)
Regardless of whether the exception handler count is encoded in the one-byte code header, or whether it is an explicit element of code_handler_count
, for each code attribute there is a run of values in code_handler_start_P
, code_handler_end_PO
, code_handler_catch_PO
, and code_handler_class_RCN
, one for each handler, as if those bands were governed by an attribute layout 'NV[PHPOHPOHRCNH]'
(there is no band governed by the 'NV'
layout element itself; it is the handler count).
That is, each triplet of Handler.start
, Handler.end
, and Handler.catch
values are transmitted as renumbered (by renumber_bci). Also, Handler.end
is transmitted as a difference of its renumbering with that of Handler.start
in the same triplet, and Handler.catch
is transmitted as a difference of its renumbering with that of Handler.end
in the same triplet. Finally, code_handler_class_RCN
is transmitted as an incremented reference into cp_Class
, or zero if the handler class reference is null.
code_bands:
*code_headers :BYTE1 [COUNT(Code,...)]
*code_max_stack :UNSIGNED5 [COUNT(0,*code_headers)]
*code_max_na_locals :UNSIGNED5 [COUNT(0,*code_headers)]
*code_handler_count :UNSIGNED5 [COUNT(0,*code_headers)]
*code_handler_start_P :BCI5 [SUM(*code_handler_count)]
*code_handler_end_PO :BRANCH5 [SUM(*code_handler_count)]
*code_handler_catch_PO :BRANCH5 [SUM(*code_handler_count)]
*code_handler_class_RCN :UNSIGNED5 [SUM(*code_handler_count)] (null or cp_Class)
code_attr_bands
Here, collected in one place, are the bands which transmit attributes. If the compressor sets bits in the flags word of a class, field, or method, and those bits are assigned (by the compressor) to attributes, then the compressor must also retrieve the stored values governed by each selected attribute layout and transmit them in the bands governed by that layout.
If the compressor marks a class (resp. field, method, or code) as having overflow attributes, it must transmit a corresponding element in the class_attr_count
band (resp. field_attr_count
, method_attr_count
, or code_attr_count
band). This count, in turn, specifies the size of a run in class_attr_indexes
(resp. field_attr_indexes
, method_attr_indexes
, or code_attr_indexes
) of indexes of attribute layouts governing attributes of the class (resp. field, method, or code). The compressor must transmit a attribute layout definition index for each overflow attribute.
For each attribute layout selected by a bit in an element of class_flags
or by an element of class_attr_indexes
, the compressor must retrieve data governed by the layout elements from the class attribute and transmit elements encoding this data in the bands governed by the layout. Similar conditions apply for field, method, and code attributes.
For each attribute layout that transmits data and contains backward calls, the compressor must transmit the number of times each backward callable will be the subject of a backward call as the decompressor works its way through attribute layouts. These call counts are transmitted in the class_attr_calls
, field_attr_calls
, method_attr_calls
, and code_attr_calls
bands, according to the context type of the layouts they apply to. The call counts are transmitted, one per backward callable, in the definition order of the callables. (See the section Attribute Layout Definitions.) Call counts are only transmitted for backward callables which occur within layouts that are used at least once. Call counts do not count entries to any callable because of a forward call, nor do they count the initial call to a callable which initiates attribute processing. A call count might be zero if a backward callable occurs in a layout that is used, but which does not happen to reach the backward callable. These call counts are needed to break a circularity inherent in the sizing of bands for mutually-recursive layouts. They provide the minimum information necessary for the decompressor to locate all the bands in the archive, prior to distributing the band values to the various output classes. For this reason, the call counts are supplied one per layout, summed over all attribute occurrences of that each layout.
The decompressor is responsible for processing any explicit attribute layout definitions transmitted by the compressor. It must prepare to receive the additional bands governed by those layouts. When it reads layout definition indexes, it must prepare to read in the correct number of values transmitted in each of those additional bands.
The grammar of bands defined in this specification includes bands governed by all predefined attribute layouts. For clarity, some band names mention a part of the layout element that created them. Note that the Deprecated attributes have no bands, because their layouts are empty.
Attributes named "Synthetic", although part of the standard class file format in earlier versions of Java, are not directly supported in this specification, because that attribute has been replaced by a new flag bit (ACC_SYNTHETIC, 0x1000) in more recent versions of Java. However, compressors are encouraged to process Synthetic attributes, and any other zero-length attributes not predefined here, by assigning them unused flag bits, and emitting explicit zero-length layout definitions for decompressors to follow.
If the compressor elects to define new layouts, the bands governed by those layouts are added immediately after the bands for the predefined layouts. (The order and structure of the predefined attribute bands reflects the predefined layout definitions in a regular way, as if the compressor had in fact defined them explicitly.)
class_attr_bands:
*class_flags_hi :UNSIGNED5 [#class_count*#have_class_flags_hi]
*class_flags_lo :UNSIGNED5 [#class_count]
*class_attr_count :UNSIGNED5 [COUNT(1<<16,...)]
*class_attr_indexes :UNSIGNED5 [SUM(*class_attr_count)]
*class_attr_calls :UNSIGNED5 [...]
*class_SourceFile_RUN :UNSIGNED5 [COUNT(SourceFile,...)] (null or cp_Utf8)
*class_EnclosingMethod_RC :UNSIGNED5 [COUNT(EnclosingMethod,...)] (cp_Class)
*class_EnclosingMethod_RDN :UNSIGNED5 [COUNT(EnclosingMethod,...)] (null or cp_Descr)
*class_Signature_RS :UNSIGNED5 [COUNT(Signature,..)] (cp_Signature)
class_metadata_bands
ic_local_bands
*class_file_version_minor_H :UNSIGNED5 [COUNT(version,...)]
*class_file_version_major_H :UNSIGNED5 [COUNT(version,...)]
{class_attr_element_bands...}
field_attr_bands:
*field_flags_hi :UNSIGNED5 [SUM(*class_field_count)*#have_field_flags_hi]
*field_flags_lo :UNSIGNED5 [SUM(*class_field_count)]
*field_attr_count :UNSIGNED5 [COUNT(1<<16,...)]
*field_attr_indexes :UNSIGNED5 [SUM(*field_attr_count)]
*field_attr_calls :UNSIGNED5 [...]
*field_ConstantValue_KQ :UNSIGNED5 [COUNT(ConstantValue,...)] (cp_Int, etc.; see note)
*field_Signature_RS :UNSIGNED5 [COUNT(Signature,...)] (cp_Signature)
field_metadata_bands
method_attr_bands:
*method_flags_hi :UNSIGNED5 [SUM(*class_method_count)*#have_method_flags_hi]
*method_flags_lo :UNSIGNED5 [SUM(*class_method_count)]
*method_attr_count :UNSIGNED5 [COUNT(1<<16,...)]
*method_attr_indexes :UNSIGNED5 [SUM(*method_attr_count)]
*method_attr_calls :UNSIGNED5 [...]
*method_Exceptions_N :UNSIGNED5 [COUNT(Exceptions,...)]
*method_Exceptions_RC :UNSIGNED5 [SUM(*method_Exceptions_N)] (cp_Class)
*method_Signature_RS :UNSIGNED5 [COUNT(Signature,...)] (cp_Signature)
method_metadata_bands
*method_MethodParameters_NB :BYTE1 [...]
*method_MethodParameters_name_RUN :UNSIGNED5 [...] (null or cp_Utf8)
*method_MethodParameters_flag_FH :UNSIGNED5 [...] (flag)
{method_attr_element_bands...}
code_attr_bands:
*code_flags_hi :UNSIGNED5 [...*#have_code_flags_hi]
*code_flags_lo :UNSIGNED5 [...]
*code_attr_count :UNSIGNED5 [COUNT(1<<16,...)]
*code_attr_indexes :UNSIGNED5 [SUM(*code_attr_count)]
*code_attr_calls :UNSIGNED5 [...]
*code_StackMapTable_N :UNSIGNED5 [COUNT(StackMapTable,...)]
*code_StackMapTable_frame_T :BYTE1 [SUM(*code_StackMapTable_N)]
*code_StackMapTable_local_N :UNSIGNED5 [COUNT(255,*code_StackMapTable_frame_T)]
*code_StackMapTable_stack_N :UNSIGNED5 [COUNT(255,*code_StackMapTable_frame_T)]
*code_StackMapTable_offset :UNSIGNED5 [...]
*code_StackMapTable_T :BYTE1 [...]
*code_StackMapTable_RC :UNSIGNED5 [COUNT(7,*code_StackMapTable_T)]
*code_StackMapTable_P :BCI5 [COUNT(8,*code_StackMapTable_T)]
*code_LineNumberTable_N :UNSIGNED5 [...]
*code_LineNumberTable_bci_P :BCI5 [...]
*code_LineNumberTable_line :UNSIGNED5 [...]
*code_LocalVariableTable_N :UNSIGNED5 [...]
*code_LocalVariableTable_bci_P :BCI5 [...]
*code_LocalVariableTable_span_O :BRANCH5 [...]
*code_LocalVariableTable_name_RU :UNSIGNED5 [...] (cp_Utf8)
*code_LocalVariableTable_type_RS :UNSIGNED5 [...] (cp_Signature)
*code_LocalVariableTable_slot :UNSIGNED5 [...]
*code_LocalVariableTypeTable_N :UNSIGNED5 [...]
*code_LocalVariableTypeTable_bci_P :BCI5 [...]
*code_LocalVariableTypeTable_span_O :BRANCH5 [...]
*code_LocalVariableTypeTable_name_RU :UNSIGNED5 [...] (cp_Utf8)
*code_LocalVariableTypeTable_type_RS :UNSIGNED5 [...] (cp_Signature)
*code_LocalVariableTypeTable_slot :UNSIGNED5 [...]
{code_attr_element_bands...}
A class possesses a "class-file version" pseudo-attribute if its class file's minor or major version differs from the #default_class_minver
or #default_class_majver
, respectively. This pseudo-attribute is a pair of 16-bit integers giving the major and minor version numbers of the class's file. These integers are stored in the header of the class's file, rather than in an attribute record. As is the convention with classfiles, the minor version number comes first. Thus minor version numbers are transmitted in the band class_file_version_minor_H
, and major version numbers in class_file_version_major_H
. Decompressors are not required to process archives with greater minor or major version numbers than those of the specification they were engineered to process. Decompressors are required to process archives with the same major and smaller or equal minor version numbers.
A transmitted class may possess a local InnerClasses
attribute, which is taken to be an adjustment to that class file's eventually written InnerClasses
attribute. Specifically, the local InnerClasses
attribute specifes a set of four-tuples which must be combined with a corresponding relevant subset drawn from ic_All
. The algorithm for doing this is described in the section Ordering of Constant Pools, but it amounts to starting with the subset of four-tuples from ic_All
which pertain to CONSTANT_Class entries of the output constant pool (barring CONSTANT_Class entries required only by the InnerClasses
attribute itself), and then forming the set symmetric difference between that relevant set and the local InnerClasses
attribute.
The four-tuples of all local InnerClasses
attributes are transmitted in five bands:
ic_local_bands:
*class_InnerClasses_N :UNSIGNED5 [COUNT(InnerClasses,...)]
*class_InnerClasses_RC :UNSIGNED5 [SUM(*class_InnerClasses_N)] (cp_Class)
*class_InnerClasses_F :UNSIGNED5 [SUM(*class_InnerClasses_N)]
*class_InnerClasses_outer_RCN :UNSIGNED5 [COUNT(!=0,*class_InnerClasses_F)] (null or cp_Class)
*class_InnerClasses_name_RUN :UNSIGNED5 [COUNT(!=0,*class_InnerClasses_F)] (null or cp_Utf8)
Each four-tuple <C,F,C2,N>
transmitted in these attribute bands is called a local IC tuple. (Note that the class file format stores the elements of these four-tuples in a different order, (C,C2,N,F)
.) For a given class file X, the sequence of local IC tuples is called ic_Local(X)
.
For each local IC tuple, the compressor transmits, at minimum, a C
value and an F
value. If a transmitted flags value F
is not zero, there are also corresponding transmitted constant pool references C2
and N
. As a whole, the transmitted local tuple may or may not be equivalent to a global tuple from ic_All
.
As an abbreviation, the compressor may transmit only a C
and a flags value of zero, if the local IC tuple to be transmitted is equivalent to a member of ic_All
, and no other member of ic_All
defines the same class C
. In this case the decompressor must behave exactly as if all four tuple components had been explicitly transmitted.
(If all four components of a local tuple are transmitted, and the flags value to be transmitted is in fact zero, the value 0x00010000
must be transmitted instead of zero for the flags.)
Frequently, the compressor will not need to transmit any local IC tuples at all, since the set of four-tuples to be stored in the class's InnerClasses
attribute will be exactly ic_Relevant(X)
, with no adjustment required. If extraneous four-tuples (beyond those required by a minimized constant pool) are found in the input class file, then some local IC tuples will be needed (with zero flags), to provide local "roots" for the extra InnerClasses
entries. This can occur if a compiler requires an InnerClasses
entry for a class mentioned only in a signature. Compressors are required to predict which classes require such extra local roots, and transmit only the unexpected four-tuples.
There are nine groups of bands governed by the metadata layouts. They are summarized here.
class_metadata_bands:
class_RVA_bands
class_RIA_bands
field_metadata_bands:
field_RVA_bands
field_RIA_bands
method_metadata_bands:
method_RVA_bands
method_RIA_bands
method_RVPA_bands
method_RIPA_bands
method_AD_bands
class_RVA_bands:
*class_RVA_anno_N :UNSIGNED5 [...]
*class_RVA_type_RS :UNSIGNED5 [...] (cp_Signature)
*class_RVA_pair_N :UNSIGNED5 [...]
*class_RVA_name_RU :UNSIGNED5 [...] (cp_Utf8)
*class_RVA_T :BYTE1 [...]
*class_RVA_caseI_KI :UNSIGNED5 [...] (cp_Int)
*class_RVA_caseD_KD :UNSIGNED5 [...] (cp_Double)
*class_RVA_caseF_KF :UNSIGNED5 [...] (cp_Float)
*class_RVA_caseJ_KJ :UNSIGNED5 [...] (cp_Long)
*class_RVA_casec_RS :UNSIGNED5 [...] (cp_Signature)
*class_RVA_caseet_RS :UNSIGNED5 [...] (cp_Signature)
*class_RVA_caseec_RU :UNSIGNED5 [...] (cp_Utf8)
*class_RVA_cases_RU :UNSIGNED5 [...] (cp_Utf8)
*class_RVA_casearray_N :UNSIGNED5 [...]
*class_RVA_nesttype_RS :UNSIGNED5 [...] (cp_Signature)
*class_RVA_nestpair_N :UNSIGNED5 [...]
*class_RVA_nestname_RU :UNSIGNED5 [...] (cp_Utf8)
class_RIA_bands:
*class_RIA_anno_N :UNSIGNED5 [...]
(analogous to class_RVA_bands)
field_RVA_bands:
*field_RVA_anno_N :UNSIGNED5 [...]
(analogous to class_RVA_bands)
field_RIA_bands:
*field_RIA_anno_N :UNSIGNED5 [...]
(analogous to field_RVA_bands)
method_RVA_bands:
*method_RVA_anno_N :UNSIGNED5 [...]
(analogous to class_RVA_bands)
method_RIA_bands:
*method_RIA_anno_N :UNSIGNED5 [...]
(analogous to method_RIA_bands)
method_RVPA_bands:
*method_RVPA_param_NB :BYTE1 [...]
*method_RVPA_anno_N :UNSIGNED5 [...]
(analogous to method_RVA_bands)
method_RIPA_bands:
*method_RIPA_param_NB :BYTE1 [...]
*method_RIPA_anno_N :UNSIGNED5 [...]
(analogous to method_RVPA_bands)
method_AD_bands
*method_AD_T :BYTE1 [...]
*method_AD_caseI_KI :UNSIGNED5 [...] (cp_Int)
*method_AD_caseD_KD :UNSIGNED5 [...] (cp_Double)
*method_AD_caseF_KF :UNSIGNED5 [...] (cp_Float)
*method_AD_caseJ_KJ :UNSIGNED5 [...] (cp_Long)
*method_AD_casec_RS :UNSIGNED5 [...] (cp_Signature)
*method_AD_caseet_RS :UNSIGNED5 [...] (cp_Signature)
*method_AD_caseec_RU :UNSIGNED5 [...] (cp_Utf8)
*method_AD_cases_RU :UNSIGNED5 [...] (cp_Utf8)
*method_AD_casearray_N :UNSIGNED5 [...]
*method_AD_nesttype_RS :UNSIGNED5 [...] (cp_Signature)
*method_AD_nestpair_N :UNSIGNED5 [...]
*method_AD_nestname_RU :UNSIGNED5 [...] (cp_Utf8)
As an aid to understanding the metadata layout, here is a brief description of each of the method metadata bands, as used by classfiles which comply with JSR 175. The class and field metadata bands function in a similar way.
Note that JSR 175 does not allow embedded references to CONSTANT_Class entries, even where a reference to a class is required. Instead, JSR 175 requires that references to classes be encoded as field signatures. (Their names are bracketed by 'L' and ';'.) The Pack200 archive transmits all such values as references into cp_Signature
, not cp_Class
.
For each use of a parameter annotation attribute, the method_RVPA_param_NB
band or method_RIPA_param_NB
band (for invisible annotations) transmits an unsigned byte indicating the parameter count. For each use of a non-parameter annotation attribute, the method_RVA_anno_N
band or method_RIA_anno_N
band (for invisible annotations) transmits an annotation count. Likewise, for each annotated parameter, the method_RVPA_anno_N
band or method_RIPA_anno_N
band (for invisible annotations) transmits an annotation count.
For each visible non-parameter annotation, the method_RVA_type_RS
band transmits an annotation's type, and the method_RVA_pair_N
band transmits the number of that annotation's member-value pairs. Analogous values for invisible annotations and parameter annotations are transmitted in the bands method_RIA_type_RS
, method_RIA_pair_N
, method_RVPA_type_RS
, method_RVPA_pair_N
, method_RIPA_type_RS
, and method_RIPA_pair_N
. For each member-value pair transmitted as a direct part of a visible non-parameter annotation, the method_RVA_name_RU
band transmits the member name. Analogous values for invisible annotations and parameter annotations are transmitted in the bands method_RIA_name_RU
, method_RVPA_name_RU
, and method_RIPA_name_RU
.
For each value transmitted with a visible non-parameter annotation, whether directly, or indirectly via a nested value or annotation, the method_RVA_T
band transmits a byte which selects the format of the annotation value, and analogous values are transmitted in the bands method_RIA_T
, method_RVPA_T
, method_RIPA_T
, and (for annotation defaults) method_AD_T
, Direct uses of these value tag bands are counted by summing the values of the corresponding preceding pair-count band (method_RVA_pair_N
, etc.). This count also includes the sum of the corresponding following nested pair-count band (method_RVA_nestpair_N
, etc.) and nested array length bands (method_RVA_casearray_N
, etc.).
Since those latter sums come from backward calls in the attribute layout, they cannot be directly computed by the decompressor, but their sum is reported by the compressor as an element of method_attr_calls
. (Note that the last callable in each metadata layout is the target of two backward calls from inside itself.) That band contains these backward call counts for method_RVA_T
, method_RIA_T
method_RVPA_T
, method_RIPA_T
method_AD_T
, in that order. If there are no occurrences of a method metadata attribute, then the corresponding backward call is omitted. (Thus, there are up to five backward call counts transmitted for method metadata.) If there are compressor-defined recursive layouts used in the archive, their backward call counts follow the counts for metadata in method_attr_calls
.
For each value tag of 'B', 'C', 'I', 'S', or 'Z', there is a corresponding element transmitted in method_RVA_caseI_KI
which supplies the value as a cp_Int
reference. Analogous integer references within invisible and parameter annotations and annotation defaults are transmitted in the bands method_RIA_caseI_KI
, method_RVPA_caseI_KI
, method_RIPA_caseI_KI
, and method_AD_caseI_KI
. For each value tag of 'e', the bands method_RVA_caseet_RS
and method_RVA_caseec_RU
transmit the signature of the class and name of the member of an enumeration constant, as references into cp_Signature
and cp_Utf8
. For each value tag of '[', the band method_RVA_casearray_N
transmits the length of a nested value array. For each value tag of '@', the bands method_RVA_nesttype_RS
and method_RVA_nestpair_N
transmit the class signature of a nested annotation and the number of its pairs. For each pair in such a nested annotation, the band method_RVA_nestname_RU
transmits the name of the pair.
The bands in the following table transmit appropriately-typed constant pool references for each occurrence of specific tag characters.
Tag(s) | Band | Reference |
---|---|---|
'B','C','I','S','Z'
|
method_RVA_caseI_KI
|
cp_Int |
'D'
|
method_RVA_caseD_KD
|
cp_Double |
'F'
|
method_RVA_caseF_KF
|
cp_Float |
'J'
|
method_RVA_caseJ_KJ
|
cp_Long |
'c'
|
method_RVA_casec_RS
|
cp_Signature |
'e'
|
method_RVA_caseet_RS method_RVA_caseec_RU
|
cp_Signature cp_Utf8
|
's'
|
method_RVA_cases_RU
|
cp_Utf8 |
Analogous groups of bands transmit values within the other four method annotation types, and within the visible and invisible annotations of classes and fields.
The last part of the Pack200 archive is a series of bands transmitting the bytecodes themselves. The bytecodes of each method code are parsed into instructions and operands transmitted in their own bands. Each method code corresponds to a run of bytecodes which ends with the distinguished code value 0xFF (255), which serves as an end marker. (Note that it is unusual for band data to be delimited by an end marker: Usually it is sized by a count in a previous band.)
Here are the bands which transmit bytecode instructions:
bc_bands:
*bc_codes :BYTE1 [...]
*bc_case_count :UNSIGNED5 [COUNT(switch,*bc_codes)]
*bc_case_value :DELTA5 [...]
*bc_byte :BYTE1 [...]
*bc_short :DELTA5 [...]
*bc_local :UNSIGNED5 [...]
*bc_label :BRANCH5 [...]
*bc_intref :DELTA5 [...] (cp_Int)
*bc_floatref :DELTA5 [...] (cp_Float)
*bc_longref :DELTA5 [...] (cp_Long)
*bc_doubleref :DELTA5 [...] (cp_Double)
*bc_stringref :DELTA5 [...] (cp_String)
*bc_loadablevalueref :DELTA5 [...] (cp_LoadableValue)
*bc_classref :UNSIGNED5 [...] (current class or cp_Class)
*bc_fieldref :DELTA5 [...] (cp_Field)
*bc_methodref :UNSIGNED5 [...] (cp_Method)
*bc_imethodref :DELTA5 [...] (cp_Imethod)
*bc_indyref :DELTA5 [...] (cp_InvokeDynamic)
*bc_thisfield :UNSIGNED5 [...] (cp_Field, only for current class)
*bc_superfield :UNSIGNED5 [...] (cp_Field, only for current super)
*bc_thismethod :UNSIGNED5 [...] (cp_Method, only for current class)
*bc_supermethod :UNSIGNED5 [...] (cp_Method, only for current super)
*bc_initref :UNSIGNED5 [...] (cp_Field, only for most recent new)
*bc_escref :UNSIGNED5 [COUNT(ref_escape,*bc_codes)] (cp_All)
*bc_escrefsize :UNSIGNED5 [...]
*bc_escsize :UNSIGNED5 [...]
*bc_escbyte :BYTE1 [...]
In order to transmit bytecode instructions more efficiently, some instructions may be rewritten into a transmission form. The ldc, ldc_w, and ldc2_w bytecodes must be rewritten by the compressor into strongly-typed operations; see below. Some getstatic, putstatic, getfield, putfield, invokevirtual, invokespecial, and invokestatic bytecodes may (at the compressor's option) be rewritten into more specialized and compact forms which, because they are context sensitive, select their operands from a smaller set of possibilities.
In all cases, the first byte of each instruction (perhaps rewritten) is determined by a byte transmitted in bc_codes
. All operand bytes are decoded into operands and transmitted in separate bands according to the type of operand encoded. The padding bytes of switch instructions and the last two bytes of invokeinterface instructions are discarded, because they can be reconstructed by the decompressor.
A "wide" prefix bytecode is transmitted in bc_codes
. The instruction following the prefix is decoded in its wide format, but (except in the case of iinc) it is transmitted in the same way, and using the same bands, as the normal (non-wide) format. The wide format of the iinc instruction uses bc_short
instead of bc_byte
to transmit its second operand.
Every branch instruction transmits its target (or targets, in the case of the switches) in the bc_label
band, encoded as the difference between the renumbered BCI of the branch target and the renumbered BCI of the branch itself. (The renumbering, as described in the section Attribute Layout Definition, numbers the first instruction as zero, the second as one, and so forth.) Differences between renumbered BCIs are very compact; the primary encoding of BRANCH5 also takes advantage of the fact that most such differences are positive, because most branches are forward branches.
The bc_classref
band carries incremented indexes into the cp_Class
constant pool. The incrementing (by unity) reserves the code zero, which always refers to the current class. (This provides a a compact form for the most common class reference, which is a self-reference.)
The bc_byte
and bc_short
bands carry fixed-sized operands. These operands are not transmitted as general 32-bit integers because their instructions, as originally defined, already serve as specially compressed transmission formats. Integer values of these bands are treated as unsigned, even when they are semantically signed as operands. The bc_byte
band is used for the last operands of multianewarray and non-wide iinc instructions, and for the operands of bipush and newarray instructions. The bc_short
band is used for the last operands of wide iinc instructions, and for the operands of sipush instructions.
The bands bc_intref
, bc_floatref
, bc_longref
, bc_doubleref
, bc_stringref
, bc_fieldref
, bc_methodref
, and bc_imethodref
transmit ordinary indexes into the constant pools cp_Int
, cp_Float
, cp_Long
, cp_Double
, cp_String
, cp_Field
, cp_Method
, and cp_Imethod
, respectively.
The band bc_indyref
transmits indexes into cp_InvokeDynamic
, for invokedynamic
instructions.
The band bc_loadablevalueref
transmits indexes into the constant pool group cp_LoadableValue
, such that zero refers to the first CONSTANT_Integer constant, and so on. It is transmitted immediately after bc_stringref
. (Typically, this band is used for cp_MethodHandle
constants.)
(The invokedynamic
instructions may only appear when #archive_majver
is 170 or higher. They and their related constant pool types were introduced in recent revisions of the classfile format.)
For every lookupswitch or tableswitch instruction, the number of cases is transmitted in bc_case_count
, and the default label is transmitted in bc_label
. Each case value and case target of a lookupswitch is transmitted in bc_case_value
and bc_label
, respectively. The initial case value of a tableswitch is transmitted in bc_case_value
, and each of its case targets is transmitted in bc_label
.
If the first byte of an instruction has a code in the range [202..255], or if an instruction's operands cannot be parsed according to the requirements of this specification, it is a non-standard instruction. Every byte of a non-standard instruction must be transmitted in a special envelope in order to warn the decompressor to accept it literally. The envelope consists of a series of "byte_escape" and "ref_escape" opcodes in the bc_code
band, corresponding sizes in bc_escsize
and bc_escrefsize
, bytes in bc_escbyte
, and constant pool references in bc_escref
. Every constant pool reference in a non-standard instruction must be escaped and transmitted in the bc_escref
band. Each ref_escape wraps one such constant pool reference, of size up to four bytes. The transmitted reference is an index into cp_All
, such that zero refers to the first CONSTANT_Utf8 constant, and so on.
Escape Opcode |
Size Operand |
Operand Size |
Size Band |
Data Band |
Opcode Value |
---|---|---|---|---|---|
byte_escape | N<=255 | N bytes |
bc_escsize
|
bc_escbyte
|
254 |
ref_escape | N<=2 | N bytes |
bc_escrefsize
|
bc_escref
|
253 |
There is no need for special transmission of any other operand type within non-standard instructions, because the Pack200 format exactly preserves all instruction bytes except for constant pool references. This specification does not address the means by which compressors adapt to the presence and format of non-standard instructions. It simply requires decompressors to decode them correctly.
Here is a table of the specially rewritten transmission forms for instructions:
Original Instruction |
Operand |
Transmission Instruction |
Rewrite Required? |
Opcode |
---|---|---|---|---|
ldc | cp_String[i] | sldc | yes | 18 (=ldc) |
ldc | cp_Class[i] | cldc | yes | 233 |
ldc | cp_Int[i] | ildc | yes | 234 |
ldc | cp_Float[i] | fldc | yes | 235 |
ldc_w | cp_String[i] | sldc_w | yes | 19 (=ldc_w) |
ldc_w | cp_Class[i] | cldc_w | yes | 236 |
ldc_w | cp_Int[i] | ildc_w | yes | 237 |
ldc_w | cp_Float[i] | fldc_w | yes | 238 |
ldc2_w | cp_Long[i] | lldc2_w | yes | 20 (=ldc2_w) |
ldc2_w | cp_Double[i] | dldc2_w | yes | 239 |
ldc | cp_LoadableValue[i] | qldc | yes | 240 |
ldc_w | cp_LoadableValue[i] | qldc_w | yes | 241 |
getstatic | (this class member) | getstatic_this | no | 202 |
putstatic | (this class member) | putstatic_this | no | 203 |
getfield | (this class member) | getfield_this | no | 204 |
putfield | (this class member) | putfield_this | no | 205 |
invokevirtual | (this class member) | invokevirtual_this | no | 206 |
invokespecial | (this class member) | invokespecial_this | no | 207 |
invokestatic | (this class member) | invokestatic_this | no | 208 |
aload_0; getstatic | (this class member) | aload_0_getstatic_this | no | 209 |
aload_0; putstatic | (this class member) | aload_0_putstatic_this | no | 210 |
aload_0; getfield | (this class member) | aload_0_getfield_this | no | 211 |
aload_0; putfield | (this class member) | aload_0_putfield_this | no | 212 |
aload_0; invokevirtual | (this class member) | aload_0_invokevirtual_this | no | 213 |
aload_0; invokespecial | (this class member) | aload_0_invokespecial_this | no | 214 |
aload_0; invokestatic | (this class member) | aload_0_invokestatic_this | no | 215 |
getstatic | (super class member) | getstatic_super | no | 216 |
putstatic | (super class member) | putstatic_super | no | 217 |
getfield | (super class member) | getfield_super | no | 218 |
putfield | (super class member) | putfield_super | no | 219 |
invokevirtual | (super class member) | invokevirtual_super | no | 220 |
invokespecial | (super class member) | invokespecial_super | no | 221 |
invokestatic | (super class member) | invokestatic_super | no | 222 |
aload_0; getstatic | (super class member) | aload_0_getstatic_super | no | 223 |
aload_0; putstatic | (super class member) | aload_0_putstatic_super | no | 224 |
aload_0; getfield | (super class member) | aload_0_getfield_super | no | 225 |
aload_0; putfield | (super class member) | aload_0_putfield_super | no | 226 |
aload_0; invokevirtual | (super class member) | aload_0_invokevirtual_super | no | 227 |
aload_0; invokespecial | (super class member) | aload_0_invokespecial_super | no | 228 |
aload_0; invokestatic | (super class member) | aload_0_invokestatic_super | no | 229 |
invokespecial | (this class <init>) | invokespecial_this_init | no | 230 |
invokespecial | (super class <init>) | invokespecial_super_init | no | 231 |
invokespecial | (new class <init>) | invokespecial_new_init | no | 232 |
invokespecial | cp_Imethod[i] | invokespecial_int | yes | 242 |
invokestatic | cp_Imethod[i] | invokestatic_int | yes | 243 |
The qldc
and qldc_w
instructions use indexes into the constant pool group cp_LoadableValue
to select arbitrary loadable constants. (These instructions may only appear when #archive_majver
is 170 or higher. Early versions of the classfile format do not require them.) Decompressors are required to accept any constant reference as an operand to one of these instructions, but compressors are encouraged to transmit only constants which cannot be encoded by other variants of ldc
, specifically elements of cp_MethodHandle
and cp_MethodType
.
(Note: A previous version of this specification referred to the string-bearing variants of ldc
, sldc
and sldc_w
, as aldc
and aldc_w
. Despite the new names, the code points and their interpretation have not changed. The old names are deprecated.)
Every bytecode instruction is contained by a class, called the current class. The superclass (if any) of the current class is the current super class. The operand of the textually most recent "new" instruction (in the same method) is called the current new class.
If an instruction refers to a field or method in the current class, it may (at the compressor's option) be rewritten for transmission as a corresponding opcode spelled with "_this". Likewise, an instruction referring to a field or method in the current super class may be rewritten as a corresponding opcode spelled with "_super". In either case, if the immediately preceding instruction is aload_0 (opcode 42), the transmission of that instruction may be suppressed by the compressor, and the corresponding opcode spelled with "aload_0_" selected instead; otherwise the "aload_0_" variant may not be selected.
If an invokespecial instruction refers to a method named <init> in the current class, the current super class, or the current new class, the compressor may choose to rewrite it as an invokespecial_this_init, invokespecial_super_init, or invokespecial_new_init, respectively.
If an invokespecial or an invokestatic instruction has an operand of type CONSTANT_InterfaceMethodref, then the compressor must transmit those methods using the pseudo-instructions invokespecial_int and invokestatic_int respectively. These instructions may only appear when #archive_majver
is 171 or higher.
Field (resp. method) operands of rewritten instructions spelled with "_this" (but not "_init") are transmitted in the special band bc_thisfield
(resp. bc_thismethod
). The numbering of these operands is defined by taking the sequence of symbols in cp_Field
(resp. cp_Method
) and selecting only members of the current class. The resulting subset, without changing its order, is renumbered starting with zero. This provides a compact mapping of small integers to members of the current class.
Likewise, operands of rewritten instructions spelled with "_super" (but not "_init") are transmitted in the bc_superfield
or bc_supermethod
band, and renumbered as a subset of cp_Field
or cp_Method
, selecting only members of the current super class.
Finally, operands of the rewritten instructions spelled with "_init" are transmitted in the band bc_initref
, and renumbered as a subset of cp_Method
, selected according to the appropriate class (current, current super, or current new), and also selected to have the name <init>. (The index transmitted is usually very small; in effect it selects only the signature of the invoked method, and most classes possess only a few constructors.)
Here is a table summarizing the transmission of instructions of more than one byte.
Instruction | Operand |
Transmitted Value |
Band |
---|---|---|---|
bipush | (byte) x | x & 0xFF |
bc_byte
|
sipush | (short) x | x & 0xFFFF |
bc_short
|
ildc | cp_Int[i] | i |
bc_intref
|
fldc | cp_Float[i] | i |
bc_floatref
|
sldc | cp_String[i] | i |
bc_stringref
|
qldc | cp_LoadableValue[i] | i |
bc_loadablevalueref
|
cldc | current class | 0 |
bc_classref
|
cldc | cp_Class[i] | i+1 |
bc_classref
|
ildc_w | cp_Int[i] | i |
bc_intref
|
fldc_w | cp_Float[i] | i |
bc_floatref
|
sldc_w | cp_String[i] | i |
bc_stringref
|
qldc_w | cp_LoadableValue[i] | i |
bc_loadablevalueref
|
cldc_w | current class | 0 |
bc_classref
|
cldc_w | cp_Class[i] | i+1 |
bc_classref
|
lldc2_w | cp_Long[i] | i |
bc_long
|
dldc2_w | cp_Double[i] | i |
bc_double
|
*load | locals[i] | i |
bc_local
|
*store | locals[i] | i |
bc_local
|
ret | locals[i] | i |
bc_local
|
iinc | locals[i] | i |
bc_local
|
iinc (not wide) | (byte) x | x & 0xFF |
bc_byte
|
iinc (wide) | (short) x | x & 0xFFFF |
bc_short
|
if** | pc | (delta/renumbered) |
bc_label
|
if_** | pc | (delta/renumbered) |
bc_label
|
goto | pc | (delta/renumbered) |
bc_label
|
jsr | pc | (delta/renumbered) |
bc_label
|
goto_w | pc | (delta/renumbered) |
bc_label
|
jsr_w | pc | (delta/renumbered) |
bc_label
|
tableswitch | case count | count |
bc_case_count
|
tableswitch | default pc | (delta/renumbered) |
bc_label
|
tableswitch | first case value | value |
bc_case_value
|
tableswitch | each case pc | (delta/renumbered) |
bc_label
|
lookupswitch | case count | count |
bc_case_count
|
lookupswitch | default pc | (delta/renumbered) |
bc_label
|
lookupswitch | each case value | value |
bc_case_value
|
lookupswitch | each case pc | (delta/renumbered) |
bc_label
|
new | current class | 0 |
bc_classref
|
new | cp_Class[i] | 1+i |
bc_classref
|
newarray | type code | value |
bc_byte
|
anewarray | current class | 0 |
bc_classref
|
anewarray | cp_Class[i] | 1+i |
bc_classref
|
checkcast | current class | 0 |
bc_classref
|
checkcast | cp_Class[i] | 1+i |
bc_classref
|
instanceof | current class | 0 |
bc_classref
|
instanceof | cp_Class[i] | 1+i |
bc_classref
|
multianewarray | cp_Class[i] | 1+i |
bc_classref
|
multianewarray | rank | rank & 0xFF |
bc_byte
|
getstatic | cp_Field[i] | i |
bc_fieldref
|
putstatic | cp_Field[i] | i |
bc_fieldref
|
getfield | cp_Field[i] | i |
bc_fieldref
|
putfield | cp_Field[i] | i |
bc_fieldref
|
invokevirtual | cp_Method[i] | i |
bc_methodref
|
invokespecial | cp_Method[i] | i |
bc_methodref
|
invokestatic | cp_Method[i] | i |
bc_methodref
|
invokeinterface | cp_Imethod[i] | i |
bc_imethodref
|
invokespecial_int | cp_Imethod[i] | i |
bc_imethodref
|
invokestatic_int | cp_Imethod[i] | i |
bc_imethodref
|
invokedynamic | cp_InvokeDynamic[i] | i |
bc_indyref
|
**_this | this_fields[i] | i |
bc_thisfield
|
**_this | this_methods[i] | i |
bc_thismethod
|
**_super | super_fields[i] | i |
bc_superfield
|
**_super | super_methods[i] | i |
bc_supermethod
|
invokespecial_this_init | this_constructors[i] | i |
bc_initref
|
invokespecial_super_init | super_constructors[i] | i |
bc_initref
|
invokespecial_new_init | new_constructors[i] | i |
bc_initref
|
A Pack200 band consists of a sequence of small integers, each of which is representable in 32 bits. Depending on the intended use of these numbers, they may be interpreted to possess a twos-complement sign.
Each integer is coded for transmission as a byte sequence, using between one and five bytes, according to a previously determined encoding. The encoding may by a primary encoding for the current band as specified as part of this Pack200 archive format specification, or the encoding may depend on information transmitted before the occurrence of the encoded integer.
(Note: In this account of encodings, bytes are indivisible octets which express unsigned values in [0,255].)
There is a small but flexible set of possible encodings allowed by the archive. They are all based on a two-parameter scheme of codings for small unsigned integers, which is further parameterized by options for sign representation and delta encoding, and by special modes for slowly varying encodings, and for the compact renaming of frequent values.
The encoding of a small whole number depends on two independent parameters B and H, and a derived parameter L:
Name | Range | Meaning |
---|---|---|
B | [1..5] | maximum byte length |
H | [1..256] | number of high byte values |
L | [0..255] | number of low byte values, defined as (256-H) |
Given any two values B and H, there is a coding (B,H) which establishes a one-to-one correspondence between an initial sequence of non-negative integers, and an "encoding sets" of short sequences of bytes.
Moreover, no byte sequence in the encoding set is a proper prefix of another byte sequence in the same set. prefix. This means, informally, that encodings are self-sizing, or "parseable".
Also, the encoding set is as full as possible, so that every long-enough sequence of bytes begins with a unique prefix of bytes drawn from the encoding set. In particular, every sequence of B bytes has a (unique) prefix in the encoding set for the (B,H) coding.
Relative to a given (B,H) coding, we say that a byte is "high" if its value is in the range [256-H..255]
. We also say that a byte is "low" if L is positive and the byte's value is in the range [0..L-1]
Given a (B,H) coding, a sequence of bytes is in its encoding set if and only if all of the following are true:
As a consequence of this definition, the encoding set be regarded as satisfying this regular expression, in terms of high and low bytes:
encoding_set = (high* low) | (high)^B
(Note: In this specification the up-caret sign '^' denotes exponentiation, either of regular expressions as above, or of numbers.)
If n is less than B then the number of (B,H) encodings of length exactly n is (H^(n-1) * L)
. The number of (B,H) encodings of length exactly B is (H^(B-1) * (L+H))
. The sum of these values for all n in [1..B]
is the total size of the encoding set for a (B,H) coding. It is called Card(B,H) and is therefore defined as:
Card(B,H) = (L * (1-H^B)/(1-H)) + H^B
if H>1, or else
Card(B,1) = B*255+1
If H is 256, there are no low bytes, and the encoding set consists of all possible sequences of exactly B bytes, and Card(B,H) is 256^B
.
Otherwise, the encoding set consists of sequences of various lengths, from 1 to B inclusive. In particular, there are L one-byte sequences. We will use these elsewhere to encode the "favored" integers of the highest frequency. Note that the encoding rules specified elsewhere in this document are designed to produce smaller integers more frequently than larger ones.
Note that L can be viewed as a "sharpness" parameter, expressing the sharpness of the distribution about zero of the expected values to be encoded. The larger L is, the more the encoding favors distributions of values which are close to zero, and rarely far from it. Larger H values favor "flatter" distributions, up to H=256, L=0, which favorably encodes completely random data within the range of the coding.
The range of a (B,H) coding is the integers in [0..Card(B,H)-1]
. (Note that this range applies only to the unsigned codings introduced so far in this section.)
Given the sequence of N byte values b[0] .. b[N-1], the sorting-based definition given above implies that the decoded integer value of this byte sequence is the little-endian base-H scaled sum of the bytes, and is called Decode(B,H; b):
Decode(B,H; b[*]) = Sum[0<=i<N]( b[i] * H^i )
As a consequence of this definition, each byte sequence in the encoding set corresponds uniquely to an arithmetic integer value in that range. This correspondence may be observed by ordering the byte sequences first by length, and second according to their antilexicographic (little-endian) order. Each element in resulting sequence decodes into its own zero-based index within the sequence.
Note that the number zero is always encoded by a sequence of B zero bytes (if H is 256) or else by single zero byte.
The largest encoded arithmetic value is always encoded by a sequence of B bytes of value 255.
Note that the (B,H) encodings (1,256), (2,256), and (4,256) are identical with the conventional little-endian representations of unsigned bytes, unsigned 16-bit integers, and unsigned 32-bit integers.
Note that some may represent arithmetic values greater than 2^31-1
or even 2^32-1
. This is a feature of the (B,H) coding scheme. (But sometimes intermediate 64-bit values are required when multiple sign bits are being used, as described below. The rules given below for signed values require the compressor to emit the simplest encoding that will suffice to transmit the required 32-bit integer, even if more complex sequences would produce the same 32-bit value, after truncation.)
Note that implementations are not always free to truncate intermediate values when computing (B,H) coding values. However, the final decoded values (after restoration of signs) will always fit in 32 bits, as described in the next section.
In order to encode signed 32-bit numbers we add another parameter to the (B,H) coding scheme.
Name | Range | Meaning |
---|---|---|
B | [1..5] | maximum byte length |
H | [1..256] | number of high byte values |
L | [0..255] | number of low byte values, defined as (256-H) |
S | [0..2] | number of sign bits |
The whole number U obtained from a (B,H) coding is converted to a signed 32-bit value X in a manner which depends on the parameter S. This process is called sign conversion.
Informally, sign conversion is performed by 32-bit truncation if S=0. Otherwise, the low-order bits of U are collectively treated as a sign bit, so that X will be negative only if all sign bits are set. The two conversions from U to positive X and U to negative X are independently dense and monotonic in opposite directions, as defined below.
Unlike the (B,H) coding, a (B,H,S) coding represents only numbers in a specifically defined range, which is never larger than [-2^31..2^31-1]
.
In what follows, for each (B,H,S) coding, we define Range(B,H,S), and the method used to represent 32-bit signed values with that coding.
A (B,H) encoding byte sequence is not a legal input to a corresponding (B,H,S) coding if it represents a whole number U which does not convert into the range of the (B,H,S) coding. Also, if there are two byte sequences encoding values U1 and U2 which both map into the same point X in the signed range, only the byte sequence for the smaller of U1 and U2 is legal.
Thus, a (B,H,S) coding may have a smaller cardinality than the corresponding (B,H) coding, due to illegal encoding byte sequences.
The range of a (B,H,S) coding contains all 32-bit signed integers [-2^31..2^31-1]
if Card(B,H) is 2^32
or more. (Negative values are represented by 31-bit unsigned overflow.) If the coding encodes fewer unsigned values, the top of its range is clipped to 2^31-1
, and the bottom of its range is clipped to -2^31
.
More precisely, the range of a (B,H,S) coding is called Range(B,H,S) and is defined (partially, for S=0) as:
Range(B,H,S) = [-2^31..2^31-1]
if S=0, Card(B,H) >= 2^32, or
Range(B,H,S) = [0..2^31-1]
if S=0, 2^31 < Card(B,H) < 2^32, or else
Range(B,H,S) = [0..Card(B,H)-1]
if S=0, Card(B,H) <= 2^31
A compressor is not allowed to emit a byte sequence which, when interpreted according to the current coding, decodes to a value outside of that coding's range.
(Note that decompressors can produce arbitrary answers for out-of-range codings, since compressors are not allowed to emit them. This allows decompressors to use 32-bit arithmetic for most calculations, ignoring the possibility of undesired wraparond.)
If S>0, the decoding of a byte sequence is performed first by decoding it as a (B,H) coding, and then by converting the decoded whole number into a signed 32-bit number. This sign conversion depends only on the arithmetic value U obtained from the (B,H) coding, and on the value of S. The arithmetic value must be preserved beyond 32 bits of precision, if the sign conversion will produce a value within the range of the coding. If the cardinality of the (B,H) coding is 2^32
or more, the (B,H,S) coding produces all possible signed 32-bit negative values, by 32-bit truncation of the intermediate unsigned whole number. The intermediate unsigned arithmetic can be carried out (with some care) using 32-bit unsigned numbers, or it can be carried out on 64-bit signed or unsigned numbers.
More precisely, the sign conversion operation SignConvert(S;U) is defined as:
SignConvert(S; U) = Cast32(U)
if S==0; or
SignConvert(S; U) = U - Floor(U / 2^S)
if S>0, (U % 2^S) < 2^S-1, Card(B,H) < 2^32
SignConvert(S; U) = Cast32(U - Floor(U / 2^S))
if S>0, (U % 2^S) < 2^S-1, Card(B,H) >= 2^32
SignConvert(S; U) = -Floor(U / 2^S)-1
if S>0, (U % 2^S) == 2^S-1
The conversion from a large unsigned number of a negative 32-bit number via wraparound is simply the familiar down-cast to a signed 32-bit int. It may be defined mathematically as:
Cast32(U) = ((U + 2^31) mod 2^32) - 2^31
Note that sign conversion always maps zero to zero. The reader may also verify that, if S>0, the range of SignConvert includes [-1..1], and that it is dense and monotonic in any wholly positive or wholly negative sub-range.
As a consequence of the definition of SignConvert(S;U)
, the S low-order bits of U function as a sign bit. In effect, U is partitioned into a sign field U0 (consisting of the S low-order bits) and a significand U1 (consisting of all other bits of U).
In this alternative view of the definition of SignConvert
, if all S bits of U0 are set, then the decoded signed value is the ones complement of U1. (Since this includes the indefinite number of zero high-order bits of U, the resulting value is arithmetically negative.)
Otherwise, if some of the S bits of U0 are clear, then the significand U1 is scaled by the number of possible values of U0 (i.e., (2^2)-1 if S is 2) and U0 is added back in.
This provides a one-to-one correspondence between some portion of the arithmetic interval [0..Card(B,H)-1]
and Range(B,H,S)
.
The integers "favored" by this encoding are those of small magnitude, but they can have either sign. If S is one, the encoding has a balanced number of positive and negative favored values. If S is two (or more), the encoding is skewed toward positive numbers, but also favors a few negative numbers. We use such codings elsewhere to encode values, such as branch displacements or deltas of mostly-sorted data, which are usually positive but sometimes negative.
Note that if S=1, this definition of sign conversion is equivalent to an unsigned right shift followed by an exclusive-or of the sign bit with all remaining bits:
SignConvert(1; U) = (U >>> 1) ^ -(U & 1)
For S>0, Range(B,H,S) is simply defined as the intersection of the 32-bit signed range [-2^31..2^31-1]
with the set SignConvert(S; [0..Card(B,H)-1])
, obtained by element-wise application of sign conversion to every arithmetically representable value of (B,H). Therefore, we can complete the definition of Range as follows:
Range(B,H,S) = [-2^31..2^31-1]
if Card(B,H) >= 2^32, or
Range(B,H,S) = [0..2^31-1]
if S=0, 2^31 < Card(B,H) < 2^32, or else
Range(B,H,S) = [0..Card(B,H)-1]
if S=0, Card(B,H) <= 2^31
Range(B,H,S) = [max( -2^31, min SignConvert(S; Range(B,H,0)) )
.. min( 2^31-1, max SignConvert(S; Range(B,H,0)) )]
(Note: When computing the bounds of a signed encoding (B,H,S), it is useful to start with the largest unsigned value M of (B,H,0) and perform sign-conversion on M, M-1, M-2, etc., noting the first positive and the first negative value to be encountered. These will be the inclusive bounds of the signed encoding's range.)
The cardinality of a (B,H) coding was defined above:
Card(B,H) = (L * (1-H^B)/(1-H)) + H^B
The cardinality of a (B,H,S) coding is determined by that of its range:
Card(B,H,S) = Card Range(B,H,S) <= Card(B,H)
As defined, the range of any coding (B,H,S) is dense about zero and can therefore be expressed as a closed interval:
Range(B,H,S) = [Min(B,H,S)..Max(B,H,S)]
-2^31 <= Min(B,H,S) <= 0
0 < Max(B,H,S) <= 2^31-1
Certain additional attributes can be predicated of codings. A coding can either be signed or not, and it can be a full-range coding, or a sub-range coding (or neither).
A (B,H,S) coding is signed if it can encode at least one negative value. This is true if and only if S is non-zero or S is zero and Card(B,H) is 2^32 or more.
A (B,H,S) coding is full-range if Range(B,H,S) is [-2^31..2^31-1]
.
A sub-range coding delivers less than 2^31 distinct values. More precisely, a coding (B,H,S) is sub-range if Card(B,H,S) <= 2^31-1
. (Some codings are neither full-range nor sub-range codings.)
As a consequence of these definitions, any coding for which B<=3 is a sub-range coding. Longer codings can also be sub-ranges if they are "sharp" enough. Some examples are (4,192,0), (5,32,0), and (5,32,1).
Also, the unsigned coding (4,256,0) is full-range, as is the signed version (4,256,1), but not the doubly-signed version (4,256,2).
The coding (4,255,0) is neither a sub-range nor a full-range coding, since its cardinality is 2^31. It represents only non-negative 32-bit integers, but its cardinality cannot be so represented.
Every single integer in the archive format is encoded according to some coding in the scheme of (B,H,S) codings. In addition, special attention is paid to sequences of integers (in the "bands" described elsewhere) which show statistical regularity. In particular, if a sequence of integers is found to exhibit a pattern of small, regular first-order differences, those differences may be encoded in place of the values themselves. This is described by a fourth coding parameter added to the (B,H,S) scheme.
Name | Range | Meaning |
---|---|---|
B | [1..5] | maximum byte length |
H | [1..256] | number of high byte values |
L | [0..255] | number of low byte values, defined as (256-H) |
S | [0..2] | number of sign bits |
D | [0..1] | delta encoding order |
If D is zero, no differencing is performed, and the (B,H,S,0) coding is in all ways identical to the corresponding (B,H,S) coding. If D is one, the values are encoded in terms of their successive differences.
(Note: Sequences which are mostly monotonically ascending will often be favorably encoded using unsigned delta codings of the form (5,H,0,1).)
Suppose a sequence of values X[i] is given, for some range of i in [0..N-1]. This sequence will be representable by a (B,H,S,1) coding only if the (B,H,S) coding is a sub-range or a full-range coding. (If it is neither, there is no legal (B,H,S,1) coding.)
The sequence X[i] will be representable, only if there is a series of "delta values" D[i] whose partial sums Sum[j<=i](D[j])
can represent the values X[i].
Each (B,H,S,1) coding has a range, in which all X[i] values must lie. This range is defined as Range(B,H,S,1) = Range(B,H,0)
. The range of a delta coding (B,H,S,1) contains negative numbers only if it is based on a full-range coding (B,H,S). Otherwise, the delta coding can only represent non-negative numbers. In practice, this is not a severe restriction, since in practice very few band elements are negative.
The partial sums Sum[j<=i](D[j]) are allowed to leave the range, but the final X[i] values are always brought back into range by adding or subtracting a multiple of Card Range(B,H,0).
More precisely:
X[i] = Sum[j<=i](D[j]) mod Card(B,H,0)
if (B,H,S) is sub-range
X[i] = (int32) Sum[j<=i](D[j])
if (B,H,S) is full-range
(Here, the cast to int32
denotes truncation to 32 bits.)
Note that implementations can simply ignore 32-bit wraparound when computing partial sums, if the coding is full-range. Otherwise, more care must be taken to bring partial sums back into range by subtracting or adding multiples of the range cardinality. Also, 32-bit wraparound may be a hazard for ranges of size 2^30 or more. It is possible to implement this using only 32-bit signed arithmetic.
The coding methods described so far are designed to favor values which, on average, tend to be near zero or near the preceding value. Some data sets have marked statistical patterns in which values have only weak arithmetic relations (to zero or to each other), but which show strong repetitive patterns, especially in their first-order statistics.
The compressor may elect to use a population-based coding transformation in such cases. This transformation takes a sequence S of N arithmetic values and converts it into three value sequences:
Each of these sequences (F, T, and U) is in turn treated as a sub-band, and transmitted with an independent encoding. Each non-zero value in T indexes a value from F, while each zero value in T selects (in order) a value from U:
S[i] = F[T[i]-1]
if T[i] != 0
U = { S[i] such that T[i] == 0 }
The table F contains any number of integer values. There is no restriction on their number or identity, except that no value in F may be repeated, except the last, and F must not contain unused values. The last value in F must be a repetition of an earlier occurrence of a "sentinel value" of F. This sentinel marks the end of the table F, allowing the decompressor to prepare to read T and U.
The "central value" X of F is defined as that element of F which is arithmetically closest to zero. If F contains both a positive X and a negative -X of minimal absolute value, then -X is defined as the central value. (I.e., a negative sign is the tie-breaker. The central value is chosen so as to have a favorable encoding.)
(Note that implementations may compare values for centrality by using the 32-bit signed int expression (X>>31)^(X<<1)
as an unsigned comparison key.)
A valid sentinel value is either the central value of F, or the last value of F. Thus, while parsing F, the decompressor must look for a repetition of either the central value (so far) or of the immediately previous value.
Let K be the number of values in F, ignoring the repetition of the sentinel value. In the next band, values in [1..K]
will refer (as 1-origin indexes) to corresponding elements of F.
The sequence T follows the table F, and encodes the input of the population transformation in a direct, element-by-element manner.
Each element of T is a token for either a favored or an unfavored value, encoded in the arithmetic range [0..K]
. Each non-zero value corresponds in order to an element of F, and produces that favored value. Each zero value in T is a placeholder for an "unfavored" value, which is still unknown.
As mentioned above, each element of F must be referred to by at least one element of T. That is, the following two sets must be the same:
{ F[T[i]-1] such that T[i] != 0 }
{ F[j] }
The statistics of the sequence T are quite favorable for compact encoding in a suitable (B,H) scheme, assuming that the coder made good choices for the contents of F. Generally speaking, the most commonly occurring input values should be given early positions in F.
A greedy algorithm could sort the input values by number of occurrences and place the most common values at the front of F. This would be likely to produce an improved encoding for the input. Such techniques are neither mandated nor discouraged by this specification.
The third subsequence consists of values which were not representable by indexes into the table F. Let Z be the number of zeroes encountered in T. There is a ordered, one-to-one correspondence between the Z zeroes in T and all the Z values of U.
Taken together, the elements of F and U selected by the elements of T must be identical in value and order to the input of the population transformation.
The decompressor can then read F into memory, read T into memory, and then make a second pass over T, translating token values, either referring to F or reading from U as necessary.
Sometimes, in a very long sequence of values, the statistics will change slowly, making an initially suitable coding tactic become unsuitable.
An adaptive coding method handles this situation by allowing value sequences to be partitioned (effectively into sub-bands), with an independent coding used locally in each part.
An adaptive coding method is specified by a count K, a coding method A which will be used to encode K values, and another coding method B which will handle the rest of the values.
Since bands are sized by context, when an adaptive coding method is applied to the bytes of an encoded band, the method will be provided in advance with the count N of values to decode. Thus, after the method A has decoded K values, the method B will be used to decode the rest of the (N-K) values in the band.
Every band in the Pack200 archive format whose primary encoding is not BYTE1 is optionally preceded by a series of bytes called a band coding specifier which specify a secondary coding or codings used in that band, instead of the primary coding. The decompressor must parse this coding specifier and save it away before it reads any of the band elements. In a few extra bytes of information, the coding specifier determines exactly how the subsequent bytes are to be decoded into band values.
Any compressor may elect not to provide a band coding specifier, in which case the band's primary encoding governs the transmission of elements. If the encoding of the first band element under the primary encoding accidentally appears to be a band coding specifier, the compressor is obligated to transmit an explicit band coding specifier that reaffirms the band's primary encoding. This is rare, because (as noted below) band coding specifiers present themselves as negative numbers in the primary encoding, and negative numbers are rare in most bands.
The symbolic structure of a band coding specifier is determined by the following grammar. (This grammar is independent of the grammar governing band structure, or any other grammar appearing in other parts of this specification.)
BandCodingSpecifier:
(Default | BHSDCode | RunCode | PopCode)
BHSDCode:
CanonicalBHSDCode | ArbitraryBHSDCode
CanonicalBHSDCode:
( '(1,256,0)' | '(1,256,1)' | ... )
ArbitraryBHSDCode:
'arb' ( B H S D )
B, H, S, D:
Integer
RunCode:
'run' K ACode BCode
K:
Integer
ACode:
(Default | BHSDCode | PopCode)
BCode:
(Default | BHSDCode | RunCode | PopCode)
PopCode:
'pop' ( FCode TCode UCode )
FCode:
(Default | BHSDCode | RunCode)
TCode:
(Default | BHSDCode | RunCode)
UCode:
(Default | BHSDCode | RunCode)
Integer:
( '0' | '1' | ... )
Default:
'default'
Note that adaptive coding methods ("RunCode" nonterminals) can be chained via the "BCode" nonterminal, but cannot be nested directly via the "ACode" nonterminal. (As the grammar shows, they may be nested indirectly via the "PopCode" nonterminal.)
Also, population coding methods can contain adaptive coding methods, and vice versa. However, although the grammar does not show this fact, population coding methods must not be nested, even indirectly.
Not all possible integer values of K and L are equally well expressible. The intended values of K are commensurate with compressor windows, and the intended values of L are sharpness parameters for BHS encodings.
At any point, band contents are decoded under the control of three pieces of information:
We denote the application of S in the context of N and D as "S(N,D)". This application decodes up to N elements of transmitted band data.
Just before a band is received, the decompressor knows an expected band length N and a default coding method D, specified statically as the band's primary encoding. The decompressor then reads zero or more bytes which constitute a band encoding specifier, and decodes N elements of band data based on the band encoding specifier.
When the coding specifier is 'default', N values are decoded using the default coding D, which is the band's primary encoding. (This rule is necessary if the band's first element appears to introduce a non-empty coding specifier. Such a default coding specifier is the only kind that a compressor is obligated to emit; the rest are all optional.)
When the coding specifier is another BHSDCode, N values are decoded using that coding. The ambient default D is ignored.
When the coding specifier is a 'run' of K, ACode, BCode, two steps are taken. First, the specifier ACode is given control, as ACode(K,D). That is, N is temporarily set to K. Second, the specifier BCode is given control as BCode(N-K,D). (Note: If N is infinity, N-K will also be infinity.) In both steps, D remains unchanged, so that occurrences of 'default' for ACode or BCode cause D to be used.
It is illegal for a 'run' coding specifier to specify a K of zero, or for it to be used to decode a run of K or fewer values.
When the coding specifier is a 'pop' of FCode, TCode, UCode, three steps are taken. First, FCode is applied as FCode(infinity, D). The decoding of the F values stops when the first duplicate value is encountered. (As specified in the section above on population coding, that duplicate value acts as a sentinel, and must be either the most "central" value read, or else the value immediately previous to the sentinel value.) Let K be the number of unique decoded values.
During the decoding of the F values by FCode, every 'run' specifier in FCode must exhaust its count 'K' without decoding the sentinel value. That is, the sentinel must be decoded by the last simple BHSD coding in FCode.
In the second step, a different coding TCode is used, since the T values are expected to be a dense encoding. TCode is applied as TCode(N,D), and the tokens are read which determine the band values to be delivered. (Note that a 'pop' coding method can only be applied if the N value supplied is finite. This will be true, since the only bands which are read with an infinite N are byte bands such as bc_codes
.) Let Z be the number of zeroes decoded using TCode.
The third step is to use UCode to decode Z values, using the original default encoding D. UCode is applied as U(Z,D). The resulting band is produced from the T sequence, as documented above, by replacing zeroes in the T sequence by successive U values, and replacing other elements of the T sequence by those indexed in the F sequence.
It is illegal for a 'pop' coding specifier to be used to decode a run of no values, or of an infinite number of values. If Z (the number of unfavored values) is zero, it is illegal for UCode to be anything other than 'default'.
The compressor uses a tense, byte-oriented encoding to transmit the band encoding specifier. Although the "little language" above allows a wide range of encodings, in practice many of them are similar in performance and applicability. It is not necessary or desirable to allow the compressor complete freedom to specify any conceivable band encoding specifier. Instead, the meta-encoding described in this section allows the compressor to select from a limited but useful set of secondary encodings. It is up to the compressor to make a choice that improves compression; the heuristic or algorithm that controls such a selection is beyond the scope of this specification.
The common band coding specifier of 'default' is encoded in a way that often requires no extra bytes. If the band has no elements, no parsing at all is done. Otherwise, if the band's default encoding is of variable length, the decompressor is obligated to decode one value X from the initial bytes of the band using the band's default coding D. If possible, the value X is converted into an unsigned byte value XB, which is taken to be the first byte of a band coding specifier, and X will be discarded. Otherwise, the band coding specifier is taken to be 'default', and so D is applied to the entire band, including the bytes which produced the initial value X.
D must be of the form (B,H,S) or (B,H,S,D), where B>1 and H<256. The value is D is irrelevant to the decoding of the first value, and the decoding of X is done using (B,H,S,0). If S is not zero, and if the value X is in the range [-256..-1], the coding specifier byte XB is defined as XB = (-1-X)
, and X is discarded. Otherwise, if S is zero, and if the value of X is in the range [(256-H)..(511-H)], the coding specifier byte XB is defined as XB = (X-(256-H))
, and X is discarded. Otherwise, there is not a coding specifier byte XB, and X is the first band value, as described above.
The value XB, if produced, is taken to be the initial byte of a coding specifier preceding the actual band data. This specifier is parsed, starting with XB, and continuing (if necessary) with bytes bytes taken from band_headers
. The parsed specifier is then used to control decoding, starting at the next byte, of all band elements.
If the band really does start with an element X1 in a range ([-256..-1]
or [L,L+255]
) that requires the production of a band specififer byte XB, but the compressor wishes to use the default encoding, the compressor is required to specify a band header, lest X1 mislead the decompressor. If the compressor wishes to keep the default coding, it must in such cases specify an explicit coding specifier of 'default', using an XB value of zero (as specified below). This zero XB is transmitted, depending on the default encoding, as an X value of -1 (usually the byte 1) or of L (usually the bytes 192,0).
(An immediate consequence of this design is that byte bands can never have non-default encodings, but all other bands can, since all the other bands have default encodigns which are variable length and have a large dynamic range. The "escape" values chosen for X are rare in practice, so they are not usually confused with real band data. Adding an extra zero value to reaffirm the default will always cost one extra byte before the actual band data, and no extra bytes in the band_headers
band. In general, the encoding of explicit X values will always require two bytes if S is zero, and at most two bytes if S is not zero.)
The band band_headers
transmits the non-initial band header bytes (if any) of each band that has a band header. The order of transmission is consistent with the overall order of all bands in the archive, as given in the present specification's band grammar. The length of band_headers
is independently specified (as #band_headers_size
) in the archive header. (It should be clear that the decompressor needs to find the end of band_headers
before it can read any further bands.)
Thus, the encoding of a coding specifier consists of a sequence of bytes: an initial byte XB, and zero or more non-initial bytes taken from band_headers
. The encodings of the little language specified in the previous section are as follows. The encoding 'default' is a zero byte. Of all the possible BHSDCode encodings, a sequence of 115 canonical encodings (defined below) is selected to cover the expected range of uses. A BHSDCode is represented as a single byte whose value is the index (1-based) of the selected canonical coding.
Enc{default} = (0)
Enc{CanonicalBHSDCode} = value in [1..115]
The special value 116 introduces an arbitrary, possibly non-canonical BHSD code, described in the following byte.
Enc{ arb ( B H S D ) } =
116
& (D:[0..1] + 2*S[0..2] + 8*(B:[1..5]-1))
& (H[1..256]-1):
The B value must be in the range [1..5]. The S value must be in the range [0..2]. The H value must be in the range [1..256]. The D value must be in the range [0..1]. (These ranges are inclusive.) In addition, if B is 1, H must be 256, and if H is 256, B must not be 5.
The 'run' construct encodings begin with a byte in [117..140], and may be followed by an optional byte KB and then by one or two encoding specifiers ACode and BCode. The offset from 117 represents bitwise the following data:
The last two bits are jointly encoded in the value ABDef. They are never both true.
Enc{ run ( K ACode BCode ) } =
(117 + (KX:[0..3]) + 4*(KBFlag:[0..1]) + 8*(ABDef:[0..2]))
& KB: one of [0..255] if KBFlag=1
& Enc{ ACode } if ADef=0 (ABDef != 1)
& Enc{ BCode } if BDef=0 (ABDef != 2)
After the leading byte is parsed, a byte KB is expected if the KBFlag is set, otherwise KB is implicitly given the value 3.
The K value for the 'run' construct can take a range of values, and is determined as follows:
K = (KB+1) * 16^KX
The ACode coding is understood to be 'default' if the ADef bit is set. Otherwise, the representation of ACode is parsed next. Finally, the BCode coding is understood to be 'default' if the BDef bit is set. Otherwise, the representation of BCode is parsed last. Both ADef and BDef may be clear, but both may not be set.
The 'pop' construct encodings begin with a byte in [141..188], and may be followed by one, two, or three encoding specifiers, FCode, UCode, and TCode. The offset from 141 bitwise encodes the following data:
The last two values are jointly encoded in the value TDefL.
Enc{ pop ( FCode TCode UCode ) }
= (141 + (FDef:[0..1]) + 2*UDef:[0..1] + 4*(TDefL:[0..11]))
& Enc{ FCode } if FDef=0
& Enc{ TCode } if TDef=0 (TDefL==0)
& Enc{ UCode } if UDef=0
If TDef is zero, an explicit encoding specifier determines TCode, and the L parameter is not present. If TDef is one, the L parameter is present and is derived from TDefL, according to the following table:
TDefL | L |
---|---|
1 | 4 |
2 | 8 |
3 | 16 |
4 | 32 |
5 | 64 |
6 | 128 |
7 | 192 |
8 | 224 |
9 | 240 |
10 | 248 |
11 | 252 |
If the L parameter is present, the TCode is derived from the K and L values. If K < 256, then TCode is BYTE1 (1,255,0), and L is ignored. Otherwise, TCode is the (B,H,S) coding where S=0, H=(256-L), and B is the smallest value such that [0..K] is contained in Range(B,H,S). (It is illegal to specify an L so large that Range(5,256-L,0) does not contain K.)
The FCode coding is understood to be 'default' if the FDef bit is set. Otherwise, the representation of FCode is parsed immediately after the initial byte. The TCode coding is understood to be derived from K and L if the TDef bit is set. Otherwise, the representation of TCode is parsed next. Finally, the UCode coding is understood to be 'default' if the UDef bit is set. Otherwise, the representation of UCode is parsed last.
Here is the list of all the canonical BHSD codings. These codings are designed to match a diverse variety of band value ranges and statistics.
index | BHSD Coding |
---|---|
1 | (1,256,0) |
2 | (1,256,1) |
3 | (1,256,0,1) |
4 | (1,256,1,1) |
5 | (2,256,0) |
6 | (2,256,1) |
7 | (2,256,0,1) |
8 | (2,256,1,1) |
9 | (3,256,0) |
10 | (3,256,1) |
11 | (3,256,0,1) |
12 | (3,256,1,1) |
13 | (4,256,0) |
14 | (4,256,1) |
15 | (4,256,0,1) |
16 | (4,256,1,1) |
17 | (5, 4,0) |
18 | (5, 4,1) |
19 | (5, 4,2) |
20 | (5, 16,0) |
21 | (5, 16,1) |
22 | (5, 16,2) |
23 | (5, 32,0) |
24 | (5, 32,1) |
25 | (5, 32,2) |
26 | (5, 64,0) |
27 | (5, 64,1) |
28 | (5, 64,2) |
29 | (5,128,0) |
30 | (5,128,1) |
31 | (5,128,2) |
32 | (5, 4,0,1) |
33 | (5, 4,1,1) |
34 | (5, 4,2,1) |
35 | (5, 16,0,1) |
36 | (5, 16,1,1) |
37 | (5, 16,2,1) |
38 | (5, 32,0,1) |
39 | (5, 32,1,1) |
40 | (5, 32,2,1) |
41 | (5, 64,0,1) |
42 | (5, 64,1,1) |
43 | (5, 64,2,1) |
44 | (5,128,0,1) |
45 | (5,128,1,1) |
46 | (5,128,2,1) |
47 | (2,192,0) |
48 | (2,224,0) |
49 | (2,240,0) |
50 | (2,248,0) |
51 | (2,252,0) |
52 | (2, 8,0,1) |
53 | (2, 8,1,1) |
54 | (2, 16,0,1) |
55 | (2, 16,1,1) |
56 | (2, 32,0,1) |
57 | (2, 32,1,1) |
58 | (2, 64,0,1) |
59 | (2, 64,1,1) |
60 | (2,128,0,1) |
61 | (2,128,1,1) |
62 | (2,192,0,1) |
63 | (2,192,1,1) |
64 | (2,224,0,1) |
65 | (2,224,1,1) |
66 | (2,240,0,1) |
67 | (2,240,1,1) |
68 | (2,248,0,1) |
69 | (2,248,1,1) |
70 | (3,192,0) |
71 | (3,224,0) |
72 | (3,240,0) |
73 | (3,248,0) |
74 | (3,252,0) |
75 | (3, 8,0,1) |
76 | (3, 8,1,1) |
77 | (3, 16,0,1) |
78 | (3, 16,1,1) |
79 | (3, 32,0,1) |
80 | (3, 32,1,1) |
81 | (3, 64,0,1) |
82 | (3, 64,1,1) |
83 | (3,128,0,1) |
84 | (3,128,1,1) |
85 | (3,192,0,1) |
86 | (3,192,1,1) |
87 | (3,224,0,1) |
88 | (3,224,1,1) |
89 | (3,240,0,1) |
90 | (3,240,1,1) |
91 | (3,248,0,1) |
92 | (3,248,1,1) |
93 | (4,192,0) |
94 | (4,224,0) |
95 | (4,240,0) |
96 | (4,248,0) |
97 | (4,252,0) |
98 | (4, 8,0,1) |
99 | (4, 8,1,1) |
100 | (4, 16,0,1) |
101 | (4, 16,1,1) |
102 | (4, 32,0,1) |
103 | (4, 32,1,1) |
104 | (4, 64,0,1) |
105 | (4, 64,1,1) |
106 | (4,128,0,1) |
107 | (4,128,1,1) |
108 | (4,192,0,1) |
109 | (4,192,1,1) |
110 | (4,224,0,1) |
111 | (4,224,1,1) |
112 | (4,240,0,1) |
113 | (4,240,1,1) |
114 | (4,248,0,1) |
115 | (4,248,1,1) |
From what has been said so far, it may appear that a decompressor has considerable free choice in its arrangement of class file contents. In the class file format, there are numerous degrees of freedom which have no effect on the meaning of the file. For example, constant pool entries, attribute lists, and class methods are presented in no significant order, and could be shuffled at will by the decompressor without changing the meaning of the Java application.
However, for any given Pack200 archive, every decompressor is required to produce a particular byte-wise image for each class file transmitted. This requirement is placed on decompressors in order to make it possible for compressors to transmit information, such as message digests, which relates to the eventual byte-wise contents of transmitted class files. This section describes the restrictions placed on every decompressor that makes the byte-wise contents of its output files a well-defined function of its input.
In general, the order of elements in a decompressed class file must be consistent with the transmission order in the Pack200 archive. For example, the order of a class's fields declared in the class file must correspond to the order in which that class's field descriptors were transmitted in the field_descr
band. (This also happens to correspond to the order in the field_flags
bands.) This table gives all of the required correspondences of class file order with archive transmission order:
element in class file |
band which determines order |
---|---|
implemented interfaces | class_interface |
declared fields | field_descr |
declared methods | method_descr |
code handler list | code_handler_start_P |
class attribute list | class_flags, class_attr_indexes |
field attribute list | field_flags, field_attr_indexes |
method attribute list | method_flags, method_attr_indexes |
code attribute list | code_attr_indexes |
constant pool entries | cp_Utf8, etc. (see below) |
Together, these required correspondences of order determine the exact contents of the decompressed class file. The ordering of interfaces, fields, methods, and exception handlers is directly determined by the ordering of the bands that transmit them.
Attribute lists for classes, fields, and methods must begin with all attributes transmitted because of set flag bits. The order must be index order (from LSB to MSB in the flag word). After any possible flag-caused attributes, the list must then contain any overflow attributes. Any attribute list with overflow attributes will present them in the same order as their layouts were transmitted in the corresponding series of values from class_attr_indexes
, field_attr_indexes
, method_attr_indexes
, or code_attr_indexes
. Note that an attribute layout whose index is less than 32 can be specified zero or one times via a flag bit. Independently, it can also be specified zero or more times via occurrences in a band transmitting overflow attribute layouts.
If an InnerClasses
or BootstrapMethods
attribute must be added to a class, under the rules given in the next section, those attributes must come last in the class's attribute list. If more than one such attribute is added, the relative ordering is BootstrapMethods
, then InnerClasses
.
The constant pool cp(X)
of a decompressed class file X is defined as if by a post-processing of X and cp_All
after the decompressor has already produced most of X. Specifically, suppose that the contents of X are determined, except that:
cp(X)
is not yet defined,InnerClasses
attribute in X (even if was an ic_Local(X)
transmitted), andThen cp(X)
is defined as if by the following steps, which populate it from the global constant pool cp_All
, and also potentially produce an InnerClasses
attribute for X.
cp(X)
is initially empty.cp(X)
.cp_Signature
must be treated as making no additional references, despite its transmitted structure.cp(X)
is added to cp(X)
, if it is not already there. (For example, a constant from cp_Method
refers to constants from cp_Class
and cp_Descr
.)CONSTANT_InvokeDynamic
, its cp_BootstrapMethod
constant, a CONSTANT_MethodHandle
, a CONSTANT_Methodref
, a CONSTANT_NameAndType
, and a CONSTANT_Utf8
.)At this point the constant pool cp(X)
is well enough defined to allow the decompressor to compute the set of relevant nested classes, called ic_Relevant(X)
.
ic_Relevant(X)
is initially empty.ic_All
which mentions the current class (defined by X) as an outer class is added to ic_Relevant(X)
.ic_this_class
and cp(X)
, the corresponding four-tuple from ic_All
is selected, and added to ic_Relevant(X)
, if not already there.ic_Relevant(X)
no longer changes.)ic_Relevant(X)
is ordered into a sequence, so that its order is consistent with ic_All
. (I.e., it is a subsequence of ic_All
.)With ic_Relevant(X)
and an optionally transmitted ic_Local(X)
in hand, the decompressor must next decide whether to store an InnerClasses
attribute ic_Stored(X)
for X, using these steps.
ic_Local(X)
is present and empty, no InnerClasses
attribute will be stored for X, and none of the following steps will be taken.ic_Local(X)
is not present and ic_Relevant(X)
is empty, no InnerClasses
attribute will be stored for X.ic_Stored(X)
is set to be the elements in ic_Local(X)
followed by the elements of ic_Relevant(X)
(both in transmission order).ic_Local(X)
and ic_Relevant(X)
are removed from ic_Stored(X)
, thus forming a symmetric difference.ic_Stored(X)
, even if empty, is added to the decompressor's output as an InnerClasses
attribute.cp_Class
and cp_Utf8
entries directly or indirectly required by ic_Stored(X)
, as well as the fixed cp_Utf8
entry spelled "InnerClasses"
, are added to cp(X)
if not already there.With the last contributions from nested class records, the decompressor finishes the constant pool using these steps:
cp_Signature
entry in cp(X)
is "flattened" into a cp_Utf8
entry with the same spelling.cp_Utf8
entry E in cp(X)
was not transmitted in cp_All
but had an equivalently spelled signature S transmitted, then E is replaced by S in cp(X)
.cp(X)
which are also in cp_All
are sorted into the same order as their occurrences in cp_All
.cp_Utf8
constants in cp(X)
not also in cp_All
are moved to the end of cp(X)
, and are sorted in the order defined by String.compareTo
.cp_Class
constants in cp(X)
not also in cp_All
are moved to the end of cp(X)
, in the order defined by String.compareTo
on their class names.cp(X)
.)cp(X)
which are referred to via a one-byte index (i.e., a ldc bytecode) are moved in mass to the beginning of cp(X)
, without changing their relative order.cp(X)
are transformed into their equivalent class file constant types, with signatures being "flattened" into equivalently spelled CONSTANT_Utf8 constants.cp_BootstrapMethod
constants in cp(X)
are removed from cp(X)
and placed (without reordering) in the BootstrapMethods
attribute for X. (If there are no such elements, no such attribute is created.) If the attribute is created, cp_Utf8
entry spelled "BootstrapMethods"
is added to cp(X)
if not already there.cp(X)
, is stored as the constant pool of X.After this process, the stored constant pool of X has referential integrity, and all constant references can be coded as indexes into cp(X)
, which has a definite order derived from cp_All
. In particular, if a constant in cp(X)
needs to refer to another constant, that constant must already have been inserted into cp(X)
, during the closure steps.
Note that the ordering of cp(X)
is consistent with that of cp_All
, except that some signature references are redirected to equivalent pre-existing elements of cp_Utf8
, and all operands of ldc bytecodes are forced to the front of the constant pool.
When assigning local indexes to elements of cp(X)
, the decompressor must of course respect the reservation of empty constant pool slots for index zero, and for CONSTANT_Long and CONSTANT_Double constants, as required by the class file format. Together with these rules, the construction and ordering of cp(X)
completely determines the assignment of concrete indexes to constant references within X.
Here is a list of all bands in the order they must occur in the archive. This list is not part of the Pack200 specification, but it is presented to help clarify the meaning of the grammar in the specification proper which defines the names and ordering of the bands. Four occurrences of ellipsis ... refer to insertion points where the compressor may insert variable numbers of extra bands to help transmit unusual strings or non-standard attributes. The other ellipses refer to repetitions of metadata bands which it would be too tedious to include.
Band |
Default coding |
Length |
Constant pool referred to |
---|---|---|---|
archive_magic
|
BYTE1 |
[4]
|
|
archive_header
|
UNSIGNED5 |
[26]
|
|
band_headers
|
BYTE1 |
[...]
|
|
cp_Utf8_prefix
|
DELTA5 |
[MAX(0,#cp_Utf8_count-2)]
|
|
cp_Utf8_suffix
|
UNSIGNED5 |
[MAX(0,#cp_Utf8_count-1)]
|
|
cp_Utf8_chars
|
CHAR3 |
[SUM(*cp_Utf8_suffix)]
|
|
cp_Utf8_big_suffix
|
DELTA5 |
[COUNT(0,*cp_Utf8_suffix)]
|
|
{cp_Utf8_big_chars...}
|
DELTA5 |
[*cp_Utf8_big_suffix[i]]
|
|
cp_Int
|
UDELTA5 |
[#cp_Int_count]
|
|
cp_Float
|
UDELTA5 |
[#cp_Float_count]
|
|
cp_Long_hi
|
UDELTA5 |
[#cp_Long_count]
|
|
cp_Long_lo
|
DELTA5 |
[#cp_Long_count]
|
|
cp_Double_hi
|
UDELTA5 |
[#cp_Double_count]
|
|
cp_Double_lo
|
DELTA5 |
[#cp_Double_count]
|
|
cp_String
|
UDELTA5 |
[#cp_String_count]
|
cp_Utf8
|
cp_Class
|
UDELTA5 |
[#cp_Class_count]
|
cp_Utf8
|
cp_Signature_form
|
DELTA5 |
[#cp_Signature_count]
|
cp_Utf8
|
cp_Signature_classes
|
UDELTA5 |
[COUNT('L',...)]
|
cp_Class
|
cp_Descr_name
|
DELTA5 |
[#cp_Descr_count]
|
cp_Utf8
|
cp_Descr_type
|
UDELTA5 |
[#cp_Descr_count]
|
cp_Signature
|
cp_Field_class
|
DELTA5 |
[#cp_Field_count]
|
cp_Class
|
cp_Field_desc
|
UDELTA5 |
[#cp_Field_count]
|
cp_Descr
|
cp_Method_class
|
DELTA5 |
[#cp_Method_count]
|
cp_Class
|
cp_Method_desc
|
UDELTA5 |
[#cp_Method_count]
|
cp_Descr
|
cp_Imethod_class
|
DELTA5 |
[#cp_Imethod_count]
|
cp_Class
|
cp_Imethod_desc
|
UDELTA5 |
[#cp_Imethod_count]
|
cp_Descr
|
cp_MethodHandle_refkind
|
DELTA5 |
[#cp_MethodHandle_count]
|
|
cp_MethodHandle_member
|
UDELTA5 |
[#cp_MethodHandle_count]
|
cp_All
|
cp_MethodType
|
UDELTA5 |
[#cp_MethodType_count]
|
cp_Signature
|
cp_BootstrapMethod_ref
|
DELTA5 |
[#cp_BootstrapMethod_count]
|
cp_MethodHandle
|
cp_BootstrapMethod_arg_count
|
UDELTA5 |
[#cp_BootstrapMethod_count]arg_tt>
|
|
cp_BootstrapMethod_arg
|
DELTA5 |
[SUM(*class_interface_count)]
|
cp_All
|
cp_InvokeDynamic_spec
|
DELTA5 |
[#cp_InvokeDynamic_count]
|
cp_BootstrapMethod
|
cp_InvokeDynamic_descr
|
UDELTA5 |
[#cp_InvokeDynamic_count]
|
cp_Descr
|
attr_definition_headers
|
BYTE1 |
[#attr_definition_count]
|
|
attr_definition_name
|
UNSIGNED5 |
[#attr_definition_count]
|
cp_Utf8
|
attr_definition_layout
|
UNSIGNED5 |
[#attr_definition_count]
|
cp_Utf8
|
ic_this_class
|
UDELTA5 |
[#ic_count]
|
cp_Class
|
ic_flags
|
UNSIGNED5 |
[#ic_count]
|
|
ic_outer_class
|
DELTA5 |
[COUNT(1<<16,...)]
|
cp_Class
|
ic_name
|
DELTA5 |
[COUNT(1<<16,...)]
|
cp_Utf8
|
class_this
|
DELTA5 |
[#class_count]
|
cp_Class
|
class_super
|
DELTA5 |
[#class_count]
|
cp_Class
|
class_interface_count
|
DELTA5 |
[#class_count]
|
|
class_interface
|
DELTA5 |
[SUM(*class_interface_count)]
|
cp_Class
|
class_field_count
|
DELTA5 |
[#class_count]
|
|
class_method_count
|
DELTA5 |
[#class_count]
|
|
field_descr
|
DELTA5 |
[SUM(*class_field_count)]
|
cp_Descr
|
field_flags_hi
|
UNSIGNED5 |
[SUM(class_field_count)#have_field_flags_hi]
|
|
field_flags_lo
|
UNSIGNED5 |
[SUM(*class_field_count)]
|
|
field_attr_count
|
UNSIGNED5 |
[COUNT(1<<16,...)]
|
|
field_attr_indexes
|
UNSIGNED5 |
[SUM(*field_attr_count)]
|
|
field_attr_calls
|
UNSIGNED5 |
[...]
|
|
field_ConstantValue_KQ
|
UNSIGNED5 |
[COUNT(ConstantValue,...)]
|
cp_Int, cp_Float, etc.
|
field_Signature_RS
|
UNSIGNED5 |
[COUNT(Signature,...)]
|
cp_Signature
|
field_RVA_anno_N
|
UNSIGNED5 |
[...]
|
|
field_RVA_type_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
field_RVA_pair_N
|
UNSIGNED5 |
[...]
|
|
field_RVA_name_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
field_RVA_T
|
BYTE1 |
[...]
|
|
field_RVA_caseI_KI
|
UNSIGNED5 |
[...]
|
cp_Int
|
field_RVA_caseD_KD
|
UNSIGNED5 |
[...]
|
cp_Double
|
field_RVA_caseF_KF
|
UNSIGNED5 |
[...]
|
cp_Float
|
field_RVA_caseJ_KJ
|
UNSIGNED5 |
[...]
|
cp_Long
|
field_RVA_casec_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
field_RVA_caseet_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
field_RVA_caseec_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
field_RVA_cases_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
field_RVA_casearray_N
|
UNSIGNED5 |
[...]
|
|
field_RVA_nesttype_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
field_RVA_nestpair_N
|
UNSIGNED5 |
[...]
|
|
field_RVA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
field_RIA_anno_N
|
UNSIGNED5 |
[...]
|
|
{field_RIA_...}
|
|||
field_RIA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
{field_attr_element_bands...}
|
(various) |
[...]
|
(various)
|
method_descr
|
MDELTA5 |
[SUM(*class_method_count)]
|
cp_Descr
|
method_flags_hi
|
UNSIGNED5 |
[SUM(class_method_count)#have_method_flags_hi]
|
|
method_flags_lo
|
UNSIGNED5 |
[SUM(*class_method_count)]
|
|
method_attr_count
|
UNSIGNED5 |
[COUNT(1<<16,...)]
|
|
method_attr_indexes
|
UNSIGNED5 |
[SUM(*method_attr_count)]
|
|
method_attr_calls
|
UNSIGNED5 |
[...]
|
|
method_Exceptions_N
|
UNSIGNED5 |
[COUNT(Exceptions,...)]
|
|
method_Exceptions_RC
|
UNSIGNED5 |
[SUM(*method_Exceptions_N)]
|
cp_Class
|
method_Signature_RS
|
UNSIGNED5 |
[COUNT(Signature,...)]
|
cp_Signature
|
method_RVA_anno_N
|
UNSIGNED5 |
[...]
|
|
method_RVA_type_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
method_RVA_pair_N
|
UNSIGNED5 |
[...]
|
|
method_RVA_name_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_RVA_T
|
BYTE1 |
[...]
|
|
method_RVA_caseI_KI
|
UNSIGNED5 |
[...]
|
cp_Int
|
method_RVA_caseD_KD
|
UNSIGNED5 |
[...]
|
cp_Double
|
method_RVA_caseF_KF
|
UNSIGNED5 |
[...]
|
cp_Float
|
method_RVA_caseJ_KJ
|
UNSIGNED5 |
[...]
|
cp_Long
|
method_RVA_casec_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
method_RVA_caseet_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
method_RVA_caseec_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_RVA_cases_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_RVA_casearray_N
|
UNSIGNED5 |
[...]
|
|
method_RVA_nesttype_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
method_RVA_nestpair_N
|
UNSIGNED5 |
[...]
|
|
method_RVA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_RIA_anno_N
|
UNSIGNED5 |
[...]
|
|
{method_RIA_...}
|
|||
method_RIA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_RVPA_param_NB
|
BYTE1 |
[...]
|
|
method_RVPA_anno_N
|
UNSIGNED5 |
[...]
|
|
{method_RVPA_...}
|
|||
method_RVPA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_RIPA_param_NB
|
BYTE1 |
[...]
|
|
method_RIPA_anno_N
|
UNSIGNED5 |
[...]
|
|
{method_RIPA_...}
|
|||
method_RIPA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_AD_T
|
BYTE1 |
[...]
|
|
method_AD_caseI_KI
|
UNSIGNED5 |
[...]
|
cp_Int
|
method_AD_caseD_KD
|
UNSIGNED5 |
[...]
|
cp_Double
|
method_AD_caseF_KF
|
UNSIGNED5 |
[...]
|
cp_Float
|
method_AD_caseJ_KJ
|
UNSIGNED5 |
[...]
|
cp_Long
|
method_AD_casec_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
method_AD_caseet_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
method_AD_caseec_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_AD_cases_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_AD_casearray_N
|
UNSIGNED5 |
[...]
|
|
method_AD_nesttype_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
method_AD_nestpair_N
|
UNSIGNED5 |
[...]
|
|
method_AD_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
method_MethodParameters_NB
|
BYTE1 |
[...]
|
|
method_MethodParameters_name_RUN
|
UNSIGNED5 |
[...]
|
null|cp_Utf8
|
method_MethodParameters_flag_FH
|
UNSIGNED5 |
[...]
|
|
{method_attr_element_bands...}
|
(various) |
[...]
|
(various)
|
class_flags_hi
|
UNSIGNED5 |
[#class_count*#have_class_flags_hi]
|
|
class_flags_lo
|
UNSIGNED5 |
[#class_count]
|
|
class_attr_count
|
UNSIGNED5 |
[COUNT(1<<16,...)]
|
|
class_attr_indexes
|
UNSIGNED5 |
[SUM(*class_attr_count)]
|
|
class_attr_calls
|
UNSIGNED5 |
[...]
|
|
class_SourceFile_RUN
|
UNSIGNED5 |
[COUNT(SourceFile,...)]
|
null|cp_Utf8
|
class_EnclosingMethod_RC
|
UNSIGNED5 |
[COUNT(EnclosingMethod,...)]
|
cp_Class
|
class_EnclosingMethod_RDN
|
UNSIGNED5 |
[COUNT(EnclosingMethod,...)]
|
null|cp_Descr
|
class_Signature_RS
|
UNSIGNED5 |
[COUNT(Signature,...)]
|
cp_Signature
|
class_RVA_anno_N
|
UNSIGNED5 |
[...]
|
|
class_RVA_type_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
class_RVA_pair_N
|
UNSIGNED5 |
[...]
|
|
class_RVA_name_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
class_RVA_T
|
BYTE1 |
[...]
|
|
class_RVA_caseI_KI
|
UNSIGNED5 |
[...]
|
cp_Int
|
class_RVA_caseD_KD
|
UNSIGNED5 |
[...]
|
cp_Double
|
class_RVA_caseF_KF
|
UNSIGNED5 |
[...]
|
cp_Float
|
class_RVA_caseJ_KJ
|
UNSIGNED5 |
[...]
|
cp_Long
|
class_RVA_casec_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
class_RVA_caseet_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
class_RVA_caseec_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
class_RVA_cases_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
class_RVA_casearray_N
|
UNSIGNED5 |
[...]
|
|
class_RVA_nesttype_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
class_RVA_nestpair_N
|
UNSIGNED5 |
[...]
|
|
class_RVA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
class_RIA_anno_N
|
UNSIGNED5 |
[...]
|
|
{class_RIA_...}
|
|||
class_RIA_nestname_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
class_InnerClasses_N
|
UNSIGNED5 |
[COUNT(InnerClasses,...)]
|
|
class_InnerClasses_RC
|
UNSIGNED5 |
[SUM(*class_InnerClasses_N)]
|
cp_Class
|
class_InnerClasses_F
|
UNSIGNED5 |
[SUM(*class_InnerClasses_N)]
|
|
class_InnerClasses_outer_RCN
|
UNSIGNED5 |
[COUNT(!=0,*class_InnerClasses_F)]
|
null|cp_Class
|
class_InnerClasses_name_RUN
|
UNSIGNED5 |
[COUNT(!=0,*class_InnerClasses_F)]
|
null|cp_Utf8
|
class_file_version_minor_H
|
UNSIGNED5 |
[COUNT(version,...)]
|
|
class_file_version_major_H
|
UNSIGNED5 |
[COUNT(version,...)]
|
|
{class_attr_element_bands...}
|
(various) |
[...]
|
(various)
|
code_headers
|
BYTE1 |
[COUNT(Code,...)]
|
|
code_max_stack
|
UNSIGNED5 |
[COUNT(0,*code_headers)]
|
|
code_max_na_locals
|
UNSIGNED5 |
[COUNT(0,*code_headers)]
|
|
code_handler_count
|
UNSIGNED5 |
[COUNT(0,*code_headers)]
|
|
code_handler_start_P
|
BCI5 |
[SUM(*code_header_count)]
|
|
code_handler_end_PO
|
BRANCH5 |
[SUM(*code_header_count)]
|
|
code_handler_catch_PO
|
BRANCH5 |
[SUM(*code_header_count)]
|
|
code_handler_class_RCN
|
UNSIGNED5 |
[SUM(*code_header_count)]
|
null|cp_Class
|
code_flags_hi
|
UNSIGNED5 |
[...*#have_code_flags_hi]
|
|
code_flags_lo
|
UNSIGNED5 |
[...]
|
|
code_attr_count
|
UNSIGNED5 |
[COUNT(1<<16,...)]
|
|
code_attr_indexes
|
UNSIGNED5 |
[SUM(*code_attr_count)]
|
|
code_attr_calls
|
UNSIGNED5 |
[...]
|
|
code_StackMapTable_N
|
UNSIGNED5 |
[COUNT(StackMapTable,...)]
|
|
code_StackMapTable_frame_T
|
BYTE1 |
[SUM(*code_StackMapTable_N)]
|
|
code_StackMapTable_local_N
|
UNSIGNED5 |
[COUNT(255,*code_StackMapTable_frame_T)]
|
|
code_StackMapTable_stack_N
|
UNSIGNED5 |
[COUNT(255,*code_StackMapTable_frame_T)]
|
|
code_StackMapTable_offset
|
UNSIGNED5 |
[...]
|
|
code_StackMapTable_T
|
BYTE1 |
[...]
|
|
code_StackMapTable_RC
|
UNSIGNED5 |
[COUNT(7,*code_StackMapTable_T)]
|
|
code_StackMapTable_P
|
BCI5 |
[COUNT(8,*code_StackMapTable_T)]
|
|
code_LineNumberTable_N
|
UNSIGNED5 |
[...]
|
|
code_LineNumberTable_bci_P
|
BCI5 |
[...]
|
|
code_LineNumberTable_line
|
UNSIGNED5 |
[...]
|
|
code_LocalVariableTable_N
|
UNSIGNED5 |
[...]
|
|
code_LocalVariableTable_bci_P
|
BCI5 |
[...]
|
|
code_LocalVariableTable_span_O
|
BRANCH5 |
[...]
|
|
code_LocalVariableTable_name_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
code_LocalVariableTable_type_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
code_LocalVariableTable_slot
|
UNSIGNED5 |
[...]
|
|
code_LocalVariableTypeTable_N
|
UNSIGNED5 |
[...]
|
|
code_LocalVariableTypeTable_bci_P
|
BCI5 |
[...]
|
|
code_LocalVariableTypeTable_span_O
|
BRANCH5 |
[...]
|
|
code_LocalVariableTypeTable_name_RU
|
UNSIGNED5 |
[...]
|
cp_Utf8
|
code_LocalVariableTypeTable_type_RS
|
UNSIGNED5 |
[...]
|
cp_Signature
|
code_LocalVariableTypeTable_slot
|
UNSIGNED5 |
[...]
|
|
{code_attr_element_bands...}
|
(various) |
[...]
|
(various)
|
bc_codes
|
BYTE1 |
[...]
|
|
bc_case_count
|
UNSIGNED5 |
[COUNT(switch,*bc_codes)]
|
|
bc_case_value
|
DELTA5 |
[...]
|
|
bc_byte
|
BYTE1 |
[...]
|
|
bc_short
|
DELTA5 |
[...]
|
|
bc_local
|
UNSIGNED5 |
[...]
|
|
bc_label
|
BRANCH5 |
[...]
|
|
bc_intref
|
DELTA5 |
[...]
|
cp_Int
|
bc_floatref
|
DELTA5 |
[...]
|
cp_Float
|
bc_longref
|
DELTA5 |
[...]
|
cp_Long
|
bc_doubleref
|
DELTA5 |
[...]
|
cp_Double
|
bc_stringref
|
DELTA5 |
[...]
|
cp_String
|
bc_loadablevalueref
|
DELTA5 |
[COUNT({qldc,qldc_w},*bc_codes)]
|
cp_All
|
bc_classref
|
UNSIGNED5 |
[...]
|
cp_Class
|
bc_fieldref
|
DELTA5 |
[...]
|
cp_Field
|
bc_methodref
|
UNSIGNED5 |
[...]
|
cp_Method
|
bc_imethodref
|
DELTA5 |
[...]
|
cp_Imethod
|
bc_thisfield
|
UNSIGNED5 |
[...]
|
cp_Field subsequence
|
bc_superfield
|
UNSIGNED5 |
[...]
|
cp_Field subsequence
|
bc_thismethod
|
UNSIGNED5 |
[...]
|
cp_Method subsequence
|
bc_supermethod
|
UNSIGNED5 |
[...]
|
cp_Method subsequence
|
bc_initref
|
UNSIGNED5 |
[...]
|
cp_Method subsequence
|
bc_escref
|
UNSIGNED5 |
[COUNT(ref_escape,*bc_codes)]
|
cp_All
|
bc_escrefsize
|
UNSIGNED5 |
[...]
|
|
bc_escsize
|
UNSIGNED5 |
[...]
|
|
bc_escbyte
|
BYTE1 |
[...]
|
|
file_name
|
UNSIGNED5 |
[#file_count]
|
cp_Utf8
|
file_size_hi
|
UNSIGNED5 |
[#file_count*(#have_file_size_hi)]
|
|
file_size_lo
|
UNSIGNED5 |
[#file_count]
|
|
file_modtime
|
DELTA5 |
[#file_count*(#have_file_modtime)]
|
|
file_options
|
UNSIGNED5 |
[#file_count*(#have_file_options)]
|
|
file_bits
|
BYTE1 |
[SUM(*file_size)]
|
cp_Utf8
Constant PoolThe following pseudo-code asserts the relations, as described in the section Utf8 Constants, between the band lengths and contents, and cp_Utf8
constant pool elements. (This code merely comments on the specification given above; it does not itself add new information to the specification.)
assert(cp_Utf8[0].equals(""));
int cursor = 0;
int big_cursor = 0;
for (int i = 1; i < cp_Utf8_count; i++) {
String thisString = cp_Utf8[i];
int prefix = (i == 1)? 0: cp_Utf8_prefix[i-2];
int suffix = thisString.length() - prefix;
String prevString = cp_Utf8[i-1];
String prevPrefix = prevString.substring(0, prefix);
String thisPrefix = thisString.substring(0, prefix);
assert(prevPrefix.equals(thisPrefix));
int small_suffix = cp_Utf8_suffix[i-1];
char[] suffix_chars;
int offset;
if (small_suffix != 0) {
assert(suffix == small_suffix);
suffix_chars = cp_Utf8_chars;
offset = cursor;
cursor += suffix;
} else {
assert(suffix == cp_Utf8_big_suffix[big_cursor]);
suffix_chars = cp_Utf8_big_chars[big_cursor];
offset = 0;
assert(suffix == suffix_chars.length);
big_cursor += 1;
}
String thisSuffix = thisString.substring(prefix);
String theseChars = new String(suffix_chars, offset, suffix);
assert(thisSuffix.equals(theseChars));
}
assert(cp_Utf8_prefix.length == Math.max(0, cp_Utf8_count-2));
assert(cp_Utf8_suffix.length == Math.max(0, cp_Utf8_count-1));
assert(cp_Utf8_chars.length == cursor);
assert(cp_Utf8_big_suffix.length == big_cursor);
assert(cp_Utf8_big_chars.length == big_cursor);
cp_Signature
Constant PoolThe following pseudo-code asserts the relations, as described in the section Type Signatures, between the band lengths and contents, and cp_Signature
constant pool elements. (This code merely comments on the specification given above; it does not itself add new information to the specification.)
int cursor = 0;
for (int i = 0; i < cp_Signature_count; i++) {
String sign = cp_Signature[i];
String form = cp_Signature_form[i];
int form_ptr = 0;
int sign_ptr = 0;
for (; form_ptr < form.length(); form_ptr++) {
assert(form.charAt(form_ptr) == sign.charAt(sign_ptr));
sign_ptr += 1;
if (form.charAt(form_ptr) == 'L') {
String cls = cp_Class[cursor];
assert(sign.startsWith(cls, sign_ptr));
cursor += 1;
sign_ptr += cls.length();
}
}
assert(sign_ptr == sign.length());
}
assert(cp_Signature_form.length == cp_Signature_count);
assert(cp_Signature_classes.length == cursor);
The following pseudo-code asserts the relations, as described in the section Attribute Layout Definitions, between a byte offset and its encoding in a band governed by a bc_index
layout element. (This code merely comments on the specification given above; it does not itself add new information to the specification.)
// ins_pos is a display of all instruction boundaries
int[] ins_pos;
...
Arrays.sort(ins_pos);
assert(ins_pos[0] == 0);
assert(ins_pos[ins_pos.length-1] == bytecodes.length);
int regulars = ins_pos.length; // # instruction boundaries
...
int renumber_bci(int bci) {
int i = Arrays.binarySearch(ins_pos, bci);
return (i >= 0) ? i : (i == -1) ? bci : ins_pos.length + bci - (-i-1);
}
...
for (int bci = -100; bci <= bytecodes.length+100; bci++) {
int bci_numbering_for_band = renumber_bci(bci);
int i = Arrays.binarySearch(ins_pos, bci);
if (i >= 0) {
assert(ins_pos[i] == bci);
assert(bci_numbering_for_band < regulars);
assert(bci_numbering_for_band == i);
} else if (0 < bci && bci < bytecodes.length) {
int nexti = (-i-1); // index of next instruction
assert(ins_pos[nexti-1] < bci && bci < ins_pos[nexti]);
int prevRegulars = nexti;
int prevAll = bci;
int prevIrregulars = prevAll - prevRegulars;
assert(bci_numbering_for_band >= regulars);
assert(bci_numbering_for_band == regulars + prevIrregulars);
} else {
// other (random) numbers are unchanged by renumbering
assert(bci_numbering_for_band >= bytecodes.length ||
bci_numbering_for_band < 0);
assert(bci_numbering_for_band == bci);
}
}
The following pseudo-code asserts the relations, as described in the section Nested Classes, between the nested class's mangled name and the predictable outer and simple names. Note that searches for the literal characters '/' and '$' must be replaced in real implementations by searches for the character ranges associated with the SLASH and DOLLAR nonterminals. (This code merely comments on the specification given above; it does not itself add new information to the specification.)
String bcn, predictableOuter, predictableICName;
...
String name = bcn.substring(bcn.lastIndexOf('/')+1);
int dollar2 = name.lastIndexOf('$');
assert(predictableICName == null ||
(predictableICName.charAt(0) > '9' &&
predictableICName.indexOf('$') < 0));
if (predictableICName == null) {
// bcnCase1 or bcnCase4
assert(predictableOuter == null);
assert(dollar2 == -1 ||
dollar2+1 == name.length() ||
isDigit(name.charAt(dollar2+1)));
} else if (predictableOuter == null) {
// bcnCase2
assert(name.endsWith("$"+predictableICName));
int dollar1 = name.substring(0, dollar2).lastIndexOf('$');
assert(dollar1 >= 0 && dollar1+1 < dollar2);
assert(isDigitString(name.substring(dollar1+1, dollar2)));
} else {
// bcnCase3
assert(bcn.equals(predictableOuter+"$"+predictableICName));
}
This collection of frequently asked questions concerning the Pack200 archive format is meant to serve as a design rationale.
Why not use an existing generic compression mechanism, such as .jar
, .zip
, or .tar.gz
files? First, it is better to know the structure of a thing compressed. Second, a zip or jar archive is compressed element-wise not globally. This means that any symbol shared by several classes must be mentioned independently once in each class file. Third, the structure of an individual class file is really an interleaving of many different kind of data, each with their own statistics, which limits the effectiveness of a single-stream compressor such as gzip. The specific benefit of reordering the data into bands is about a factor of two, independently of the quality of the off-the-shelf compressor used as a post-pass. There are also significant marginal benefits from the various type-specific recoding techniques. The net result is to improve a compression factor from 2-4X to 7-9X.
Why is this specification so complex? The techniques described here are the result of much experimentation over several years. Each feature of this specification is thought to contribute measurably to the overall compression ratio achieved by this technology. The authors would be happy to be shown, by benchmarks, that some given feature can be omitted without significant loss of compression performance.
(A compression performance multiplier of 1.002 or more for some particular technique is significant, because such multipliers accumulate readily into performance that end-users notice. A multiplier of less than 1.0005 is insignificant.)
The Pack200 transmission format is closely tied to the current class file format. Won't it go out of date as soon as the class file format evolves further? Further developments are likely to take the form of new attributes or perhaps new bytecodes. The layout language defined by Pack200 supports a very wide range of attribute formats, including arbitrary mixes of bytes and constant pool references. Likewise, the bytecode representation includes escape operators which can represent arbitrary mixes of data and constant pool references. Although such constructs will not compress as optimally as the features for which this specification is directly designed, it appears that reasonable future extensions will continue to be transmittable, without disturbing the compression of current features, and without requiring decompressors to be updated.
Since Pack200 is a lossy compression algorithm, won't it break signed JAR files? Signed JAR files contain secure hashes over the bytewise contents of individual class files. Any change to the bits of a class file will change its hash code, making the innocuous changes of Pack200 indistinguishable from attacks on the application code.
Pack200, like many other compression algorithms, allows the compressor many degrees of freedom in choosing the contents of the compressed archive, but no degrees of freedom in choosing the contents of the decompressed JAR file. Given a compressed archive, all compliant Pack200 decompressors must produce the same class file bytes, for each transmitted class file. This stability of output persists for class files even if a resource file (such as the manifest) changes its contents. Thus, for any given class file, compression may be lossy, but the decompression of each class file must be exact, in a useful way defined by the Pack200 specification.
This means that a compressor with the right stability properties can be used to produce a signed, packed JAR using these steps:
(Note: If this specification allows two compliant decompressors to produce different JAR archive elements for the same compressed archive input, it is a bug in the specification. The specification is thought to be free of such bugs, but if you find one please report it.)
What is the relationship between file names in Pack200 archives and file names in JAR (or ZIP) archives or disk files? Pack200 specifies that file names are transmitted as Utf8 strings. Since these strings are intended to function as names of JAR elements in functioning Java applications, the strings must also conform to Java usage, which also represents pathnames as Utf8 strings in JAR files and 16-bit Unicode in memory. Also, in JAR files, pathname components are separated by the forward slash character ('/'), and not by any system-specific character. This allows class loaders to easily convert the internal representation of the class names (e.g., "java/lang/Object") to pathnames (e.g., "java/lang/Object.class").
The Pack200 specification does not dictate the interpretation of file name strings with respect to any other tool or operating system. However, the natural mapping would be as coherent as possible with Java's usages.
What is the relationship between file dates in JAR (or ZIP) archives or disk files? Pack200 specifies that file dates are transmitted as 32-bit counts of seconds relative to Java's time base (i.e., System.currentTimeMillis
divided by one thousand). This provides absolute times relative to UTC, at a one-second granularity. If an operating system provides more accurate file times, they must be adjusted to a nearby whole second before transmission in a Pack200 archive.
JAR and ZIP store dates in local format (i.e., "YYYY/MM/DD\ HH:MM:SS"
) with no timezone specification. This means that conversion to and from Java's UTC-based times requires a guess at the timezone under which the compressor was operating. (This problem is not unique to Pack200. It is a problem with all uses of ZIP and JAR archives.) For many purposes, the standard guess is that the decompressor and compressor were operating in the same timezone and during the same yearly daylight savings time regime. However, in order to provide more stability in the times transmitted in Pack200 archives, it is expected that most compressors and decompressors will agree to use UTC as the timezone when interpreting ZIP-style local times.
Doesn't the JEFF file format also oprovide a class-specific compression algorithm? Yes, the JEFF Working Group has defined a standard file format (ISO/IEC 20970) which allows class files to be compressed about 50% and also loaded directly into memory for interpretive execution. As compression performance, it is comparable with the DEFLATE algorithm used within JAR archives. Pack200 provides much greater compression. However, Pack200 archives are not designed for direct execution by any virtual machine. The greater compression level of Pack200 is won at a cost of complexity in the unpacker, a complexity incompatible with any requirement of direct loading or execution.
Why doesn't Pack200 use Huffman encoding? (Same question for LZW, or BWT, or move-to-front, or any other standard compression technique.) As a matter of simplicity and division of labor, Pack200 intentionally focuses on finding and removing large-scale redundancies specific to class files. Pack200 produces a byte-oriented output, hopefully with clear patterns and a simple alphabet statistics. It relies on a post-pass compressor to encode these bytes in some more efficient string-sharing, bit-sliced representation.
The following considerations support this design:
How does this specification cope with archive format changes? The major and minor version numbers of a Pack200 archive advertise which version of this standard has been generated by a packer. Because of the flexibility of attribute layouts, packers can sometimes represent new classfile formats in old archive formats, but newer archive formats may be required for reasons of functionality or performance.
Unpacker implementations are strongly encouraged to support each standard version of the archive format, since there is not always a strong alignment between packer and unpacker versions at either end of a deployment channel. Packer implementations are encouraged to preserve the ability to emit older archive formats, to maintain maximum compatibility with unpackers.
The reference implementation chooses to maintain backward compatibility by producing a 1.5 pack format if the input JAR archive contains no 1.6 (or newer) classfiles. In general, it will produce the oldest possible archive version compatible with all the input classfiles. An empty archive will default to the oldest archive version, which is 1.5.
Copyright © 2003, 2017, Oracle and/or its affiliates. All rights reserved.