``Nothing will ever be attempted, if all possible objections must first be overcome.'' Samuel Johnson
Note: the current version of this document is available online at: http://www-sor.inria.fr/~piumarta/squeak/j3/j3.html.
We are pulled in many directions, including:Alea jacta erat.[...]
A high performance VMThis last is one for which we have a lot of enthusiasm, as well as a fair
amount of experience, but we simply don't have enough cycles to keep it on
our list. Also, on the positive side, it is a project that could be done
quite independently, and then merged with mainline Squeak when it was done.
[...]
Beyond this, we have the sense that you might actually be interested in
working on such a project.
The first version was called ``Jitter'', which supposedly stood for: ``just-in-time transformation of executable representation''. The idea was to generate threaded code, ... code inside image ... translator written in Slang ... run simulated or in generated code with platdep macros...
The second was called ``Jitter-2'', which soon became abbreviated to ``J2''.
The first two Jitters were not a complete waste of time. Apart from the experience gained in the internals of the Squeak Interpreter and ObjectMemory, they contributed several important ideas both to the design of J3 and to the optimisations used in the regular bytecode Interpreter. To cite just two examples: the prefetching of threaded opcodes in Jitter-1 was adopted in the Interpreter by moving the send of ``fetchNextBytecode'' out of the interpreter loop and into each of the bytecode implementations (placing it as early as possible in the method), and the idea of using a combination of the inline cache and method specialisation for indexable classes for sends of #at: and #at:put: in Jitter-2 (to eliminate overheads associated with decoding the format of the receiver on each send of #at: and #at:put:) led to the inclusion of the ``at cache'' in the Interpreter.
The major breakthroughs in understanding which led to J3 were therefore:
All of this, I hope, is still thoroughly in keeping with the Squeak philosophy of portability, openness, and accessibility to everyone. The code generator contains maybe a hundred lines of platform-specific code, and the runtime support routines (the ``glue'' that connects Smalltalk methods and primitives together, and the ``trampolines'' that connect C with Smalltalk code and kick-start the initial Process) represent no more than another hundred lines. Porting J3 to a new processor (given a suitable runtime assembler -- but that's another story) should be the work of a few days, not weeks or months.
into a single native instruction
or into a pair of instructions
on load-store machines. (This is desirable because modern
processors are essentially memory-bound, and halving the number of memory
references will often double the execution speed of a given piece of code.)
In Smalltalk, decoding just the push-pop pairs involves all the
possibile permutations of 4 sources and 3 destinations, and up to three
different encodings of a given operation with a given source or destination.
Combinatorial explosion would soon force the translator to spend (waste)
most of its time frantically trying to match patterns.
to appear rewritten with inverted forward conditional branches
in the bytecode. This would add yet another bunch of expensive
pattern matching to the generator. What's more, jump chains are not
uncommon. The following (thoroughly amusing or highly depressing,
depending on your disposition) example is from Object>>doesNotUnderstand:
(The original is on left, and the spaghetti-free version on the
right. Pedantics requires me to explain that the only path to the
bytecode at 40 is from the jump at 35, and that the jump at 41 is only
reached from the bytecode preceding it.)
Pop, Dup | simple stack operators |
LdSelf, LdThisContext | load receiver and active context |
LdConst | covers nil, true, false, SmallIntegers |
LdInst, StInst | load/store an instance (receiver) variable |
LdTemp, StTemp | load/store a temporary variable |
LdLit | load from literal frame |
LdVal, StVal | load/store value field of Association |
Jmp, JmpF, JmpT | (un)conditional jumps |
Super, Send, Special | super, self, or special send |
LocalRet, RemoteRet | local and non-local (to home sender) return |
Lambda, Block | create and enter Closure |
Note that the Special operator could be removed if we made the selector explicit in message send, for example using LdLit to supply the selector to Super and Send, and using a LdSpecial operator to provided a selector from the SpecialSelectors Array, transforming any subsequent Send into the equivalent of the existing Special operator. This would be a better solution because it is more general, allowing special selectors to be used in Super sends (for example) or as literal values in programs, without consuming literal frame slots.
Note also that many of the Smalltalk bytecodes are converted into two or more SAM operations, for example:
is translated into four SAM operations:popLiteralVariable: N
To see why this apparent discarding of information encoded in the bytecodes results in better (effectively optimal) native code, it is necessary to understand how the optimiser works. Reducing the number of operations in this manner also makes the optimiser both simpler and faster. (The optimiser is explained later.)LdLit: N
StVal
Pop
Pop
32 of the Smalltalk bytecodes are ``pre-encoded'' sends of ``special''
selectors. (The original pupose of these special sends was two-fold:
first to allow a microcoded virtual machine (and later a software interpreter)
to ``inline'' the evaluation of all sends of arithmetic messages having
SmallInteger
or Float receivers and arguments, without
ever needing to send a real message; and second to reduce literal frame
pressure due to frequently-used selectors.) The Special
send operation subsumes the 16 special send bytecodes which are therefore
all converted into Special operations, with two exceptions. Class
and Equivalent can never ``fail'' their inlined response
and be converted into a full send; they are therefore retained as separate
operations in SAM to simplify (slightly) the analysis that has to be done
by the optimiser for the Special operation. The 16 arithmetic
operations are also retained for similar reasons. (Athough they do
not reduce the amount of analysis performed by the optimiser, they spread
it out more evenly and help make the final artifact easier to understand
and maintain.)
Add, Sub, Mul, Div, Divide, Mod | arithmetic operators |
Less[Equal], Greater[Equal], [Not]Equal | relational operators |
Equivalent | identity |
Bit{Shift, And, Or} | bitwise operators |
Class | fetch class of object |
Jump rewriting is also important because the optimiser performs a limited form of dataflow analysis to track the types of the objects that will appear in the runtime stack. Since, in general, the stack contents can only be tracked reliably during the execution of linear sequences of instructions, the type information must be recombined at each basic block entry point based on information propagated along each of the possible paths into the block. Reducing the number of jumps in the code therefore also reduces the number of points at which the optimiser is forced to discard potentially valuable type information.
Having converted the bytecode into SAM, and then performed the initial jump rewriting, the compiler iterates over the SAM code and calls a ``generator'' function in the optimiser for each SAM opcode. The optimiser implements a separate generator function for each of the SAM operations defined above.
Optimisation is based on a technique called delayed code generation (DCG) [Piu92], which has the interesting property that only one iteration over the intermediate form (or walk through a parse tree) is necessary to generate very high quality final code. DCG works by manipulating an explicit representation of the machine operands representing quantities with which the program reasons. For example, in Smalltalk these quantities include literal constants, the receiver, temporary variables, and locations on the runtime stack.
Code generation is entirely operand-driven. The optimiser's main job is to combine operands (that it has accumulated from previous operations) according to the operation implied by each of its generator functions. The optimiser tries hard to combine operands without generating native code, and emits instructions only when such implicit combination becomes impossible. In order to do this, some kind of state must be retained to propagate operands between successive generator functions. In the case of Smalltalk, the most suitable structure in which to keep this state is a simulation stack, which models the contents of the runtime stack that will exist when the method is later executed. An example should make all of this much clearer. Consider the Smalltalk statement
| a |which corresponds to the following sequence of SAM operations:
a := 3 + 4.
LdConst #3Instead of generating code to assign the SmallInteger constants to a physical location (pushing them on the runtime stack, for example), the generator function for LdConst simply pushes a descriptor onto the simulation stack to represent the (virtual) presence of the corresponding constant on the runtime stack. The generator function for Add performs a case analysis of the types of the top two items on the simulation stack. Seeing that these are both SmallInteger constants, the generator combines them immediately and replaces them with a descriptor representing the SmallInteger 7. No code has been generated. When the generator function for StTemp is called, operand combination can no longer avoid generating a machine instruction. The StTemp generator function therefore calls the NAM output routine for moving a literal into a memory location:
LdConst #4
Add
StTemp 0
Pop
moviT(#7, 0)(This reads: ``move immediate [SmallInteger 7] to temp [0]''.) The NAM output routine, which is specific to the host processor, converts the request into physical machine instructions. For example:
movl $15, 0(%(tempBaseReg))for the Pentium, or via an intermediate temporary register for the PowerPC:
li r9, 15The output routine returns a descriptor representing the ``cheapest'' location in which the ``output'' operand of the operation can be found. For the current example this will be either a constant or a register descriptor. The generator function for Pop sees that the top of the simulation stack is not physically on the runtime stack, and can discard it by simply ``popping'' the simulation stack without generating any instructions to pop the runtime stack. The above one or two instructions are therefore all that the code generator emits for the Smalltalk statement ``a := 3 + 4''. The following table shows the same sequence of SAM opcodes along with the simulation stack contents (on a PowerPC) after execution of the generator routine, and the calls made to the NAM output routines:
stw r9, 0(r(tempBaseReg))
SAM | simulation stack | NAM output |
... | ||
LdConst #3 | ... [#3] | |
LdConst #4 | ... [#3] [#4] | |
Add | ... [#7] | |
StTemp 0 | ... [r9] | move_i_t([#7], 0) => [r9] |
Pop | ... |
In the case of the Pentium, the descriptor returned by move_i_t represents the constant #7; in the case of the PowerPC it represents the register r9. Since the generator function replaces the descriptor on top of the simulation stack with the output operand descriptor returned by the output routine, instructions generated to move input operands to registers on load-store architctures are not repeated a second time if the same value is reused in a given statement. Consequently, on the PowerPC, the statement
| a b |is compiled into the following native code:
a := b := 42.
li r9, 85 ; input operand = #42
stw r9, 4(r(tempBaseReg)) ; output operand = r9
stw r9, 0(r(tempBaseReg)) ; input operand = r9
Speed goes up because a significant amount the CPU time in the bytecode Interpreter goes into activating and returning from methods. The benchFibs benchmark, running fully optimised native code, is approximately ten times slower when executing in heap Contexts compared to executing directly on the hardware stack. On the Pentium, the shortest possible path through activateNewMethod executes 83 instructions. The same activation on the hardware stack requires [VERIFY THIS] 7 instructions. A similar bottleneck exists with returns, which execute 44 Pentium instructions in the best case for heap-allocated Contexts and only [VERIFY THIS] 4 when executing on the stack.
The price to be paid is additional complexity elsewhere in the system.
Heap-allocated Contexts are tolerant
of many evils, including outliving their activation (e.g. the home contexts
of blocks) and having arbitrary changes made to their internal fields and
stack contents (the Debugger, Exceptions,
brace stacks, ContextPart release code
run from Process termination, and so
on) -- of which by far the most evil has to be a stores into the
sender field. Stack frames, on the other hand, are inherently
LIFO. A significant amount of effort has to be spent to reconcile
this intrinsic LIFOness with any non-linear behaviour induced by BlockContexts
or explicit program modifications to the internals of Context
objects.
... | ||
oop | stack[32] | // the Smalltalk stack, just like in a Context |
... | ||
oop | receiver | // the receiver of the executing method |
NativeMethod | *method | // the method executing in the frame |
oop | *stackp | // top of stack (a pointer into the stack portion) |
Context | *pseudo | // PseudoContext for Frame, or 0 |
insn | *opc | // the SqABI obscured PC for the frame |
insn | *pc | // the SqABI saved PC for the frame |
void | *cpc | // the CABI saved PC for the frame |
Frame | *sender | // a pointer to the sender's Frame |
The answer is PseudoContexts. These are ``proxies'' which exhibit perfectly Context-like behaviour at the Smalltalk level, but which in reality are active objects that ``reflect'' onto the underlying execution state held in stack Frames. Reading any field of a PseudoContext causes the appropriate value to be calculated according to the current state of the Frame on which it reflects. This is straightforward, since all of the state that would be present in a real Context can be reconstructed trivially from the state contained in a Frame. Writing a field of a PseudoContext causes a corresponding modification to the state of the associated Frame. Such modifications range from delicate (e.g. writing a new value into the stack, or changing the pc, stackp, receiver, method or home field) to downright violent (writing the sender field of a PseudoContext, for example, can result in the wholesale reconstruction of the stack ``under the feet'' of the running Process). Such modifications are seldom trivial: some kind of special ``corrective'' action is almost always required to ensure that the program continues to exhibit correct behaviour. This is due to the type analysis performed by the optimiser. Simply modifying the receiver or a location in the stack, for example, can invalidate certain optimisation assumptions that were in effect during translation of the Frame's associated CompiledMethod, causing the program to function incorrectly. (A trivial example would be storing a non-SmallInteger receiver into a Frame executing a method in class SmallInteger. The optimiser has at the very least eliminated redundant tag checks for inlined arithmetic operations on the receiver, which would suddenly become relevant again.)
A PseudoContext has the following
form:
for method frame... | for block frame... | after frame exit... |
nil | nil | sender [Pseudo]Context or nil |
nil | nil | pc value at frame exit |
0 | 0 | stackp value at frame exit |
method | nargs | unchanged |
unused | startpc | unchanged |
receiver | home | unchanged |
raw pointer to frame | raw pointer to frame | ...up to 32 stack items |
...31 unused slots | ...31 unused slots |
The most obvious thing to notice about PseudoContexts is that they have the same format and almost the same layout as normal Context objects. This is for three reasons. First it helps to simplify the process of stabilisation, since only the first three fields need to be reinitialised and the final stack contents copied into the indexable part of the PseudoContext. Second it allows the fourth, fifth and sixth fields to be read directly by Smalltalk without any further computation (since they have the same values as they would in a normal MethodContext or BlockContext object). Third it allows the runtime to distinguish between PseudoContexts for method frames and those for block frames simply by inspecting the fourth field: if its value is tagged then the corresponding Frame is executing a block.
A raw pointer to the associated frame can be stored directly in PseudoContexts since they have the same ``special'' header type as do MethodContexts and BlockContexts. During garbage collection, the collector inspects the stackp field (which is always 0 for a PseudoContext) and ignores all indexable fields above this index; in other words, the garbage collector never tries to trace or remap the raw pointer to the frame.
PseudoContexts are created on demand, in only four situations:
Some indirect modifications to Frame (via writes into the associated PseudoContext) have immediatel consequences. The glue routines respond to modifications of the receiver or stackp, and stores of the stack portion of the frame, by immediately redirecting the method of the frame to a non-optimised version (compiling it if necessary, based on the original CompiledMethod), updating the pc value accordingly to resume execution at the same point in the method. Other modifications to Frames have effects that are only felt when the frame exits. These situations are handled lazily (to avoid repeated work when several such modifications are made to the same frame) using a ``context fault'' mechanism. One such situation is the creation of the PseudoContext itself, which must be stabilised (converted into the equivalent Method- or BlockContext) when the frame exits.
Most frames (around 85% in a typical Smalltalk application) do not have an associated PseudoContext. It is therefore highly desirable to avoid checking if any associated PseudoContext needs to be stabilised each time a frame exits. Smalltalk also requires that returns check to see if the returning Context's sender is nil, and if so the returning Context is itself sent the #cannotReturn: message with the return value as the argument. J3 avoids these tests by making then implicit in the ``normal'' exit from the frame.
Frames always exit in the most efficient manner possible: they reload the sender's frame and stack pointers and transfer control to the pc value saved in the sender's frame. This is typically no more than three or four machine instructions. Frames for which ``special'' action is required upon exit are ``marked'' by changing the saved pc in their sender's frame to point to a ``context fault'' handler. The original saved pc value is copied into the ``obscured pc'' slot of the sender's frame. When the callee frame exits, control is transferred to the context fault handler which takes the appropriate remedial action and then restarts execution in the sender frame by jumping to the location stored in the obscured pc. Three situations are currently ``repaired'' lazily using this mechanism:
benchmark
|
Squeak 2.6
|
VisualWorks 5i
|
J3 alpha 1
|
towers of hanoi
|
7251
|
681
|
|
allocation
|
3776
|
1193
|
|
array write
|
6769
|
2275
|
|
dictionary write
|
2764
|
337
|
|
floating math
|
23674
|
7065
|
|
integer math
|
4567
|
1042
|
|
ordered collection iterate
|
26318
|
4831
|
|
ordered collection write
|
4787
|
895
|
|
string compare
|
7831
|
1752
|
That's the evidence. I leave you to make up your own mind.
I am also deeply indebted to Neilze Dorta, who lent me her PowerBook G3 whenever it wasn't needed, thus turning many hours of painful waiting for egcs to compile the PowerPC version of J3 on my rather slow PowerMac 6400 into many minutes of significantly less painful waiting.
[...] | ... |
[DS84] | L. Peter Deutsch and Alan M. Schiffman, Efficient Implementation of the Smalltalk-80 System. Proceedings of the 1984 Conference on Principles of Programming Languages (POPL'84), pp. 297--302. |
[Mir99] | Eliot Miranda, Context Handling in VisualWorks 5i. OOPSLA'99 Workshop on Simplicity, Performance and Portability in Vitual Machine Design. November 1999, Denver (CO). |
[Piu92] | Ian Piumarta, Delayed Code Generation in a Smalltalk-80 Compiler. Ph.D. Thesis, University of Manchester (UK), October 1992. |
[Piu99a] | Ian Piumarta, ccg: a tool for writing dynamic code generators. OOPSLA'99 Workshop on Simplicity, Performance and Portability in Virtual Machine Design. November 1999, Denver (CO). |
[PR98] | Ian Piumarta and Fabio Riccardi, Optimizing direct threaded code by selective inlining. Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'98),. June 17-19, Montréal (Canada). |