J3 for Squeak

``Nothing will ever be attempted, if all possible objections must first be overcome.'' Samuel Johnson

Note: the current version of this document is available online at: http://www-sor.inria.fr/~piumarta/squeak/j3/j3.html.

Some history

J3 is the third in a series of attempts to make a faster VM for Squeak. It all began on Saturday 4 January 1997, at 12:29:25 or thereabouts, when Dan Ingalls sent me a message which included...

We are pulled in many directions, including:
[...]
A high performance VM
This last is one for which we have a lot of enthusiasm, as well as a fair
amount of experience, but we simply don't have enough cycles to keep it on
our list. Also, on the positive side, it is a project that could be done
quite independently, and then merged with mainline Squeak when it was done.
[...]
Beyond this, we have the sense that you might actually be interested in
working on such a project.

Alea jacta erat.

The first version was called ``Jitter'', which supposedly stood for: ``just-in-time transformation of executable representation''. The idea was to generate threaded code, ... code inside image ... translator written in Slang ... run simulated or in generated code with platdep macros...

The second was called ``Jitter-2'', which soon became abbreviated to ``J2''.

The first two Jitters were not a complete waste of time. Apart from the experience gained in the internals of the Squeak Interpreter and ObjectMemory, they contributed several important ideas both to the design of J3 and to the optimisations used in the regular bytecode Interpreter. To cite just two examples: the prefetching of threaded opcodes in Jitter-1 was adopted in the Interpreter by moving the send of ``fetchNextBytecode'' out of the interpreter loop and into each of the bytecode implementations (placing it as early as possible in the method), and the idea of using a combination of the inline cache and method specialisation for indexable classes for sends of #at: and #at:put: in Jitter-2 (to eliminate overheads associated with decoding the format of the receiver on each send of #at: and #at:put:) led to the inclusion of the ``at cache'' in the Interpreter.

What's in a J3?

In contrast to the first two Jitters, J3 is a no-compromise approach. In the end, translation to threaded code and the use of a context cache were nothing but added complexity that I could well have done without. One of the most difficult problems in the first two Jitters was trying to maintain the state of two independent interpreters: the threaded code interpreter running in the context cache, and a fallback into the bytecode interpreter when things got tough (executing methods in Context receivers, for example). Although several versions of Jitter 2 could be found on the Unix Squeak FTP site (if one knew precisely where to look) it was ever announced, mainly because the complexity of keeping two independent sets of interperpreter state ``in sync'' with each other was so complex (given that the bytecode Interpreter was largely immutable) that it never quite worked correctly.

The major breakthroughs in understanding which led to J3 were therefore:

threaded code is every bit as complicated to generate and maintain as is real native code, so why pretend?

the context cache is every bit as complicated to manage as is the real hardware stack, so why pretend?

complexity is exponentially proportional to the amount of state that has to be managed: keeping two sets of interpreter state in sync was much, much more complicated than simply ditching the least interesting of the two interpreters (the bytecoded one) and concentrating on simply making everything run in the other (optimising) interpreter.

That's J3 in a nutshell: generation of high-quality native code for several host processors, Smalltalk Processes running directly on the real hardware stack, and ``tricky'' sends to Context receivers (amongst others) handled with intelligent code generation rather than by trying to avoid such situations by maintaining the parallel state needed for fallback into bytecode interpretation.

All of this, I hope, is still thoroughly in keeping with the Squeak philosophy of portability, openness, and accessibility to everyone. The code generator contains maybe a hundred lines of platform-specific code, and the runtime support routines (the ``glue'' that connects Smalltalk methods and primitives together, and the ``trampolines'' that connect C with Smalltalk code and kick-start the initial Process) represent no more than another hundred lines. Porting J3 to a new processor (given a suitable runtime assembler -- but that's another story) should be the work of a few days, not weeks or months.

Architecture

J3 is built as a plugin -- an optional dynamic library that can be loaded into an already running Squeak VM that uses the regular unmodified bytecode Interpeter. Although J3 reuses a significant part of the Interpreter (and the ObjectMemory implementation in its entirety) the execution mechanism is entirely independent: once J3 takes control away from the Interpreter, it never returns it. This allows J3 to radically alter the model of execution, without worrying about maintaining any of the internal state of the Interpreter. (In fact, one of the first things that J3 does when it wrests control from the Interpreter is to set all of the Interpreter's state registers [receiver, method, activeContext, theHomeContext, instructionPointer and stackPointer] to either nil or zero, as appropriate!)

The code generator

The code generator has three independent components:

a compilation loop which transforms bytecodes into a high-level abstract intermediate representation called ``SAM'';

a linear optimiser that accepts a stream of SAM operations and converts them into an equivalent stream of low-level abstract NAM operations;

a collection of platform-specific output routines which map NAM operations onto the native machine instructions of the host processor.

In its raw form, Smalltalk bytecode is difficult to optimise because of two factors:

The encoding is optimised for space, not for ease of decoding. A given operation can be implemented in several different bytecodes, each allowing a different range of parameters (selector indices in sends, argument counts, field indices in loads/stores, jump displacements, and so on). This would not matter if it were merely a matter of converting bytecodes on an individual basis into native code. Unfortunatey there are many situations in which good code can only be generated by looking at adjacent bytecodes. An obvious example is translating

push X

pop Y

into a single native instruction

move X, Y

or into a pair of instructions

load register, X

store register, Y

on load-store machines. (This is desirable because modern processors are essentially memory-bound, and halving the number of memory references will often double the execution speed of a given piece of code.) In Smalltalk, decoding just the push-pop pairs involves all the possibile permutations of 4 sources and 3 destinations, and up to three different encodings of a given operation with a given source or destination. Combinatorial explosion would soon force the translator to spend (waste) most of its time frantically trying to match patterns.

Conditional jump ranges are asymmetrical, which forces the majority of backward ``jump-if-true'' operations

priorLabel: ...

...

jumpTrue: priorLabel

...

to appear rewritten with inverted forward conditional branches

priorLabel: ...

...

jumpFalse: tempLabel

jumpTo: priorLabel

tempLabel: ...

in the bytecode. This would add yet another bunch of expensive pattern matching to the generator. What's more, jump chains are not uncommon. The following (thoroughly amusing or highly depressing, depending on your disposition) example is from Object>>doesNotUnderstand:

33 receiver of #and: 33 receiver of #and:

34 34

35 <9B> jumpFalse: 40 35 jumpFalse: 46

36 36

37 argument of #and: 37 argument of #and:

38 38

39 <90> jumpTo: 41 39 jumpFalse: 46

40 <72> pushConstant: false 40 deleted

41 <9B> jumpFalse: 46 41 deleted

42 42

43 consequent... 43 consequent...

(The original is on left, and the spaghetti-free version on the right. Pedantics requires me to explain that the only path to the bytecode at 40 is from the jump at 35, and that the jump at 41 is only reached from the bytecode preceding it.)

When faced with this kind of situation, remembering that there are multiple encodings for each of the jumps depending on their displacement range, any code generator that is going to do anything more than the most trivial combination of successive instructions is far better off making a ``pre-pass'' through the bytecode to convert it into a much more ``regular'' form.

Squeak Abstract Machine

The compiler loop reduces the CompiledMethod to an equivalent SAM method, but which uses only 22 operations instead of the 248 bytecodes defined by Smalltalk-80. The SAM operations are as follows:

Table 1: Fundamental SAM operations.
`Pop`, `Dup`	simple stack operators
`LdSelf`, `LdThisContext`	load receiver and active context
`LdConst`	covers nil, true, false, SmallIntegers
`LdInst`, `StInst`	load/store an instance (receiver) variable
`LdTemp`, `StTemp`	load/store a temporary variable
`LdLit`	load from literal frame
`LdVal`, `StVal`	load/store value field of Association
`Jmp`, `JmpF`, `JmpT`	(un)conditional jumps
`Super`, `Send`, `Special`	super, self, or special send
`LocalRet`, `RemoteRet`	local and non-local (to home sender) return
`Lambda`, `Block`	create and enter Closure

Note that the Special operator could be removed if we made the selector explicit in message send, for example using LdLit to supply the selector to Super and Send, and using a LdSpecial operator to provided a selector from the SpecialSelectors Array, transforming any subsequent Send into the equivalent of the existing Special operator. This would be a better solution because it is more general, allowing special selectors to be used in Super sends (for example) or as literal values in programs, without consuming literal frame slots.

Note also that many of the Smalltalk bytecodes are converted into two or more SAM operations, for example:

popLiteralVariable: N

is translated into four SAM operations:

            LdLit: N
            StVal
            Pop
            Pop

To see why this apparent discarding of information encoded in the bytecodes results in better (effectively optimal) native code, it is necessary to understand how the optimiser works. Reducing the number of operations in this manner also makes the optimiser both simpler and faster. (The optimiser is explained later.)

32 of the Smalltalk bytecodes are ``pre-encoded'' sends of ``special'' selectors. (The original pupose of these special sends was two-fold: first to allow a microcoded virtual machine (and later a software interpreter) to ``inline'' the evaluation of all sends of arithmetic messages having SmallInteger or Float receivers and arguments, without ever needing to send a real message; and second to reduce literal frame pressure due to frequently-used selectors.) The Special send operation subsumes the 16 special send bytecodes which are therefore all converted into Special operations, with two exceptions. Class and Equivalent can never ``fail'' their inlined response and be converted into a full send; they are therefore retained as separate operations in SAM to simplify (slightly) the analysis that has to be done by the optimiser for the Special operation. The 16 arithmetic operations are also retained for similar reasons. (Athough they do not reduce the amount of analysis performed by the optimiser, they spread it out more evenly and help make the final artifact easier to understand and maintain.)

Table 2: Operations defined for convenience only, equivalent to `Special`.
`Add`, `Sub`, `Mul`, `Div`, `Divide`, `Mod`	arithmetic operators
`Less`[`Equal`], `Greater`[`Equal`], [`Not`]`Equal`	relational operators
`Equivalent`	identity
`Bit`{`Shift`, `And`, `Or`}	bitwise operators
`Class`	fetch class of object

Jump rewriting

In order to generate good code for jumps and conditionals it is necessary to sort out the mess that the Smalltalk Compiler makes of them (as mentioned earlier). The compiler loop creates a small table of jump locations in the SAM during initial translation from the bytecode. Before starting the otimisation/code generation phase the compiler iterates through this table, repairing jump chains and converting ``inverted conditional jumps'' into the single (range unlimited) conditional jump that the Compiler should have generated. It also keeps track of the number of possible origins from which each jump operation is reachable, and when a jump (either part of a jump chain or the unconditional jump in the middle of an inverted conditional jump) becomes unreachable during jump optimisation, the corresponding SAM opcode is deleted entirely.

Jump rewriting is also important because the optimiser performs a limited form of dataflow analysis to track the types of the objects that will appear in the runtime stack. Since, in general, the stack contents can only be tracked reliably during the execution of linear sequences of instructions, the type information must be recombined at each basic block entry point based on information propagated along each of the possible paths into the block. Reducing the number of jumps in the code therefore also reduces the number of points at which the optimiser is forced to discard potentially valuable type information.

Optimisation

The optimiser is not a separate pass, but is simply a ``filter'' placed between the compiler loop and the low-level instruction output routines.

Having converted the bytecode into SAM, and then performed the initial jump rewriting, the compiler iterates over the SAM code and calls a ``generator'' function in the optimiser for each SAM opcode. The optimiser implements a separate generator function for each of the SAM operations defined above.

Optimisation is based on a technique called delayed code generation (DCG) [Piu92], which has the interesting property that only one iteration over the intermediate form (or walk through a parse tree) is necessary to generate very high quality final code. DCG works by manipulating an explicit representation of the machine operands representing quantities with which the program reasons. For example, in Smalltalk these quantities include literal constants, the receiver, temporary variables, and locations on the runtime stack.

Code generation is entirely operand-driven. The optimiser's main job is to combine operands (that it has accumulated from previous operations) according to the operation implied by each of its generator functions. The optimiser tries hard to combine operands without generating native code, and emits instructions only when such implicit combination becomes impossible. In order to do this, some kind of state must be retained to propagate operands between successive generator functions. In the case of Smalltalk, the most suitable structure in which to keep this state is a simulation stack, which models the contents of the runtime stack that will exist when the method is later executed. An example should make all of this much clearer. Consider the Smalltalk statement

| a |
a := 3 + 4.

which corresponds to the following sequence of SAM operations:

                LdConst #3
                LdConst #4
                Add
                StTemp 0
                Pop

Instead of generating code to assign the SmallInteger constants to a physical location (pushing them on the runtime stack, for example), the generator function for LdConst simply pushes a descriptor onto the simulation stack to represent the (virtual) presence of the corresponding constant on the runtime stack. The generator function for Add performs a case analysis of the types of the top two items on the simulation stack. Seeing that these are both SmallInteger constants, the generator combines them immediately and replaces them with a descriptor representing the SmallInteger 7. No code has been generated. When the generator function for StTemp is called, operand combination can no longer avoid generating a machine instruction. The StTemp generator function therefore calls the NAM output routine for moving a literal into a memory location:

moviT(#7, 0)

(This reads: ``move immediate [SmallInteger 7] to temp [0]''.) The NAM output routine, which is specific to the host processor, converts the request into physical machine instructions. For example:

movl $15, 0(%(tempBaseReg))

for the Pentium, or via an intermediate temporary register for the PowerPC:

li r9, 15
stw r9, 0(r(tempBaseReg))

The output routine returns a descriptor representing the ``cheapest'' location in which the ``output'' operand of the operation can be found. For the current example this will be either a constant or a register descriptor. The generator function for Pop sees that the top of the simulation stack is not physically on the runtime stack, and can discard it by simply ``popping'' the simulation stack without generating any instructions to pop the runtime stack. The above one or two instructions are therefore all that the code generator emits for the Smalltalk statement ``a := 3 + 4''. The following table shows the same sequence of SAM opcodes along with the simulation stack contents (on a PowerPC) after execution of the generator routine, and the calls made to the NAM output routines:

SAM	simulation stack	NAM output
	...
LdConst #3	... [#3]
LdConst #4	... [#3] [#4]
Add	... [#7]
StTemp 0	... [r9]	move_i_t([#7], 0) => [r9]
Pop	...

In the case of the Pentium, the descriptor returned by move_i_t represents the constant #7; in the case of the PowerPC it represents the register r9. Since the generator function replaces the descriptor on top of the simulation stack with the output operand descriptor returned by the output routine, instructions generated to move input operands to registers on load-store architctures are not repeated a second time if the same value is reused in a given statement. Consequently, on the PowerPC, the statement

| a b |
a := b := 42.

is compiled into the following native code:

                li      r9, 85                   ; input operand = #42
                stw     r9, 4(r(tempBaseReg))    ; output operand = r9
                stw     r9, 0(r(tempBaseReg))    ; input operand = r9

Contexts

Rather than executing methods (and blocks) in Context objects, J3 executes them directly on the hardware stack. This increases speed, but also increases complexity. Here's why.

Speed goes up because a significant amount the CPU time in the bytecode Interpreter goes into activating and returning from methods. The benchFibs benchmark, running fully optimised native code, is approximately ten times slower when executing in heap Contexts compared to executing directly on the hardware stack. On the Pentium, the shortest possible path through activateNewMethod executes 83 instructions. The same activation on the hardware stack requires [VERIFY THIS] 7 instructions. A similar bottleneck exists with returns, which execute 44 Pentium instructions in the best case for heap-allocated Contexts and only [VERIFY THIS] 4 when executing on the stack.

The price to be paid is additional complexity elsewhere in the system. Heap-allocated Contexts are tolerant of many evils, including outliving their activation (e.g. the home contexts of blocks) and having arbitrary changes made to their internal fields and stack contents (the Debugger, Exceptions, brace stacks, ContextPart release code run from Process termination, and so on) -- of which by far the most evil has to be a stores into the sender field. Stack frames, on the other hand, are inherently LIFO. A significant amount of effort has to be spent to reconcile this intrinsic LIFOness with any non-linear behaviour induced by BlockContexts or explicit program modifications to the internals of Context objects.

Stack frames

J3 uses a different stack frame format for each supported architcture, a situation imposed by such differences as pushing the return address directly on the stack (e.g. Pentium) vs. storing it in an automatically call-saved user register (e.g. Sparc) vs. storing it in a special-purpose register that is overwritten on each and every call (e.g. PowerPC).

`Frame` format for PowerPC (higher addresses are nearer the top).
	`...`
`oop`	`stack[32]`	// the Smalltalk stack, just like in a Context
	`...`
`oop`	`receiver`	// the receiver of the executing method
`NativeMethod`	`*method`	// the method executing in the frame
`oop`	`*stackp`	// top of stack (a pointer into the stack portion)
`Context`	`*pseudo`	// PseudoContext for `Frame`, or 0
`insn`	`*opc`	// the SqABI obscured PC for the frame
`insn`	`*pc`	// the SqABI saved PC for the frame
`void`	`*cpc`	// the CABI saved PC for the frame
`Frame`	`*sender`	// a pointer to the sender's `Frame`

Handling modifications to Contexts

J3 normally does not create a Context when a message is sent. However, a Smalltalk program can get hold of a Context corresponding to an executing Frame in several ways: by referencing thisContext, for example, or by following sender and home chains in Contexts to which it already has a reference, or accessing the suspendedContext slot of a Process. Since Contexts play no part in the basic execution mechanism of J3, something must be done to reconcile Smalltalk's view of Context-based execution with J3's Context-free implementation of it.

The answer is PseudoContexts. These are ``proxies'' which exhibit perfectly Context-like behaviour at the Smalltalk level, but which in reality are active objects that ``reflect'' onto the underlying execution state held in stack Frames. Reading any field of a PseudoContext causes the appropriate value to be calculated according to the current state of the Frame on which it reflects. This is straightforward, since all of the state that would be present in a real Context can be reconstructed trivially from the state contained in a Frame. Writing a field of a PseudoContext causes a corresponding modification to the state of the associated Frame. Such modifications range from delicate (e.g. writing a new value into the stack, or changing the pc, stackp, receiver, method or home field) to downright violent (writing the sender field of a PseudoContext, for example, can result in the wholesale reconstruction of the stack ``under the feet'' of the running Process). Such modifications are seldom trivial: some kind of special ``corrective'' action is almost always required to ensure that the program continues to exhibit correct behaviour. This is due to the type analysis performed by the optimiser. Simply modifying the receiver or a location in the stack, for example, can invalidate certain optimisation assumptions that were in effect during translation of the Frame's associated CompiledMethod, causing the program to function incorrectly. (A trivial example would be storing a non-SmallInteger receiver into a Frame executing a method in class SmallInteger. The optimiser has at the very least eliminated redundant tag checks for inlined arithmetic operations on the receiver, which would suddenly become relevant again.)

A PseudoContext has the following form:

*for method frame...*	*for block frame...*	*after frame exit...*
nil	nil	sender [Pseudo]Context or nil
nil	nil	pc value at frame exit
0	0	stackp value at frame exit
method	nargs	unchanged
unused	startpc	unchanged
receiver	home	unchanged
raw pointer to frame	raw pointer to frame	...up to 32 stack items
...31 unused slots	...31 unused slots

The most obvious thing to notice about PseudoContexts is that they have the same format and almost the same layout as normal Context objects. This is for three reasons. First it helps to simplify the process of stabilisation, since only the first three fields need to be reinitialised and the final stack contents copied into the indexable part of the PseudoContext. Second it allows the fourth, fifth and sixth fields to be read directly by Smalltalk without any further computation (since they have the same values as they would in a normal MethodContext or BlockContext object). Third it allows the runtime to distinguish between PseudoContexts for method frames and those for block frames simply by inspecting the fourth field: if its value is tagged then the corresponding Frame is executing a block.

A raw pointer to the associated frame can be stored directly in PseudoContexts since they have the same ``special'' header type as do MethodContexts and BlockContexts. During garbage collection, the collector inspects the stackp field (which is always 0 for a PseudoContext) and ignores all indexable fields above this index; in other words, the garbage collector never tries to trace or remap the raw pointer to the frame.

PseudoContexts are created on demand, in only four situations:

when the program refers explicitly to thisContext, a PseudoContext is created for the active frame;
when a BlockContext receives a #value message, the receiver is converted into a PseudoContext associated with the frame executing the block;
when a field of an existing PseudoContext is read and its value would normally be a Context object, but the value refers to a frame rather than an existing in-heap context, a PseudoContext is created and associated with the frame and returned as the ``contents'' of the field; and finally
when a Process is suspended, a PseudoContext is created for the active frame and stored in the suspendedContext slot of the outgoing Process.

PseudoContexts are active (and reactive) objects: they respond to accesses to their state by performing computation, either to calculate the value requested or to modify the state of their associated frames in the hardware stack. J3 implements this active behaviour without sacrificing the performance of accesses to the state of normal, ``passive'', non-Context objects. It does this by reusing the specialisation mechanism already provided in the dynamic translator. Since the only way to access the state of an object in Smalltalk is to send it a message, J3 simply arranges that all methods compiled for Context receivers (objects whose class is PseudoContext, MethodContext or BlockContext) are specialised in a very particular manner. Rather than emitting the usual code for bytecodes and primitive responses that access the internal state of the object, the code generator emits calls to glue routines written in C. These routines know that the receiver is necessarily a Context object. They first check if the receiver's class is MethodContext or BlockContext, and ``punt'' to the usual state accessors if that is the case. (This check is necessary since conversion between stable Method- and BlockContext objects and their dynamic PseudoContext counterparts is asynchronous and non-deterministic.) However, if the receiver's class is PseudoContext, the glue routines perform the relevant active response based on which field of the receiver the method is trying to access. The operations that are specialised in this manner are the LdInst opcode, and the ``quick'' primitive responses compiled directly into the prologue for methods whose primitive index is 60 (#at:), 61 (#at:put:), 73 (#instVarAt:) or 74 (#instVarAtPut:). Only two shared glue routines are necessary since these six operations form two groups of three load and three store operations, the appropriate active response being determined by the index (between 1 and 6 for the instance variables, and between 7 and 38 for the stack contents) of the field being accessed.

Some indirect modifications to Frame (via writes into the associated PseudoContext) have immediatel consequences. The glue routines respond to modifications of the receiver or stackp, and stores of the stack portion of the frame, by immediately redirecting the method of the frame to a non-optimised version (compiling it if necessary, based on the original CompiledMethod), updating the pc value accordingly to resume execution at the same point in the method. Other modifications to Frames have effects that are only felt when the frame exits. These situations are handled lazily (to avoid repeated work when several such modifications are made to the same frame) using a ``context fault'' mechanism. One such situation is the creation of the PseudoContext itself, which must be stabilised (converted into the equivalent Method- or BlockContext) when the frame exits.

Most frames (around 85% in a typical Smalltalk application) do not have an associated PseudoContext. It is therefore highly desirable to avoid checking if any associated PseudoContext needs to be stabilised each time a frame exits. Smalltalk also requires that returns check to see if the returning Context's sender is nil, and if so the returning Context is itself sent the #cannotReturn: message with the return value as the argument. J3 avoids these tests by making then implicit in the ``normal'' exit from the frame.

Frames always exit in the most efficient manner possible: they reload the sender's frame and stack pointers and transfer control to the pc value saved in the sender's frame. This is typically no more than three or four machine instructions. Frames for which ``special'' action is required upon exit are ``marked'' by changing the saved pc in their sender's frame to point to a ``context fault'' handler. The original saved pc value is copied into the ``obscured pc'' slot of the sender's frame. When the callee frame exits, control is transferred to the context fault handler which takes the appropriate remedial action and then restarts execution in the sender frame by jumping to the location stored in the obscured pc. Three situations are currently ``repaired'' lazily using this mechanism:

stabilisation of a PseudoContext associated with the callee frame. The return pc is redirected to the context fault handler when the PseudoContext for the callee frame is created;
any frame whose sender is ``nil'' has a dummy frame just below it on the stack, whose saved pc points to the context fault handler and whose obscured pc is zero. (This dummy frame occupies very little stack space, since it does not need to contain space for the indexable portion of the frame.) The handler responds to this situation by reinstating the exiting frame, creating a PseudoContext for it if it didn't already have one, and then synthesising a send of #cannotReturn: to its PseudoContext. In effect, the Process fell of the bottom of its stack and drowned;
reloading the stack because of a store into the sender field of the PseudoContext associated with the callee frame. The actual value is written into the first (sender) field of the PseudoContext, and if the context fault handler sees that this field is not nil it flushes the stack (stabilising all of the frames, from the sender right down to the base frame of the Process), and then calls the ``trampoline'' used to ``kick-start'' the initial Process. This trampoline ``reloads'' the entire stack (starting from the Context stored in the sender field of the callee's PseudoContext) and then resumes execution in the topmost frame of the reloaded stack.

The above scheme has one drawback: some operating systems handle signals (or exceptions, or interrupts, or whatever they're called this week) on the stack of the active user process. If a signal were to arrive after the truncation of the stack in the epilogue of a context-faulting callee, but before the context fault handler had a chance to reinstate the frame to take the appropriate action, the contents of the callee frame might become corrupted. Such operating systems are thankfully quite rare, and J3 avoids the problem by emitting slightly different code for message sends and returns. Instead of the normal callee-popped frames described above, J3 arranges for caller-popped frames -- in which the truncation of the stack to delete the callee's frame happens immediately after the send site in the caller's method. The price to pay is a slight increase in code size, since one additional instruction is required after each send site. (Typically one instruction is saved in the method epilogue but, since there are more sends sites than there are methods, the overall code size is larger.) This seems like the best solution: J3 doesn't penalise performance on real operating systems for the sake of the bad behaviour of broken operating systems. The changes to the compiler and glue routines are trivial, requiring a slight readjustment of pc values in a handful of places, and are entirely hidden behind the abstract platform-independent interfaces.

Was it really worth the bother?

benchmark	Squeak 2.6	VisualWorks 5i	J3 alpha 1
towers of hanoi	7251	681
allocation	3776	1193
array write	6769	2275
dictionary write	2764	337
floating math	23674	7065
integer math	4567	1042
ordered collection iterate	26318	4831
ordered collection write	4787	895
string compare	7831	1752

That's the evidence. I leave you to make up your own mind.

Acknowledgements

The design of J3 was greatly influenced by discussions with many people over a period of many years. Eliot Miranda, Dan Ingalls, Claus Gittinger (author of the amazing Smalltalk/X), ... Equally inspiring was the early work of L. Peter Deutsch and Alan Schiffman [DS83] on PS, which later became HPS (the object engine currently used in VisualWorks). Thanks are also due to John Maloney and Andreas Raab for their help in getting J3 to compile under MacOS and Windows.

I am also deeply indebted to Neilze Dorta, who lent me her PowerBook G3 whenever it wasn't needed, thus turning many hours of painful waiting for egcs to compile the PowerPC version of J3 on my rather slow PowerMac 6400 into many minutes of significantly less painful waiting.

References

[...]	...
[DS84]	L. Peter Deutsch and Alan M. Schiffman, Efficient Implementation of the Smalltalk-80 System. Proceedings of the 1984 Conference on Principles of Programming Languages (POPL'84), pp. 297--302.
[Mir99]	Eliot Miranda, Context Handling in VisualWorks 5i. OOPSLA'99 Workshop on Simplicity, Performance and Portability in Vitual Machine Design. November 1999, Denver (CO).
[Piu92]	Ian Piumarta, Delayed Code Generation in a Smalltalk-80 Compiler. Ph.D. Thesis, University of Manchester (UK), October 1992.
[Piu99a]	Ian Piumarta, `ccg`: a tool for writing dynamic code generators. OOPSLA'99 Workshop on Simplicity, Performance and Portability in Virtual Machine Design. November 1999, Denver (CO).
[PR98]	Ian Piumarta and Fabio Riccardi, Optimizing direct threaded code by selective inlining. Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'98),. June 17-19, Montréal (Canada).