An Expert At Napping
Today, I'm learning to get rest the hard way. No, not
representational state transfer, and not reStructuredText - actually
stopping and recovering from the last year. From looking for work
while planning a wedding, to working for eight full-time hours and
commuting for six, to getting run-down and never really recovering,
this has been one crazy year. I've had no desire at all to put down
my keyboard and stop thinking about code, but I feel that I no longer
have any choice. I am so very tired.
My plan is to cut the time I spend coding on interesting problems into
nice discreet blocks, to plan ahead the things I will work on in that
time, and to take generous rest breaks. I'll spend some time writing
about the things I've been playing with, even incomplete thoughts.
Entangled
I've moved my research project into its own garden, where I can tweak
the IR and compiler to fit the needs of the analysis and optimiser.
At the moment it can compile several versions of python bytecode into
two different SSA forms - the first with implicit stack and locals,
and the second with the more familliar arguments. It also has a
web-based visualisation that can animate the SSA graph as it is
constructed. I have a little work to do on the visualisation before I
can get back to the next graph transformation - I want to display and
highlight the use-dep relationships by drawing nice paths in SVG.
Then I get to convert the explicit SSA form into one with explicit
special-method invocation.
Starting over with my own compiler has made me realise some of the
design decisions I wish had been made in pypy. Entangled has static
descriptions of the languages it operates on, much like Chez Scheme
has. This gives a great place to add extra information to particular
operations, such as their stack effect, whether they never return, and
how to decode their operand. It also means we can be certain of the
language that reaches each pass - we know there are no operations we
don't expect.
Another concept that I wish we had in pypy and in firm was the fact
that there is no global state in the compiler. Individual passes can
maintain global state, for example if we did call-family lowering in
this compiler, that pass would mantain a concordance between functions
and their 'call family', which is their set of invoked signatures as
well as other functions that may be aliased at the same call site.
However, once that pass was done, any information not present in the
graph would be thrown away.
I've also discovered how to get 80 percent of the benefit with only 20
percent of the algebra: we can track, with static analysis, which
objects we know have no other alias present on the stack. Or more
weakly, which objects have no alias that has been written through.
The truly interesting bit of this compiler will be how we determine
this quality and how we use it.
Bytecode VMs
I love to play with bytecode VMs. Of particular interest to me are
VMs for bootstrapping. These involve writing minimal assembler or C,
and this providing a runtime on which you can write code to get things
done. This is something of a mashup of two things: the bootstrapping
king is FORTH, which can be defined quite easily with a little
assembler code. On the other hand, a truly nice bytecode VM is
implemented by femtolisp which contains entirely printable opcodes.
My bytecode VMs have no practical use - they are just fun to
implement. First I like to define the types and operations I would
like to have, then I define the representation of values and what
state needs to be tracked, and then write a function to perform each
operation. I take care that each operation is defined in terms of
primitives that translate easily into assembly.
I've recently designed two of these languages - a fancy one with
datatypes and fancy matching and all the stuff I'd want in a bytecode
language - and a very simple one, with just array-lists, bytestrings,
fixed-size integers, closures, and a handfull of constants (booleans
and a none value). This last one has the handy benefit that integer
literals appear almost as their hexadecimal representation in source.
I've written a handful of functions in it, such as equals?
and
foldr
, and can confirm it's not the sort of language you would
want to use on a regular basis, but it is quite usable.
The most dramatic thing I have taken away from the latter
implementation: make all your functions unary, and have them take a
list. If instead you have your CALL_*
operations take an extra
nargs
parameter, that requires a lot more work in the code for the
operation, and then you still need to do extra work to actually get
the arguments desired to the top of the stack. Another alternative is
just giving the function the current stack to work with, but that
truly complicates return, as well as one other thing I ran into.
That is, the second big takeaway - you really want a locals vector.
Before I started creating one in my functions as a general practice, I
found myself needing to set the stack up a certain way so that I could
tail-call into different conditions. Most of the time, that was not a
constant-time process. So make it easy - I tend to push a vector with
default values for locals at function entry.
The third takeaway is that if people are going to write bytecode -
that is, not compilers or assemblers - don't use a relative offset or
even absolute position in your jump opcode: instead, have it take a
command string as argument. This made function construction fairly
readable, as they start with allocating an empty vector, then
appending strings for each basic block, appending the rest of the
function's 'environment', and then combining the environment with the
functions start block.
I made a couple of other concessions for readability. Whitespace is a
no-op, that is, space can be used to logically group different
operations. Comments can be embedded, as the ; operation skips
comands until the next newline. String literals can be embedded too,
with the " operation performing a similar function.
Here's a code example - foldr
:
;; ZL creates a list of length zero. This will be our environment.
ZL
;; Here, the backtick ` aborts with the message on the top of the
;; stack, and the letter 'a' at the end appends this block to the
;; list.
"\"number is not iterable\"`"a
;; take ([) item 9 (X9) from the environment (e) and branch (q)
;; to it. Presumably I wanted to keep the list functions near
;; each other.
"eX9[q"a
"eXb[q"a
"\"function is not iterable\"`"a
"\"none is not iterable\"`"a
"\"true is not iterable\"`"a
"\"false is not iterable\"`"a
"\"<unknown> is not iterable\"`"a
;; bX1 = first argument, ZmZ[ is the index, ZmX1[ is the
;; accumulated result
"eXA Z bX1[l ZmZ[ =p +q"a
"Zm bZ[ ZL bX1[ ZmZ[ [a ZmX1[a . X1] Zm ZmZ[ ] eX9[q"a
"ZmZ[z"a
;; strings
"eXE Z bX1[_ ZmZ[ =p +q"a
"Zm bZ[ ZL bX1[ ZmZ[ wa ZmX1[a . X1] Zm ZmZ[ ] eXD[q"a
"ZmZ[z"a
;; start block
;; push local vector [index, result], branch on type of arg1
"ZL Za bX2[a eX1mhq"
\ ;; Construct function
It's clearly not pretty or easy to use - but there is no undefined
behaviour or wierd linking behaviour, and you could see how - stranded
on a desert island with nothing but an m68k and a way to write the
operations - you could bootstrap your way to a real language that you
could work in. You could easily write a compiler to generate code for
it, too.
Instruction Set Architectures
I'm a real chip geek, and it's an interesting time to be one: at the
high end, there's real interest around RISC-V. On the lower end,
devices with ARM processors in them are everywhere. You can even find
some MIPS based routers if you are prepared to do some ground work. I
got into software right after 2006 during the rise of amd64 in the
consumer space. The thing to know about that time is that just like
in the early 90s, you had to mortage your house to get something
interesting that still had power to do what you want to do. The only
real players at the time were the Sparc family and the Itanium II, and
they were both priced well beyond anything I could afford.
Anyway, we now have the ability to design, test, and run different
levels of emulation for any kind of hardware we want to build right
from our desks - so it's well worth playing with. You can build stack
architectures like the RTX2010 or Burroughs BLS 6000, or you can build
wide vector architectures like AMD's GCN which powered the Radeon 5xxx
series of graphics cards.
Me - I'm interested in three things: flexible security, wide
execution, and great performance tools available to sophisticated
compilers and safe runtimes. Oh, and the less patent-hobbled the
better. I'm still bitter at Intel for pricing the Itanium II beyond
developers who wanted to develop for it, and for killing the i960 in
the consumer market.
It's great fun to play with the concepts behind ISA design. For
example: have you ever wondered why the different operations on a CPU
take the time that they do? Multiplication typically takes three
times as long as addition, which might sound puzzling until you find
out that they are probably using a Dadda multiplier which is truly
a beautiful trick. Why, on a modern x86-64 CPU, does integer division
take 29 mostly-pipelined cycles, but floating division take 9
unpipelined cycles? Because there is only one hardware low-latency
division unit, and it's in the floating-point path. Fun exercise:
figure out how they implemented the different division units.
For myself, writing a compiler that is good at figuring out how to
remove false data dependencies, it's fun to look at wide architectures
and figure out how to make them work with real-world programming
languages. These are languages which are semantically about graph
walking and control flow, and require all sorts of dynamic allocation
patterns - but if we can figure out how to execute that sort of code
on vector architectures, we have a great path to real performance
improvements.
As a compiler architect, there are a few things I think are missing
from the GPU that would make it a great tool for parallelising typical
inner loops. These include a way to represent an append to the result
list within each loop that may be conditional, ways to easily handle
function pointers, partitioning the different streams by a value and
handling those values one at a time.
I can't say yet that I've figured it out - until I can see the sort of
changes I can make to the way we lay out objects in the compiler - but
I'm having a lot of fun exploring the possibilities.
A good place to start playing actually would be the old Moxie Logic
blogs about GCC support. I have often wondered if the GCC compiler
could be augmented to graph the cost/benefit of different numbers of
registers and seeing how they get used.