The Perfect Instruction Set

The x86 architecture is ageing, but rather than looking for re-invention, it only saw incremental extensions (especially for operating system instructions and SIMD) over the last decade or so. Before getting to the i7 core, we saw a long series of evolutions—not revolutions. It all started with the 8086 (and its somewhat weaker sibling, the 8088), which was first conceived as an evolutionary extension to the 8085, which was itself binary compatible with the 8080. The Intel 8080’s lineage brings us to the 8008, a 8 bits of data, 14 bits of address micro-processor. Fortunately, the 8008 isn’t a double 4004. The successors of the 8086 include (but the list surely isn’t exhaustive) the 80186, the 80286, the 80386, first in the series to include decent memory protection for multitasking, then the long series of 486, various models of Pentium, Core 2 and i7.

So, just like you can trace the human lineage to apes, then to monkeys, and eventually to rodent-like mammals, the x86 has a long history and is still far from being perfect, and its basic weakness, in my opinion, is that it still use the 1974 8080 accumulator-based instruction set architecture. And despite a number of very smart architectural improvements (out of order execution, branch prediction, SIMD), it still suffers from the same problems the 8085 did: the instruction set is not orthogonal, it contains many obsolete complex instructions that aren’t used all that much anymore (such as the BCD helpers), and that everything has to be backwards compatible, meaning that every new generation still is more of the same, only (mostly) faster.

But what would be the perfect instruction set? In [1], the typical instruction set is composed of seven facets (to which I add an eighth):

  • Type of ISA. CISC or RISC, or hybrid. The CISC (complex instruction set computer) approach gives the ISA instructions that perform more complex operations in a single instruction while the RISC (reduced instruction set computer) approach takes the opposite view in making the instruction set sport very few instructions and instructions that do very little, very specialized operations. A typical RISC architecture will have only a very limited of complex instructions (that deal with the OS aspects) and concentrate in instructions that do just a single thing, for example, adding two values already in registers, but would not have an instruction that computes an address, loads a value from memory, then add it to a register. The second type of instruction is of the CISC type where, in essence, the number of instructions is minimized, but to the cost of having instructions with literally hundreds of variants.
  • Memory Management. How the CPU accesses memory dealing with things such as alignment, virtual memory, and caches.
  • Addressing Modes. How does the CPU generates the addresses to access data. There are many addressing modes, ranging from the very simple to the very complex.
  • Types and Sizes of Operands. The size of the registers, the size of the basic data types such as bytes, words, etc. It also considers the types of data the processor can manipulate; signed, unsigned integers, floating points, packed registers, etc.
  • Instructions. The very instructions the processor recognizes. These can be divided in many categories such as arithmetic, flow control, memory management, address generation, virtual memory, OS-helpers, etc.
  • Flow Control. What type of flow control is permitted? Surely, flow control instructions must include unconditional jumps, conditional jumps, and function calls; certainly indirect jumps; maybe it can even help with high-level languages constructs such as C’s switch statement.
  • Instruction Encoding. Are the instructions of variable or fixed-length? If instructions are allowed to vary in (byte) lengths, it certainly gives more flexibility for extending the instruction set and accommodating complex instructions that have many variants. Having fixed-width instructions, as it is typical of RISC processors, will limit the instruction set somewhat; especially respective of immediate values.
  • Extensions to the Instruction Set. (my addition to the list) Can the architecture accommodate application-specific extensions such as SIMD or cryptographic acceleration (such as Via’s Padlock, for example)?

All of these criteria must be balanced in order to minimize die size, power consumption, circuit complexity, while maximizing instruction throughput, that is, processor speed (which isn’t necessarily measured in GHz alone). And you note that this list doesn’t say much about superpipeline, superscalar, or out-of-order processing. That’s pretty much left to the implementation of the processor itself. To make a parallel, the ISA is a bit like the Java language specification, and the CPU like the Java virtual machine: many different implementations with different trade-offs will lead to a Java program being flawlessly executed, but with varying performances.

With a very wide difference between memory and processor bandwidths, the CISC approach is probably preferable: might as well have a single read from memory to get an instruction to perform as much as possible. This helps the CPUs to be designed for a memory hierarchy with important performance differences between stages. On the other hand, a CISC CPU is considerably more complex than a RISC CPU, as it may have to devote a lot of circuity to decompose the complex instructions into micro-instructions (the “atomic” instructions that are used to build a complex instruction) and execute the micro-instruction out-of-order. Note that both CISC and RISC CPUs can be superscalar and out-of-order; it is just that superscalar and out-of-order techniques help CISC CPUs quite a lot. So, I’d go for CISC.

The alignment, that is, how the processor reads and writes from and to memory, is also a crucial aspect of the instruction set. If the CPU is allowed only aligned reads/writes, the program must deal explicitly with how data is fetched from memory. That is, on a processor that does not allow unaligned reads, reading a 32 bits (or 4 bytes) value at an address 4n+1 on a 32-bits aligned bus would cause a “bus error” and the read or write would fail. On a processor that deals naturally with unaligned reads or writes, the worst that can happen is that the CPU generates several reads (or writes) to complete the memory access. Taking the 4n+1 address example again, the CPU would just write the first 3 bytes with a first access, and the last one with a second access (although it may have to perform read, mask, write series of operations). This is more costly in cycles, but it makes programming a whole lot simpler—aligned processors are a pain.

Addressing modes also play a major rôle in program simplicity. If you can translate a high-level language construct such as t[a+3*x] in very few instructions, it may result in better performance than if you have to emulate this relatively simple addressing mode using a comparatively large number of instructions. It also makes the mapping from high-level languages to assembly easier for the compiler, and you can cargo-cult the thing away as the processor is likely to have very complex and high-performance hardware dedicated to address calculation. On the other hand, if the processor doesn’t know much about address generation, you’re left to hope the compiler is really smart and knows how to generate the most efficient code for the processor.

The types and sizes of operands can be divided into two, or maybe three, basic categories: integers, floats, and vectors. It is clear (to me) that the two first categories are mandatory. We must have both unsigned and signed integers. Floats, probably IEEE 754 floats, are also quite nice to have. Vector data is merely very long contiguous blocks of data, maybe 128 or 256-bits (16 bytes or 32 bytes) long, and can be transported from and to memory using single instructions. Their interpretation should also vary from integers to float, but also may include other interpretations.

Two things that are missing from the x86 ISA are gather reads and scatter writes for vector data: the addressing modes are limited to a single bank of contiguous data. Gather reads allows the CPU to fetch scattered memory locations into a single vector. For example, reading from location x one byte every three bytes, for 8 reads, would allow the data in discontinuous locations to be read and packed into a single 64-bits wide register. Inversely, scatter writes takes a packed vectors and writes its elements in discontinuous memory locations: for example, starting at y, every 7 bytes, for 16 bytes. This cuts both addressing modes and types/sizes concerns.

The instruction set should be orthogonal. The major beef I have against x86 ISA is that it is accumulator-based and that many instructions are over-specialized in that they force particular registers to be used. For example, you can’t choose any registers to perform a multiply: one operand must be in ax (or eax for 32 bits, or rax for 64 bits) and the result is spread in ax:dx (or eax:edx, rax:rdx). If you had something else in those registers, well, tough luck, you must move the values out before or they’re lost. Other instructions (such as in) also necessitates specific registers to be used. This leads to an effect called register starving, which may mean that there are too few registers, or that there are too few registers available for the next instruction. An orthogonal instruction set that can use any register to perform any instruction greatly reduces this register starving effect. Ideally, I would push the idea further as to not distinguish floating point registers from integer registers, whereas they are segregated in the current x86 architecture. I am undecided whether or not I would also have normal registers hold vector values; this comes to a trade-off between performance and ALU width.

Flow control instructions are also necessary, but I don’t really see what more we need if we already have unconditional, conditional, and indirect jumps; and direct and indirect function calls (and returns). Interrupt-like call mechanisms are necessary, but they are rather simple.

The fixed-width encoding of instructions in typical RISC limits the flexibility of the instruction set, especially relative to its immediate values: if an instruction is 32-bits long, you can’t load a 32-bits immediate value with it: with the R3000 processors you would have to perform two loads (one for the lower part and one for the upper part) to load a single 32-bits immediate value. Fixed-width encoding also poses the problem of backwards compatibility when releasing new versions of the processors. What if previous versions had only 8 registers and that now you have 32? Do you have old instructions execute correctly and add a new set of instructions to accommodate the new registers in the unused op-codes map? Or do you just have a different instruction set altogether? I would guess it would be easier to retain compatibility with a variable-length instruction encoding.

Extending the ISA is quite essential for many high-performance application. In digital signal and multimedia processing, in simulations, graphics, etc., it was understood a long time ago that you just couldn’t get things done with a general purpose processor alone. That’s why GPUs are very different from general purpose processors: they are tailored and highly optimized for the problem they are meant to solve: high-speed graphics rendering. It is not always clear whether a task should be delegated to a specialized co-processor (such as a GPU) or if it should rather be performed by the processor itself using specialized instructions, but I think that the processor should be able to have different instruction extensions to perform more domain-specific tasks. In the x86 we have SIMD instructions both for integer and float values, but the SIMD instructions are not very specialized and are therefore sometimes rather hard to use efficiently for a given application. Some processors, such as the VIA C3, offer very specialized instructions to help with cryptographic applications which are needed in a wide variety of applications ranging from secure banking to the more mundane problem of having your customers pay for digital TV. In any case, the ISA should provide means of extensions and means to identify which extensions are present in this particular processor (something like the CPUID instruction).

To make it short, my perfect instruction set architecture:

  • would be CISC
  • would deals with unaligned memory accesses (in addition to protected, virtual memory)
  • would sport complex addressing modes including gather/scatter
  • would have many data types
  • would be orthogonal with many registers
  • would have variable-length encoding instruction encoding
  • would have an extensible ISA allowing new specialized instructions to be added as needed

*
* *

All the ideas I discussed here were implemented in many processors, but none managed to dislodge the x86 from its dominant position in the server and desktop markets. Why is that so? Shouldn’t better architectures come in and simply supersede the older, ageing, and baroque architecture that is the x86? Yes, in theory, it would, if it wasn’t for the immense inertia of countless existing proprietary applications. In the open-source world, this is not much of a problem because most (but not all) of the software is already written with cross-platform portability in mind, and for most of it, it (almost) suffice to port the compiler to a new architecture to have the complete code-base work on this new architecture. In the closed-source world of proprietary software, software is too often written for a specific plat-form combo such as, say, windows 2000 and IIS, and continues on living for years on the fumes alone, or, more exactly, on the continued backwards compatibility of processors and operating systems. You can run DOS on a i7 (beats me why you’d do such a hideous thing) and you can run Windows 2000 on a eight core Xeon—it’ll probably run somewhat crippled as not dealing with all the hardware advances, but it’ll run just fine.

When I switched to Linux as my primary plat-form about 6 years ago, I didn’t realize right away how much of an advantage cross-platform portability is. Now, if I fancy a MIPS-based workstation I can just have a MIPS-friendly Linux distribution installed on it, apt-get (or whatever is the package manager) all my applications, rsync my data, and go on with my life. If I’d still be with Windows, I would have to stay in the Wintel world. What this particular aspect of open-source gives us is the ability to just ditch a CPU series if a better one comes along.


[1] John L. Hennessy, David A. Patterson — Computer Architecture: A Quantitative Approach — 4th ed., Morgan Kaufmann, , 704 pp. ISBN 0-12-370490-1

16 Responses to The Perfect Instruction Set

  1. “[CISC] helps the CPUs to be designed for a memory hierarchy with important performance differences between stages.”

    Does it? There’s not too much difference between “add eax, [esi]; add ebx, eax” and “load ecx, 0(esi); add eax, ecx; add ebx, eax” – if you don’t have a stall after the first add in the first sequence, you might as well not have it in the second.

    “Aligned processors are a pain”

    Are they? If you can read 1, 2, 4, 8 bytes from any naturally aligned location, then you won’t have many problems – thanks to C alignment rules. In case you want to save some cycles, perf. penalty for unaligned loads is bad anyway. Of course, if you can have unaligned loads with no penalty, that’s great – but if there is a penalty, disabling them is not too much of a loss.

    While complex addressing modes can be useful, the argument about hoping for the compiler quality is, uh, a non-argument. With todays complexity of processors, you can only hope – be it x86 or PowerPC.

    I’d like to murder anyone who introduces a general-purpose CPU with floating-point support that is not IEEE754 32/64 bit. Yes, that includes the internal 80-bit x87 – thank god there is SSE now. Some seemingly minor deviations may hurt as well – I’ve had to work around SPUs having only a round-to-zero rounding mode for floats (definitely not minor, loses energy).

    While variable-length instructions indeed offer better extensibility perspectives, I’d like the encoding to be more like UTF-8 than x86 – given a random byte in the sequence, it’s really easy to decode it if it’s UTF-8, but really hard to disassemble it if it’s an x86 instruction stream.

  2. Steven Pigeon says:

    “Aligned processors are a pain”

    Are they? If you can read 1, 2, 4, 8 bytes from any naturally aligned location, then you won’t have many problems – thanks to C alignment rules.

    True, most of the time you won’t notice the difference and the compiler takes take of hiding all of this for you. The cost of this is wasted storage, which you may or may not care for. In embedded devices, you do. Even in high-end servers you might eventually do. But when I wrote that I was thinking of variable length codes and stuff like that that requires you to read bitstrings with random alignment. On a processor that doesn’t really mind where you read a chunk of memory from, the code is a lot simpler.

    I’d like the encoding to be more like UTF-8 than x86

    That’s also what I had in mind, but not exactly for the same reasons. When you look at the original x86 opcode map, it’s already quite full. They left out 0x0f, 0x63–0x67, 0xd7, and 0xf1 free. Everything else is filled (as I can see in my 1986 iAPX 86/88 manual). It’s a bit like the Dewey classification system where all the caterogies are already assigned and end up with having to classify new stuff in stupid categories (Dewey classifies computer science in 005.something, between misc. and etc.). An approach à la UTF-8 is more like the Congress classification, extensible and with gaps left for future stuff.

  3. […] This post was mentioned on Twitter by Mark Papadakis and reddit_prog_hot. reddit_prog_hot said: The Perfect Instruction Set http://bit.ly/dW2XQU http://bit.ly/ifcVDW [6 comments] […]

  4. witek says:

    Orthogonality of instruction yes, but only in regards to universal registers. Loading/storing into memory should be done using separate instructions. This simplifies many things and is more RISC approach.

    Many datatypes in the same register is a pain for writing good register allocator in compiler. Only recently wee start to have good allocators with subregisters.

    As of alligment, most of unaligneed problems comes from SSE, which needed alligned operands. On new CPUs even unaligned operations works, they are just slower. And this is more connected with cache. Maybe you want unaligned cache? This can be interesting, and bring some improvments, but unfortunetly will be more complex, because of overlap beetwen multiple cache-lines will be possible.

    I would like to have:
    – Conditional execution of instruction, just like in ARM!
    – flat, wide and big and universal register file. (64x64bit, and possibility to merge 2 or 4 consecutive registers when using Vector Instructions. with sub-read operations and few registers reserved for operating system).

    • Steven Pigeon says:

      As of alligment, most of unaligneed problems comes from SSE, which needed alligned operands.

      Depends. Some instructions will require alignment, others not. There are a variety of load/store instructions that do not care about alignment. Some instructions taking a memory operand will also work properly without alignment. What you (maybe) lose is performance as unaligned reads may be more expensive than perfectly aligned reads.

      I would like to have:
      – Conditional execution of instruction, just like in ARM!
      – flat, wide and big and universal register file. (64x64bit, and possibility to merge 2 or 4 consecutive registers when using Vector Instructions. with sub-read operations and few registers reserved for operating system).

      I too think that would yield a rather elegant solution.

  5. LennyP says:

    Interesting article. First I’m curious what the “perfect” instruction set would be looking up from the bottom instead of down from the top; as in software designer vs chip designer.

    As to my own idea of “perfect” it would entail a strong dynamic element so I can reconfigure, on the fly, the registers, caches, i/o, etc based upon the context of the problem I’m trying to solve. There is no one “perfect” instruction set for all tasks, so why not allow some instructions to be constructed or added/removed at compile time or by the designing engineer during runtime?

    • Steven Pigeon says:

      Probably that from a hardware perspective instructions have to be much simpler than they are in a CISC CPU.

      Dynamic instructions could be had with user-definable micro-code, I guess, but that still would not give a lot of flexibility because you would have to deal with whatever hardware there is in the CPU. Maybe a FPGA-like extension would provide the means of configuring the CPU at the gate level, but I do not know how it compares to hard silicon, speed-wise.

  6. Mark Miller says:

    Two instruction set approaches that were under-appreciated should be revisited:

    The National Semi NS32032 and tagged architectures. I would love to see 64 bit reincarnations.

    • Steven Pigeon says:

      I has under the impression that the NS32032 was more or less strongly inspired by the 680×0, by its instruction structure. (Although I didn’t work much with the 680×0, I have some assembly-language experience with the 68040).

  7. Stu says:

    Hm, have you looked at the ARM instruction set?

    • Steven Pigeon says:

      Cursorily. I haven’t had the chance of developping for ARM yet, I can’t really comment the instruction set. It seems very motorolèsque and the SIMD instructions seems a lot simpler than Intel’s.

  8. TIm Locke says:

    Also, should be pure Big-Endian.

    • Steven Pigeon says:

      I also think so. What are your reasons for that choice?

      • Tim Locke says:

        Just cuz. Are we sure we want to take the chance of getting into a religious debate on here? :) Are there any valid reasons for either preference? Here are some of mine:

        1. Big-endian is easier for humans to read as it is what we are used to from using decimal, at least in English. German and Dutch seem to be little-endian. Interestingly, Arabic numerals seem big-endian in left-to-right languages but are little-endian in right-to-left languages such as Arabic. Hmm.

        2. The best processor architectures are big-endian (6800, 6809, 68000, early PowerPC, SPARC).

        3. Apparently the IP protocol is big-endian, although I would think it is swapped in hardware when needed these days.

        4. Apparently the JVM is big-endian. I’m not sure how much if there is any performance hit on little-endian processors.

        5. HMS, YMD, and file paths are big-endian.

        As I mentioned, it is also important that the endianness be pure. The 68000 came close but apparently its bytes were little-endian. I believe this would be little noticed unless shifting or rotating bits. Most discussions of 68000 endianness do not even mention it.

        Regarding non-pure endianness, the DEC PDP-11 was little-endian except for 32-bit integers, its registers and floating point. The DEC VAX was little-endian except for floating point and BCD.

        Let me know if any of what I’ve said is in error.

  9. Steven Pigeon says:

    For 3., it’s done in software, using ntoh and hton (network-to-host and host-to-network) function family and usually (CISC) processors give you a machine instruction to do just that: x86 has bswap. Maybe there’s proviso for this on the NIC hardware itself, but we’d have to look at a NIC driver to see if it’s the case.

    For 4., you’re not going to tell me that *performance* is an issue with Java, aren’t you? :p But more to the point: I thought Java was trying hard to isolate that kind of detail from the coder?

    I somewhat expected more hackerish reasons, such as it may be possible to compare two integers in memory without reading neither completely (which no processor ever bothers doing, I’m sure).

    • Tim Locke says:

      Yes, Java tries to isolate the coder from endianness but it must take a fraction of a second to deal with it while loading code on little-endian machines.

      If there were any relevant benefits to one or the other at the hardware level then the issue would not have arisen.

      Programmer preference is the largest benefit. Because of x86, most low-level programmers are familiar with little-endian but I still think big-endian is easier to learn because it will seem natural to new programmers. Those little-endian students have to learn yet one more thing.

      Anyone know which endianness Windows for ARM will use?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: