The x86 architecture is ageing, but rather than looking for re-invention, it only saw incremental extensions (especially for operating system instructions and SIMD) over the last decade or so. Before getting to the i7 core, we saw a long series of evolutions—not revolutions. It all started with the 8086 (and its somewhat weaker sibling, the 8088), which was first conceived as an evolutionary extension to the 8085, which was itself binary compatible with the 8080. The Intel 8080’s lineage brings us to the 8008, a 8 bits of data, 14 bits of address micro-processor. Fortunately, the 8008 isn’t a double 4004. The successors of the 8086 include (but the list surely isn’t exhaustive) the 80186, the 80286, the 80386, first in the series to include decent memory protection for multitasking, then the long series of 486, various models of Pentium, Core 2 and i7.
So, just like you can trace the human lineage to apes, then to monkeys, and eventually to rodent-like mammals, the x86 has a long history and is still far from being perfect, and its basic weakness, in my opinion, is that it still use the 1974 8080 accumulator-based instruction set architecture. And despite a number of very smart architectural improvements (out of order execution, branch prediction, SIMD), it still suffers from the same problems the 8085 did: the instruction set is not orthogonal, it contains many obsolete complex instructions that aren’t used all that much anymore (such as the BCD helpers), and that everything has to be backwards compatible, meaning that every new generation still is more of the same, only (mostly) faster.
But what would be the perfect instruction set? In , the typical instruction set is composed of seven facets (to which I add an eighth):
- Type of ISA. CISC or RISC, or hybrid. The CISC (complex instruction set computer) approach gives the ISA instructions that perform more complex operations in a single instruction while the RISC (reduced instruction set computer) approach takes the opposite view in making the instruction set sport very few instructions and instructions that do very little, very specialized operations. A typical RISC architecture will have only a very limited of complex instructions (that deal with the OS aspects) and concentrate in instructions that do just a single thing, for example, adding two values already in registers, but would not have an instruction that computes an address, loads a value from memory, then add it to a register. The second type of instruction is of the CISC type where, in essence, the number of instructions is minimized, but to the cost of having instructions with literally hundreds of variants.
- Memory Management. How the CPU accesses memory dealing with things such as alignment, virtual memory, and caches.
- Addressing Modes. How does the CPU generates the addresses to access data. There are many addressing modes, ranging from the very simple to the very complex.
- Types and Sizes of Operands. The size of the registers, the size of the basic data types such as bytes, words, etc. It also considers the types of data the processor can manipulate; signed, unsigned integers, floating points, packed registers, etc.
- Instructions. The very instructions the processor recognizes. These can be divided in many categories such as arithmetic, flow control, memory management, address generation, virtual memory, OS-helpers, etc.
- Flow Control. What type of flow control is permitted? Surely, flow control instructions must include unconditional jumps, conditional jumps, and function calls; certainly indirect jumps; maybe it can even help with high-level languages constructs such as C’s switch statement.
- Instruction Encoding. Are the instructions of variable or fixed-length? If instructions are allowed to vary in (byte) lengths, it certainly gives more flexibility for extending the instruction set and accommodating complex instructions that have many variants. Having fixed-width instructions, as it is typical of RISC processors, will limit the instruction set somewhat; especially respective of immediate values.
- Extensions to the Instruction Set. (my addition to the list) Can the architecture accommodate application-specific extensions such as SIMD or cryptographic acceleration (such as Via’s Padlock, for example)?
All of these criteria must be balanced in order to minimize die size, power consumption, circuit complexity, while maximizing instruction throughput, that is, processor speed (which isn’t necessarily measured in GHz alone). And you note that this list doesn’t say much about superpipeline, superscalar, or out-of-order processing. That’s pretty much left to the implementation of the processor itself. To make a parallel, the ISA is a bit like the Java language specification, and the CPU like the Java virtual machine: many different implementations with different trade-offs will lead to a Java program being flawlessly executed, but with varying performances.
With a very wide difference between memory and processor bandwidths, the CISC approach is probably preferable: might as well have a single read from memory to get an instruction to perform as much as possible. This helps the CPUs to be designed for a memory hierarchy with important performance differences between stages. On the other hand, a CISC CPU is considerably more complex than a RISC CPU, as it may have to devote a lot of circuity to decompose the complex instructions into micro-instructions (the “atomic” instructions that are used to build a complex instruction) and execute the micro-instruction out-of-order. Note that both CISC and RISC CPUs can be superscalar and out-of-order; it is just that superscalar and out-of-order techniques help CISC CPUs quite a lot. So, I’d go for CISC.
The alignment, that is, how the processor reads and writes from and to memory, is also a crucial aspect of the instruction set. If the CPU is allowed only aligned reads/writes, the program must deal explicitly with how data is fetched from memory. That is, on a processor that does not allow unaligned reads, reading a 32 bits (or 4 bytes) value at an address on a 32-bits aligned bus would cause a “bus error” and the read or write would fail. On a processor that deals naturally with unaligned reads or writes, the worst that can happen is that the CPU generates several reads (or writes) to complete the memory access. Taking the address example again, the CPU would just write the first 3 bytes with a first access, and the last one with a second access (although it may have to perform read, mask, write series of operations). This is more costly in cycles, but it makes programming a whole lot simpler—aligned processors are a pain.
Addressing modes also play a major rôle in program simplicity. If you can translate a high-level language construct such as t[a+3*x] in very few instructions, it may result in better performance than if you have to emulate this relatively simple addressing mode using a comparatively large number of instructions. It also makes the mapping from high-level languages to assembly easier for the compiler, and you can cargo-cult the thing away as the processor is likely to have very complex and high-performance hardware dedicated to address calculation. On the other hand, if the processor doesn’t know much about address generation, you’re left to hope the compiler is really smart and knows how to generate the most efficient code for the processor.
The types and sizes of operands can be divided into two, or maybe three, basic categories: integers, floats, and vectors. It is clear (to me) that the two first categories are mandatory. We must have both unsigned and signed integers. Floats, probably IEEE 754 floats, are also quite nice to have. Vector data is merely very long contiguous blocks of data, maybe 128 or 256-bits (16 bytes or 32 bytes) long, and can be transported from and to memory using single instructions. Their interpretation should also vary from integers to float, but also may include other interpretations.
Two things that are missing from the x86 ISA are gather reads and scatter writes for vector data: the addressing modes are limited to a single bank of contiguous data. Gather reads allows the CPU to fetch scattered memory locations into a single vector. For example, reading from location x one byte every three bytes, for 8 reads, would allow the data in discontinuous locations to be read and packed into a single 64-bits wide register. Inversely, scatter writes takes a packed vectors and writes its elements in discontinuous memory locations: for example, starting at y, every 7 bytes, for 16 bytes. This cuts both addressing modes and types/sizes concerns.
The instruction set should be orthogonal. The major beef I have against x86 ISA is that it is accumulator-based and that many instructions are over-specialized in that they force particular registers to be used. For example, you can’t choose any registers to perform a multiply: one operand must be in ax (or eax for 32 bits, or rax for 64 bits) and the result is spread in ax:dx (or eax:edx, rax:rdx). If you had something else in those registers, well, tough luck, you must move the values out before or they’re lost. Other instructions (such as in) also necessitates specific registers to be used. This leads to an effect called register starving, which may mean that there are too few registers, or that there are too few registers available for the next instruction. An orthogonal instruction set that can use any register to perform any instruction greatly reduces this register starving effect. Ideally, I would push the idea further as to not distinguish floating point registers from integer registers, whereas they are segregated in the current x86 architecture. I am undecided whether or not I would also have normal registers hold vector values; this comes to a trade-off between performance and ALU width.
Flow control instructions are also necessary, but I don’t really see what more we need if we already have unconditional, conditional, and indirect jumps; and direct and indirect function calls (and returns). Interrupt-like call mechanisms are necessary, but they are rather simple.
The fixed-width encoding of instructions in typical RISC limits the flexibility of the instruction set, especially relative to its immediate values: if an instruction is 32-bits long, you can’t load a 32-bits immediate value with it: with the R3000 processors you would have to perform two loads (one for the lower part and one for the upper part) to load a single 32-bits immediate value. Fixed-width encoding also poses the problem of backwards compatibility when releasing new versions of the processors. What if previous versions had only 8 registers and that now you have 32? Do you have old instructions execute correctly and add a new set of instructions to accommodate the new registers in the unused op-codes map? Or do you just have a different instruction set altogether? I would guess it would be easier to retain compatibility with a variable-length instruction encoding.
Extending the ISA is quite essential for many high-performance application. In digital signal and multimedia processing, in simulations, graphics, etc., it was understood a long time ago that you just couldn’t get things done with a general purpose processor alone. That’s why GPUs are very different from general purpose processors: they are tailored and highly optimized for the problem they are meant to solve: high-speed graphics rendering. It is not always clear whether a task should be delegated to a specialized co-processor (such as a GPU) or if it should rather be performed by the processor itself using specialized instructions, but I think that the processor should be able to have different instruction extensions to perform more domain-specific tasks. In the x86 we have SIMD instructions both for integer and float values, but the SIMD instructions are not very specialized and are therefore sometimes rather hard to use efficiently for a given application. Some processors, such as the VIA C3, offer very specialized instructions to help with cryptographic applications which are needed in a wide variety of applications ranging from secure banking to the more mundane problem of having your customers pay for digital TV. In any case, the ISA should provide means of extensions and means to identify which extensions are present in this particular processor (something like the CPUID instruction).
To make it short, my perfect instruction set architecture:
- would be CISC
- would deals with unaligned memory accesses (in addition to protected, virtual memory)
- would sport complex addressing modes including gather/scatter
- would have many data types
- would be orthogonal with many registers
- would have variable-length encoding instruction encoding
- would have an extensible ISA allowing new specialized instructions to be added as needed
All the ideas I discussed here were implemented in many processors, but none managed to dislodge the x86 from its dominant position in the server and desktop markets. Why is that so? Shouldn’t better architectures come in and simply supersede the older, ageing, and baroque architecture that is the x86? Yes, in theory, it would, if it wasn’t for the immense inertia of countless existing proprietary applications. In the open-source world, this is not much of a problem because most (but not all) of the software is already written with cross-platform portability in mind, and for most of it, it (almost) suffice to port the compiler to a new architecture to have the complete code-base work on this new architecture. In the closed-source world of proprietary software, software is too often written for a specific plat-form combo such as, say, windows 2000 and IIS, and continues on living for years on the fumes alone, or, more exactly, on the continued backwards compatibility of processors and operating systems. You can run DOS on a i7 (beats me why you’d do such a hideous thing) and you can run Windows 2000 on a eight core Xeon—it’ll probably run somewhat crippled as not dealing with all the hardware advances, but it’ll run just fine.
When I switched to Linux as my primary plat-form about 6 years ago, I didn’t realize right away how much of an advantage cross-platform portability is. Now, if I fancy a MIPS-based workstation I can just have a MIPS-friendly Linux distribution installed on it, apt-get (or whatever is the package manager) all my applications, rsync my data, and go on with my life. If I’d still be with Windows, I would have to stay in the Wintel world. What this particular aspect of open-source gives us is the ability to just ditch a CPU series if a better one comes along.