Явно отварянето на линкове и четене не ти е силна страна затова ще ти спестя усилията и ще ти постна част от интервюто на Джим Кeлер където обяснява нещата като за идиоти. За разлика от мене той е проектирал тези процесори и е наясно.
CPU Instruction Sets: Arm vs x86 vs RISC-V
IC: You’ve spoken about CPU instruction sets in the past, and one of the biggest requests for this interview I got was around your opinion about CPU instruction sets. Specifically questions came in about how we should deal with fundamental limits on them, how we pivot to better ones, and what your skin in the game is in terms of ARM versus x86 versus RISC V. I think at one point, you said most compute happens on a couple of dozen op-codes. Am I remembering that correctly?
JK: Arguing about instruction sets is a very sad story. It's not even a couple of dozen op-codes - 80% of core execution is only six instructions - you know, load, store, add, subtract, compare and branch. With those you have pretty much covered it. If you're writing in Perl or something, maybe call and return are more important than compare and branch. But instruction sets only matter a little bit - you can lose 10%, or 20%, of performance because you're missing instructions.
For a while we thought variable-length instructions were really hard to decode. But we keep figuring out how to do that. You basically predict where all the instructions are in tables, and once you have good predictors, you can predict that stuff well enough. So fixed-length instructions seem really nice when you're building little baby computers, but if you're building a really big computer, to predict or to figure out where all the instructions are, it isn't dominating the die. So it doesn't matter that much.
When RISC first came out, x86 was half microcode. So if you look at the die, half the chip is a ROM, or maybe a third or something. And the RISC guys could say that there is no ROM on a RISC chip, so we get more performance. But now the ROM is so small, you can't find it. Actually, the adder is so small, you can hardly find it? What limits computer performance today is predictability, and the two big ones are instruction/branch predictability, and data locality.
Now the new predictors are really good at that. They're big - two predictors are way bigger than the adder. That's where you get into the CPU versus GPU (or AI engine) debate. The GPU guys will say ‘look there's no branch predictor because we do everything in parallel’. So the chip has way more adders and subtractors, and that's true if that's the problem you have. But they're crap at running C programs.
GPUs were built to run shader programs on pixels, so if you're given 8 million pixels, and the big GPUs now have 6000 threads, you can cover all the pixels with each one of them running 1000 programs per frame. But it's sort of like an army of ants carrying around grains of sand, whereas big AI computers, they have really big matrix multipliers. They like a much smaller number of threads that do a lot more math because the problem is inherently big. Whereas the shader problem was that the problems were inherently small because there are so many pixels.
There are genuinely three different kinds of computers: CPUs, GPUs, and AI. NVIDIA is kind of doing the ‘inbetweener’ thing where they're using a GPU to run AI, and they're trying to enhance it. Some of that is obviously working pretty well, and some of it is obviously fairly complicated. What's interesting, and this happens a lot, is that general-purpose CPUs when they saw the vector performance of GPUs, added vector units. Sometimes that was great, because you only had a little bit of vector computing to do, but if you had a lot, a GPU might be a better solution.
IC: So going back to ISA question - many people were asking about what do you think about Arm versus x86? Which one has the legs, which one has the performance? Do you care much, if at all?
JK: I care a little. Here's what happened - so when x86 first came out, it was super simple and clean, right? Then at the time, there were multiple 8-bit architectures: x86, the 6800, the 6502. I programmed probably all of them way back in the day. Then x86, oddly enough, was the open version. They licensed that to seven different companies. Then that gave people opportunity, but Intel surprisingly licensed it. Then they went to 16 bits and 32 bits, and then they added virtual memory, virtualization, security, then 64 bits and more features. So what happens to an architecture as you add stuff, you keep the old stuff so it's compatible.
So when Arm first came out, it was a clean 32-bit computer. Compared to x86, it just looked way simpler and easier to build. Then they added a 16-bit mode and the IT (if then) instruction, which is awful. Then they added a weird floating-point vector extension set with overlays in a register file, and then 64-bit, which partly cleaned it up. There was some special stuff for security and booting, and so it has only got more complicated.
Now RISC-V shows up and it's the shiny new cousin, right? Because there's no legacy. It's actually an open instruction set architecture, and people build it in universities where they don’t have time or interest to add too much junk, like some architectures have. So relatively speaking, just because of its pedigree, and age, it's early in the life cycle of complexity. It's a pretty good instruction set, they did a fine job. So if I was just going to say if I want to build a computer really fast today, and I want it to go fast, RISC-V is the easiest one to choose. It’s the simplest one, it has got all the right features, it has got the right top eight instructions that you actually need to optimize for, and it doesn't have too much junk.
IC: So modern instruction sets have too much bloat, especially the old ones. Legacy baggage and such?
JK: Instructions that have been iterated on, and added to, have too much bloat. That's what always happens. As you keep adding things, the engineers have the struggle. You can have this really good design, there are 10 features, and so you add some features to it. The features all make it better, but they also make it more complicated. As you go along, every new feature added gets harder to do, because the interaction for that feature, and everything else, gets terrible.
The marketing guys, and the old customers, will say ‘don't delete anything’, but in the meantime they are all playing with the new fresh thing that only does 70% of what the old one does, but it does it way better because it doesn't have all these problems. I've talked about diminishing return curves, and there's a bunch of reasons for diminishing returns, but one of them is the complexity of the interactions of things. They slow you down to the point where something simpler that did less would actually be faster. That has happened many times, and it's some result of complexity theory and you know, human nefariousness I think.
IC: So did you ever see a situation where x86 gets broken down and something just gets reinvented? Or will it just remain sort of legacy, and then just new things will pop up like RISC-V to kind of fill the void when needed?
JK: x86-64 was a fairly clean slate, but obviously it had to carry all the old baggage for this and that. They deprecated a lot of the old 16-bit modes. There's a whole bunch of gunk that disappeared, and sometimes if you're careful, you can say ‘I need to support this legacy, but it doesn't have to be performant, and I can isolate it from the rest’. You either emulate it or support it.
We used to build computers such that you had a front end, a fetch, a dispatch, an execute, a load store, an L2 cache. If you looked at the boundaries between them, you'd see 100 wires doing random things that were dependent on exactly what cycle or what phase of the clock it was. Now these interfaces tend to look less like instruction boundaries – if I send an instruction from here to there, now I have a protocol. So the computer inside doesn't look like a big mess of stuff connected together, it looks like eight computers hooked together that do different things. There’s a fetch computer and a dispatch computer, an execution computer, and a floating-point computer. If you do that properly, you can change the floating-point without touching anything else.
That's less of an instruction set thing – it’s more ‘what was your design principle when you build it’, and then how did you do it. The thing is, if you get to a problem, you could say ‘if I could just have these five wires between these two boxes, I could get rid of this problem’. But every time you do that, every time you violate the abstraction layer, you've created a problem for future Jim. I've done that so many times, and like if you solve it properly, it would still be clean, but at some point if you hack it a little bit, then that kills you over time.