As cheesy as the title sounds, I promise it cannot beat the cheesiness of the technique I’ll be telling you about in this post. The morning I saw Mark Ermolov’s tweet about the undocumented instruction reading from/writing to the CRBUS, I had a bit of free time in my hands and I knew I had to find out the opcode so I started theory-crafting right away. After a few hours of staring at numbers, I ended up coming up with a method of discovering practically every instruction in the processor using a side(?)-channel. It’s an interesting method involving even more interesting components of the processor so I figured I might as well write about it, so here it goes.
You can find the full data-set and the implementation source at haruspex.can.ac / Github.
Preface 1/2: If storks are busy delivering babies, where do micro-instructions come from?
Modern processors are built with a crazy amount of microarchitectural complexity these days and as one would expect the good old instruction decoders no longer decode for the execution unit directly. They end up decoding them into micro-instructions according to the micro-code of the processor to dispatch to the execution ports of the processor. There are two units that perform this translation in a modern Intel processor:
- Micro-instruction Translation Engine (MITE). The unit responsible for the translation of simple legacy instructions that translate to four or less micro-instructions.
- Microcode Sequencer (MS). The unit responsible for the translation of much more complex instructions that drive the CISC craziness of the Intel architecture we all hold dearly.
Another unit that dispatches these micro-instructions is the Decoded Stream Buffer (DSB) more frequently known as the iCache but it is not really relevant to the experiment we’re going to do. Now, why am I telling you all of this? Mainly because we can profile these extremely low-level units thanks to the performance counters Intel gracefully provides us; mainly these two bad boys:


One of the advantages of utilizing these events compared to a Sandsifter-like approach is that even if the microcode of the instruction throws a #UD (say if the password does not match or if the conditions are not met) it cannot fool us as it’d still have to be decoded.
Preface 2/2: Executing in a parallel universe
Now the problem is getting those instructions there. There are a million of ways you can shoot yourself in the foot by executing random instructions. What happens if we hit an instruction that changes our interruptibility state, causes a system reset, messes with the internal caches or the very performance counters we’re using, the stack we’re executing on, the code of the profiler, the code segment we’re executing in…
The easy solution is to simply, not execute, at least in our visible universe, which brings us to our second cool microarchitectural detail: speculative and out-of-order execution. The processor evaluating the branch condition at the branch has apparently grown out of fashion so what really happens when you do a branch is that, first, the branch prediction engine attempts to guess the branch you are going to take to reduce the cost of the reversal of the instruction pipelines, given an equal possibility both branches execute at the same time where possible and no speculation fences are present and then one of them gets their net change reverted… which is pretty much what we wanted isn’t it? Now although probing the branch to reset the branch prediction state is possible and I’ve experimented with it before, it’s a bit bothersome so an easier way is to simply do a CALL. Consider this snippet:

We will also need to trash the Decoded Stream Buffer I’ve mentioned previously so that we do not get any micro-instructions fed from the cache but there’s a very simple solution for that. The instruction cache also has to handle self-rewriting code so execute memory is extremely sensitive to any memory writes, given this information adding the simple snippet below before every measurement takes care of this issue.

0x0: Precise data collection
Finally, we’ve hit the exciting part of the experiment, experimenting. We want precise results which implies non-interruptibility. You might be tricked into thinking being in kernel-mode and doing a
Repeating the experiment multiple times and picking the
0x1: Reducing the results
First of all, we’ll get two baseline measurements, the
Simply removing all measurements matching the

Every nop translates to a single micro-instruction that will be handled by the MITE, which means if the prefix is redundant MITS should be always off by just one and MS should stay the same, additionally, we can filter out some redundant checks by declaring

Suffix-based purging is also more or less the same logic which gets rid of 72869 instructions, leaving us with 1699 entries, which is good enough to start the analysis!

0x2: Deduction of behavior
Let’s demonstrate the amazing amount of information we’ve gathered just from these two counters and some bytes existing at some fixed place without even being executed. If MS is below the nop baseline, this would indicate that the control flow is interrupted meaning that it must be a branch or an exception and if the MITS is the same as the fault-baseline, this likely indicates a serializing instruction which dispatched a number of micro-instructions (given that it passed our initial filter of MS or MITS not remaining same) but then halted the speculative flow (since none of the NOP opcodes were decoded).


Considering how simplistic the conditions are, not bad right?
0x3: Deduction of speculative behavior
You might have noticed the “Speculation fence” indication in the previous data dump. I’ve gotten a bit greedy and wanted to also know if the instructions speculatively execute or if they halt the queue for the sake of side-channeling the results so I went ahead and collected another piece of information, which comes from a rather unexpected performance counter:

You might be wondering what this has to do with anything. This particular counter is very useful given the fact that we don’t do any divisions in our previous code. Why? Because this means adding a division right after the instruction and seeing if the cycles is non-zero will let us know if the speculative execution was halted or not. We achieve this by first squeezing a

AVX to SSE switch, memory stall, register dependencies, atomicity, this one has it all! Marking the entries that have non-zero counters as “speculative friendly” essentially lets us know what instructions we can speculatively execute and leak information from, and as you can see in the next example it seems to work pretty nicely.


0x
: Some of the interesting results
Here is the full list of undocumented easter-egg instructions my i7 6850k comes with:

Contrary to popular belief

Despite lacking the CPL check of

