The Design of Scalar AES Instruction Set Extensions for RISC-V

microcolonel · on July 30, 2020

I'm interested in if there's a simple design for a ChaCha/Poly1305 accelerating ISA extension for RISC-V (outside the general crypto extension, not sure even as a member of that where it is going). I feel like it is a lot simpler, if for no other reason than there being effectively a single mode for the ChaCha primitive rather than three or more. The whole operation is composed of bitwise rotations and adders that can be arranged in a static network, and (as far as I know) is more or less not a source of timing sidechannel information one way or another.

Maybe I could try banging my head against the XCrypto repository this weekend.

brandmeyer · on July 30, 2020

ARX ciphers suffer somewhat on vanilla RISC-V due to the lack of native rotations. You have to synthesize rotations from shifting and logical ops.

Prime field algorithms suffer somewhat on vanilla RISC-V due to the lack of good bignum support. For example, there is no add-with-carry, nor a convenient way to get the carry-out at a low level. You have to use the set-if-less-than instruction after an addition to separately compute the carry bit.

a1369209993 · on July 31, 2020

> For example, there is no add-with-carry, nor a convenient way to get the carry-out at a low level.

IIUC, you're supposed to break the bignum up into (say) 56-bit chunks, and use the upper bits of the register as the carry. I'm sceptical that that works as well as it should for practical bignum applications, though. (I haven't had occasion to try it out.)

microcolonel · on July 31, 2020

I think this becomes less terrible the wider your bignum is.

dependenttypes · on July 31, 2020

microcolonel was interested for an ISA extension rather than software implementation.

brandmeyer · on July 31, 2020

ISA extensions cover quite a lot of ground. One design point is to implement a very narrowly scoped accelerator that only addresses the highest latency (or lowest work-per-instruction) parts of a software implementation.

Both Chacha and Poly1305 were specifically designed for fast execution in software implementations, so IMO this approach is well-suited to those algorithms. You could easily argue that the ARMv7-M instruction "unsigned multiply accumulate accumulate long" exists solely as a bignum accelerator, for example. Perhaps fusing addition with xor is similarly worthwhile for ChaCha.

cordite · on July 31, 2020

Secure constant time primitive computation is important. Outside of symmetric encryption with AES, other primitive operations could be beneficial.

Avoiding side channels and valuing small code size can result in some really funky work arounds. For example, using a whole ECDH key exchange in a JavaCard co-processor in order to do EC point calculations. As seen in [1] slide 30.

[1]: https://www.blackhat.com/docs/us-17/thursday/us-17-Mavroudis...

BooneJS · on July 30, 2020

What’s the strategy for operating systems, compilers, and/or language runtimes to figure out what a RISC-V chip supports? It used to be one thing checking which SIMD units an x86 had, but the nature of an open source processor lends itself to a near infinite listing of /proc/cpuinfo flags.

microcolonel · on July 30, 2020

> It used to be one thing checking which SIMD units an x86 had, but the nature of an open source processor lends itself to a near infinite listing of /proc/cpuinfo flags.

I mean, x86 sets a very high bar for the sheer number of feature flags. Most ISA extensions of interest to compilers are standardized with the RISC-V Foundation †.

The mechanism is similar to other chips, and the situation so far seems to be about the same, except a couple orders of magnitude difference in the current number of flags.

† Seems it's now RISC-V International

fuoqi · on July 30, 2020

So is there a CPUID-like instruction in the base RISC-V instruction set? After a cursory search I couldn't find anything like it. If there is indeed no such instruction, it will be a real shame. Why would you not include it into a such extensible ISA?

loeg · on July 30, 2020

> So is there a CPUID-like instruction in the base RISC-V instruction set?

Yes.

> After a cursory search I couldn't find anything like it.

This was the first google result for "riscv cpuid feature instruction" for me (RISC-V Instruction Set Manual v1.7):

> The mcpuid register is an XLEN-bit read-only register containing information regarding the capabilities of the CPU implementation. This register must be readable in any implementation

fuoqi · on July 30, 2020

Thank you! I was looking at the Volume 1 only.

But IIUC it will not be accessible at user-level, so capability detection will be significantly less convenient for cross-platform library authors compared to x86.

Narishma · on July 30, 2020

Why does it need to be accessible from user-space? Applications can just ask the OS what extensions are available before using them.

jlokier · on July 31, 2020

Ew. So generic RISC-V libraries, that do things like calculations, crypto or even memcpy(), need to have OS-specific code in them?

They won't work on OSes the library author didn't know about, or hasn't supplied code for, or which didn't exist when the library was created.

There are a lot of OSes.

Most likely it will be a call to libc or equivalent, and therefore be tied to libc. For generic libraries it may be the only reason for a dependency on libc.

There are even more libc variations than OSes. In Linux terms, it is almost distro-specific.

This is effectively part of the RISC-V ABI, and it means there is no OS-independent ABI.

microcolonel · on July 31, 2020

> There are a lot of OSes.

Maybe, but right now there are only a couple OSes that run on RISC-V, where you would care to detect these things at runtime. As of now, I think FreeBSD has an AT_HWCAP, which exposes RISC-V standard extension information. (though yes, it will be some platform code, one #ifdef and a couple lines to call either getauxval or elf_aux_info).

> ...or even memcpy()

P.S. memcpy goes in libc, the same place that implements these functions, so I think probably choosing the right memcpy is not an issue.

fuoqi · on July 30, 2020

It does not need to be, but it makes things significantly easier for library authors. Asking OS is a more involved process than calling an instruction (for starters you should learn which OS your code is running on) and you may not have OS in the first place (e.g. for bare-metal programming).

roca · on July 31, 2020

If you're on bare metal, then you have access to supervisor mode.

I like the approach of hiding cpuid registers from user-space. It means, for example, the operating system can easily control what hardware features user-space will use. That's useful for debugging, for process migration, for record and replay (I maintain rr), and to "discourage" user-space from using buggy or detrimental features.

Simplifying the hardware features exposed to user-space makes evolution easier. Your OS kernel should be (needs to be) up to date, while you might have old user-space binaries that you want to keep running, so you can live with weaker compatibility guarantees for supervisor-mode-only features.

microcolonel · on July 30, 2020

Oh right, I tried looking for it and it's not in the current specification AFAIK. What there is, though, is the misa CSR, which accomplishes something similar described in 3.1.1 of the Privileged spec. The misa register is a bitfield of the supported standard extensions. Not sure but I think it's not readable in U-mode, or it's undefined.

microcolonel · on July 30, 2020

Your system's device tree describes the CPU to the kernel. The riscv,isa string contains the list of extensions in the standard format (a string of single-character standard extensions, and Z-delimited extensions).

To clarify I did not mean that this information was accessed through specialized CPU instructions directly, but that it is available through cpuinfo on Linux and such like usual.

Here's an example device tree for the HiFive Unleashed board, which describes the embedded controller as rv64imac, and the application processor cores as rv64imafdc (same as the embedded controller, but with floats and doubles): https://github.com/riscv/riscv-device-tree-doc/blob/master/e...

If your system is application-specific enough, you won't even bother with that. You'll know what extensions your SoC supports when you make your purchasing decision, and you'll know when you compile your software.

formerly_proven · on July 30, 2020

How does the OS <discover/gain access to/build/enumerate> the device tree?

Edit: Apparently the device tree is handled by the firmware (UEFI?), so that a RISC-V OS asks the firmware to provide the device tree and I suppose if it ever gets to desktops, it'd be firmware's job to figure out what CPU is plugged into the board. The current state seems to only cater to integrated systems, though?

microcolonel · on July 30, 2020

The firmware or the bootloader leaves it in memory somewhere.

Re: the added question on "integrated systems": your VM hypervisor will map a device tree for VM guests as well, this is how QEMU and other hypervisors expose virtio devices to guests as well.

formerly_proven · on July 30, 2020

Yeah, but that sort of punts the question of "how does the [firmware] code figure out what CPU is being used", unless that's backed into the firmware at build time.

cmrx64 · on July 30, 2020

yeah, it usually is, that's how those device trees work. it's generally bundled with the "bootloader" (chain loader component of the firmware). you can even bake them into your kernel if you want, i believe.

RISC-V is a processor ISA, not a platform spec. you may be interested in the SBI: https://github.com/riscv/riscv-sbi-doc/blob/v0.2.0/riscv-sbi...

the details of how all this stuff works at a human-collaboration level has changed a bit since 1993. there's a higher level of collaboration expected. vs intel and ms bossing people around.

edit: you may also be interested in https://doc.coreboot.org/arch/riscv/index.html and https://lkml.org/lkml/2020/2/25/1209 and https://lists.gnu.org/archive/html/grub-devel/2019-02/msg000...

not exactly "desktop" yet, but hopefully you can see how that might look, given the existing linux infrastructure :)

monocasa · on July 30, 2020

Yep, it's baked into the firmware somehow. Sometimes it'll actually be linked in, sometimes the firmware knows where in flash or whatever to grab it from before loading the kernel, sometimes the firmware knows how to interrogate hardware and generate a new one.

Sometimes it's a combo of all three.

fluffything · on July 31, 2020

> What’s the strategy for operating systems, compilers, and/or language runtimes to figure out what a RISC-V chip supports?

Same as for x86, Power, ARM, MIPS, systemz, sparc, ... ISAs: there is a flag register that the OS can query to test whether the hardware supports something.

I only know one ISA that works in a completely different way (WebAssembly), by requiring the user to attempt to perform an operation, and if that faults, then the operation is not supported. The reason WASM can do this is because it is a virtual ISA. This approach does not work well for real hardware because it makes instruction decoders larger and slower, since it prevents them from assuming that they will only be fed instructions they do know.

orra · on July 31, 2020

You sound knowledgeable, but this I don't understand:

> This approach does not work well for real hardware because it makes instruction decoders larger and slower, since it prevents them from assuming that they will only be fed instructions they do know.

But what if you fail to do the runtime checks? And for example just feed an SSE4 binary/instructions to a processor that only supports SSE3? Then won't it segfault?

fluffything · on July 31, 2020

Hardware undefined behavior. The hardware can do anything.

That includes trapping on an illegal instruction or something, but there is no guarantee that some CPU that was designed and built before SSE4 existed will do anything meaningful when fed with machine code that it was never intended to see.

You are basically hopping for the instruction decoder to be "really good", and not "really fast". Eg. suppose that there is only one instruction in SSE3 which has the 7th bit set. A fast instruction decoder tests that bit, and if its true, it knows exactly what instruction that is. Years later, SSE4 is designed and implemented, and it adds another instruction that also has that bit set. You are required to check whether the CPU supports SSE4, and the SSE3-only CPU will tell you that it does not. If you then go ahead and feed it a SSE4 instruction, the CPU can do really anything, including executing some other completely different instruction.

orra · on Aug 3, 2020

Interesting, but how is that not a security flaw? Sounds like user space applications can cause CPU level undefined behaviour!?

fluffything · on Aug 17, 2020

When running a user space program, instead of executing the instruction in the assembly, the CPU will at best raise an exception, and at worse attempt to execute some other instruction. If the process doesn't have privileges to execute that other instruction, then the CPU will just raise an exception (unless if it has a big hardware bug).

I don't see how this could be a security flaw.

jepler · on July 30, 2020

> The misa CSR is a WARL read-write register reporting the ISA supported by the hart [hart is a hardware thread] - from the Privileged Architecture manual

However, I didn't see how a nonstandard extension such as this would be represented; this is concerned with representing up to 26 or so extensions, the ones with single letters assigned to them.

userbinator · on July 31, 2020

Making RISC-V not so RISCy... it turns out that complex instructions are actually useful in practice, and ISAs which don't have them will eventually develop them in order to remain competitive.

simias · on July 31, 2020

I'm not sure if it's very unRISC-y. CISC is all about these sort-of-generic-but-slightly-high-level methods such as SCASB (find a byte in a string), LOOPD (loop while equal) and all the in-memory operations, prefix instructions etc... All that with a lot of redundancy in the ISA. A RISC CPU will "emulate" this instruction with a couple of more primitive instructions, favoring speed and simplicity.

Of course the line is blurry, but to me things like SIMD, floating point or AES are more like "coprocessor" extensions, it's not something that can decently be emulated using a handful of CISC opcodes. Making a quasi-religious point that CISC shouldn't support high-level instructions will doom these CPUs to irrelevance.

A CISC CPU is no less CISC if you put in on a board next to a hardware video decoder module for instance, so why should an on-die AES extension be a game changer?

The original PlayStation's CPU had a coprocessor dealing with 3D transformations (the part of the pipeline that would be handled by a vertex shader these days) so it actually supported opcodes such as "MVMVA: Multiply vector by matrix and add vector". I don't think anybody would consider that unRISC-y.

fluffything · on July 31, 2020

In fact, the RISC-V V (Vector) extension supports vector "shapes", so you can take a f32x16 vector and make it a f32x4x4 4x4 matrix, and then multiplying it with another f32x4x4 vector will do a 4x4 * 4x4 matrix multiplication, in hardware. IIUC this shapes generalize to tensors, so you can use this hardware units to, e.g., implement DNNs, similar to nvidia's tensor cores.

segfaultbuserr · on July 31, 2020

My understanding is that the primary motivation of pushing RISC-V is always about creating (or at least attempting to create) an open ISA and hopefully bringing greater transparency to CPU architectures, arguing whether it's RISC, is not an irrelevant issue to many. In a parallel universe, I would be happy to use a "CISC-V" chip if it exists.

zozbot234 · on July 31, 2020

RISC itself has nothing to do with the number of instructions. It's about the load-store principle, separating instructions that access memory wrt. those that compute on registers.

Someone · on July 31, 2020

It’s hard to define RISC, but separating load and stores from computations is a side effect of the desire to have uniform, low instruction execution times in order to simplify pipelining, not a goal of RISC.

AES would be non-RISCy because it won’t execute in a single clock cycle, either stopping the pipeline or necessitating complex logic.

Having said that, I don’t think RISC-V implementations are 100% RISC in that respect (is any CPU?)

anonymousDan · on July 30, 2020

Slightly off topic, but can anyone tell me if the RISCV ISA makes static disassembly easier than x86-64?

brandmeyer · on July 30, 2020

Some of the operand fields in the 16-bit instructions are positioned in weird places in the instructions, but otherwise the process is pretty mechanical.

TomVDB · on July 31, 2020

RISC-V instructions are either 32 or 16 bits, and the size of the instruction is determined by 2 bits in a fixed place (the LSBs of the instruction word IIRC.) Much easier than x86-64.