# A Lightweight ISA Extension for AES and SM4 Markku-Juhani O. Saarinen mjos@pqshield.com PQShield Ltd. Oxford, United Kingdom August 23, 2020 First International Workshop on Secure RISC-V Architecture Design Exploration (SECRISC-V'20). # **Talk Outline** - Introduction: AES, SM4, and Crypto TG - SAES32 and SSM4 Instruction Set Extension - Implementation and Analysis - Conclusions # **AES: Advanced Encryption Standard** #### Federal Information Processing Standards Publication 197 #### November 26, 2001 Announcing the #### ADVANCED ENCRYPTION STANDARD (AES) Federal Information Processing Standards Publications (FIPS PUBS) are issued by the National Institute of Standards and Technology (NIST) after approval by the Secretary of Commerce persuant to Section 5131 of the Information Technology Management Reform Act of 1996 (Public Law 106-106) and the Comment Security Act of 1997 (Public Law 106-235). - 1. Name of Standard. Advanced Encryption Standard (AES) (FIPS PUB 197). - Category of Standard. Computer Security Standard, Cryptography. Explanation. The Advanced Encryption Standard (AES) specifies a FIPS-approved cryptographs algorithm that can be used to protect electronic data. The AES algorithm is a - symmetric block cipher that can encrypt (melpher) and decrypt (decipher) information. Encryption converts data to an unintelligible form called cipheriest; decrypting the cipheriest converts the data back into its original form, called plaintest. - The AEC algorithm is capable of using cryptographic keys of 128, 192, and 256 bits to encrypt and decrypt data in blocks of 128 bits. - Approving Authority. Secretary of Commerce. Maintenance Access. Department of Commerce. National Institute of Standards and - Technology, Information Technology Laboratory (ITL). 6. Applicability. This standard may be used by Federal departments and agencies when an annual elementary that manifects (medicability) information (on defined in P. L. 100-235) remains - Other HPS-approved cryptographic algorithms may be used in addition to, or in lieu of, this standard Pederal agencies or department that use cryptographic devices for protecting classified inferension can use those devices for protecting nearity (unclassified) inferentiation in litus of - information can use those devices for protecting sensitive (unclassified) information in lieu of thire standard. In addition, this standard may be adopted and used by non-Federal Government organizations. Such use is encouraged when it provides the desired security for commercial and private - → Specified in FIPS PUB 197, International standards. - → 128-bit block size, 128 / 192 / 256-bit secret key. - $\rightarrow$ Single 8 $\times$ 8-bit S-Box, Substitution-Permutation. - → Rijndael by Joan Daemen and Vincent Rijmen (1998). Clear, open, well-understood design methodology. - → **Very common.** Hardware support saves energy in comms (TLS, IPSec, WiFi), storage (XTS), etc. - → ARMv8.0-CE (SIMD) and Intel AES-NI (SIMD) ISAs. - → Embedded often have memory mapped AES engines. No real standard for those; vendor-specific drivers. # **SM4: The Chinese Standard Block Cipher** - → Specified in GM/T 0002-2012, GB/T 32907-2016, internationally ISO/IEC 18033-3:2010/DAmd 2. - → 128-bit block size, 128-bit secret key (one key size). - $\rightarrow$ Single 8 $\times$ 8-bit S-Box, Generalized Feistel Structure. - → Credited to Lu Shuwang (吕述望) et al, early 2000s. Design methodology and criteria difficult to find. - → Important to RISC-V International (due to export etc.) - → ARMv8.2-SM (SIMD) ISA has SM4 support, Intel no. - → SM4 has regulatory preference over AES in PR China. # **Crypto TG: The RISC-V Crypto Spec** Crypto Spec v0.6.2, August 13, 2020 - → The RISC-V Cryptographic Extensions Task Group (Crypto TG) has been operating since 2017. - → In late 2019 the scope was extended from Vector (RVV, SIMD-style) AES to "Scalar" RV32 and RV64. - → I proposed the present work (as ENC1S) in Feb 2020. It was evaluated and adopted as the preferred option for RV32 some months later (as SAES32 & SSM4). - → Evaluation (as AES "v3"): B. Marshall, G. R. Newell, D. Page, M.-J. O. Saarinen, and C. Wolf: "The design of scalar AES Instruction Set Extensions for RISC-V." https://eprint.iacr.org/2020/930 - → The crypto spec is going to freeze soon; you can find it at: https://github.com/riscv/riscv-crypto # **Talk Outline** - Introduction: AES, SM4, and Crypto TG - SAES32 and SSM4 Instruction Set Extension - Implementation and Analysis - **W** Conclusions # **AES Steps** - $\rightarrow$ AES has $\{10, 12, 14\}$ rounds for $\{128, 192, 256\}$ bit keys, respectively. - → Rounds are made of: AddRoundKey, SubBytes, ShiftRows, MixColumns. - Contrary to Feistel ciphers like SM4 Inverse of Substitution-Permutation Network (SPN) like AES requires inversion of each step (inverse SB, SR, MC). # **AES Steps: T-table** - ShiftRows just shuffles bytes and SubBytes operates on individual bytes. - ightarrow SubBytes and Mixcolumns can be combined into 8 ightarrow 32 bit "T table" lookups. - $\rightarrow$ MixColumns is 4 $\times$ 4 byte matrix multiplication defined in GF(2<sup>8</sup>); it's **linear!** #### The original 1998 Rijndael Reference code targeted 32-bit systems of the day: ``` e0 = Te0[(t0 >> 24)] Te2[(t2 >> Te3[(t3 ) & Oxff] rk[0]: Te0[(t1 >> 24)] Te2[(t3 >> Te3[(t0 ) & Oxffl rk[1]: Te0[(t2 >> 24)] Te1[(t3 >> 16) \& 0xff] Te2[(t0 >> Te3[(t1 ) & Oxffl ^ rk[2]: Te0[(t3 >> 24)] Te1[(t0 >> 16) \& 0xff] Ta2[(+1 >> Te3[(t2 ) & Oxffl rk[3]: ``` - $\rightarrow$ For a decade, all AES Implementations looked $\approx$ like this. - $\rightarrow$ 4 input bytes $\times$ 256 S-Box entries $\times$ 32 bits = 4 kB. - → Another 4 kB for decryption, possibly 1 kB for last rounds. - → Serious cache timing attacks emerged after mid-2000s (Bernstein, Osvik, et al [2,12]). Can be exploited remotely. - → Need to replace table lookups with with straight-line logic. - → On RV32 targets such bit-sliced implementations are 2.5× slower than table-based ones (Stoffelen [15]). ARM32: Google (Android, Chrome) tries to negoatiate ChaCha20 for TLS instead of AES on systems that do not have AES instructions. Secure AES is just too slow. ## Old school: Hand-optimized RV32I T-Table AES https://github.com/Ko-/riscvcrypto/blob/master/aes128tables/aes\_asm.S ``` andi \TO, \XO, Oxff andi \T1, \X1, Oxff andi \T2, \X2, 0xff andi \T3, \X3, 0xff slli \TO. \TO. 4 slli \T1, \T1. 4 slli \T2, \T2, 4 slli \T3. \T3. 4 \T4, \T0, \LUT1 \TO, (\T4) \T4. \T1. \LUT1 add \T1, (\T4) ٦w \T4, \T2, \LUT1 add 1w \T2. (\T4) \T4, \T3, \LUT1 add \T3, (\T4) \YO, \YO, \TO \Y1, \Y1, \T1 \Y2, \Y2, \T2 \Y3. \Y3. \T3 xor ``` ``` srli \X0, \X0, 4 srli \X1. \X1. 4 srli \X2, \X2, 4 srli \X3, \X3, 4 \TO. \X1. \C \T1, \X2, \C \T2. \X3. \C and \T3. \X0. \C and \T4, \T0, \LUT3 \T0, (\T4) add \T4. \T1. \LUT3 \T1, (\T4) \T4. \T2. \LUT3 add \T2. (\T4) \T4. \T3. \LUT3 \T3. (\T4) \YO. \YO. \TO \Y1. \Y1. \T1 \Y2. \Y2. \T2 \Y3. \Y3. \T3 ``` ``` srli \X0, \X0, 8 srli \X1, \X1, 8 srli \X2, \X2, 8 srli \X3, \X3, 8 \TO. \X2. \C \T1, \X3, \C and \T2. \X0. \C and and \T3. \X1. \C add \T4, \T0, \LUT0 ٦w \TO, (\T4) add \T4. \T1. \LUTO \T1. (\T4) ٦w \T4. \T2. \LUTO add 1 w \T2. (\T4) \T4, \T3, \LUTO \T3. (\T4) \YO. \YO. \TO \Y1. \Y1. \T1 xor \Y2, \Y2, \T2 \Y3. \Y3. \T3 xor ``` ``` srli \X0, \X0, 8 srli \X1. \X1. 8 srli \X2, \X2, 8 srli \X3, \X3, 8 and \TO. \X3. \C \T1, \X0, \C and \T2. \X1. \C and and \T3. \X2. \C add \T4, \T0, \LUT2 ٦w \TO, (\T4) add \T4. \T1. \LUT2 \T1. (\T4) ٦w \T4, \T2, \LUT2 add lw \T2. (\T4) \T4. \T3. \LUT2 add \T3, (\T4) xor \YO. \YO. \TO \Y1. \Y1. \T1 xor \Y2. \Y2. \T2 xor \Y3. \Y3. \T3 xor ``` $4 \times 20 = 80$ instructions (+ key fetch) per round. # Approach: Roll T-Table Operations into a Single Instruction #### SAES32 & SSM4: Scalar RV32 AES,SM4 ``` saes32.encsm rd, rs1, rs2, bs saes32.encs rd, rs1, rs2, bs saes32.decsm rd, rs1, rs2, bs saes32.decs rd, rs1, rs2, bs ssm4.ed rd, rs1, rs2, bs ssm4.ks rd, rs1, rs2, bs ``` - encs and decs lack MixColumns. Used for final round, key schedule. - R-type, immediate bs $\in \{0, 1, 2, 3\}$ : $4 \times 6 = 24$ code points total. - → The same logic also supports SM4. ## **SAES32 AES** https://github.com/mjosaarinen/lwaes\_isa/blob/master/asm/saes32\_enc.S ``` saes32.encsm t4. t4. t0. 0 saes32.encsm t4. t4. t1. 1 saes32.encsm t4, t4, t2, 2 saes32.encsm t4. t4. t3. 3 saes32.encsm t5, t5, t1, 0 saes32.encsm t5, t5, t2, 1 saes32.encsm t5. t5. t3. 2 saes32.encsm t5. t5. t0. 3 saes32.encsm t6. t6. t2. 0 saes32.encsm t6, t6, t3, 1 saes32.encsm t6. t6. t0. 2 saes32.encsm t6. t6. t1. 3 saes32.encsm a7, a7, t3, 0 saes32.encsm a7. a7. t0. 1 saes32.encsm a7. a7. t1. 2 saes32.encsm a7. a7. t2. 3 ``` $4 \times 4 = 16$ (+ key fetch). - → From 80 instrs to 16× saes32.encsm for main rounds. 16× saes32.encs for final. - → Same for decryption, with saes32.decs[m]. - → No table lookups, which often require multiple cycles. Timing-attack secure. - → Key schedule uses the same instructions. - → It would be possible to reduce insn count to 12 by having four parallel S-boxes, but that has > 3× implementation size, more energy. - $\Rightarrow$ $\approx$ 5× faster than table-based (insecure), > 10× faster than constant-time. ## SSM4.ED and SSM4.KS https://github.com/mjosaarinen/lwaes\_isa/blob/master/asm/sm4\_encdec.S - → SM4 is a Generalized Feistel; unlike AES, encryption and decryption are the same (with 32 expanded key words reversed). Separate key sched instruction. - $\rightarrow$ 4×S-Box would probably give a bigger speed-up for SM4 than for AES. - → The linear transformation in SM4 is based on rotations. In RISC-V rotation instructions are in RV32B Bitmanip extension; those are not needed here. - → Depending on availability of rotations, the speedup is similar or much better than that of AES, without much additional area and greatly reduced energy. - → Importantly the instruction makes SM4 constant-time too. ## Is this new? Prior Art No mainstream "pure scalar" 32-bit ISA currently has AES or SM4 instructions. Similar, custom instructions for "T-Table" style AES has been discussed in: [NIKO4] K. Nadehara, M. Ikekawa, and I. Kuroda. "Extended instructions for the AES cryptography and their efficient implementation." 2004 IEEE Workshop on Signal Processing Systems (SIPS), pp. 152–157, 2004. DOI:10.1109/SIPS.2004.1363041. [BBFR06] G. Bertoni, L. Breveglieri, R. Farina, and F. Regazzoni. "Speeding up AES by extending a 32-bit processor instruction set." IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06) pp. 275–282, 2006. DDI:10.1109/ASAP.2006.62. However these proposals did not roll the AddRoundKey operation into the same instruction, and apparently need 20 per round rather than 16 (plus key schedule). # **Talk Outline** - Introduction: AES, SM4, and Crypto TG - SAES32 and SSM4 Instruction Set Extension - Implementation and Analysis - **W** Conclusions # **Custom-0 Encoding (Temporary)** | R-Type: | [31:30] | [29:25] | [24:20] | [19:15] | [14:12] | [11:7] | [6:0] | |---------|---------|---------|---------|---------|---------|--------|---------| | | 00 | fn | rs2 | rs1 | 000 | rd | 0001011 | | Instruction | fn[4:2] | Description or Use | |--------------|---------|------------------------------------------| | saes32.encsm | 3'b000 | AES Encrypt round. | | saes32.encs | 3'b001 | AES Final / Key sched. | | saes32.decsm | 3'b010 | AES Decrypt round. | | saes32.decs | 3'b011 | AES Decrypt final. | | ssm4.ed | 3'b100 | SM4 Encrypt and Decrypt. | | ssm4.ks | 3'b101 | SM4 Key Schedule. | | Unused | 3'b11x | $(4 \times 6 = 24 \text{ points used.})$ | # **Hardware Implementation** https://github.com/mjosaarinen/lwaes\_isa/blob/master/hdl/saes32.v The original "reference implementation" is pure combinatorial logic, in verilog. ``` module saes32( output [31:0] rd, // not a reg input [31:0] rs1, // rs1 wire input [31:0] rs2, // rs2 wire input [4:0] fn // 5-bit func ); ``` Obtained 100 MHz timing signoff on Artix-7 (old!) when inserted into 1-cycle decoding pipeline of the Pluto RV32 core. # S-Boxes: Boyar-Peralta for SM4 too https://github.com/mjosaarinen/lwaes\_isa/blob/master/hdl/sboxes.v #### AES and SM4 S-Boxes are not "random": - → Both are "Nyberg S-Boxes" [11] built from inversion $x^{-1}$ in GF(2<sup>8</sup>) and linear (XORs) input/output layers. - $\rightarrow$ AES, AES<sup>-1</sup>, and SM4 S-Boxes are "affine equivalent". - → We expanded Boyar-Peralta [4] low-depth AES S-Box to the SM4 case by creating linear outer layers for it. - → Non-linear middle layer could be muxed and shared; probably not worth it due to latency, small size gain. # S-Boxes: Algebraic gate counts https://github.com/mjosaarinen/lwaes\_isa/blob/master/hdl/sboxes.v #### Low-depth S-Boxes that implement AES,AES<sup>-1</sup>,SM4. | Component | In, Out | XOR | XNOR | AND | Total | |-------------------|---------|-----|------|-----|-------| | Shared middle | 21 → 18 | 30 | - | 34 | 64 | | AES top | 8 o 21 | 26 | - | - | 26 | | <b>AES bottom</b> | 18 o 8 | 34 | 4 | - | 38 | | $AES^{-1}$ top | 8 o 21 | 16 | 10 | - | 26 | | $AES^{-1}$ bottom | 18 o 8 | 37 | - | - | 37 | | SM4 top | 8 o 21 | 18 | 9 | - | 27 | | SM4 bottom | 18 o 8 | 33 | 5 | - | 38 | - $\rightarrow$ Each gate count $\approx$ 128, only XORs and ANDs. - → Usually better than synthesis from a table. # **FPGA Area Example** ## RV32 SoC area with and without SAES32 (AES, AES $^{-1}$ , SM4). | Resource | Base | <b>SAES32</b> (△) | EXTAES (△) | |------------|-------|-------------------|----------------| | Logic LUTs | 7,767 | 8,202 (+435) | 9,795 (+2,028) | | Slice regs | 3,319 | 3,342 (+23) | 4,361 (+1,042) | | SLICEL | 1,571 | 1,864 (+293) | 2,068 (+497) | | SLICEM | 734 | 737 (+3) | 851 (+117) | - → EXTAES is a CPU-external memory-mapped AES-only module (for comparison). - $\rightarrow$ "Pluto" core on an Artix-7 FPGA. Area grows by $\approx 5\%$ for this simple core. ## **CMOS Area Estimates** #### Yosys Simple CMOS Flow area estimates for SAES32 & SSM4 | Target | GE (NAND2) | Transistors | LTP | |------------------|------------|-------------|-----| | AES Encrypt only | 642 | 2,568 | 25 | | SM4 Full | 767 | 3,066 | 25 | | AES Full | 1,240 | 4,960 | 28 | | AES + SM4 Full | 1,679 | 6,714 | 28 | - → It's very small. AES is a "lightweight cipher" for embedded RISC-V MCUs! - → Not all applications need both SM4 and AES, or AES inverse (e.g. CTR, SIV). # **Talk Outline** - Introduction: AES, SM4, and Crypto TG - SAES32 and SSM4 Instruction Set Extension - Implementation and Analysis - Conclusions ## **Conclusions** - → Crypto TG is proposing three kinds of AES extensions: RV32, RV64, and RVV. - → **SAES32** The scalar RV32 AES ISE is *very* lightweight with only 1 S-Box. Designed primarily for timing-attack security, latency, and energy savings. - → Straight-forward design rolling five T-table operations into one instruction: Byte select, S-Box, MixColumns, Rotate, final XOR (MC / AddRoundKey). - → **SM4**: The same "architecture" optionally also supports the Chinese Standard. - → Enables interoperable middleware for security functions in tiny RISC-V MCUs. ..Thank You!