AES-128 Block Cipher

Introduction

In January 1997, the National Institute of Standards and Technology (NIST) initiated a process to replace the Data Encryption Standard (DES) published in 1977. A draft criteria to evaluate potential algorithms was published, and members of the public were invited to provide feedback. The finalized criteria was published in September 1997 which outlined a minimum acceptable requirement for each submission.

4 years later in November 2001, Rijndael by Belgian Cryptographers Vincent Rijmen and Joan Daemen which we now refer to as the Advanced Encryption Standard (AES), was announced as the winner.

Since publication, implementations of AES have frequently been optimized for speed. Code which executes the quickest has traditionally taken priority over how much ROM it uses. Developers will use lookup tables to accelerate each step of the encryption process, thus compact implementations are rarely if ever sought after.

Our challenge here is to implement AES in the least amount of C and more specifically x86 assembly code. It will obviously result in a slow implementation, and will not be resistant to side-channel analysis, although the latter problem can likely be resolved using conditional move instructions (CMOVcc) if necessary.

AES Parameters

There are three different set of parameters available, with the main difference related to key length. Our implementation will be AES-128 which fits perfectly onto a 32-bit architecture

.

Key Length
(Nk words)
Block Size
(Nb words)
Number of Rounds
(Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

Structure of AES

Two IF statements are introduced in order to perform the encryption in one loop. What isn’t included in the illustration below is ExpandRoundKey and AddRoundConstantwhich generate round keys.

The first layout here is what we normally see used when describing AES. The second introduces 2 conditional statements which makes the code more compact.

Source in C

The optimizers built into C compilers can sometimes reveal more efficient ways to implement a piece of code. At the very least, they will show you alternative ways to write some code in assembly.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned W;

// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B x){
    B i,y,c;
    if(x){
      for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
      x=y;F(4)x^=y=(y<<1)|(y>>7);
    }
    return x^99;
}
void E(B *s){
    W i,w,x[8],c=1,*k=(W*)&x[4];
    
    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];
    
    for(;;){
      // 1st part of ExpandRoundKey, AddRoundKey and update state
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
      
      // 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;      
      
      // if round 11, stop else update c
      if(c==108)break;c=M(c);      
      
      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
      
      // if not round 10, MixColumns
      if(c!=108)F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

x86 Overview

Some x86 registers have special purposes, and it’s important to know this when writing compact code.

Register Description Used by
eax Accumulator lods, stos, scas, xlat, mul, div
ebx Base xlat
ecx Count loop, rep (conditional suffixes E/Z and NE/NZ)
edx Data cdq, mul, div
esi Source Index lods, movs, cmps
edi Destination Index stos, movs, scas, cmps
ebp Base Pointer enter, leave
esp Stack Pointer pushad, popad, push, pop, call, enter, leave

Those of you familiar with the x86 architecture will know certain instructions have dependencies or affect the state of other registers after execution. For example, LODSB will load a byte from memory pointer in SI to AL before incrementing SI by 1. STOSB will store a byte in AL to memory pointer in DI before incrementing DI by 1. MOVSB will move a byte from memory pointer in SI to memory pointer in DI, before adding 1 to both SI and DI. If the same instruction is preceded REP (for repeat) then this also affects the CX register, decreasing by 1.

Initialization

The s parameter points to a 32-byte buffer containing a 16-byte plain text and 16-byte master key which is copied to the local buffer x.

A copy of the data is required, because both will be modified during the encryption process. ESI will point to swhile EDI will point to x

EAX will hold Rcon value declared as c. ECX will be used exclusively for loops, and EDX is a spare register for loops which require an index starting position of zero. There’s a reason to prefer EAX than other registers. Byte comparisons are only 2 bytes for AL, while 3 for others.

// 2 vs 3 bytes
  /* 0001 */ "\x3c\x6c"             /* cmp al, 0x6c         */
  /* 0003 */ "\x80\xfb\x6c"         /* cmp bl, 0x6c         */
  /* 0006 */ "\x80\xf9\x6c"         /* cmp cl, 0x6c         */
  /* 0009 */ "\x80\xfa\x6c"         /* cmp dl, 0x6c         */

In addition to this, one operation requires saving EAX in another register, which only requires 1 byte with XCHG. Other registers would require 2 bytes

// 1 vs 2 bytes
  /* 0001 */ "\x92"                 /* xchg edx, eax        */
  /* 0002 */ "\x87\xd3"             /* xchg ebx, edx        */

Setting EAX to 1, our loop counter ECX to 4, and EDX to 0 can be accomplished in a variety of ways requiring only 7 bytes. The alternative for setting EAX here would be : XOR EAX, EAX; INC EAX

// 7 bytes
  /* 0001 */ "\x6a\x01"             /* push 0x1             */
  /* 0003 */ "\x58"                 /* pop eax              */
  /* 0004 */ "\x6a\x04"             /* push 0x4             */
  /* 0006 */ "\x59"                 /* pop ecx              */
  /* 0007 */ "\x99"                 /* cdq                  */

Another way …

// 7 bytes
  /* 0001 */ "\x31\xc9"             /* xor ecx, ecx         */
  /* 0003 */ "\xf7\xe1"             /* mul ecx              */
  /* 0005 */ "\x40"                 /* inc eax              */
  /* 0006 */ "\xb1\x04"             /* mov cl, 0x4          */

And another..

// 7 bytes
  /* 0000 */ "\x6a\x01"             /* push 0x1             */
  /* 0002 */ "\x58"                 /* pop eax              */
  /* 0003 */ "\x99"                 /* cdq                  */
  /* 0004 */ "\x6b\xc8\x04"         /* imul ecx, eax, 0x4   */

ESI will point to s which contains our plain text and master key. ESI is normally reserved for read operations. We can load a byte with LODS into AL/EAX, and move values from ESI to EDI using MOVS.

Typically we see stack allocation using ADD or SUB, and sometimes (very rarely) using ENTER. This implementation only requires 32-bytes of stack space, and PUSHAD which saves 8 general purpose registers on the stack is exactly 32-bytes of memory, executed in 1 byte opcode.

To illustrate why it makes more sense to use PUSHAD/POPAD instead of ADD/SUB or ENTER/LEAVE, the following are x86 opcodes generated by assembler.

// 5 bytes
  /* 0000 */ "\xc8\x20\x00\x00" /* enter 0x20, 0x0 */
  /* 0004 */ "\xc9"             /* leave           */
  
// 6 bytes
  /* 0000 */ "\x83\xec\x20"     /* sub esp, 0x20   */
  /* 0003 */ "\x83\xc4\x20"     /* add esp, 0x20   */
  
// 2 bytes
  /* 0000 */ "\x60"             /* pushad          */
  /* 0001 */ "\x61"             /* popad           */

Obviously the 2-byte example is better here, but once you require more than 96-bytes, usually ADD/SUB in combination with a register is the better option.

; *****************************
; void E(void *s);
; *****************************
_E:
    pushad
    xor    ecx, ecx           ; ecx = 0
    mul    ecx                ; eax = 0, edx = 0
    inc    eax                ; c = 1
    mov    cl, 4
    pushad                    ; alloca(32)
; F(8)x[i]=((W*)s)[i];
    mov    esi, [esp+64+4]    ; esi = s
    mov    edi, esp
    pushad
    add    ecx, ecx           ; copy state + master key to stack
    rep    movsd
    popad

Multiplication

A pointer to this function is stored in EBP, and there are three reasons to use EBP over other registers:

  1. EBP has no 8-bit registers, so we can’t use it for any 8-bit operations.
  2. Indirect memory access requires 1 byte more for index zero.
  3. The only instructions that use EBP are ENTER and LEAVE.
// 2 vs 3 bytes for indirect access  
  /* 0001 */ "\x8b\x5d\x00"         /* mov ebx, [ebp]       */
  /* 0004 */ "\x8b\x1e"             /* mov ebx, [esi]       */

When writing compact code, EBP is useful only as a temporary register or pointer to some function.

; *****************************
; Multiplication over GF(2**8)
; *****************************
    call   $+21               ; save address      
    push   ecx                ; save ecx
    mov    cl, 4              ; 4 bytes
    add    al, al             ; al <<= 1
    jnc    $+4                ;
    xor    al, 27             ;
    ror    eax, 8             ; rotate for next byte
    loop   $-9                ; 
    pop    ecx                ; restore ecx
    ret
    pop    ebp

SubByte

In the SubBytes step, each byte a_{i,j} in the state matrix is replaced with S(a_{i,j}) using an 8-bit substitution box. The S-box is derived from the multiplicative inverse over GF(2^8), and we can implement SubByte purely using code.

; *****************************
; B SubByte(B x)
; *****************************
sub_byte:  
    pushad 
    test   al, al            ; if(x){
    jz     sb_l6
    xchg   eax, edx
    mov    cl, -1            ; i=255 
; for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
sb_l0:
    mov    al, 1             ; y=1
sb_l1:
    test   ah, ah            ; !c
    jnz    sb_l2    
    cmp    al, dl            ; y!=x
    setz   ah
    jz     sb_l0
sb_l2:
    mov    dh, al            ; y^=M(y)
    call   ebp               ;
    xor    al, dh
    loop   sb_l1             ; --i
; F(4)x^=y=(y<<1)|(y>>7);
    mov    dl, al            ; dl=y
    mov    cl, 4             ; i=4  
sb_l5:
    rol    dl, 1             ; y=R(y,1)
    xor    al, dl            ; x^=y
    loop   sb_l5             ; i--
sb_l6:
    xor    al, 99            ; return x^99
    mov    [esp+28], al
    popad
    ret

AddRoundKey

The state matrix is combined with a subkey using the bitwise XOR operation. This step known as Key Whitening was inspired by the mathematician Ron Rivest, who in 1984 applied a similar technique to the Data Encryption Standard (DES) and called it DESX.

; *****************************
; AddRoundKey
; *****************************
; F(4)s[i]=x[i]^k[i];
    pushad
    xchg   esi, edi           ; swap x and s
xor_key:
    lodsd                     ; eax = x[i]
    xor    eax, [edi+16]      ; eax ^= k[i]
    stosd                     ; s[i] = eax
    loop   xor_key
    popad

AddRoundConstant

There are various cryptographic attacks possible against AES without this small, but important step. It protects against the Slide Attack, first described in 1999 by David Wagner and Alex Biryukov. Without different round constants to generate round keys, all the round keys will be the same.

; *****************************
; AddRoundConstant
; *****************************
; *k^=c; c=M(c);
    xor    [esi+16], al
    call   ebp

ExpandRoundKey

The operation to expand the master key into subkeys for each round of encryption isn’t normally in-lined. To boost performance, these round keys are precomputed before the encryption process since you would only waste CPU cycles repeating the same computation which is unnecessary.

Compacting the AES code into a single call requires in-lining the key expansion operation. The C code here is not directly translated into x86 assembly, but the assembly does produce the same result.

; ***************************
; ExpandRoundKey
; ***************************
; F(4)w<<=8,w|=S(((B*)k)[15-i]);w=R(w,8);F(4)w=k[i]^=w;
    pushad
    add    esi,16
    mov    eax, [esi+3*4]    ; w=k[3]
    ror    eax, 8            ; w=R(w,8)
exp_l1:
    call   S                 ; w=S(w)
    ror    eax, 8            ; w=R(w,8);
    loop   exp_l1
    mov    cl, 4
exp_l2:
    xor    [esi], eax        ; k[i]^=w
    lodsd                    ; w=k[i]
    loop   exp_l2
    popad

Combining the steps

An earlier version of the code used separate AddRoundKeyAddRoundConstant, and ExpandRoundKey, but since these steps all relate to using and updating the round key, the 3 steps are combined in order to reduce the number of loops, thus shaving off a few bytes.

; *****************************
; AddRoundKey, AddRoundConstant, ExpandRoundKey
; *****************************
; w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
; w=R(w,8)^c;F(4)w=k[i]^=w;
    pushad
    xchg   eax, edx
    xchg   esi, edi
    mov    eax, [esi+16+12]  ; w=R(k[3],8);
    ror    eax, 8
xor_key:
    mov    ebx, [esi+16]     ; t=k[i];
    xor    [esi], ebx        ; x[i]^=t;
    movsd                    ; s[i]=x[i];
; w=(w&-256)|S(w)
    call   sub_byte          ; al=S(al);
    ror    eax, 8            ; w=R(w,8);
    loop   xor_key
; w=R(w,8)^c;
    xor    eax, edx          ; w^=c;
; F(4)w=k[i]^=w;
    mov    cl, 4
exp_key:
    xor    [esi], eax        ; k[i]^=w;
    lodsd                    ; w=k[i];
    loop   exp_key
    popad

Shifting Rows

ShiftRows cyclically shifts the bytes in each row of the state matrix by a certain offset. The first row is left unchanged. Each byte of the second row is shifted one to the left, with the third and fourth rows shifted by two and three respectively.

Because it doesn’t matter about the order of SubBytes and ShiftRows, they’re combined in one loop.

; ***************************
; ShiftRows and SubBytes
; ***************************
; F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((B*)s)[i]);
    pushad
    mov    cl, 16
shift_rows:
    lodsb                    ; al = S(s[i])
    call   sub_byte
    push   edx
    mov    ebx, edx          ; ebx = i%4
    and    ebx, 3            ;
    shr    edx, 2            ; (i/4 - ebx) % 4
    sub    edx, ebx          ; 
    and    edx, 3            ; 
    lea    ebx, [ebx+edx*4]  ; ebx = (ebx+edx*4)
    mov    [edi+ebx], al     ; x[ebx] = al
    pop    edx
    inc    edx
    loop   shift_rows
    popad

Mixing Columns

The MixColumns transformation along with ShiftRows are the main source of diffusion. Each column is treated as a four-term polynomial b(x)=b_{3}x^{3}+b_{2}x^{2}+b_{1}x+b_{0}, where the coefficients are elements over {GF} (2^{8}), and is then multiplied modulo x^{4}+1 with a fixed polynomial a(x)=3x^{3}+x^{2}+x+2

; *****************************
; MixColumns
; *****************************
; F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    pushad
mix_cols:
    mov    eax, [edi]        ; w0 = x[i];
    mov    ebx, eax          ; w1 = w0;
    ror    eax, 8            ; w0 = R(w0,8);
    mov    edx, eax          ; w2 = w0;
    xor    eax, ebx          ; w0^= w1;
    call   ebp               ; w0 = M(w0);
    xor    eax, edx          ; w0^= w2;
    ror    ebx, 16           ; w1 = R(w1,16);
    xor    eax, ebx          ; w0^= w1;
    ror    ebx, 8            ; w1 = R(w1,8);
    xor    eax, ebx          ; w0^= w1;
    stosd                    ; x[i] = w0;
    loop   mix_cols
    popad
    jmp    enc_main

Counter Mode (CTR)

Block ciphers should never be used in Electronic Code Book (ECB) mode, and the ECB Penguin illustrates why.

 

 

 

 

 

 

As you can see, blocks of the same data using the same key result in the exact same ciphertexts, which is why modes of encryption were invented. Galois/Counter Mode (GCM) is authenticated encryption which uses Counter (CTR) mode to provide confidentiality.

The concept of CTR mode which turns a block cipher into a stream cipher was first proposed by Whitfield Diffie and Martin Hellman in their 1979 publication, Privacy and Authentication: An Introduction to Cryptography.

CTR mode works by encrypting a nonce and counter, then using the ciphertext to encrypt our plain text using a simple XOR operation. Since AES encrypts 16-byte blocks, a counter can be 8-bytes, and a nonce 8-bytes.

The following is a very simple implementation of this mode using the AES-128 implementation.

// encrypt using Counter (CTR) mode
void encrypt(W len, B *ctr, B *in, B *key){
    W i,r;
    B t[32];

    // copy master key to local buffer
    F(16)t[i+16]=key[i];

    while(len){
      // copy counter+nonce to local buffer
      F(16)t[i]=ctr[i];
      
      // encrypt t
      E(t);
      
      // XOR plaintext with ciphertext
      r=len>16?16:len;
      F(r)in[i]^=t[i];
      
      // update length + position
      len-=r;in+=r;
      
      // update counter
      for(i=15;i>=0;i--)
        if(++ctr[i])break;
    }
}

In assembly

; void encrypt(W len, B *ctr, B *in, B *key)
_encrypt:
    pushad
    lea    esi,[esp+32+4]
    lodsd
    xchg   eax, ecx          ; ecx = len
    lodsd
    xchg   eax, ebp          ; ebp = ctr
    lodsd
    xchg   eax, edx          ; edx = in
    lodsd
    xchg   esi, eax          ; esi = key
    pushad                   ; alloca(32)
; copy master key to local buffer
; F(16)t[i+16]=key[i];
    lea    edi, [esp+16]     ; edi = &t[16]
    movsd
    movsd
    movsd
    movsd
aes_l0:
    xor    eax, eax
    jecxz  aes_l3            ; while(len){
; copy counter+nonce to local buffer
; F(16)t[i]=ctr[i];
    mov    edi, esp          ; edi = t
    mov    esi, ebp          ; esi = ctr
    push   edi
    movsd
    movsd
    movsd
    movsd
; encrypt t    
    call   _E                ; E(t)
    pop    edi
aes_l1:
; xor plaintext with ciphertext
; r=len>16?16:len;
; F(r)in[i]^=t[i];
    mov    bl, [edi+eax]     ; 
    xor    [edx], bl         ; *in++^=t[i];
    inc    edx               ; 
    inc    eax               ; i++
    cmp    al, 16            ;
    loopne aes_l1            ; while(i!=16 && --ecx!=0)
; update counter
    xchg   eax, ecx          ; 
    mov    cl, 16
aes_l2:
    inc    byte[ebp+ecx-1]   ;
    loopz  aes_l2            ; while(++c[i]==0 && --ecx!=0)
    xchg   eax, ecx
    jmp    aes_l0
aes_l3:
    popad
    popad
    ret

Summary

The final assembly code for ECB mode is 205 bytes, and 272 for CTR mode.

Check sources here.