Contents
Introduction
Assembly language is a low-level programming language for niche platforms such as IoTs, device drivers, and embedded systems. Usually, it’s the sort of language that Computer Science students should cover in their coursework and rarely use in their future jobs. From TIOBE Programming Community Index, assembly language has enjoyed a steady rise in the rankings of the most popular programming languages recently.
In the early days, when an application was written in assembly language, it had to fit in a small amount of memory and run as efficiently as possible on slow processors. When memory becomes plentiful and processor speed is dramatically increased, we mainly rely on high level languages with ready made structures and libraries in development. If necessary, assembly language can be used to optimize critical sections for speed or to directly access non-portable hardware. Today assembly language still plays an important role in embedded system design, where performance efficiency is still considered as an important requirement.
In this article, we’ll talk about some basic criteria and code skills specific to assembly language programming. Also, considerations would be emphasized on execution speed and memory consumption. I’ll analyze some examples, related to the concepts of register, memory, and stack, operators and constants, loops and procedures, system calls, etc.. For simplicity, all samples are in 32-bit, but most ideas will be easily applied to 64-bit.
All the materials presented here came from my teaching [1] for years. Thus, to read this article, a general understanding of Intel x86-64 assembly language is necessary, and being familiar with Visual Studio 2010 or above is assumed. Preferred, having read Kip Irvine’s textbook [2] and the MASM Programmer’s Guide [3] are recommended. If you are taking an Assembly Language Programming class, this could be a supplemental reading for studies.
About instruction
The first two rules are general. If you can use less, don’t use more.
1. Using less instructions
Suppose that we have a 32-bit
variable:
.data
var1 DWORD 123
The example is to add
to
. This is correct with
and
:
mov ebx, var1
add eax, ebx
But as
can accept one memory operand, you can just
Hide Copy Code
add eax, var1
2. Using an instruction with less bytes
Suppose that we have an array:
.data
array DWORD 1,2,3
If want to rearrange the values to be 3,1,2, you could
mov eax,array xchg eax,[array+4] xchg eax,[array+8] xchg array,eax
But notice that the last instruction should be
instead of
. Although both can assign
in
to the first array element, the other way around in exchange
is logically unnecessary.
Be aware of code size,
takes 5-byte machine code but
takes 6, as another reason to choose
here:
00000011 87 05 00000000 R xchg array,eax
00000017 A3 00000000 R mov array,eax
To check machine code, you can generate a listing file in assembling or open the Disassembly window at runtime in Visual Studio. Also, you can look up from the Intel instruction manual.
About register and memory
In this section, we’ll use a popular example, the nth Fibonacci number, to illustrate multiple solutions in assembly language. The C function would be like:
unsigned int Fibonacci(unsigned int n)
{
unsigned int previous = 1, current = 1, next = 0;
for (unsigned int i = 3; i <= n; ++i)
{
next = current + previous;
previous = current;
current = next;
}
return next;
}
3. Implementing with memory variables
At first, let’s copy the same idea from above with two variables
and
created here
.data
previous DWORD ?
current DWORD ?
We can use
store the result without the
variable. Since
cannot move from memory to memory, a register like
must be involved for assignment
. The following is the procedure
. It receives
from
and returns
as the nth Fibonacci number calculated:
FibonacciByMemory PROC
mov eax,1
mov previous,0
mov current,0
L1:
add eax,previous mov edx, current mov previous, edx
mov current, eax
loop L1
ret
FibonacciByMemory ENDP
4. If you can use registers, don’t use memory
A basic rule in assembly language programming is that if you can use a register, don’t use a variable. The register operation is much faster than that of memory. The general purpose registers available in 32-bit are
,
,
,
,
, and
. Don’t touch
and
that are for system use.
Now let
replace the
variable and
replace
. The following is
, simply with three instructions needed in the loop:
FibonacciByRegMOV PROC
mov eax,1
xor ebx,ebx
xor edx,edx
L1:
add eax,ebx mov ebx,edx
mov edx,eax
loop L1
ret
FibonacciByRegMOV ENDP
A further simplified version is to make use of
which steps up the sequence without need of
. The following shows
machine code in its listing, where only two instructions of three machine-code bytes in the loop body:
000000DF FibonacciByRegXCHG PROC
000000DF 33 C0 xor eax,eax
000000E1 BB 00000001 mov ebx,1
000000E6 L1:
000000E6 93 xchg eax,ebx 000000E7 03 C3 add eax,ebx 000000E9 E2 FB loop L1
000000EB C3 ret
000000EC FibonacciByRegXCHG ENDP
In concurrent programming
The x86-64 instruction set provides many atomic instructions with the ability to temporarily inhibit interrupts, ensuring that the currently running process cannot be context switched, and suffices on a uniprocessor. In someway, it also would avoid the race condition in multi-tasking. These instructions can be directly used by compiler and operating system writers.
5. Using atomic instructions
As seen above used
, so called as atomic swap, is more powerful than some high level language with just one statement:
xchg eax, var1
A classical way to swap a register with a memory
could be
mov ebx, eax
mov eax, var1
mov var1, ebx
Moreover, if you use the Intel486 instruction set with the .486 directive or above, simply using the atomic
is more concise in the Fibonacci procedure.
exchanges the first operand (destination) with the second operand (source), then loads the sum of the two values into the destination operand. Thus we have
000000EC FibonacciByRegXADD PROC
000000EC 33 C0 xor eax,eax
000000EE BB 00000001 mov ebx,1
000000F3 L1:
000000F3 0F C1 D8 xadd eax,ebx 000000F6 E2 FB loop L1
000000F8 C3 ret
000000F9 FibonacciByRegXADD ENDP
Two atomic move extensions are
and
. Another worth mentioning is bit test instructions,
,
,
, and
. For the following example
.data
Semaphore WORD 10001000b
.code
btc Semaphore, 6
Imagine the instruction set without
, one non-atomic implementation for the same logic would be
mov ax, Semaphore
shr ax, 7
xor Semaphore,01000000b
Little-endian
An x86 processor stores and retrieves data from memory using little-endian order (low to high). The least significant byte is stored at the first memory address allocated for the data. The remaining bytes are stored in the next consecutive memory positions.
6. Memory representations
Consider the following data definitions:
.data
dw1 DWORD 12345678h
dw2 DWORD 'AB', '123', 123h
by3 BYTE 'ABCDE', 0FFh, 'A', 0Dh, 0Ah, 0
w1 WORD 123h, 'AB', 'A'
For simplicity, the hexadecimal constants are used as initializer. The memory representation is as follows:

As for multiple-byte
and
date, they are represented by the little-endian order. Based on this, the second
initialized with
should be
and next
is
in their original order. You can’t initialize
as
that contains five bytes
, while you really can initialize
in a byte memory since no little-endian for byte data. Similarly, see
for a
memory.
7. A code error hidden by little-endian
From the last section of using
, we try to fill in a byte array with first 7 Fibonacci numbers, as
,
,
,
,
,
,
. The following is such a simple implementation but with a bug. The bug does not show up an error immediately because it has been hidden by little-endian.
FibCount = 7
.data
FibArray BYTE FibCount DUP(0ffh)
BYTE 'ABCDEF'
.code
mov edi, OFFSET FibArray
mov eax,1
xor ebx,ebx
mov ecx, FibCount
L1:
mov [edi], eax
xadd eax, ebx
inc edi
loop L1
To debug, I purposely make a memory
at the end of the byte array
with seven
initialized. The initial memory looks like this:

Let’s set a breakpoint in the loop. When the first number
filled, it is followed by three zeros as this:

But OK, the second number
comes to fill the second byte to overwrite three zeros left by the first. So on and so forth, until the seventh
, it just fits the last byte here:

All fine with an expected result in
because of little-endian. Only when you define some memory immediately after this
, your first three byte will be overwritten by zeros, as here
becomes
. How to make an easy fix?
About runtime stack
The runtime stack is a memory array directly managed by the CPU, with the stack pointer register
holding a 32-bit offset on the stack.
is modified by instructions
,
,
,
, etc.. When use
and
or alike, you explicitly change the stack contents. You should be very cautious without affecting other implicit use, like
and
, because you programmer and the system share the same runtime stack.
8. Assignment with PUSH and POP is not efficient
In assembly code, you definitely can make use of the stack to do assignment
, as in
. The following is
where only difference is using
and
instead of two
instructions with
.
FibonacciByStack
mov eax,1
mov previous,0
mov current,0
L1:
add eax,previous push current pop previous
mov current, eax
loop L1
ret
FibonacciByStack ENDP
As you can imagine, the runtime stack built on memory is much slower than registers. If you create a test benchmark to compare above procedures in a long loop, you’ll find that
is the most inefficient. My suggestion is that if you can use a register or memory, don’t use
and
.
9. Using INC to avoid PUSHFD and POPFD
When you use the instruction
or
to add or subtract an integer with the previous carry, you reasonably want to reserve the previous carry flag (
) with
and
, since an address update with
will overwrite the
. The following
example borrowed from the textbook [2] is to calculate the sum of two extended long integers
by
:
Extended_Add PROC
clc L1:
mov al,[esi] adc al,[edi] pushfd
mov [ebx],al add esi, 1 add edi, 1
add ebx, 1 popfd loop L1
mov dword ptr [ebx],0 adc dword ptr [ebx],0 ret
Extended_Add ENDP
As we know, the
instruction makes an increment by 1 without affecting the
. Obviously we can replace above
with
to avoid
and
. Thus the loop is simplified like this:
L1:
mov al,[esi] adc al,[edi]
mov [ebx],al inc esi inc edi
inc ebx
loop L1
Now you might ask what if to calculate the sum of two long integers
by
where each iteration must update the addresses by 4 bytes, as
. We still can make use of
to have such an implementation:
clc
xor ebx, ebx
L1:
mov eax, [esi +ebx*TYPE DWORD]
adc eax, [edi +ebx*TYPE DWORD]
mov [edx +ebx*TYPE DWORD], eax
inc ebx
loop L1
Applying a scaling factor here would be more general and preferred. Similarly, wherever necessary, you also can use the
instruction that makes a decrement by 1 without affecting the carry flag.
10. Another good reason to avoid PUSH and POP
Since you and the system share the same stack, you should be very careful without disturbing the system use. If you forget to make
and
in pair, an error could happen, especially in a conditional jump when the procedure returns.
The following
searches a 2-dimensional array for a value passed in
. If it is found, simply jump to the
label returning one in
as true, else set
zero as false.
Search2DAry PROC
mov ecx,NUM_ROW
ROW:
push ecx mov ecx,NUM_COL
COL:
cmp al, [esi+ecx-1]
je FOUND
loop COL
add esi, NUM_COL
pop ecx loop ROW
mov eax, 0
jmp QUIT
FOUND:
mov eax, 1
QUIT:
ret
Search2DAry ENDP
Let’s call it in
by preparing the argument
pointing to the array address and the search value
to be
or
respectively for not-found or found test case:
.data
ary2D BYTE 10h, 20h, 30h, 40h, 50h
BYTE 60h, 70h, 80h, 90h, 0A0h
NUM_COL = 5
NUM_ROW = 2
.code
main PROC
mov esi, OFFSET ary2D
mov eax, 31h
call Search2DAry
exit
main ENDP
Unfortunately, it’s only working in not-found for
. A crash occurs for a successful searching like
, because of the stack leftover from an outer loop counter pushed. Sadly enough, that leftover being popped by
becomes a return address to the caller.
Therefore, it’s better to use a register or variable to save the outer loop counter here. Although the logic error is still, a crash would not happen without interfering with the system. As a good exercise, you can try to fix.
Assembling time vs. runtime
I would like to talk more about this assembly language feature. Preferred, if you can do something at assembling time, don’t do it at runtime. Organizing logic in assembling indicates doing a job at static (compilation) time, not consuming runtime. Differently from high level languages, all operators in assembly language are processed in assembling such as
,
,
, and
, while only instructions work at runtime like
,
,
, and
.
11. Implementing with plus (+) instead of ADD
Let’s redo Fibonacci calculating to implement
in assembling with the plus operator by help of the
instruction. The following is
with only one line changed from
.
FibonacciByRegLEA
xor eax,eax
xor ebx,ebx
mov edx,1
L1:
lea eax, DWORD PTR [ebx+edx] mov edx,ebx
mov ebx,eax
loop L1
ret
FibonacciByRegLEA ENDP
This statement is encoded as three bytes implemented in machine code without an addition operation explicitly at runtime:
000000CE 8D 04 1A lea eax, DWORD PTR [ebx+edx]
This example doesn’t make too much performance difference, compared to
. But is enough as an implementation demo.
12. If you can use an operator, don’t use an instruction
For an array defined as:
.data
Ary1 DWORD 20 DUP(?)
If you want to traverse it from the second element to the middle one, you might think of this like in other language:
mov esi, OFFSET Ary1
add esi, TYPE DWORD mov ecx LENGTHOF Ary1 sub ecx, 1
div ecx, 2 L1:
Loop L1
Remember that
,
, and
are dynamic behavior at runtime. If you know values in advance, they are unnecessary to calculate at runtime, instead, apply operators in assembling:
mov esi, OFFSET Ary1 + TYPE DWORD mov ecx (LENGTHOF Ary1 -1)/2 L1:
Loop L1
This saves three instructions in the code segment at runtime. Next, let’s save memory in the data segment.
13. If you can use a symbolic constant, don’t use a variable
Like operators, all directives are processed at assembling time. A variable consumes memory and has to be accessed at runtime. As for the last
, you may want to remember its size in byte and the number of elements like this:
.data
Ary1 DWORD 20 DUP(?)
arySizeInByte DWORD ($ - Ary1) aryLength DWORD LENGTHOF Ary1
It is correct but not preferred because of using two variables. Why not simply make them symbolic constants to save the memory of two
?
.data
Ary1 DWORD 20 DUP(?)
arySizeInByte = ($ - Ary1) aryLength EQU LENGTHOF Ary1
Using either equal sign or EQU directive is fine. The constant is just a replacement during code preprocessing.
14. Generating the memory block in macro
For an amount of data to initialize, if you already know the logic how to create, you can use macro to generate memory blocks in assembling, instead of at runtime. The following macro creates all
Fibonacci numbers in a
array named
:
.data
val1 = 1
val2 = 1
val3 = val1 + val2
FibArray LABEL DWORD
DWORD val1 DWORD val2
WHILE val3 LT 0FFFFFFFFh DWORD val3 val1 = val2
val2 = val3
val3 = val1 + val2
ENDM
As macro goes to the assembler to be processed statically, this saves considerable initializations at runtime, as opposed to
mentioned before.
For more about macro in MASM, see my article Something You May Not Know About the Macro in MASM [4]. I also made a reverse engineering for the
statement in VC++ compiler implementation. Interestingly, under some condition the
statement chooses the binary search but without exposing the prerequisite of a sort implementation at runtime. It’s reasonable to think of the preprocessor that does the sorting with all known
values in compilation. The static sorting behavior (as opposed to dynamic behavior at runtime), could be implemented with a macro procedure, directives and operators. For details, please see Something You May Not Know About the Switch Statement in C/C++ [5].
About loop design
Almost every language provides an unconditional jump like
, but most of us rarely use it based on software engineering principles. Instead, we use others like
and
. While in assembly language, we rely more on jumps either conditional or unconditional to make control workflow more freely. In the following sections, I list some ill-coded patterns.
15. Encapsulating all loop logic in the loop body
To construct a loop, try to make all your loop contents in the loop body. Don’t jump out to do something and then jump back into the loop. The example here is to traverse a one-dimensional integer array. If find an odd number, increment it, else do nothing.
Two unclear solutions with the correct result would be possibly like:
mov ecx, LENGTHOF array
xor esi, esi
L1:
test array[esi], 1
jnz ODD
PASS:
add esi, TYPE DWORD
loop L1
jmp DONE
ODD:
inc array[esi]
jmp PASS
DONE:
|
|
mov ecx, LENGTHOF array
xor esi, esi
jmp L1
ODD:
inc array[esi]
jmp PASS
L1:
test array[esi], 1
jnz ODD
PASS:
add esi, TYPE DWORD
loop L1
|
However, they both do incrementing outside and then jump back. They make a check in the loop but the left does incrementing after the loop and the right does before the loop. For a simple logic, you may not think like this; while for a complicated problem, assembly language could lead astray to produce such a spaghetti pattern. The following is a good one, which encapsulates all logic in the loop body, concise, readable, maintainable, and efficient.
mov ecx, LENGTHOF array
xor esi, esi
L1:
test array[esi], 1
jz PASS
inc array[esi]
PASS:
add esi, TYPE DWORD
loop L1
16. Loop entrance and exit
Usually preferred is a loop with one entrance and one exit. But if necessary, two or more conditional exits are fine as shown in
with found and not-found results.
The following is a bad pattern of two-entrance, where one gets into
via initialization and another directly goes to
. Such a code is pretty hard to understand. Need to reorganize or refactor the loop logic.
je MIDDLE
START:
MIDDLE:
loop START
The following is a bad pattern of two-loop ends, where some logic gets out of the first loop end while the other exits at the second. Such a code is quite confusing. Try to reconsider with a label jumping to maintain one loop end.
START2:
je NEXT
loop START2
jmp DONE
NEXT:
loop START2
DONE:
17. Don’t change ECX in the loop body
The register
acts as a loop counter and its value is implicitly decremented when using the
instruction. You can read
and make use of its value in iteration. As see in
in the previous section, we compare the indirect operand
with
. But never try to change the loop counter within the loop body that makes code hard to understand and hard to debug. A good practice is to think of the loop counter
as read-only.
mov ecx, 10
L1:
mov eax, ecx mov ebx, [esi +ecx *TYPE DWORD] mov ecx, edx inc ecx loop L1
18. When jump backward…
Besides the
instruction, assembly language programming can heavily rely on conditional or unconditional jumps to create a loop when the count is not determined before the loop. Theoretically, for a backward jump, the workflow might be considered as a loop. Assume that
and
are desired jump or
instructions. The following backward
nested in the
is probably thought of as an inner loop.
L1:
L2:
jy L2
jx L1
To have selection logic of if-then-else, it’s reasonable to use a foreword jump like this as branching in the
iteration:
L1:
jy TrueLogic
jmp DONE
TrueLogic:
DONE:
jx L1
About procedure
Similar to functions in C/C++, we talk about some basics in assembly language’s procedure.
19. Making a clear calling interface
When design a procedure, we hope to make it as reusable as possible. Make it perform only one task without others like I/O. The procedure’s caller should take the responsibility to do input and putout. The caller should communicate with the procedure only by arguments and parameters. The procedure should only use parameters in its logic without referring outside definitions, without any:
- Global variable and array
- Global symbolic constant
Because implementing with such a definition makes your procedure un-reusable.
Recalling previous five
procedures, we use register
as both argument and parameter with the return value in
to make a clear calling interface:
FibonacciByXXX
Now the caller can do like
call FibonacciByXXX
To illustrate as a second example, let’s take a look again at calling
in the previous section. The register arguments
and
are prepared so that the implementation of
doesn’t directly refer to the global array,
.
... ...
NUM_COL = 5
NUM_ROW = 2
.code
main PROC
mov esi, OFFSET ary2D
mov eax, 31h
call Search2DAry
exit
main ENDP
Search2DAry PROC
mov ecx,NUM_ROW ... ...
mov ecx,NUM_COL ... ...
Unfortunately, the weakness is its implementation still using two global constants
and
that makes it not being called elsewhere. To improve, supplying other two register arguments would be an obvious way, or see the next section.
20. INVOKE vs. CALL
Besides the
instruction from Intel, MASM provides the 32-bit
directive to make a procedure call easier. For the
instruction, you only can use registers as argument/parameter pair in calling interface as shown above. The problem is that the number of registers is limited. All registers are global and you probably have to save registers before calling and restore after calling. The
directive gives the form of a procedure with a parameter-list, as you experienced in high level languages.
When consider
with a parameter-list without referring the global constants
and
, we can have its prototype like this
Search2DAry PROTO, pAry2D: PTR BYTE, val: BYTE, nRow: WORD, nCol: WORD
Again, as an exercise, you can try to implement this for a fix. Now you just do
INVOKE Search2DAry, ary2D, 31h, NUM_ROW, NUM_COL
Likewise, to construct a parameter-list procedure, you still need to follow the rule without referring global variables and constants. Besides, also attention to:
- The entire calling interface should only go through the parameter list without referring any register values set outside the procedure.
21. Call-by-Value vs. Call-by-Reference
Also be aware of that a parameter-list should not be too long. If so, use an object parameter instead. Suppose that you fully understood the function concept, call-by-value and call-by-reference in high level languages. By learning the stack frame in assembly language, you understand more about the low-level function calling mechanism. Usually for an object argument, we prefer passing a reference, an object address, rather than the whole object copied on the stack memory.
To demonstrate this, let’s create a procedure to write month, day, and year from an object of the Win32 SYSTEMTIME structure.
The following is the version of call-by-value, where we use the dot operator to retrieve individual
field members from the
object and extend their 16-bit values to 32-bit
:
WriteDateByVal PROC, DateTime:SYSTEMTIME
movzx eax, DateTime.wMonth
movzx eax, DateTime.wDay
movzx eax, DateTime.wYear
ret
WriteDateByVal ENDP
The version of call-by-reference is not so straight with an object address received. Not like the arrow ->, pointer operator in C/C++, we have to save the pointer (address) value in a 32-bit register like
. By using
as an indirect operand, we must cast its memory back to the
type. Then we can get the object members with the dot:
WriteDateByRef PROC, datetimePtr: PTR SYSTEMTIME
mov esi, datetimePtr
movzx eax, (SYSTEMTIME PTR [esi]).wMonth
movzx eax, (SYSTEMTIME PTR [esi]).wDay
movzx eax, (SYSTEMTIME PTR [esi]).wYear
ret
WriteDateByRef ENDP
You can watch the stack frame of argument passed for two versions at runtime. For
, eight
members are copied on the stack and consume sixteen bytes, while for
, only need four bytes as a 32-bit address. It will make a big difference for a big structure object, though.
22. Avoid multiple RET
To construct a procedure, it’s ideal to make all your logics within the procedure body. Preferred is a procedure with one entrance and one exit. Since in assembly language programming, a procedure name is directly represented by a memory address, as well as any labels. Thus directly jumping to a label or a procedure without using
or
would be possible. Since such an abnormal entry would be quite rare, I am not to going to mention here.
Although multiple returns are sometimes used in other language examples, I don’t encourage such a pattern in assembly code. Multiple
instructions could make your logic not easy to understand and debug. The following code on the left is such an example in branching. Instead, on the right, we have a label
at the end and jump there making a single exit, where probably do common chaos to avoid repeated code.
MultiRetEx PROC
jx NEXTx
ret
NEXTx:
jy NEXTy
ret
NEXTy:
ret
MultiRetEx ENDP
|
|
SingleRetEx PROC
jx NEXTx
jmp QUIT
NEXTx:
jy NEXTy
jmp QUIT
NEXTy:
QUIT:
ret
SingleRetEx ENDP
|
Object data members
Similar to above
structure, we can also create our own type or a nested:
Rectangle STRUCT
UpperLeft COORD <>
LowerRight COORD <>
Rectangle ENDS
.data
rect Rectangle { {10,20}, {30,50} }
The
type contains two COORD members,
and
. The Win32
contains two
(
),
and
. Obviously, we can access the object
’s data members with the dot operator from either direct or indirect operand like this
mov rect.UpperLeft.X, 11
mov esi,OFFSET rect
mov (Rectangle PTR [esi]).UpperLeft.Y, 22
mov esi,OFFSET rect.LowerRight
mov (COORD PTR [esi]).X, 33
mov esi,OFFSET rect.LowerRight.Y
mov WORD PTR [esi], 55
By using the
operator, we access different data member values with different type casts. Recall that any operator is processed in assembling at static time. What if we want to retrieve a data member’s address (not value) at runtime?
23. Indirect operand and LEA
For an indirect operand pointing to an object, you can’t use the
operator to get the member’s address, because
only can take an address of a variable defined in the data segment.
There could be a scenario that we have to pass an object reference argument to a procedure like
in the previous section, but want to retrieve its member’s address (not value). Still use the above
object for an example. The following second use of
is not valid in assembling:
mov esi,OFFSET rect
mov edi, OFFSET (Rectangle PTR [esi]).LowerRight
Let’s ask for help from the
instruction that you have seen in
in the previous section. The
instruction calculates and loads the effective address of a memory operand. Similar to the
operator, except that only
can obtain an address calculated at runtime:
mov esi,OFFSET rect
lea edi, (Rectangle PTR [esi]).LowerRight
mov ebx, OFFSET rect.LowerRight
lea edi, (Rectangle PTR [esi]).UpperLeft.Y
mov ebx, OFFSET rect.UpperLeft.Y
mov esi,OFFSET rect.UpperLeft
lea edi, (COORD PTR [esi]).Y
I purposely have
here to get an address statically and you can verify the same address in
that is loaded dynamically from the indirect operand
at runtime.
About system I/O
From Computer Memory Basics, we know that I/O operations from the operating system are quite slow. Input and output are usually in the measurement of milliseconds, compared with register and memory in nanoseconds or microseconds. To be more efficient, trying to reduce system API calls is a nice consideration. Here I mean Win32 API call. For details about the Win32 functions mentioned in the following, please refer to MSDN to understand.
24. Reducing system I/O API calls
An example is to output
lines of
random characters with random colors as below:

We definitely can generate one character to output a time, by using SetConsoleTextAttribute and WriteConsole. Simply set its color by
INVOKE SetConsoleTextAttribute, consoleOutHandle, wAttributes
Then write that character by
INVOKE WriteConsole,
consoleOutHandle, OFFSET buffer, 1, OFFSET bytesWritten, 0
When write
characters, make a new line. So we can create a nested iteration, the outer loop for
rows and the inner loop for
columns. As
by
, we call these two console output functions 1000 times.
However, another pair of API functions can be more efficient, by writing
characters in a row and setting their colors once a time. They are WriteConsoleOutputAttribute and WriteConsoleOutputCharacter. To make use of them, let’s create two procedures:
ChooseColor PROC
ChooseCharacter PROC
We call them in a loop to prepare a
array
and a BYTE array
for all
characters selected. Now we can write the
random characters per line with two calls here:
INVOKE WriteConsoleOutputAttribute,
outHandle,
ADDR bufColor,
MAXCOL,
xyPos,
ADDR cellsWritten
INVOKE WriteConsoleOutputCharacter,
outHandle,
ADDR bufChar,
MAXCOL,
xyPos,
ADDR cellsWritten
Besides
and
, we define
and the
type
so that
is incremented each row in a single loop of
rows. Totally we only call these two APIs 20 times.
About PTR operator
MASM provides the operator
that is similar to the pointer
used in C/C++. The following is the
specification:
- type PTR expression
Forces the expression to be treated as having the specified type.
- [[ distance ]] PTR type
Specifies a pointer to type.
This means that two usages are available, such as
or
. Let’s discuss how to use them.
25. Defining a pointer, cast and dereference
The following C/C++ code demonstrates which type of Endian is used in your system, little endian or big endian? As an integer type takes four bytes, it makes a pointer type cast from the array name
, a
address, to an
address. Then it displays the integer result by dereferencing the
pointer.
int main()
{
unsigned char fourBytes[] = { 0x12, 0x34, 0x56, 0x78 };
unsigned int *ptr = (unsigned int *)fourBytes;
printf("1. Directly Cast: n is %Xh\n", *ptr);
return 0;
}
As expected in x86 Intel based system, this verifies the little endian by showing
in hexadecimal. We can do the same thing in assembly language with
, which is just similar to an address casting to 4-byte
, the
type.
.data
fourBytes BYTE 12h,34h,56h,78h
.code
mov eax, DWORD PTR fourBytes
There is no explicit dereference here, since
combines four bytes into a
memory and lets
retrieve it as a direct operand to
. This could be considered equivalent to the (
) cast.
Now let’s do another way by using
. Again, with the same logic above, this time we define a
pointer type first with
:
DWORD_POINTER TYPEDEF PTR DWORD
This could be considered equivalent to defining the pointer type as
. Then in the following data segment, the address variable
takes over the
memory. Finally in code,
holds this address as an indirect operand and makes an explicit dereference here to get its
value to
.
.data
fourBytes BYTE 12h,34h,56h,78h
dwPtr DWORD_POINTER fourBytes
.code
mov ebx, dwPtr mov eax, [ebx]
To summarize,
indicates a
address type to define(declare) a variable like a pointer type. While
indicates the memory pointed by a
address like a type cast.
26. Using PTR in a procedure
To define a procedure with a parameter list, you might want to use
in both ways. The following is such an example to increment each element in a
array:
IncrementArray PROC, pAry:PTR DWORD, count:DWORD
mov edi,pAry
mov ecx,count
L1:
inc DWORD PTR [edi]
add edi, TYPE DWORD
loop L1
ret
IncrementArray ENDP
As the first parameter
is a
address, so
is used as a parameter type. In the procedure, when incrementing a value pointed by the indirect operand
, you must tell the system what the type(size) of that memory is by using
.
Another example is the earlier mentioned
, where
is a Windows defined structure type.
WriteDateByRef PROC, datetimePtr: PTR SYSTEMTIME
mov esi, datetimePtr
movzx eax, (SYSTEMTIME PTR [esi]).wMonth
... ...
ret
WriteDateByRef ENDP
Likewise, we use
as the parameter type to define
. When
receives an address from
, it has no knowledge about the memory type just like a
pointer in C/C++. We have to cast it as a
memory, so as to retrieve its data members.
Signed and Unsigned
In assembly language programming, you can define an integer variable as either signed as
,
, and
, or unsigned as
,
, and
. The data ranges, for example of 8-bit, are
-
: 0 to 255 (
to
), totally 256 numbers
-
: half negatives, -128 to -1 (
to
), half positives, 0 to 127 (
to
)
Based on the hardware point of view, all CPU instructions operate exactly the same on signed and unsigned integers, because the CPU cannot distinguish between signed and unsigned. For example, when define
.data
bVal BYTE 255
sbVal SBYTR -1
Both of them have the 8-bit binary
saved in memory or moved to a register. You, as a programmer, are solely responsible for using the correct data type with an instruction and are able to explain a results from the flags affected:
- The carry flag
for unsigned integers
- The overflow flag
for signed integers
The following are usually several tricks or pitfalls.
27. Comparison with conditional jumps
Let’s check the following code to see which label it jumps:
mov eax, -1
cmp eax, 1
ja L1
jmp L2
As we know,
follows the same logic as
while non-destructive to the destination operand. Using
means considering unsigned comparison, where the destination
is
, i.e.
, while the source is
. Certainly
is bigger than
, so that makes it jump to
. Thus, any unsigned comparisons such as
,
,
,
, etc. can be remembered as A(Above) or B(Below). An unsigned comparison is determined by
and the zero flag
as shown in the following examples:
CMP if |
Destination |
Source |
ZF(ZR) |
CF(CY) |
Destination<Source |
1 |
2 |
0 |
1 |
Destination>Source |
2 |
1 |
0 |
0 |
Destination=Source |
1 |
1 |
1 |
0 |
Now let’s take a look at signed comparison with the following code to see where it jumps:
mov eax, -1
cmp eax, 1
jg L1
jmp L2
Only difference is
here instead of
. Using
means considering signed comparison, where the destination
is
, i.e.
, while the source is
. Certainly
is smaller than
, so that makes
to
. Likewise, any signed comparisons such as
,
,
,
, etc. can be thought of as G(Greater) or L(Less). A signed comparison is determined by
and the sign flag
as shown in the following examples:
CMP if |
Destination |
Source |
SF(PL) |
OF(OV) |
Destination<Source: (SF != OF) |
-2 |
127 |
0 |
1 |
-2 |
1 |
1 |
0 |
Destination>Source: (SF == OF) |
127 |
1 |
0 |
0 |
127 |
-1 |
1 |
1 |
Destination = Source |
1 |
1 |
ZF=1 |
28. When CBW, CWD, or CDQ mistakenly meets DIV…
As we know, the
instruction is for unsigned to perform 8-bit, 16-bit, or 32-bit integer division with the dividend
,
, or
respectively. As for unsigned, you have to clear the upper half by zeroing
,
, or
before using
. But when perform signed division with
, the sign extension
,
, and
are provided to extend the upper half before using
.
For a positive integer, if its highest bit (sign bit) is zero, there is no difference to manually clear the upper part of a dividend or mistakenly use a sign extension as shown in the following example:
mov eax,1002h
cdq
mov ebx,10h
div ebx
This is fine because
is a small positive and
makes
zero, the same as directly clearing
. So if your value is positive and its highest bit is zero, using
and
XOR EDX, EDX
are exactly the same.
However, it doesn’t mean that you can always use
/
/
with
when perform a positive division. For an example of 8-bit,
, expecting quotient
and remainder
. But, if you make this
mov al, 129
cbw mov bl,2
div bl
Try above in debug to see how integer division overflow happens as a result. If really want to make it correct as unsigned
, you must:
mov al, 129
XOR ah, ah mov bl,2
div bl
On the other side, if really want to use
, it means that you perform a signed division. Then you must use
:
mov al, 129 cbw mov bl,2
idiv bl
As seen here,
in signed byte is decimal
so that signed
gives the correct quotient and remainder as above
29. Why 255-1 and 255+(-1) affect CF differently?
To talk about the carry flag
, let’s take the following two arithmetic calculations:
mov al, 255
sub al, 1
mov bl, 255
add bl, -1
From a human being’s point of view, they do exactly the same operation,
minus
with the result 254 (
). Likewise, based on the hardware point, for either calculation, the CPU does the same operation by representing
as a two’s complement
and then add it to
. Now
is
and the binary format of
is also
. This is how it has been calculated:
1111 1111
+ 1111 1111
-------------
1111 1110
Remember? A CPU operates exactly the same on signed and unsigned because it cannot distinguish them. A programmer should be able to explain the behavior by the flag affected. Since we talk about the
, it means we consider two calculations as unsigned. The key information is that
is
and then
in decimal. So the logic interpretation of
is
- For
, it means
minus
to result in
, without need of a borrow, so
=
- For
, it seems that
plus
is resulted in
, but with a carry
(
) out,
is a remainder left in byte, so
=
From hardware implementation,
depends on which instruction used,
or
. Here MSB (Most Significant Bit) is the highest bit.
- For
instruction,
, directly use the carry out of the MSB, so
=
- For
instruction,
, must INVERT the carry out of the MSB, so
=
30. How to determine OF?
Now let’s see the overflow flag
, still with above two arithmetic calculations as this:
mov al, 255
sub al, 1
mov bl, 255
add bl, -1
Both of them are not overflow, so
=
. We can have two ways to determine
, the logic rule and hardware implementation.
Logic viewpoint: The overflow flag is only set,
=
, when
- Two positive operands are added and their sum is negative
- Two negative operands are added and their sum is positive
For signed,
is
(
). The flag
doesn’t care about
or
. Our two examples just do
plus
with the result
. Thus, two negatives are added with the sum still negative, so
=
.
Hardware implementation: For non-zero operands,
-
= (carry out of the MSB)
(carry into the MSB)
As seen our calculation again:
1111 1111
+ 1111 1111
-------------
1111 1110
The carry out of the MSB is
and the carry into the MSB is also
. Then
= (
) =
To practice more, the following table enumerates different test cases for your understanding:

Ambiguous «LOCAL» directive
As mentioned previously, the
operator has two usages such as
and
. But MASM provides another confused directive
, that is ambiguous depending on the context, where to use with exactly the same reserved word. The following is the specification from MSDN:
LOCAL localname [[, localname]]…
LOCAL label [[ [count ] ]] [[:type]] [[, label [[ [count] ]] [[type]]]]…
- In the first directive, within a macro,
defines labels that are unique to each instance of the macro.
- In the second directive, within a procedure definition (PROC),
creates stack-based variables that exist for the duration of the procedure. The label may be a simple variable or an array containing count elements.
This specification is not clear enough to understand. In this section, I’ll expose the essential difference in between and show two example using the
directive, one in a procedure and the other in a macro. As for your familiarity, both examples calculate the nth Fibonacci number as early
. The main point delivered here is:
- The variables declared by
in a macro are NOT local to the macro. They are system generated global variables on the data segment to resolve redefinition.
- The variables created by
in a procedure are really local variables allocated on the stack frame with the lifecycle only during the procedure.
For the basic concepts and implementations of data segment and stack frame, please take a look at some textbook or MASM manual that could be worthy of several chapters without being talked here.
31. When LOCAL used in a procedure
The following is a procedure with a parameter
to calculate nth Fibonacci number returned in
. I let the loop counter
take over the parameter
. Please compare it with
. The logic is the same with only difference of using the local variables
and
here, instead of global variables
and
in
.
FibonacciByLocalVariable PROC USES ecx edx, n:DWORD
LOCAL pre, cur :DWORD
mov ecx,n
mov eax,1
mov pre,0
mov cur,0
L1:
add eax, pre mov edx, cur
mov pre, edx
mov cur, eax
loop L1
ret
FibonacciByLocalVariable ENDP
The following is the code generated from the VS Disassembly window at runtime. As you can see, each line of assembly source is translated into machine code with the parameter
and two local variables created on the stack frame, referenced by
:
231: 232: FibonacciByLocalVariable PROC USES ecx edx, n:DWORD
011713F4 55 push ebp
011713F5 8B EC mov ebp,esp
011713F7 83 C4 F8 add esp,0FFFFFFF8h
011713FA 51 push ecx
011713FB 52 push edx
233: 234: 235: 236: LOCAL pre, cur :DWORD
237:
238: mov ecx,n
011713FC 8B 4D 08 mov ecx,dword ptr [ebp+8]
239: mov eax,1
011713FF B8 01 00 00 00 mov eax,1
240: mov pre,0
01171404 C7 45 FC 00 00 00 00 mov dword ptr [ebp-4],0
241: mov cur,0
0117140B C7 45 F8 00 00 00 00 mov dword ptr [ebp-8],0
242: L1:
243: add eax,pre 01171412 03 45 FC add eax,dword ptr [ebp-4]
244: mov EDX, cur
01171415 8B 55 F8 mov edx,dword ptr [ebp-8]
245: mov pre, EDX
01171418 89 55 FC mov dword ptr [ebp-4],edx
246: mov cur, eax
0117141B 89 45 F8 mov dword ptr [ebp-8],eax
247: loop L1
0117141E E2 F2 loop 01171412
248:
249: ret
01171420 5A pop edx
01171421 59 pop ecx
01171422 C9 leave
01171423 C2 04 00 ret 4
250: FibonacciByLocalVariable ENDP
When
running, the stack frame can be seen as below:

Obviously, the parameter
is at
. This
add esp, 0FFFFFFF8h
just means
sub esp, 08h
moving the stack pointer
down eight bytes for two
creation of
and
. Finally the
instruction implicitly does
mov esp, ebp
pop ebp
that moves
back to
releasing the local variables
and
. And this releases
, at
, for STD calling convention:
ret 4
32. When LOCAL used in a macro
To have a macro implementation, I almost copy the same code from
. Since no
for a macro, I manually use
/
for
and
. Also without a stack frame, I have to create global variables
and
on the data segment. The
can be like this:
mFibonacciByMacro MACRO n
LOCAL mPre, mCur, mL
.data
mPre DWORD ?
mCur DWORD ?
.code
push ecx
push edx
mov ecx,n
mov eax,1
mov mPre,0
mov mCur,0
mL:
add eax, mPre mov edx, mCur
mov mPre, edx
mov mCur, eax
loop mL
pop edx
pop ecx
ENDM
If you just want to call
once, for example
mFibonacciByMacro 12
You don’t need
here. Let’s simply comment it out:
accepts the argument
and replace
with
. This works fine with the following listing MASM generated:
mFibonacciByMacro 12
0000018C 1 .data
0000018C 00000000 1 mPre DWORD ?
00000190 00000000 1 mCur DWORD ?
00000000 1 .code
00000000 51 1 push ecx
00000001 52 1 push edx
00000002 B9 0000000C 1 mov ecx,12
00000007 B8 00000001 1 mov eax,1
0000000C C7 05 0000018C R 1 mov mPre,0
00000000
00000016 C7 05 00000190 R 1 mov mCur,0
00000000
00000020 1 mL:
00000020 03 05 0000018C R 1 add eax,mPre 00000026 8B 15 00000190 R 1 mov edx, mCur
0000002C 89 15 0000018C R 1 mov mPre, edx
00000032 A3 00000190 R 1 mov mCur, eax
00000037 E2 E7 1 loop mL
00000039 5A 1 pop edx
0000003A 59 1 pop ecx
Nothing changed from the original code with just a substitution of
. The variables
and
are visible explicitly. Now let’s call it twice, like
mFibonacciByMacro 12
mFibonacciByMacro 13
This is still fine for the first
but secondly, causes three redefinitions in preprocessing
. Not only are data labels, i.e., variables
and
, but also complained is the code label
. This is because in assembly code, each label is actually a memory address and the second label of any
,
, or
should take another memory, rather than defining an already created one:
mFibonacciByMacro 12
0000018C 1 .data
0000018C 00000000 1 mPre DWORD ?
00000190 00000000 1 mCur DWORD ?
00000000 1 .code
00000000 51 1 push ecx
00000001 52 1 push edx
00000002 B9 0000000C 1 mov ecx,12
00000007 B8 00000001 1 mov eax,1
0000000C C7 05 0000018C R 1 mov mPre,0
00000000
00000016 C7 05 00000190 R 1 mov mCur,0
00000000
00000020 1 mL:
00000020 03 05 0000018C R 1 add eax,mPre 00000026 8B 15 00000190 R 1 mov edx, mCur
0000002C 89 15 0000018C R 1 mov mPre, edx
00000032 A3 00000190 R 1 mov mCur, eax
00000037 E2 E7 1 loop mL
00000039 5A 1 pop edx
0000003A 59 1 pop ecx
mFibonacciByMacro 13
00000194 1 .data
1 mPre DWORD ?
FibTest.32.asm(83) : error A2005:symbol redefinition : mPre
mFibonacciByMacro(6): Macro Called From
FibTest.32.asm(83): Main Line Code
1 mCur DWORD ?
FibTest.32.asm(83) : error A2005:symbol redefinition : mCur
mFibonacciByMacro(7): Macro Called From
FibTest.32.asm(83): Main Line Code
0000003B 1 .code
0000003B 51 1 push ecx
0000003C 52 1 push edx
0000003D B9 0000000D 1 mov ecx,13
00000042 B8 00000001 1 mov eax,1
00000047 C7 05 0000018C R 1 mov mPre,0
00000000
00000051 C7 05 00000190 R 1 mov mCur,0
00000000
1 mL:
FibTest.32.asm(83) : error A2005:symbol redefinition : mL
mFibonacciByMacro(17): Macro Called From
FibTest.32.asm(83): Main Line Code
0000005B 03 05 0000018C R 1 add eax,mPre 00000061 8B 15 00000190 R 1 mov edx, mCur
00000067 89 15 0000018C R 1 mov mPre, edx
0000006D A3 00000190 R 1 mov mCur, eax
00000072 E2 AC 1 loop mL
00000074 5A 1 pop edx
00000075 59 1 pop ecx
To rescue, let’s turn on this:
LOCAL mPre, mCur, mL
Again, running
twice with
and
, fine this time, we have:
mFibonacciByMacro 12
0000018C 1 .data
0000018C 00000000 1 ??0000 DWORD ?
00000190 00000000 1 ??0001 DWORD ?
00000000 1 .code
00000000 51 1 push ecx
00000001 52 1 push edx
00000002 B9 0000000C 1 mov ecx,12
00000007 B8 00000001 1 mov eax,1
0000000C C7 05 0000018C R 1 mov ??0000,0
00000000
00000016 C7 05 00000190 R 1 mov ??0001,0
00000000
00000020 1 ??0002:
00000020 03 05 0000018C R 1 add eax,??0000 00000026 8B 15 00000190 R 1 mov edx, ??0001
0000002C 89 15 0000018C R 1 mov ??0000, edx
00000032 A3 00000190 R 1 mov ??0001, eax
00000037 E2 E7 1 loop ??0002
00000039 5A 1 pop edx
0000003A 59 1 pop ecx
mFibonacciByMacro 13
00000194 1 .data
00000194 00000000 1 ??0003 DWORD ?
00000198 00000000 1 ??0004 DWORD ?
0000003B 1 .code
0000003B 51 1 push ecx
0000003C 52 1 push edx
0000003D B9 0000000D 1 mov ecx,13
00000042 B8 00000001 1 mov eax,1
00000047 C7 05 00000194 R 1 mov ??0003,0
00000000
00000051 C7 05 00000198 R 1 mov ??0004,0
00000000
0000005B 1 ??0005:
0000005B 03 05 00000194 R 1 add eax,??0003 00000061 8B 15 00000198 R 1 mov edx, ??0004
00000067 89 15 00000194 R 1 mov ??0003, edx
0000006D A3 00000198 R 1 mov ??0004, eax
00000072 E2 E7 1 loop ??0005
00000074 5A 1 pop edx
00000075 59 1 pop ecx
Now the label names,
,
, and
, are not visible. Instead, running the first of
, the preprocessor generates three system labels
,
, and
for
,
, and
. And for the second
, we can find another three system generated labels
,
, and
for
,
, and
. In this way, MASM resolves the redefinition issue in multiple macro executions. You must declare your labels with the
directive in a macro.
However, by the name
, the directive sounds misleading, because the system generated
,
, etc. are not limited to a macro’s context. They are really global in scope. To verify, I purposely initialize
and
as
and
:
LOCAL mPre, mCur, mL
.data
mPre DWORD 2
mCur DWORD 3
Then simply try to retrieve the values from
and
even before calling two
in code
mov esi, ??0000
mov edi, ??0001
mFibonacciByMacro 12
mFibonacciByMacro 13
To your surprise probably, when set a breakpoint, you can enter &
into the VS debug Address box as a normal variable. As we can see here, the
memory address is
with
values
,
, and so on. Such a
is allocated on the data segment together with other properly named variables, as shown string ASCII beside:

o summarize, the
directive declared in a macro is to prevent data/code labels from being globally redefined.
Further, as an interesting test question, think of the following multiple running of
which is working fine without need of a
directive in
. Why?
mov ecx, 2
L1:
mFibonacciByMacro 12
loop L1
Summary
I talked so much about miscellaneous features in assembly language programming. Most of them are from our class teaching and assignment discussion [1]. The basic practices are presented here with short code snippets for better understanding without irrelevant details involved. The main purpose is to show assembly language specific ideas and methods with more strength than other languages.
As noticed, I haven’t given a complete test code that requires a programming environment with input and output. For an easy try, you can go [2] to download the Irvine32 library and setup your MASM programming environment with Visual Studio, while you have to learn a lot in advance to prepare yourself first. For example, the statement
mentioned here in
is not an element in assembly language, but is defined as
there.
Assembly language is notable for its one-to-one correspondence between an instruction and its machine code as shown in several listings here. Via assembly code, you can get closer to the heart of the machine, such as registers and memory. Assembly language programming often plays an important role in both academic study and industry development. I hope this article could serve as an useful reference for students and professionals as well.