Assembly Tutorial – Writing to a File

We’re going to cover one more quick file based example. To do this we will write a utility that reads from the command line and writes to a file. We will also try a couple of different file modes.

First lets look at the code:

.section .data
filename:
.string "output\.txt"

.section .bss
.lcomm buffer_data, 500

.section .text

.globl _start
_start:

movq $0, %rax
movq $0, %rdi
movq $buffer_data, %rsi
movq $500, %rdx
syscall

movq $2, %rax
movq $filename, %rdi
movq $0x41, %rsi
movq $0666, %rdx
syscall

movq %rax, %rdi
movq $1, %rax
movq $buffer_data, %rsi
movq $500, %rdx
syscall

movq $60, %rax
movq $0, %rdi
syscall

This should all be pretty familiar by now. First we define a string named “filename” with the value “output.txt” and we define a buffer named “buffer_data” of size 500 bytes.

The first few lines of instructions:

movq $0, %rax
movq $0, %rdi
movq $buffer_data, %rsi
movq $500, %rdx
syscall

read from stdin into our buffer. As usual, we set rax to 0 to indicate we are reading, rdi to 0 to as that is the file descriptor of stdin, rsi to the label of the memory buffer we wish to read to and rdx to the number of bytes we want to read. We then have another interrupt that creates the file we wish to write to:

movq $2, %rax
movq $filename, %rdi
movq $0x41, %rsi
movq $0666, %rdx
syscall

We set rax to 2 as this is the linux open file system call number and the filename goes in rdi. We set rdx to 0666 to indicate that every user will have read, write and execute permissions with this file. We set rsi to the mode we would like to use when opening this file. There are quite a few different flags you can use when opening files. They include

  • Create 0x40
  • Append 0x400
  • Truncate 0x200
  • Read Only 0x0
  • Write Only 0x1
  • Read and Write 0x2

You can combine these flags with bitwise or |. Note however, that you can’t combine read only and write only in this way. Also, notice that we prefaced the numeric value here with 0x to indicate it is a hexadecimal value. We didn’t bother using this preface previously as we were using the read only flag ‘0’ which is the same in hex and decimal.

Let’s look at some examples. In our above program, we wanted create a new file, and we only write to it, so we use the create and write only flags, 0x40 and 0x1 respectively. When we take the bitwise or of these flags we get

0x40 | 0x1 = 0x41

so we set rsi to 0x41.

When we assemble, link and run this file, the command line should give us an opportunity to enter text, which then gets written to a file named “output.txt”. If a file with that name already exists, the first line of that file will be overwritten.

Suppose that, when a file named “output.txt” already exists, we would rather completely overwrite it rather than just overwriting the first line. We do this by setting rsi to 0x241 when we perform the open file system call. 0x241 is the bitwise or of the create flag 0x40, the write only flag 0x1 and the truncate flag 0x200.

Now, suppose that, instead we wanted to append to the end of an existing file rather than overwriting it. To do this, when opening the file, we set the rsi register to 0x441, that is the open flag 0x40, the write flag 0x1 and the append flag 0x400. Now we can repeatedly run our binary and append text to the end of the file.

Assembly Tutorial – Debugging Assembly Code

Coding in assembly can be quite tricky. The syntax is unintuitive. For example when we want to open a file, we don’t call a simple one line function with a name like “open”, instead we have to set multiple registers and use a system call. It is also extremely unforgiving, we can easily set the wrong register or set a register to the wrong value without noticing, and then our code will just fall over and there will be no helpful error message.

How do we diagnose these errors? Well thankfully we can debug assembly with the standard GNU debugger gdb. The process of debugging assembly in gdb is very similar to debugging C or C++.

Suppose we have an assembly program that is supposed to write to stdout but unfortunately does not:

 .section .data
msg:
.string "Hello world\n"

 .section .text

 .globl _start
_start:

 movq $1, %rax
 movq $10, %rdi
 movq $msg, %rsi
 movq $12, %rdx
 syscall


 movq $60, %rax
 movq $0, %rdi
 syscall

We have carefully combed through this code, but have not found the error yet, so we decide to debug it. To debug, first we must assemble the code with debug symbols. To do this we assemble with the extra command line option –gstabs+:

as --gstabs+ write.s -o write.o

We then link the file as we normally do:

ld write.o -o write

Now instead of running the binary file that has been created we pass it as an argument to gdb:

gdb write

This will load the write binary into gdb, after gdb spits out some general information you should have a command line that looks like

Reading symbols from write...
(gdb)

To set breakpoints in gdb we use the command:

b <filename>:<linenumber>

In our case we will set a breakpoint at the start label:

b write.s:10

We then tell gdb to start the execution of our program with the run command, ‘r’. If this runs succesfully our command line will look like:

Breakpoint 1, _start () at write.s:10
10	 movq $1, %rax
(gdb) 

While debugging we can step to the next line of code executed with the ‘s’ command. Let’s say we step all the way to line 14 as so:

(gdb) s
11	 movq $10, %rdi
(gdb) s
12	 movq $msg, %rsi
(gdb) s
13	 movq $12, %rdx
(gdb) s
14	 syscall
(gdb) 

Let’s have a look at what’s in the registers, to do this we use the ‘info registers’ command:

(gdb) info registers
rax            0x4                 1
rbx            0xa                 0
rcx            0x402000            0
rdx            0xc                 12
rsi            0x0                 4202496
rdi            0x0                 10
rbp            0x0                 0x0
rsp            0x7fffffffd870      0x7fffffffd870
r8             0x0                 0
r9             0x0                 0
r10            0x0                 0
r11            0x0                 0
r12            0x0                 0
r13            0x0                 0
r14            0x0                 0
r15            0x0                 0
rip            0x401014            0x401014 <_start+20>
eflags         0x202               [ IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) 

Each of the individual registers is listed along with the value it contains, first in hex and then in a more human readable form. If we just want to see the value in a single register we use ‘info registers <register name>’, for example,

info registers rdi

gives,

rdi            0xa                 10
(gdb) 

Now, as this point in the code we are attempting to write to stdout, so the rdi register should be set to 1, however we can see it is set to ten. If we look back at the code we have stepped through we see that on line 11 we set the rdi register to 10, whoops!

There are a few other useful commands we have left out.

  • info – prints a numbered list of breakpoints
  • delete <breakpoint number> – deletes a breakpoint
  • c – continues execution to the next breakoint and
  • r – can be called at any point to restart execution from the beginning.

We’ll learn more about debugging in a later post, but this should be all you need to get started!

Assembly Tutorial – Reading From a File

In previous posts we saw how to read from stdin and write to stdout. Now, we know that stdin and stdout are just special files, so we should be able to read from and write to normal files without much difficulty.

To show this, we’re going to write a simple program that writes the contents of a text file to the terminal (a bit like the cat utility). The following code, opens a file called “inputfile.txt” and then reads the first 500 bytes and writes them to stdout.


.section .data
name:
.string "inputfile.txt"
 
.section .bss
.lcomm buffer_data, 500

.section .text

.globl _start
_start:
 
movq $2, %rax
movq $name, %rdi
movq $0, %rsi
movq $0666, %rdx
syscall

movq %rax, %rdi
movq $0, %rax
movq $buffer_data, %rsi
movq $500, %rdx
syscall

movq $1, %rdi
movq $buffer_data, %rsi
movq $500, %rdx
movq $1, %rax
syscall

movq $60, %rax
movq $0, %rdi
syscall

Just like with our echo utility, we use a buffer defined in the .bss section to temporarily store the data we read. We also define the name of the file we will read in the data section.

The first thing we have to do is open the file we are interested in. We do this with the following code:

movq $2, %rax
movq $name, %rdi
movq $0, %rsi
movq $0666, %rdx
syscall

First we set rax to 2, this is the system call number for opening files. We set rdi to the label of the memory location containing the name of the file. When opening files we need to set the rsi register to indicate whether we are opening to read or write. In this case we are reading, so we set rsi to 0. Finally we set rdx with the permissions we would like this file to have. This is just the normal linux file permissions, in this case 0666 indicates that all users have read and write permission. Finally we invoke a system call and the linux kernel, should open this file. If the kernel manages to open the file successfully, the file descriptor will be in rax once control returns to our code.

Once the file is opened, we have to read it into our buffer. We do this with the following piece of code:

movq %rax, %rdi
movq $0, %rax
movq $buffer_data, %rsi
movq $500, %rdx
syscall

We are going to read from this file just like we read from stdin. The difference is that instead of putting the file descriptor for stdin (0) in rdi, now we put the file descriptor for the file we opened in there. Now, assuming everything went according to plan, the file descriptor we want will be in rax, so the first thing we do is move the value from rax to rdi. The rest of the code is the same, we will read the first 500 bytes of the file with the file descriptor in rdi into our buffer.

Our final two pieces of code just output the contents of the buffer to stdout and exit with a success code.

As usual, if this code is in a file named display.s, we assemble and link it with:

as display.s -o display.o
ld display.o -o display 

This creates a binary file that will display the contents of a file named “inputfile.txt”. If the file does not exist, the open system call will return with the error code -2 in rax. Right now we don’t check for this, but once we know a little more about control flow we will.

Assembly Tutorial – A Simple Echo Program

We are gong to write a simple program that reads from the terminal and writes the data it has read back out to the terminal. As we know from the previous post, reading and writing to the terminal is just a matter of reading and writing to two special files, stdin and stdout. These have file descriptors 0 and 1 respectively. We also need to set aside some space in memory to store the characters we read in from the terminal.

Let’s look at the code!

movq $1, %rax
movq $0, %rbx
int $0x80
.section .data 
.section .bss
.lcomm buffer_data, 500

.section .text

.globl _start
_start:

movq $0, %rax
movq $0, %rdi
movq $buffer_data, %rsi
movq $500, %rdx
syscall

movq $1, %rax
movq $1, %rdi
syscall

movq $60, %rax
movq $0, %rdi
syscall

We have some new syntax here. The second line:

.section .bss

indicates to the assembler that this is the section in which we will define buffers. A buffer is a chunk of contiguous memory that we use to perform input and output operation. The next line:

.lcomm buffer_data, 500

declares a buffer named buffer_data that is 500 byes long. Now we read from the terminal with a usual file I/O kernel interrupt. We set rax to 0 to indicate we are reading form this file and we set rdi to 0 as that is the file descriptor of stdin. We set rsi to the address of our buffer with

movq $buffer_data, %rsi

and we set rdx to 500 to tell the kernel we would like to read 500 bytes. Then we invoke the system call to transfer control to the kernel.

The next three lines set up the system call to write the contents of the buffer to the terminal. The registers rsi and rdx will not have been altered by the kernel, so they still contain the address of the buffer and the number of bytes, 500. We set rax to 1 to indicate that we want to write and rdi to 1 to tell the kernel that it is stdout that we want to write to.

The final three lines are the usual exit with 0. Now if you write this code in a file named echo_input.s and run

as echo_input.s -o echo_input.o
ld echo_input.o -o echo_input

you will have a new binary file named echo input. If you execute this binary, it will wait for you to enter some input and hit the return key. When you do this, it will print what you wrote out again and exit with exit code 0.

Now, in our code, we specified that our buffer is 500 bytes long, and that we should read and write 500 bytes. The read and write operations will handle data smaller than 500 bytes perfectly well. However if you try to pass in a string larger than this, only the first 500 characters will be passed to our echo utility, the shell will try, and probably fail, to execute whatever comes after that.

Assembly Tutorial – Everything is a File

At some point you may have heard:

In linux, everything is a file.

-common observation of unknown provenance

What does this actually mean though? Well in this post, we are going to find out.

In our previous post when we wanted to write to the terminal we used the following lines of code:

movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $12, %rdx
syscall

Now, in Linux, when we write to the terminal, we are actually writing to a special file called stdout. Rather than saving this to disk, the kernel writes the contents of the file to the terminal. The same is true for reading from the terminal, reading and writing to sockets and many other I/O operations. We can perform all of these operations in the same way, because the kernel allows us to treat them all as if we are reading from or writing to a file.

So let’s describe how we read and write to files more generally. File I/O requires a system call and we need to set four registers to give the kernel the information it needs to perform the I/O for us.

We set the rax register to 0 if we want to read from a file and 1 if we want to write to a file. We have to tell the kernel what file we would like it to read from/write to. To do this we set the register rdi with the file’s file descriptor. File descriptors are the unique numeric identifies associated with the files that the kernel knows about. For example, stdout‘s file descriptor is 1. We set rsi to be the address of the data we would like to write to the file, or the address of the memory we would like to read into. Finally, we set rdx to the size in bytes that we would like to read/write.

Assembly Tutorial – Hello World in Assembly

The first code you ever wrote was probably something like:

public class MyFirstClass {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

Assembly is a little bit more complicated than Java, so it is necessary to cover some basic concepts before you are ready to write to the terminal. The good news is that we are ready to write to the terminal!

Writing to the terminal works via a system call, just like when we exited with exit code 0. Lets have a look at the code.

.section .data
msg:
.string "Hello world\n"
 
.section .text
.globl _start
_start:
  
movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $12, %rdx
syscall

movq $60, %rax
movq $0, %rdi
syscall

We start off with our data section. Our data section is no longer empty. Now, we define a string with the value “Hello world\n”. We give this string the label msg.

In the text section we define our entry point as before, and we exit with code 0 as before. However we also have the five lines that output our message to the terminal:

movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $12, %rdx
syscall

Before transferring control to the linux kernel we have to move four values into registers. First we have to put 1 in the rax register and 1 in the rdi register. Don’t worry, the significance of these two values will be explained later! We put the address of the data we would like to write in the rsi register. In this case, we can reference the address of the data we would like to write with the label msg. Finally, in the rdx register we put the number of bytes we would like to write. This is asci, so each character is one byte, so our string is 12 bytes long, including the newline character.

If you put this code in a film named helloworld.s and execute

as helloworld.s -o helloworld.o
ld helloworld.o -o helloworld

This will create a binary in the same directory named helloworld. When you run this binary you should see “Hello world” printed to the terminal!

Assembly Tutorial – How Does our Simple Program Work?

In the last post we wrote a simple assembly program, all it did was exit with status code zero. Our code was

.section .data 
.section .text 
.globl _start 
_start: 
movq $60, %rax
movq $0, %rdi 
syscall

How does this work? The first line can be ignored for now, it just denotes where we would store any user defined data, if we had any. The second line

.section .text

denotes where our actual code begins. The third and fourth lines

_globl _start
_start:

defines a special label called start. A label is a convenient human readable name for a particular memory address. Labels allow us, the programmer to reference memory addresses without having to use actual numeric addresses. Labels are obviously less error prone, but they also mean that when we change the memory layout of our code we don’t have to recalculate a lot of addresses. The CPU does not understand labels, when the assembler runs it replaces all usages of labels with the actual addresses they refer to.

As we said, _start is a special label that defines the entry point of our program, like the main method in a C program. The line

_globl _start

just makes this label available outside of the program itself. If we had left this label out, our program will still assemble, link and run successfully, as the assembler will just create a default entry point. In general we don’t want to do this, as in more complex programs the default entry point might not be the entry point we want.

To understand this program there are two things we need to understand, the first of which is the system call. The kernel is the core part of the operating system. It handles all I/O, looks after memory at a low level, writes and reads files and plenty of other things. When we want to perform any of these task we transfer control to the kernel, this is called a system call. We perform a system call with the command

sycall

This will immediately transfer control over to the kernel. However, we also need to tell the kernel what we would like it to do for us. To do this we move certain special values into specific registers. (Remember the registers are small very fast memory inside the CPU). When the Kernel takes over, it reads these registers to find out what we are asking of it.

In the above program we use two registers, rax and rdi. These are 64 bit general purpose registers. With the command

movq $60, %rax

we move the 64 bit value 1 into the register rax. With the command

movq $0, %rdi

we move the 64 bit value 0 in the register rdi. When we perform a system call the value in the rax register tells the kernel what operation we would like it to perform. In this case, the value 1 tells it that we would like to exit. When exiting the value in the rdi register will be the exit code, in this case, we are exiting with code 0.

Assembly Tutorial Writing a Simple Assembly Program

There are two different syntaxes for assembly language, AT&T and Intel. The difference is really only aesthetic, both compile to the same underlying code. For no particular reason I have been using AT&T syntax. All my code has been built and run on 64 bit Ubuntu linux, with the GNU assembly, gas, and the GNU linker ld. My code is available on github.

My first assembly program was a little something like this:

.section .data
.section .text
.globl _start
_start:
movq $60, %rax
movq $0, %rdi
syscall

Put this in a file named return.s and run

as return.s -o return.o

This calls the gnu assembler, which assembles the text you have written into object code in a file named return.o.

Then run

ld return.o -o return

This calls the GNU linker and links the object code, creating an executable file named return. Now if, in the same directory, you run

./return

your code (should) run successfully.

All this code does is tell the CPU to exit with exit code 0, so when it runs successfully, you shouldn’t really see anything. You can inspect the exit code of the last program you ran with

echo $?

which, in this case should be 0.