Assembly Tutorial – A Simple Echo Program

We are gong to write a simple program that reads from the terminal and writes the data it has read back out to the terminal. As we know from the previous post, reading and writing to the terminal is just a matter of reading and writing to two special files, stdin and stdout. These have file descriptors 0 and 1 respectively. We also need to set aside some space in memory to store the characters we read in from the terminal.

Let’s look at the code!

movq $1, %rax
movq $0, %rbx
int $0x80
.section .data 
.section .bss
.lcomm buffer_data, 500

.section .text

.globl _start
_start:

movq $0, %rax
movq $0, %rdi
movq $buffer_data, %rsi
movq $500, %rdx
syscall

movq $1, %rax
movq $1, %rdi
syscall

movq $60, %rax
movq $0, %rdi
syscall

We have some new syntax here. The second line:

.section .bss

indicates to the assembler that this is the section in which we will define buffers. A buffer is a chunk of contiguous memory that we use to perform input and output operation. The next line:

.lcomm buffer_data, 500

declares a buffer named buffer_data that is 500 byes long. Now we read from the terminal with a usual file I/O kernel interrupt. We set rax to 0 to indicate we are reading form this file and we set rdi to 0 as that is the file descriptor of stdin. We set rsi to the address of our buffer with

movq $buffer_data, %rsi

and we set rdx to 500 to tell the kernel we would like to read 500 bytes. Then we invoke the system call to transfer control to the kernel.

The next three lines set up the system call to write the contents of the buffer to the terminal. The registers rsi and rdx will not have been altered by the kernel, so they still contain the address of the buffer and the number of bytes, 500. We set rax to 1 to indicate that we want to write and rdi to 1 to tell the kernel that it is stdout that we want to write to.

The final three lines are the usual exit with 0. Now if you write this code in a file named echo_input.s and run

as echo_input.s -o echo_input.o
ld echo_input.o -o echo_input

you will have a new binary file named echo input. If you execute this binary, it will wait for you to enter some input and hit the return key. When you do this, it will print what you wrote out again and exit with exit code 0.

Now, in our code, we specified that our buffer is 500 bytes long, and that we should read and write 500 bytes. The read and write operations will handle data smaller than 500 bytes perfectly well. However if you try to pass in a string larger than this, only the first 500 characters will be passed to our echo utility, the shell will try, and probably fail, to execute whatever comes after that.

Assembly Tutorial – Everything is a File

At some point you may have heard:

In linux, everything is a file.

-common observation of unknown provenance

What does this actually mean though? Well in this post, we are going to find out.

In our previous post when we wanted to write to the terminal we used the following lines of code:

movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $12, %rdx
syscall

Now, in Linux, when we write to the terminal, we are actually writing to a special file called stdout. Rather than saving this to disk, the kernel writes the contents of the file to the terminal. The same is true for reading from the terminal, reading and writing to sockets and many other I/O operations. We can perform all of these operations in the same way, because the kernel allows us to treat them all as if we are reading from or writing to a file.

So let’s describe how we read and write to files more generally. File I/O requires a system call and we need to set four registers to give the kernel the information it needs to perform the I/O for us.

We set the rax register to 0 if we want to read from a file and 1 if we want to write to a file. We have to tell the kernel what file we would like it to read from/write to. To do this we set the register rdi with the file’s file descriptor. File descriptors are the unique numeric identifies associated with the files that the kernel knows about. For example, stdout‘s file descriptor is 1. We set rsi to be the address of the data we would like to write to the file, or the address of the memory we would like to read into. Finally, we set rdx to the size in bytes that we would like to read/write.

Assembly Tutorial – Hello World in Assembly

The first code you ever wrote was probably something like:

public class MyFirstClass {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

Assembly is a little bit more complicated than Java, so it is necessary to cover some basic concepts before you are ready to write to the terminal. The good news is that we are ready to write to the terminal!

Writing to the terminal works via a system call, just like when we exited with exit code 0. Lets have a look at the code.

.section .data
msg:
.string "Hello world\n"
 
.section .text
.globl _start
_start:
  
movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $12, %rdx
syscall

movq $60, %rax
movq $0, %rdi
syscall

We start off with our data section. Our data section is no longer empty. Now, we define a string with the value “Hello world\n”. We give this string the label msg.

In the text section we define our entry point as before, and we exit with code 0 as before. However we also have the five lines that output our message to the terminal:

movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $12, %rdx
syscall

Before transferring control to the linux kernel we have to move four values into registers. First we have to put 1 in the rax register and 1 in the rdi register. Don’t worry, the significance of these two values will be explained later! We put the address of the data we would like to write in the rsi register. In this case, we can reference the address of the data we would like to write with the label msg. Finally, in the rdx register we put the number of bytes we would like to write. This is asci, so each character is one byte, so our string is 12 bytes long, including the newline character.

If you put this code in a film named helloworld.s and execute

as helloworld.s -o helloworld.o
ld helloworld.o -o helloworld

This will create a binary in the same directory named helloworld. When you run this binary you should see “Hello world” printed to the terminal!

Assembly Tutorial – How Does our Simple Program Work?

In the last post we wrote a simple assembly program, all it did was exit with status code zero. Our code was

.section .data 
.section .text 
.globl _start 
_start: 
movq $60, %rax
movq $0, %rdi 
syscall

How does this work? The first line can be ignored for now, it just denotes where we would store any user defined data, if we had any. The second line

.section .text

denotes where our actual code begins. The third and fourth lines

_globl _start
_start:

defines a special label called start. A label is a convenient human readable name for a particular memory address. Labels allow us, the programmer to reference memory addresses without having to use actual numeric addresses. Labels are obviously less error prone, but they also mean that when we change the memory layout of our code we don’t have to recalculate a lot of addresses. The CPU does not understand labels, when the assembler runs it replaces all usages of labels with the actual addresses they refer to.

As we said, _start is a special label that defines the entry point of our program, like the main method in a C program. The line

_globl _start

just makes this label available outside of the program itself. If we had left this label out, our program will still assemble, link and run successfully, as the assembler will just create a default entry point. In general we don’t want to do this, as in more complex programs the default entry point might not be the entry point we want.

To understand this program there are two things we need to understand, the first of which is the system call. The kernel is the core part of the operating system. It handles all I/O, looks after memory at a low level, writes and reads files and plenty of other things. When we want to perform any of these task we transfer control to the kernel, this is called a system call. We perform a system call with the command

sycall

This will immediately transfer control over to the kernel. However, we also need to tell the kernel what we would like it to do for us. To do this we move certain special values into specific registers. (Remember the registers are small very fast memory inside the CPU). When the Kernel takes over, it reads these registers to find out what we are asking of it.

In the above program we use two registers, rax and rdi. These are 64 bit general purpose registers. With the command

movq $60, %rax

we move the 64 bit value 1 into the register rax. With the command

movq $0, %rdi

we move the 64 bit value 0 in the register rdi. When we perform a system call the value in the rax register tells the kernel what operation we would like it to perform. In this case, the value 1 tells it that we would like to exit. When exiting the value in the rdi register will be the exit code, in this case, we are exiting with code 0.

Assembly Tutorial Writing a Simple Assembly Program

There are two different syntaxes for assembly language, AT&T and Intel. The difference is really only aesthetic, both compile to the same underlying code. For no particular reason I have been using AT&T syntax. All my code has been built and run on 64 bit Ubuntu linux, with the GNU assembly, gas, and the GNU linker ld. My code is available on github.

My first assembly program was a little something like this:

.section .data
.section .text
.globl _start
_start:
movq $60, %rax
movq $0, %rdi
syscall

Put this in a file named return.s and run

as return.s -o return.o

This calls the gnu assembler, which assembles the text you have written into object code in a file named return.o.

Then run

ld return.o -o return

This calls the GNU linker and links the object code, creating an executable file named return. Now if, in the same directory, you run

./return

your code (should) run successfully.

All this code does is tell the CPU to exit with exit code 0, so when it runs successfully, you shouldn’t really see anything. You can inspect the exit code of the last program you ran with

echo $?

which, in this case should be 0.