-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhdmcw-programs.qmd
185 lines (149 loc) · 10.7 KB
/
hdmcw-programs.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# Episode 5: Programs {.unnumbered}
Before getting into the technical details of how the computer loads and executes program, I want to talk about programs more
generally. It's surprisingly hard to define what a computer _program_ really is. Usually you'll see something
about "a set of instructions performed by a computer to perform a task", but that leaves a lot of room for interpretation[^1]. If you try to get more precise,
you'll end up talking about Turing machines or (God forbid) lambda calculus. The only constraint I'm going to put on programs
for now is that i'll usually be talking about programs where the source code is translated directly into machine instructions
that the CPU can execute. This is
in contrast to programming languages that use another program to interpret instructions at runtime or that use a _virtual machine_
to translate a machine-agnostic code into machine-specific instructions. That's not because one type
of program is necessarily better than another, but my focus is on what the computer is doing.
There's sort of a meta-question here though: is the _program_ the stuff that we write into a file, or is it the stuff that the computer
executes? I'm going to meta-answer with the meta-response "Yes". For now, i'm going to focus more on how we make the thing that
the computer executes, and save the part about how the computer executes the thing for later.
In the first two episodes we wrote two assembly language programs, and I skirted around a bunch of details
including things like "what is assembly language"? If your nerd energy is strong enough to have made it
this far, you probably don't need an answer to that, but I'll just say that it's the lowest level at
which you can program except for entering numbers directly into the computer's memory. Each assembly language instruction maps to a machine-level _operation code_ or _opcode_[^2] and usually some operands that tell the _opcode_ what to operate on. As we saw in the last episode, each line of assembly maps to numbers that instruct the CPU what to do. Assembly language also has _labels_ that designate specific places in the
program, and various other directives that we'll discuss in a while.
What I'll start to answer now is: How does the assembly language get turned into numbers like what we saw in the computer's memory in Episode 2?
The simple answer is "the assembler". An assembler is yet another program that takes the assembly
language source code and turns it into machine code. Let's imagine that the program in the last episode
is written in a file called __adder.s__ (I don't have to imagine it because it is). The assembler program on
my Mac is called __as__ [^3], so i can assemble the source program with the command:
```
as -o adder.o adder.s
```
This produces the file __adder.o__ , which is called an _object file_, and on the Mac it's in a format called Mach-O, which i _think_ is pronounced "mock oh" and not "macho". The equivalent on Unix systems is _Executable and Linkable Format_ (ELF). Let's look at the contents of this file. We'll use __xxd__
since it's a binary file.
```console
foo@bar:~$ file adder.o
adder.o: Mach-O 64-bit object arm64
```
```console
foo@bar:~$ xxd adder.o
00000000: cffa edfe 0c00 0001 0000 0000 0100 0000 ................
00000010: 0400 0000 6801 0000 0000 0000 0000 0000 ....h...........
00000020: 1900 0000 e800 0000 0000 0000 0000 0000 ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 3800 0000 0000 0000 8801 0000 0000 0000 8...............
00000050: 3800 0000 0000 0000 0700 0000 0700 0000 8...............
00000060: 0200 0000 0000 0000 5f5f 7465 7874 0000 ........__text..
00000070: 0000 0000 0000 0000 5f5f 5445 5854 0000 ........__TEXT..
00000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000090: 2c00 0000 0000 0000 8801 0000 0400 0000 ,...............
000000a0: c001 0000 0600 0000 0004 0080 0000 0000 ................
000000b0: 0000 0000 0000 0000 5f5f 6461 7461 0000 ........__data..
000000c0: 0000 0000 0000 0000 5f5f 4441 5441 0000 ........__DATA..
000000d0: 0000 0000 0000 0000 2c00 0000 0000 0000 ........,.......
000000e0: 0c00 0000 0000 0000 b401 0000 0000 0000 ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000100: 0000 0000 0000 0000 3200 0000 1800 0000 ........2.......
00000110: 0100 0000 0000 0e00 0000 0000 0000 0000 ................
00000120: 0200 0000 1800 0000 f001 0000 0600 0000 ................
00000130: 5002 0000 2800 0000 0b00 0000 5000 0000 P...(.......P...
00000140: 0000 0000 0500 0000 0500 0000 0100 0000 ................
00000150: 0600 0000 0000 0000 0000 0000 0000 0000 ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000180: 0000 0000 0000 0000 0100 0090 2100 0091 ............!...
00000190: 2200 40b9 0300 0090 6300 0091 6400 40b9 "[email protected].@.
000001a0: 8200 028b 0500 0090 a500 0091 a200 00f9 ................
000001b0: 4000 20d4 1100 0000 1900 0000 0000 0000 @. .............
000001c0: 2000 0000 0300 004c 1c00 0000 0300 003d ......L.......=
000001d0: 1000 0000 0200 004c 0c00 0000 0200 003d .......L.......=
000001e0: 0400 0000 0100 004c 0000 0000 0100 003d .......L.......=
000001f0: 1d00 0000 0e01 0000 0000 0000 0000 0000 ................
00000200: 1800 0000 0e02 0000 2c00 0000 0000 0000 ........,.......
00000210: 0800 0000 0e02 0000 3000 0000 0000 0000 ........0.......
00000220: 0d00 0000 0e02 0000 3400 0000 0000 0000 ........4.......
00000230: 1200 0000 0e02 0000 2c00 0000 0000 0000 ........,.......
00000240: 0100 0000 0f01 0000 0000 0000 0000 0000 ................
00000250: 005f 7374 6172 7400 6d79 6132 006d 7973 ._start.mya2.mys
00000260: 3100 6c74 6d70 3100 6d79 6131 006c 746d 1.ltmp1.mya1.ltm
00000270: 7030 0000 0000 0000 p0......
```
Alrighty then. So if you squint you will see in there the hex values for our opcodes
and data, plus some extra goodies like __\_\_text__ and __\_\_data__, and also some text
that matches our variable names (and, importantly, our __\_start__ label). For the
most part this doesn't tell us much, except that our assembler has produced a binary
file that represents the code we wrote.
Fortunately, there's a utility on OSX that lets
you examine Mach-O files in various ways. For one, we can disassemble it:
```console
foo@bar:~$ objdump -d adder.o
adder.o: file format mach-o arm64
Disassembly of section __TEXT,__text:
0000000000000000 <ltmp0>:
0: 90000001 adrp x1, 0x0 <ltmp0>
4: 91000021 add x1, x1, #0
8: b9400022 ldr w2, [x1]
c: 90000003 adrp x3, 0x0 <ltmp0+0xc>
10: 91000063 add x3, x3, #0
14: b9400064 ldr w4, [x3]
18: 8b020082 add x2, x4, x2
1c: 90000005 adrp x5, 0x0 <ltmp0+0x1c>
20: 910000a5 add x5, x5, #0
24: f90000a2 str x2, [x5]
28: d4200040 brk #0x2
```
Nice. That's our program, and it even has hex for the instructions! (Note: the part of the object file with the program code is referred to as the "text section") So, were done, right! We can just run this and... wait no, sorry, this is not an _executable_ file. To make an actual runnable program we have to
use a second program, the _linker_, which naturally is called __ld__. The command for the linker is also fairly short in this case:
```console
foo@bar:~$ ld -o adder adder.o -e _start -arch arm64
```
First, it's using __-o adder__ to say that we want the executable program to be called __adder__. Then it takes the name of our object file as input. Now, normally the __ld__ command sort of gets mental.
It often takes numerous arguments for libraries, directories to look in, optimization levels, etc.
The option __-e \_start__ tells the linker where the executable starts, and here we need to go back to the source code. Recall that in both of our programs so far we've had this bit at the beginning:
```
.global _start // Provide program starting address to linker
.align 4
_start:
```
Using __objdump__ we can look at the _symbol table_ in our binary program file, and we see:
```console
foo@bar:~$ objdump --syms adder
adder: file format mach-o arm64
SYMBOL TABLE:
0000000100004000 l O __DATA,__data mya1
0000000100004004 l O __DATA,__data mya2
0000000100004008 l O __DATA,__data mys1
0000000100000000 g F __TEXT,__text __mh_execute_header
0000000100003f70 g F __TEXT,__text _start
```
Right there is the "_start" symbol, and hey, it has the address that we saw in the last episode corresponding to the first instruction.
That's an amazing coincidence.[^4]. Why does the linker need to know where the start of our program is? Doesn't it just start at, like, 0?
The linker's job is to combine various files into an executable program. In this case, we've only got one object file, so the other files that
are getting linked are system-level libraries. However, in most complex programs there will be many object files and so the linker
needs to know the entry point for the combined files. That's why the "_start" symbol is a _global_ symbol. If you were writing, say, C code
the global start symbol would be in the runtime library.
If we look at the final runnable program (__adder__), we'll see that it's way larger than the object file (__adder.o__) even
though the latter holds all of the code we wrote.
```console
foo@bar:~$ ls -l
total 96
-rw-r--r-- 1 mikemull staff 184 Jul 19 16:05 Makefile
-rwxr-xr-x 1 mikemull staff 33432 Aug 5 10:30 adder
-rw-r--r-- 1 mikemull staff 632 Aug 5 10:30 adder.o
-rw-r--r-- 1 mikemull staff 553 Aug 5 10:30 adder.s
```
Most of that is because the linker adds more of what are called "load commands" to the resulting binary, which are essentially instructions that
tell the operating system how to load the program.
So, we've more or less covered the "How does assembly get turned into numbers" part, skipping over (for now) details about how the actual files are created and read/written.
Next we need to start thinking about how the stuff in that file gets executed by the computer. This is where the operating system gets really hard to separate from the hardware. The OS has the responsibility for loading the numbers that comprise our program, but it's not a simple matter of taking the bits from a disk or SSD and putting them into physical memory.
Before we talk about how the instructions are executed, we're going to have to return to our discussion of memory and finally talk about _virtual memory_.
That will be the subject of the [next episode](hdmcw-virtual.html).
[^1]: Pun intended.
[^2]: Everyone says _opcode_. It just sounds cooler.
[^3]: __as__ just calls __clang__, which is part of the [LLVM](https://llvm.org/) toolset (as is __objdump__).
[^4]: No, it's not.