-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhdmcw-threads.qmd
286 lines (220 loc) · 10.3 KB
/
hdmcw-threads.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
# Episode 9: Threads {.unnumbered}
Not gonna lie, threads are tricky. It's partially because there's an interplay between hardware, the operating system, and programming language constructs.
Partially it's because _time_ comes on the scene and you have to worry about the order in which things happen.
To illustrate some of these issues, i'm gonna jump right to a code example. This uses the POSIX threads (pthreads) API:
```C
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
void *copy1(void *);
void *copy2(void *);
void *copy_ints(int *, int);
const int NUMINTS=10;
int main()
{
pthread_t thread1, thread2;
int iret1, iret2;
srand(time(NULL));
int* myints = (int*) malloc(NUMINTS * sizeof(int));
iret1 = pthread_create( &thread1, NULL, copy1, (void *) myints);
iret2 = pthread_create( &thread2, NULL, copy2, (void *) myints);
pthread_join( thread1, NULL);
pthread_join( thread2, NULL);
for (int i=0; i < 10; i++) {
printf("%d", myints[i]);
}
printf("\n%d\n", NUMINTS);
exit(0);
}
void *copy_ints(int *x, int val)
{
for (int i=0; i<NUMINTS; i++) {
int r = rand() % 100000;
for (int j=0; j < r; j++) // Delay loop
;
x[i] = val;
}
return NULL;
}
void * copy1(void *intarray) {
return copy_ints((int *) intarray, 1);
}
void * copy2(void *intarray) {
return copy_ints((int *) intarray, 2);
}
```
This starts two new threads from the main thread, and passes to them the array __myints__. The functions invoked in the two threads
copy a constant value into the array (1 for the first thread, 2 for the second). However, there's a random delay before setting each
array value. If i run this 10 times with a slight delay between runs to make sure the random number generator gets reseeded, i'll get output like this:
```
[2][2][2][2][1][2][1][2][1][2]
[2][2][2][2][1][2][2][2][1][2]
[2][2][2][2][2][1][1][1][1][1]
[1][2][2][1][1][1][1][1][1][1]
[2][1][1][2][2][2][2][2][2][2]
[2][1][1][2][2][2][2][2][2][2]
[2][1][1][2][2][2][2][2][2][2]
[2][2][2][2][1][1][2][2][2][2]
[2][2][2][2][2][2][2][2][2][2]
[1][2][2][1][1][1][1][1][1][1]
```
This tells us two things: that the two threads are filling the same memory and that, because of the random delay, which thread fills in the
value last is sorta random. Let's build the program with debug info and then run it with __lldb__:
```console
clang threads.c --debug
lldb ./a.out
(lldb) breakpoint set -l 35
Breakpoint 2: where = a.out`copy_ints + 20 at threads.c:35:14, address = 0x0000000100003ea0
(lldb) run
```
If look at the program with the Activity Monitor, we'll see this:

As expected there is one process for __a.out__ and 3 threads, one for the main program and two for the threads we created.
Now, i'm going to change the NUMINTS constant to 1,000,000 and run it again so that it runs for a while. I'm going to use the
__Instruments__ application that comes with __XCode__ to look at the CPU usage. What i see is this:

Since the program is CPU bound it's not too suprising that it seems to be running exclusively on the performance cores.
What might be suprising is that the two threads we started seem to be running on all 4 performance cores.
What i think it happening here is the effect of scheduling. I'll talk about that more in the next episode, but for now keep in mind that there are many programs running on my computer, even if i didn't necessarily start them. There's the operating system kernel itself, and various other system services to handle networking, system monitoring, and the graphical user interface. All of those things have threads of their own that need a turn on the CPUs, so the operating system will give our threads some period of time (i'm guessing 10ms since that's the interval in the graphs), then look in a queue to see what else might need to be run. It might be that our threads are the only thing ready to run on the performance cores so they end up getting run immediately, but _not necessarily on the same core_.
The thread is mostly an operating system thing-- from the hardware's vantage point it's just running instructions and fetching/storing from memory.
However, there are aspects of the hardware that need to be taken into account when managing threads. For one, each thread has _state_, which takes
up memory. Second, when the scheduler switches from one thread to another (aka _context switching_), the values of CPU registers need to be saved
and the values for the next thread to run need to be stored into the registers. Finally, creating threads takes some time. Very often in real
applications a number of threads are pre-allocated in what's called a _thread pool_ and then started when they are needed.
Most importantly, you have to remember that threads are potentially running simultaneously _and_ that (within a given process)
they are using the same memory. We saw the
effects of this in our silly program above, but that program has the benefit of not having any precisely specified outcome (ie, as long as
the program doesn't crash, nobody cares what the integer array contains at the end). Consider another silly program:
```C
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
void *incr(void *);
void *decr(void *);
void *modify_ints(int *, int);
const int NUMINTS=10;
const int REPEATS=200;
int main()
{
pthread_t thread1, thread2;
int iret1, iret2;
int* myints = (int*) malloc(NUMINTS * sizeof(int));
for (int i=0; i< NUMINTS; i++)
myints[i] = 0;
iret1 = pthread_create( &thread1, NULL, incr, (void *) myints);
iret2 = pthread_create( &thread2, NULL, decr, (void *) myints);
pthread_join( thread1, NULL);
pthread_join( thread2, NULL);
for (int i=0; i < 10; i++) {
printf("[%d]", myints[i]);
}
printf("\n");
exit(0);
}
void *modify_ints(int *x, int val)
{
for (int itimes=0; itimes < REPEATS; itimes++) {
for (int i=0; i<NUMINTS; i++) {
x[i] += val;
}
}
return NULL;
}
void * incr(void *intarray) {
return modify_ints((int *) intarray, 1);
}
void * decr(void *intarray) {
return modify_ints((int *) intarray, -1);
}
```
There's one thread incrementing the count and another thread decrementing the count of each array element. Since both
happen the same number of times, you'd figure that the outcome would be an array of zeros. But if we run this several times
we'll see a wierd thing:
```
[0][0][0][0][0][0][0][0][0][0]
[0][0][0][0][0][0][0][0][0][0]
[0][0][0][0][0][0][0][0][0][0]
[-6][-6][-6][-7][10][10][10][2][2][9]
[8][11][11][11][-15][-15][-15][-15][-11][-2]
[26][15][-29][-29][36][-30][-30][-30][-30][-30]
[0][0][0][0][0][0][0][0][0][0]
[4][-7][-7][-8][16][16][16][16][-15][-22]
```
What the 🤬? That's pretty surprising. Computers! Amirite?
The line in the C code that looks like __x[i] += val;__ would be rendered in assembly (with --debug on) as:
```
0x100003f04 <+68>: ldr w10, [sp, #0x14]
0x100003f08 <+72>: ldr x8, [sp, #0x18]
0x100003f0c <+76>: ldrsw x9, [sp, #0xc]
0x100003f10 <+80>: add x9, x8, x9, lsl #2
0x100003f14 <+84>: ldr w8, [x9]
0x100003f18 <+88>: add w8, w8, w10
0x100003f1c <+92>: str w8, [x9]
```
As things stand both threads can be executing this section of code. However, notice that the operations here are (1) grab a value
from memory and put it into a register, compute the new value in a register with an __add__, then store the value from the register
back to memory. It's _possible_ that thread 2 can grab the same value that thread 1 grabbed
_before_ thread 1 increments. Then thread 2 decrements that value and stores it after thread 1 stores its value. So, instead of an
increment followed by a decrement, the increment is forgotten. This situation where the answer can depend on precise
timing issues is called a _race condition_.
The fix for this is usually to put guardrails around the bit of code that causes the problem, almost always something that's modifying
shared memory. This code is called a _critical section_. My fix here is to use another part of the __pthreads__ API, the _mutex_ API.
Here's the modified code:
```C
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
void *incr(void *);
void *decr(void *);
void *modify_ints(int *, int);
const int NUMINTS=10;
const int REPEATS=200;
pthread_mutex_t count_mutex;
int main()
{
pthread_t thread1, thread2;
int iret1, iret2;
pthread_mutex_init(&count_mutex, NULL);
int* myints = (int*) malloc(NUMINTS * sizeof(int));
for (int i=0; i< NUMINTS; i++)
myints[i] = 0;
iret1 = pthread_create( &thread1, NULL, incr, (void *) myints);
iret2 = pthread_create( &thread2, NULL, decr, (void *) myints);
pthread_join( thread1, NULL);
pthread_join( thread2, NULL);
for (int i=0; i < 10; i++) {
printf("[%d]", myints[i]);
}
printf("\n");
exit(0);
}
void *modify_ints(int *x, int val)
{
for (int itimes=0; itimes < REPEATS; itimes++) {
for (int i=0; i<NUMINTS; i++) {
pthread_mutex_lock(&count_mutex);
x[i] += val;
pthread_mutex_unlock(&count_mutex);
}
}
return NULL;
}
void * incr(void *intarray) {
return modify_ints((int *) intarray, 1);
}
void * decr(void *intarray) {
return modify_ints((int *) intarray, -1);
}
```
The main difference is the calls to _pthread_mutex_lock()_ and _pthread_mutex_unlock()_ around the bit where we modify our shared array.
The first thread to call _pthread_mutex_lock_ will have exclusive access to that part of the code. If the second thread arrives
at the lock call, it will wait until the first thread does it's thing and calls _pthread_mutex_unlock_. This avoids that race condition
at the expense of a slight delay.
The term _mutex_ is short for _mutual exclusion_, which makes sense. This idea goes back to the 1960 also, introduce first by
computer scientist Edsger Dijkstra. The concept of using a mutex to insure that operations are done in the right order
is called _thread synchronization_.
As mentioned before, when the second thread calls _pthread_mutex_lock()_, it will see that the mutex is locked and will wait until
it is unlocked. How does it know when to start running again? This, again, relies on the scheduler, the topic of Episode 8.