You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This re-design is based on the information contained in this issue which describes the current implementation and the underlying functionality of pthreads on GNU, MUSL and BIONIC implementations of libc.
Goals
Our implementation must find the follow key components of the pthread implementation for each variant:
The offset within the the pthead structure of the list head which is used to store the threads in a list.
The offset within the pthread structure of the start routine and start parameter (Note that in MUSL these are actually stored in the thread stack rather than the pthread itself).
The anchor of the list of threads:
GNU has two such lists
MUSL has no anchor (and so this may not be necessary)
BIONIC has just one list
The lock used to synchronise access to thread list
An address in the pthreads library which is called in the context of new threads before the user's thread start routine (Note that MUSL has two such entry points, one for normal threads and one for c11).
An address in the pthreads library which is called in the context of threads as they as disposed.
We shall consider the design for each variant in turn.
GNU
1. Finding the list head offset
In order to find the list head offset, we can take advantage of the fact that when you create a new thread it is inserted at the head of either the _dl_stack_user list (if you provide a stack) or the _dl_stack_used (if you leave it for libc to provide one). Therefore if we create two threads, then the list heads of each of the threads should reference each other. We can ensure that these threads hang around by having their start routine block on an event (which we can later set in order for them to complete and be subsequently reaped).
We can then iterate through the thread structure of the second thread (remember it is inserted at the head of the list before the first thread) reading pointers from aligned addresses (since we don't know how large the thread structure is, we should select an arbitrary maximum size at which we stop searching. We should use gum_memory_read, however, since we might otherwise read beyond the end of the thread structure and potentially cause a crash. If we find the address of the first pthread (plus the same offset we are reading from) then we have found the flink pointer of the list. Since we know where the flink pointer is and that the blink immediately follows it, we can then verify that the blink of the first thead points back at the first.
However, since we don't hold a lock (to prevent other threads from spawning new threads) there is a potential race condition (in that another thread could appear in-between our two threads in the list). We can reduce the likelihood of this in two ways:
Create our threads in the _dl_stack_user list (since it is less likely to be used by an application).
If we fail to find our neighbouring thread, then dispose of them and repeat the process over again.
2. Finding the offset of the start routine and parameter.
This step should be fairly trivial. As in the step above, we can create a new thread, in doing so, we control the value of the start routine and start parameter and we can simply search for these within our structure in the same way as we did for our list head.
3. Finding the anchor of the list
We need to find the anchor of the list in order that we can iterate through all of the threads (note that there are two lists, so we cannot simply iterate from pthread_self ()). Also as we will discuss next the anchors are an important landmark to locate the locks for the list. We must be aware as in the last stage that we don't hold a lock on the list and as such it could change as we are walking it. We must therefore take great care to validate the nodes in the list. We can again mitigate the risk of a thread being removed
We can validate a node is in one of the lists by dereferencing its flink and blink to locate its neighbours and confirm that in turn their linked list pointers reference back to it. When a node is removed from the list, its flink and blink are updated to point back at themselves, so we must take care to also check for this as a false positive. Although after a pthread is disposed, the heap could reallocate the memory to the application for another purpose, as such like before we should use gum_memory_read to avoid any potential segfaults during this process.
We can further confirm that a node is a valid thread by calling pthread_getname_np. This can fail for one of two reasons:
The thread which has been passed has been deleted (and is therefore no longer valid).
The thread which has been passed is actually not a thread, but the list anchor.
pthread_getname_np actually just reads the thread ID from the pthread structure here and then queries /proc/self/task/%u/comm. So there is a remote chance of a incurring a false positive here (though this could be detected by checking for duplicates in our list of threads). We could also validate that the start routine pointer in the thread points to a valid executable segment. These issues seem unlikely to occur, so they will not be addressed by the implementation for now.
We can confirm we have found the complete list of nodes when we return back to the starting point of our list (after checking the references of each node as we traverse them). We can then cycle through them to identify which of the nodes is the anchor rather than a valid thread. We must then repeat the process for our second thread list.
4, Finding the lock used to synchronise access to thread list
We recall from our analysis that the storage of global variables in GNU libc changed in version 2.33. Fortunately, we can easily detect the version of libc we are dealing with using the API gnu_get_libc_version. In older versions of libc, the lock (stack_cache_lock), is stored immediately preceding __stack_user (the list anchor of threads which have their own user-allocated stack) in memory. In the case of later versions, the stack_cache_lock is a small but predictable offset from the _dl_stack_user list head.
Note that in older versions of libc, the field ordering is subject to change by the compiler (they don't appear in source code order even after taking into account the different sections into which the fields should be placed). Experimentation has shown that the ordering is consistent, however for our crosstool-ng toolchains for armbe8, arm64be and arm64beilp32. Further work may be needed for these older libc versions should this layout not prove to be as reliable. It should be noted that valid values for this lock are either 0, 1, or 2 and therefore this may help provide some simplistic validation. Further validation could be carried out by taking the lock and then (from another thread) attempting to spawn a new thread. If this succeeds, then it is clear that we have not correctly identified the lock.
5. Finding an address in the pthread library which is called during thread construction.
In order to find an address in the pthreads library which is called for all threads prior to their start routine (and in the context of that thread), we can simply create a new thread with a start routine which makes use of __builtin_return_address to find its immediate caller. This should yield the required result.
6. Finding an address in the pthread library which is called during thread destruction.
As in the existing design, we can again make use of the function __call_tls_dtors. Since this function can be readily found by symbol lookup, there is no need to change this aspect of the design.
MUSL
1. Finding the list head offset
This can be performed in the same was as for GNU.
2. Finding the offset of the start routine and parameter.
The offset and start routine of a thread in MUSL are stored in its stack. We can locate this by creating a pthread with its own stack and then simply searching the pthread for the pointer to it. This can be performed in the same way we discussed already. When we have found the stack, we can read the start routine and start parameter in the same way as the existing implementation.
3. Finding the anchor of the list
MUSL doesn't actually have an anchor for the list of threads, so we can just use pthread_self when iterating through the threads for gum_process_enumerate_threads. However, we do need to be able to identify which of the threads is the main thread (e.g. it's values of gettid() and getpid() match). We can call getpid() ourselves to get the PID, we just need a means to find the tid for a given pthread. We can find this by creating a thread and having it pass back the value of gettid to the caller (before blocking). Then we can search the pthread for the matching tid as we have discussed before. Given that we may encounter false positives, we can create multiple threads to confirm our results.
4, Finding the lock used to synchronise access to thread list
We can identify the lock used to synchronise list access by using an interceptor to hook the __clone function (which is exported from the library). The lock is passed as one of the arguments to this function. We simply need to create a new thread to trigger it.
5. Finding an address in the pthread library which is called during thread construction.
We can do this in the same way as for GNU. However, threads with c11 set in their pthread_attr_t use a different code path, so this step needs to be repeated, once creating a thread with the parameter and once without.
6. Finding an address in the pthread library which is called during thread destruction.
We can make use of the __pthread_exit function just like the existing implementation. Since it is exported, it can be readily found and there is no need to change this part of the implementation.
BIONIC
1. Finding the list head offset
This can be performed in the same was as for GNU
2. Finding the offset of the start routine and parameter.
This can be performed in the same was as for GNU
3. Finding the anchor of the list
rather than cyclic. BIONIC exports this list as the symbol _ZL13g_thread_list. But we could simply do without one and iterate from pthread_self like MUSL for gum_process_enumerate_threads (But we should note that unlike GNU and MUSL while the thread lists are still doubly-linked, they are null terminated so we would need to walk the list in both directions to the null terminators).
4, Finding the lock used to synchronise access to thread list
Again BIONIC makes life easy for us here, it exports the lock as the symbol _ZL18g_thread_list_lock.
5. Finding an address in the pthread library which is called during thread construction.
Here we can use the same technique as GNU (and MUSL). Or we can copy the existing implementation and make use of the external symbol _ZL15__pthread_startPv.
6. Finding an address in the pthread library which is called during thread destruction.
Again BIONIC simply exports the function pthread_exit.
The text was updated successfully, but these errors were encountered:
The thread which has been passed has been deleted (and is therefore no longer valid).
This is technically a use-after-free, the pthread_t can be used once it's either been joined or detached. (Other than from the thread itself.) And the GNU implementation dereferences it (for the non-self case) without checking if it's in the list of threads.
5. Finding an address in the pthread library which is called during thread construction.
Here we also need to rewind by a few instructions.
Yeah. We need to assume that any pthread in the list may be invalid and contain arbitrary data.
Yeah good point. For fixed size instruction sets, we need only step back one instruction. ARM with thumb2 is variable length so may need some logic to check for a 2 byte or 4 byte instruction. Intel is variable length, so we'd need to work out the length of the different possible indirect call instructions and check for them.
This re-design is based on the information contained in this issue which describes the current implementation and the underlying functionality of
pthreads
on GNU, MUSL and BIONIC implementations oflibc
.Goals
Our implementation must find the follow key components of the
pthread
implementation for each variant:pthead
structure of the list head which is used to store the threads in a list.pthread
structure of the start routine and start parameter (Note that in MUSL these are actually stored in the thread stack rather than thepthread
itself).pthreads
library which is called in the context of new threads before the user's thread start routine (Note that MUSL has two such entry points, one for normal threads and one forc11
).pthreads
library which is called in the context of threads as they as disposed.We shall consider the design for each variant in turn.
GNU
1. Finding the list head offset
In order to find the list head offset, we can take advantage of the fact that when you create a new thread it is inserted at the head of either the
_dl_stack_user
list (if you provide a stack) or the_dl_stack_used
(if you leave it forlibc
to provide one). Therefore if we create two threads, then the list heads of each of the threads should reference each other. We can ensure that these threads hang around by having their start routine block on an event (which we can later set in order for them to complete and be subsequently reaped).We can then iterate through the thread structure of the second thread (remember it is inserted at the head of the list before the first thread) reading pointers from aligned addresses (since we don't know how large the thread structure is, we should select an arbitrary maximum size at which we stop searching. We should use
gum_memory_read
, however, since we might otherwise read beyond the end of the thread structure and potentially cause a crash. If we find the address of the first pthread (plus the same offset we are reading from) then we have found theflink
pointer of the list. Since we know where theflink
pointer is and that theblink
immediately follows it, we can then verify that theblink
of the first thead points back at the first.However, since we don't hold a lock (to prevent other threads from spawning new threads) there is a potential race condition (in that another thread could appear in-between our two threads in the list). We can reduce the likelihood of this in two ways:
_dl_stack_user
list (since it is less likely to be used by an application).2. Finding the offset of the start routine and parameter.
This step should be fairly trivial. As in the step above, we can create a new thread, in doing so, we control the value of the start routine and start parameter and we can simply search for these within our structure in the same way as we did for our list head.
3. Finding the anchor of the list
We need to find the anchor of the list in order that we can iterate through all of the threads (note that there are two lists, so we cannot simply iterate from
pthread_self ()
). Also as we will discuss next the anchors are an important landmark to locate the locks for the list. We must be aware as in the last stage that we don't hold a lock on the list and as such it could change as we are walking it. We must therefore take great care to validate the nodes in the list. We can again mitigate the risk of a thread being removedWe can validate a node is in one of the lists by dereferencing its
flink
andblink
to locate its neighbours and confirm that in turn their linked list pointers reference back to it. When a node is removed from the list, itsflink
andblink
are updated to point back at themselves, so we must take care to also check for this as a false positive. Although after apthread
is disposed, the heap could reallocate the memory to the application for another purpose, as such like before we should usegum_memory_read
to avoid any potential segfaults during this process.We can further confirm that a node is a valid thread by calling
pthread_getname_np
. This can fail for one of two reasons:pthread_getname_np
actually just reads the thread ID from thepthread
structure here and then queries/proc/self/task/%u/comm
. So there is a remote chance of a incurring a false positive here (though this could be detected by checking for duplicates in our list of threads). We could also validate that the start routine pointer in the thread points to a valid executable segment. These issues seem unlikely to occur, so they will not be addressed by the implementation for now.We can confirm we have found the complete list of nodes when we return back to the starting point of our list (after checking the references of each node as we traverse them). We can then cycle through them to identify which of the nodes is the anchor rather than a valid thread. We must then repeat the process for our second thread list.
4, Finding the lock used to synchronise access to thread list
We recall from our analysis that the storage of global variables in GNU
libc
changed in version2.33
. Fortunately, we can easily detect the version oflibc
we are dealing with using the APIgnu_get_libc_version
. In older versions oflibc
, the lock (stack_cache_lock
), is stored immediately preceding__stack_user
(the list anchor of threads which have their own user-allocated stack) in memory. In the case of later versions, thestack_cache_lock
is a small but predictable offset from the_dl_stack_user
list head.Note that in older versions of
libc
, the field ordering is subject to change by the compiler (they don't appear in source code order even after taking into account the different sections into which the fields should be placed). Experimentation has shown that the ordering is consistent, however for ourcrosstool-ng
toolchains forarmbe8
,arm64be
andarm64beilp32
. Further work may be needed for these olderlibc
versions should this layout not prove to be as reliable. It should be noted that valid values for this lock are either0
,1
, or2
and therefore this may help provide some simplistic validation. Further validation could be carried out by taking the lock and then (from another thread) attempting to spawn a new thread. If this succeeds, then it is clear that we have not correctly identified the lock.5. Finding an address in the pthread library which is called during thread construction.
In order to find an address in the
pthreads
library which is called for all threads prior to their start routine (and in the context of that thread), we can simply create a new thread with a start routine which makes use of__builtin_return_address
to find its immediate caller. This should yield the required result.6. Finding an address in the pthread library which is called during thread destruction.
As in the existing design, we can again make use of the function
__call_tls_dtors
. Since this function can be readily found by symbol lookup, there is no need to change this aspect of the design.MUSL
1. Finding the list head offset
This can be performed in the same was as for GNU.
2. Finding the offset of the start routine and parameter.
The offset and start routine of a thread in MUSL are stored in its stack. We can locate this by creating a
pthread
with its own stack and then simply searching thepthread
for the pointer to it. This can be performed in the same way we discussed already. When we have found the stack, we can read the start routine and start parameter in the same way as the existing implementation.3. Finding the anchor of the list
MUSL doesn't actually have an anchor for the list of threads, so we can just use
pthread_self
when iterating through the threads forgum_process_enumerate_threads
. However, we do need to be able to identify which of the threads is the main thread (e.g. it's values ofgettid()
andgetpid()
match). We can callgetpid()
ourselves to get the PID, we just need a means to find thetid
for a givenpthread
. We can find this by creating a thread and having it pass back the value ofgettid
to the caller (before blocking). Then we can search thepthread
for the matchingtid
as we have discussed before. Given that we may encounter false positives, we can create multiple threads to confirm our results.4, Finding the lock used to synchronise access to thread list
We can identify the lock used to synchronise list access by using an
interceptor
to hook the__clone
function (which is exported from the library). The lock is passed as one of the arguments to this function. We simply need to create a new thread to trigger it.5. Finding an address in the pthread library which is called during thread construction.
We can do this in the same way as for GNU. However, threads with
c11
set in theirpthread_attr_t
use a different code path, so this step needs to be repeated, once creating a thread with the parameter and once without.6. Finding an address in the pthread library which is called during thread destruction.
We can make use of the
__pthread_exit
function just like the existing implementation. Since it is exported, it can be readily found and there is no need to change this part of the implementation.BIONIC
1. Finding the list head offset
This can be performed in the same was as for GNU
2. Finding the offset of the start routine and parameter.
This can be performed in the same was as for GNU
3. Finding the anchor of the list
rather than cyclic. BIONIC exports this list as the symbol
_ZL13g_thread_list
. But we could simply do without one and iterate frompthread_self
like MUSL forgum_process_enumerate_threads
(But we should note that unlike GNU and MUSL while the thread lists are still doubly-linked, they are null terminated so we would need to walk the list in both directions to the null terminators).4, Finding the lock used to synchronise access to thread list
Again BIONIC makes life easy for us here, it exports the lock as the symbol
_ZL18g_thread_list_lock
.5. Finding an address in the pthread library which is called during thread construction.
Here we can use the same technique as GNU (and MUSL). Or we can copy the existing implementation and make use of the external symbol
_ZL15__pthread_startPv
.6. Finding an address in the pthread library which is called during thread destruction.
Again BIONIC simply exports the function
pthread_exit
.The text was updated successfully, but these errors were encountered: