Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design of gum-process-linux for better support on atypical architectures. #1011

Open
WorksButNotTested opened this issue Mar 27, 2025 · 2 comments

Comments

@WorksButNotTested
Copy link
Contributor

WorksButNotTested commented Mar 27, 2025

This re-design is based on the information contained in this issue which describes the current implementation and the underlying functionality of pthreads on GNU, MUSL and BIONIC implementations of libc.

Goals

Our implementation must find the follow key components of the pthread implementation for each variant:

  1. The offset within the the pthead structure of the list head which is used to store the threads in a list.
  2. The offset within the pthread structure of the start routine and start parameter (Note that in MUSL these are actually stored in the thread stack rather than the pthread itself).
  3. The anchor of the list of threads:
  • GNU has two such lists
  • MUSL has no anchor (and so this may not be necessary)
  • BIONIC has just one list
  1. The lock used to synchronise access to thread list
  2. An address in the pthreads library which is called in the context of new threads before the user's thread start routine (Note that MUSL has two such entry points, one for normal threads and one for c11).
  3. An address in the pthreads library which is called in the context of threads as they as disposed.

We shall consider the design for each variant in turn.

GNU

1. Finding the list head offset

In order to find the list head offset, we can take advantage of the fact that when you create a new thread it is inserted at the head of either the _dl_stack_user list (if you provide a stack) or the _dl_stack_used (if you leave it for libc to provide one). Therefore if we create two threads, then the list heads of each of the threads should reference each other. We can ensure that these threads hang around by having their start routine block on an event (which we can later set in order for them to complete and be subsequently reaped).

We can then iterate through the thread structure of the second thread (remember it is inserted at the head of the list before the first thread) reading pointers from aligned addresses (since we don't know how large the thread structure is, we should select an arbitrary maximum size at which we stop searching. We should use gum_memory_read, however, since we might otherwise read beyond the end of the thread structure and potentially cause a crash. If we find the address of the first pthread (plus the same offset we are reading from) then we have found the flink pointer of the list. Since we know where the flink pointer is and that the blink immediately follows it, we can then verify that the blink of the first thead points back at the first.

                                     +----------+           +----------+
                                     | THREAD 2 |           | THREAD 1 |
                                     +----------+           +----------+
              +----------+           //         //          //         //
              | ANCHOR   |           |          |           |          |
              +----------+           +---------+            +----------+
              | Flink    | ========> | Flink    | ========> | Flink    | ==== // ====>
              +----------+           +----------+           +----------+
<==== // ==== | Blink    | <======== | Blink    | <======== | Blink    |
              +----------+           +----------+           +----------+
                                     |          |           |          |
                                     //         //          //         //
                                     |          |           |          |
                                     +----------+           +----------+

However, since we don't hold a lock (to prevent other threads from spawning new threads) there is a potential race condition (in that another thread could appear in-between our two threads in the list). We can reduce the likelihood of this in two ways:

  1. Create our threads in the _dl_stack_user list (since it is less likely to be used by an application).
  2. If we fail to find our neighbouring thread, then dispose of them and repeat the process over again.

2. Finding the offset of the start routine and parameter.

This step should be fairly trivial. As in the step above, we can create a new thread, in doing so, we control the value of the start routine and start parameter and we can simply search for these within our structure in the same way as we did for our list head.

3. Finding the anchor of the list

We need to find the anchor of the list in order that we can iterate through all of the threads (note that there are two lists, so we cannot simply iterate from pthread_self ()). Also as we will discuss next the anchors are an important landmark to locate the locks for the list. We must be aware as in the last stage that we don't hold a lock on the list and as such it could change as we are walking it. We must therefore take great care to validate the nodes in the list. We can again mitigate the risk of a thread being removed

We can validate a node is in one of the lists by dereferencing its flink and blink to locate its neighbours and confirm that in turn their linked list pointers reference back to it. When a node is removed from the list, its flink and blink are updated to point back at themselves, so we must take care to also check for this as a false positive. Although after a pthread is disposed, the heap could reallocate the memory to the application for another purpose, as such like before we should use gum_memory_read to avoid any potential segfaults during this process.

We can further confirm that a node is a valid thread by calling pthread_getname_np. This can fail for one of two reasons:

  1. The thread which has been passed has been deleted (and is therefore no longer valid).
  2. The thread which has been passed is actually not a thread, but the list anchor.

pthread_getname_np actually just reads the thread ID from the pthread structure here and then queries /proc/self/task/%u/comm. So there is a remote chance of a incurring a false positive here (though this could be detected by checking for duplicates in our list of threads). We could also validate that the start routine pointer in the thread points to a valid executable segment. These issues seem unlikely to occur, so they will not be addressed by the implementation for now.

We can confirm we have found the complete list of nodes when we return back to the starting point of our list (after checking the references of each node as we traverse them). We can then cycle through them to identify which of the nodes is the anchor rather than a valid thread. We must then repeat the process for our second thread list.

4, Finding the lock used to synchronise access to thread list

We recall from our analysis that the storage of global variables in GNU libc changed in version 2.33. Fortunately, we can easily detect the version of libc we are dealing with using the API gnu_get_libc_version. In older versions of libc, the lock (stack_cache_lock), is stored immediately preceding __stack_user (the list anchor of threads which have their own user-allocated stack) in memory. In the case of later versions, the stack_cache_lock is a small but predictable offset from the _dl_stack_user list head.

Note that in older versions of libc, the field ordering is subject to change by the compiler (they don't appear in source code order even after taking into account the different sections into which the fields should be placed). Experimentation has shown that the ordering is consistent, however for our crosstool-ng toolchains for armbe8, arm64be and arm64beilp32. Further work may be needed for these older libc versions should this layout not prove to be as reliable. It should be noted that valid values for this lock are either 0, 1, or 2 and therefore this may help provide some simplistic validation. Further validation could be carried out by taking the lock and then (from another thread) attempting to spawn a new thread. If this succeeds, then it is clear that we have not correctly identified the lock.

5. Finding an address in the pthread library which is called during thread construction.

In order to find an address in the pthreads library which is called for all threads prior to their start routine (and in the context of that thread), we can simply create a new thread with a start routine which makes use of __builtin_return_address to find its immediate caller. This should yield the required result.

6. Finding an address in the pthread library which is called during thread destruction.

As in the existing design, we can again make use of the function __call_tls_dtors. Since this function can be readily found by symbol lookup, there is no need to change this aspect of the design.

MUSL

1. Finding the list head offset

This can be performed in the same was as for GNU.

2. Finding the offset of the start routine and parameter.

The offset and start routine of a thread in MUSL are stored in its stack. We can locate this by creating a pthread with its own stack and then simply searching the pthread for the pointer to it. This can be performed in the same way we discussed already. When we have found the stack, we can read the start routine and start parameter in the same way as the existing implementation.

3. Finding the anchor of the list

MUSL doesn't actually have an anchor for the list of threads, so we can just use pthread_self when iterating through the threads for gum_process_enumerate_threads. However, we do need to be able to identify which of the threads is the main thread (e.g. it's values of gettid() and getpid() match). We can call getpid() ourselves to get the PID, we just need a means to find the tid for a given pthread. We can find this by creating a thread and having it pass back the value of gettid to the caller (before blocking). Then we can search the pthread for the matching tid as we have discussed before. Given that we may encounter false positives, we can create multiple threads to confirm our results.

4, Finding the lock used to synchronise access to thread list

We can identify the lock used to synchronise list access by using an interceptor to hook the __clone function (which is exported from the library). The lock is passed as one of the arguments to this function. We simply need to create a new thread to trigger it.

5. Finding an address in the pthread library which is called during thread construction.

We can do this in the same way as for GNU. However, threads with c11 set in their pthread_attr_t use a different code path, so this step needs to be repeated, once creating a thread with the parameter and once without.

6. Finding an address in the pthread library which is called during thread destruction.

We can make use of the __pthread_exit function just like the existing implementation. Since it is exported, it can be readily found and there is no need to change this part of the implementation.

BIONIC

1. Finding the list head offset

This can be performed in the same was as for GNU

2. Finding the offset of the start routine and parameter.

This can be performed in the same was as for GNU

3. Finding the anchor of the list

rather than cyclic. BIONIC exports this list as the symbol _ZL13g_thread_list. But we could simply do without one and iterate from pthread_self like MUSL for gum_process_enumerate_threads (But we should note that unlike GNU and MUSL while the thread lists are still doubly-linked, they are null terminated so we would need to walk the list in both directions to the null terminators).

4, Finding the lock used to synchronise access to thread list

Again BIONIC makes life easy for us here, it exports the lock as the symbol _ZL18g_thread_list_lock.

5. Finding an address in the pthread library which is called during thread construction.

Here we can use the same technique as GNU (and MUSL). Or we can copy the existing implementation and make use of the external symbol _ZL15__pthread_startPv.

6. Finding an address in the pthread library which is called during thread destruction.

Again BIONIC simply exports the function pthread_exit.

@oleavr
Copy link
Member

oleavr commented Mar 28, 2025

This sounds like a great plan! A couple of notes:

GNU

3. Finding the anchor of the list

The thread which has been passed has been deleted (and is therefore no longer valid).

This is technically a use-after-free, the pthread_t can be used once it's either been joined or detached. (Other than from the thread itself.) And the GNU implementation dereferences it (for the non-self case) without checking if it's in the list of threads.

5. Finding an address in the pthread library which is called during thread construction.

Here we also need to rewind by a few instructions.

@WorksButNotTested
Copy link
Contributor Author

  1. Yeah. We need to assume that any pthread in the list may be invalid and contain arbitrary data.

  2. Yeah good point. For fixed size instruction sets, we need only step back one instruction. ARM with thumb2 is variable length so may need some logic to check for a 2 byte or 4 byte instruction. Intel is variable length, so we'd need to work out the length of the different possible indirect call instructions and check for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants