extract_fa: Add parallel processing of partitions #5373

jbylicki · 2025-09-22T11:02:59Z

This PR implements parallelized partition finding in the extract_fa step. We have tested the speedup on proprietary designs that we can't share, and found great improvements. Partition finding needs to only read from the global state, and thus requires little synchronization. We've tested parallelization on later loops, but found no improvements there due to frequent writes.

The run times of the extract_fa step compared to main (runs were performed on 8 cores):

branch	design	median extract_fa time [s]
main	design-1	45.111
extract-fa-parallelization	design-1	25.255
main	design-2	0.813
extract-fa-parallelization	design-2	0.605
main	design-3	17.464
extract-fa-parallelization	design-3	10.008
main	design-4	101.922
extract-fa-parallelization	design-4	56.426
main	design-5	67.834
extract-fa-parallelization	design-5	40.071
main	design-6	32.499
extract-fa-parallelization	design-6	22.378
main	design-7	80.202
extract-fa-parallelization	design-7	38.357
main	design-8	533.619
extract-fa-parallelization	design-8	256.580

Signed-off-by: Jan Bylicki <[email protected]>

jix · 2025-09-22T18:04:58Z

passes/techmap/extract_fa.cc

-			count_func2 = 0;
-			count_func3 = 0;
+					if (config.verbose)
+						log("  checking %s\n", log_signal(it.first));


Even read-only access for RTLIL data structures currently isn't thread safe. There are a lot more places where worker thread RTLIL access happens, but I picked this as the most obvious one. Until we are able to change that, only the main thread can access RTLIL. To ensure we can maintain this, we also require that this is made obvious by not handing RTLIL references to code running on worker threads in the first place. See #5266 (comment) for a recent discussion on the requirements for adding multi-threaded code to Yosys and the corresponding PR for an example of what is currently possible.

I also think if we are adding multi-threading, we should prefer using work queues to dynamically balance the workload instead of statically splitting it like this PR currently does. The PR I linked to introduces some primitives for this.

Since I left the above comment, there was a discussion on providing thread-safe read-only RTLIL access. While the thread starts with a proposal for a thread-safe alternative API, the conclusion was that we will try to make all const RTLIL methods thread-safe eventually. I'm not sure how long it will take to get us there, but compared to what I've been asking for above, that should significantly lower the barrier for adding parallel processing to passes.

jix · 2025-09-22T18:36:41Z

passes/techmap/extract_fa.cc

+		pool<tuple<tuple<SigBit, SigBit, SigBit>,int, SigBit>> tl_func_3;
+	};
+
+	std::mutex consteval_mtx;


The declaration of this mutex is far removed from the data that is actually protected by it. If raw mutexes are used at all that should be right next to each other.

When using a mutex like this, there is also nothing preventing or even hinting at an issue when introducing new ce accesses that are not protected by the appropriate lock guard. This makes it way too easy to introduce bugs that can be very hard to debug. For that reason I'm inclined to require use of higher level primitives within passes.

We could e.g. add our own Mutex<T> that combines a std::mutex mutex and a T value and only provides access via a lock method that passes out our own MutexGuard<T> that combines a std::lock_guard guard and a T &value ensuring that you only get to access the shared value while you hold the lock. (Unless you explicitly store a reference elsewhere, but that's always a hazard and not specific to multi-threading, making it somewhat easier to spot.) This is more or less the same API that Rust provides but of course there's nothing that stops this approach from being implemented in C++. The first example I found is folly's Synchronized, which also goes into a bit more detail motivating the use of this API over what std::mutex provides.

extract_fa: Add parallel processing of partitions

c87c6a9

Signed-off-by: Jan Bylicki <[email protected]>

ShinyKate assigned widlarizer and jix and unassigned widlarizer Sep 22, 2025

jix requested changes Sep 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

extract_fa: Add parallel processing of partitions #5373

extract_fa: Add parallel processing of partitions #5373

Uh oh!

jbylicki commented Sep 22, 2025

Uh oh!

jix Sep 22, 2025

Uh oh!

jix Oct 9, 2025

Uh oh!

jix Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

extract_fa: Add parallel processing of partitions #5373

Are you sure you want to change the base?

extract_fa: Add parallel processing of partitions #5373

Uh oh!

Conversation

jbylicki commented Sep 22, 2025

Uh oh!

jix Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

jix Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

jix Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants