-
Couldn't load subscription status.
- Fork 995
extract_fa: Add parallel processing of partitions #5373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jan Bylicki <[email protected]>
| count_func2 = 0; | ||
| count_func3 = 0; | ||
| if (config.verbose) | ||
| log(" checking %s\n", log_signal(it.first)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even read-only access for RTLIL data structures currently isn't thread safe. There are a lot more places where worker thread RTLIL access happens, but I picked this as the most obvious one. Until we are able to change that, only the main thread can access RTLIL. To ensure we can maintain this, we also require that this is made obvious by not handing RTLIL references to code running on worker threads in the first place. See #5266 (comment) for a recent discussion on the requirements for adding multi-threaded code to Yosys and the corresponding PR for an example of what is currently possible.
I also think if we are adding multi-threading, we should prefer using work queues to dynamically balance the workload instead of statically splitting it like this PR currently does. The PR I linked to introduces some primitives for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I left the above comment, there was a discussion on providing thread-safe read-only RTLIL access. While the thread starts with a proposal for a thread-safe alternative API, the conclusion was that we will try to make all const RTLIL methods thread-safe eventually. I'm not sure how long it will take to get us there, but compared to what I've been asking for above, that should significantly lower the barrier for adding parallel processing to passes.
| pool<tuple<tuple<SigBit, SigBit, SigBit>,int, SigBit>> tl_func_3; | ||
| }; | ||
|
|
||
| std::mutex consteval_mtx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declaration of this mutex is far removed from the data that is actually protected by it. If raw mutexes are used at all that should be right next to each other.
When using a mutex like this, there is also nothing preventing or even hinting at an issue when introducing new ce accesses that are not protected by the appropriate lock guard. This makes it way too easy to introduce bugs that can be very hard to debug. For that reason I'm inclined to require use of higher level primitives within passes.
We could e.g. add our own Mutex<T> that combines a std::mutex mutex and a T value and only provides access via a lock method that passes out our own MutexGuard<T> that combines a std::lock_guard guard and a T &value ensuring that you only get to access the shared value while you hold the lock. (Unless you explicitly store a reference elsewhere, but that's always a hazard and not specific to multi-threading, making it somewhat easier to spot.) This is more or less the same API that Rust provides but of course there's nothing that stops this approach from being implemented in C++. The first example I found is folly's Synchronized, which also goes into a bit more detail motivating the use of this API over what std::mutex provides.
This PR implements parallelized partition finding in the
extract_fastep. We have tested the speedup on proprietary designs that we can't share, and found great improvements. Partition finding needs to only read from the global state, and thus requires little synchronization. We've tested parallelization on later loops, but found no improvements there due to frequent writes.The run times of the
extract_fastep compared to main (runs were performed on 8 cores):