Skip to content

rocPRIM 3.4.0 for ROCm 6.4.0

Latest
Compare
Choose a tag to compare
@rocm-ci rocm-ci released this 11 Apr 13:35
d8771ec

Added

  • Added extended tests to rtest.py. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer to run relative to smoke and regression tests.
  • Use python rtest.py [--emulation|-e|--test|-t]=extended to run these tests.
  • Added regression tests to rtest.py. Regression tests are a subset of tests that caused hardware problems for past emulation environments.
    • Can be run with python rtest.py [--emulation|-e|--test|-t]=regression
  • Added the parallel find_first_of device function with autotuned configurations, this function is similar to std::find_first_of, it searches for the first occurrence of any of the provided elements.
  • Added --emulation option added for rtest.py
    • Unit tests can be run with [--emulation|-e|--test|-t]=<test_name>
  • Added tuned configurations for segmented radix sort for gfx942 to improve performance on this architecture.
  • Added a parallel device-level function, rocprim::adjacent_find, similar to the C++ Standard Library std::adjacent_find algorithm.
  • Added configuration autotuning to device adjacent find (rocprim::adjacent_find) for improved performance on selected architectures.
  • Added rocprim::numeric_limits which is an extension of std::numeric_limits, which includes support for 128-bit integers.
  • Added rocprim::int128_t and rocprim::uint128_t which are the __int128_t and __uint128_t types.
  • Added the parallel search and find_end device functions similar to std::search and std::find_end, these functions search for the first and last occurrence of the sequence respectively.
  • Added a parallel device-level function, rocprim::search_n, similar to the C++ Standard Library std::search_n algorithm.
  • Added new constructors and a base function, and added constexpr specifier to all functions in rocprim::reverse_iterator to improve parity with the C++17 std::reverse_iterator.
  • Added hipGraph support to device run-length-encode for non trivial runs (rocprim::run_length_encode_non_trivial_runs).
  • Added configuration autotuning to device run-length-encode for non trivial runs (rocprim::run_length_encode_non_trivial_runs) for improved performance on selected architectures.
  • Added configuration autotuning to device run-length-encode for trivial runs (rocprim::run_length_encode) for improved performance on selected architectures.
  • Added a new type traits interface to enable users to provide additional type trait information to rocPRIM, facilitating better compatibility with custom types.

Changed

  • Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use python rtest.py [--emulation|-e|--test|-t]=smoke to run these tests.

  • The rtest.py options have changed. rtest.py is now run with at least either --test|-t or --emulation|-e, but not both options.

  • Changed the internal algorithm of block radix sort to use rank match to improve performance of various radix sort related algorithms.

  • Disabled padding in various cases where higher occupancy resulted in better performance despite more bank conflicts.

  • Removed HIP-CPU support. HIP-CPU support was experimental and broken.

  • Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.

  • You can use CMake HIP language support with CMake 3.18 and later. To use HIP language support, run cmake with -DUSE_HIPCXX=ON instead of setting the CXX variable to the path to a HIP-aware compiler.

Resolved issues

  • Fixed an issue where rmake.py would generate wrong CMAKE commands while using Linux environment
  • Fixed an issue where rocprim::partial_sort_copy would yield a compile error if the input iterator is const.
  • Fixed incorrect 128-bit signed and unsigned integers type traits.
  • Fixed compilation issue when rocprim::radix_key_codec<...> is specialized with a 128-bit integer.
  • Fixed the warp-level reduction rocprim::warp_reduce.reduce DPP implementation to avoid undefined intermediate values during the reduction.
  • Fixed an issue that caused a segmentation fault when hipStreamLegacy was passed to some API functions.

Upcoming changes

  • Using the initialisation constructor of rocprim::reverse_iterator will throw a deprecation warning. It will be marked as explicit in the next major release.

  • Using the initialisation constructor of rocprim::reverse_iterator will throw a deprecation warning. It will be marked as explicit in the next major release.