The release notes for the ROCm platform.
The ROCm v5.5 release consists of the following HIP enhancements:
In this release, the stack size limit is increased from 16k to 131056 bytes (or 128K - 16). Applications requiring to update the stack size can use hipDeviceSetLimit API.
The following hipcc changes are implemented in this release:
hipcc
will not implicitly link tolibpthread
andlibrt
, as they are no longer a link time dependence for HIP programs. Applications that depend on these libraries must explicitly link to them.-use-staticlib
and-use-sharedlib
options are deprecated.
-
Separation of
hipcc
binaries (Perl scripts) from HIP tohipcc
project. Users will access separatehipcc
package for installinghipcc
binaries in future ROCm releases. -
In a future ROCm release, the following samples will be removed from the
hip-tests
project.hipBusbandWidth
at https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/samples/1_Utils/shipBusBandwidthhipCommander
at https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/samples/1_Utils/hipCommander
Note that the samples will continue to be available in previous release branches.
Note
This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.
The new memory management HIP API is as follows:
-
Sets information on the specified pointer [BETA].
hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attribute, hipDeviceptr_t ptr);
The new module management HIP APIs are as follows:
-
Launches kernel
$f$ with launch parameters and shared memory on stream with arguments passed tokernelParams
, where thread blocks can cooperate and synchronize as they execute.hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams);
-
Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute.
hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags);
The new HIP Graph Management APIs are as follows:
-
Creates a memory allocation node and adds it to a graph [BETA]
hipError_t hipGraphAddMemAllocNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, hipMemAllocNodeParams* pNodeParams);
-
Return parameters for memory allocation node [BETA]
hipError_t hipGraphMemAllocNodeGetParams(hipGraphNode_t node, hipMemAllocNodeParams* pNodeParams);
-
Creates a memory free node and adds it to a graph [BETA]
hipError_t hipGraphAddMemFreeNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, void* dev_ptr);
-
Returns parameters for memory free node [BETA].
hipError_t hipGraphMemFreeNodeGetParams(hipGraphNode_t node, void* dev_ptr);
-
Write a DOT file describing graph structure [BETA].
hipError_t hipGraphDebugDotPrint(hipGraph_t graph, const char* path, unsigned int flags);
-
Copies attributes from source node to destination node [BETA].
hipError_t hipGraphKernelNodeCopyAttributes(hipGraphNode_t hSrc, hipGraphNode_t hDst);
-
Enables or disables the specified node in the given graphExec [BETA]
hipError_t hipGraphNodeSetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int isEnabled);
-
Query whether a node in the given graphExec is enabled [BETA]
hipError_t hipGraphNodeGetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int* isEnabled);
This release consists of the following OpenMP enhancements:
- Additional support for OMPT functions
get_device_time
andget_record_type
. - Add support for min/max fast fp atomics on AMD GPUs.
- Fix the use of the abs function in C device regions.
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
andhipconfig.bin
. Thehipcc
/hipconfig
soft link will be assimilated to point fromhipcc
/hipconfig
to the respective compiled binaries as the default option.
ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility.
The following is the new filesystem hierarchy:4
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation -- ROCm v5.2 release#pragma
message changed to#warning
-- Future release#warning
changed to#error
-- Future release- Backward compatibility wrappers removed -- Future release
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder.
For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Support for Code Object v3 is deprecated and will be removed in a future release.
The following APIs and macros have been marked as deprecated. These are expected to be removed in a future ROCm release and coincides with the release of Comgr v3.0.
amd_comgr_action_info_set_options()
amd_comgr_action_info_get_options()
AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES
AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN
For replacements, see the AMD_COMGR_ACTION_INFO_GET
/SET_OPTION_LIST APIs
, and the AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC
macros.
The following environment variables are removed in this ROCm release:
GPU_MAX_COMMAND_QUEUES
GPU_MAX_WORKGROUP_SIZE_2D_X
GPU_MAX_WORKGROUP_SIZE_2D_Y
GPU_MAX_WORKGROUP_SIZE_3D_X
GPU_MAX_WORKGROUP_SIZE_3D_Y
GPU_MAX_WORKGROUP_SIZE_3D_Z
GPU_BLIT_ENGINE_TYPE
GPU_USE_SYNC_OBJECTS
AMD_OCL_SC_LIB
AMD_OCL_ENABLE_MESSAGE_BOX
GPU_FORCE_64BIT_PTR
GPU_FORCE_OCL20_32BIT
GPU_RAW_TIMESTAMP
GPU_SELECT_COMPUTE_RINGS_ID
GPU_USE_SINGLE_SCRATCH
GPU_ENABLE_LARGE_ALLOCATION
HSA_LOCAL_MEMORY_ENABLE
HSA_ENABLE_COARSE_GRAIN_SVM
GPU_IFH_MODE
OCL_SYSMEM_REQUIREMENT
OCL_CODE_CACHE_ENABLE
OCL_CODE_CACHE_RESET
The following are the known issues in this release.
When user applications call ncclCommAbort
to destruct communicators and then create new
communicators repeatedly, subsequent communicators may fail to initialize.
This issue is under investigation and will be resolved in a future release.
Multiple HIP directed tests fail.
Library | Version |
---|---|
hipBLAS | 0.53.0 ⇒ 0.54.0 |
hipCUB | 2.13.0 ⇒ 2.13.1 |
hipFFT | 1.0.10 ⇒ 1.0.11 |
hipSOLVER | 1.6.0 ⇒ 1.7.0 |
hipSPARSE | 2.3.3 ⇒ 2.3.5 |
rccl | 2.13.4 ⇒ 2.15.5 |
rocALUTION | 2.1.3 ⇒ 2.1.8 |
rocBLAS | 2.46.0 ⇒ 2.47.0 |
rocFFT | 1.0.21 ⇒ 1.0.22 |
rocPRIM | 2.12.0 ⇒ 2.13.0 |
rocRAND | 2.10.16 ⇒ 2.10.17 |
rocSOLVER | 3.20.0 ⇒ 3.21.0 |
rocSPARSE | 2.4.0 ⇒ 2.5.1 |
rocThrust | 2.17.0 |
rocWMMA | 0.9 ⇒ 1.0 |
Tensile | 4.35.0 ⇒ 4.36.0 |
hipBLAS 0.54.0 for ROCm 5.5.0
- added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF
- added scripts to plot performance for multiple functions
- data driven hipblas-bench and hipblas-test execution via external yaml format data files
- client smoke test added for quick validation using command hipblas-test --yaml hipblas_smoke.yaml
- fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes
- fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0
- fixed getrs to return successfully when given nullptrs with corresponding size = 0
- fixed getrs to give info = -1 when transpose is not an expected type
- fixed gels to return successfully when given nullptrs with corresponding size = 0
- fixed gels to give info = -1 when transpose is not in ('N', 'T') for real cases or not in ('N', 'C') for complex cases
- changed reference code for Windows to OpenBLAS
- hipblas client executables all now begin with hipblas- prefix
hipCUB 2.13.1 for ROCm 5.5.0
- Benchmarks for
BlockShuffle
,BlockLoad
, andBlockStore
.
- CUB backend references CUB and Thrust version 1.17.2.
- Improved benchmark coverage of
BlockScan
by addingExclusiveScan
, benchmark coverage ofBlockRadixSort
by addingSortBlockedToStriped
, and benchmark coverage ofWarpScan
by addingBroadcast
.
- Windows HIP SDK support
BlockRadixRankMatch
is currently broken under the rocPRIM backend.BlockRadixRankMatch
with a warp size that does not exactly divide the block size is broken under the CUB backend.
hipFFT 1.0.11 for ROCm 5.5.0
- Fixed old version rocm include/lib folders not removed on upgrade.
hipSOLVER 1.7.0 for ROCm 5.5.0
- Added functions
- gesvdj
- hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize
- hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj
- gesvdjBatched
- hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize
- hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched
- gesvdj
hipSPARSE 2.3.5 for ROCm 5.5.0
- Fixed an issue, where the rocm folder was not removed on upgrade of meta packages
- Fixed a compilation issue with cusparse backend
- Added more detailed messages on unit test failures due to missing input data
- Improved documentation
- Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis)
RCCL 2.15.5 for ROCm 5.5.0
- Compatibility with NCCL 2.15.5
- Unit test executable renamed to rccl-UnitTests
- HW-topology aware binary tree implementation
- Experimental support for MSCCL
- New unit tests for hipGraph support
- NPKit integration
- rocm-smi ID conversion
- Support for HIP_VISIBLE_DEVICES for unit tests
- Support for p2p transfers to non (HIP) visible devices
- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench
rocALUTION 2.1.8 for ROCm 5.5.0
- Added build support for Navi32
- Fixed a typo in MPI backend
- Fixed a bug with the backend when HIP support is disabled
- Fixed a bug in SAAMG hierarchy building on HIP backend
- Improved SAAMG hierarchy build performance on HIP backend
- LocalVector::GetIndexValues(ValueType*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector*) instead
- LocalVector::SetIndexValues(const ValueType*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead
- LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*) instead
- LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix*) instead
- LocalMatrix::RugeStueben() is deprecated
- LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, int) instead
- LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*) instead
rocBLAS 2.47.0 for ROCm 5.5.0
- added functionality rocblas_geam_ex for matrix-matrix minimum operations
- added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions
- added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API
- added support for vector initialization in the rocBLAS test framework with negative increments
- added windows build documentation for forthcoming support using ROCm HIP SDK
- added scripts to plot performance for multiple functions
- improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU.
- improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU.
- improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs.
- fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench
- fixed deprecated API compatibility with Visual Studio compiler
- fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory
- install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help)
- rocblas client executables all now begin with rocblas- prefix
- install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default
rocFFT 1.0.22 for ROCm 5.5.0
- Improved performance of 1D lengths < 2048 that use Bluestein's algorithm.
- Reduced time for generating code during plan creation.
- Optimized 3D R2C/C2R lengths 32, 84, 128.
- Optimized batched small 1D R2C/C2R cases.
- Added gfx1101 to default AMDGPU_TARGETS.
- Moved client programs to C++17.
- Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled.
- Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache.
- Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged.
- Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present.
rocPRIM 2.13.0 for ROCm 5.5.0
- New block level
radix_rank
primitive. - New block level
radix_rank_match
primitive.
- Improved the performance of
block_radix_sort
anddevice_radix_sort
.
- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.
- Fixed benchmark build on Windows
rocRAND 2.10.17 for ROCm 5.5.0
- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
- New benchmark for the device API using Google Benchmark,
benchmark_rocrand_device_api
, replacingbenchmark_rocrand_kernel
.benchmark_rocrand_kernel
is deprecated and will be removed in a future version. Likewise,benchmark_curand_host_api
is added to replacebenchmark_curand_generate
andbenchmark_curand_device_api
is added to replacebenchmark_curand_kernel
. - experimental HIP-CPU feature
- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3".
- Python 2.7 is no longer officially supported.
- Windows HIP SDK support
rocSOLVER 3.21.0 for ROCm 5.5.0
- SVD for general matrices using Jacobi algorithm:
- GESVDJ (with batched and strided_batched versions)
- LU factorization without pivoting for block tridiagonal matrices:
- GEBLTTRF_NPVT (with batched and strided_batched versions)
- Linear system solver without pivoting for block tridiagonal matrices:
- GEBLTTRS_NPVT (with batched and strided_batched, versions)
- Product of triangular matrices
- LAUUM
- Added experimental hipGraph support for rocSOLVER functions
- Improved the performance of SYEVJ/HEEVJ.
- STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach.
- SYEVJ/HEEVJ should now be invariant under matrix scaling.
- SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed.
- Fixed GETF2_NPVT and GETRF_NPVT input data initialization in tests and benchmarks.
- Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages.
rocSPARSE 2.5.1 for ROCm 5.5.0
- Added bsrgemm and spgemm for BSR format
- Added bsrgeam
- Added build support for Navi32
- Added experimental hipGraph support for some rocSPARSE routines
- Added csritsv, spitsv csr iterative triangular solve
- Added mixed precisions for SpMV
- Added batched SpMM for transpose A in COO format with atomic atomic algorithm
- Optimization to csr2bsr
- Optimization to csr2csr_compress
- Optimization to csr2coo
- Optimization to gebsr2csr
- Optimization to csr2gebsr
- Fixes to documentation
- Fixes a bug in COO SpMV gridsize
- Fixes a bug in SpMM gridsize when using very large matrices
- In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.
rocWMMA 1.0 for ROCm 5.5.0
- Added support for wave32 on gfx11+
- Added infrastructure changes to support hipRTC
- Added performance tracking system
- Modified the assignment of hardware information
- Modified the data access for unsigned datatypes
- Added library config to support multiple architectures
Tensile 4.36.0 for ROCm 5.5.0
- Add functions for user-driven tuning
- Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac
- Add binary search for Grid-Based algorithm
- Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2)
- Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4)
- Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1)
- Add GSU SingleBuffer algorithm for HSS/BSS
- Add gfx900:xnack-, gfx1032, gfx1034, gfx1035
- Enable gfx1031 support
- Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1
- Improve InitAccVgprOpt
- Use global_atomic for GSU instead of flat and global_store for debug code
- Replace flat_load/store with global_load/store
- Use global_load/store for BufferLoad/Store=0 and enable scheduling
- LocalSplitU support for HGEMM+HPA when MFMA disabled
- Update Code Object Version
- Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss
- Update asm cap cache arguments
- Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead
- Change checks, error messages, assembly syntax, and coverage for DirectToLds
- Remove unused cmake file
- Clean up the LLVM dependency code
- Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2
- Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead
- Add build-id to header of compiled source kernels
- Fix solution index collisions
- Fix h beta vectorwidth4 correctness issue for WMMA
- Fix an error with BufferStore=0
- Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2)
- Fix MoveMIoutToArch bug
- Fix flat load correctness issue on I8 and flat store correctness issue
- Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes
- Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop
- Fix issues with DirectToVgpr + ScheduleIterAlg<3
- Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2
- Fix mismatch issue with PrefetchGlobalRead=2
- Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size
- Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1
- Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case
- Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical
- Fix for failing CI tests due to CpuThreads=0
- Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2
- Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only)
- Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only)