-
Notifications
You must be signed in to change notification settings - Fork 38
Integrate Broadcom Devices in amd-smi framework #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: amd-staging
Are you sure you want to change the base?
Conversation
Signed-off-by: Joseph Narlo <[email protected]>
1. As part of this we have added Broadcom's NIC and SWITCH devices in the amd-smi framework. 2. The affected command is 'amd-smi list' 3. Filter the NIC and SWITCH devices based on the Broadcom's vendor ids 4. Since these changes are depending on sysfs entries which is created by broadcom drivers, if no brcm driver present in the system then necessary elements will not created as part of the list command.
…MD-ROCm-Internal/amdsmi into SWDEV-504389/Synch_Comment_In_Linux_BM
Comments added
NIC/Switch LIST
List Nic/Switch
As part of this feature, 1. Added monitor attributes for BRCM devices - NIC and SWITCH 2. The affected amd-smi command is 'amd-smi monitor' 3. The list of monitor attributes are read from the corresponding BRCM device's sysfs path
BRCM Monitor
BRCM Monitor
…imestamp from firmware. (ROCm#61) * SWDEV-511296 - update violation_status->violation_timestamp to read values from firmware. Signed-off-by: Greg Scaffidi <[email protected]> * SWDEV-511296 - update violation_status->violation_timestamp to read values from firmware. Signed-off-by: Greg Scaffidi <[email protected]> * SWDEV-511296 - update violation_status->violation_timestamp to read values from firmware. Signed-off-by: Greg Scaffidi <[email protected]> --------- Signed-off-by: Greg Scaffidi <[email protected]>
Fix ordering of RHEL 8 build process Signed-off-by: Williams, Justin <[email protected]>
9872083
to
09379f8
Compare
ohhh that's cool, I'll ask some teammates to see if they want to take on broadcom things. continue internally |
Affected Commands: 1. amd-smi topology -nic Display nic and gpu connectivity 2. amd-smi topology -show_numa Display nic,gpu's numa and cpu affinity
1. Remove duplicate declaration 2. Resolve Alignment issue
234704f
to
bd82e88
Compare
@@ -994,6 +996,25 @@ typedef struct { | |||
* | |||
* @cond @tag{gpu_bm_linux} @tag{host} @endcond | |||
*/ | |||
typedef struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to explain the difference of those temperature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comment statements to the methods.
include/amd_smi/amdsmi.h
Outdated
uint32_t nic_temp_input; | ||
uint32_t nic_temp_max; | ||
uint32_t nic_temp_shutdown; | ||
} amdsmi_nic_temperature_metric_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is Broadcom specific metrics, maybe we should use the amdsmi_brcm_nic_temperature_metric_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed as amdsmi_brcm_nic_temperature_metric_t
} amdsmi_nic_temperature_metric_t; | ||
|
||
typedef struct { | ||
char current_link_speed[AMDSMI_MAX_STRING_LENGTH]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will assume this is the integer or float, and not string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per BRCM NIC driver implementation the complete string added in the sysfs entry and the same populated in amd-smi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No code changes made.
@@ -2951,7 +2972,17 @@ amdsmi_status_t amdsmi_get_gpu_bdf_id(amdsmi_processor_handle processor_handle, | |||
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail | |||
*/ | |||
amdsmi_status_t amdsmi_get_gpu_topo_numa_affinity(amdsmi_processor_handle processor_handle, int32_t *numa_node); | |||
|
|||
amdsmi_status_t amdsmi_get_gpu_topo_cpu_affinity(amdsmi_processor_handle processor_handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is user facing API, we may want to reduce the amount of API. In this case, we have cpu/gpu/nic, maybe just one API with a enum parameter to specify the type we want to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the idea is do not want to conflict with gpu implementation.
In addition to that to identify the devices in python interface using the corresponding wrapper methods.
No code changes made.
|
||
amdsmi_status_t amdsmi_get_gpu_topo_cpu_affinity(amdsmi_processor_handle processor_handle, | ||
unsigned int *cpu_aff_length, char *cpu_aff_data); | ||
amdsmi_status_t amdsmi_get_nic_topo_numa_affinity(amdsmi_processor_handle processor_handle, int32_t *numa_node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add comment for API defined in this file, so that doxygen can generate proper document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment and deliver the changes.
char buffer[128]; | ||
|
||
// Open a pipe to execute the command | ||
std::shared_ptr<FILE> pipe(popen(command.c_str(), "r"), pclose); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to avoid call external command if possible as this is a library. maybe you can refer to some code here:
https://github.com/pciutils/pciutils/blob/master/lspci.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like lspci command, we are planning to add ethtool, sg_ses commands also to populate the nic and switch devices details. In this case, we are believing this is better approach to get the necessary information.
No code changes made.
namespace amd { | ||
namespace smi { | ||
|
||
class AMDSmiNoDrmNIC { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what is no drm nic? I think drm is related to GPU only? In amdsmi, some optional function needs to refer the libdrm, but we do not want to add the hard dependency to it, so we dynamically load it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the dependency "xf86drm.h" and delivered the changes. Other than GPU's base class referred as no_drm. It can be NIC and Switch.
@@ -1500,6 +1500,8 @@ rsmi_status_t rsmi_driver_status(rsmi_driver_state_t* state); | |||
* @retval ::RSMI_STATUS_SUCCESS is returned upon successful call. | |||
*/ | |||
rsmi_status_t rsmi_num_monitor_devices(uint32_t *num_devices); | |||
rsmi_status_t rsmi_num_nic_monitor_devices(uint32_t *num_devices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comments for public facing header so that the doxygen can generate API document properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comments and deliver the changes.
@@ -562,38 +642,74 @@ amdsmi_get_gpu_device_bdf(amdsmi_processor_handle processor_handle, amdsmi_bdf_t | |||
return AMDSMI_STATUS_SUCCESS; | |||
} | |||
|
|||
amdsmi_status_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need to remove this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This difference due to merge conflict and it was resolved.
@@ -324,6 +325,13 @@ amdsmi_status_t AMDSmiDrm::get_bdf_by_index(uint32_t gpu_index, amdsmi_bdf_t *bd | |||
return AMDSMI_STATUS_SUCCESS; | |||
} | |||
|
|||
amdsmi_status_t AMDSmiDrm::amdgpu_query_cpu_affinity(std::string devicePath, std::string &cpu_affinity) { | |||
std::string cpuAffFile = "cpulistaffinity"; | |||
cpu_affinity = smi_brcm_get_value_string(devicePath, cpuAffFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it will never fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is handled in amd_smi_utils.cc. If file not present returning as "N/A".
Hence, expected return value is always success.
uint32_t socket_count = 0; | ||
|
||
// Initialize amdsmi for AMD CPUs | ||
ret = amdsmi_init(AMDSMI_INIT_AMD_GPUS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you only init the amdsmi to discover GPU device, then you should not be able to find out the NIC device in the following amdsmi_get_processor_handles_by_type()
if you want to have an example of code to support both NIC and GPU device, you may need to pass both as parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of AMDSMI_INIT_AMD_GPUS only BRCM NIC and Switch flags are initialized.
* Construct the BDF for the discovered NIC devices | ||
* Push NIC SMI Device object to nic_devices vector | ||
*/ | ||
ret = DiscoverBRCMnicDevices(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a system does not have the NIC device, we should not fail the whole init() as the user may only care about the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled graceful exit in DiscoverBRCMnicDevices and DiscoverBRCMswitchDevices and make sure always return 0.
uint64_t bdfid; | ||
for (auto &device : nic_devices_) { | ||
if (ConstructBDFID(device->path(), &bdfid) != 0) { | ||
std::cerr << "Failed to construct BDFID." << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, any NIC device init fail should not fail the whole function as people may just want to work on GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled dead code.
@@ -653,7 +780,63 @@ void RocmSMI::AddToDeviceList(std::string dev_name, uint64_t bdfid) { | |||
LOG_DEBUG(ss); | |||
} | |||
|
|||
void RocmSMI::AddToNICDeviceList(std::string dev_name, uint64_t bdfid, uint32_t card_indx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of code is similar to AddToSWITCHDeviceList(), consider to remove duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both are two different elements do not want to make function overload.
Moreover, there is a chance to have a server with any one of the BRCM device - either BRCM NIC or BRCM Switch.
@@ -691,6 +874,84 @@ static bool isAMDGPU(std::string dev_path) { | |||
return isAmdGpu; | |||
} | |||
|
|||
static bool isBRCMnic(std::string dev_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code is similar to isBRCMswitch(), consider to remove duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OEM, device id and device's sysfs paths are different and handle multiple if else will not be looks good.
No code changes made.
@@ -1096,6 +1357,142 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) { | |||
return 0; | |||
} | |||
|
|||
uint32_t RocmSMI::DiscoverBRCMnicDevices(void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar code to DiscoverBRCMswitchDevices(), remove duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both are two different elements do not want to make function overload.
Moreover, there is a chance to have a server with any one of the BRCM device - either BRCM NIC or BRCM Switch.
amdsmi_status_t amdsmi_get_gpu_topo_cpu_affinity(amdsmi_processor_handle processor_handle, | ||
unsigned int *cpu_aff_length, char *cpu_aff_data); | ||
amdsmi_status_t amdsmi_get_nic_topo_numa_affinity(amdsmi_processor_handle processor_handle, int32_t *numa_node); | ||
amdsmi_status_t amdsmi_get_nic_topo_cpu_affinity(amdsmi_processor_handle processor_handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any new API defined in the amdsmi.h, we need a unit test, you can refer to the files at tests/amd_smi_test/ folders.
|
||
static const char *kDeviceNamePrefix = "card"; | ||
static const char *kNICPrefix = "e"; //changed to 'e' since SLES syspath is 'eth' and RHEL/Ubuntu is 'en' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wl for wireless LAN (e.g., wlp2s0). But it may not be applied to our case, so we can still use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Moreover, we are filtering by specific vendor / driver supported nic devices only.
* Push SWITCH SMI Device object to switch_devices vector | ||
*/ | ||
|
||
ret = DiscoverBRCMswitchDevices(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The below switch code is very similar to handle BRCM NIC devices. Consider removing duplications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both are two different elements do not want to make function overload.
Moreover, there is a chance to have a server with any one of the BRCM device - either BRCM NIC or BRCM Switch.
rocm_smi/src/rocm_smi_main.cc
Outdated
err_msg += kPathSwitchRoot; | ||
err_msg += "."; | ||
perror(err_msg.c_str()); | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using hardcode number, we may prefer to use the macro to make the code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed same as DiscoverAmdgpuDevices implementation.
auto path_dir = opendir(path.c_str()); | ||
if (path_dir == nullptr) { | ||
//move to the next index | ||
if (numofretry == 3) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we come up to use the hardcode number 3 instead of other number for retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To handle the negative test scenario added this case - the test server had only one invalid sysfs entry and the execution failed.
Note: In the good scenario the retry count is not required.
For safer side added the retry count as 3. In general we have user editable configuration file for feeding these values, but not sure that can be applicable for amd-smi.
@@ -247,6 +251,122 @@ amdsmi_status_t AMDSmiSystem::populate_amd_gpu_devices() { | |||
return AMDSMI_STATUS_SUCCESS; | |||
} | |||
|
|||
amdsmi_status_t AMDSmiSystem::populate_brcm_nic_devices() { | |||
uint32_t device_count = 0; | |||
amdsmi_status_t amd_smi_status = no_drm_nic.init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to check the return status?
return AMDSMI_STATUS_SUCCESS; | ||
} | ||
|
||
amdsmi_status_t AMDSmiSystem::populate_brcm_switch_devices() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider combining with populate_brcm_nic_devices() to remove duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both are two different elements do not want to make function overload.
Moreover, there is a chance to have a server with any one of the BRCM device - either BRCM NIC or BRCM Switch.
} | ||
else { | ||
std::string line; | ||
getline(file, line); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle the errors as this function may throw exceptions.
else { | ||
std::string line; | ||
getline(file, line); | ||
return static_cast<uint32_t>(stoi(line)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the line is not number, what will happen?
return "N/A"; | ||
} | ||
else { | ||
getline(file, temp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle error
@@ -63,6 +63,11 @@ add_executable(${SMI_NODRM_EXAMPLE_EXE} "amd_smi_nodrm_example.cc") | |||
target_link_libraries(${SMI_NODRM_EXAMPLE_EXE} amd_smi) | |||
add_dependencies(${SMI_NODRM_EXAMPLE_EXE} amd_smi) | |||
|
|||
set(SMI_LIST_EXAMPLE_EXE "amd_smi_list") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you just want to show to use the nic device, maybe using the name amd_smi_nic.cc instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the amd_smi_list example code to print all GPU, NIC and Switch devices BDF and UUID details.
Remove duplicate methods from merges
Incorporated review comments.
Incorporated Review Comments Phase 2.1
As part of this feature,
Updated (1/27)
i) NIC_TEMP_CURRENT
ii) NIC_TEMP_CRIT_ALARM
iii) NIC_TEMP_EMERGENCY_ALARM
iv) NIC_TEMP_SHUTDOWN_ALARM
v) NIC_TEMP_MAX_ALARM
i) CURRENT_LINK_SPEED
ii) MAX_LINK_SPEED
iii) CURRENT_LINK_WIDTH
iv) MAX_LINK_WIDTH