Skip to content

Conversation

@kalyanalle
Copy link
Contributor

on multi card systems for each device(GPU)

Related-To: VLCLJ-2646

Note: left the debug prints for testing purpose, will remove in the final code.

Brief Function Logic Explanations
Data Collection Functions

collectMemoryData()
Purpose: Collects memory telemetry from a device
Enumerates memory modules using zesDeviceEnumMemoryModules()
For each module: gets bandwidth counters (read/write) and memory state (free/used)
Stores in deviceData.memoryBandwidth and deviceData.memoryStates
Returns gracefully if no memory modules found
collectPowerData()
Purpose: Collects power consumption telemetry from a device
Enumerates power domains using zesDeviceEnumPowerDomains()
For each domain: gets energy counters using zesPowerGetEnergyCounter()
Stores in deviceData.powerEnergy
Returns gracefully if no power domains found
collectTemperatureData()
Purpose: Collects temperature readings from a device
Enumerates temperature sensors using zesDeviceEnumTemperatureSensors()
For each sensor: gets temperature value using zesTemperatureGetState()
Stores in deviceData.temperatures
Returns gracefully if no temperature sensors found
collectPciData()
Purpose: CRITICAL - Collects PCI info and creates unique device ID
Gets PCI properties using zesDevicePciGetProperties()
Creates BDF string: "bus:device:function" (e.g., "3:0:0")
Gets PCI traffic stats using zesDevicePciGetStats()
Returns false if PCI properties fail (test-critical failure)

Validation Functions
validateUniquePciBdf()
Purpose: CORE PMT VALIDATION - Ensures no duplicate PCI addresses
Uses std::set to detect duplicate BDF identifiers
Returns false if duplicate found → PMT mapping error detected
Most critical validation - proves each device has unique address
validateMemoryDataIsolation()
Purpose: Ensures memory counters differ between all device pairs
Double loop: Compares every device pair (i vs j where j > i)
Checks memory bandwidth: read/write counters must differ
Checks memory state: free memory should differ between devices
EXPECT_FALSE on identical data → detects PMT cross-contamination
validatePowerDataIsolation()
Purpose: Ensures power readings differ between all device pairs
Double loop: Compares every device pair
Checks energy counters: power consumption values must differ
EXPECT_FALSE on identical energy → detects shared power data
validateTemperatureDataIsolation()
Purpose: Validates temperature readings are realistic per device
Double loop: Validates each device's temperature range
Range check: 0°C < temperature < 150°C per device
No uniqueness requirement (idle GPUs may have similar temps)
Ensures PMT thermal interface is accessible
validatePciDataIsolation()
Purpose: CRITICAL - Validates PCI bus uniqueness and traffic isolation
EXPECT_NE on PCI bus numbers → different devices must be on different buses
Compares PCI traffic stats: RX/TX/packet counters must differ
EXPECT_FALSE on identical stats → detects PMT interface sharing
Core PMT mapping validation - validates the commit's fix

on multi card systems for each device(GPU)

Related-To: VLCLJ-2646

Signed-off-by: Alle, Kalyan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant