Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VITIS-13434 Provide more error details for AIE async errors via xrt::error #8736

Merged
merged 92 commits into from
Feb 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
49abcee
add hipMemsetD32Async().
zhangchiming May 15, 2024
b3091df
fix typo.
zhangchiming May 15, 2024
0f98a8f
Replace hard coded name with __func__.
zhangchiming May 16, 2024
47fe9cb
fix staled pointer error.
zhangchiming May 16, 2024
ba1e8db
fix double allocation issue.
zhangchiming May 16, 2024
ae76085
fix incorrect share_ptr use.
zhangchiming May 17, 2024
98e114c
add hipMemsetD32Async().
zhangchiming May 15, 2024
1f2880c
fix typo.
zhangchiming May 15, 2024
a31bebb
Replace hard coded name with __func__.
zhangchiming May 16, 2024
fdc6516
fix staled pointer error.
zhangchiming May 16, 2024
b8909e5
fix double allocation issue.
zhangchiming May 16, 2024
6808339
1) add lock in memory_database::get_hip_mem_from_addr()
zhangchiming May 3, 2024
5e277dd
fix incorrect share_ptr use.
zhangchiming May 17, 2024
40a508d
Fix rebase error.
zhangchiming May 20, 2024
fc287ea
move to_hex() into core/common/utils.h
zhangchiming May 22, 2024
fd8fda3
remove extra member in class copy_buffer.
zhangchiming May 23, 2024
147df5c
fix a typo.
zhangchiming May 23, 2024
0ad2df6
fix a typo.
zhangchiming Jun 6, 2024
3b4df2d
fix some typo and remove template use from copy_buffer class implemen…
zhangchiming Jun 6, 2024
e5972cb
Merge branch 'master' of github.com:zhangchiming/XRT
zhangchiming Jun 6, 2024
dc2f04b
Merge branch 'Xilinx:master' into master
zhangchiming Jun 10, 2024
52c8f28
change type of err_msg argument of helper function throw_if() to cons…
Jun 10, 2024
e815e7a
Merge branch 'Xilinx:master' into master
zhangchiming Jun 10, 2024
776ebde
Merge branch 'Xilinx:master' into master
zhangchiming Jun 11, 2024
b8efc97
Merge branch 'Xilinx:master' into master
zhangchiming Jun 16, 2024
d529aea
fix incorrect usage of shared_ptr in copy_buffer constructor.
zhangchiming Jun 18, 2024
4b43440
change the place where host_vec is moved.
zhangchiming Jun 18, 2024
213e42a
add back missing std::move in copy_from_host_buffer_commad constructor.
zhangchiming Jun 18, 2024
dcec0d0
Merge branch 'Xilinx:master' into master
zhangchiming Jun 22, 2024
aa739f0
Merge branch 'Xilinx:master' into master
zhangchiming Jun 30, 2024
a91ee48
Merge branch 'Xilinx:master' into master
zhangchiming Jul 2, 2024
c887b87
Merge branch 'Xilinx:master' into master
zhangchiming Jul 3, 2024
1e275a5
Add initial implementation of hip stream ordered memory allocator.
zhangchiming Jul 3, 2024
6ccf053
Fix the size alignment in memory pool allocator.
zhangchiming Jul 3, 2024
dbfded0
fix rebase error.
zhangchiming Jul 3, 2024
e8c463a
fix rebase error.
zhangchiming Jul 3, 2024
0b0f3de
fix rebase error.
zhangchiming Jul 3, 2024
08336d2
Merge branch 'Xilinx:master' into master
zhangchiming Jul 3, 2024
622ae0e
Add comment for the choice of shared_ptr vs unique_ptr in in enqueing…
zhangchiming Jul 3, 2024
179ea49
Add comment for using shared_ptr for storage of pointer to memory_poo…
zhangchiming Jul 3, 2024
ab941d4
use unique_ptr for device_cache.
Jul 5, 2024
177a7f1
Merge branch 'Xilinx:master' into master
zhangchiming Jul 9, 2024
89eca2b
Fix issues raised in code review.
Jul 9, 2024
95eac6c
fix the error in memory::write().
Jul 9, 2024
7e9103b
remove curly braces.
Jul 9, 2024
4a2d401
Merge branch 'Xilinx:master' into master
zhangchiming Jul 9, 2024
94742f5
Fix issues in hipMallocAsync() and hipFreeAsync().
Jul 10, 2024
98ad99d
fix error found in unit testing.
Jul 10, 2024
f613080
use sub class of hip::memory for async allocation from hip memory pool.
Jul 10, 2024
06b3a5b
Merge branch 'Xilinx:master' into master
zhangchiming Jul 10, 2024
421670f
fix the error in sub mem lookup from memory_database.
Jul 10, 2024
975e1e3
Merge branch 'master' of github.com:zhangchiming/XRT
Jul 10, 2024
e288da7
Merge branch 'Xilinx:master' into master
zhangchiming Jul 29, 2024
21b6834
remove sub_mem address map.
Jul 31, 2024
20547d2
code clean up.
Aug 2, 2024
5897820
Merge branch 'Xilinx:master' into master
zhangchiming Aug 2, 2024
2920570
Merge branch 'master' of github.com:zhangchiming/XRT
Aug 2, 2024
3ee2a94
Merge branch 'Xilinx:master' into master
zhangchiming Aug 5, 2024
51cf99b
Merge branch 'Xilinx:master' into master
zhangchiming Aug 7, 2024
8e00bea
add code for hipDeviceGetDefaultMemPool(), hipDeviceGetMemPool() and …
Aug 7, 2024
949944b
Merge branch 'Xilinx:master' into master
zhangchiming Aug 8, 2024
d836e5e
fix nullptr error.
Aug 8, 2024
3ee0e40
Merge branch 'master' of github.com:zhangchiming/XRT
Aug 8, 2024
d78756a
Merge branch 'Xilinx:master' into master
zhangchiming Aug 9, 2024
4a1c8fb
Fix compile error on Linux.
zhangchiming Aug 9, 2024
3af0ef0
Merge branch 'Xilinx:master' into master
zhangchiming Aug 14, 2024
f33cd91
Merge branch 'Xilinx:master' into master
zhangchiming Aug 15, 2024
bd3bfe1
Merge branch 'Xilinx:master' into master
zhangchiming Aug 19, 2024
9351087
Merge branch 'Xilinx:master' into master
zhangchiming Aug 21, 2024
f4c270a
Merge branch 'Xilinx:master' into master
zhangchiming Aug 26, 2024
6035ddd
Merge branch 'Xilinx:master' into master
zhangchiming Aug 26, 2024
865f551
Merge branch 'Xilinx:master' into master
zhangchiming Aug 26, 2024
90f7a7b
Fix compile warning caused by using "int" type.
zhangchiming Aug 26, 2024
832ea3a
Merge branch 'master' of github.com:zhangchiming/XRT
zhangchiming Aug 26, 2024
65b6c67
Merge branch 'Xilinx:master' into master
zhangchiming Aug 29, 2024
65626f8
Merge branch 'Xilinx:master' into master
zhangchiming Sep 3, 2024
010daa8
Merge branch 'Xilinx:master' into master
zhangchiming Sep 19, 2024
f671397
Merge branch 'Xilinx:master' into master
zhangchiming Sep 20, 2024
2aafbe4
fix compile error in release builds.
zhangchiming Sep 20, 2024
a4afd8d
Fix compile error.
zhangchiming Sep 20, 2024
a26b03a
Merge branch 'Xilinx:master' into master
zhangchiming Sep 20, 2024
eebedac
Merge branch 'Xilinx:master' into master
zhangchiming Sep 30, 2024
e74cdeb
Merge branch 'Xilinx:master' into master
zhangchiming Oct 1, 2024
5de4a28
Merge branch 'Xilinx:master' into master
zhangchiming Jan 15, 2025
20a11ed
Merge branch 'Xilinx:master' into master
zhangchiming Jan 16, 2025
cca178c
Merge branch 'Xilinx:master' into master
zhangchiming Jan 30, 2025
d8988b8
Merge branch 'Xilinx:master' into master
zhangchiming Jan 31, 2025
049d7b4
Initial checkin.
Jan 31, 2025
993d6b3
Add missing code.
Feb 3, 2025
a341df2
remove catch all from error_impl constructor.
Feb 3, 2025
e22c448
Correct the wrong comment.
Feb 3, 2025
977beeb
Merge branch 'Xilinx:master' into ex_error
zhangchiming Feb 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions src/runtime_src/core/common/api/xrt_error.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ class error_impl
{
xrtErrorCode m_errcode = 0;
xrtErrorTime m_timestamp = 0;
std::string m_ex_error_str;
stsoe marked this conversation as resolved.
Show resolved Hide resolved
public:
error_impl(const xrt_core::device* device, xrtErrorClass ecl)
{
Expand All @@ -213,13 +214,23 @@ class error_impl
auto buf = xrt_core::device_query<xrt_core::query::xocl_errors>(device);
if (buf.empty())
return;
auto ect = xrt_core::query::xocl_errors::to_value(buf, ecl);
std::tie(m_errcode, m_timestamp) = ect;
return;
if (device->get_ex_error_support() == true) {
auto ect = xrt_core::query::xocl_errors::to_ex_value(buf, ecl);
m_errcode = std::get<0>(ect);
m_timestamp = std::get<1>(ect);
uint64_t ex_error_code = std::get<2>(ect);
m_ex_error_str = xrt_core::device_query<xrt_core::query::xocl_ex_error_code2string>(device, ex_error_code);
stsoe marked this conversation as resolved.
Show resolved Hide resolved
}
else {
auto ect = xrt_core::query::xocl_errors::to_value(buf, ecl);
std::tie(m_errcode, m_timestamp) = ect;
m_ex_error_str = "";
}
} catch (const xrt_core::query::no_such_key&) {
// Ignoring for now. Check below for edge if not available
// query table of zocl doesn't have xocl_errors key
}

//Below code will be removed after zocl changes for new format
auto errors = xrt_core::device_query<xrt_core::query::error>(device);
for (auto& line : errors) {
Expand Down Expand Up @@ -258,7 +269,8 @@ class error_impl
("%s\n"
"Timestamp: %s")
% error_code_to_string(m_errcode)
% error_time_to_string(m_timestamp);
% error_time_to_string(m_timestamp)
% m_ex_error_str;

return fmt.str();
}
Expand Down
19 changes: 19 additions & 0 deletions src/runtime_src/core/common/device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,25 @@ is_nodma() const
return *m_nodma;
}

bool
device::
get_ex_error_support() const
{
std::lock_guard lk(m_mutex);
if (m_ex_error_support != std::nullopt)
return *m_ex_error_support;

try {
auto ex_error_support = xrt_core::device_query<xrt_core::query::xocl_errors_ex>(this);
m_ex_error_support = xrt_core::query::xocl_errors_ex::to_bool(ex_error_support);
}
catch (const std::exception&) {
m_ex_error_support = false;
}

return *m_ex_error_support;
}

uuid
device::
get_xclbin_uuid() const
Expand Down
12 changes: 12 additions & 0 deletions src/runtime_src/core/common/device.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include <string>
#include <map>
#include <memory>
#include <optional>
#include <boost/property_tree/ptree.hpp>
#include <boost/optional/optional.hpp>

Expand Down Expand Up @@ -195,6 +196,16 @@ class device : public ishim
bool
is_nodma() const;

/**
* get_ex_error_support() - Does this device support extended error code
*
* Return: true if device support extended error code
*
*/
XRT_CORE_COMMON_EXPORT
bool
get_ex_error_support() const;

private:
// Private look up function for concrete query::request
virtual const query::request&
Expand Down Expand Up @@ -476,6 +487,7 @@ class device : public ishim
private:
id_type m_device_id;
mutable boost::optional<bool> m_nodma = boost::none;
mutable std::optional<bool> m_ex_error_support = std::nullopt;

using name2idx_type = std::map<std::string, cuidx_type>;
std::map<slot_id, name2idx_type> m_cu2idx; // slot -> cu name mapping to cuidx
Expand Down
29 changes: 29 additions & 0 deletions src/runtime_src/core/common/query_requests.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,35 @@ to_value(const std::vector<char>& buf, xrtErrorClass ecl)
return {0, 0};
}

std::tuple<uint64_t, uint64_t, uint64_t>
xrt_core::query::xocl_errors::
to_ex_value(const std::vector<char>& buf, xrtErrorClass ecl)
{
if (buf.empty())
return { 0, 0, 0 };

auto errors_buf = reinterpret_cast<const xcl_errors*>(buf.data());
if (errors_buf->num_err <= 0)
return { 0, 0, 0 };

if (errors_buf->num_err > XCL_ERROR_CAPACITY)
throw xrt_core::system_error(EINVAL, "Invalid data in sysfs");

uint64_t error_code = 0;
uint64_t time_stamp = 0;
uint64_t ex_error_code = 0;
for (int i = errors_buf->num_err - 1; i >= 0; i--) {
if (XRT_ERROR_CLASS(errors_buf->errors[i].err_code) == ecl) {
error_code = errors_buf->errors[i].err_code;
time_stamp = errors_buf->errors[i].ts;
ex_error_code = errors_buf->errors[i].ex_error_code;
break;
}
}

return { error_code, time_stamp, ex_error_code };
}

std::vector<xclErrorLast>
xrt_core::query::xocl_errors::
to_errors(const std::vector<char>& buf)
Expand Down
43 changes: 42 additions & 1 deletion src/runtime_src/core/common/query_requests.h
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,10 @@ enum class key_type
kernel_max_bandwidth_mbps,
sub_device_path,
read_trace_data,
noop
noop,

xocl_errors_ex,
xocl_ex_error_code2string
};

struct pcie_vendor : request
Expand Down Expand Up @@ -1673,6 +1676,39 @@ struct error : request
}
};

// Retrieve support for extended asynchronous xocl errors from xocl driver
struct xocl_errors_ex : request
{
using result_type = uint32_t;
static const key_type key = key_type::xocl_errors_ex;

virtual std::any
get(const device*) const override = 0;

static bool
to_bool(const result_type& value)
{
return (value == std::numeric_limits<uint32_t>::max())
? false : value;
}
};

// Retrieve extended asynchronous xocl errors string corresponding to the error code from xocl driver
struct xocl_ex_error_code2string : request
{
using result_type = std::string; // get value type
static const key_type key = key_type::xocl_ex_error_code2string;

virtual std::any
get(const device*) const override = 0;

static std::string
to_string(const std::string& errstr)
{
return std::string(errstr);
}
};

// Retrieve asynchronous xocl errors from xocl driver
struct xocl_errors : request
{
Expand All @@ -1687,6 +1723,11 @@ struct xocl_errors : request
static std::pair<uint64_t, uint64_t>
to_value(const std::vector<char>& buf, xrtErrorClass ecl);

// Parse buffer, get error code and timestamp
XRT_CORE_COMMON_EXPORT
static std::tuple<uint64_t, uint64_t, uint64_t>
to_ex_value(const std::vector<char>& buf, xrtErrorClass ecl);

// Parse sysfs raw data and get list of errors
XRT_CORE_COMMON_EXPORT
static std::vector<xclErrorLast>
Expand Down
7 changes: 4 additions & 3 deletions src/runtime_src/core/include/xclerr_int.h
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,10 @@
* xrtErrorCode (64 bits) = ErrorNum + Driver + Severity + Module + Class
*/
typedef struct xclErrorLast {
xrtErrorCode err_code; /* 64 bits; XRT error code */
xrtErrorTime ts; /* 64 bits; timestamp */
unsigned pid; /* 32 bits; pid associated with error, if available */
xrtErrorCode err_code; /* 64 bits; XRT error code */
xrtErrorTime ts; /* 64 bits; timestamp */
unsigned pid; /* 32 bits; pid associated with error, if available */
xrtExErrorCode ex_error_code; /* 64 bits; XRT extra error code*/
} xclErrorLast;

typedef struct xcl_errors {
Expand Down
40 changes: 40 additions & 0 deletions src/runtime_src/core/include/xrt/detail/xrt_error_code.h
Original file line number Diff line number Diff line change
Expand Up @@ -152,4 +152,44 @@ enum xrtErrorClass {
XRT_ERROR_CLASS_LAST_ENTRY = XRT_ERROR_CLASS_UNKNOWN
};

typedef uint64_t xrtExErrorCode;

/**
*xrtExErrorCode layout
*
* This layout is internal to XRT(akin to a POSIX error code).
*
* The error code is populated by driver and consumed by XRT
* implementation where it is translated into an actual error / info /
* warning that is propagated to the end user.
*
*63 - 48 47 - 32 31 - 16 15 - 0
* --------------------------------------
* | | | | | | |----| ExErrorID
* | | | | |----|----------- AIE_LOC_COL
* | | |----|----------------------AIR_LOC_ROW
* |----|--------------------------------RESERVED
*
*/

#define XRT_EX_ERROR_ID_MASK 0xFFFFUL
#define XRT_EX_ERROR_ID_SHIFT 0
#define XRT_EX_ERROR_LOC_COL_MASK 0xFFFFUL
#define XRT_EX_ERROR_LOC_COL_SHIFT 16
#define XRT_EX_ERROR_LOC_ROW_MASK 0xFFFFUL
#define XRT_EX_ERROR_LOC_ROW_SHIFT 32
#define XRT_EX_ERROR_RESERVED_MASK 0xFFFFUL
#define XRT_EX_ERROR_RESERVED_SHIFT 48

#define XRT_EX_ERROR_CODE_BUILD(ID, COL, ROW, RESERVED) \
((static_cast<uint64_t>((ID) & XRT_EX_ERROR_ID_MASK) << XRT_ERROR_NUM_SHIFT) | \
(static_cast<uint64_t>((COL) & XRT_EX_ERROR_LOC_COL_MASK) << XRT_EX_ERROR_LOC_COL_SHIFT) | \
(static_cast<uint64_t>((ROW) & XRT_EX_ERROR_LOC_ROW_MASK) << XRT_EX_ERROR_LOC_ROW_SHIFT) | \
(static_cast<uint64_t>((RESERVED) & XRT_EX_ERROR_RESERVED_MASK) << XRT_EX_ERROR_RESERVED_SHIFT))

#define XRT_EX_ERROR_ID(code) (((code) >> XRT_EX_ERROR_ID_SHIFT) & XRT_EX_ERROR_ID_MASK)
#define XRT_EX_ERROR_LOC_COL(code) (((code) >> XRT_EX_ERROR_LOC_COL_SHIFT) & XRT_EX_ERROR_LOC_COL_MASK)
#define XRT_EX_ERROR_LOC_ROW(code) (((code) >> XRT_EX_ERROR_LOC_ROW_SHIFT) & XRT_EX_ERROR_LOC_ROW_MASK)


#endif
Loading