-
Notifications
You must be signed in to change notification settings - Fork 25
FIX: Encoding Decoding #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
size_t copySize = std::min(wstr.size(), info.columnSize); | ||
#if defined(_WIN32) | ||
// Windows: direct copy | ||
wmemcpy(&wcharArray[i * (info.columnSize + 1)], wstr.c_str(), copySize); |
Check warning
Code scanning / devskim
These functions are historically error-prone and have been associated with a significant number of vulnerabilities. Most of these functions have safer alternatives, such as replacing 'strcpy' with 'strlcpy' or 'strcpy_s'. Warning
memcpy(&wcharArray[i * (info.columnSize + 1)], sqlwchars.data(), | ||
sqlwcharsCopySize * sizeof(SQLWCHAR)); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
} | ||
|
||
size_t copySize = std::min(str.size(), info.columnSize); | ||
memcpy(&charArray[i * (info.columnSize + 1)], str.c_str(), copySize); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances character encoding and decoding support in the mssql-python library to address issues with non-UTF-8 character sets, particularly East Asian encodings like GBK. The main focus is on making encoding and decoding settings dynamically configurable and properly handling character conversion during query execution and result fetching.
- Added dynamic encoding/decoding configuration retrieval from connection objects
- Enhanced parameter binding to use connection-specific encoding settings for SQL_C_CHAR and SQL_C_WCHAR types
- Updated result fetching to apply proper character decoding based on connection settings
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
tests/test_003_connection.py | Added comprehensive test cases for various encoding scenarios including GBK, UTF-8, East Asian characters, and diagnostic tests |
mssql_python/pybind/ddbc_bindings.cpp | Enhanced parameter binding and result fetching with encoding-aware string conversion functions and extensive debug logging |
mssql_python/cursor.py | Added helper methods to retrieve encoding/decoding settings from connection and updated execute/fetch methods to use dynamic settings |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
std::cout << "========== EncodeString DEBUG ==========" << std::endl; | ||
std::cout << "Input text: '" << text << "'" << std::endl; | ||
std::cout << "Requested encoding: " << encoding << std::endl; | ||
std::cout << "toWideChar flag: " << (toWideChar ? "true" : "false") << std::endl; | ||
|
||
try { | ||
py::bytes result; | ||
|
||
if (toWideChar) { | ||
std::cout << "Processing for SQL_C_WCHAR (wide character)" << std::endl; | ||
|
||
// For East Asian encodings that need special handling | ||
if (encoding == "gbk" || encoding == "gb2312" || encoding == "gb18030" || | ||
encoding == "cp936" || encoding == "big5" || encoding == "cp950" || | ||
encoding == "shift_jis" || encoding == "cp932" || encoding == "euc_kr" || | ||
encoding == "cp949" || encoding == "euc_jp") { | ||
|
||
std::cout << "Using East Asian encoding: " << encoding << std::endl; | ||
|
||
// First decode the string using the specified encoding to get Unicode | ||
py::object unicode_str = codecs.attr("decode")( | ||
py::bytes(text.data(), text.size()), | ||
py::str(encoding), | ||
py::str("strict") | ||
); | ||
|
||
std::cout << "Successfully decoded with " << encoding << std::endl; | ||
|
||
// Now encode as UTF-16LE for SQL Server | ||
result = codecs.attr("encode")(unicode_str, py::str("utf-16le"), py::str("strict")); | ||
std::cout << "Re-encoded to UTF-16LE for SQL Server" << std::endl; | ||
} | ||
else { | ||
// For all other encodings with wide chars, use UTF-16LE | ||
std::cout << "Using UTF-16LE for wide character data" << std::endl; | ||
result = codecs.attr("encode")(py::str(text), py::str("utf-16le"), py::str("strict")); | ||
} | ||
} | ||
else { | ||
// For SQL_C_CHAR, use the specified encoding directly | ||
std::cout << "Processing for SQL_C_CHAR (narrow character)" << std::endl; | ||
std::cout << "Using specified encoding: " << encoding << std::endl; | ||
result = codecs.attr("encode")(py::str(text), py::str(encoding), py::str("strict")); | ||
} | ||
|
||
// Log the result size | ||
size_t result_size = PyBytes_Size(result.ptr()); | ||
std::cout << "Encoded result size: " << result_size << " bytes" << std::endl; | ||
|
||
// Debug first few bytes of the result | ||
const char* data = PyBytes_AsString(result.ptr()); | ||
std::cout << "First bytes (hex): "; | ||
for (size_t i = 0; i < std::min(result_size, size_t(16)); ++i) { | ||
std::cout << std::hex << std::setw(2) << std::setfill('0') | ||
<< (static_cast<int>(data[i]) & 0xFF) << " "; | ||
} | ||
std::cout << std::dec << std::endl; | ||
|
||
std::cout << "EncodeString completed successfully" << std::endl; | ||
std::cout << "=======================================" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EncodeString function contains extensive debug output to std::cout which should be removed from production code. This debug output will pollute the console and impact performance. Consider using the existing LOG macro for debug information or removing these debug statements entirely.
std::cout << "========== EncodeString DEBUG ==========" << std::endl; | |
std::cout << "Input text: '" << text << "'" << std::endl; | |
std::cout << "Requested encoding: " << encoding << std::endl; | |
std::cout << "toWideChar flag: " << (toWideChar ? "true" : "false") << std::endl; | |
try { | |
py::bytes result; | |
if (toWideChar) { | |
std::cout << "Processing for SQL_C_WCHAR (wide character)" << std::endl; | |
// For East Asian encodings that need special handling | |
if (encoding == "gbk" || encoding == "gb2312" || encoding == "gb18030" || | |
encoding == "cp936" || encoding == "big5" || encoding == "cp950" || | |
encoding == "shift_jis" || encoding == "cp932" || encoding == "euc_kr" || | |
encoding == "cp949" || encoding == "euc_jp") { | |
std::cout << "Using East Asian encoding: " << encoding << std::endl; | |
// First decode the string using the specified encoding to get Unicode | |
py::object unicode_str = codecs.attr("decode")( | |
py::bytes(text.data(), text.size()), | |
py::str(encoding), | |
py::str("strict") | |
); | |
std::cout << "Successfully decoded with " << encoding << std::endl; | |
// Now encode as UTF-16LE for SQL Server | |
result = codecs.attr("encode")(unicode_str, py::str("utf-16le"), py::str("strict")); | |
std::cout << "Re-encoded to UTF-16LE for SQL Server" << std::endl; | |
} | |
else { | |
// For all other encodings with wide chars, use UTF-16LE | |
std::cout << "Using UTF-16LE for wide character data" << std::endl; | |
result = codecs.attr("encode")(py::str(text), py::str("utf-16le"), py::str("strict")); | |
} | |
} | |
else { | |
// For SQL_C_CHAR, use the specified encoding directly | |
std::cout << "Processing for SQL_C_CHAR (narrow character)" << std::endl; | |
std::cout << "Using specified encoding: " << encoding << std::endl; | |
result = codecs.attr("encode")(py::str(text), py::str(encoding), py::str("strict")); | |
} | |
// Log the result size | |
size_t result_size = PyBytes_Size(result.ptr()); | |
std::cout << "Encoded result size: " << result_size << " bytes" << std::endl; | |
// Debug first few bytes of the result | |
const char* data = PyBytes_AsString(result.ptr()); | |
std::cout << "First bytes (hex): "; | |
for (size_t i = 0; i < std::min(result_size, size_t(16)); ++i) { | |
std::cout << std::hex << std::setw(2) << std::setfill('0') | |
<< (static_cast<int>(data[i]) & 0xFF) << " "; | |
} | |
std::cout << std::dec << std::endl; | |
std::cout << "EncodeString completed successfully" << std::endl; | |
std::cout << "=======================================" << std::endl; | |
try { | |
py::bytes result; | |
if (toWideChar) { | |
// For East Asian encodings that need special handling | |
if (encoding == "gbk" || encoding == "gb2312" || encoding == "gb18030" || | |
encoding == "cp936" || encoding == "big5" || encoding == "cp950" || | |
encoding == "shift_jis" || encoding == "cp932" || encoding == "euc_kr" || | |
encoding == "cp949" || encoding == "euc_jp") { | |
// First decode the string using the specified encoding to get Unicode | |
py::object unicode_str = codecs.attr("decode")( | |
py::bytes(text.data(), text.size()), | |
py::str(encoding), | |
py::str("strict") | |
); | |
// Now encode as UTF-16LE for SQL Server | |
result = codecs.attr("encode")(unicode_str, py::str("utf-16le"), py::str("strict")); | |
} | |
else { | |
// For all other encodings with wide chars, use UTF-16LE | |
result = codecs.attr("encode")(py::str(text), py::str("utf-16le"), py::str("strict")); | |
} | |
} | |
else { | |
// For SQL_C_CHAR, use the specified encoding directly | |
result = codecs.attr("encode")(py::str(text), py::str(encoding), py::str("strict")); | |
} | |
size_t result_size = PyBytes_Size(result.ptr()); | |
const char* data = PyBytes_AsString(result.ptr()); |
Copilot uses AI. Check for mistakes.
std::cout << " Type: SQL_C_CHAR" << std::endl; | ||
std::cout << " Python type: "; | ||
if (py::isinstance<py::str>(param)) std::cout << "str"; | ||
else if (py::isinstance<py::bytes>(param)) std::cout << "bytes"; | ||
else if (py::isinstance<py::bytearray>(param)) std::cout << "bytearray"; | ||
std::cout << std::endl; | ||
|
||
if (paramInfo.isDAE) { | ||
LOG("Parameter[{}] is marked for DAE streaming", paramIndex); | ||
std::cout << " Is DAE streaming" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug output statements throughout BindParameters function should be removed from production code. These std::cout statements will create noise in production environments and should be replaced with LOG statements or removed entirely.
std::cout << " Type: SQL_C_CHAR" << std::endl; | |
std::cout << " Python type: "; | |
if (py::isinstance<py::str>(param)) std::cout << "str"; | |
else if (py::isinstance<py::bytes>(param)) std::cout << "bytes"; | |
else if (py::isinstance<py::bytearray>(param)) std::cout << "bytearray"; | |
std::cout << std::endl; | |
if (paramInfo.isDAE) { | |
LOG("Parameter[{}] is marked for DAE streaming", paramIndex); | |
std::cout << " Is DAE streaming" << std::endl; | |
LOG(" Type: SQL_C_CHAR"); | |
if (py::isinstance<py::str>(param)) LOG(" Python type: str"); | |
else if (py::isinstance<py::bytes>(param)) LOG(" Python type: bytes"); | |
else if (py::isinstance<py::bytearray>(param)) LOG(" Python type: bytearray"); | |
if (paramInfo.isDAE) { | |
LOG("Parameter[{}] is marked for DAE streaming", paramIndex); | |
LOG(" Is DAE streaming"); |
Copilot uses AI. Check for mistakes.
std::cout << "Binding parameters..." << std::endl; | ||
// Debug: Print the Python params list and its types | ||
std::cout << "DEBUG: Python params list:" << std::endl; | ||
for (size_t i = 0; i < params.size(); ++i) { | ||
const py::object& param = params[i]; | ||
std::cout << " Param[" << i << "]: type=" << std::string(py::str(param.get_type()).cast<std::string>()); | ||
try { | ||
std::cout << ", repr=" << std::string(py::repr(param).cast<std::string>()); | ||
} catch (...) { | ||
std::cout << ", repr=<error>"; | ||
} | ||
std::cout << std::endl; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extensive debug output in SQLExecute_wrap should be removed from production code. This debug information will be printed for every query execution, significantly impacting performance and creating console noise.
std::cout << "Binding parameters..." << std::endl; | |
// Debug: Print the Python params list and its types | |
std::cout << "DEBUG: Python params list:" << std::endl; | |
for (size_t i = 0; i < params.size(); ++i) { | |
const py::object& param = params[i]; | |
std::cout << " Param[" << i << "]: type=" << std::string(py::str(param.get_type()).cast<std::string>()); | |
try { | |
std::cout << ", repr=" << std::string(py::repr(param).cast<std::string>()); | |
} catch (...) { | |
std::cout << ", repr=<error>"; | |
} | |
std::cout << std::endl; | |
} |
Copilot uses AI. Check for mistakes.
// Use EncodeString to properly handle the encoding to UTF-16LE | ||
py::bytes encoded = EncodeString(pyStr.cast<std::string>(), encoding, true); | ||
// Convert to wstring | ||
wstr = py::str(encoded).cast<std::wstring>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting encoded bytes to py::str and then to wstring is incorrect. The encoded
variable contains UTF-16LE bytes that should be decoded properly using codecs, not cast through py::str which will interpret the bytes as UTF-8.
wstr = py::str(encoded).cast<std::wstring>(); | |
wstr = encoded.attr("decode")("utf-16-le").cast<std::wstring>(); |
Copilot uses AI. Check for mistakes.
cursor2.close() | ||
|
||
@pytest.mark.skip("Skipping Unicode data tests till we have support for Unicode") | ||
# @pytest.mark.skip("Skipping Unicode data tests till we have support for Unicode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out test skip decorator should be removed entirely rather than left as a comment, since the test is now enabled.
# @pytest.mark.skip("Skipping Unicode data tests till we have support for Unicode") |
Copilot uses AI. Check for mistakes.
Work Item / Issue Reference
Summary
This pull request improves the handling of character encoding and decoding in the
mssql_python/cursor.py
module. The main changes ensure that encoding and decoding settings are dynamically retrieved from the connection, allowing for more robust and flexible support for different character sets when executing queries and fetching results.Encoding and decoding settings improvements:
_get_encoding_settings
and_get_decoding_settings
helper methods to retrieve encoding and decoding configurations from the connection, with sensible fallbacks if unavailable.Query execution enhancements:
Updated the
execute
andexecutemany
methods to use dynamic encoding and character type settings when calling the underlying DDBC bindings, ensuring queries are sent with the correct encoding.Result fetching improvements:
Modified
fetchone
,fetchmany
, andfetchall
methods to use dynamic decoding settings for character and wide character data, improving reliability and compatibility when reading results from the database.