FIX: Encoding Decoding #265

jahnvi480 · 2025-09-30T06:54:04Z

Work Item / Issue Reference

AB#<WORK_ITEM_ID>

GitHub Issue: #250

Summary

This pull request improves the handling of character encoding and decoding in the mssql_python/cursor.py module. The main changes ensure that encoding and decoding settings are dynamically retrieved from the connection, allowing for more robust and flexible support for different character sets when executing queries and fetching results.

Encoding and decoding settings improvements:

Added _get_encoding_settings and _get_decoding_settings helper methods to retrieve encoding and decoding configurations from the connection, with sensible fallbacks if unavailable.

Query execution enhancements:

Updated the execute and executemany methods to use dynamic encoding and character type settings when calling the underlying DDBC bindings, ensuring queries are sent with the correct encoding.
Result fetching improvements:
Modified fetchone, fetchmany, and fetchall methods to use dynamic decoding settings for character and wide character data, improving reliability and compatibility when reading results from the database.

mssql_python/pybind/ddbc_bindings.cpp

+                        size_t copySize = std::min(wstr.size(), info.columnSize);
+                #if defined(_WIN32)
+                        // Windows: direct copy
+                        wmemcpy(&wcharArray[i * (info.columnSize + 1)], wstr.c_str(), copySize);


mssql_python/pybind/ddbc_bindings.cpp

+                        memcpy(&wcharArray[i * (info.columnSize + 1)], sqlwchars.data(), 
+                            sqlwcharsCopySize * sizeof(SQLWCHAR));


mssql_python/pybind/ddbc_bindings.cpp

                        }
+
+                        size_t copySize = std::min(str.size(), info.columnSize);
+                        memcpy(&charArray[i * (info.columnSize + 1)], str.c_str(), copySize);


Copilot

Pull Request Overview

This PR enhances character encoding and decoding support in the mssql-python library to address issues with non-UTF-8 character sets, particularly East Asian encodings like GBK. The main focus is on making encoding and decoding settings dynamically configurable and properly handling character conversion during query execution and result fetching.

Added dynamic encoding/decoding configuration retrieval from connection objects
Enhanced parameter binding to use connection-specific encoding settings for SQL_C_CHAR and SQL_C_WCHAR types
Updated result fetching to apply proper character decoding based on connection settings

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
tests/test_003_connection.py	Added comprehensive test cases for various encoding scenarios including GBK, UTF-8, East Asian characters, and diagnostic tests
mssql_python/pybind/ddbc_bindings.cpp	Enhanced parameter binding and result fetching with encoding-aware string conversion functions and extensive debug logging
mssql_python/cursor.py	Added helper methods to retrieve encoding/decoding settings from connection and updated execute/fetch methods to use dynamic settings

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-30T06:56:08Z

mssql_python/pybind/ddbc_bindings.cpp

+    std::cout << "========== EncodeString DEBUG ==========" << std::endl;
+    std::cout << "Input text: '" << text << "'" << std::endl;
+    std::cout << "Requested encoding: " << encoding << std::endl;
+    std::cout << "toWideChar flag: " << (toWideChar ? "true" : "false") << std::endl;
+
+    try {
+        py::bytes result;
+
+        if (toWideChar) {
+            std::cout << "Processing for SQL_C_WCHAR (wide character)" << std::endl;
+
+            // For East Asian encodings that need special handling
+            if (encoding == "gbk" || encoding == "gb2312" || encoding == "gb18030" || 
+                encoding == "cp936" || encoding == "big5" || encoding == "cp950" || 
+                encoding == "shift_jis" || encoding == "cp932" || encoding == "euc_kr" ||
+                encoding == "cp949" || encoding == "euc_jp") {
+
+                std::cout << "Using East Asian encoding: " << encoding << std::endl;
+
+                // First decode the string using the specified encoding to get Unicode
+                py::object unicode_str = codecs.attr("decode")(
+                    py::bytes(text.data(), text.size()), 
+                    py::str(encoding),
+                    py::str("strict")
+                );
+
+                std::cout << "Successfully decoded with " << encoding << std::endl;
+
+                // Now encode as UTF-16LE for SQL Server
+                result = codecs.attr("encode")(unicode_str, py::str("utf-16le"), py::str("strict"));
+                std::cout << "Re-encoded to UTF-16LE for SQL Server" << std::endl;
+            } 
+            else {
+                // For all other encodings with wide chars, use UTF-16LE
+                std::cout << "Using UTF-16LE for wide character data" << std::endl;
+                result = codecs.attr("encode")(py::str(text), py::str("utf-16le"), py::str("strict"));
+            }
+        }
+        else {
+            // For SQL_C_CHAR, use the specified encoding directly
+            std::cout << "Processing for SQL_C_CHAR (narrow character)" << std::endl;
+            std::cout << "Using specified encoding: " << encoding << std::endl;
+            result = codecs.attr("encode")(py::str(text), py::str(encoding), py::str("strict"));
+        }
+
+        // Log the result size
+        size_t result_size = PyBytes_Size(result.ptr());
+        std::cout << "Encoded result size: " << result_size << " bytes" << std::endl;
+
+        // Debug first few bytes of the result
+        const char* data = PyBytes_AsString(result.ptr());
+        std::cout << "First bytes (hex): ";
+        for (size_t i = 0; i < std::min(result_size, size_t(16)); ++i) {
+            std::cout << std::hex << std::setw(2) << std::setfill('0') 
+                      << (static_cast<int>(data[i]) & 0xFF) << " ";
+        }
+        std::cout << std::dec << std::endl;
+
+        std::cout << "EncodeString completed successfully" << std::endl;
+        std::cout << "=======================================" << std::endl;


The EncodeString function contains extensive debug output to std::cout which should be removed from production code. This debug output will pollute the console and impact performance. Consider using the existing LOG macro for debug information or removing these debug statements entirely.

Suggested change

std::cout << "========== EncodeString DEBUG ==========" << std::endl;

std::cout << "Input text: '" << text << "'" << std::endl;

std::cout << "Requested encoding: " << encoding << std::endl;

std::cout << "toWideChar flag: " << (toWideChar ? "true" : "false") << std::endl;

try {

py::bytes result;

if (toWideChar) {

std::cout << "Processing for SQL_C_WCHAR (wide character)" << std::endl;

// For East Asian encodings that need special handling

if (encoding == "gbk" || encoding == "gb2312" || encoding == "gb18030" ||

encoding == "cp936" || encoding == "big5" || encoding == "cp950" ||

encoding == "shift_jis" || encoding == "cp932" || encoding == "euc_kr" ||

encoding == "cp949" || encoding == "euc_jp") {

std::cout << "Using East Asian encoding: " << encoding << std::endl;

// First decode the string using the specified encoding to get Unicode

py::object unicode_str = codecs.attr("decode")(

py::bytes(text.data(), text.size()),

py::str(encoding),

py::str("strict")

);

std::cout << "Successfully decoded with " << encoding << std::endl;

// Now encode as UTF-16LE for SQL Server

result = codecs.attr("encode")(unicode_str, py::str("utf-16le"), py::str("strict"));

std::cout << "Re-encoded to UTF-16LE for SQL Server" << std::endl;

}

else {

// For all other encodings with wide chars, use UTF-16LE

std::cout << "Using UTF-16LE for wide character data" << std::endl;

result = codecs.attr("encode")(py::str(text), py::str("utf-16le"), py::str("strict"));

}

}

else {

// For SQL_C_CHAR, use the specified encoding directly

std::cout << "Processing for SQL_C_CHAR (narrow character)" << std::endl;

std::cout << "Using specified encoding: " << encoding << std::endl;

result = codecs.attr("encode")(py::str(text), py::str(encoding), py::str("strict"));

}

// Log the result size

size_t result_size = PyBytes_Size(result.ptr());

std::cout << "Encoded result size: " << result_size << " bytes" << std::endl;

// Debug first few bytes of the result

const char* data = PyBytes_AsString(result.ptr());

std::cout << "First bytes (hex): ";

for (size_t i = 0; i < std::min(result_size, size_t(16)); ++i) {

std::cout << std::hex << std::setw(2) << std::setfill('0')

<< (static_cast<int>(data[i]) & 0xFF) << " ";

}

std::cout << std::dec << std::endl;

std::cout << "EncodeString completed successfully" << std::endl;

std::cout << "=======================================" << std::endl;

try {

py::bytes result;

if (toWideChar) {

// For East Asian encodings that need special handling

if (encoding == "gbk" || encoding == "gb2312" || encoding == "gb18030" ||

encoding == "cp936" || encoding == "big5" || encoding == "cp950" ||

encoding == "shift_jis" || encoding == "cp932" || encoding == "euc_kr" ||

encoding == "cp949" || encoding == "euc_jp") {

// First decode the string using the specified encoding to get Unicode

py::object unicode_str = codecs.attr("decode")(

py::bytes(text.data(), text.size()),

py::str(encoding),

py::str("strict")

);

// Now encode as UTF-16LE for SQL Server

result = codecs.attr("encode")(unicode_str, py::str("utf-16le"), py::str("strict"));

}

else {

// For all other encodings with wide chars, use UTF-16LE

result = codecs.attr("encode")(py::str(text), py::str("utf-16le"), py::str("strict"));

}

}

else {

// For SQL_C_CHAR, use the specified encoding directly

result = codecs.attr("encode")(py::str(text), py::str(encoding), py::str("strict"));

}

size_t result_size = PyBytes_Size(result.ptr());

const char* data = PyBytes_AsString(result.ptr());

Copilot · 2025-09-30T06:56:08Z

mssql_python/pybind/ddbc_bindings.cpp

+                std::cout << "  Type: SQL_C_CHAR" << std::endl;
+                std::cout << "  Python type: ";
+                if (py::isinstance<py::str>(param)) std::cout << "str";
+                else if (py::isinstance<py::bytes>(param)) std::cout << "bytes";
+                else if (py::isinstance<py::bytearray>(param)) std::cout << "bytearray";
+                std::cout << std::endl;
+
                if (paramInfo.isDAE) {
                    LOG("Parameter[{}] is marked for DAE streaming", paramIndex);
+                    std::cout << "  Is DAE streaming" << std::endl;


Debug output statements throughout BindParameters function should be removed from production code. These std::cout statements will create noise in production environments and should be replaced with LOG statements or removed entirely.

Suggested change

std::cout << " Type: SQL_C_CHAR" << std::endl;

std::cout << " Python type: ";

if (py::isinstance<py::str>(param)) std::cout << "str";

else if (py::isinstance<py::bytes>(param)) std::cout << "bytes";

else if (py::isinstance<py::bytearray>(param)) std::cout << "bytearray";

std::cout << std::endl;

if (paramInfo.isDAE) {

LOG("Parameter[{}] is marked for DAE streaming", paramIndex);

std::cout << " Is DAE streaming" << std::endl;

LOG(" Type: SQL_C_CHAR");

if (py::isinstance<py::str>(param)) LOG(" Python type: str");

else if (py::isinstance<py::bytes>(param)) LOG(" Python type: bytes");

else if (py::isinstance<py::bytearray>(param)) LOG(" Python type: bytearray");

if (paramInfo.isDAE) {

LOG("Parameter[{}] is marked for DAE streaming", paramIndex);

LOG(" Is DAE streaming");

Copilot · 2025-09-30T06:56:09Z

mssql_python/pybind/ddbc_bindings.cpp

+        std::cout << "Binding parameters..." << std::endl;
+        // Debug: Print the Python params list and its types
+        std::cout << "DEBUG: Python params list:" << std::endl;
+        for (size_t i = 0; i < params.size(); ++i) {
+            const py::object& param = params[i];
+            std::cout << "  Param[" << i << "]: type=" << std::string(py::str(param.get_type()).cast<std::string>());
+            try {
+            std::cout << ", repr=" << std::string(py::repr(param).cast<std::string>());
+            } catch (...) {
+            std::cout << ", repr=<error>";
+            }
+            std::cout << std::endl;
+        }


Extensive debug output in SQLExecute_wrap should be removed from production code. This debug information will be printed for every query execution, significantly impacting performance and creating console noise.

Suggested change

std::cout << "Binding parameters..." << std::endl;

// Debug: Print the Python params list and its types

std::cout << "DEBUG: Python params list:" << std::endl;

for (size_t i = 0; i < params.size(); ++i) {

const py::object& param = params[i];

std::cout << " Param[" << i << "]: type=" << std::string(py::str(param.get_type()).cast<std::string>());

try {

std::cout << ", repr=" << std::string(py::repr(param).cast<std::string>());

} catch (...) {

std::cout << ", repr=<error>";

}

std::cout << std::endl;

}

Copilot · 2025-09-30T06:56:09Z

mssql_python/pybind/ddbc_bindings.cpp

+                            // Use EncodeString to properly handle the encoding to UTF-16LE
+                            py::bytes encoded = EncodeString(pyStr.cast<std::string>(), encoding, true);
+                            // Convert to wstring
+                            wstr = py::str(encoded).cast<std::wstring>();


Converting encoded bytes to py::str and then to wstring is incorrect. The encoded variable contains UTF-16LE bytes that should be decoded properly using codecs, not cast through py::str which will interpret the bytes as UTF-8.

Suggested change

wstr = py::str(encoded).cast<std::wstring>();

wstr = encoded.attr("decode")("utf-16-le").cast<std::wstring>();

Copilot · 2025-09-30T06:56:09Z

tests/test_003_connection.py

    cursor2.close()

-@pytest.mark.skip("Skipping Unicode data tests till we have support for Unicode")
+# @pytest.mark.skip("Skipping Unicode data tests till we have support for Unicode")


Commented-out test skip decorator should be removed entirely rather than left as a comment, since the test is now enabled.

Suggested change

# @pytest.mark.skip("Skipping Unicode data tests till we have support for Unicode")

FIX: Encoding Decoding

ffc1332

Copilot AI review requested due to automatic review settings September 30, 2025 06:54

github-actions bot added the pr-size: large Substantial code update label Sep 30, 2025

github-advanced-security bot found potential problems Sep 30, 2025

View reviewed changes

Copilot AI reviewed Sep 30, 2025

View reviewed changes

Merge branch 'main' into jahnvi/githubissue_250

523b55e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIX: Encoding Decoding #265

FIX: Encoding Decoding #265

jahnvi480 commented Sep 30, 2025

Uh oh!

Check warning

Check notice

Check notice

Copilot AI left a comment

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Uh oh!

		memcpy(&wcharArray[i * (info.columnSize + 1)], sqlwchars.data(),
		sqlwcharsCopySize * sizeof(SQLWCHAR));

	wstr = py::str(encoded).cast<std::wstring>();
	wstr = encoded.attr("decode")("utf-16-le").cast<std::wstring>();

FIX: Encoding Decoding #265

Are you sure you want to change the base?

FIX: Encoding Decoding #265

Conversation

jahnvi480 commented Sep 30, 2025

Work Item / Issue Reference

Summary

Uh oh!

Check warning

Check notice

Check notice

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!