Skip to content

Commit acccdac

Browse files
authored
BUG: to_stata erroring when encoded text and normal text have mismatched length (#61629)
* Initial testcase provided in Issue * Replaced check for encoded with unencoded check to prevent edge cases where two values are different * replaced type check with isinstance() * Updated patch notes * pre-commit checks
1 parent 2a7a294 commit acccdac

File tree

3 files changed

+19
-4
lines changed

3 files changed

+19
-4
lines changed

doc/source/whatsnew/v3.0.0.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -777,6 +777,7 @@ I/O
777777
- Bug in :meth:`DataFrame.to_excel` when writing empty :class:`DataFrame` with :class:`MultiIndex` on both axes (:issue:`57696`)
778778
- Bug in :meth:`DataFrame.to_excel` where the :class:`MultiIndex` index with a period level was not a date (:issue:`60099`)
779779
- Bug in :meth:`DataFrame.to_stata` when exporting a column containing both long strings (Stata strL) and :class:`pd.NA` values (:issue:`23633`)
780+
- Bug in :meth:`DataFrame.to_stata` when input encoded length and normal length are mismatched (:issue:`61583`)
780781
- Bug in :meth:`DataFrame.to_stata` when writing :class:`DataFrame` and ``byteorder=`big```. (:issue:`58969`)
781782
- Bug in :meth:`DataFrame.to_stata` when writing more than 32,000 value labels. (:issue:`60107`)
782783
- Bug in :meth:`DataFrame.to_string` that raised ``StopIteration`` with nested DataFrames. (:issue:`16098`)

pandas/io/stata.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2739,7 +2739,7 @@ def _encode_strings(self) -> None:
27392739
encoded = self.data[col].str.encode(self._encoding)
27402740
# If larger than _max_string_length do nothing
27412741
if (
2742-
max_len_string_array(ensure_object(encoded._values))
2742+
max_len_string_array(ensure_object(self.data[col]._values))
27432743
<= self._max_string_length
27442744
):
27452745
self.data[col] = encoded
@@ -3263,11 +3263,15 @@ def generate_blob(self, gso_table: dict[str, tuple[int, int]]) -> bytes:
32633263
bio.write(gso_type)
32643264

32653265
# llll
3266-
utf8_string = bytes(strl, "utf-8")
3267-
bio.write(struct.pack(len_type, len(utf8_string) + 1))
3266+
if isinstance(strl, str):
3267+
strl_convert = bytes(strl, "utf-8")
3268+
else:
3269+
strl_convert = strl
3270+
3271+
bio.write(struct.pack(len_type, len(strl_convert) + 1))
32683272

32693273
# xxx...xxx
3270-
bio.write(utf8_string)
3274+
bio.write(strl_convert)
32713275
bio.write(null)
32723276

32733277
return bio.getvalue()

pandas/tests/io/test_stata.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2601,3 +2601,13 @@ def test_strl_missings(temp_file, version):
26012601
]
26022602
)
26032603
df.to_stata(temp_file, version=version)
2604+
2605+
2606+
@pytest.mark.parametrize("version", [117, 118, 119, None])
2607+
def test_ascii_error(temp_file, version):
2608+
# GH #61583
2609+
# Check that 2 byte long unicode characters doesn't cause export error
2610+
df = DataFrame({"doubleByteCol": ["§" * 1500]})
2611+
df.to_stata(temp_file, write_index=0, version=version)
2612+
df_input = read_stata(temp_file)
2613+
tm.assert_frame_equal(df, df_input)

0 commit comments

Comments
 (0)