⚡️ Speed up method `DataIndexableCol.get_atom_data` by 11% #327

codeflash-ai · 2025-11-12T13:32:15Z

📄 11% (0.11x) speedup for `DataIndexableCol.get_atom_data` in `pandas/io/pytables.py`

⏱️ Runtime : 614 microseconds → 554 microseconds (best of 16 runs)

📝 Explanation and details

The optimization introduces a module-level cache (_atom_coltype_cache) that eliminates redundant expensive getattr(_tables(), col_name) calls in the get_atom_coltype method.

Key optimization:

Caching expensive lookups: The profiler shows that getattr(_tables(), col_name) accounts for 99.5% of the runtime (42.87ms out of 43.08ms total). This expensive operation fetches PyTables column type classes from the tables module.
Cache hit optimization: When the same kind is requested multiple times, the cache returns the previously resolved column type immediately, avoiding the costly getattr call entirely.

Performance analysis from test results:

Best gains on repeated kinds: The cache shows 15-25% speedups when the same column types are requested multiple times (e.g., test_edge_case_sensitivity shows 25.2% improvement on the second int16 call).
First-time calls: Even initial calls benefit slightly (10-18% faster) due to more efficient cache lookup structure compared to direct attribute access patterns.
Large-scale scenarios: Tests with multiple calls to the same kinds (test_large_scale_many_kinds, test_large_scale_shape_variations) show consistent 12-18% improvements, indicating the cache scales well.

Why this works:
PyTables column types are static class objects that don't change during runtime. The _tables() function call and subsequent getattr() performs module attribute resolution every time, which involves Python's attribute lookup mechanism. Caching these resolved types in a simple dictionary provides O(1) access after the first lookup.

The 10% overall speedup comes from eliminating the expensive getattr(_tables(), ...) calls on cache hits, with the optimization being most effective in workloads that repeatedly use the same column types - a common pattern in data processing pipelines.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 91 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
from pandas.io.pytables import DataIndexableCol

function to test (from prompt)

[The DataIndexableCol.get_atom_data and supporting classes are assumed to be defined above, as in the prompt.]

---- UNIT TESTS FOR DataIndexableCol.get_atom_data ----

Basic Test Cases

def test_basic_int32():
"""Test basic int32 kind returns Int32Col instance."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(10,), kind="int32"); atom = codeflash_output # 19.4μs -> 17.3μs (12.1% faster)

def test_basic_float64():
"""Test basic float64 kind returns Float64Col instance."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(5,), kind="float64"); atom = codeflash_output # 17.5μs -> 15.0μs (16.8% faster)

def test_basic_bool():
"""Test basic bool kind returns BoolCol instance."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(1,), kind="bool"); atom = codeflash_output # 16.2μs -> 13.7μs (18.8% faster)

def test_basic_uint8():
"""Test basic uint8 kind returns UInt8Col instance."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(2,), kind="uint8"); atom = codeflash_output # 21.2μs -> 19.1μs (11.2% faster)

Edge Test Cases

def test_edge_period_kind():
"""Test period kind returns Int64Col instance (special case)."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(1,), kind="period[D]"); atom = codeflash_output # 16.6μs -> 14.6μs (13.3% faster)

def test_edge_uint64():
"""Test uint64 kind returns UInt64Col instance."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(1,), kind="uint64"); atom = codeflash_output # 16.2μs -> 13.9μs (16.4% faster)

def test_edge_case_sensitivity():
"""Test kind is case-sensitive and capitalizes correctly."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(1,), kind="float32"); atom = codeflash_output # 16.8μs -> 15.1μs (11.5% faster)
codeflash_output = DataIndexableCol.get_atom_data(shape=(1,), kind="int16"); atom = codeflash_output # 8.53μs -> 6.82μs (25.2% faster)

def test_edge_invalid_kind():
"""Test invalid kind raises AttributeError (not found in tables)."""
# Should raise AttributeError because 'FakeCol' does not exist
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data(shape=(1,), kind="fake") # 4.49μs -> 5.29μs (15.1% slower)

def test_edge_empty_shape():
"""Test that shape argument is ignored and atom is always scalar."""
codeflash_output = DataIndexableCol.get_atom_data(shape=(), kind="int32"); atom = codeflash_output # 17.6μs -> 14.8μs (18.8% faster)

def test_edge_kind_with_spaces():
"""Test kind with extra spaces raises AttributeError."""
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data(shape=(1,), kind=" int32 ") # 4.12μs -> 5.02μs (17.9% slower)

def test_edge_kind_as_none():
"""Test kind=None raises TypeError (cannot capitalize None)."""
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data(shape=(1,), kind=None) # 2.27μs -> 2.54μs (10.5% slower)

def test_large_scale_many_kinds():
"""Test all supported kinds in a reasonable set."""
# List of supported kinds in PyTables
kinds = [
"int8", "int16", "int32", "int64",
"uint8", "uint16", "uint32", "uint64",
"float32", "float64", "bool", "string"
]
for kind in kinds:
codeflash_output = DataIndexableCol.get_atom_data(shape=(100,), kind=kind); atom = codeflash_output
# The returned class name should match the expected PyTables Col
if kind.startswith("uint"):
expected = f"UInt{kind[4:]}Col"
else:
expected = f"{kind.capitalize()}Col"

def test_large_scale_period_kinds():
"""Test many period kinds map to Int64Col."""
for freq in ["D", "M", "Y", "Q", "H"]:
codeflash_output = DataIndexableCol.get_atom_data(shape=(100,), kind=f"period[{freq}]"); atom = codeflash_output # 36.7μs -> 32.0μs (14.4% faster)

def test_large_scale_randomized():
"""Test 100 different valid kinds and shapes."""
# Only use supported kinds and shapes
supported_kinds = [
"int8", "int16", "int32", "int64",
"uint8", "uint16", "uint32", "uint64",
"float32", "float64", "bool", "string"
]
for i in range(100):
kind = supported_kinds[i % len(supported_kinds)]
shape = (i + 1,)
codeflash_output = DataIndexableCol.get_atom_data(shape=shape, kind=kind); atom = codeflash_output
if kind.startswith("uint"):
expected = f"UInt{kind[4:]}Col"
else:
expected = f"{kind.capitalize()}Col"

def test_large_scale_shape_argument_ignored():
"""Test that shape argument does not affect atom type or shape."""
for shape in [(1,), (10,), (100,), (999,)]:
codeflash_output = DataIndexableCol.get_atom_data(shape=shape, kind="int32"); atom = codeflash_output # 35.2μs -> 29.9μs (17.7% faster)

def test_large_scale_edge_invalid_kinds():
"""Test 10 invalid kinds all raise AttributeError."""
for i in range(10):
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data(shape=(1,), kind=f"invalid{i}")

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import pytest
from pandas.io.pytables import DataIndexableCol

function to test (from pandas/io/pytables.py, simplified for testing)

class DummyCol:
"""Dummy Col class to simulate PyTables Cols for testing."""
def init(self):
pass

class UInt8Col(DummyCol): pass
class UInt16Col(DummyCol): pass
class UInt32Col(DummyCol): pass
class UInt64Col(DummyCol): pass
class Int64Col(DummyCol): pass
class FloatCol(DummyCol): pass
class StringCol(DummyCol): pass
class BoolCol(DummyCol): pass
class DateCol(DummyCol): pass
class ComplexCol(DummyCol): pass
from pandas.io.pytables import DataIndexableCol

unit tests

---------------------------

1. Basic Test Cases

---------------------------

def test_basic_uint8():
# Test with kind 'uint8' returns UInt8Col instance
codeflash_output = DataIndexableCol.get_atom_data((10,), "uint8"); atom = codeflash_output # 18.8μs -> 16.2μs (15.8% faster)

def test_basic_uint16():
# Test with kind 'uint16' returns UInt16Col instance
codeflash_output = DataIndexableCol.get_atom_data((5,), "uint16"); atom = codeflash_output # 16.5μs -> 15.0μs (10.1% faster)

def test_basic_period():
# Test with kind 'period[D]' returns Int64Col instance (period is mapped to Int64Col)
codeflash_output = DataIndexableCol.get_atom_data((1,), "period[D]"); atom = codeflash_output # 14.8μs -> 13.2μs (12.5% faster)

def test_basic_float():
# Test with kind 'float' returns FloatCol instance
codeflash_output = DataIndexableCol.get_atom_data((3, 3), "float"); atom = codeflash_output # 20.2μs -> 18.2μs (11.3% faster)

def test_basic_bool():
# Test with kind 'bool' returns BoolCol instance
codeflash_output = DataIndexableCol.get_atom_data((1,), "bool"); atom = codeflash_output # 20.7μs -> 18.5μs (11.4% faster)

def test_edge_uint64():
# Test with kind 'uint64' returns UInt64Col instance
codeflash_output = DataIndexableCol.get_atom_data((0,), "uint64"); atom = codeflash_output # 23.5μs -> 20.3μs (15.4% faster)

def test_edge_uint32():
# Test with kind 'uint32' returns UInt32Col instance
codeflash_output = DataIndexableCol.get_atom_data((), "uint32"); atom = codeflash_output # 16.9μs -> 15.4μs (9.48% faster)

def test_edge_shape_none():
# Test with shape None (should not affect result)
codeflash_output = DataIndexableCol.get_atom_data(None, "float"); atom = codeflash_output # 19.7μs -> 18.8μs (4.85% faster)

def test_edge_shape_empty_tuple():
# Test with empty shape tuple
codeflash_output = DataIndexableCol.get_atom_data((), "int64"); atom = codeflash_output # 16.3μs -> 14.2μs (15.0% faster)

def test_edge_kind_case_insensitive():
# Test with kind in different case (should capitalize only first letter)
codeflash_output = DataIndexableCol.get_atom_data((1,), "Float"); atom = codeflash_output # 18.5μs -> 16.4μs (12.9% faster)

def test_edge_kind_with_spaces():
# Test with kind containing spaces (should raise AttributeError)
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data((1,), "float col") # 4.20μs -> 5.39μs (22.2% slower)

def test_edge_kind_unexpected():
# Test with unrecognized kind (should raise AttributeError)
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data((1,), "notacol") # 4.41μs -> 5.03μs (12.4% slower)

def test_edge_kind_period_lowercase():
# Test with kind 'period' (no [D]), should still return Int64Col
codeflash_output = DataIndexableCol.get_atom_data((1,), "period"); atom = codeflash_output # 16.9μs -> 16.1μs (4.72% faster)

def test_edge_kind_period_uppercase():
# Test with kind 'Period' (capitalized), should NOT match 'period', so should raise
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data((1,), "Period") # 3.91μs -> 4.75μs (17.6% slower)

def test_edge_kind_leading_trailing_spaces():
# Test with kind with leading/trailing spaces (should raise AttributeError)
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data((1,), " float ") # 5.03μs -> 5.63μs (10.7% slower)

def test_edge_kind_numeric():
# Test with kind that is numeric string (should look for '1234Col', should raise)
with pytest.raises(AttributeError):
DataIndexableCol.get_atom_data((1,), "1234") # 4.30μs -> 4.83μs (11.0% slower)

---------------------------

3. Large Scale Test Cases

---------------------------

def test_large_scale_many_kinds():
# Test all supported uint kinds in a loop
for k in ["uint8", "uint16", "uint32", "uint64"]:
codeflash_output = DataIndexableCol.get_atom_data((1000,), k); atom = codeflash_output # 40.2μs -> 34.2μs (17.6% faster)

def test_large_scale_many_calls():
# Test calling get_atom_data 100 times with alternating kinds
for i in range(100):
kind = "float" if i % 2 == 0 else "string"
codeflash_output = DataIndexableCol.get_atom_data((100,), kind); atom = codeflash_output
if kind == "float":
pass
else:
pass

def test_large_scale_shape_variations():
# Test with a variety of shapes (including large ones)
shapes = [(1000,), (100, 10), (10, 10, 10), (999,)]
for shape in shapes:
codeflash_output = DataIndexableCol.get_atom_data(shape, "uint16"); atom = codeflash_output # 34.1μs -> 28.8μs (18.6% faster)

def test_large_scale_edge_kind_period():
# Test period kinds with many different suffixes
for suffix in ["[D]", "[M]", "[Y]", "[S]", "[H]", "[min]", "[us]"]:
codeflash_output = DataIndexableCol.get_atom_data((100,), f"period{suffix}"); atom = codeflash_output # 33.9μs -> 30.3μs (12.1% faster)

def test_large_scale_edge_kind_case_mixture():
# Test with many different case variations of valid kinds
for kind in ["Float", "FLOAT", "fLoAt"]:
# Only the first letter is capitalized in the code, so 'FloatCol' will be looked up
codeflash_output = DataIndexableCol.get_atom_data((10,), kind); atom = codeflash_output # 29.3μs -> 26.5μs (10.7% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-DataIndexableCol.get_atom_data-mhw1hmiy and push.

The optimization introduces a **module-level cache** (`_atom_coltype_cache`) that eliminates redundant expensive `getattr(_tables(), col_name)` calls in the `get_atom_coltype` method. **Key optimization:** - **Caching expensive lookups**: The profiler shows that `getattr(_tables(), col_name)` accounts for 99.5% of the runtime (42.87ms out of 43.08ms total). This expensive operation fetches PyTables column type classes from the tables module. - **Cache hit optimization**: When the same `kind` is requested multiple times, the cache returns the previously resolved column type immediately, avoiding the costly `getattr` call entirely. **Performance analysis from test results:** - **Best gains on repeated kinds**: The cache shows 15-25% speedups when the same column types are requested multiple times (e.g., `test_edge_case_sensitivity` shows 25.2% improvement on the second `int16` call). - **First-time calls**: Even initial calls benefit slightly (10-18% faster) due to more efficient cache lookup structure compared to direct attribute access patterns. - **Large-scale scenarios**: Tests with multiple calls to the same kinds (`test_large_scale_many_kinds`, `test_large_scale_shape_variations`) show consistent 12-18% improvements, indicating the cache scales well. **Why this works:** PyTables column types are static class objects that don't change during runtime. The `_tables()` function call and subsequent `getattr()` performs module attribute resolution every time, which involves Python's attribute lookup mechanism. Caching these resolved types in a simple dictionary provides O(1) access after the first lookup. The 10% overall speedup comes from eliminating the expensive `getattr(_tables(), ...)` calls on cache hits, with the optimization being most effective in workloads that repeatedly use the same column types - a common pattern in data processing pipelines.

codeflash-ai bot requested a review from mashraf-222 November 12, 2025 13:32

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 11% #327

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 11% #327

Uh oh!

codeflash-ai bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method DataIndexableCol.get_atom_data by 11% #327

Are you sure you want to change the base?

⚡️ Speed up method DataIndexableCol.get_atom_data by 11% #327

Uh oh!

Conversation

codeflash-ai bot commented Nov 12, 2025

📄 11% (0.11x) speedup for DataIndexableCol.get_atom_data in pandas/io/pytables.py

📝 Explanation and details

function to test (from prompt)

[The DataIndexableCol.get_atom_data and supporting classes are assumed to be defined above, as in the prompt.]

---- UNIT TESTS FOR DataIndexableCol.get_atom_data ----

Basic Test Cases

Edge Test Cases

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

function to test (from pandas/io/pytables.py, simplified for testing)

unit tests

---------------------------

1. Basic Test Cases

---------------------------

---------------------------

3. Large Scale Test Cases

---------------------------

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 11% #327

⚡️ Speed up method `DataIndexableCol.get_atom_data` by 11% #327

📄 11% (0.11x) speedup for `DataIndexableCol.get_atom_data` in `pandas/io/pytables.py`