Skip to content

Conversation

@andrewmonostate
Copy link

@andrewmonostate andrewmonostate commented Aug 18, 2025

Problem

The Kaitai Struct compiler generates invalid Python code when field names use Python reserved keywords. For example, a field named class generates self.class = ..., which causes a SyntaxError because class is a reserved keyword in Python.

Real-world Impact

This issue affects real format specifications, notably the OpenPGP message format which contains a field named class. The generated Python parsers cannot be imported due to syntax errors.

Solution

This PR adds proper escaping of Python reserved keywords by appending an underscore, following PEP 8 conventions:

  • classclass_
  • defdef_
  • ifif_
  • etc.

Changes Made

  1. Added reserved keyword detection (PythonCompiler.scala):

    • Added PYTHON_RESERVED_WORDS set containing all Python reserved keywords
    • Added escapePythonKeyword() function to check and escape keywords
    • Modified idToStr() to apply escaping for NamedIdentifier and InstanceIdentifier
  2. Added comprehensive tests (PythonCompilerSpec.scala):

    • Tests for common reserved keywords (class, def, if, lambda, etc.)
    • Tests for async/await keywords
    • Tests that non-reserved words are not escaped
    • Tests for instance identifiers with reserved keywords
    • All 12 tests pass ✅

Test Results

[info] PythonCompilerSpec:
[info] Run completed in 233 milliseconds.
[info] Total number of tests run: 12
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

Example

Before (causes SyntaxError):

self.class = self._io.read_u1()

After (valid Python):

self.class_ = self._io.read_u1()

Compatibility

  • Works with all Python versions (reserved keywords have been stable)
  • Follows Python's PEP 8 naming conventions
  • Non-breaking change for existing valid field names

Problem: The Kaitai Struct compiler generates Python code that uses reserved
keywords as identifiers (e.g., 'self.class = ...'), causing SyntaxError
when the generated code is imported.

Solution: Added proper escaping of Python reserved keywords by appending
an underscore to any identifier that matches a Python reserved word.

Changes:
- Added PYTHON_RESERVED_WORDS set with all Python reserved keywords
- Added escapePythonKeyword() function to check and escape keywords
- Modified idToStr() to apply escaping for NamedIdentifier and InstanceIdentifier
- Added comprehensive test suite in PythonCompilerSpec.scala

Test results: All 12 tests pass, verifying correct escaping behavior

This fixes generated code that uses fields named 'class', 'def', 'if', etc.,
which are common in binary format specifications like OpenPGP.
@GreyCat
Copy link
Member

GreyCat commented Aug 23, 2025

@andrewmonostate, thanks for your contribution!

Unfortunately, this is a rather complicated problem, and I believe what we need to start with is some kind of test which actually proves that a problem exists and the good place to add that would be our general test suite. I went ahead and created a proposal there in kaitai-io/kaitai_struct_tests#136 — please take a look.

For reference, current master without your patch results in:

======================================================================
ERROR: test_reserved_python_keywords (unittest.loader._FailedTest.test_reserved_python_keywords)
----------------------------------------------------------------------
ImportError: Failed to import test module: test_reserved_python_keywords
Traceback (most recent call last):
  File "/usr/lib64/python3.13/unittest/loader.py", line 396, in _find_test_path
    module = self._get_module_from_name(name)
  File "/usr/lib64/python3.13/unittest/loader.py", line 339, in _get_module_from_name
    __import__(name)
    ~~~~~~~~~~^^^^^^
  File "tests/spec/python/test_reserved_python_keywords.py", line 10
    self.assertEqual(r.and, 1)
                       ^^^
SyntaxError: invalid syntax

@Mingun
Copy link
Contributor

Mingun commented Aug 23, 2025

Quick question: does your fix prevents name clashing? So, for example, if you have two attributes class and class_ in your KSY, how you will deal with that?

@generalmimon
Copy link
Member

@Mingun:

Quick question: does your fix prevents name clashing? So, for example, if you have two attributes class and class_ in your KSY, how you will deal with that?

When you think about what having foo_bar and foo_bar_ in the same scope would mean for camel-case languages where the underscores are removed and thus both would become fooBar, this situation should clearly be forbidden on the .ksy level (otherwise KSC would accept a .ksy spec that could only be translated into snake-case languages, harming the language agnosticity of Kaitai Struct), so there is actually no point in considering it here.

@Mingun
Copy link
Contributor

Mingun commented Aug 23, 2025

I would disagree with that. Why should a concrete translator implementation dictate rules for a language-agnostic language? Only the translator is responsible for ensuring that it does not create invalid and clashed identifiers. Anyway, thinking about what happens if your name mangling algorithm generates names without any signs of mangling is a good idea. It should be mentioned at least so that the question of mangling does not arise. If the described situation is impossible, then you can just say so. It's not hard, right?

@GreyCat
Copy link
Member

GreyCat commented Aug 24, 2025

I concur with @generalmimon — given we have pretty long history of interpreting underscores and translating this into camel case, foo_bar_ vs foo_bar is a conflict we have already.

Two ways out would be:

  • Thinking of some other projection mechanism for foo_bar_ (e.g. turning it into fooBar_)
  • Banning foo_bar_ altogether

I'm leaning towards stricter naming rules going forward, as ultimately thinking of more and more underscore-related exceptions will become insanely difficult. Even if we'll allow foo_bar_, what about:

  • foo_bar__ => fooBar__?
  • foo__bar => foo_Bar?
  • foo__bar__ => foo_Bar__?

Don't forget numbers too:

  • foo2bar
  • foo2_bar
  • foo_2bar
  • foo_2_bar
  • foo2__bar
  • foo__2bar
  • foo_2_bar

Thinking of non-self-contradicting transformation rules here will be rather hard if not impossible.

@Mingun
Copy link
Contributor

Mingun commented Aug 24, 2025

Just let decision to concrete language generators. The reasonable default would be to add increasing number to the name to prevent conflict. If we have class, class_ and class1 fields, then the non-conflicting names would be:

Original name Translated name Comment
class class2 Add number first, but class1 already occupied, so increase number
class_ class3 Add number first, but class1 and class2 already occupied, so increase number twice
class1 class1 This is not the reserved word, so keep it as is

Algorithm should first assign names to all fields that is allowed by target language (class1) and then generate names for fields which names are reserved.

Step 1:
Transformed names: []

  1. class -> do the transformation: class -> reserved, postpone
  2. class_ -> do the transformation: class -> reserved, postpone (but after processing class)
  3. class1 -> do the transformation: class1 -> not reserved, not conflicted, assign

Step 2:
Transformed names: [class1 => class1]

  1. class -> do the transformation: class (in fact, you can use the already remembered transformation from step 1)
    1. reserved, append number: class1
    2. conflicted with existing name, increase number: class2
    3. this name is free and not reserved, done. Transformed names: [class1 => class1, class => class2]
  2. class_ -> do the transformation: class
    1. reserved, append number: class1
    2. conflicted with existing name, increase number: class2
    3. conflicted with existing name, increase number: class3
    4. this name is free and not reserved, done. Transformed names: [class1 => class1, class => class2, class_ => class3]

This is the Rust code that I write to implement this algorithm for my ksc-rs project (it was not yet published):

/// Defines a user-defined type
#[derive(Clone, Debug, Default, PartialEq)]
pub struct UserType {
  /// The list of fields that this type consists of. The fields in the data stream
  /// are in the same order as they are declared here.
  pub fields: IndexMap<SeqName, Attribute>,
  /// List of dynamic and calculated fields of this type. The position of these fields
  /// is not fixed in the type, and they may not even be physically represented in the
  /// data stream at all.
  pub instances: IndexMap<FieldName, Instance>,
  /// List of used-defined types, defined inside this type.
  pub types: IndexMap<TypeName, UserType>,
  /// List of enumerations defined inside this type.
  pub enums: IndexMap<EnumName, Enum>,
  // pub params: IndexMap<ParamName, Param>, //TODO: Parameters
}
impl UserType {
  /// Returns a map with field names that is valid according to the target
  /// language rules. Generation and validity checks are performed by the
  /// `generator` function that receives 2 parameters:
  /// - a field name for which name is generated
  /// - a generation attempt. Attempts are started from zero and increasing
  ///   over the time. The generator MUST return different name on each attempt
  ///   otherwise the algorithm will run in infinity cycle.
  ///
  /// # Parameters
  /// - `generator`: a name generator function. This function should return
  ///   `Some(name)` if generated name in this attempt is valid and `None` if not.
  ///
  /// # Example
  ///
  /// ```
  /// # use ksc::model::UserType;
  /// use ksc::model::AttributeName::*;
  /// use heck::ToLowerCamelCase;
  ///
  /// # let ty: UserType = UserType::default();/*
  /// let ty: UserType = ...;
  /// # */
  ///
  /// // List of reserved words
  /// const KEYWORDS: [&'static str; 1] = [
  ///   "enum",
  /// ];
  ///
  /// let names = ty.attribute_names(|name, attempt| {
  ///   // Generate a name converted by converting KSY field name
  ///   // to mixedCase (Java field style names) on first attempt
  ///   // and adding a numerical suffix on other attempts
  ///   let generated = match (attempt, name) {
  ///     // it is better to generate non-intersecting names for
  ///     // unnamed fields, for example, in this case, starting
  ///     // with an underscore (because such names are not possible
  ///     // for named fields after to mixedCase conversion), but
  ///     // that is not strictly necessary
  ///     (0, Unnamed(i)) => format!("unnamed{}", i),
  ///     (a, Unnamed(i)) => format!("unnamed{}{}", i, a),
  ///
  ///     (0, Seq(n)) => n.to_lower_camel_case(),
  ///     (a, Seq(n)) => format!("{}{}", n.to_lower_camel_case(), a),
  ///
  ///     (0, NonSeq(n)) => n.to_lower_camel_case(),
  ///     (a, NonSeq(n)) => format!("{}{}", n.to_lower_camel_case(), a),
  ///   };
  ///
  ///   // Check if a generated name conflicts with a keyword and
  ///   // return `None` if that is true
  ///   match KEYWORDS.binary_search(&generated.as_ref()) {
  ///     Ok(_) => None,
  ///     Err(_) => Some(generated),
  ///   }
  /// });
  /// ```
  pub fn attribute_names<G, R>(
    &self,
    generator: G,
  ) -> IndexMap<AttributeName, R>
    where G: Fn(AttributeName, usize) -> Option<R>,
          R: Clone + Eq + Hash + AsRef<str>,
  {
    let mut mapping = IndexMap::new();
    let mut used_names = IndexMap::new();
    let mut unprocessed_named = Vec::new();
    let mut unprocessed_unnamed = Vec::new();

    enum State<'a> {
      /// Mapping was added
      Inserted,
      /// Old mapping was replaced by new one, the old mapped value returned
      Replaced(AttributeName<'a>),
      /// Mapping was not done because there is a invalid name or unnamed field
      Postponed,
    }

    let mut generate = |attempt, name, original| {
      if let Some(generated) = generator(name, attempt) {
        match used_names.entry(generated.clone()) {
          // If name not used yet, register mapping
          Entry::Vacant(e) => {
            mapping.insert(name, generated);
            e.insert(name);
            State::Inserted
          },
          // If mapping already used, but generated name does not match
          // original one, and new candidate has the same name as generated one,
          // replace mapping
          Entry::Occupied(mut e) => {
            if original == generated.as_ref() && mapping.get(&name) != Some(&generated) {
              mapping.insert(name, generated);
              let old = e.insert(name);
              mapping.swap_remove(&old);
              State::Replaced(old)
            } else {
              State::Postponed
            }
          },
        }
      } else {
        State::Postponed
      }
    };

    // First, use all names that does not violates rules. Unnamed fields does
    // not violate rules, but their generated names could conflict with the
    // explicitly defined ones, so we postpone their name generation
    for (name, _) in &self.fields {
      match name {
        OptionalName::Unnamed(i) => unprocessed_unnamed.push(*i),
        OptionalName::Named(n) => {
          match generate(0, AttributeName::Seq(n), n.deref()) {
            State::Inserted => {},
            State::Replaced(old) => unprocessed_named.push(old),
            State::Postponed => unprocessed_named.push(AttributeName::Seq(n)),
          }
        }
      }
    }

    for (name, _) in &self.instances {
      match generate(0, AttributeName::NonSeq(name), name.deref()) {
        State::Inserted => {},
        State::Replaced(old) => unprocessed_named.push(old),
        State::Postponed => unprocessed_named.push(AttributeName::NonSeq(name)),
      }
    }

    for name in unprocessed_named {
      match name {
        AttributeName::Seq(n) | AttributeName::NonSeq(n) => {
          for attempt in 1.. {
            if let State::Inserted = generate(attempt, name, n.deref()) {
              break;
            }
          }
        },
        _ => unreachable!(),
      }
    }

    // Generate names for unnamed fields in the last stage
    for i in unprocessed_unnamed {
      for attempt in 0.. {
        if let State::Inserted = generate(attempt, AttributeName::Unnamed(i), "") {
          break;
        }
      }
    }

    mapping
  }
}

@generalmimon
Copy link
Member

generalmimon commented Aug 24, 2025

@Mingun I'm not convinced that numeric suffixes are a good idea, because you inevitably make the name translation of one attribute name dependent on what the other attributes are named. This implies that if the .ksy spec is changed in a way that some of the other attributes are removed/renamed, it may influence this attribute, even though it was not touched. That may cause the user's application code to break.

BTW, this is already the case with _unnamed* fields which are created when you omit id. For example, if you add an attribute at the beginning of seq, all subsequent _unnamed* fields are renumbered. It's not a problem, because omitting id should only be done for fields with an unknown purpose, which should not be used in user applications. But it illustrates the problem well.

Another problem is that with numeric suffixes, you cannot tell from the .ksy specification how conflicting names will be numbered. For example, if you have foo_bar_ and foo__bar, will it be fooBar and fooBar1 respectively, or the other way around? You can issue warnings about this, which mitigates this problem somewhat, but I don't think you should be forced to study warnings just to learn what today's mapping from .ksy names to generated code names is.

I also don't think we need to be linguistic purists - we don't have to support everything. I don't think a .ksy spec with two names in the same scope that differ only in underscores follows good practices (and also if we reject it, I'd argue it's unlikely to affect many people), so I definitely agree with @GreyCat on stricter naming rules. If there's a need for similar names, why couldn't the author of the .ksy specification add a numerical suffix themselves (e.g. id: foo_bar, id: foo_bar1)? This would avoid the problems I have described.

@Mingun
Copy link
Contributor

Mingun commented Aug 24, 2025

@Mingun I'm not convinced that numeric suffixes are a good idea,

Using numerical suffixes is not necessary, but it is the simplest scalable solution with clear semantics and predictable rules.

because you inevitably make the name translation of one attribute name dependent on what the other attributes are named.

This is the only possible solution if you want to prevent name conflicts. Conflict implies you need to know about other names in the same scope to ensure you prevent it

This implies that if the .ksy spec is changed in a way that some of the other attributes are removed/renamed, it may influence this attribute, even though it was not touched.

Of course, but this only will happened with attributes, which already requires some non-trivial name translation (mostly attributes with reserved names). The number of such attributes, however, is quite small, because there are usually few reserved words and it is unlikely that other attributes will also appear next to them, whose names match the name of the reserved attribute after its trivial transformation. Anyway, if you remove / rename something, you already make incompatible changes and it is expected, that dependers should be also updated. Use semver and everything will be fine.

Another problem is that with numeric suffixes, you cannot tell from the .ksy specification how conflicting names will be numbered. For example, if you have foo_bar_ and foo__bar, will it be fooBar and fooBar1 respectively, or the other way around? You can issue warnings about this, which mitigates this problem somewhat, but I don't think you should be forced to study warnings just to learn what today's mapping from .ksy names to generated code names is.

You really does not have any other option. The only advice is to try not to name names too similar in one scope (differing only in the number of underscores, for example).

I also don't think we need to be linguistic purists - we don't have to support everything. I don't think a .ksy spec with two names in the same scope that differ only in underscores follows good practices (and also if we reject it, I'd argue it's unlikely to affect many people), so I definitely agree with @GreyCat on stricter naming rules. If there's a need for similar names, why couldn't the author of the .ksy specification add a numerical suffix themselves (e.g. id: foo_bar, id: foo_bar1)? This would avoid the problems I have described.

You lose sight of the fact that the main problem is that the converted names may match the existing ones, even if everything is fine in your KSY. And you need to solve this problem, because it is impossible to prohibit all words reserved in all languages or otherwise unacceptable. An example is Java. You can name the type in KSY as class. Working with this in Java will be unpleasant, but possible. But you cannot name the attribute class because, according to the conversion rules for Java, this will turn into getClass(), which is name of the special method exists in all objects. You naturally cannot foresee all similar cases in all languages.

Mediocre software distinguishes from good software how the corner cases processed. Don't be mediocre. As you yourself said, we don't have to support everything. But we can make things more end-user friendly. The user does not need to think why he cannot use the as word as attribute name when generating Java code, because somewhere in Python for which he was not going to and never will be, this is a reserved word. In Java, it will always turn into either As, AS, getAs or setAs and I don't see any problems with that.

@andrewmonostate
Copy link
Author

I want to address the whole “underscore vs numeric suffix” debate in detail, because frankly this is seems like unreal to me, very over-engineered and it distracts from the actual bug we need to solve, which is actual people's code not compiling.

  1. The bug is real and already proven.
    Generated Python code with reserved keywords (class, def, etc.) simply doesn’t import — it throws a SyntaxError. I already provided a compiler-level unit test in my PR, and now there’s a duplicated integration test in the main suite. Great, the bug is confirmed. The conversation should have stopped there: fix it with the most idiomatic, Python-appropriate solution.

  2. Underscore is the Pythonic solution.
    This isn’t arbitrary. PEP8, the Python stdlib, and virtually every OSS project uses a trailing underscore to escape reserved keywords. Example straight from the docs: class_. This is clear, readable, predictable, and already familiar to every Python developer. Numeric suffixes (class2, class3) are not idiomatic, harder to read, and make code feel generated and clunky.

  3. The “but what if foo_bar_ and foo__bar collide” argument is absurd.
    Yes, in pathological cases you can construct names that are weird after mangling. But the same is true in every language. At some point, if you’re authoring a .ksy with foo_bar_ and foo__bar in the same scope, you’re already writing bad specs. It’s not Kaitai’s job to enable every self-contradictory naming scheme under the sun. This is no different from C disallowing two functions that only differ by case on Windows, or Java banning identifiers that collide with getters. You set reasonable rules, not try to support nonsense.

  4. “Numeric suffixes scale better” is a non-argument.
    The proposed numbering system is worse because:
    • It makes name resolution dependent on what else is in the scope (brittle).
    • It produces unpredictable mappings (class might become class2 in one version, class3 in another if something else changes).
    • It’s not transparent from the .ksy spec — you can’t tell what the final code will look like without generating it.
    • It breaks the principle of least surprise for Python developers.

This “solution” actually introduces instability into downstream code. You edit your .ksy slightly and suddenly different attributes get renumbered. That’s not robustness, that’s chaos at your fingertips!

  1. “We must preserve language-agnosticism” is absurd.
    Kaitai already applies language-specific rules (camelCase, getX in Java, etc.). That’s the entire point of translators. Pretending that the Python backend can’t adopt Pythonic conventions because “language agnosticity” is just moving the goalposts. By that logic, we shouldn’t be adding any reserved-word handling at all, because it’s inherently language-specific.

  2. Practical reality beats theoretical puritans.
    Right now, users cannot generate working Python code. That’s a real, blocking bug. The underscore solution fixes it immediately in a way that is familiar and natural. Numeric suffixing and endless debates about edge-case underscores solve nothing except giving contributors the feeling of pain and stalling an obvious fix.

Conclusion:
• Append _ to reserved keywords (PEP8 convention, stable, predictable).
• why overthink pathological naming collisions? They are rare, and stricter rules on the .ksy side are fine if needed.
• Stop chasing numeric suffix fantasies - they make code unpredictable and harder to maintain.
• Merge the tested, working fix, and then discuss incremental refinements if truly necessary.

Right now the project is burning contributor goodwill over pointless nitpicking instead of just shipping a fix that unblocks users. That’s the real risk here, not whether someone somewhere might define both foo_bar_ and foo__bar.

And by the way, I see uthis exact discussion has been going in circles since 2016 in issue #90. That’s 9 years of arguing over edge-cases while leaving the real bug unfixed. This is exactly why contributors get frustrated. The practical solution here is simple and Pythonic: underscore escape.

@andrewmonostate
Copy link
Author

andrewmonostate commented Aug 28, 2025

If you guys need more details or proof of the error, see my original issue #1249. I provided a concrete Python reproduction showing how class generates invalid code (SyntaxError). Instead of treating that as a separate, actionable Python bug, it was closed and merged into #90 - a 9-year-old thread that’s been stuck in theoretical name-mangling debates since 2016 (2 world cups, 2 Olympic Games and a pandemic has passed and that issue is still alive!).

That issue is exactly why nothing has moved: endless speculation about “perfect universal rules” while real users are left with broken generated code. It makes no sense to bury fresh, clear bug reports into decade-old design debates. The practical fix is simple (underscore escape) and already tested and cleanly implemented.

@Mingun
Copy link
Contributor

Mingun commented Aug 29, 2025

I want to address the whole “underscore vs numeric suffix” debate

I think you got the substance of the debate wrong. There is no "undercore vs numeric suffix" debates. There is only a discussion of how to implement correctly, without introducing hidden bugs. For example, if you look at my suggestion carefully, you will see that the concrete language translator itself decides exactly how it will convert the original names into final ones. If you want, on the first attempt you can just add an underscore. If this name is free, it will be used. But if not, you'll have another go. Now you can add a number, add a number and underscore, underscore and number, or whatever you want.

By that logic, we shouldn’t be adding any reserved-word handling at all, because it’s inherently language-specific.

You're absolutely right! Reserved-word handling is the task of each language translator, and not the Kaitai core. Core may only provide usable helpers to do that with minimal efforts, one of which I provide in the code snippet above.

... endless debates about ... solve nothing except giving contributors the feeling of pain and stalling an obvious fix.

Unfortunately, in this repository, every issue and PR follows this rule, as if they only exist for this 😞.

Right now the project is burning contributor goodwill over pointless nitpicking instead of just shipping a fix that unblocks users.

Totally true.

The practical fix is simple (underscore escape) and already tested and cleanly implemented.

I only want to emphasis, that solution that I suggest as simple as yours, but additionally takes into account all corner cases. It complements, not replaces, yours. Since all language translators live in the same repository along with the core compiler, you don't even need to split PR into PRs into multiple repositories. You just need to implement a simple, bulletproof solution and forget about this problem forever!

@andrewmonostate
Copy link
Author

@Mingun Thanks for clarifying, but I think this is still missing the point. And I get it I'm new here, don't want to be disrespectful or mess the project's normal process esses.

That said, please note:

1.	There was an underscore vs numeric debate.

Multiple comments (including yours) explicitly proposed numeric suffixes or hybrid schemes as “bulletproof.” That’s what I was responding to. My point is simple: in Python, underscore is the idiomatic, PEP8-endorsed, standard solution (class → class_). Introducing numbering or “underscore+number+whatever” schemes is neither idiomatic nor predictable. It’s actually less safe because it creates hidden dependencies on what else exists in scope and makes downstream code unstable when .ksy specs evolve.

2.	The “bulletproof” claim is not feasible.

No naming scheme can be bulletproof across all languages, as you guys reached consensus yourselves, because each has its own reserved words, casing, getter/setter conventions, etc. We already accept language-specific translators that apply language-specific rules. Pretending we can solve everything with one meta-algorithm in the core is chasing a ghost. That’s why Python should follow Pythonic rules, Java should follow Java rules, etc.

3.	“Try underscore, then fall back to number” makes things worse.

That produces names that are unpredictable and brittle: class could become class_ in one version and class1 in another if something else collides later. That is exactly the kind of hidden bug you claim to want to avoid — downstream code silently breaks because the mapping shifted. A stable, deterministic escape (class → class_) is far safer than a cascading fallback scheme.

4.	Scope creep vs fixing the bug.

The actual bug here is simple: generated Python code with reserved keywords doesn’t import. That is immediately fixed by underscore escape. Your “unified helper” idea might be interesting for core someday, but holding up a working Python fix while chasing “bulletproof for all translators forever” is how we end up with issue #90 sitting open for 9 years with no resolution.

5.	Complement, not replace.

If your helper really does complement the fix, then the obvious path is: merge the Python fix now (tested, idiomatic, solves the bug), then extend with core-level helpers incrementally. Not stall the urgent bugfix for another abstract round of “what if” scenarios.

At the end of the day, this is not about inventing the perfect universal mangling algorithm. It’s about shipping a fix that makes Kaitai’s Python output usable. That should be the priority.

TLDR:
No naming scheme will ever be bulletproof across all languages - discussing otherwise won't change this. Pythonic fix, aligned with python ideology and standards, that makes the code run, is proposed and contains multiple tests to show it works.

@Mingun
Copy link
Contributor

Mingun commented Aug 29, 2025

PEP8 has concrete purpose: allow identifiers, that looks like keywords. It does not solve the name clashing problem, which translator should prevent when rename rules -- because if KSY model is checked and corrent, the translator must generate the correct code.

You trying to solve only the forbidden identifiers problem, but at the same time introduce another bug by creating a name clash problem. In my opinion, if you do something, you should do it right, i.e. solve that problem, especially when it was highlighted and the solution was suggested.

makes downstream code unstable when .ksy specs evolve.

Any changes makes downstream code unstable. You may rename class to clazz and then downstream code will stop to compiling / working correctly.

No naming scheme can be bulletproof across all languages

That is not required. Read what I wrote several times: each language should use its own naming scheme. This is what I am trying to convey and what @generalmimon disagrees with. They may be identical for different languages, but that is not required.

That’s why Python should follow Pythonic rules, Java should follow Java rules, etc.

Yes. Yes! Use the idiomatic "to pythonic case + append underscore" on the first try, but use something else when this will not work. The code snippet, provided by me, only requires from you only to define rules what to do on each try.

That produces names that are unpredictable and brittle: class could become class_ in one version and class1 in another if something else collides later.

They predictable. Just the numbers are the most predictable parts of names. class -> class_ may only be changed to class -> class1 if you add field class_ to the KSY. Because explicit names always takes precedence. if the name class_ already explicitly exists, don't you think it's strange to rename some other field to it?

  1. Scope creep vs fixing the bug.
  2. Complement, not replace.

Come on, this project is in a half-dead state. If you need the fix, maintain you own fork. There, if you allow yourself, you can implement any ad-hoc solutions, which works for you case but may break other cases that you are not interesting. But I feel that for shared project this will be totally wrong approach.

Not stall the urgent bugfix for another abstract round of “what if” scenarios.

I would agree with that if the correct fix was not known, but it is not. And I as you may noticed by yourself, any "urgent" fixes should come to you own fork (of course, if under "urgent" you mean something other than "in this decade, maybe").

@andrewmonostate
Copy link
Author

@Mingun I think you’re overstating the “clash problem” and missing the practical priority here.

1.	PEP8 and reserved identifiers

PEP8 isn’t just about “looking like” keywords — it’s the de facto convention for escaping them. class → class_ is what every Python developer expects. That solves the real, immediate bug: generated code doesn’t even import. That’s priority #1.

2.	“But you introduce a clash bug”

This is a hypothetical overreaction. If someone actually defines both class and class_ in the same .ksy, they’ve already written a bad spec. Kaitai already rejects or transforms other self-contradictory cases (like underscores collapsing in camelCase languages). It’s not the translator’s job to endlessly sanitize pathological naming collisions — it’s the spec author’s job to not create them. Right ?

3.	“Numbers are the most predictable part of names”

This is where we disagree. A user reading class_ knows immediately what happened: it was a reserved word, escaped Pythonically. A user reading class2 has no clue unless they reverse engineer the translator’s fallback logic. Worse: when .ksy evolves, numbers do shift around, which silently breaks downstream code. That’s not predictability, that’s instability disguised as determinism

4.	“Just fork it if you need it”

With respect, that’s not how healthy OSS works. The point of this project is to provide working codegen for supported languages. Telling contributors to fork and patch themselves instead of merging a minimal, idiomatic bugfix is how projects rot. This is the definition of paralysis by analysis - endless “what if” debates while users are stuck with broken code.

5.	Correct vs practical

You keep saying “the correct fix is known.” The reality is: no universal “correct” exists. Each language translator needs to follow its own idioms. For Python, the correct fix is underscore escaping. The “try underscore, then numbers, then something else” scheme is not more correct — it’s just more complex, less Pythonic, and introduces new fragility.

And just to be clear: we already had to patch around this bug in downstream repos — see trailofbits/polyfile#3431, trailofbits/polyfile#3432, and zbirenbaum/polyfile-weave#1. That’s duct tape to keep real projects working, not theory. The bug is blocking users today. Saying “just fork it” is not a serious answer - the whole point of Kaitai is to provide working, idiomatic generators. If the only way to get fixes in is by faking it downstream, that’s exactly how an OSS project would stop being useful.

--//--
TLDR
No human writing a spec - or machine for that matter - would ever deliberately create both class and class_ side by side. It’s a contrived edge case with no practical value, so what’s the point of blocking a real bugfix over it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants