- 
                Notifications
    You must be signed in to change notification settings 
- Fork 174
Fix Python reserved keyword escaping in code generation #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix Python reserved keyword escaping in code generation #320
Conversation
Problem: The Kaitai Struct compiler generates Python code that uses reserved keywords as identifiers (e.g., 'self.class = ...'), causing SyntaxError when the generated code is imported. Solution: Added proper escaping of Python reserved keywords by appending an underscore to any identifier that matches a Python reserved word. Changes: - Added PYTHON_RESERVED_WORDS set with all Python reserved keywords - Added escapePythonKeyword() function to check and escape keywords - Modified idToStr() to apply escaping for NamedIdentifier and InstanceIdentifier - Added comprehensive test suite in PythonCompilerSpec.scala Test results: All 12 tests pass, verifying correct escaping behavior This fixes generated code that uses fields named 'class', 'def', 'if', etc., which are common in binary format specifications like OpenPGP.
| @andrewmonostate, thanks for your contribution! Unfortunately, this is a rather complicated problem, and I believe what we need to start with is some kind of test which actually proves that a problem exists and the good place to add that would be our general test suite. I went ahead and created a proposal there in kaitai-io/kaitai_struct_tests#136 — please take a look. For reference, current master without your patch results in:  | 
| Quick question: does your fix prevents name clashing? So, for example, if you have two attributes  | 
| 
 When you think about what having  | 
| I would disagree with that. Why should a concrete translator implementation dictate rules for a language-agnostic language? Only the translator is responsible for ensuring that it does not create invalid and clashed identifiers. Anyway, thinking about what happens if your name mangling algorithm generates names without any signs of mangling is a good idea. It should be mentioned at least so that the question of mangling does not arise. If the described situation is impossible, then you can just say so. It's not hard, right? | 
| I concur with @generalmimon — given we have pretty long history of interpreting underscores and translating this into camel case,  Two ways out would be: 
 I'm leaning towards stricter naming rules going forward, as ultimately thinking of more and more underscore-related exceptions will become insanely difficult. Even if we'll allow  
 Don't forget numbers too: 
 Thinking of non-self-contradicting transformation rules here will be rather hard if not impossible. | 
| Just let decision to concrete language generators. The reasonable default would be to add increasing number to the name to prevent conflict. If we have  
 Algorithm should first assign names to all fields that is allowed by target language ( Step 1: 
 Step 2: 
 This is the Rust code that I write to implement this algorithm for my ksc-rs project (it was not yet published): /// Defines a user-defined type
#[derive(Clone, Debug, Default, PartialEq)]
pub struct UserType {
  /// The list of fields that this type consists of. The fields in the data stream
  /// are in the same order as they are declared here.
  pub fields: IndexMap<SeqName, Attribute>,
  /// List of dynamic and calculated fields of this type. The position of these fields
  /// is not fixed in the type, and they may not even be physically represented in the
  /// data stream at all.
  pub instances: IndexMap<FieldName, Instance>,
  /// List of used-defined types, defined inside this type.
  pub types: IndexMap<TypeName, UserType>,
  /// List of enumerations defined inside this type.
  pub enums: IndexMap<EnumName, Enum>,
  // pub params: IndexMap<ParamName, Param>, //TODO: Parameters
}
impl UserType {
  /// Returns a map with field names that is valid according to the target
  /// language rules. Generation and validity checks are performed by the
  /// `generator` function that receives 2 parameters:
  /// - a field name for which name is generated
  /// - a generation attempt. Attempts are started from zero and increasing
  ///   over the time. The generator MUST return different name on each attempt
  ///   otherwise the algorithm will run in infinity cycle.
  ///
  /// # Parameters
  /// - `generator`: a name generator function. This function should return
  ///   `Some(name)` if generated name in this attempt is valid and `None` if not.
  ///
  /// # Example
  ///
  /// ```
  /// # use ksc::model::UserType;
  /// use ksc::model::AttributeName::*;
  /// use heck::ToLowerCamelCase;
  ///
  /// # let ty: UserType = UserType::default();/*
  /// let ty: UserType = ...;
  /// # */
  ///
  /// // List of reserved words
  /// const KEYWORDS: [&'static str; 1] = [
  ///   "enum",
  /// ];
  ///
  /// let names = ty.attribute_names(|name, attempt| {
  ///   // Generate a name converted by converting KSY field name
  ///   // to mixedCase (Java field style names) on first attempt
  ///   // and adding a numerical suffix on other attempts
  ///   let generated = match (attempt, name) {
  ///     // it is better to generate non-intersecting names for
  ///     // unnamed fields, for example, in this case, starting
  ///     // with an underscore (because such names are not possible
  ///     // for named fields after to mixedCase conversion), but
  ///     // that is not strictly necessary
  ///     (0, Unnamed(i)) => format!("unnamed{}", i),
  ///     (a, Unnamed(i)) => format!("unnamed{}{}", i, a),
  ///
  ///     (0, Seq(n)) => n.to_lower_camel_case(),
  ///     (a, Seq(n)) => format!("{}{}", n.to_lower_camel_case(), a),
  ///
  ///     (0, NonSeq(n)) => n.to_lower_camel_case(),
  ///     (a, NonSeq(n)) => format!("{}{}", n.to_lower_camel_case(), a),
  ///   };
  ///
  ///   // Check if a generated name conflicts with a keyword and
  ///   // return `None` if that is true
  ///   match KEYWORDS.binary_search(&generated.as_ref()) {
  ///     Ok(_) => None,
  ///     Err(_) => Some(generated),
  ///   }
  /// });
  /// ```
  pub fn attribute_names<G, R>(
    &self,
    generator: G,
  ) -> IndexMap<AttributeName, R>
    where G: Fn(AttributeName, usize) -> Option<R>,
          R: Clone + Eq + Hash + AsRef<str>,
  {
    let mut mapping = IndexMap::new();
    let mut used_names = IndexMap::new();
    let mut unprocessed_named = Vec::new();
    let mut unprocessed_unnamed = Vec::new();
    enum State<'a> {
      /// Mapping was added
      Inserted,
      /// Old mapping was replaced by new one, the old mapped value returned
      Replaced(AttributeName<'a>),
      /// Mapping was not done because there is a invalid name or unnamed field
      Postponed,
    }
    let mut generate = |attempt, name, original| {
      if let Some(generated) = generator(name, attempt) {
        match used_names.entry(generated.clone()) {
          // If name not used yet, register mapping
          Entry::Vacant(e) => {
            mapping.insert(name, generated);
            e.insert(name);
            State::Inserted
          },
          // If mapping already used, but generated name does not match
          // original one, and new candidate has the same name as generated one,
          // replace mapping
          Entry::Occupied(mut e) => {
            if original == generated.as_ref() && mapping.get(&name) != Some(&generated) {
              mapping.insert(name, generated);
              let old = e.insert(name);
              mapping.swap_remove(&old);
              State::Replaced(old)
            } else {
              State::Postponed
            }
          },
        }
      } else {
        State::Postponed
      }
    };
    // First, use all names that does not violates rules. Unnamed fields does
    // not violate rules, but their generated names could conflict with the
    // explicitly defined ones, so we postpone their name generation
    for (name, _) in &self.fields {
      match name {
        OptionalName::Unnamed(i) => unprocessed_unnamed.push(*i),
        OptionalName::Named(n) => {
          match generate(0, AttributeName::Seq(n), n.deref()) {
            State::Inserted => {},
            State::Replaced(old) => unprocessed_named.push(old),
            State::Postponed => unprocessed_named.push(AttributeName::Seq(n)),
          }
        }
      }
    }
    for (name, _) in &self.instances {
      match generate(0, AttributeName::NonSeq(name), name.deref()) {
        State::Inserted => {},
        State::Replaced(old) => unprocessed_named.push(old),
        State::Postponed => unprocessed_named.push(AttributeName::NonSeq(name)),
      }
    }
    for name in unprocessed_named {
      match name {
        AttributeName::Seq(n) | AttributeName::NonSeq(n) => {
          for attempt in 1.. {
            if let State::Inserted = generate(attempt, name, n.deref()) {
              break;
            }
          }
        },
        _ => unreachable!(),
      }
    }
    // Generate names for unnamed fields in the last stage
    for i in unprocessed_unnamed {
      for attempt in 0.. {
        if let State::Inserted = generate(attempt, AttributeName::Unnamed(i), "") {
          break;
        }
      }
    }
    mapping
  }
} | 
| @Mingun I'm not convinced that numeric suffixes are a good idea, because you inevitably make the name translation of one attribute name dependent on what the other attributes are named. This implies that if the .ksy spec is changed in a way that some of the other attributes are removed/renamed, it may influence this attribute, even though it was not touched. That may cause the user's application code to break. BTW, this is already the case with  Another problem is that with numeric suffixes, you cannot tell from the .ksy specification how conflicting names will be numbered. For example, if you have  I also don't think we need to be linguistic purists - we don't have to support everything. I don't think a .ksy spec with two names in the same scope that differ only in underscores follows good practices (and also if we reject it, I'd argue it's unlikely to affect many people), so I definitely agree with @GreyCat on stricter naming rules. If there's a need for similar names, why couldn't the author of the .ksy specification add a numerical suffix themselves (e.g.  | 
| 
 Using numerical suffixes is not necessary, but it is the simplest scalable solution with clear semantics and predictable rules. 
 This is the only possible solution if you want to prevent name conflicts. Conflict implies you need to know about other names in the same scope to ensure you prevent it 
 Of course, but this only will happened with attributes, which already requires some non-trivial name translation (mostly attributes with reserved names). The number of such attributes, however, is quite small, because there are usually few reserved words and it is unlikely that other attributes will also appear next to them, whose names match the name of the reserved attribute after its trivial transformation. Anyway, if you remove / rename something, you already make incompatible changes and it is expected, that dependers should be also updated. Use semver and everything will be fine. 
 You really does not have any other option. The only advice is to try not to name names too similar in one scope (differing only in the number of underscores, for example). 
 You lose sight of the fact that the main problem is that the converted names may match the existing ones, even if everything is fine in your KSY. And you need to solve this problem, because it is impossible to prohibit all words reserved in all languages or otherwise unacceptable. An example is Java. You can name the type in KSY as  Mediocre software distinguishes from good software how the corner cases processed. Don't be mediocre. As you yourself said, we don't have to support everything. But we can make things more end-user friendly. The user does not need to think why he cannot use the  | 
| I want to address the whole “underscore vs numeric suffix” debate in detail, because frankly this is seems like unreal to me, very over-engineered and it distracts from the actual bug we need to solve, which is actual people's code not compiling. 
 This “solution” actually introduces instability into downstream code. You edit your .ksy slightly and suddenly different attributes get renumbered. That’s not robustness, that’s chaos at your fingertips! 
 Conclusion: Right now the project is burning contributor goodwill over pointless nitpicking instead of just shipping a fix that unblocks users. That’s the real risk here, not whether someone somewhere might define both foo_bar_ and foo__bar. And by the way, I see uthis exact discussion has been going in circles since 2016 in issue #90. That’s 9 years of arguing over edge-cases while leaving the real bug unfixed. This is exactly why contributors get frustrated. The practical solution here is simple and Pythonic: underscore escape. | 
| If you guys need more details or proof of the error, see my original issue #1249. I provided a concrete Python reproduction showing how class generates invalid code (SyntaxError). Instead of treating that as a separate, actionable Python bug, it was closed and merged into #90 - a 9-year-old thread that’s been stuck in theoretical name-mangling debates since 2016 (2 world cups, 2 Olympic Games and a pandemic has passed and that issue is still alive!). That issue is exactly why nothing has moved: endless speculation about “perfect universal rules” while real users are left with broken generated code. It makes no sense to bury fresh, clear bug reports into decade-old design debates. The practical fix is simple (underscore escape) and already tested and cleanly implemented. | 
| 
 I think you got the substance of the debate wrong. There is no "undercore vs numeric suffix" debates. There is only a discussion of how to implement correctly, without introducing hidden bugs. For example, if you look at my suggestion carefully, you will see that the concrete language translator itself decides exactly how it will convert the original names into final ones. If you want, on the first attempt you can just add an underscore. If this name is free, it will be used. But if not, you'll have another go. Now you can add a number, add a number and underscore, underscore and number, or whatever you want. 
 You're absolutely right! Reserved-word handling is the task of each language translator, and not the Kaitai core. Core may only provide usable helpers to do that with minimal efforts, one of which I provide in the code snippet above. 
 Unfortunately, in this repository, every issue and PR follows this rule, as if they only exist for this 😞. 
 Totally true. 
 I only want to emphasis, that solution that I suggest as simple as yours, but additionally takes into account all corner cases. It complements, not replaces, yours. Since all language translators live in the same repository along with the core compiler, you don't even need to split PR into PRs into multiple repositories. You just need to implement a simple, bulletproof solution and forget about this problem forever! | 
| @Mingun Thanks for clarifying, but I think this is still missing the point. And I get it I'm new here, don't want to be disrespectful or mess the project's normal process esses. That said, please note: Multiple comments (including yours) explicitly proposed numeric suffixes or hybrid schemes as “bulletproof.” That’s what I was responding to. My point is simple: in Python, underscore is the idiomatic, PEP8-endorsed, standard solution (class → class_). Introducing numbering or “underscore+number+whatever” schemes is neither idiomatic nor predictable. It’s actually less safe because it creates hidden dependencies on what else exists in scope and makes downstream code unstable when .ksy specs evolve. No naming scheme can be bulletproof across all languages, as you guys reached consensus yourselves, because each has its own reserved words, casing, getter/setter conventions, etc. We already accept language-specific translators that apply language-specific rules. Pretending we can solve everything with one meta-algorithm in the core is chasing a ghost. That’s why Python should follow Pythonic rules, Java should follow Java rules, etc. That produces names that are unpredictable and brittle: class could become class_ in one version and class1 in another if something else collides later. That is exactly the kind of hidden bug you claim to want to avoid — downstream code silently breaks because the mapping shifted. A stable, deterministic escape (class → class_) is far safer than a cascading fallback scheme. The actual bug here is simple: generated Python code with reserved keywords doesn’t import. That is immediately fixed by underscore escape. Your “unified helper” idea might be interesting for core someday, but holding up a working Python fix while chasing “bulletproof for all translators forever” is how we end up with issue #90 sitting open for 9 years with no resolution. If your helper really does complement the fix, then the obvious path is: merge the Python fix now (tested, idiomatic, solves the bug), then extend with core-level helpers incrementally. Not stall the urgent bugfix for another abstract round of “what if” scenarios. At the end of the day, this is not about inventing the perfect universal mangling algorithm. It’s about shipping a fix that makes Kaitai’s Python output usable. That should be the priority. TLDR: | 
| PEP8 has concrete purpose: allow identifiers, that looks like keywords. It does not solve the name clashing problem, which translator should prevent when rename rules -- because if KSY model is checked and corrent, the translator must generate the correct code. You trying to solve only the forbidden identifiers problem, but at the same time introduce another bug by creating a name clash problem. In my opinion, if you do something, you should do it right, i.e. solve that problem, especially when it was highlighted and the solution was suggested. 
 Any changes makes downstream code unstable. You may rename  
 That is not required. Read what I wrote several times: each language should use its own naming scheme. This is what I am trying to convey and what @generalmimon disagrees with. They may be identical for different languages, but that is not required. 
 Yes. Yes! Use the idiomatic "to pythonic case + append underscore" on the first try, but use something else when this will not work. The code snippet, provided by me, only requires from you only to define rules what to do on each try. 
 They predictable. Just the numbers are the most predictable parts of names.  
 Come on, this project is in a half-dead state. If you need the fix, maintain you own fork. There, if you allow yourself, you can implement any ad-hoc solutions, which works for you case but may break other cases that you are not interesting. But I feel that for shared project this will be totally wrong approach. 
 I would agree with that if the correct fix was not known, but it is not. And I as you may noticed by yourself, any "urgent" fixes should come to you own fork (of course, if under "urgent" you mean something other than "in this decade, maybe"). | 
| @Mingun I think you’re overstating the “clash problem” and missing the practical priority here. PEP8 isn’t just about “looking like” keywords — it’s the de facto convention for escaping them. class → class_ is what every Python developer expects. That solves the real, immediate bug: generated code doesn’t even import. That’s priority #1. This is a hypothetical overreaction. If someone actually defines both class and class_ in the same .ksy, they’ve already written a bad spec. Kaitai already rejects or transforms other self-contradictory cases (like underscores collapsing in camelCase languages). It’s not the translator’s job to endlessly sanitize pathological naming collisions — it’s the spec author’s job to not create them. Right ? This is where we disagree. A user reading class_ knows immediately what happened: it was a reserved word, escaped Pythonically. A user reading class2 has no clue unless they reverse engineer the translator’s fallback logic. Worse: when .ksy evolves, numbers do shift around, which silently breaks downstream code. That’s not predictability, that’s instability disguised as determinism With respect, that’s not how healthy OSS works. The point of this project is to provide working codegen for supported languages. Telling contributors to fork and patch themselves instead of merging a minimal, idiomatic bugfix is how projects rot. This is the definition of paralysis by analysis - endless “what if” debates while users are stuck with broken code. You keep saying “the correct fix is known.” The reality is: no universal “correct” exists. Each language translator needs to follow its own idioms. For Python, the correct fix is underscore escaping. The “try underscore, then numbers, then something else” scheme is not more correct — it’s just more complex, less Pythonic, and introduces new fragility. And just to be clear: we already had to patch around this bug in downstream repos — see trailofbits/polyfile#3431, trailofbits/polyfile#3432, and zbirenbaum/polyfile-weave#1. That’s duct tape to keep real projects working, not theory. The bug is blocking users today. Saying “just fork it” is not a serious answer - the whole point of Kaitai is to provide working, idiomatic generators. If the only way to get fixes in is by faking it downstream, that’s exactly how an OSS project would stop being useful. --//-- | 
Problem
The Kaitai Struct compiler generates invalid Python code when field names use Python reserved keywords. For example, a field named
classgeneratesself.class = ..., which causes aSyntaxErrorbecauseclassis a reserved keyword in Python.Real-world Impact
This issue affects real format specifications, notably the OpenPGP message format which contains a field named
class. The generated Python parsers cannot be imported due to syntax errors.Solution
This PR adds proper escaping of Python reserved keywords by appending an underscore, following PEP 8 conventions:
class→class_def→def_if→if_Changes Made
Added reserved keyword detection (
PythonCompiler.scala):PYTHON_RESERVED_WORDSset containing all Python reserved keywordsescapePythonKeyword()function to check and escape keywordsidToStr()to apply escaping forNamedIdentifierandInstanceIdentifierAdded comprehensive tests (
PythonCompilerSpec.scala):class,def,if,lambda, etc.)Test Results
Example
Before (causes SyntaxError):
After (valid Python):
Compatibility