Skip to content

Change Request: Program range cover whole program #648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
overlookmotel opened this issue Apr 7, 2025 · 18 comments
Open
2 of 4 tasks

Change Request: Program range cover whole program #648

overlookmotel opened this issue Apr 7, 2025 · 18 comments

Comments

@overlookmotel
Copy link

overlookmotel commented Apr 7, 2025

Which packages would you like to change?

  • espree
  • eslint-scope
  • eslint-visitor-keys

What problem do you want to solve?

The source range for Program is surprising when source code starts or ends with comments.

e.g.:

// foo

x

// bar

Espree (like Esprima) produces an AST where range of Program node excludes whitespace and comments from start and end of file. In example above, the range of Program node is [ 8, 9 ] i.e. the range of the sole statement.

AST explorer - Espree

This differs from Acorn, which produces a range of [0, 17] i.e. the whole of the source code.

AST explorer - Acorn

What is particularly surprising is that with option comment: true you get comments which are outside the range of the program that contains them:

{
  type: 'Program',
  start: 8,
  end: 9,
  body: [
    {
      type: 'ExpressionStatement',
      start: 8,
      end: 9,
      expression: { type: 'Identifier', start: 8, end: 9, name: 'x' }
    }
  ],
  sourceType: 'script',
  comments: [
    { type: 'Line', value: ' foo', start: 0, end: 6, range: [ 0, 6 ] },
    { type: 'Line', value: ' bar', start: 11, end: 17, range: [ 11, 17 ] }
  ]
}

As a further surprise, if the file contains only comments, then the behavior changes:

// foo

This produces a range of [ 0, 8 ]. This does not align with Esprima, which produces range of [ 8, 8 ].

What do you think is the correct solution?

Espree's docs state that:

The primary goal is to produce the exact same AST structure and tokens as Esprima, and that takes precedence over anything else.

Given that Esprima has not received updates for 7 years, and is essentially defunct, I'm not sure whether the maintainers of Espree consider alignment with Esprima still to be the primary goal. I'm also not sure whether that aim, if it still holds, extends to source ranges, or just the AST shape.

What's the correct behavior?

This is arguable. But, in my opinion, it's unintuitive that comments which are part of a program exist in ranges outside of the Program.

The example where the program consists of only comments I think is illustrative. What's the logical range for Program to have in this case, if comments are excluded? [0, 0] or [8, 8]?

The fact that there isn't a clear answer I feel reveals there's a problem with the mental model.

[0, 8] (as Espree currently does) makes most sense to me. But this is inconsistent with Esprima, and also inconsistent with Espree's behavior when there are statements in the program.

Others may feel differently.

Why does this matter?

In addition to the difference from Acorn, @typescript-eslint/parser is also different again - for the first example, it gives a range of [8, 18]! Though the maintainers seem to think Acorn's behavior would be preferable.

AST explorer - TS-ESLint

Personally, I think it's unfortunate that the 3 most depended-on ESTree-compatible parsers Espree, Acorn, and TS-ESLint differ in this respect.

What to do?

One way or another, in my opinion Espree's current behavior is incorrect.

  • In the case which program is only comments, output differs from Esprima, violating Espree's stated goal.
  • Or, if alignment with Esprima is not an aim as far as ranges are concerned, it would make more sense to have a less surprising behavior.

Changing behavior to match Acorn (which personally I'd advocate for) would be a breaking change. Though, given that the 3 most popular ESTree-compatible parsers differ from each in this regard and I'm not aware of anyone complaining about it before, probably not a big one!

Participation

  • I am willing to submit a pull request for this change.

Additional comments

No response

@fisker
Copy link
Contributor

fisker commented Apr 9, 2025

Duplicate of #488

@nzakas
Copy link
Member

nzakas commented Apr 9, 2025

Thanks for the feedback. As I stated in #488 (comment), Espree is the canonical ESLint parser. Any parser wanting to work with ESLint should mimic what Espree does. As such, we need to be careful when making changes.

The decision of where to start the Program node has long since passed. It will remain as it was with Esprima.

I think producing comments that fall outside of the range of Program is probably a bug, as ESTree is meant to have each node wholly contain its children. @eslint/eslint-tsc what do you think?

For reference, here's the code that adjusts the Program location:

/*
* Adjust opening and closing position of program to match Esprima.
* Acorn always starts programs at range 0 whereas Esprima starts at the
* first AST node's start (the only real difference is when there's leading
* whitespace or leading comments). Acorn also counts trailing whitespace
* as part of the program whereas Esprima only counts up to the last token.
*/
if (program.body.length) {
const [firstNode] = program.body;
if (program.range) {
program.range[0] = firstNode.range[0];
}
if (program.loc) {
program.loc.start = firstNode.loc.start;
}
program.start = firstNode.start;
}
if (extra.lastToken) {
if (program.range) {
program.range[1] = extra.lastToken.range[1];
}
if (program.loc) {
program.loc.end = extra.lastToken.loc.end;
}
program.end = extra.lastToken.end;
}

I think the issue is that comments are not represented as tokens, so the end location currently only considers non-comment syntax.

@nzakas nzakas moved this from Needs Triage to Feedback Needed in Triage Apr 9, 2025
@overlookmotel
Copy link
Author

Thanks very much for coming back. Sorry that I didn't see this had been raised before in #488.

Thanks for confirming that the primary goal for Espree remains aligning with Esprima. In that context, of course the answer is clear. The current behavior cannot change.

Only 2 questions remain:

Incompatibility with Esprima

If the program contains only comments e.g.:

// foo

Espree produces a range of [ 0, 8 ]. This does not align with Esprima, which produces range of [ 8, 8 ].

I assume that divergence from Esprima is unintended? If so, should that be considered a bug (fixing it not a breaking change)? Or something to tackle in next major release?

Comments outside of Program

ESTree is meant to have each node wholly contain its children

Personally I don't think that needs to be considered to apply to comments, as comments are not "nodes" in the sense of the ESTree spec, which doesn't say anything about comments (same as tokens are not "nodes").

Often legal comments or "directive" / "pragma" comments appear at top of file, and it'd likely be problematic for users if those comments were omitted.

Espree's NPM docs specifically state that hashbangs are supported. Hashbangs are treated as comments, and by definition must be at top of file. So excluding all comments that lie outside the range of Program would break that support.

@nzakas
Copy link
Member

nzakas commented Apr 10, 2025

I assume that divergence from Esprima is unintended? If so, should that be considered a bug (fixing it not a breaking change)? Or something to tackle in next major release?

I think this was unintended as I can't imagine we tested this case when we switched to Acorn. I think this is an indicator that we just need to come to a decision on one question: are comments considered part of Program or not? If yes, then the current behavior for a file with just a comment is correct and we need to adjust what happens when there are comments at the end of a program to make it align. If no, then we should fix the behavior of the one-comment case and leave the other case alone.

@overlookmotel
Copy link
Author

are comments considered part of Program or not?

In my view they are. But you've clearly stated that the goal is to align with Esprima, so I thought that answered the question - Esprima does not consider comments to be part of the Program.

@nzakas
Copy link
Member

nzakas commented Apr 11, 2025

@overlookmotel As I noted, there is an inconsistency in the implementation. Trailing comments are not treated as part of the program but leading comments are. I'm seeking to resolve that inconsistency.

@mdjermanovic
Copy link
Member

@nzakas I don't see an inconsistency between how leading comments are treated vs how trailing comments are treated.

I believe Program's range currently works as follows:

  1. If there are no tokens, i.e., the source code is only whitespace and/or comments, then the Program's range is the whole source code.
  2. If there's at least one token, then the Program's range starts from the first token's start and ends at the last token's end. Whitespace and comments before the first token and after the last token are not included in the Program's range.

Is that correct? If not, can you provide an example where there's an inconsistency between how leading and trailing comments are treated?

@nzakas
Copy link
Member

nzakas commented Apr 15, 2025

Sorry, I don't think I explained the inconsistency I'm seeing very well. Let me try again.

For instance, here's an example with just a comment:
https://astexplorer.net/#/gist/7896dbd46cd027a8414f45d6c5c5e763/055c0d167a560b52c8cc48a614aacfb50ef07906

In this case, the Program range is 0 to 181.

If I add some code underneath it, the range changes to 178 to 182:
https://astexplorer.net/#/gist/4acf3a930a735d191dd9891280a0a7b9/d3d4bc7af2a20c476676b9f4fc502714eebf6e63

That's the inconsistency I'm concerned with. It seems like the start of the program should be the start of the program. (Similarly, the end of the program should be the end of the program.) But we are counting leading and trailing comments differently in these situations.

@overlookmotel
Copy link
Author

Thanks for coming back. I believe the only change required to resolve this inconsistency is as follows:

  • If the file contains no tokens (no statements), set Program range start and end both to the length of the source text.

i.e. in your first example, Program's range would be [181, 181].

This would match Esprima's behavior.

There are other alternative solutions which could also resolve the inconsistency, but they would not align with Esprima, which I've understood from your previous comments is of paramount importance.

I hope I've made sense this time!

@mdjermanovic
Copy link
Member

For instance, here's an example with just a comment: https://astexplorer.net/#/gist/7896dbd46cd027a8414f45d6c5c5e763/055c0d167a560b52c8cc48a614aacfb50ef07906

In this case, the Program range is 0 to 181.

If I add some code underneath it, the range changes to 178 to 182: https://astexplorer.net/#/gist/4acf3a930a735d191dd9891280a0a7b9/d3d4bc7af2a20c476676b9f4fc502714eebf6e63

That's the inconsistency I'm concerned with. It seems like the start of the program should be the start of the program. (Similarly, the end of the program should be the end of the program.) But we are counting leading and trailing comments differently in these situations.

I still don't see an inconsistency between leading and trailing comments. If there are no tokens, the range spans from the start to the end of the source code, whether the source code content is whitespace-only or contains one or multiple comments. If there is at least one token, the range spans from the start of the first token to the end of the last token, and neither leading nor trailing comments are included.

For example, in the following case the Program range is [0, 56]. The whole source code has 56 code units, and all of the leading comment, trailing comment and whitespace are included in the range:

/**
 * Leading comment
 */

/**
 * Trailing comment
 */

https://astexplorer.net/#/gist/675eeef0df636f1a0297bd48acb29f74/f8e6977b4274852d115199907142b967b1a6e081

While in the following case the Program range is [28, 31]. It's the range of foo, and neither leading nor trailing comments are included:

/**
 * Leading comment
 */

foo

/**
 * Trailing comment
 */

https://astexplorer.net/#/gist/a0cc72f7d660b457ef68bbea44e6eae4/4f83a3bba3a633e3d03a56b5c2b8c4583e9bcf6e

@nzakas
Copy link
Member

nzakas commented Apr 18, 2025

@mdjermanovic okay, you've lost me a bit. The example you show is exactly the inconsistency I'm talking about. It doesn't make sense that comments and whitespace are considered part of the program in one case and not in the other. I don't see why having one token should change how we're calculating the start and end ranges of the program.

@overlookmotel
Copy link
Author

overlookmotel commented Apr 22, 2025

I think the root of this confusion goes back to my original point that the mental model doesn't entirely hang together. Consequently, different people can come at this question from different directions and reach different answers.

However, it's a stated goal of Espree to align with Esprima, so we need to find a solution which both:

  1. Is internally consistent, and
  2. Matches Esprima.

The rule that Esprima follows is that Program span is the start of first token to end of last token. Esprima's interpretation of this rule when the program has no tokens is that Program's range has to be zero length. They have chosen that zero-length range to be at end of source text ([56, 56]) rather than at start ([0, 0]). That's an arbitrary choice, but it's consistent.

So, in my view, Esprima's schema is internally consistent. Therefore the only option which satisfies both criteria is to align with Esprima's behavior and make the range for an empty program [<length of source text>, <length of source text>].

Other interpretations are also valid, but they fail on the 2nd criteria.


Side note: If we were starting from ground zero and didn't have to care about breaking changes or alignment with Esprima, I'd advocate ripping it all up and making Program's span always be [0, <length of source text>]. That'd also be internally consistent, and to my mind more intuitive. But that's not an option, unfortunately.

@nzakas
Copy link
Member

nzakas commented Apr 22, 2025

@overlookmotel Thanks for the summary. At this point, I'm not sure we need to align this with Esprima, that's why I've opened this up for larger discussion. Espree is used far more than Esprima, so we have more license around the edges. We can't change things like the AST format or tokens format as that would break too much compatibility. Because we are already not following Esprima for Program range means that any change we make is technically breaking so I'd like to make one that is consistent and, ideally, what we would want if starting from zero.

@overlookmotel
Copy link
Author

overlookmotel commented Apr 23, 2025

Ah, sorry I'd misunderstood. Given that this thread has become lengthy, let me summarize my personal view on what I think would be the ideal if we were starting from scratch.

The problems:

  • It's weird/surprising that comments which are part of a program have ranges which are outside the range of the Program itself.
  • It's ambiguous what is the "right" range for a Program containing no tokens, if we follow the current rule that Program starts on first token, and ends on last token.
  • It's unfortunate for interoperability that Espree and Acorn disagree on this.

I propose that in all cases:

  • Program range start is 0.
  • Program range end is length of the source text.
  • loc follows the same pattern. i.e.:
    • loc.start is always { line: 1, column: 0 }
    • loc.end.line is the number of lines in the source text.
    • loc.end.column is the number of characters in last line of source text.

This would be different from Esprima, but would align with Acorn (and also Babel).

For what it's worth, TS-ESLint parser maintainers also seem to think this would be preferable.

I will now shut up and let others give their views!

@nzakas
Copy link
Member

nzakas commented Apr 23, 2025

I propose that in all cases:

* `Program` range start is 0.

* `Program` range end is length of the source text.

* `loc` follows the same pattern. i.e.:
  
  * `loc.start` is always `{ line: 1, column: 0 }`
  * `loc.end.line` is the number of lines in the source text.
  * `loc.end.column` is the number of characters in last line of source text.

Yes, that's what I think as well.

@fisker

This comment has been minimized.

@nzakas
Copy link
Member

nzakas commented Apr 24, 2025

@fisker please open a separate issue for that.

@fisker

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Feedback Needed
Development

No branches or pull requests

4 participants