Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Readtables back #535

Closed
elibarzilay opened this issue Apr 16, 2016 · 48 comments
Closed

Add Readtables back #535

elibarzilay opened this issue Apr 16, 2016 · 48 comments

Comments

@elibarzilay
Copy link

Again, apologies for the vague issue, but:

  1. Do readtables exist? If so where are they documented? (Better: some simple examples?)
  2. If not, then make this a feature request to have them...
@disnet
Copy link
Member

disnet commented Apr 16, 2016

Ah, thought I had already made an issue for this.

Indeed, they are missing and we need to add them back. We had them pre-1.0 (thanks @jlongster!) but then I rewrote the reader.

@disnet disnet changed the title Readtables? Readtables Apr 16, 2016
@disnet disnet changed the title Readtables Add Readtables back Apr 16, 2016
@ericguedespinto
Copy link

Could we use readtables to make sweetjs whitespace aware? Say for instance like coffescript or ruby?

@vendethiel
Copy link

That'd be really, really, really hard.

@elibarzilay
Copy link
Author

Readtables are more intended for a case where you want to add a new construct to an existing language with minimal changes. An indentation-based syntax is a drastic change to the language so you really need to implement a parser for that.

@bopjesvla
Copy link
Contributor

bopjesvla commented May 10, 2016

I implemented indent blocks in 0.7. The table used a simple regular expression to track the flow of indentation. The syntax was:

while (true):
    while (false):
        if (1 === 2):
            return;
        y = 3
        j = 5
    x = 5
    for (i of [3]):
        i.j="k"
    i = 3

Unfortunately, outdenting two levels at once was blocked by an issue related to nested braces.

@gabejohnson
Copy link
Member

@disnet are you looking to have the same API as the old readtables?

@disnet
Copy link
Member

disnet commented Jul 28, 2016

Not necessarily.

@gabejohnson
Copy link
Member

I'm looking to extend the utility of #. I'm thinking of looking at Racket's API for inspiration. What requirements are you targeting for readtables?

@gabejohnson
Copy link
Member

gabejohnson commented Aug 1, 2016

So far I've come up with a primitive api that's similar to that of syntax:

lexeme ~> = function(reader) {
  reader.key(); // ~>
  reader.read(); // eats 'skipMe'. uses default reader behavior
  // creates a reset point that is consumed when .reset is called.
  // other possible names: shift, savePos, checkpoint
  reader.mark();
  const delim = reader.next().value // [
  let brackets;
  if (delim.isLeftBracket()) {
    reader.reset();
    brackets = reader.read('Delimiter');
  } else {
    // throw
  }
  return `thread { ${brackets.inner()} }`; // .inner returns another reader instance
}

foo ~> skipMe [bar, baz(1)] // foo thread { bar, baz(1) }

Implicitly, the current readtable is being extended. I do have questions about using modules at read time. Are syntactic extensions and other libraries available at read time?

import m from './m' for reader

@bopjesvla
Copy link
Contributor

Things I liked about the old API are mostly whitespace-related: the ability to skip whitespace without gobbling up the next token, access to the full source code, access to the current index, access to the current line number.

A read-phase equivalent of syntax isn't going to cut it for stuff like JSX and indent blocks where whitespace is significant.

@gabejohnson
Copy link
Member

I disagree. The above doesn't display the full API I have in mind. I was thinking of moving many of the 0.7 readtable utility functions to Token objects returned from read and next. There's also no reason you couldn't have the option of calling reader.read('Whitespace') or something similar.

I'm also not sure why you'd need access to the entire source string. Your coffeetable implementation doesn't seem to use anything in the source before the triggering character anyway.

I know I haven't thought things all the way through yet and I'd like to hear more useful criticism.

@bopjesvla
Copy link
Contributor

mhmm, I distinctly remember having a use case for access to the source code, but I'll have to get back to you on that. Also, I didn't mean to say that a syntax-like API design and whitespace awareness are mutually exclusive, just that they have to be consolidated.

@gabejohnson
Copy link
Member

gabejohnson commented Aug 1, 2016

My mistake. It looks like you used it to look behind one character for ellipsis. As with syntax extensions, I think we'll need limited lookbehind.

I agree about consolidation.

@disnet
Copy link
Member

disnet commented Aug 4, 2016

syntax is a binding form so I don't think your proposed syntax-like API quite works. Reader extensions are a low-level tool to change the lexical structure of your program so making them look like a binding I think confuses things.

What were the extensions to # you wanted to do?

@gabejohnson
Copy link
Member

gabejohnson commented Aug 4, 2016

I was thinking lexeme or some other identifier would be implemented as a more primitive reader macro and desugar to something like:

const readtable = [];
readtable.push({
  token: "~>",
  read: function(reader) {
    // ...
  }
});
export default readtable;

But I'm guessing that doing multiple expansion passes so that readers could use syntax macros could be a lot of overhead.

As far as #, I was thinking of implementing callable, clojure-like keywords and immutable data structures:

const hm = #{ #foo 'foo', #bar 'bar' };
#foo(hm); // 'foo'

as well as some sort of lens literal:

const hm = #{ #foo 'foo', #bar #{ #baz 'baz' } };
const l = #bar/baz; // maybe #/bar/baz, #bar/baz/ or #/bar/baz/
l(hm); // 'baz'

These could potentially work on regular js objects as well

const obj = {foo: 'foo', bar: {baz: 'baz'}};
#bar/baz(obj); // 'baz'

@disnet
Copy link
Member

disnet commented Aug 5, 2016

Oh yeah, #{ and friends would be 💯.

Have you started hacking on readtables? If not I might start; I have a few other things I want to clean up in the reader anyway.

@gabejohnson
Copy link
Member

gabejohnson commented Aug 5, 2016

I started working from the API back (so the reader that's passed to the transform function). But I haven't started integrating it yet. I wanted to wait on answers to the details.

Can readtables make use of syntactic extensions?
How about previous readtable entries?
Are they going to take part in the module system?
How are they loaded?

I guess I could start by just wedging them in there via CLI options and not worry about the other questions for now.

@gabejohnson
Copy link
Member

gabejohnson commented Aug 5, 2016

Is it feasible to implement #... -> syntaxQuote...`` and syntaxQuote as reader and syntactic extensions respectively?

@gabejohnson
Copy link
Member

What were your thoughts for cleaning up the reader?

@disnet
Copy link
Member

disnet commented Aug 6, 2016

Can readtables make use of syntactic extensions?
How about previous readtable entries?

Let's say no for now, can think about this later.

Are they going to take part in the module system?
How are they loaded?

My original intention was for them to be part of the #lang pragma. One of the exports that a #lang points at would be the readtable (the other an as-yet-to-be-determined module manipulating thing). For the first pass though you can definitely just do the CLI thing.

Is it feasible to implement #... -> syntaxQuote... and syntaxQuote as reader and syntactic extensions respectively?

Yes definitely, syntax templates should definitely be implementable via a readtable.

What were your thoughts for cleaning up the reader?

I just pushed a branch with some doodling I was doing the other day. Feel free to take or ignore as you see fit.

Main things I wanted to do was pull in the tokenizer directly so we can own changes to it rather than rely on the shift parser dep. I was also thinking of changing the Reader class to composition rather than extending the Tokenizer but not sure which is better in this situation. We could also address #551.

Also, I would love if the / disambiguation logic could be in the readtable.

@gabejohnson
Copy link
Member

@disnet If you're considering removing the parser dependency anyway I've been thinking about a more radical idea. Now it'd be a bunch of monkey work but what if we implement the reader to delegate completely to a readtable?

A table entry could have a predicate (character sequence, regex or function) to match then next token paired with a tokenizer function. The function could return either a string or a token. If it's a string, the reader splices it back into the source and continues from the index before the function was called. Otherwise, the source is consumed and the token is added to the stack.

Another possibility is to have either a tokenizer or a transform function in the table entry.

Now I haven't thought out all of the implications but the idea keeps coming back to me.

@gabejohnson
Copy link
Member

Most of the work would be in pulling apart the current tokenizer and populating the base readtable with it.

@disnet
Copy link
Member

disnet commented Aug 7, 2016

but what if we implement the reader to delegate completely to a readtable?

Yes absolutely!

@gabejohnson
Copy link
Member

gabejohnson commented Aug 18, 2016

@disnet we previously discussed limiting the horizon of macro contexts #537 (comment)

What are your thoughts on the same question for reader macros? My understanding is that pre 1.0 readtables allowed access to the entire source of a file.

Do we still want everything to be visible?

@disnet
Copy link
Member

disnet commented Aug 18, 2016

@gabejohnson yeah, they definitely should have access to everything. A macro needs to play nice with other macros (and binding forms) so they must be limited but a reader is low-level and get's to do whatever it wants.

@gabejohnson
Copy link
Member

gabejohnson commented Aug 23, 2016

@disnet I'm currently favoring using a zipper as the reader abstraction copying some of the clojure.zip API https://clojuredocs.org/clojure.zip

This would give the user a great deal of flexibility; allowing them to transform the matched character sequence in a number of ways:

In this example the sequence is replaced with a token. The actual Reader instance will advance rdr after tokenizer returns:

{
  matcher: '|>',
  tokenizer: (rdr) => rdr.replace({
    type: TokenType.IDENTIFIER,
    value: 'PIPE_MACRO'
  })
}

Here, one substring is replaced with another and rdr is moved back one so that 'PIPE_MACRO' can be matched by the Reader:

{
  matcher: '|>',
  tokenizer: (rdr) => {
    rdr = rdr.replace('PIPE_MACRO');
    return rdr.left();
  }
}

The following two examples illustrate using a function/method to remove boilerplate:

{
  matcher: '|>',
  tokenizer: (rdr) => rdr.fromString('PIPE_MACRO')
}

Although more convenient, I'm hesitant to include fromString as a method on the zipper as it should be fairly easy to implement as a utility function:

{
  matcher: '|>',
  tokenizer: (rdr) => ReaderUtils.fromString(rdr, 'PIPE_MACRO');
}

I'd particularly like feedback from you @elibarzilay and @jlongster as, I believe, you are both quite familiar with reader macros.

@gabejohnson
Copy link
Member

Another possibility is to have the Reader infer that if a string is returned it should be a simple replacement:

{
  matcher: '|>',
  tokenizer: (rdr) => 'PIPE_MACRO'
}

or even

{
  matcher: '|>',
  tokenizer: 'PIPE_MACRO'
}

@disnet
Copy link
Member

disnet commented Aug 24, 2016

@gabejohnson 👍 on your plans so far.

I think racket has a way of specifying modes for each mapping: terminating vs non-terminating macros, deferring a char to another mapping etc.

Does your current design have something like modes? Probably not critical for the first version but just want to make sure the API will have an obvious extension point.

@gabejohnson
Copy link
Member

gabejohnson commented Aug 24, 2016

@disnet I suppose modes in my scheme (no pun intended) are somewhat implicit. For instance, the last example above shows a string as the value of the table entry tokenizer property. This is akin to Racket's make-readtable with a char passed as the mode parameter.

Terminating macros would be implicit in that the author of the entry can do whatever they want once they match say {. They could treat everything that follows as a line comment, match on } and create a child in the tree or do a char/string substitution.

I honestly hadn't given any thought to non-terminating macros. Though you could write a matcher for identifiers that keeps accumulating chars until it hits one that's invalid. Then it defers to the readtable. "foo#bar" would match on foo, then if # is in the current readtable it would return a reader w/ the result of that match. Not sure what the API would look like there. Maybe next from the clojure.zip API?

We could certainly make some convenience functions/methods in the future. In fact maybe I should start w/ matcher only accepting functions and making the string (regex?) option a convenience later.

@disnet
Copy link
Member

disnet commented Aug 26, 2016

So I've been thinking about your proposed API a bit more and have become more convinced that the API should be as low-level as possible (this of course applies to everything we put into core Sweet not just readtables). It doesn't have to be nice as long as it is possible to build nice things on top of it.

In particular I don't think you want table entries to be strings at all. A char code should be sufficient since "|>" is equivalent to having the entry be the char code | and the method attempting to consume > or defer to the parent readtable's | entry. A library can of course compile strings and regexes and other kinds of patterns down to this core API but we don't need/want to do it ourselves.

As with everything in Sweet we want to let a thousand APIs bloom.

I'm still not entirely following how the zipper API works for readtables. Could you explain how zippers work in this context? It seems to me the core things you need to be able to do inside the tokenizer method are:

  • peek at without consuming the raw string after the current position
  • consume some number of characters
  • read the next token using the current readtable
  • lookup line and column info at the current position
  • create a new token
    • need to have constructors for all token types that enforce required arguments like line info
  • (optional) ability to see previously read tokens?

Maybe the zipper API you have in mind does all that but I'm not following yet :)

@gabejohnson
Copy link
Member

Restricting matcher to char codes rules out implementing identifier matching in a base readtable. But I'm thinking more that we would incur a serious performance hit by implementing the entire base reader in the readtable. Only one way to find out.

As an aside, I was thinking of a scheme to use tries to match keywords and then cache identifiers in the trie was well. Just a thought.

Now let me address the zipper API:

A Focus instance represents the current position in the token tree/source string. If the focus is on a character, it contains the character/char code and location info (line, column, index). The tokenizer function is responsible for transcribing the location info to a new token.

right moves the focus one character to the right in the case that the character hasn't already been consumed. Otherwise it moves the focus to the next token. right does NOT consume the character.

I haven't decided what happens at EOS.

left moves the focus one character to the left unless there are no unconsumed characters to the left. Otherwise it moves the focus to the next token to the left.

I haven't decided what happens if there are no more tokens to the left. Clojure would return nil. That's obviously not cool in our case.

next moves the focus one token right/down in a depth-first traversal. It consults the readtable to tokenize the following characters. If there is already a token to the right next acts like right unless the current item is a delimiter. In this case, the focus drops into the delimiter. next DOES consume characters.

prev moves the focus left/up in a depth-first traversal. It acts like left unless there are no more items to the left in which case it moves up if possible.

up moves the focus out of the current delimiter.

down moves the focus into the current delimiter (if it IS a delimiter otherwise throw?)

None of these traversal methods return nodes/characters. The all return a new, immutable focus.

That's the traversal API. I do have a slight concern that user's intuition might be violated since next is not read-only. Maybe there should be an explicit tokenize.

@gabejohnson
Copy link
Member

gabejohnson commented Aug 26, 2016

Now for the node access and tree modification API:

node/getNode could be a method or getter. It returns the node at the current focus.

remove removes the node at the current focus and returns a new focus moved as though prev were called.

insertLeft/insertRight insert nodes to the left and right, respectively.

Optionally other zipper methods could be implemented, but I think everything else could be written as utilities (probably overlooking something though).

node could be a getter/setter. Then we wouldn't need a replace util.

@gabejohnson
Copy link
Member

Also, I forgot about path. It should be/return an object that allows you to pass it to a traverse method/function and arrive at the position you were at when you got it.

root would be a nice to have, but you can always get the path of the root node and pass it to traverse. Actually I'll have to play around with that one.

@gabejohnson
Copy link
Member

As for the token constructors, I've just been hand rolling them, but I see what you mean. Maybe the focus is a property on the reader and you can also call rdr.createIdentifier(string, location, ...).

I'll grant you that with only source, index, line and column you could do most of the stuff the zipper does (except for inspecting already created tokens). The reason I want to make the zipper native is so that the scanning could be implemented almost entirely in a base readtable and traversal would be a snap. Otherwise, looking behind is going O(n) and cumbersome. If zippers were implemented as a library, then reading would be O(n^2).

I'd like to easily be able to create a delimiter, jump back/forward in the tree, insert it and return to the "current" position.

@gabejohnson
Copy link
Member

If you used zippers you'd be able to do rdr.createIdentifier(startFocus, endFocus)

@disnet
Copy link
Member

disnet commented Aug 26, 2016

Restricting matcher to char codes rules out implementing identifier matching in a base readtable.

I don't see how. You would just have entries for each char code in IdentifierStart that each dispatch to the same method:

let tokenizeIdentifier = ...

let readtable = {
  'a': tokenizeIdentifier,
  'b': tokenizeIdentifier,
  ...
}

Should be faster than an arbitrary length string match right? Lot of repetition for each entry but we have macros for that :) Or maybe just have a way to declare a range of charcodes?

Your proposed zipper API is appealing but can you build it on top of a simpler set of primitives? Strawman proposal cribbed from Racket:

function makeReadtable(base: Readtable, entries: [Entry]): Readtable

type Entry = {
  char: number or Range;
  mode: ReadMode;
  action(source: CharStream, info: LineInfo): Token?;
}

type CharStream = {
  // return `num` chars without advancing
  peek(num: number?): string;
  // return `num` chars and advance
  readChar(num: number?): string;
  // read `num` tokens, consulting the currently installed readtable
  read(num: number?): Token?;
}

To be clear, I'm not saying we shouldn't do the zipper API; I just want to build the zipper API on top of a set of primitives.

Side note: if I understand correctly the zipper API allows you to insert tokens at arbitrary locations in the already read stream. That concerns me since that is combining two concerns: tokenization (what tokens are) and token stream rewriting (where tokens go). Readtables should only be about the former as that allows them to be (relatively) composable (ie mutually unaware readtable authors can both extend the base reader and things mostly work out).

@disnet
Copy link
Member

disnet commented Aug 26, 2016

The reason I want to make the zipper native is so that the scanning could be implemented almost entirely in a base readtable and traversal would be a snap. Otherwise, looking behind is going O(n) and cumbersome. If zippers were implemented as a library, then reading would be O(n^2).

I don't understand this

@gabejohnson
Copy link
Member

gabejohnson commented Aug 27, 2016

Restricting matcher to char codes rules out implementing identifier matching in a base readtable.

This was based on an (arguably bad) assumption that matching and tokenizing would be separate concerns. Though I suppose nothing is gained by this if a null result from action means no match.

The complexity remarks were based on a complecting of tokenization and rewriting. I suppose it's actually O(nm). If there are n characters in the source then constructing a zipper is O(n) in the worst case and there are m readtable lookups. If there is a zipper created upon each lookup this is O(nm). Obviously, separating them changes things.

Now I would argue that creating the token tree is already complecting tokenization and rewriting. But it is a simple transform and makes sense to do at the same time.

@gabejohnson
Copy link
Member

Are you suggesting that there could be a separate token stream rewrite phase? Or simply that it shouldn't be allowed?

@gabejohnson
Copy link
Member

gabejohnson commented Aug 27, 2016

Though I suppose the rewrite phase is during enforestation. There are still things that I would like to do that appear to be difficult w/o moving tokens around arbitrarily. Maybe I'm overlooking something.

Suppose I wanted to write a language that looked like the following:

#lang whatever

export {
  foo: x => ...,
  bar: (x, y) => ...,
}

local {
  baz: x => ...,
  qux: (x, y, z) => ...,
}

import {
  quux from './quux',
  corge from './corge',
}

this would translate to:

import quux from './quux';
import corge from './corge';

const baz = x => ...;
const qux = (x, y, z) => ...;

export const foo = x => ...;
export const bar = (x, y) => ...;

I can maybe define whatever to insert a whatever macro at the top of the file and then do all of my manipulation after that, but then I have to write my own matching logic.

import { export, local, import } from './whatever' for syntax;

whatever

export {
  foo: x => ...,
  bar: (x, y) => ...,
}

local {
  baz: x => ...,
  qux: (x, y, z) => ...,
}

import {
  quux from './quux',
  corge from './corge',
}

@disnet
Copy link
Member

disnet commented Aug 27, 2016

This is exactly what the lang pragma is all about :)

Following Racket the lang pragma will have an API that allows you to install a readtable, set up new syntax and implicit form bindings, and in general rewrite the syntax stream.

Your proposed zipper API would probably work really nicely here but I'm pretty sure we want to keep it separate from readtables.

@gabejohnson
Copy link
Member

Sounds good! I'll start with the API you outlined above. I think I have a pretty clear idea of how to go about implementing most of it. The only mode I'm not sure of is non-terminating-macro. But I'm sure it'll come out in the wash.

@gabejohnson
Copy link
Member

@disnet for Entry.char should we be using code points instead of char codes?

@disnet
Copy link
Member

disnet commented Aug 29, 2016

Yes indeed! Good catch.

@gabejohnson
Copy link
Member

gabejohnson commented Sep 1, 2016

@disnet two questions:

  1. Should action have a prefix argument (for resolving /, etc.)?

It could either be a list or a stream of Tokens. Definitely read-only.

  1. Should undefined/null returned from action throw an error?

At first I was thinking that returning null could cause the reader to fall-back to a previous entry for that char. Now I'm not so sure. Since we have the base readtable available to us when we make a new one, we could always grab the previous entry and call its action. I'm leaning toward being explicit, this is just a sanity check.

@disnet
Copy link
Member

disnet commented Sep 2, 2016

  1. Should action have a prefix argument (for resolving /, etc.)?

Yes I think it should.

  1. Should undefined/null returned from action throw an error?

Actions need a way to consume from the CharStream but not produce a token (e.g. a comment). null would work for this case right?

@gabejohnson
Copy link
Member

Actions need a way to consume from the CharStream but not produce a token (e.g. a comment). null would work for this case right?

You're right. In the fallback scenario I was going to have a special noop token, but if we're being explicit about passing down the action chain then null for noop makes more sense.

@gabejohnson
Copy link
Member

#601 and #626 Should close this. No?

@disnet disnet closed this as completed Jan 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants