Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a --info flag to output meta information of the model #808

Closed
rok-cesnovar opened this issue Feb 2, 2021 · 10 comments · Fixed by #965
Closed

Adding a --info flag to output meta information of the model #808

rok-cesnovar opened this issue Feb 2, 2021 · 10 comments · Fixed by #965

Comments

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Feb 2, 2021

This is something I have seen users work around with various techniques, there are also packages that do this with regexes and the like.

I understand some of this can be accessed via the generated C++ models but that requires instantiating the model with the data and also requires tightly working with C++ which seems a bit of overkill.

Examples of use:

This should not be to difficult to do properly in stanc3. As you can see in the posteriordb issue @mandel has already made a branch with almost everything we need.

Meta data we could output and use:

  • list of included files
  • list of input data
  • list of transformed parameters and parameters
  • list of generated quantities
  • dimensionality of all data & parameters
  • (optional) does it use reduce_sum/map_rect

There may be other ideas. The output format should probably be pretty printed JSON as its both human readable and nice to work with in other scripts/languages.

@mandel would you be interested in making a PR of your branch?

@jtimonen
Copy link

jtimonen commented Feb 2, 2021

An additional wish would be list of user-defined functions and their arguments if that is not too difficult to do. My ultimate goal would be to create a tool that can take an rstantools-based package like rstanarm, which is written with many includes, conditions, etc, and given the input data, turn it into minimal stan code which is much more readable. So this would require removing blocks and loops that will never be accessed, user-defined functions that will not be used, parameters which have dimension zero etc. I tried to do that once but then it was too much work to parse everything myself but it could be possible if stanc3 had a feature like this.

@mitzimorris
Copy link
Member

mitzimorris commented Feb 2, 2021

need to capture type for data and parameters - generated quantities variables can be int as well as real.

list of input data

does this mean the input data variable declarations?

list of included files

by this you mean all includes in the program? why?

@rok-cesnovar
Copy link
Member Author

does this mean the input data variable declarations?

Yes, all data variables and their types excluding transformed data variables (i wrote input data to differentiate from transf. data)

why?

so we can easily and also programtically check a models dependecies. This can work for cases like taking a model out of a huge database of models and also enable for interfaces to check if any of the included files has changes since last compile.

@seantalts
Copy link
Member

seantalts commented Feb 2, 2021

+1, I think it's a great idea to expose more metadata about the model. My only input would be that we already expose some of it here: https://github.com/stan-dev/stanc3/blob/master/src/stan_math_backend/Cpp_Json.ml#L73 so we could try to keep that in mind and integrate that with whatever we come up with. It gives the model class a method you can call to get a JSON dump of parameter names and dimensions, I believe. We should definitely expose that on the command line interface :)

I understand some of this can be accessed via the generated C++ models but that requires instantiating the model with the data and also requires tightly working with C++ which seems a bit of overkill.

I think you might have to instantiate the model for the dimensionality of the parameters to be printed out, at least if you want actual numbers in there (which is what RStan et al require from the Cpp_Json stuff above). We should be able to print out whatever the definitions were (eg array[N] real x) without instantiating, though, maybe that's all that is needed.

@rok-cesnovar
Copy link
Member Author

Yes for actual sizes you need the data of course, except for where the size is defined with a literal. I meant in the sense of number of dimensions. So print if its a 1D array, 2D array, etc. For matrices/vectors/scalars types the dimensionality info is redundant.

thanks for the cpp_json pointer.

@mandel
Copy link
Contributor

mandel commented Feb 2, 2021

I will be happy to help for that. The format was originally inspired by PosteriorDB.

The option--info option currently generates a JSON object a field inputs, parameters, transformed parameters, and generated quantities containing a dictionary where each entry corresponds to a variable in respectively the data, parameters, transformed parameters, and generated quantites blocks. To each variable is associated an object with two fields:

  • type: the base type of the variable ("int" or "real").
  • dimensions: the number of dimensions (0 for a scalar, 1 for a vector or row vector, etc.).

For example on https://github.com/stan-dev/posteriordb/blob/master/posterior_database/models/stan/hmm_drive_0.stan the generated json is

{ "inputs": { "K": { "type": "int", "dimensions": 0},
              "N": { "type": "int", "dimensions": 0},
              "u": { "type": "real", "dimensions": 1},
              "v": { "type": "real", "dimensions": 1},
              "alpha": { "type": "real", "dimensions": 2} },
  "parameters": { "theta1": { "type": "real", "dimensions": 1},
                  "theta2": { "type": "real", "dimensions": 1},
                  "phi": { "type": "real", "dimensions": 1},
                  "lambda": { "type": "real", "dimensions": 1} }
  "transformed parameters": { "theta": { "type": "real", "dimensions": 2},
                              ,
                               }
  "generated quantities": { "z_star": { "type": "int", "dimensions": 1},
                            "log_p_z_star": { "type": "real", "dimensions": 0},
                             } }

@seantalts
Copy link
Member

#810 is merged and fixes many of these things. I think what's left:

  • list of included files
  • (optional) does it use reduce_sum/map_rect

Is that it?

@rok-cesnovar
Copy link
Member Author

I think that is what came up so far indeed.

@seantalts
Copy link
Member

seantalts commented Feb 8, 2021

A general way to address "does it use reduce_sum/map_rect" might be "produce a list of all named functions used" which I think would be fun and might even help us check test model coverage or help with dependencies later in some future Stan package system...

@rok-cesnovar
Copy link
Member Author

Nice. A list of named functions is even better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants