Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a user-facing way to discover how many models a pre-analyzed file has? #246

Open
axch opened this issue Oct 8, 2015 · 6 comments

Comments

@axch
Copy link
Contributor

axch commented Oct 8, 2015

Use case: knowing roughly what kind of result robustness to expect
Use case: knowing what model ranges to specify in queries

Starting with select * from bayesdb_generator_model works but seems a little internal. Or do we intend to document (parts of) Bayeslite's schema to enable this sort of introspection?

@riastradh-probcomp
Copy link
Contributor

I did at some point intend to document some parts of bayeslite's schema to allow for some introspection.

We could invent a pragma for the purpose too:

PRAGMA bayesdb_generator_precision(satellites_cc)

('Precision' is the first word that came to mind which might be taken to mean an estimate of the expected error, or might be taken to be an estimate of the population variance, &c.)

@tibbetts
Copy link
Contributor

tibbetts commented Oct 8, 2015

In a separate conversation with Vikash, he wanted me to, instead of saying "BayesDB says the inferred value is 12 with 70% confidence" to say something like "BayesDB, on a population with 20 observations, 32 models run for 1200 iterations, inferred a value of 12 with 70% confidence." I take it this ticket would give me the metadata to get the phrasing right.

Should charts automatically be tagged with this metadata?

@riastradh-probcomp
Copy link
Contributor

Should charts automatically be tagged with this metadata?

Yes!

@axch
Copy link
Contributor Author

axch commented Oct 8, 2015

One annoyance: there is nothing enforcing that all models are run for the same number of iterations, even though this is a convention. In the interest of maximum disclosure, we would need to invent a scheme that summarizes the amount of analysis done even when it is heterogeneous; also in the presence of streaming in more data.

@riastradh-probcomp
Copy link
Contributor

We could take the average. (We could also multiply them!)

@gregory-marton
Copy link
Contributor

recipes.analysis_status() shows this info in the notebook (by returning a df with counts of iterations and number of models that have that count of iterations). There is a lower-level function per_model_analysis_status() that returns a df with each model number and its iteration count, for which analysis_status is a .value_counts().

I'm not sure the extent to which this counts as "user facing" because it's not in bayeslite (bdbcontrib), and is not part of the language, but is still just a python function.

This also doesn't really address the questions of expected precision or robustness, because of course different numbers of models and iterations will be good enough for different datasets, queries, and requirements of the answer. But that's a little bit of an open problem, isn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants