-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformity feature: Support mutual information for cells of observed rows #250
Comments
Being a little unclear on the math, I'm not sure what the use case looks like. Probability of dependence estimates the probability that mutual information between columns is nonzero? Similarity measures the mutual information between observed rows? What does this effectively mean for individual observed cells? |
The architecture is that we have an infinite exchangeable set of tuples of random variables {(A_r, B_r, C_r)}_r, and we approximate the posterior distribution given certain assignments A_0 = a_0, C_1 = c_1, &c. Currently we can only approximate mutual information for two random variables A_i, B_i in the same row i for which no values have been assigned. This issue is to allow approximating it for two random variables from a row that has been observed. The architecture more specifically for Crosscat is that there are additional categorical variables {(L_r, M_r, N_r)}_r which we cannot observe, nor even whose number can we observe. Each Crosscat state is a sample from the distribution on latent variable numbers (views) and assignments (categories). Each model estimator evaluates a Monte Carlo integral (1) over samples of Crosscat states of some function of a single Crosscat state. In this case, approximating mutual information of variables of an entirely unobserved row from a single Crosscat state means evaluating a Monte Carlo integral (2) over samples of category assignments of some mutual information estimator (itself a Monte Carlo integral (3) over samples of the posterior predictive distribution on the variables given the category assignments). What @axch proposes is to do is to implement approximation of mutual information of variables of an observed row from a single Crosscat state, in which implementation the Monte Carlo integral (2) is replaced by a single evaluation of (3), given the fixed category assignments of that observed row in that Crosscat state, rather than a Monte Carlo integral over samples of category assignments of evaluations of (3). |
Should be much faster than unobserved rows, because the cluster assignment in each model is assumed known. The current MI code path I am aware of only computes MI of columns for unobserved (new) rows
The text was updated successfully, but these errors were encountered: