-
Notifications
You must be signed in to change notification settings - Fork 5
Ability to continue model fitting #6
base: develop
Are you sure you want to change the base?
Conversation
These are good ideas! I still think HCA or Mallet is probably the right tool for serious fitting but I'm happy to implement these changes. For simplicity's sake and to keep to sklearn's philosophy, I'd like to have the interface be as simple as possible. (One imagined audience is researchers in the humanities and social sciences with small corpora.) I like deleting WS and DS because they seem likely to confuse people if they are undocumented and they are also to recover. ZS can be recovered, with some effort, from What about adding a Let me add some code comments to the random_state business. Essentially, I'm worried about people being able to see the random seed they originally passed. For example, if the seed is set to |
I thought about adding options to With regard to the temporary values, I see your point. I'm not sure how better to handle that. One option is to accept a keyword to Regarding
That is, just eliminate the temporary |
re: random_state the issue is you want to record the seed passed somewhere so someone can look at it (in case they forgot it) and then rerun the model to get the exact same results. I realize it's a little confusing. There's probably a clearer way. |
You make some good points. Can I think about this more in a week or so when I have more time? In principle, I'd really like to stick to the sklearn interface as closely as possible. Perhaps they have a method for resuming fits? I trust you're comfortable using your fork for the immediate future. For example, I'd really like that one could just drop in LDA as a replacement for NMF: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html I think it would also be great if the documentation was as terse and as welcoming. |
Take your time. I'm content using my fork in the meantime. Understood regarding the sklearn interface. But one difference I see (maybe I'm missing something) is that the NMF interface lets the user specify stopping criteria (in addition to Regarding |
Maybe one could do something innovative here. If one ran two threads in parallel one could test for convergence, I think. |
Wait... I thought I was the one who was supposed to be breaking the sklearn paradigm. 😉 Just kidding. There are likely a few issues with running multiple threads, one of which is doubling the processing requirements. There must be some commonly used stopping criteria for learning topic models. Perhaps a tolerance on the change in |
Processing requirements will never be a problem. Everyone has more than 1 core these days; one can release the thread in Cython. WS and DS can be shared by both threads, only ZS will differ, so only modestly more memory. Here's the idea:
If this allows users to simply set |
Test for convergence by calculating the Rhat of the complete loglikelihood of the two chains. Perhaps one can do this ever 100 iterations. |
I've had another thought. What if one passed data from the existing model (e.g., the phi and the theta) as initial values to a fresh fit. It would be continuation in all but name. How does that sound? |
I think that would work. Do you mean calling
where In the runs that I did, there seemed to be a significant amount of time spent building the initial internal model data arrays so it would be nice if one could set a flag to not delete them (so they could be easily reused in the second model) but I understand your concern about keeping those parameters around unnecessarily. Regarding model convergence, unless there is no other reasonable option, I would avoid trying to run two models concurrently to test for convergence. It seems like there could be situations where they either appear to converge but actually have not. More importantly, I think it would be better to use the multiple cores for solving the "main" model (i.e., utilize multithreading when solving the model). Otherwise, even with multithreading, the effective number of cores would always be cut in half to run the second model. |
I mean the second. One would add new parameters like initial_theta and initial_phi (or initial_ndz initial_nzw) to init. Multiple cores is a great idea. Someday I'll get around to this. It looks like one could do something with atomic operations that are in recent versions of GCC. |
92fdc38
to
b036ebd
Compare
17b1753
to
5297116
Compare
For sufficiently large vocabularies/corpora, fitting the model can take quite a while. If the model hasn't converged after the chosen number of iterations, one must call
fit
(orfit_transform
) again with a larger number of iterations, essentially repeating all the previous iterations before continuing to improve the model fit. This change adds acontinue_fitting
method to theLDA
class that allows one to continue where previous fitting left off. To accomplish this I did the following:_fit
method into_fit
and_run_fitting_iterations
(which is called by_fit
).WS
,DS
, andZS
members after fitting (I added adelete_temp_vars
method to explicitly do that, if desired).continue_fitting
method (which also calls_run_fitting_iterations
).Lastly, I set the
random_state
member inLDA.__init__
. Take a look at that specifically and let me know if that change would adversely affect anything. I'm not sure what the intent of the following two lines were:When the
random_state
argument isNone
, it appears that the random seed will be initialized twice: once whenrng
is set (in the line above) and again whenself.random_state
is used (since it is still equal toNone
above). Note that my changes still have the seed set twice (though I could easily change that) - the difference is that therandom_state
member is explicitly initialized in the constructor now.