Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with some clustering algorithms on top of the similarity metric #8

Open
marco-c opened this issue Mar 16, 2017 · 6 comments

Comments

@marco-c
Copy link
Owner

marco-c commented Mar 16, 2017

No description provided.

@mansimarkaur
Copy link
Contributor

mansimarkaur commented Mar 23, 2017

I want to work on this. Could you please provide a description explaining this issue a little. Thanks!

@marco-c
Copy link
Owner Author

marco-c commented Mar 25, 2017

We have a way to evaluate the distance between two crash traces (WMD, but in the future we might experiment with other distance metrics #9), we can use this distance to cluster the stack traces in groups.

We should test with some clustering algorithms (http://scikit-learn.org/stable/modules/clustering.html) and see how they perform.

If the implemented algorithm turns out to be too slow (it's possible, as WMD is really slow), we can try two things:

  • see if alternative distance metrics (Experiment with alternative distance metrics to WMD #9) are fast enough;
  • create a clustering algorithm customized for our task. E.g. instead of clustering on the complete list of stack traces, we could implement a two-level clustering, where the first level is generated by the algorithm currently used on Socorro [signature] and the second level is used to split stack traces from a given signature in multiple groups. Or, where the first level is generated by a simper distance metric and the second level by WMD.

But let's not worry about this slowness problem for now. Let's try with the WMD distance and a well-known algorithm first.

@aditya-iitd
Copy link
Contributor

With WMD, we can only use those clustering algorithms in which metric used is distance between points or others (eg-: Euclidean distance ) can also be used?

@marco-c
Copy link
Owner Author

marco-c commented Mar 30, 2017

With WMD, we can only use those clustering algorithms in which metric used is distance between points or others (eg-: Euclidean distance ) can also be used?

Can you rephrase this question? Euclidean distance is a distance between points too.

@mansimarkaur
Copy link
Contributor

How exactly should we compare the various algorthims?
One metric would be speed which can be covered by speed benchmarks.
Another one would be accuracy, how do we go about testing this?

@marco-c
Copy link
Owner Author

marco-c commented Apr 8, 2017

Another one would be accuracy, how do we go about testing this?

A possible approach is #39.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants