Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom document preprocessing #98

Open
GoogleCodeExporter opened this issue Dec 1, 2015 · 0 comments
Open

Add support for custom document preprocessing #98

GoogleCodeExporter opened this issue Dec 1, 2015 · 0 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. Have a corpus with mixed-case or punctuation
2. Run any of the algorithms

What is the expected output? What do you see instead?

The output would have things lower-cased as needed and the punctuation handled 
according to user-specified rules.

Ideally, we could support some type of filter that would take in a Document and 
transform it according to whatever rules it wanted.  This might be useful to 
incorporate with the token filter and IteratorFactory?  Or it could be a step 
that exists totally in GenericMain?

Original issue reported on code.google.com by [email protected] on 17 Jul 2011 at 12:16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant