Add support for custom document preprocessing #98

GoogleCodeExporter · 2015-12-01T06:14:48Z

What steps will reproduce the problem?
1. Have a corpus with mixed-case or punctuation
2. Run any of the algorithms

What is the expected output? What do you see instead?

The output would have things lower-cased as needed and the punctuation handled 
according to user-specified rules.

Ideally, we could support some type of filter that would take in a Document and 
transform it according to whatever rules it wanted.  This might be useful to 
incorporate with the token filter and IteratorFactory?  Or it could be a step 
that exists totally in GenericMain?

Original issue reported on code.google.com by [email protected] on 17 Jul 2011 at 12:16

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter added auto-migrated Type-Defect Priority-Low labels Dec 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for custom document preprocessing #98

Add support for custom document preprocessing #98

GoogleCodeExporter commented Dec 1, 2015

Add support for custom document preprocessing #98

Add support for custom document preprocessing #98

Comments

GoogleCodeExporter commented Dec 1, 2015