Skip to content

Using text2ngram with huge corpus files #24

@maidis

Description

@maidis

I created a 2.7 GB corpus file for Turkish. But it seems text2ngram can't handle such a big file. Can some optimizations be made in the program to work in large files?

On my system [1] second iteration can't finish:

for i in 1 2 3; do text2ngram -n $i -l -f sqlite -o database_aa.db mytext.filtered; done

By the way, thanks for the open source alternative to XT9 and good documentation on how to use it :) I already start test it with a small corpus [2].

[1] 5950HQ + 16 GB RAM
[2] https://pbs.twimg.com/media/DY_ftChXUAAQP3t.jpg:large

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions