You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 4, 2021. It is now read-only.
Collecting data for machine translation training from CommonCrawl is a two-phase process illustrated in the following diagram:
4
+
3
5

6
+
7
+
## Phase 1: Language annotation, building a meta-data database and monolingual data extraction
8
+
9
+
The first phase detects the languages of the web pages contained in the crawl and other meta-data. A database is built from this data that can be accessed via a RESTful web API.
10
+
11
+
In this phase monolingual data for language model training can be generated. The data for some of the CommonCrawl crawls and some languages can be found on:
0 commit comments