Skip to content
Phil Garner edited this page Jul 30, 2013 · 2 revisions

Structure

  • media is the original CDs/DVDs.

    • Rationale: When he buys the data, the system manager doesn't know how to organise it. He just copies the raw media onto hard disk.
  • dbase is a flattened structure of links pointing into the media.

    • Rationale: The individual DVDs are difficult to access - you have to know which file is on which disc. The flat structure is much easier to deal with; it mirrors the distributor's original layout.
  • audio is just the audio data in wav/riff format.

    • Rationale: The original distributions use all sorts of formats, some are easy to read, some aren't. Everyone usually ends up copying this data before they use it, which is a big waste of space. Everything reads wav; we just need one copy.

dbase and audio are generated automatically.

Case

Normally, the filenames in the database will match those on the media, and this is the right answer. Sometimes, however, the media is an ISO-9660 CD-ROM without rock-ridge extensions. In this case there is no telling whether a given machine will mount it with uppercase or lowercase filenames. Typically, the file lists and the like that come with the database will assume one or the other; this is not obvious to the system managers.

If a database turns out to have been written in the wrong case, it is best to convert the media. There is an ISS tool to do this:

% $ISSROOT/bin/convert-case.sh   
convert-case.sh [-u -l] <source-directory>
-u  Convert to upper case
-l  Convert to lower case

The right answer is just convert the whole media directory.

Flat structure

To flatten the database structure, run the iss script:

% $ISSROOT/bin/flatten.sh -h
flatten.sh [-h -r -t <target-dir>] <source-dir>
-h  Print this help
-t  [.] Where to write the files.
-r  Really do it, rather than just check it looks right.

where a typical invocation is

% $ISSROOT/bin/flatten.sh -t dbase $(pwd)/media

The pwd is to force a full path for the resulting links.

Audio conversion

Often, the audio data in the database is in some common but inconvenient format, e.g., SPHERE shorten. To save lots of duplicate conversions, we convert to RIFF (wav) by default. There is a command:

% $ISSROOT/bin/convert-audio.sh -h
convert-audio.sh [-r -t <target-directory>] <source-directory>
-h  Print this help.
-t  [.] Where to write the files.
-e  [sph] Source file extension.
-c  [sph] Source encoding.
-p  [.]   Pattern to match in source file / path.
-o  [see output] Options to pass to sox to read raw files.
-r  Really do it, rather than just check it looks right.

and a typical invocation is

% $ISSROOT/bin/convert-audio.sh -t audio dbase

Labels

We need labels in the form of a word MLF like this:

#!MLF!#
"*/01fc020t.lab"
A
BIG
MISTAKE
IN
BUYING
SAY
A
NEW
COMPUTER
SYSTEM
COULD
BRING
DOWN
THE
WHOLE
COMPANY
.

Note that there is no need for the word boundary tokens <s> and </s>. These are added during training when necessary.

Dictionaries have to be prepared before running the main training, but there are scripts to help. There are two dictionaries:

  1. $flatDict, which looks like this:
<s>  [] sil
</s> [] sil
!EXCLAMATION-POINT      EH K S K L AH M EY SH AH N P OY N T

That is, no silence after words, but including the sentence start and end tokens.

  1. $mainDict, which is like this:
<s>  [] sil
</s> [] sil
!EXCLAMATION-POINT      EH K S K L AH M EY SH AH N P OY N T sp
!EXCLAMATION-POINT      EH K S K L AH M EY SH AH N P OY N T sil

That is, now each word is repeated once for each silence model. It's important that the one with sp is first because that one will be used when creating the initial model when sp is introduced. Later in the process, both pronunciations will be used by HVite when aligning the data.

Labels

Normally, HLEd is run with -l '*', which allows label files to be independent of the directory. However, when IDs are used for labels with HTK extended file format, we don't need them. Whether or not to add labels is controlled by the number of columns in the list file: Use IDs if there is more than one column.

Clone this wiki locally