Skip to content

Commit 2b2bf63

Browse files
committed
ready for release, except doi
1 parent c2063ad commit 2b2bf63

File tree

12 files changed

+150
-144
lines changed

12 files changed

+150
-144
lines changed

book/chapters/implementation.tex

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ \section{Python package: segments}
2525

2626
The Python package \texttt{segments} is available both as a command line interface (CLI) and as an application programming interface (API).
2727

28+
2829
\subsection*{Installation}
2930

3031
To install the Python package \texttt{segments} \citep{ForkelMoran2018} from the Python Package Index (PyPI) run:
@@ -35,56 +36,57 @@ \subsection*{Installation}
3536

3637
\noindent on the command line. This will give you access to both the CLI and programmatic functionality in Python scripts, when you import the \texttt{segments} library.
3738

38-
You can also install the \texttt{segments} package from the GitHub repository:\footnote{\url{https://github.com/cldf/segments}}
39+
You can also install the \texttt{segments} package from the GitHub repository,\footnote{\url{https://github.com/cldf/segments}} in particular if you would like to contribute to the code base:\footnote{\url{https://github.com/cldf/segments/blob/master/CONTRIBUTING.md}}
3940

4041
\begin{lstlisting}[language=bash, basicstyle=\myfont]
4142
$ git clone https://github.com/cldf/segments
4243
$ cd segments
4344
$ python setup.py develop
4445
\end{lstlisting}
4546

47+
4648
\subsection*{Application programming interface}
47-
The \texttt{segments} API can be accessed by importing the package into Python or by writing a Python script. Here is an example of how to import the libraries, create a tokenizer object, tokenize a string, and create an orthography profile.
49+
The \texttt{segments} API can be accessed by importing the package into Python. Here is an example of how to import the library, create a tokenizer object, tokenize a string, and create an orthography profile. Begin by importing the \texttt{Tokenizer} from the \texttt{segments} library.
4850

4951
\begin{lstlisting}[basicstyle=\myfont]
5052
>>> from segments.tokenizer import Tokenizer
5153
\end{lstlisting}
5254

53-
\noindent The \texttt{characters} function will segment a string at Unicode code points.
55+
\noindent Next, instantiate a tokenizer object, which takes optional arguments for an orthography profile and an orthography profile rules file.
5456

55-
% \lstset{extendedchars=false, escapeinside=**}
56-
\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}]
57+
\begin{lstlisting}[basicstyle=\myfont]
5758
>>> t = Tokenizer()
58-
>>> result = t.characters('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
59-
>>> print(result)
60-
>>> '(*@c ̂ h a ́ ɾ a ̃ ̌ c t ʼ ɛ ↗ ʐ ː | # k ͡ p@*)'
6159
\end{lstlisting}
6260

63-
\noindent The \texttt{grapheme\_clusters} function will segment text at the Unicode Extended Grapheme Cluster boundaries.\footnote{\url{http://www.unicode.org/reports/tr18/tr18-19.html\#Default_Grapheme_Clusters}}
61+
\noindent The default tokenization strategy is to segment some input text at the Unicode Extended Grapheme Cluster boundaries,\footnote{\url{http://www.unicode.org/reports/tr18/tr18-19.html\#Default_Grapheme_Clusters}} and to return, by default, a space-delimited string of graphemes. White space between input string sequences is by default separated by a hash symbol <\#>, which is a linguistic convention used to denote word boundaries. The default grapheme tokenization is useful when you encounter a text that you want to tokenize to identify potential orthographic or transcription elements.
6462

6563
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
66-
>>> result = t.grapheme_clusters('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
64+
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
6765
>>> print(result)
68-
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p@*)'
66+
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | \# k͡ p@*)'
6967
\end{lstlisting}
7068

71-
\noindent The \texttt{grapheme\_clusters} function is the default segmentation algorithm for the \texttt{segments.Tokenizer}. It is useful when you encounter a text that you want to tokenize to identify orthographic or transcription elements.
72-
7369
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
74-
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
70+
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', segment_separator='(*@-@*)')
7571
>>> print(result)
76-
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p@*)'
72+
>>> '(*@ĉ-h-á-ɾ-ã̌-c-t-ʼ-ɛ-↗-ʐ-ː-| \# k͡ -p@*)'
7773
\end{lstlisting}
7874

79-
\noindent The \texttt{ipa} parameter forces grapheme segmentation for IPA strings.
75+
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont, showstringspaces=false]
76+
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', separator=' // '))
77+
>>> print(result)
78+
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | // k͡ p@*)'
79+
\end{lstlisting}
80+
81+
\noindent The optional \texttt{ipa} parameter forces grapheme segmentation for IPA strings.\footnote{\url{https://en.wikipedia.org/wiki/International\_Phonetic\_Alphabet}} Note here that Unicode Spacing Modifier Letters,\footnote{\url{https://en.wikipedia.org/wiki/Spacing\_Modifier\_Letters}} such as <ː> and <\dia{0361}{\large\fontspec{CharisSIL}◌}>, will be segmented together with base characters (although you might need orthography profiles and rules to correct these in your input source; see Section \ref{pitfall-different-notions-of-diacritics} for details).
8082

8183
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
8284
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', ipa=True)
8385
>>> print(result)
84-
>>> '(*@ĉ h á ɾ ã̌ c ɛ ↗ ʐː | # k͡p@*)'
86+
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐː | \# k͡p@*)'
8587
\end{lstlisting}
8688

87-
\noindent We can also load an orthography profile and tokenize an input string with it. In the data directory, we've placed an example orthography profile. Let's have a look at it using \texttt{more} on the command line.
89+
\noindent You can also load an orthography profile and tokenize input strings with it. In the data directory,\footnote{https://github.com/unicode-cookbook/recipes/tree/master/Basics/data} we've placed an example orthography profile. Let's have a look at it using \texttt{more} on the command line.
8890

8991

9092
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont, showstringspaces=false]
@@ -103,7 +105,7 @@ \subsection*{Application programming interface}
103105
\end{lstlisting}
104106

105107

106-
\noindent An orthography profile is a delimited UTF-8 text file (here we use tab as a delimiter for reading ease). The first column must be labelled \texttt{Grapheme}. Each row in the \texttt{Grapheme} column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns IPA and XSAMPA, which are mappings from our graphemes to their IPA and XSAMPA transliterations. The final column \texttt{COMMENT} is for comments; if you want to use a tab ``quote that string''!
108+
\noindent An orthography profile is a delimited UTF-8 text file (here we use tab as a delimiter for reading ease). The first column must be labeled \texttt{Grapheme}, as discussed in Section \ref{formal-specification-of-orthography-profiles}. Each row in the \texttt{Grapheme} column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns \texttt{IPA} and \texttt{XSAMPA}, which are mappings from our graphemes to their IPA and X-SAMPA transliterations. The final column \texttt{COMMENT} is for comments; if you want to use a tab ``quote that string''!
107109

108110
Let's load the orthography profile with our tokenizer.
109111

@@ -120,10 +122,10 @@ \subsection*{Application programming interface}
120122
>>> '(*@aa b ch on n - ih@*)'
121123
\end{lstlisting}
122124

123-
\noindent This example shows how we can tokenize input text into our orthographic specification. We can also segment graphemes and transliterate them into other formats, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA.
125+
\noindent This example shows how we can tokenize input text into our orthographic specification. We can also segment graphemes and transliterate them into other forms, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA or X-SAMPA.
124126

125127
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont]
126-
>>> t.transform('(*@aabchonn-ih@*)', 'IPA')
128+
>>> t('(*@aabchonn-ih@*)', column='IPA')
127129
>>> '(*@aː b tʃ õ n í@*)'
128130
\end{lstlisting}
129131

@@ -133,41 +135,39 @@ \subsection*{Application programming interface}
133135
% >>> '(*@a: b tS o~ n i_H@*)'
134136
% \end{lstlisting}
135137

136-
\begin{lstlisting}[basicstyle=\myfont, showstringspaces=false]
137-
>>> t.transform('aabchonn-ih', 'XSAMPA')
138+
\begin{lstlisting}[basicstyle=\myfont, showstringspaces=false, escapeinside={(*@}{@*)}]
139+
>>> t('aabchonn(*@-@*)ih', column='XSAMPA')
138140
>>> 'a: b tS o~ n i_H'
139141
\end{lstlisting}
140142

141143

142-
\noindent It is also useful to know which characters in your input string are not in your orthography profile. Use the function \texttt{find\_missing\_characters}.
144+
\noindent It is also useful to know which characters in your input string are not in your orthography profile. By default, missing characters are displayed with the Unicode \textsc{replacement character} at \uni{FFFD}, which appears below as a white question mark within a black diamond.
143145

144146
\begin{lstlisting}[extendedchars=false, escapeinside={(*@}{@*)}, basicstyle=\myfont, language=bash]
145-
>>> t.find_missing_characters('(*@aa b ch on n - ih x y z@*)')
147+
>>> t('(*@aa b ch on n - ih x y z@*)')
146148
>>> '(*@aa b ch on n - ih � � �@*)'
147149
\end{lstlisting}
148150

149-
150-
\noindent We set the default as the Unicode \texttt{replacement character} \uni{fffd}.\footnote{\url{http://www.fileformat.info/info/unicode/char/fffd/index.htm}} But you can simply change this by specifying the replacement character when you load the orthography profile with the tokenizer.
151-
151+
\noindent You can change the default by specifying a different replacement character when you load the orthography profile with the tokenizer.
152152

153153
\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}, showstringspaces=false]
154154
>>> t = Tokenizer('data/orthography(*@-@*)profile.tsv',
155155
errors_replace=lambda c: '?')
156-
>>> t.find_missing_characters("aa b ch on n (*@-@*) ih x y z")
156+
>>> t('aa b ch on n (*@-@*) ih x y z')
157157
>>> 'aa b ch on n (*@-@*) ih ? ? ?'
158158
\end{lstlisting}
159159

160160
\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}, showstringspaces=false]
161161
>>> t = Tokenizer('data/orthography(*@-@*)profile.tsv',
162162
errors_replace=lambda c: '<{0}>'.format(c))
163-
>>> t.find_missing_characters("aa b ch on n (*@-@*) ih x y z")
163+
>>> t('aa b ch on n (*@-@*) ih x y z')
164164
>>> 'aa b ch on n (*@-@*) ih <x> <y> <z>'
165165
\end{lstlisting}
166166

167-
\noindent Perhaps you want to create an initial orthography profile that also contains those graphemes x, y, z? Note that the space character and its frequency are also captured in this initial profile.
167+
\noindent Perhaps you want to create an initial orthography profile that also contains those graphemes <x>, <y>, and <z>? Note that the space character and its frequency are also captured in this initial profile.
168168

169169
\begin{lstlisting}[basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}, showstringspaces=false]
170-
>>> profile = Profile.from_text("aa b ch on n (*@-@*) ih x y z")
170+
>>> profile = Profile.from_text('aa b ch on n (*@-@*) ih x y z')
171171
>>> print(profile)
172172
\end{lstlisting}
173173

@@ -279,11 +279,11 @@ \subsection*{Command line interface}
279279
'(*@ʃ ɛː ç t e l ç e n@*)'
280280
\end{lstlisting}
281281

282-
\noindent And we can transliterate to XSAMPA.
282+
\noindent And we can transliterate to X-SAMPA.
283283

284284
\begin{lstlisting}[language=bash, basicstyle=\myfont, extendedchars=false, escapeinside={(*@}{@*)}]
285285
$ cat sources/german.txt | segments (*@--@*)mapping=XSAMPA
286-
(*@--@*)profile=data/german(*@-orthography(*@-@*)profile.tsv tokenize
286+
(*@--@*)profile=data/german(*@-@*)orthography(*@-@*)profile.tsv tokenize
287287

288288
'S E: C t e l C e n'
289289
\end{lstlisting}

book/chapters/pitfalls.tex

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ \section{Pitfall: Missing glyphs}
188188
Font}. This font does not show a real
189189
glyph, but instead shows the hexadecimal code inside a box
190190
for each character, so a user can at least see the Unicode code point of the
191-
character to be displayed.\footnote{\url{http://scripts.sil.org/UnicodeBMPFallbackFont}}
191+
character intended for display.\footnote{\url{http://scripts.sil.org/UnicodeBMPFallbackFont}}
192192

193193
% ==========================
194194
\section{Pitfall: Faulty rendering}
@@ -200,7 +200,7 @@ \section{Pitfall: Faulty rendering}
200200
reasons for unexpected visual display, namely automatic font substitution and
201201
faulty rendering. Like missing glyphs, any such problems are independent from
202202
the Unicode Standard. The Unicode Standard only includes very general
203-
information about characters and leaves the specific visual display to others to
203+
information about characters and leaves the specific visual display for others to
204204
decide on. Any faulty display is thus not to be blamed on the Unicode
205205
Consortium, but on a complex interplay of different mechanisms happening in a
206206
computer to turn Unicode code points into visual symbols. We will only sketch a
@@ -212,26 +212,26 @@ \section{Pitfall: Faulty rendering}
212212
have a glyph within this font, then the software application will automatically
213213
search for another font to display the glyph. The result will be that this
214214
specific glyph will look slightly different from the others. This mechanism
215-
works differently depending on the software application, only limited
216-
user influence is usually expected and little feedback is given, which might be rather
215+
works differently depending on the software application; only limited
216+
user influence is usually expected and little feedback is given. This may be rather
217217
frustrating to font-aware users.
218218

219219
% \footnote{For example, Apple Pages does not give any feedback that a font is being replaced, and the user does not seem to have any influence on the choice of replacement (except by manually marking all occurrences). In contrast, Microsoft Word does indicate the font replacement by showing the name in the font menu of the font replacement. However, Word simply changes the font completely, so any text written after the replacement is written in a different font as before. Both behaviors leave much to be desired.}
220220

221221
Another problem with visual display is related to so-called \textsc{font
222222
rendering}. Font rendering refers to the process of the actual positioning of
223223
Unicode characters on a page of written text. This positioning is actually a
224-
highly complex challenge, and many things can go wrong in the process. Well-known
224+
highly complex challenge and many things can go wrong in the process. Well-known
225225
rendering difficulties, like proportional glyph size or ligatures, are reasonably
226-
well understood by developers. Nevertheless , the positioning of multiple diacritics relative to
226+
well understood by developers. Nevertheless, the positioning of multiple diacritics relative to
227227
a base character is still a widespread problem. Especially problematic is when
228228
more than one diacritic is supposed to be placed above (or
229229
below) another. Even within the Latin script vertical placement
230230
often leads to unexpected effects in many modern software applications.
231231
The rendering problems arising in Arabic and in many scripts of Southeast
232232
Asia (like Devanagari or Burmese) are even more complex.
233233

234-
To understand why any problems arise it is important to realize that there are
234+
To understand why these problems arise it is important to realize that there are
235235
basically three different approaches to font rendering. The most widespread is
236236
Adobe's and Microsoft's \textsc{OpenType} system. This approach makes it
237237
relatively easy for font developers, but the font itself does not include all
@@ -446,11 +446,11 @@ \section{Pitfall: Canonical equivalence}
446446
In other words, there are equivalent sequences of Unicode characters that should
447447
be normalized, i.e.~transformed into a unique Unicode-sanctioned representation
448448
of a character sequence called a \textsc{normalization form}. Unicode provides a
449-
Unicode Normalization Algorithm, which essentially puts combining marks
449+
Unicode Normalization Algorithm, which puts combining marks
450450
into a specific logical order and it defines decomposition and composition
451451
transformation rules to convert each string into one of four normalization
452452
forms. We will discuss here the two most relevant normalization forms: NFC and
453-
NFD.\@
453+
NFD.
454454

455455
The first of the three characters above is considered the \textsc{Normalization
456456
Form C (NFC)}, where \textsc{C} stands for composition. When the process of NFC

book/chapters/preface.tex

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
\chapter{Preface}
22
\label{preface}
33

4-
This text is meant as a practical guide for linguists and programmers, who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together.
4+
This text is meant as a practical guide for linguists and programmers who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together.
55

6-
The intersection of the Unicode Standard and the International Phonetic Alphabet is often met with frustration by users. Nevertheless, the two standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA.
6+
The intersection of the Unicode Standard and the International Phonetic Alphabet is often met with frustration by users. Nevertheless, the two standards have provided language researchers with the computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA.
77

8-
We use quantitative methods in our research to compare languages to uncover and clarify their phylogenetic relationships. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we created a suite of open-source Python and R software packages to work with languages using profiles that adequately describe their orthographic conventions. Using orthography profiles and these tools allows users to tokenize and transliterate text from diverse sources, so that they can be meaningfully compared and analyzed.
8+
In our research, we use quantitative methods to compare languages to uncover and clarify their phylogenetic relationships. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we have created a suite of open-source Python and R software packages to work with languages using profiles that adequately describe their orthographic conventions. Using these tools in combination with orthography profiles allows users to tokenize and transliterate text from diverse sources, so that they can be meaningfully compared and analyzed.
99

1010
We welcome comments and corrections regarding this book, our source code, and the supplemental case studies that we provide online.\footnote{\url{https://github.com/unicode-cookbook/}} Please use the issue tracker, email us directly, or make suggestions on PaperHive.\footnote{\url{https://paperhive.org/}}
1111

0 commit comments

Comments
 (0)