You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Python package \texttt{segments} is available both as a command line interface (CLI) and as an application programming interface (API).
27
27
28
+
28
29
\subsection*{Installation}
29
30
30
31
To install the Python package \texttt{segments} \citep{ForkelMoran2018} from the Python Package Index (PyPI) run:
@@ -35,56 +36,57 @@ \subsection*{Installation}
35
36
36
37
\noindent on the command line. This will give you access to both the CLI and programmatic functionality in Python scripts, when you import the \texttt{segments} library.
37
38
38
-
You can also install the \texttt{segments} package from the GitHub repository:\footnote{\url{https://github.com/cldf/segments}}
39
+
You can also install the \texttt{segments} package from the GitHub repository,\footnote{\url{https://github.com/cldf/segments}} in particular if you would like to contribute to the code base:\footnote{\url{https://github.com/cldf/segments/blob/master/CONTRIBUTING.md}}
The \texttt{segments} API can be accessed by importing the package into Python or by writing a Python script. Here is an example of how to import the libraries, create a tokenizer object, tokenize a string, and create an orthography profile.
49
+
The \texttt{segments} API can be accessed by importing the package into Python. Here is an example of how to import the library, create a tokenizer object, tokenize a string, and create an orthography profile. Begin by importing the \texttt{Tokenizer} from the \texttt{segments} library.
48
50
49
51
\begin{lstlisting}[basicstyle=\myfont]
50
52
>>> from segments.tokenizer import Tokenizer
51
53
\end{lstlisting}
52
54
53
-
\noindentThe \texttt{characters} function will segment a string at Unicode code points.
55
+
\noindentNext, instantiate a tokenizer object, which takes optional arguments for an orthography profile and an orthography profile rules file.
>>> result = t.characters('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
59
-
>>> print(result)
60
-
>>> '(*@c ̂ h a ́ ɾ a ̃ ̌ c t ʼ ɛ ↗ ʐ ː | # k ͡ p@*)'
61
59
\end{lstlisting}
62
60
63
-
\noindent The \texttt{grapheme\_clusters} function will segment text at the Unicode Extended Grapheme Cluster boundaries.\footnote{\url{http://www.unicode.org/reports/tr18/tr18-19.html\#Default_Grapheme_Clusters}}
61
+
\noindent The default tokenization strategy is to segment some input text at the Unicode Extended Grapheme Cluster boundaries,\footnote{\url{http://www.unicode.org/reports/tr18/tr18-19.html\#Default_Grapheme_Clusters}} and to return, by default, a space-delimited string of graphemes. White space between input string sequences is by default separated by a hash symbol <\#>, which is a linguistic convention used to denote word boundaries. The default grapheme tokenization is useful when you encounter a text that you want to tokenize to identify potential orthographic or transcription elements.
>>> result = t.grapheme_clusters('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
64
+
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)')
67
65
>>> print(result)
68
-
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p@*)'
66
+
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | \# k͡ p@*)'
69
67
\end{lstlisting}
70
68
71
-
\noindent The \texttt{grapheme\_clusters} function is the default segmentation algorithm for the \texttt{segments.Tokenizer}. It is useful when you encounter a text that you want to tokenize to identify orthographic or transcription elements.
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', separator=' // '))
77
+
>>> print(result)
78
+
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | // k͡ p@*)'
79
+
\end{lstlisting}
80
+
81
+
\noindent The optional \texttt{ipa} parameter forces grapheme segmentation for IPA strings.\footnote{\url{https://en.wikipedia.org/wiki/International\_Phonetic\_Alphabet}} Note here that Unicode Spacing Modifier Letters,\footnote{\url{https://en.wikipedia.org/wiki/Spacing\_Modifier\_Letters}} such as <ː> and <\dia{0361}{\large\fontspec{CharisSIL}◌}>, will be segmented together with base characters (although you might need orthography profiles and rules to correct these in your input source; see Section \ref{pitfall-different-notions-of-diacritics} for details).
>>> result = t('(*@ĉháɾã̌ctʼɛ↗ʐː| k͡p@*)', ipa=True)
83
85
>>> print(result)
84
-
>>> '(*@ĉ h á ɾ ã̌ c tʼ ɛ ↗ ʐː | # k͡p@*)'
86
+
>>> '(*@ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐː | \# k͡p@*)'
85
87
\end{lstlisting}
86
88
87
-
\noindentWe can also load an orthography profile and tokenize an input string with it. In the data directory, we've placed an example orthography profile. Let's have a look at it using \texttt{more} on the command line.
89
+
\noindentYou can also load an orthography profile and tokenize input strings with it. In the data directory,\footnote{https://github.com/unicode-cookbook/recipes/tree/master/Basics/data} we've placed an example orthography profile. Let's have a look at it using \texttt{more} on the command line.
\noindent An orthography profile is a delimited UTF-8 text file (here we use tab as a delimiter for reading ease). The first column must be labelled\texttt{Grapheme}. Each row in the \texttt{Grapheme} column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns IPA and XSAMPA, which are mappings from our graphemes to their IPA and XSAMPA transliterations. The final column \texttt{COMMENT} is for comments; if you want to use a tab ``quote that string''!
108
+
\noindent An orthography profile is a delimited UTF-8 text file (here we use tab as a delimiter for reading ease). The first column must be labeled\texttt{Grapheme}, as discussed in Section \ref{formal-specification-of-orthography-profiles}. Each row in the \texttt{Grapheme} column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns \texttt{IPA} and \texttt{XSAMPA}, which are mappings from our graphemes to their IPA and X-SAMPA transliterations. The final column \texttt{COMMENT} is for comments; if you want to use a tab ``quote that string''!
107
109
108
110
Let's load the orthography profile with our tokenizer.
\noindent This example shows how we can tokenize input text into our orthographic specification. We can also segment graphemes and transliterate them into other formats, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA.
125
+
\noindent This example shows how we can tokenize input text into our orthographic specification. We can also segment graphemes and transliterate them into other forms, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA or X-SAMPA.
\noindent It is also useful to know which characters in your input string are not in your orthography profile. Use the function \texttt{find\_missing\_characters}.
144
+
\noindent It is also useful to know which characters in your input string are not in your orthography profile. By default, missing characters are displayed with the Unicode \textsc{replacement character} at \uni{FFFD}, which appears below as a white question mark within a black diamond.
>>> t.find_missing_characters('(*@aa b ch on n - ih x y z@*)')
147
+
>>> t('(*@aa b ch on n - ih x y z@*)')
146
148
>>> '(*@aa b ch on n - ih � � �@*)'
147
149
\end{lstlisting}
148
150
149
-
150
-
\noindent We set the default as the Unicode \texttt{replacement character} \uni{fffd}.\footnote{\url{http://www.fileformat.info/info/unicode/char/fffd/index.htm}} But you can simply change this by specifying the replacement character when you load the orthography profile with the tokenizer.
151
-
151
+
\noindent You can change the default by specifying a different replacement character when you load the orthography profile with the tokenizer.
>>> t = Tokenizer('data/orthography(*@-@*)profile.tsv',
162
162
errors_replace=lambda c: '<{0}>'.format(c))
163
-
>>> t.find_missing_characters("aa b ch on n (*@-@*) ih x y z")
163
+
>>> t('aa b ch on n (*@-@*) ih x y z')
164
164
>>> 'aa b ch on n (*@-@*) ih <x> <y> <z>'
165
165
\end{lstlisting}
166
166
167
-
\noindent Perhaps you want to create an initial orthography profile that also contains those graphemes x, y, z? Note that the space character and its frequency are also captured in this initial profile.
167
+
\noindent Perhaps you want to create an initial orthography profile that also contains those graphemes <x>, <y>, and <z>? Note that the space character and its frequency are also captured in this initial profile.
have a glyph within this font, then the software application will automatically
213
213
search for another font to display the glyph. The result will be that this
214
214
specific glyph will look slightly different from the others. This mechanism
215
-
works differently depending on the software application, only limited
216
-
user influence is usually expected and little feedback is given, which might be rather
215
+
works differently depending on the software application; only limited
216
+
user influence is usually expected and little feedback is given. This may be rather
217
217
frustrating to font-aware users.
218
218
219
219
% \footnote{For example, Apple Pages does not give any feedback that a font is being replaced, and the user does not seem to have any influence on the choice of replacement (except by manually marking all occurrences). In contrast, Microsoft Word does indicate the font replacement by showing the name in the font menu of the font replacement. However, Word simply changes the font completely, so any text written after the replacement is written in a different font as before. Both behaviors leave much to be desired.}
220
220
221
221
Another problem with visual display is related to so-called \textsc{font
222
222
rendering}. Font rendering refers to the process of the actual positioning of
223
223
Unicode characters on a page of written text. This positioning is actually a
224
-
highly complex challenge, and many things can go wrong in the process. Well-known
224
+
highly complex challenge and many things can go wrong in the process. Well-known
225
225
rendering difficulties, like proportional glyph size or ligatures, are reasonably
226
-
well understood by developers. Nevertheless, the positioning of multiple diacritics relative to
226
+
well understood by developers. Nevertheless, the positioning of multiple diacritics relative to
227
227
a base character is still a widespread problem. Especially problematic is when
228
228
more than one diacritic is supposed to be placed above (or
229
229
below) another. Even within the Latin script vertical placement
230
230
often leads to unexpected effects in many modern software applications.
231
231
The rendering problems arising in Arabic and in many scripts of Southeast
232
232
Asia (like Devanagari or Burmese) are even more complex.
233
233
234
-
To understand why any problems arise it is important to realize that there are
234
+
To understand why these problems arise it is important to realize that there are
235
235
basically three different approaches to font rendering. The most widespread is
236
236
Adobe's and Microsoft's \textsc{OpenType} system. This approach makes it
237
237
relatively easy for font developers, but the font itself does not include all
Copy file name to clipboardExpand all lines: book/chapters/preface.tex
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
\chapter{Preface}
2
2
\label{preface}
3
3
4
-
This text is meant as a practical guide for linguists and programmers, who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together.
4
+
This text is meant as a practical guide for linguists and programmers who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together.
5
5
6
-
The intersection of the Unicode Standard and the International Phonetic Alphabet is often met with frustration by users. Nevertheless, the two standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA.
6
+
The intersection of the Unicode Standard and the International Phonetic Alphabet is often met with frustration by users. Nevertheless, the two standards have provided language researchers with the computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA.
7
7
8
-
We use quantitative methods in our research to compare languages to uncover and clarify their phylogenetic relationships. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we created a suite of open-source Python and R software packages to work with languages using profiles that adequately describe their orthographic conventions. Using orthography profiles and these tools allows users to tokenize and transliterate text from diverse sources, so that they can be meaningfully compared and analyzed.
8
+
In our research, we use quantitative methods to compare languages to uncover and clarify their phylogenetic relationships. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we have created a suite of open-source Python and R software packages to work with languages using profiles that adequately describe their orthographic conventions. Using these tools in combination with orthography profiles allows users to tokenize and transliterate text from diverse sources, so that they can be meaningfully compared and analyzed.
9
9
10
10
We welcome comments and corrections regarding this book, our source code, and the supplemental case studies that we provide online.\footnote{\url{https://github.com/unicode-cookbook/}} Please use the issue tracker, email us directly, or make suggestions on PaperHive.\footnote{\url{https://paperhive.org/}}
0 commit comments