Skip to content

Commit 3511ff9

Browse files
cysouwbambooforest
authored andcommitted
Post release updates (#44)
* update gitignore * minor edits in introduction * small caps in introduction * minor edits to chapter 2 unicode approach * corrections chapter 3 unicode pitfalls * corrections chapter 4 IPA * layout of tables in chapter 5 * typo in Chapter 5 * update bibliography * more latex gitignore * addition to Chapter 7 orthography profiles * corrections to chapter 8 implementation * Updated tex and pdf files * requested changes * fix alignment on p.107 * change bash explanation * edits in last paragraph; copy up the cookbook pdf
1 parent bbb0bbe commit 3511ff9

18 files changed

+141
-91
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,12 @@
33
unicode.texnicle
44

55
*.adx
6+
*.bcf
67
*.toc
78
*.aux
89
*.glo
910
*.idx
11+
*.ldx
1012
*.log
1113
*.toc
1214
*.ist
@@ -25,6 +27,8 @@ unicode.texnicle
2527
*.maf
2628
*.mtc
2729
*.mtc1
30+
*.mw
31+
*.sdx
2832
*.out
2933
*.synctex.gz
3034
*.fdb_latexmk

book/chapters/implementation.tex

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,12 @@ \section{Overview}
1010

1111
\section{How to install Python and R}
1212
\label{installing-python-and-r}
13+
1314
When one encounters problems installing software, or bugs in programming code, search engines are your friend! Installation problems and incomprehensible error messages have typically been encountered and solved by other users. Try simply copying and pasting the output of an error message into a search engine; the solution is often already somewhere online. We are fans of Stack Exchange\footnote{\url{https://stackexchange.com/}} -- a network of question-and-answer websites -- which are extremely helpful in solving issues regarding software installation, bugs in code, etc.
1415

15-
Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Unix operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks).
16+
Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Linux operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks).
1617

17-
Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Unix) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Unix.
18+
Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Linux) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Linux.
1819

1920
Once you have R or Python (or both) installed on your computer, you are ready to use the orthography profiles software libraries presented in the next two sections. As noted above, we make this material available online on GitHub,\footnote{\url{https://github.com/}} a web-based version control system for source code management. GitHub repositories can be cloned or downloaded,\footnote{\url{https://help.github.com/articles/cloning-a-repository/}} so that you can work through the examples on your local machine. Use your favorite search engine to figure out how to install Git on your computer and learn more about using Git.\footnote{\url{https://git-scm.com/}} In our GitHub repository, we make the material presented below (and more use cases described briefly in Section \ref{use-cases}) available as Jupyter Notebooks. Jupyter Notebooks provide an interface where you can run and develop source code using the browser as an interface. These notebooks are easily viewed in our GitHub repository of use cases.\footnote{\url{https://github.com/unicode-cookbook/recipes}}
2021

@@ -432,17 +433,24 @@ \subsection*{Installation}
432433
by using Rscript to get the paths to the executables within the terminal.
433434

434435
<<eval=FALSE, tidy=FALSE, engine='bash'>>=
435-
# get the paths to the R executables in bash
436-
pathT=`Rscript -e 'cat(file.path(find.package("qlcData"),
437-
"exec", "tokenize"))'`
438-
pathW=`Rscript -e 'cat(file.path(find.package("qlcData"),
439-
"exec", "writeprofile"))'`
440-
441-
# make softlinks to the R executables in /usr/local/bin
442-
# you will have to enter your user's password!
436+
pathT=`Rscript -e 'cat(system.file("exec/tokenize", package="qlcData"))'`
437+
pathW=`Rscript -e 'cat(system.file("exec/writeprofile", package="qlcData"))'`
438+
@
439+
440+
Then you can make softlinks to the R executables in \texttt{/usr/local/bin} by using the following command in the terminal:
441+
442+
<<eval=FALSE, tidy=FALSE, engine='bash'>>=
443443
sudo ln -is $pathT $pathW /usr/local/bin
444444
@
445445

446+
You can also do this within R by using the following commands, again possible replacing \texttt{/user/local/bin} with a suitable location on your system:
447+
448+
<<eval=FALSE>>=
449+
# link executables from within R
450+
file.symlink(system.file("exec/tokenize", package="qlcData"), "/usr/local/bin")
451+
file.symlink(system.file("exec/writeprofile", package="qlcData"), "/usr/local/bin")
452+
@
453+
446454
After inserting this softlink it should be possible to access the
447455
\texttt{tokenize} function from the shell. Try \texttt{tokenize --help} to test
448456
the functionality.
@@ -458,10 +466,10 @@ \subsection*{Installation}
458466
% online at \url{TODO}. The webapps are also included inside the \texttt{qlcData}
459467
% package and can be started with the following helper function:
460468

461-
To make the functionality even more accessible, we have prepared webapps with
462-
the \texttt{Shiny} framework for the R functions. The webapps are
463-
included inside the \texttt{qlcData} package and can be started with the
464-
helper function (in R): \texttt{launch\_shiny('tokenize')}.
469+
% To make the functionality even more accessible, we have prepared webapps with
470+
% the \texttt{Shiny} framework for the R functions. The webapps are
471+
% included inside the \texttt{qlcData} package and can be started with the
472+
% helper function (in R): \texttt{launch\_shiny('tokenize')}.
465473

466474
% <<eval=FALSE>>=
467475
% launch_shiny('tokenize')
@@ -1204,5 +1212,11 @@ \section{Recipes online}
12041212
12051213
\noindent The ASJP use case shows how to download the full set of ASJP wordlists, to combine them into a single large CSV file, and to tokenize the ASJP orthography. The Dutch use case takes as input the 10K corpus for Dutch (``nld'') from the Leipzig Corpora Collection,\footnote{\url{http://wortschatz.uni-leipzig.de/en/download/}} which is then cleaned and tokenized with an orthography profile that captures the intricacies of Dutch orthography.
12061214
1207-
% In closing, using GitHub to share code and data provides a platform for sharing scientific results and it also promotes a means for scientific replicability of results. Moreover, we find that in cases where the scientists are building the to tools for analysis, open repositories and data help to ensure that what you see is what you get.
1215+
\section{Closing words}
1216+
1217+
In closing, we hope that these rather elaborate musings on writing systems, Unicode, and the IPA, will help readers appreciate the progress that has been made over the last decades, while acknowledging the many pitfalls that are still lurking below the surface. But mainly we hope that our proposals point towards a way forward in sharing scientific data, interpretations, and analyses, in a more transparent manner.
1218+
1219+
GitHub (or any other similar service) provides a platform for sharing scientific research and it also promotes a means for scientific replicability of results. Not only do we use GitHub for the software packages that accompany this book, but we actually used it to write this book. GitHub allowed us to openly and collaboratively work on the book and then to participate interactively in an open review process by using an issue tracker and version control.\footnote{\url{https://userblogs.fu-berlin.de/langsci-press/2018/07/11/what-it-means-to-be-open-and-community-based-the-unicode-cookbook-as-a-showcase/}} We will continue to use our repositories to make corrections and updates to the book and to the associated orthography profile software packages, when necessary.\footnote{\url{https://github.com/unicode-cookbook/}}
1220+
1221+
Moreover, we find that in situations where scientists are building tools for analysis, open repositories and open data help to ensure that what you see is what you get.
12081222

book/chapters/introduction.tex

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -239,14 +239,14 @@ \subsubsection*{Binary encoding}
239239
To also allow for different uppercase and lowercase letters and for a large
240240
variety of control characters to be used in the newly developing technology of
241241
computers, the American Standards Association decided to propose a new 7-bit
242-
encoding in 1963 (with $2^7 = 128$ different possible characters), known as the
242+
encoding in 1963 (with 2\textsuperscript{7} = 128 different possible characters),
243+
known as the
243244
\textsc{American Standard Code for Information Interchange} (ASCII), geared
244245
towards the encoding of English orthography. With the ascent of other
245246
orthographies in computer usage, the wish to encode further variations of Latin
246247
letters (including German <ß> and various letters with diacritics, e.g.\ <è>) led the
247248
Digital Equipment Corporation to introduce an 8-bit \textsc{Multinational
248-
Character Set} (MCS, with $2^8 = 256$ different possible characters), first used
249-
with the introduction of the VT{\large 220} Terminal in 1983.
249+
Character Set} (MCS, with 2\textsuperscript{8} = 256 different possible characters), first used with the introduction of the VT{\large 220} Terminal in 1983.
250250

251251
Because 256 characters were clearly not enough for the unique representation of
252252
many different characters
@@ -269,17 +269,17 @@ \subsubsection*{Binary encoding}
269269
In the 1980s various people started to develop true
270270
international code sets. In the United States, a group of computer scientists
271271
formed the \textsc{unicode consortium}, proposing a 16-bit encoding in 1991
272-
(with $2^{16} = 65,536$ different possible characters). At the same time in
272+
(with 2\textsuperscript{16} = 65,536 different possible characters). At the same time in
273273
Europe, the \textsc{international organization for standardization} (ISO) was
274274
working on ISO~10646 to replace the ISO/IEC~8859 standard. Their first draft of
275275
the \textsc{universal character set} (UCS) in 1990 was 31-bit (with
276-
theoretically $2^{31} = 2,147,483,648$ possible characters, but because of some
276+
theoretically 2\textsuperscript{31} = 2,147,483,648 possible characters, but because of some
277277
technical restrictions only 679,477,248 were allowed). Since 1991, the Unicode
278278
Consortium and the ISO jointly develop the \textsc{unicode standard}, or
279279
ISO/IEC~10646, leading to the current system including the original 16-bit
280280
Unicode proposal as the \textsc{basic multilingual plane}, and 16 additional
281-
planes of 16-bit for further extensions (with in total $(1+16) \cdot 2^{16} =
282-
1,114,112$ possible characters). The most recent version of the Unicode Standard
281+
planes of 16-bit for further extensions (with in total (1+16)\times 2\textsuperscript{16} =
282+
1,114,112 possible characters). The most recent version of the Unicode Standard
283283
(currently at version number 11.0.0) was published in June 2018 and it defines
284284
137,374 different characters \citep{Unicode2018}.
285285

@@ -382,10 +382,10 @@ \subsubsection*{Script systems}
382382

383383
Breaking it down further, a script consists of \textsc{graphemes}, which are writing
384384
system-specific minimally distinctive symbols (see below). Graphemes may consist of one or more
385-
\textsc{characters}. The term \textsc{character} is overladen. In the linguistic terminology of writing
386-
systems, a \textsc{character} is a general term for any self-contained element
385+
\textsc{characters}. The term `character' is overladen. In the linguistic terminology of writing
386+
systems, a character is a general term for any self-contained element
387387
in a writing system. A second interpretation is used as a conventional term for a unit in the Chinese writing
388-
system \citep{Daniels1996}. In technical terminology, a \textsc{character}
388+
system \citep{Daniels1996}. In technical terminology, a character
389389
refers to the electronic encoding of a component in a writing system that has semantic
390390
value (see Section \ref{character-encoding-system}). Thus in this work we must navigate
391391
between the general linguistic and technical terms for \textsc{character}

book/chapters/ipa_background.tex

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ \chapter{The International Phonetic Alphabet}
1414
diacritics (Section~\ref{EncodingIPA}). Occurring a little over a hundred years after
1515
the inception of the IPA, its encoding was a major challenge
1616
(Section~\ref{need-for-multilingual-environment}); many
17-
linguists have encountered the pitfalls when the two are used together
17+
linguists have encountered pitfalls when the two are used together
1818
(Chapter~\ref{ipa-meets-unicode}).
1919

2020
% ==========================
@@ -48,7 +48,7 @@ \section{Brief history}
4848
\url{https://en.wikipedia.org/wiki/History\_of\_the\_International\_Phonetic_Alphabet}.}
4949

5050
Over the years there have been several revisions, but mostly minor ones. Articulation
51-
labels -- what are often called \textit{features} even though the IPA
51+
labels -- what are often called \textit{features}, even though the IPA
5252
deliberately avoids this term -- have changed, e.g.\ terms like \textit{lips}, \textit{throat}
5353
or \textit{rolled} are no longer used. Phonetic symbol values have changed, e.g.\
5454
voiceless is no longer marked by <h>. Symbols have been dropped, e.g.\ the
@@ -292,7 +292,7 @@ \section{IPA encodings}
292292
devised a system of base characters with secondary diacritic marks
293293
(e.g.\ in the previous example <kp>, the base character, is modified with <W>).
294294
This encoding approach is
295-
also used in SAMPA and X-SAMPA (Section~\ref{sampa-xsampa}) and in the
295+
also used in SAMPA and X-SAMPA (see below) and in the
296296
ASJP.\footnote{See the ASJP use case in the online supplementary
297297
materials to this book: \url{https://github.com/unicode-cookbook/recipes}.}
298298
But before UPSID, SAMPA and ASJP, IPA was encoded with numbers.

book/chapters/ipa_meets_unicode.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ \section{The twain shall meet}
2525
sometimes look like the Unicode Consortium is making incomprehensible decisions,
2626
but it is important to realize that the consortium has tried and is continuing
2727
to try to be as consistent as possible across a wide range of use cases, and it
28-
does place linguistic traditions above other orthographic choices. Furthermore,
28+
does not place linguistic traditions above other orthographic choices. Furthermore,
2929
when we look at the history of how the IPA met Unicode, we see that many of the
3030
decisions for IPA symbols in the Unicode Standard come directly from the
3131
International Phonetic Association itself. Therefore, many pitfalls that we will

book/chapters/orthography_profiles.tex

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -193,13 +193,14 @@ \subsection*{File Format}
193193
% normalized following NFC (or NFD if specified in the metadata)
194194
that includes information pertinent to the orthography.\footnote{See
195195
Section~\ref{pitfall-file-formats} in which we suggest to use NFC,
196-
no-BOM and LF line breaks because of the pitfalls they avoid. A keen reviewer notes, however, that specifying
197-
a convention for line endings and BOM is overly strict because most
196+
no-BOM and LF line breaks because of the pitfalls they avoid. Specifying
197+
a convention for line endings and BOM is often overly strict because most
198198
computing environments (now) transparently handle both alternatives.
199199
For example, using Python a file can be decoded using the encoding
200200
``utf-8-sig'', which strips away the BOM (if present) and reads
201201
an input full in text mode, so that both line feed variants ``LF'' and
202-
``CRLF'' will be stripped.}
202+
``CRLF'' will be stripped. However, note that most shells (e.g. bash) will not
203+
behave properly with CRLF line endings.}
203204

204205
\item \textsc{A profile is a delimited text file with an obligatory header
205206
line}. A minimal profile must have a single column with the header \texttt{Grapheme}.

0 commit comments

Comments
 (0)