unicode-cookbook
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎book/chapters/implementation.tex‎
Lines changed: 29 additions & 15 deletions b/‎book/chapters/implementation.tex‎
Lines changed: 29 additions & 15 deletions
diff --git a/‎book/chapters/introduction.tex‎
Lines changed: 10 additions & 10 deletions b/‎book/chapters/introduction.tex‎
Lines changed: 10 additions & 10 deletions
diff --git a/‎book/chapters/ipa_background.tex‎
Lines changed: 3 additions & 3 deletions b/‎book/chapters/ipa_background.tex‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎book/chapters/ipa_meets_unicode.tex‎
Lines changed: 1 addition & 1 deletion b/‎book/chapters/ipa_meets_unicode.tex‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/chapters/orthography_profiles.tex‎
Lines changed: 4 additions & 3 deletions b/‎book/chapters/orthography_profiles.tex‎
Lines changed: 4 additions & 3 deletions
@@ -3,10 +3,12 @@
 unicode.texnicle
 
 *.adx
+*.bcf
 *.toc
 *.aux
 *.glo
 *.idx
+*.ldx
 *.log
 *.toc
 *.ist
@@ -25,6 +27,8 @@ unicode.texnicle
 *.maf
 *.mtc
 *.mtc1
+*.mw
+*.sdx
 *.out
 *.synctex.gz
 *.fdb_latexmk
 
@@ -10,11 +10,12 @@ \section{Overview}
 
 \section{How to install Python and R}
 \label{installing-python-and-r}
+
 When one encounters problems installing software, or bugs in programming code, search engines are your friend! Installation problems and incomprehensible error messages have typically been encountered and solved by other users. Try simply copying and pasting the output of an error message into a search engine; the solution is often already somewhere online. We are fans of Stack Exchange\footnote{\url{https://stackexchange.com/}} -- a network of question-and-answer websites -- which are extremely helpful in solving issues regarding software installation, bugs in code, etc.
 
-Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Unix operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks). 
+Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Linux operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks). 
 
-Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Unix) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Unix. 
+Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Linux) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Linux. 
 
 Once you have R or Python (or both) installed on your computer, you are ready to use the orthography profiles software libraries presented in the next two sections. As noted above, we make this material available online on GitHub,\footnote{\url{https://github.com/}} a web-based version control system for source code management. GitHub repositories can be cloned or downloaded,\footnote{\url{https://help.github.com/articles/cloning-a-repository/}} so that you can work through the examples on your local machine. Use your favorite search engine to figure out how to install Git on your computer and learn more about using Git.\footnote{\url{https://git-scm.com/}} In our GitHub repository, we make the material presented below (and more use cases described briefly in Section \ref{use-cases}) available as Jupyter Notebooks. Jupyter Notebooks provide an interface where you can run and develop source code using the browser as an interface. These notebooks are easily viewed in our GitHub repository of use cases.\footnote{\url{https://github.com/unicode-cookbook/recipes}}
 
@@ -432,17 +433,24 @@ \subsection*{Installation}
 by using Rscript to get the paths to the executables within the terminal.
 
 <<eval=FALSE, tidy=FALSE, engine='bash'>>=
-# get the paths to the R executables in bash
-pathT=`Rscript -e 'cat(file.path(find.package("qlcData"), 
-  "exec", "tokenize"))'`
-pathW=`Rscript -e 'cat(file.path(find.package("qlcData"), 
-  "exec", "writeprofile"))'`
-
-# make softlinks to the R executables in /usr/local/bin
-# you will have to enter your user's password!
+pathT=`Rscript -e 'cat(system.file("exec/tokenize", package="qlcData"))'`
+pathW=`Rscript -e 'cat(system.file("exec/writeprofile", package="qlcData"))'`
+@
+
+Then you can make softlinks to the R executables in \texttt{/usr/local/bin} by using the following command in the terminal:
+
+<<eval=FALSE, tidy=FALSE, engine='bash'>>=
 sudo ln -is $pathT $pathW /usr/local/bin
 @
 
+You can also do this within R by using the following commands, again possible replacing \texttt{/user/local/bin} with a suitable location on your system:
+
+<<eval=FALSE>>=
+# link executables from within R
+file.symlink(system.file("exec/tokenize", package="qlcData"), "/usr/local/bin")
+file.symlink(system.file("exec/writeprofile", package="qlcData"), "/usr/local/bin")
+@
+
 After inserting this softlink it should be possible to access the
 \texttt{tokenize} function from the shell. Try \texttt{tokenize --help} to test
 the functionality.
@@ -458,10 +466,10 @@ \subsection*{Installation}
 % online at \url{TODO}. The webapps are also included inside the \texttt{qlcData} 
 % package and can be started with the following helper function:
 
-To make the functionality even more accessible, we have prepared webapps with 
-the \texttt{Shiny} framework for the R functions. The webapps are 
-included inside the \texttt{qlcData} package and can be started with the 
-helper function (in R): \texttt{launch\_shiny('tokenize')}.
+% To make the functionality even more accessible, we have prepared webapps with 
+% the \texttt{Shiny} framework for the R functions. The webapps are 
+% included inside the \texttt{qlcData} package and can be started with the 
+% helper function (in R): \texttt{launch\_shiny('tokenize')}.
 
 % <<eval=FALSE>>=
 % launch_shiny('tokenize')
@@ -1204,5 +1212,11 @@ \section{Recipes online}
 
 \noindent The ASJP use case shows how to download the full set of ASJP wordlists, to combine them into a single large CSV file, and to tokenize the ASJP orthography. The Dutch use case takes as input the 10K corpus for Dutch (``nld'') from the Leipzig Corpora Collection,\footnote{\url{http://wortschatz.uni-leipzig.de/en/download/}} which is then cleaned and tokenized with an orthography profile that captures the intricacies of Dutch orthography.
 
-% In closing, using GitHub to share code and data provides a platform for sharing scientific results and it also promotes a means for scientific replicability of results. Moreover, we find that in cases where the scientists are building the to tools for analysis, open repositories and data help to ensure that what you see is what you get.
+\section{Closing words}
+
+In closing, we hope that these rather elaborate musings on writing systems, Unicode, and the IPA, will help readers appreciate the progress that has been made over the last decades, while acknowledging the many pitfalls that are still lurking below the surface. But mainly we hope that our proposals point towards a way forward in sharing scientific data, interpretations, and analyses, in a more transparent manner.
+
+GitHub (or any other similar service) provides a platform for sharing scientific research and it also promotes a means for scientific replicability of results. Not only do we use GitHub for the software packages that accompany this book, but we actually used it to write this book. GitHub allowed us to openly and collaboratively work on the book and then to participate interactively in an open review process by using an issue tracker and version control.\footnote{\url{https://userblogs.fu-berlin.de/langsci-press/2018/07/11/what-it-means-to-be-open-and-community-based-the-unicode-cookbook-as-a-showcase/}} We will continue to use our repositories to make corrections and updates to the book and to the associated orthography profile software packages, when necessary.\footnote{\url{https://github.com/unicode-cookbook/}}
+
+Moreover, we find that in situations where scientists are building tools for analysis, open repositories and open data help to ensure that what you see is what you get.
 
@@ -239,14 +239,14 @@ \subsubsection*{Binary encoding}
 To also allow for different uppercase and lowercase letters and for a large
 variety of control characters to be used in the newly developing technology of
 computers, the American Standards Association decided to propose a new 7-bit
-encoding in 1963 (with $2^7 = 128$ different possible characters), known as the
+encoding in 1963 (with 2\textsuperscript{7} = 128 different possible characters), 
+known as the
 \textsc{American Standard Code for Information Interchange} (ASCII), geared
 towards the encoding of English orthography. With the ascent of other
 orthographies in computer usage, the wish to encode further variations of Latin
 letters (including German <ß> and various letters with diacritics, e.g.\ <è>) led the
 Digital Equipment Corporation to introduce an 8-bit \textsc{Multinational
-Character Set} (MCS, with $2^8 = 256$ different possible characters), first used
-with the introduction of the VT{\large 220} Terminal in 1983. 
+Character Set} (MCS, with 2\textsuperscript{8} = 256 different possible characters), first used with the introduction of the VT{\large 220} Terminal in 1983. 
 
 Because 256 characters were clearly not enough for the unique representation of 
 many different characters
@@ -269,17 +269,17 @@ \subsubsection*{Binary encoding}
 In the 1980s various people started to develop true
 international code sets. In the United States, a group of computer scientists
 formed the \textsc{unicode consortium}, proposing a 16-bit encoding in 1991
-(with $2^{16} = 65,536$ different possible characters). At the same time in
+(with 2\textsuperscript{16} = 65,536 different possible characters). At the same time in
 Europe, the \textsc{international organization for standardization} (ISO) was
 working on ISO~10646 to replace the ISO/IEC~8859 standard. Their first draft of
 the \textsc{universal character set} (UCS) in 1990 was 31-bit (with
-theoretically $2^{31} = 2,147,483,648$ possible characters, but because of some
+theoretically 2\textsuperscript{31} = 2,147,483,648 possible characters, but because of some
 technical restrictions only 679,477,248 were allowed). Since 1991, the Unicode
 Consortium and the ISO jointly develop the \textsc{unicode standard}, or
 ISO/IEC~10646, leading to the current system including the original 16-bit
 Unicode proposal as the \textsc{basic multilingual plane}, and 16 additional
-planes of 16-bit for further extensions (with in total $(1+16) \cdot 2^{16} =
-1,114,112$ possible characters). The most recent version of the Unicode Standard
+planes of 16-bit for further extensions (with in total (1+16)\times 2\textsuperscript{16} =
+1,114,112 possible characters). The most recent version of the Unicode Standard
 (currently at version number 11.0.0) was published in June 2018 and it defines
 137,374 different characters \citep{Unicode2018}.
 
@@ -382,10 +382,10 @@ \subsubsection*{Script systems}
 
 Breaking it down further, a script consists of \textsc{graphemes}, which are writing 
 system-specific minimally distinctive symbols (see below). Graphemes may consist of one or more 
-\textsc{characters}. The term \textsc{character} is overladen. In the linguistic terminology of writing
-systems, a \textsc{character} is a general term for any self-contained element
+\textsc{characters}. The term `character' is overladen. In the linguistic terminology of writing
+systems, a character is a general term for any self-contained element
 in a writing system. A second interpretation is used as a conventional term for a unit in the Chinese writing
-system \citep{Daniels1996}. In technical terminology, a \textsc{character} 
+system \citep{Daniels1996}. In technical terminology, a character
 refers to the electronic encoding of a component in a writing system that has semantic 
 value (see Section \ref{character-encoding-system}). Thus in this work we must navigate 
 between the general linguistic and technical terms for \textsc{character} 
 
@@ -14,7 +14,7 @@ \chapter{The International Phonetic Alphabet}
 diacritics (Section~\ref{EncodingIPA}). Occurring a little over a hundred years after 
 the inception of the IPA, its encoding was a major challenge 
 (Section~\ref{need-for-multilingual-environment}); many 
-linguists have encountered the pitfalls when the two are used together 
+linguists have encountered pitfalls when the two are used together 
 (Chapter~\ref{ipa-meets-unicode}).
 
 % ==========================
@@ -48,7 +48,7 @@ \section{Brief history}
 \url{https://en.wikipedia.org/wiki/History\_of\_the\_International\_Phonetic_Alphabet}.}
 
 Over the years there have been several revisions, but mostly minor ones. Articulation 
-labels -- what are often called \textit{features} even though the IPA
+labels -- what are often called \textit{features}, even though the IPA
 deliberately avoids this term -- have changed, e.g.\ terms like \textit{lips}, \textit{throat}
 or \textit{rolled} are no longer used. Phonetic symbol values have changed, e.g.\
 voiceless is no longer marked by <h>. Symbols have been dropped, e.g.\ the
@@ -292,7 +292,7 @@ \section{IPA encodings}
 devised a system of base characters with secondary diacritic marks 
 (e.g.\ in the previous example <kp>, the base character, is modified with <W>). 
 This encoding approach is 
-also used in SAMPA and X-SAMPA (Section~\ref{sampa-xsampa}) and in the 
+also used in SAMPA and X-SAMPA (see below) and in the 
 ASJP.\footnote{See the ASJP use case in the online supplementary 
 materials to this book: \url{https://github.com/unicode-cookbook/recipes}.} 
 But before UPSID, SAMPA and ASJP, IPA was encoded with numbers.
 
@@ -25,7 +25,7 @@ \section{The twain shall meet}
 sometimes look like the Unicode Consortium is making incomprehensible decisions,
 but it is important to realize that the consortium has tried and is continuing
 to try to be as consistent as possible across a wide range of use cases, and it
-does place linguistic traditions above other orthographic choices. Furthermore,
+does not place linguistic traditions above other orthographic choices. Furthermore,
 when we look at the history of how the IPA met Unicode, we see that many of the
 decisions for IPA symbols in the Unicode Standard come directly from the
 International Phonetic Association itself. Therefore, many pitfalls that we will
 
@@ -193,13 +193,14 @@ \subsection*{File Format}
 	   % normalized following NFC (or NFD if specified in the metadata)
        that includes information pertinent to the orthography.\footnote{See 
 	   Section~\ref{pitfall-file-formats} in which we suggest to use NFC, 
-	   no-BOM and LF line breaks because of the pitfalls they avoid. A keen reviewer notes, however, that specifying 
-	   a convention for line endings and BOM is overly strict because most 
+	   no-BOM and LF line breaks because of the pitfalls they avoid. Specifying 
+	   a convention for line endings and BOM is often overly strict because most 
 	   computing environments (now) transparently handle both alternatives. 
 	   For example, using Python a file can be decoded using the encoding 
 	   ``utf-8-sig'', which strips away the BOM (if present) and reads 
 	   an input full in text mode, so that both line feed variants ``LF'' and 
-	   ``CRLF'' will be stripped.}
+	   ``CRLF'' will be stripped. However, note that most shells (e.g. bash) will not 
+	   behave properly with CRLF line endings.}
 
 	\item \textsc{A profile is a delimited text file with an obligatory header
        line}. A minimal profile must have a single column with the header \texttt{Grapheme}.