You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* update gitignore
* minor edits in introduction
* small caps in introduction
* minor edits to chapter 2 unicode approach
* corrections chapter 3 unicode pitfalls
* corrections chapter 4 IPA
* layout of tables in chapter 5
* typo in Chapter 5
* update bibliography
* more latex gitignore
* addition to Chapter 7 orthography profiles
* corrections to chapter 8 implementation
* Updated tex and pdf files
* requested changes
* fix alignment on p.107
* change bash explanation
* edits in last paragraph; copy up the cookbook pdf
Copy file name to clipboardExpand all lines: book/chapters/implementation.tex
+29-15Lines changed: 29 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -10,11 +10,12 @@ \section{Overview}
10
10
11
11
\section{How to install Python and R}
12
12
\label{installing-python-and-r}
13
+
13
14
When one encounters problems installing software, or bugs in programming code, search engines are your friend! Installation problems and incomprehensible error messages have typically been encountered and solved by other users. Try simply copying and pasting the output of an error message into a search engine; the solution is often already somewhere online. We are fans of Stack Exchange\footnote{\url{https://stackexchange.com/}} -- a network of question-and-answer websites -- which are extremely helpful in solving issues regarding software installation, bugs in code, etc.
14
15
15
-
Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Unix operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks).
16
+
Searching the web for ``install r and python'' returns numerous tutorials on how to set up your machine for scientific data analysis. Note that there is no single correct setup for a particular computer or operating system. Both Python and R are available for Windows, Mac, and Linux operating systems from the Python and R project websites. Another option is to use a so-called package manager, i.e.\ a software program that allows the user to manage software packages and their dependencies. On Mac, we use Homebrew,\footnote{\url{https://brew.sh/}} a simple-to-install (via the Terminal App) free and open source package management system. Follow the instructions on the Homebrew website and then use Homebrew to install R and Python (as well as other software packages such as Git and Jupyter Notebooks).
16
17
17
-
Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Unix) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Unix.
18
+
Alternatively for R, RStudio\footnote{\url{https://www.rstudio.com/}} provides a free and open source integrated development environment (IDE). This application can be downloaded and installed (for Mac, Windows and Linux) and it includes its own R installation and R libraries package manager. For developing in Python, we recommend the free community version of PyCharm,\footnote{\url{https://www.jetbrains.com/pycharm/}} an IDE which is available for Mac, Windows, and Linux.
18
19
19
20
Once you have R or Python (or both) installed on your computer, you are ready to use the orthography profiles software libraries presented in the next two sections. As noted above, we make this material available online on GitHub,\footnote{\url{https://github.com/}} a web-based version control system for source code management. GitHub repositories can be cloned or downloaded,\footnote{\url{https://help.github.com/articles/cloning-a-repository/}} so that you can work through the examples on your local machine. Use your favorite search engine to figure out how to install Git on your computer and learn more about using Git.\footnote{\url{https://git-scm.com/}} In our GitHub repository, we make the material presented below (and more use cases described briefly in Section \ref{use-cases}) available as Jupyter Notebooks. Jupyter Notebooks provide an interface where you can run and develop source code using the browser as an interface. These notebooks are easily viewed in our GitHub repository of use cases.\footnote{\url{https://github.com/unicode-cookbook/recipes}}
20
21
@@ -432,17 +433,24 @@ \subsection*{Installation}
432
433
by using Rscript to get the paths to the executables within the terminal.
Then you can make softlinks to the R executables in \texttt{/usr/local/bin} by using the following command in the terminal:
441
+
442
+
<<eval=FALSE, tidy=FALSE, engine='bash'>>=
443
443
sudo ln -is $pathT $pathW /usr/local/bin
444
444
@
445
445
446
+
You can also do this within R by using the following commands, again possible replacing \texttt{/user/local/bin} with a suitable location on your system:
After inserting this softlink it should be possible to access the
447
455
\texttt{tokenize} function from the shell. Try \texttt{tokenize --help} to test
448
456
the functionality.
@@ -458,10 +466,10 @@ \subsection*{Installation}
458
466
% online at \url{TODO}. The webapps are also included inside the \texttt{qlcData}
459
467
% package and can be started with the following helper function:
460
468
461
-
To make the functionality even more accessible, we have prepared webapps with
462
-
the \texttt{Shiny} framework for the R functions. The webapps are
463
-
included inside the \texttt{qlcData} package and can be started with the
464
-
helper function (in R): \texttt{launch\_shiny('tokenize')}.
469
+
%To make the functionality even more accessible, we have prepared webapps with
470
+
%the \texttt{Shiny} framework for the R functions. The webapps are
471
+
%included inside the \texttt{qlcData} package and can be started with the
472
+
%helper function (in R): \texttt{launch\_shiny('tokenize')}.
465
473
466
474
% <<eval=FALSE>>=
467
475
% launch_shiny('tokenize')
@@ -1204,5 +1212,11 @@ \section{Recipes online}
1204
1212
1205
1213
\noindent The ASJP use case shows how to download the full set of ASJP wordlists, to combine them into a single large CSV file, and to tokenize the ASJP orthography. The Dutch use case takes as input the 10K corpus for Dutch (``nld'') from the Leipzig Corpora Collection,\footnote{\url{http://wortschatz.uni-leipzig.de/en/download/}} which is then cleaned and tokenized with an orthography profile that captures the intricacies of Dutch orthography.
1206
1214
1207
-
% In closing, using GitHub to share code and data provides a platform for sharing scientific results and it also promotes a means for scientific replicability of results. Moreover, we find that in cases where the scientists are building the to tools for analysis, open repositories and data help to ensure that what you see is what you get.
1215
+
\section{Closing words}
1216
+
1217
+
In closing, we hope that these rather elaborate musings on writing systems, Unicode, and the IPA, will help readers appreciate the progress that has been made over the last decades, while acknowledging the many pitfalls that are still lurking below the surface. But mainly we hope that our proposals point towards a way forward in sharing scientific data, interpretations, and analyses, in a more transparent manner.
1218
+
1219
+
GitHub (or any other similar service) provides a platform for sharing scientific research and it also promotes a means for scientific replicability of results. Not only do we use GitHub for the software packages that accompany this book, but we actually used it to write this book. GitHub allowed us to openly and collaboratively work on the book and then to participate interactively in an open review process by using an issue tracker and version control.\footnote{\url{https://userblogs.fu-berlin.de/langsci-press/2018/07/11/what-it-means-to-be-open-and-community-based-the-unicode-cookbook-as-a-showcase/}} We will continue to use our repositories to make corrections and updates to the book and to the associated orthography profile software packages, when necessary.\footnote{\url{https://github.com/unicode-cookbook/}}
1220
+
1221
+
Moreover, we find that in situations where scientists are building tools for analysis, open repositories and open data help to ensure that what you see is what you get.
To also allow for different uppercase and lowercase letters and for a large
240
240
variety of control characters to be used in the newly developing technology of
241
241
computers, the American Standards Association decided to propose a new 7-bit
242
-
encoding in 1963 (with $2^7 = 128$ different possible characters), known as the
242
+
encoding in 1963 (with 2\textsuperscript{7} = 128 different possible characters),
243
+
known as the
243
244
\textsc{American Standard Code for Information Interchange} (ASCII), geared
244
245
towards the encoding of English orthography. With the ascent of other
245
246
orthographies in computer usage, the wish to encode further variations of Latin
246
247
letters (including German <ß> and various letters with diacritics, e.g.\ <è>) led the
247
248
Digital Equipment Corporation to introduce an 8-bit \textsc{Multinational
248
-
Character Set} (MCS, with $2^8 = 256$ different possible characters), first used
249
-
with the introduction of the VT{\large 220} Terminal in 1983.
249
+
Character Set} (MCS, with 2\textsuperscript{8} = 256 different possible characters), first used with the introduction of the VT{\large 220} Terminal in 1983.
250
250
251
251
Because 256 characters were clearly not enough for the unique representation of
0 commit comments