ftyers · jackashore · Oct 29, 2018 · Oct 29, 2018 · Oct 29, 2018 · Oct 29, 2018
diff --git a/2018-komp-ling/practicals/Practical 1/segmentation-response.md b/2018-komp-ling/practicals/Practical 1/segmentation-response.md
@@ -0,0 +1,53 @@
+# Report on segmentation task
+
+For this task I used Pragmatic segmenter and NLTK's Punkt. Both of them performed quite well while segmenting Russian text. However, there were some problems which we are to consider.
+
+## Pragmatic segmenter
+
+**Accuracy**: 0.97
+
+Pragmatic segmenter could not cope with the abbreviation 'н. э.' and divided one sentence into two by mistake:
+
+> Традиционно считается, что этническая основа Литвы сформирована носителями археологочической культуры восточнолитовских курганов, сформировавшейся в V веке н. э. \
+> на территории современных Восточной Литвы и Северо-Западной Белоруссии.
+
+Although, in the following example the segmenter did not make such a mistake (probably due to the bracket following the abbreviation):
+
+> В конце неолита (III-II тыс. до н. э.) на территорию современной Литвы проникли индоевропейские племена.
+
+Similar problem was found out with the abbreviation 'см.':
+
+> В 1944 году нацисты были изгнаны Красной Армией с территории Литовской ССР (см.\
+> Белорусская операция (1944)).
+
+In all other cases the segmenter performed very well.
+
+## NLTK's PUNKT
+
+**Accuracy**: 0.95
+
+This segmenter performed a little worse than the previous one. It could not cope with the abbreviation 'н. э.' as well, but made a wider range of mistakes.
+
+Did not find the border between 2 sentences:
+
+> Территория современной Литвы была заселена людьми с конца X—IX тысячелетия до н. э. Жители занимались охотой и рыболовством, использовали лук и стрелы с кремнёвыми наконечниками, скребки для обработки кожи, удочки и сети. - не разделилось
+
+Made too many borders:
+
+> В конце неолита (III-II тыс.\
+> до н.\
+> э.)
+
+One more example with the abbreviation 'тыс.':
+
+> Более 3 тыс.\
+> озёр
+
+The following example, however, shows that in some cases 'н. э.' was recognized as an abbreviation:
+
+> Около VII века н. э. литовский язык отделился от латышского.
+
+Another problem showed up while considering the abbreviation 'см.' (exactly the same case as we saw in Pragmatic segmenter):
+
+> В 1944 году нацисты были изгнаны Красной Армией с территории Литовской ССР (см.\
+> Белорусская операция (1944)).
diff --git a/2018-komp-ling/practicals/Practical 1/tokenisation-response.md b/2018-komp-ling/practicals/Practical 1/tokenisation-response.md
@@ -0,0 +1,6 @@
+## Report on tokenisation task
+
+I wrote a python script in order to implement maxmatch algorithm (maxmatch.py). *You will understand easily how to use it once you run the file*
+
+I extracted japanese dictionary as well as training sentences (via python, again). I chose WER to test my code, and it showed 14%.
+
diff --git a/2018-komp-ling/practicals/Practical 1/tokenisation/maxmatch.py b/2018-komp-ling/practicals/Practical 1/tokenisation/maxmatch.py
@@ -0,0 +1,40 @@
+tokens = []
+def tokenize(sentence, dictionary):
+    """
+    maxmatch algorithm
+    :param sentence: string
+    :param dictionary: list
+    :return: list of tokens
+    """
+    global tokens
+    for counter in range(len(sentence), -1, -1):
+        first_word = sentence[:counter]
+        remainder = sentence[counter:]
+        if first_word in dictionary:
+            tokens.append(first_word)
+            return tokenize(remainder, dictionary)
+        if counter == 1:
+            first_word = sentence[0]
+            remainder = sentence[1:]
+            tokens.append(first_word)
+            tokens.append(remainder)
+    return tokens
+
+
+def main():
+    print('Insert path to dictionary')
+    n_dict = str(input())
+    f = open(n_dict)
+    dictionary = []
+    for line in f.readlines():
+        if '\n' in line:
+            line = line.strip('\n')
+        dictionary.append(line)
+
+    print('Insert sentence: ')
+    sentence = str(input())
+
+    print(tokenize(sentence, dictionary))
+
+if __name__ == '__main__':
+    main()
diff --git a/2018-komp-ling/practicals/Practical 1/tokenisation/tokenisation-report.md b/2018-komp-ling/practicals/Practical 1/tokenisation/tokenisation-report.md
@@ -0,0 +1,6 @@
+## Report on tokenisation task
+
+I wrote a python script in order to implement maxmatch algorithm (maxmatch.py). *You will understand easily how to use it once you run the file*
+
+I extracted japanese dictionary as well as training sentences (via python, again). I chose WER to test my code, and it showed 14%.
+
diff --git a/2018-komp-ling/practicals/Practical 2/freq.py b/2018-komp-ling/practicals/Practical 2/freq.py
@@ -0,0 +1,28 @@
+import sys
+
+vocab = {} # dict to store frequency list
+f = open(sys.stdin, 'r', encoding='utf-8')
+# for each of the lines of input
+for line in f.readlines():
+	# if there is no tab character, skip the line
+	if '\t' not in line:
+		continue
+	# make a list of the cells in the row
+	row = line.split('\t')
+	# the form is the value of the second cell
+	form = row[1]
+	# if we haven't seen it yet, set the frequency count to 0
+	if form not in vocab:
+		vocab[form] = 0
+	vocab[form] = vocab[form] + 1
+
+freq = []
+
+for w in vocab:
+	freq.append((vocab[w], w))
+freq.sort(reverse=True)
+
+fd = open('freq.txt', 'w+', encoding='utf-8')
+for w in freq:
+	fd.write('%d\t%s' % (w[0], w[1]))
+fd.close()
diff --git a/2018-komp-ling/practicals/Practical 2/freq.txt b/2018-komp-ling/practicals/Practical 2/freq.txt
@@ -0,0 +1,118 @@
+21	,
+8	.
+7	и
+6	в
+4	не
+4	его
+4	был
+3	с
+3	он
+3	Левы
+2	тех
+2	она
+2	нить
+2	к
+2	из
+2	жизни
+2	еще
+2	во
+2	было
+2	Лева
+2	В
+2	"
+1	этом
+1	это
+1	эта
+1	чьих-то
+1	чреве
+1	чем
+1	фамилии
+1	узлов
+1	уже
+1	у
+1	то
+1	так
+1	существенна
+1	струилась
+1	стремительности
+1	сторону
+1	старому
+1	случились
+1	случалось
+1	слишком
+1	славному
+1	скорее
+1	скользила
+1	сказать
+1	сибирских
+1	себя
+1	своей
+1	самых
+1	самого
+1	русскому
+1	рук
+1	руд
+1	роковом
+1	роду
+1	родителями
+1	родителям
+1	ровном
+1	протекала
+1	провисала
+1	приходилось
+1	принадлежность
+1	предка
+1	правда
+1	потрясений
+1	потомком
+1	пор
+1	помнил
+1	перемещения
+1	пальцев
+1	отношение
+1	особых
+1	основном
+1	определять
+1	однофамильцем
+1	обрывов
+1	ним
+1	несильном
+1	неприятные
+1	необходимости
+1	немного
+1	находилась
+1	натяжении
+1	младенчестве
+1	мерно
+1	меж
+1	лишь
+1	кой-какие
+1	когда
+1	как
+1	их
+1	или
+1	излишней
+1	зачат
+1	замечательного
+1	давние
+1	годы
+1	году
+1	говоря
+1	глубину
+1	вспоминать
+1	временами
+1	возникало
+1	вернее
+1	божественных
+1	без
+1	Собственно
+1	Он
+1	Одоевцевых
+1	Одоевцева
+1	Образно
+1	Если
+1	Без
+1	А
+1	-
+1	)
+1	(
diff --git a/2018-komp-ling/practicals/Practical 2/rank.py b/2018-komp-ling/practicals/Practical 2/rank.py
@@ -0,0 +1,16 @@
+import sys
+
+freq = []
+
+fd = open(sys.argv[1], 'r', encoding='utf-8')
+for line in fd.readlines():
+    line = line.strip('\n')
+    (f, w) = line.split('\t')
+    freq.append((int(f), w))
+
+ranks = []
+for counter in range(0, len(freq)):
+    ranks.append((counter + 1, freq[counter][0], freq[counter][1]))
+
+for w in ranks:
+    print('%d\t%s\t%s' % (w[0], w[1], w[2]))
diff --git a/2018-komp-ling/practicals/Practical 2/rank.txt b/2018-komp-ling/practicals/Practical 2/rank.txt
@@ -0,0 +1,118 @@
+1	21	,
+2	8	.
+3	7	и
+4	6	в
+5	4	не
+6	4	его
+7	4	был
+8	3	с
+9	3	он
+10	3	Левы
+11	2	тех
+12	2	она
+13	2	нить
+14	2	к
+15	2	из
+16	2	жизни
+17	2	еще
+18	2	во
+19	2	было
+20	2	Лева
+21	2	В
+22	2	"
+23	1	этом
+24	1	это
+25	1	эта
+26	1	чьих-то
+27	1	чреве
+28	1	чем
+29	1	фамилии
+30	1	узлов
+31	1	уже
+32	1	у
+33	1	то
+34	1	так
+35	1	существенна
+36	1	струилась
+37	1	стремительности
+38	1	сторону
+39	1	старому
+40	1	случились
+41	1	случалось
+42	1	слишком
+43	1	славному
+44	1	скорее
+45	1	скользила
+46	1	сказать
+47	1	сибирских
+48	1	себя
+49	1	своей
+50	1	самых
+51	1	самого
+52	1	русскому
+53	1	рук
+54	1	руд
+55	1	роковом
+56	1	роду
+57	1	родителями
+58	1	родителям
+59	1	ровном
+60	1	протекала
+61	1	провисала
+62	1	приходилось
+63	1	принадлежность
+64	1	предка
+65	1	правда
+66	1	потрясений
+67	1	потомком
+68	1	пор
+69	1	помнил
+70	1	перемещения
+71	1	пальцев
+72	1	отношение
+73	1	особых
+74	1	основном
+75	1	определять
+76	1	однофамильцем
+77	1	обрывов
+78	1	ним
+79	1	несильном
+80	1	неприятные
+81	1	необходимости
+82	1	немного
+83	1	находилась
+84	1	натяжении
+85	1	младенчестве
+86	1	мерно
+87	1	меж
+88	1	лишь
+89	1	кой-какие
+90	1	когда
+91	1	как
+92	1	их
+93	1	или
+94	1	излишней
+95	1	зачат
+96	1	замечательного
+97	1	давние
+98	1	годы
+99	1	году
+100	1	говоря
+101	1	глубину
+102	1	вспоминать
+103	1	временами
+104	1	возникало
+105	1	вернее
+106	1	божественных
+107	1	без
+108	1	Собственно
+109	1	Он
+110	1	Одоевцевых
+111	1	Одоевцева
+112	1	Образно
+113	1	Если
+114	1	Без
+115	1	А
+116	1	-
+117	1	)
+118	1	(