Skip to content

hi here's my homework #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
baeb218
quiz
wadimiusz Oct 29, 2018
ad11259
Update quiz-01-response.md
wadimiusz Nov 2, 2018
b6a070f
Update quiz-01-response.md
wadimiusz Nov 2, 2018
b13934d
Create segmentation.md
wadimiusz Nov 2, 2018
f5e56d3
maxmatch added
wadimiusz Nov 2, 2018
f61e6d3
Merge branch 'master' of https://github.com/wadimiusz/ftyers.github.io
wadimiusz Nov 2, 2018
e02e0af
Create readme.md
wadimiusz Nov 2, 2018
f9a4f9a
Create requirements.txt
wadimiusz Nov 2, 2018
bff950b
typo
wadimiusz Nov 2, 2018
79be3ae
my fingers type words
wadimiusz Nov 3, 2018
55f98de
FINGERS
wadimiusz Nov 3, 2018
b6fe6d6
1"
wadimiusz Nov 3, 2018
1b6b240
Update maxmatch.py
wadimiusz Nov 4, 2018
484000c
hi there
wadimiusz Nov 21, 2018
eed526f
Merge branch 'master' of https://github.com/wadimiusz/ftyers.github.io
wadimiusz Nov 21, 2018
65830c8
banana
wadimiusz Nov 21, 2018
69ddf6d
betonomeshalka
wadimiusz Nov 21, 2018
ee798a4
why do those even have names
wadimiusz Nov 21, 2018
643aad4
report
wadimiusz Nov 21, 2018
f72d5f9
meshaet beton
wadimiusz Nov 21, 2018
d7ba82a
This is a commit message, you know?
wadimiusz Nov 21, 2018
11ad88d
Behold!
wadimiusz Nov 21, 2018
5f53659
meow
wadimiusz Nov 21, 2018
b0d0e4b
1
wadimiusz Nov 25, 2018
dc70f37
meow
wadimiusz Nov 25, 2018
379e335
commit message i guess
wadimiusz Nov 25, 2018
334bcb5
hi
wadimiusz Nov 26, 2018
76037ae
Чё?
wadimiusz Nov 26, 2018
abc00de
quiz
wadimiusz Nov 26, 2018
721fbcc
quiz
wadimiusz Nov 26, 2018
15caaf8
quiz
wadimiusz Nov 26, 2018
956ec1d
quiz
wadimiusz Nov 26, 2018
65cf73d
quiz
wadimiusz Nov 26, 2018
ecc22a9
commit
wadimiusz Dec 9, 2018
fabbe78
formatting
wadimiusz Dec 9, 2018
33b4f92
report modified
wadimiusz Dec 11, 2018
dca6b6c
report modified
wadimiusz Dec 11, 2018
0bad72c
report modified
wadimiusz Dec 11, 2018
5d4e025
Update report.md
wadimiusz Dec 11, 2018
f6ac1a7
report modified
wadimiusz Dec 11, 2018
4cba2b9
a cat on a mat
wadimiusz Dec 11, 2018
0151694
qiuz
wadimiusz Dec 12, 2018
d4f4d6e
quiz
wadimiusz Dec 12, 2018
688edf5
quiz
wadimiusz Dec 12, 2018
7682fd1
quiz
wadimiusz Dec 12, 2018
365c619
report
wadimiusz Dec 13, 2018
9bcbbfe
rules and report
wadimiusz Dec 13, 2018
3d3194e
tagger
wadimiusz Dec 13, 2018
f70d38e
Update report.md
wadimiusz Dec 13, 2018
cb099d4
empty commit
wadimiusz Mar 24, 2019
0ef4507
another empty commit
wadimiusz Mar 24, 2019
645bcca
пюпю
wadimiusz Mar 30, 2019
382528e
пюпю
wadimiusz Mar 30, 2019
f411008
Rename segmentation.md to segmentation-response.md
wadimiusz Mar 30, 2019
031c52d
Rename disambiguation-responce.md to disambiguation-response.md
wadimiusz Apr 2, 2019
81d2d41
хреннер
wadimiusz Apr 2, 2019
9f4fbb3
Merge branch 'master' of https://github.com/wadimiusz/ftyers.github.io
wadimiusz Apr 2, 2019
34200e8
key
wadimiusz Apr 3, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
88 changes: 88 additions & 0 deletions 2018-komp-ling/key.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQGNBFySoZABDADOrgc5MWLHtd34Zw4bomxtzVxhChHI1P03ooNDioZiNVMDWMc/
me2FnDLmSDaXpqdgyx7KTvj6a7xeJtBjo5a4X6lrxL8WXp891DNrBe9KLMFq19Py
8Pem6eZWXhX9Brk5s1EVAEeOQJVpotU99PajdLIzxWf5V8Un1iP61O4ctqk5D5pb
m9wNccqox4dJE6l1OBrec8LYszL4J1nw6fWMhoF2WbMkZwIMzip1NfmIdPjbV33t
o96Ne4+bzwEYWjn7eBrD0bxpf1fIy1//iaHlhvaCQrvtDcjhsf1Tjut4YXUFYNa+
P8RQ1WQsyrThPJMRj2V+8t98ndkpi/wzXzueM4w6JESFJ/AI8Pf3HAmOXpGypvKH
Kw8U1VcusZm+w6DxDabeZ51J6+toT9YFDaFTn+ayPfPdYtnneb49yGWSGkwBL4nV
FXMtWRK2C+ARppX+swTF0IFCYxCOvGCvK+5j0/dVxBb/uCc2U7vQQDa9L8PJKqAA
myM6ZEbfbrJvX8UAEQEAAbQ1VmFzc2UgVG9uZXRpY2UgPGhmcWFtOSs0em0zeWxl
MjI4ZWRvQHNoYXJrbGFzZXJzLmNvbT6JAdQEEwEKAD4WIQSc+x0c72DkGxLxCaxY
puM/66kr2gUCXJKhkAIbAwUJA8JnAAULCQgHAgYVCgkICwIEFgIDAQIeAQIXgAAK
CRBYpuM/66kr2ipeDADNpXsisbP/Mt487lIpFhINbKCwSJzJTzvA9EM3ByrfNHhi
kOgUZLeHpSjZ9j8sVynKcLiNNX6BteUQZZvOxeSVw1GDnYs5AQkV9FYy69aRG4An
iJUEsWbXLcl39C/bBaa3x09SckyXSeBWwyPE0RP5K3u0mOIDbcmtf9ihJXEE+RB3
cdXlgklJUwlm+vaQC7f8kEwy7C+BYc7hkVMFr6dCwkLNUO76x3upIFYpS4qIgW/G
McsGPedJGscOfUKa/tGd0qIhWw/M5mvf3a1GbytbZ/LT/LDgW7cygO4Gk0EqEJK5
de75LC2bzVKuGWyUF0V+IFZ36q92IBXE/om/4eJlHb7GtxD+u/E89LkfMyrcjKAO
gmPi7Ax1YBURXk6C6Gc1Sa9AtvMsZT4Hyz/VYS6m/QK4oVyQXOfQ2e9KuaVQfPK9
wKs/mago8cZ7xiHx1PqWEJ/in5IoLwoawDvefGt3lo5pMYMifkWfUq8umYBwLMfk
jOpkXNw20CEI72AhjSe5AY0EXJKhkAEMAMeuNMLRja+f5viVNK9FolhQGr7W5DA2
OgW1gVuiKkUwH1s1PG58Jg31dphL1s5e6xnV3WlgpP+EvBUAGgRkf/fGSvaajcW9
bsJexDHujbmUdhyEAQuO/673GF86e6kJybSCULQpyr54tj+zD0gyrAmz3iIcjDhg
oKWEWwVG/rEOcr+pKTf2ovBW11JY0FvE6+aCq9fERaarC4+QU93Gccq4V/z99F3y
GtF9ia7Sa8Sc+smWoMXhwCEADB1ylF86fVkIWM19pUCDKjpWn+yT7zk2Csl13MK4
cZr59G8L5/p4xV3Gh2ZIbdUI6zl5b5J+fEWC8DVY0nwgDfshzjpoYKjZL7oL8n1F
rvKAuVSKdnbl+gazt0diXpAL+kSwNyl856QbC+mXuhJz1OfnFUMETc0+CIHg1elc
6cjbUa/R/NeJ3XnfTUMXNPdRW8r1RdjiDoqqdW0LU5sAOYRlxLOuHBd+i7YiBe+6
Q3GblzDoDWexbnPNCEywWqHKRiqhZUM1nwARAQABiQG8BBgBCgAmFiEEnPsdHO9g
5BsS8QmsWKbjP+upK9oFAlySoZACGwwFCQPCZwAACgkQWKbjP+upK9owtwv/XmBQ
GwIw9gQzGTikIbj1GFOu/5+X0mq7p8MOf2aTTrzT/sG7xwcvu4wHv6AbNyWtQ/zB
pUO9U5HreoOq0u8umGNHPbOzjDKysBAKcWxRb4BgK/lTQMpoRVisDEdXSJl3hRdU
/Iay1E7Ytc/aRNbe7W04LnWij3n0rDR/R7US+yZ3WfmRJWf1wveomIRuzHuDp1zm
P3X2O/SnelNNmh8jzVV37G5pWWUiROuWiLdFcg/yp/82LMJSUpP0XogzhJPp1/2t
Og7N4yhEV8KnYrOt7WXxbqjJq9YkOroa3MzVFuv9RhAtFaWXgpiu6GOzFhYNY9kF
Gc7AOg1Rm7TF8V21wM669IIPAMfTcdnVyNjrq8jwzaGTWuLe8FfupB49XuZayb9R
PnVJCpP5vK19xuBBqe7Qwl/MzlK8aCdtCK58CJSDmLkAmiSneHlzrNWDWGkQjfF9
VnMnFjoYD9/LLWMXKz9qVEqwEH3HCESePyB400O5ySw+GdKx2hjjIa29FSvHmQIN
BFylBdkBEACv2dNl6apJw49cY1/Dh5W7UCIbtirixxxoJP8EQuVydFqXmducFd8E
u1AqL/Lqs6/JxmNNP9muGT/IWulEXpFhTsJ4ZhSc6LPBP6rT2vAZvNwlkovqJ5u5
mHO9TOqlHjyPrJ2zRc/KYOPrXhvVOxBddedq5864D+0TDcWV64iSZ/I1QT4UhedS
SObuZkFNSJ9KA6hFhmD9AmSgmQYkzH5fpdsatIYvU6PDzETKLMdADvDH5ae5TzJV
Vu4gMiZF6/psE9SEgSZLpwqfZaEWSxQ45kGRd/tjpapBwdIDnTAckLz9u5YYbSW5
6evz1TcIWaAdIJdI4Nh+84yOo1/iTKNRPea+tLLbJv6EQ1uACoXZ1KVDXZtvEjNB
FQ/xDT9E+ijzTI/Utfyu3jZGzSvSPqrig0KaPf+ZqTpReHQt+dtlQLE3v9XwFHN4
F2XHjz6LXG07q4ODM03iXz2HdLpu2QTALHFJzb8vEZ4qNcWaZzO1DQcvgyxaivww
ynkAY4rE9wANIDrtbW0U68r7X+NgSYUCegIUyRe5q6ZnGHzM2Zsfq73VKO+Dswx7
aS/pPfTMDxXLPH/vPH/RiJm7ZBEsdd6yACN0NvYMTDxEEfGMSavOuwrmgUa/b93n
qazKxR7CogwME95nRV6rcCmP1u6cED1Xvu4N+7e0A4mTf8TVCcJ2KQARAQABtCFW
YWRpbSBGb21pbiA8d2FkaW1pdXN6QGdtYWlsLmNvbT6JAk4EEwEKADgWIQRhBtgY
5+gYGUiJvFd5CQJkpiHwjAUCXKUF2QIbAwULCQgHAgYVCgkICwIEFgIDAQIeAQIX
gAAKCRB5CQJkpiHwjBn4D/9wY0pz+4yp6c8fmxXHwNUz5iDq0gH3lK27TPOcXiXr
cr6H/Vf3inDcSZ1n3NPcBZiXDBkp5G7DmuJf+RWA5qtRXZobXMoIduYeL+JgVWzZ
+vTUSipN4RYTQmOaRhMT2UVlX7ABWP2RzJBUcgaoDOTsT3wcuMBDoErmZzVxe+QV
ZbUPZNSQTcUknzab3yPNUnMW4491WPodaCx0EgJvXYQB7qRYDakRPwbIotfPKIh1
S1zvSEtuTRORV39BwxvcpCEcC+hX1zIYsekVeZ+Om74nUVIPhJDyQ9YYX75lW4so
9H27YgMYqHKuxRem81MaLUcmT49bDSJ2h84r6tRY1bJ+a7Af1Rmf9faGDJMSPaVn
j8ATcZFnPc9NMRlCcgp0mRFTPRoPLsx5XL5/+tmzUY+LQUa6VJQIWczhXh329sTt
nfzgBpIztvu5X+fQfKGnUgPZcR+XvhIPKbPjoP+KoAwuc/mYit3Iz6tQ9hGfm/X4
SoK6kBaZ86QQ2D7Gss4fg1HBp5zCy7cY7p2s013ruldhpv2+oOqlpyQq91zJ4AF+
Wzbcs2hM10hCpqoDCQIzIh9G4mg7x0TCcUZKz597Y0SCwg9nj73YBbw704tnz3KR
/I8ht8JGSlTI6soK9Fko3ftM6FIxRpETyhl+MD41XLUG2w3/gCG3XpZ685WOHKKm
DrkCDQRcpQXZARAAvqCbNYEoSQbIeyz//2CiuZcZezAJtxEWDeQAWG4Qx7Q+uq+F
NdBaJ4Dt4ub7XZ80XGc6wTeLF9LVh6Ek5/essNvyAYQpO6p+XebTcS2VEVwRU8ba
Yf5dqIqYW8RYPbnxXPsPa2LtONReB+uzeB/IEXBULUpk01+8nlCsOvfzV2SB3xme
7SiuaixJnn8TRGgSkV1cnkHkgwPVF9hMPcBhth4gh5SaDtL/WpiW9mQ6gbXByJcO
Z5/qTWipL9tZKEHV+QNmHPrDJII4bo3AYBQ31ZWl2qI6/br73elVPzJI/Ii+XeRs
pGvbpTObdXZGHmRmPmUe7hSrVGEkWeOFFkrnfPVIjhVsKvBFq/1E8CynRBm8TlrP
Iay5kcG3IfgYCOgopS3POeoO+ULDLNmeADcN7VJvpY8TuoykUqZy0BylZq4fCKEx
Cn4+mxpktL7B6USyIqohzqfgcpkKFvifFaQlfY3+2fPdIFvfBjMise0pUvJLFOWr
qu4C7u/JXvezdIXU6p1Xxjuu8QsApQ+m7asY1lVH9AtFspHZEkvQ3llFlbRUzeRb
BHkNarYwUKg+IsIZ/dL0G5gJQXHiKF/puW188IT4S9rDF8+tMnAKdNL6R8ksF4Ca
OMjHPJ9qFzrS/ItzPTlgFqntMjuzh41SDIZkw2eteJj32BR71x/vDkP1w6sAEQEA
AYkCNgQYAQoAIBYhBGEG2Bjn6BgZSIm8V3kJAmSmIfCMBQJcpQXZAhsMAAoJEHkJ
AmSmIfCMBjwP/AvWGtp/abGP1f3FrtZY6d+U1I88RYfng0Ry05L6lTGjTVkmw1la
QvKsv8PUsJTYej24Yv3XCFdxja44NeAVj1SNM8xXSF8w9HTr6FADG/6FxqCcMU/1
w3xHeXUkiBns/TRs7mh7UUlyYnALAXA1MZnqdnkr/ORigVAex0iuqhzsyNoZEK1Y
JnCajmCvIGgu7RIy52g73IEWJf5Q+SgM9kBYIpNdL9/r2wSa0+nsMHe670N+085h
LRtGO+chMknWdnsNT8/Z3rh4sOOd1mhcPECCQaig0TgTxtgRz/g9lGt5r90G2oKG
5jk0ldzcvPtYLu3lN/lTqUnWfuRjvLcE+3vILVG8ZOMumwOiajmUZAPFyz5dzYuY
mnFtaFPWu8cNljlsWr60GktWprF6iSNAYqUkZ1gL1kTCg2bxxXHLmoxR75U1ftXU
PLNMoF0xV8t6SuXmIMCakZg3JqPKKbABcryOiYoYE3LIZZbQK7DaoLbEgOyM2UXT
rrAekgW+az/EHWYHgUVqF86UKF/s8BFP7mZCbmS6rdUwL3484K4DaziDQCiPUO01
yppj4awGjMfwgRIqUl6zUM7WkOOeoGMkbmooM5YS1v0Suvb87KKRdpYDPQVgrFiO
AZcSk/AKKGDuakdwmEdVGVRgDG4SovFaa4rxOnokmZsYm+YgszqDjCd6
=RwsH
-----END PGP PUBLIC KEY BLOCK-----
96 changes: 96 additions & 0 deletions 2018-komp-ling/practicals/disambiguation-response.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Tagger comparison

Udpipe trained upon the Finnish corpus performed as follows:

```
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 100.00 | 100.00 | 100.00 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 100.00 | 100.00 | 100.00 |
UPOS | 94.64 | 94.64 | 94.64 | 94.64
XPOS | 95.81 | 95.81 | 95.81 | 95.81
Feats | 90.77 | 90.77 | 90.77 | 90.77
AllTags | 89.75 | 89.75 | 89.75 | 89.75
Lemmas | 84.52 | 84.52 | 84.52 | 84.52
UAS | 100.00 | 100.00 | 100.00 | 100.00
LAS | 100.00 | 100.00 | 100.00 | 100.00
```

I wrote a small program that fits an nltk tagger (a binary tagger that falls back to a unigram tagger that falls back to marking everything with the most popular tag). The program is named train_nltk_tagger.py and is located in the disambiguation folder. It has performed as follows:

```
We have trained a bigram tagger that falls back to simpler taggers in case of emergency.
It has performed with the accuracy of 0.9360702420503085
```

As I understand from reading the evaluation script doc strings, the alignacc is just accuracy in case the words are aligned, so it's comparable with the accuracy from nltk.
So the result of 94.64% and 93.61% are quite compatible.

# Constraint grammar

I have added the following rules:

```
REMOVE DET IF (1 VERB) (1 (VerbForm=Part)) ;
REMOVE (Case=Acc) if (0 (PROPN)) (1 (PROPN)) (1 (Case=Gen)) ;
```
The first rule aims to prevent the word все before particips from being interpreted as a determinant. The second rule says, "Hey, if we have two consecutive proper names, they are probably the same name and must have the same case, so if you're sure that the second is genetive, the first can't be accusative."

That's how it works for our data:

```
$ echo "Однако стиль работы Семена Еремеевича заключался в том, чтобы принимать всех желающих и лично вникать в дело." | python3 ud-scripts/conllu-analyser.py ru-analyser.tsv | vislcg3 -t -g rus.cg3 | grep REMOVE
; "Семен" PROPN Animacy=Anim Case=Acc Gender=Masc Number=Sing REMOVE:11
; "тот" DET Case=Loc Gender=Neut Number=Sing REMOVE:9
; "тот" DET Case=Loc Gender=Masc Number=Sing REMOVE:9
; "весь" DET Case=Loc Number=Plur REMOVE:10
; "весь" DET Case=Acc Number=Plur REMOVE:10
; "весь" DET Case=Gen Number=Plur REMOVE:10
```


# Improve perceptron tagger

Using UD_Portugese.git didn't work (it doesn't seem to be an actual address), so I used UD_Portugese-BSD.git instead.

The result with the standard features was as follows:

```
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 100.00 | 100.00 | 100.00 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 100.00 | 100.00 | 100.00 |
UPOS | 96.35 | 96.35 | 96.35 | 96.35
XPOS | 100.00 | 100.00 | 100.00 | 100.00
Feats | 100.00 | 100.00 | 100.00 | 100.00
AllTags | 96.35 | 96.35 | 96.35 | 96.35
Lemmas | 100.00 | 100.00 | 100.00 | 100.00
UAS | 100.00 | 100.00 | 100.00 | 100.00
LAS | 100.00 | 100.00 | 100.00 | 100.00
```

I tried adding several features such as prefix of the context +2 word, properties of the string (numeric, register...) and length. Here is the full list of the added features:
```
add('i prefix', word[:3])
add("i-1 pref1", context[i-1][-3:])
add("i-1 prefix", context[i-1][:3])
add("i+1 pref1", context[i+1][-3:])
add("i+1 prefix", context[i +1][:3])
add("i-2 pref1", context[i-2][-3:])
add("i+1 prefix", context[i+2][:3])
add("i+2 suffix", context[i+2][-3:])
add("i+2 pref1", context[i+2][0])
add("i+2 prefix", context[i+2][:3])

add('upper register', str(int(word.isupper())))
add('lower register', str(int(word.islower())))
add('capitalized', str(int(word[0].isupper())))
add('numeric', str(int(word.isnumeric())))
add('alpha', str(int(word.isalpha())))

add("length", str(len(word)))
```

They didn't seem to have any effect on the performance. It seems like pretty much everything useful for disambiguation is already used.
11 changes: 11 additions & 0 deletions 2018-komp-ling/practicals/disambiguation/.idea/disambiguation.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions 2018-komp-ling/practicals/disambiguation/.idea/encodings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions 2018-komp-ling/practicals/disambiguation/.idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions 2018-komp-ling/practicals/disambiguation/.idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions 2018-komp-ling/practicals/disambiguation/.idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

162 changes: 162 additions & 0 deletions 2018-komp-ling/practicals/disambiguation/.idea/workspace.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading