-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy patherror-codes.html
More file actions
1113 lines (974 loc) · 43.5 KB
/
error-codes.html
File metadata and controls
1113 lines (974 loc) · 43.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Error Codes</title>
<!-- 2015-12-06 So 22:36 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="generator" content="Org-mode" />
<meta name="author" content="Maik Schünemann" />
<style type="text/css">
<!--/*--><![CDATA[/*><!--*/
.title { text-align: center; }
.todo { font-family: monospace; color: red; }
.done { font-family: monospace; color: green; }
.priority { font-family: monospace; color: orange; }
.tag { background-color: #eee; font-family: monospace;
padding: 2px; font-size: 80%; font-weight: normal; }
.timestamp { color: #bebebe; }
.timestamp-kwd { color: #5f9ea0; }
.right { margin-left: auto; margin-right: 0px; text-align: right; }
.left { margin-left: 0px; margin-right: auto; text-align: left; }
.center { margin-left: auto; margin-right: auto; text-align: center; }
.underline { text-decoration: underline; }
#postamble p, #preamble p { font-size: 90%; margin: .2em; }
p.verse { margin-left: 3%; }
pre {
border: 1px solid #ccc;
box-shadow: 3px 3px 3px #eee;
padding: 8pt;
font-family: monospace;
overflow: auto;
margin: 1.2em;
}
pre.src {
position: relative;
overflow: visible;
padding-top: 1.2em;
}
pre.src:before {
display: none;
position: absolute;
background-color: white;
top: -10px;
right: 10px;
padding: 3px;
border: 1px solid black;
}
pre.src:hover:before { display: inline;}
pre.src-sh:before { content: 'sh'; }
pre.src-bash:before { content: 'sh'; }
pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
pre.src-R:before { content: 'R'; }
pre.src-perl:before { content: 'Perl'; }
pre.src-java:before { content: 'Java'; }
pre.src-sql:before { content: 'SQL'; }
table { border-collapse:collapse; }
caption.t-above { caption-side: top; }
caption.t-bottom { caption-side: bottom; }
td, th { vertical-align:top; }
th.right { text-align: center; }
th.left { text-align: center; }
th.center { text-align: center; }
td.right { text-align: right; }
td.left { text-align: left; }
td.center { text-align: center; }
dt { font-weight: bold; }
.footpara:nth-child(2) { display: inline; }
.footpara { display: block; }
.footdef { margin-bottom: 1em; }
.figure { padding: 1em; }
.figure p { text-align: center; }
.inlinetask {
padding: 10px;
border: 2px solid gray;
margin: 10px;
background: #ffffcc;
}
#org-div-home-and-up
{ text-align: right; font-size: 70%; white-space: nowrap; }
textarea { overflow-x: auto; }
.linenr { font-size: smaller }
.code-highlighted { background-color: #ffff00; }
.org-info-js_info-navigation { border-style: none; }
#org-info-js_console-label
{ font-size: 10px; font-weight: bold; white-space: nowrap; }
.org-info-js_search-highlight
{ background-color: #ffff00; color: #000000; font-weight: bold; }
/*]]>*/-->
</style>
<script type="text/javascript">
/*
@licstart The following is the entire license notice for the
JavaScript code in this tag.
Copyright (C) 2012-2013 Free Software Foundation, Inc.
The JavaScript code in this tag is free software: you can
redistribute it and/or modify it under the terms of the GNU
General Public License (GNU GPL) as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version. The code is distributed WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU GPL for more details.
As additional permission under GNU GPL version 3 section 7, you
may distribute non-source (e.g., minimized or compacted) forms of
that code without the copy of the GNU GPL normally required by
section 4, provided you include this license notice and a URL
through which recipients can access the Corresponding Source.
@licend The above is the entire license notice
for the JavaScript code in this tag.
*/
<!--/*--><![CDATA[/*><!--*/
function CodeHighlightOn(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.cacheClassElem = elem.className;
elem.cacheClassTarget = target.className;
target.className = "code-highlighted";
elem.className = "code-highlighted";
}
}
function CodeHighlightOff(elem, id)
{
var target = document.getElementById(id);
if(elem.cacheClassElem)
elem.className = elem.cacheClassElem;
if(elem.cacheClassTarget)
target.className = elem.cacheClassTarget;
}
/*]]>*///-->
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
displayAlign: "center",
displayIndent: "0em",
"HTML-CSS": { scale: 100,
linebreaks: { automatic: "false" },
webFont: "TeX"
},
SVG: {scale: 100,
linebreaks: { automatic: "false" },
font: "TeX"},
NativeMML: {scale: 100},
TeX: { equationNumbers: {autoNumber: "AMS"},
MultLineWidth: "85%",
TagSide: "right",
TagIndent: ".8em"
}
});
</script>
<script type="text/javascript"
src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
</head>
<body>
<div id="content">
<h1 class="title">Error Codes</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#sec-1">1. Introduction</a></li>
<li><a href="#sec-2">2. Setting up</a></li>
<li><a href="#sec-3">3. Getting the edits between the texts</a>
<ul>
<li><a href="#sec-3-1">3.1. Calculating the edit distance</a></li>
<li><a href="#sec-3-2">3.2. Backtracking the edits from the levenshtein matrix</a></li>
<li><a href="#sec-3-3">3.3. Calculating the edits</a></li>
</ul>
</li>
<li><a href="#sec-4">4. Getting the error codes from the edit distance</a>
<ul>
<li><a href="#sec-4-1">4.1. Extracting basic error-codes</a></li>
<li><a href="#sec-4-2">4.2. Extracting many-to-one and one-to-many errors</a>
<ul>
<li><a href="#sec-4-2-1">4.2.1. Recognizing these errors</a></li>
</ul>
</li>
<li><a href="#sec-4-3">4.3. Setting up the extraction list</a></li>
</ul>
</li>
<li><a href="#sec-5">5. Testing the error-codes</a></li>
<li><a href="#sec-6">6. Helpers for deployment to ocr-visualizer</a></li>
<li><a href="#sec-7">7. Helpers for generating statistics</a></li>
<li><a href="#sec-8">8. Todos</a>
<ul>
<li><a href="#sec-8-1">8.1. <span class="todo nilTODO">TODO</span> Maybe we need context aware filtering to get rid of dirt errors at the beginning</a></li>
</ul>
</li>
</ul>
</div>
</div>
<hr />
<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1"><span class="section-number-2">1</span> Introduction</h2>
<div class="outline-text-2" id="text-1">
<p>
This program solves the problem of extracting semantic ocr-errors
between a two texts, one that is the ground-truth and the other is
the performed result from the ocr.
</p>
<p>
The ocr-error codes are given by the following matrix:
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="right" />
<col class="right" />
<col class="right" />
<col class="right" />
<col class="right" />
<col class="right" />
<col class="right" />
<col class="right" />
<col class="right" />
<col class="right" />
</colgroup>
<tbody>
<tr>
<td class="right"> </td>
<td class="right">1</td>
<td class="right">2</td>
<td class="right">3</td>
<td class="right">4</td>
<td class="right">5</td>
<td class="right">6</td>
<td class="right">7</td>
<td class="right">8</td>
<td class="right">9</td>
</tr>
<tr>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">7</td>
<td class="right">2</td>
<td class="right">5</td>
<td class="right">6</td>
</tr>
<tr>
<td class="right">2</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">7</td>
<td class="right">2</td>
<td class="right">5</td>
<td class="right"> </td>
</tr>
<tr>
<td class="right">3</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">7</td>
<td class="right">2</td>
<td class="right">5</td>
<td class="right"> </td>
</tr>
<tr>
<td class="right">4</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">7</td>
<td class="right">2</td>
<td class="right">5</td>
<td class="right">6</td>
</tr>
<tr>
<td class="right">5</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">1</td>
<td class="right">7</td>
<td class="right">2</td>
<td class="right">5</td>
<td class="right"> </td>
</tr>
<tr>
<td class="right">6</td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
</tr>
<tr>
<td class="right">7</td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
</tr>
<tr>
<td class="right">8</td>
<td class="right">4</td>
<td class="right">4</td>
<td class="right">4</td>
<td class="right">4</td>
<td class="right">4</td>
<td class="right">4</td>
<td class="right">4</td>
<td class="right"> </td>
<td class="right"> </td>
</tr>
<tr>
<td class="right">9</td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
<td class="right"> </td>
</tr>
</tbody>
</table>
<p>
The codes are the following:
</p>
<ul class="org-ul">
<li>1 letter</li>
<li>2 Punctuation</li>
<li>3 digit</li>
<li>4 special character</li>
<li>5 whitespace</li>
<li>6 part of character</li>
<li>7 multiple characters</li>
<li>8 missing character</li>
<li>9 Case error</li>
</ul>
<p>
The values in the matrix are the weights for the particular
ocr-error. The values shown here are examples, they can be changed
for particular situations.
</p>
<p>
The idea is to separate the semantic informations of the ocr-errors
from just the sum of error points that determines the quality of the
ocr.
</p>
<p>
The task of the program is to extract the correct error-codes and
their locations from theground-truth and ocr-results texts.
</p>
</div>
</div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2"><span class="section-number-2">2</span> Setting up</h2>
<div class="outline-text-2" id="text-2">
<div class="org-src-container">
<pre class="src src-clojure">(ns error-codes.core
(:require [clojure.java.io :refer [file]])
(:use [clojure.core.matrix]))
(set! *warn-on-reflection* true)
</pre>
</div>
<pre class="example">
true
</pre>
<div class="org-src-container">
<pre class="src src-clojure">(ns error-codes.test-core
(:require [clojure.test :refer :all]
[error-codes.core :refer :all]
[clojure.core.matrix :refer :all]))
</pre>
</div>
</div>
</div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3"><span class="section-number-2">3</span> Getting the edits between the texts</h2>
<div class="outline-text-2" id="text-3">
</div><div id="outline-container-sec-3-1" class="outline-3">
<h3 id="sec-3-1"><span class="section-number-3">3.1</span> Calculating the edit distance</h3>
<div class="outline-text-3" id="text-3-1">
<p>
Before we can see what errors were performed by the ocr we need a
correspondence between the characters in the ground truths text and
the characters in the ocr-results text.
If the ocr would miss a character all further characters would also
be misaligned.
Therefore we need informations about the <i>changes</i> between
ground-truths and ocr-results when they are maximally aligned.
</p>
<p>
To get this, we can use the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">levenshtein distance</a>, which is defined
as the minimum number of single-character edits (i.e. insertions,
deletions or substitutions) required to change one word into the
other.
</p>
<p>
The levenshtein distance is computed by the standard dynamic
programming algorithm:
</p>
<div class="org-src-container">
<pre class="src src-clojure">(defn lev [str1 str2]
(let [mat (new-matrix :ndarray (inc (count str1)) (inc (count str2)))]
(mset! mat 0 0 0)
(dotimes [i (count str1)]
(mset! mat (inc i) 0 (inc i)))
(dotimes [j (count str2)]
(mset! mat 0 (inc j) (inc j)))
(dotimes [dj (count str2)]
(dotimes [di (count str1)]
(let [j (inc dj) i (inc di)]
(mset! mat i j
(cond
(= (.charAt ^String str1 di) (.charAt ^String str2 dj))
(mget mat di dj)
:else
(min (inc (mget mat di j)) (inc (mget mat i dj))
(inc (mget mat di dj))))))))
mat))
</pre>
</div>
<p>
This fills a 2-dimensional array which are filled with the
levenshtein distances for substrings of the texts.The full
levenshtein-distance is in the field [(count gt) (count or)] of the
created matrix.
</p>
<div class="org-src-container">
<pre class="src src-clojure">(defn lev-distance [a b]
(let [m (lev a b)]
(mget m (count a) (count b))))
(deftest test-lev
(is (= 1 (lev-distance "" "a")))
(is (= 1 (lev-distance "a" "")))
(is (= 3 (lev-distance "kitten" "sitting"))))
(test-lev)
</pre>
</div>
</div>
</div>
<div id="outline-container-sec-3-2" class="outline-3">
<h3 id="sec-3-2"><span class="section-number-3">3.2</span> Backtracking the edits from the levenshtein matrix</h3>
<div class="outline-text-3" id="text-3-2">
<p>
We are not interested in the edit distance, but in the <b>edits</b> that
were performed to calculate it!
One possibility would be to store not the current distance in the
cells of the matrix but the edits so far in addition to the
distance. However, this approach has proven too slow (manipulating
maps instead of doubles while creating the levenshtein matrix).
It is however possible, to <i>trace</i> back the patch on wich the full
levenshtein distance was created from the created matrix. This is
described <a href="http://de.wikipedia.org/wiki/Levenshtein-Distanz">here</a>.
The order of branches in the backtrace function is important!
moving the check for substitutions to the front favours the way
with substitutions about other ways with insertions and deletions
which have the same error count
</p>
<p>
The edits between the files are represented as a map with the keys
</p>
<p>
:insertions, :deletions, and :substitutions which have a sequence
of pairs [a b] as values where a and b specify the position where
the edit was performed in the ground-truth and the orc-results
texts
</p>
<div class="org-src-container">
<pre class="src src-clojure">(defn backtrace [d i j acc]
(cond
(and (> i 0) (= (inc (mget d (dec i) j)) (mget d i j)))
(recur d (dec i) j (assoc acc :deletions (cons [(dec i) j] (:deletions acc))))
(and (> j 0) (= (inc (mget d i (dec j))) (mget d i j)))
(recur d i (dec j) (assoc acc :insertions (cons [i (dec j)] (:insertions acc))))
(and (> i 0) (> j 0) (= (inc (mget d (dec i) (dec j))) (mget d i j)))
(recur d (dec i) (dec j) (assoc acc :substitutions (cons [(dec i) (dec j)] (:substitutions acc))))
(and (> i 0) (> j 0) (= (mget d (dec i) (dec j)) (mget d i j)))
(recur d (dec i) (dec j) acc)
:else acc))
</pre>
</div>
</div>
</div>
<div id="outline-container-sec-3-3" class="outline-3">
<h3 id="sec-3-3"><span class="section-number-3">3.3</span> Calculating the edits</h3>
<div class="outline-text-3" id="text-3-3">
<p>
With the lev and backtrace function, we can define the edits
function which returns the edits map described above with the
minimal edits between the two texts
</p>
<div class="org-src-container">
<pre class="src src-clojure">(defn edits [a b]
(let [d (lev a b)]
(backtrace d (count a) (count b) {:insertions '() :deletions '()
:substitutions '()
:distance (mget d (count a) (count b))})))
</pre>
</div>
<p>
A few examples:
</p>
<div class="org-src-container">
<pre class="src src-clojure">(deftest test-edits
(is (= (edits "a" "b")
'{:insertions (), :deletions (), :substitutions ([0 0]), :distance 1}))
;;swapping two characters is not multiple substitutions but insertion and deletions
;;which is more in line with what humans see there.
(is (= (edits "ab" "ba")
'{:insertions ([0 0]), :deletions ([1 2]), :substitutions (), :distance 2}))
(is (= (edits "vr" "io")
'{:insertions (), :deletions (), :substitutions ([0 0] [1 1]), :distance 2}))
;;many to one errors are substitutions followed by insertions
(is (= (edits "m" "rn")
'{:insertions ([1 1]), :deletions (), :substitutions ([0 0]), :distance 2}))
;;one to many errors are substitutions followed by deletions
(is (= (edits "rn" "m")
'{:insertions (), :deletions ([1 1]), :substitutions ([0 0]), :distance 2}))
(is (= (edits "Kitten" "sitting")
'{:insertions ([6 6]), :deletions (), :substitutions ([0 0] [4 4]), :distance 3}))
(is (= (edits "Kitten" "sittieng")
'{:insertions ([4 4] [6 7]), :deletions (), :substitutions ([0 0]), :distance 3}))
(is (= (edits "Kitten" "iiittiing")
'{:insertions ([2 2] [5 6] [6 8]), :deletions (), :substitutions ([0 0] [4 5]), :distance 5}))
)
(test-edits)
</pre>
</div>
</div>
</div>
</div>
<div id="outline-container-sec-4" class="outline-2">
<h2 id="sec-4"><span class="section-number-2">4</span> Getting the error codes from the edit distance</h2>
<div class="outline-text-2" id="text-4">
<p>
With the edits in place, the problem of the proper text alignment is
solved. what is left to do is mapping the edits to the right error
codes. Some codes (like [1 1]) which is a simple substitution (see
table at the top) are trivial to extract from the edit distance.
But what about the one-to-many errors (codes [x 7]) or the
many-to-one errors (codes [7 x] or [x 6] the table is not
deterministic here)
</p>
<p>
Like test-edits showed above, one can recognise
many-to-one/one-to-many errors by substitutions which are followed
by insertions or deletions
</p>
<p>
Therefore, they have to be extracted before the substitutions,
deletions or insertions are extracted.
</p>
<p>
A good architecture for the extracting operation is therefore to
apply multiple extract <i>passes</i> to the edits to generate the
error-codes. The extract phases can be defined as functions which
get the current edits map and the two texts as arguments and returns
[new-edits extracted-error-codes].
The extract phases can simply be stored in a vector which is
traversed in the extraction phase
</p>
<div class="org-src-container">
<pre class="src src-clojure">(declare extraction-list)
(defn edits-to-error-codes
([edits t1 t2] (edits-to-error-codes edits t1 t2 extraction-list))
([edits t1 t2 extraction-list]
(let [{:keys [substitutions deletions insertions]} edits]
(-> (reduce (fn [[codes edits] f]
(let [[ncodes nedits] (f edits t1 t2)]
[(concat codes ncodes) nedits])) [[] edits] extraction-list)
first))))
(defn error-codes
[t1 t2]
(as-> (edits t1 t2) x
(edits-to-error-codes x t1 t2)
;;sort the codes so that the error-codes are ordered by position in text
(sort-by (comp second second) x)))
</pre>
</div>
<p>
For the extraction phases we also need a small function which maps a
character to its number in the matrix.
</p>
<div class="org-src-container">
<pre class="src src-clojure">(defn to-code-number [^Character c]
(cond
(Character/isLetter c) 1
(#{\. \, \? \!} c) 2
(Character/isDigit c) 3
(= c \space) 5
:else 4))
</pre>
</div>
</div>
<div id="outline-container-sec-4-1" class="outline-3">
<h3 id="sec-4-1"><span class="section-number-3">4.1</span> Extracting basic error-codes</h3>
<div class="outline-text-3" id="text-4-1">
<p>
Extraction of the basic error codes for substitution, insertion and
deletions are easy as they are just the insertions, substitutions
and deletions in the edits map along with the right error
code. This extraction functions will be at the bottom of the
extraction-list
</p>
<div class="org-src-container">
<pre class="src src-clojure">(defn extract-substitution-errors [edits t1 t2]
(let [codes (for [[p1 p2] (:substitutions edits)]
(let [c1 (nth t1 p1) c2 (nth t2 p2)
[f1 f2] (map to-code-number [c1 c2])]
[[f1 f2] [p1 p2]]))]
[codes (assoc edits :substitutions [])]))
(defn extract-insertion-errors [edits t1 t2]
(let [codes (for [[p1 p2] (:insertions edits)]
(let [ c2 (nth t2 p2)
f2 (to-code-number c2)]
[[8 f2] [p1 p2]]))]
[codes (assoc edits :insertions [])]))
(defn extract-deletion-errors [edits t1 t2]
(let [codes (for [[p1 p2] (:deletions edits)]
(let [c1 (nth t1 p1)
f1 (to-code-number c1)]
[[f1 8] [p1 p2]]))]
[codes (assoc edits :deletions [])]))
</pre>
</div>
</div>
</div>
<div id="outline-container-sec-4-2" class="outline-3">
<h3 id="sec-4-2"><span class="section-number-3">4.2</span> Extracting many-to-one and one-to-many errors</h3>
<div class="outline-text-3" id="text-4-2">
<p>
Now comes the difficult path. Recognizing the semantic errors which
are not just basic edits.
</p>
</div>
<div id="outline-container-sec-4-2-1" class="outline-4">
<h4 id="sec-4-2-1"><span class="section-number-4">4.2.1</span> Recognizing these errors</h4>
<div class="outline-text-4" id="text-4-2-1">
<div class="org-src-container">
<pre class="src src-clojure">(edits "mann" "rnann")
</pre>
</div>
<div class="org-src-container">
<pre class="src src-clojure">(edits "rnann" "mann")
</pre>
</div>
<p>
We see that many-to-one/one-to-many errors can be recognized by
substitutions followed by insertions/deletions.
However we don't want to consume all following
insertions/deletions when there are to many. Consider this:
</p>
<div class="org-src-container">
<pre class="src src-clojure">(edits "Oma" "Ornrewölkra")
</pre>
</div>
<p>
If we would just extract all following insertions here this
wouldn't resemble well an <i>ocr-error</i>. Instead we only want to
consume here the first two insertions for the letters r and n.
The same is true for many-to-one errors.
</p>
<p>
To generalise this observation, we do a <i>bar-analysis</i>:
for one-to-many errors, OCR-engines do mistakes at the level of
splitting the image into characters. So the width of the right
character (example \m) and the recognised characters (example "rn"
or "iii") will be roughly the same. We can therefore count the
horizontal <i>bars</i> of the recognised characters and stop when their
bar-count matches the bar-count of the original character.
</p>
<p>
We map each character to its number of bars
(we are dealing with text written in <a href="http://en.wikipedia.org/wiki/Fraktur">Fraktur</a>):
</p>
<div class="org-src-container">
<pre class="src src-clojure">(def bar-map
{\A 2 \a 2 \B 2 \b 2 \C 2 \c 1 \D 2 \d 2 \E 2 \e 1 \F 2 \f 1 \G 3 \g 2
\H 2 \h 2 \I 2 \J 2 \i 1 \j 1 \K 2 \k 1 \L 2 \l 1 \M 4 \m 3 \N 3 \n 2
\O 2 \o 2 \P 3 \p 2 \Q 2 \q 2 \R 3 \r 1 \S 2 \s 2 \T 2 \t 1 \U 2 \u 2
\V 3 \v 2 \W 4 \w 3 \X 2 \x 1 \Y 3 \y 2 \Z 2 \z 1 \ü 2})
(defn lookup-bar [char]
(get bar-map char 1))
</pre>
</div>
<p>
It can also be that there are multiple one-to-many errors
following each other. In the edits, this is shown as
</p>
<div class="org-src-container">
<pre class="src src-clojure">(edits "Mammut" "Marniiiut")
</pre>
</div>
<div class="org-src-container">
<pre class="src src-clojure">(edits "Marniiiut" "Mammut")
</pre>
</div>
<p>
So we also have to check for multiple substitutions followed by
inertions/deletions and then map each character in the
substitutions to the following characters according to the
bar-analysis.
</p>
<p>
The handling of many-to-one errors is basically the same, the
differences are how to determine which insertions/deletion follows
the given and how to label the extracted codes
</p>
<div class="org-src-container">
<pre class="src src-clojure">(defn- extract-following-substitutions [substitutions]
(when (seq substitutions)
(reduce (fn [l [a b]]
(if (= (map inc (last (last l))) [a b])
;;add to current group
(update-in l [(dec (count l))] #(conj % [a b]))
;;make a new group
(conj l [[a b]])))
[[(first substitutions)]] (rest substitutions))))
(defn- following [[a b] insdel type]
(loop [[ia ib] [(inc a) (inc b)] acc []]
(if (some #{[ia ib]} insdel)
(recur (case type
:one-to-many [ia (inc ib)]
:many-to-one [(inc ia) ib]) (conj acc [ia ib]))
acc)))
(defn- add-following [following-substitutions insdel type]
(for [fs following-substitutions
:let [if (following (last fs) insdel type)]
:when (seq if)]
[fs (concat fs if)]))
(defn- take-bars [t1 t2 [a b] insdel type]
(let [res (let [bar-count (lookup-bar (case type
:one-to-many (nth t1 a)
:many-to-one (nth t2 b)))
_ (prn "bar-count " bar-count (nth t1 a))]
(loop [[[ia ib] & is] insdel acc 0 ret []]
(if ia
(let [bc (lookup-bar (case type
:one-to-many (nth t2 ib)
:many-to-one (nth t1 ia)))
_ (prn "bc " bc (nth t2 ib)
"acc " acc " ret " ret
(>= (+ acc bc) bar-count)
"ia ib " ia ib)]
(if (>= (+ acc bc) bar-count)
(conj ret [ia ib])
(recur is (+ acc bc) (conj ret [ia ib]))))
ret)))]
(prn "take-bars " res)
res))
;;;todo take care of how many are allowed to be extracted
;;;we know that count-free is > 0 at the beginning
;;;be careful at the start of the error
(defn- extract [t1 t2 fl type]
(for [[fs insdel] fl]
(let [_ (prn "fs " fs "insdel " insdel "count-free" (- (count insdel) (count fs)))
count-free (- (count insdel) (count fs))]
(loop [[[a b] & ss :as aktfs] fs extr [] insdel insdel
count-free (- (count insdel) (count fs))]
(if (and a (seq insdel))
(let [ext (take-bars t1 t2 [a b] (take (inc count-free) insdel) type)
_ (prn "ext-after-take-bars " [a b] ext "insdel-for-ext " insdel "count-free " count-free)]
(recur ;dont drop here
ss
(conj extr [[a b] ext])
(drop (count ext) insdel)
(- count-free (- (count ext) 1))))
extr)))))
(defn- delete-from-edits [edits to-delete]
(into {} (for [[k v] (dissoc edits :distance)] [k (remove (set to-delete) v)])))
(defn to-single-error [t1 t2 a b]
(cond
(= (count t1) a) (let [c2 (nth t2 b)
f2 (to-code-number c2)]
[[8 f2] [a b]])
(= (count t2) b) (let [c1 (nth t1 a)
f1 (to-code-number c1)]
[[f1 8] [a b]]);;deletion
:else [[(to-code-number (nth t1 a))
(to-code-number (nth t2 b))]
[a b]]))
(defn extract-count-changing-errors [type edits t1 t2]
(let [{:keys [substitutions insertions deletions]} edits
fs (extract-following-substitutions substitutions)
_ (prn "fs " fs)
fi (add-following fs (case type :one-to-many insertions :many-to-one deletions) type)
_ (prn "fi " fi)
ext (apply concat (extract t1 t2 fi type))
_ (prn "ext " ext)
nedits (delete-from-edits edits (partition 2 (flatten ext)))
res
[(for [[[a b] e] ext]
(case (count e)
1 (let [[_ b] (first e)]
(to-single-error t1 t2 a b))
(case type
:one-to-many [[(to-code-number (nth t1 a)) 7]
[a (second (first e))] [(inc a) (second (last e))]]
:many-to-one [[7 (to-code-number (nth t2 b))]
[(first (first e)) b] [(first (last e)) (inc b)]]))) nedits]
_ (prn "extr " ext "res" (first res))]
res))
</pre>
</div>
</div>
</div>
</div>
<div id="outline-container-sec-4-3" class="outline-3">
<h3 id="sec-4-3"><span class="section-number-3">4.3</span> Setting up the extraction list</h3>
<div class="outline-text-3" id="text-4-3">
<p>
Now the extraction list can be filled. First the one-to-many and
many-to-one errors have to be processed. After that the basic
substitution, insertion and deletion errors can be extracted
</p>
<div class="org-src-container">
<pre class="src src-clojure">(def extraction-list
[(partial extract-count-changing-errors :one-to-many)
(partial extract-count-changing-errors :many-to-one)
extract-substitution-errors
extract-insertion-errors
extract-deletion-errors])
</pre>
</div>
</div>
</div>
</div>
<div id="outline-container-sec-5" class="outline-2">
<h2 id="sec-5"><span class="section-number-2">5</span> Testing the error-codes</h2>
<div class="outline-text-2" id="text-5">
<div class="org-src-container">
<pre class="src src-clojure">(deftest test-error-codes
(is (= (error-codes "a" "b")
'([[1 1] [0 0]])))
(is (= (error-codes "kitten" "sitting")
'([[1 1] [0 0]] [[1 1] [4 4]] [[8 1] [6 6]])))
(is (= (error-codes "m" "rn")
'([[1 7] [0 0] [1 1]])))
(is (= (error-codes "Mammut" "rnarniiiut")
'([[1 7] [0 0] [1 1]] [[1 7] [2 3] [3 4]] [[1 7] [3 4] [4 7]])))
(is (= (error-codes "rnarniiiut" "Mammut")
'([[7 1] [0 0] [1 1]] [[7 1] [3 2] [4 3]] [[7 1] [4 3] [7 4]]))))
(test-error-codes)
</pre>
</div>
<div class="org-src-container">
<pre class="src src-clojure">(deftest big-error-codes-test
(is (= (error-codes "78\n\nschneider, vom bloßen Geldverdienst absehend, mit schöner Beharrlichkeit dem Ziele\nnachstrebten, ein wirkliches Kunstwerk zu schaffen. Daß ihnen dies bis zu einem\ngewissen Grade gelungen ist, leidet keinen Zweifel; jedenfalls ist die Federzeichnung,\nnach welcher der Holzschneider gearbeitet hat, in allen Stücken getreu und correct,\nwird möchten sagen zu correct und getreu, wiedergegeben.\n\tDas Werk stellt eine Felsschlucht dar, in welcher eine Löwin ihren vor ihr\nhingestreckten, von einem Wurfspieß durchbohrten Gemahl betrauert, während oben\ndurch eine Oeffnung in der Steinwand Beduinenjäger sichtbar werden, die auch ihr\nLeben zu bedrohen scheinen. Die Gruppirung dieser Figuren ist gut, die Bewegung\nder Löwin ist - in der Conception - ebenfalls angemessen. Die Jäger hätten\nfüglich wegbleiben können, da sie, wofern sie den auch der Löwin drohenden Tod\nandeuten sollen, die eigentliche Wirkung des Bildes der trauernden Löwin stören;\nsollen sie aber sagen, daß der Löwe durch Jäger umgekommen ist, so sind sie über-\nflüssig, da die Ursache des Todes schon hinreichend durch den im Leibe des Thieres\nsteckenden abgebrochenen Spieß angegeben ist.\n\tDas Bild würde ferner an Wirkung gewonnen haben durch eine feinere Beob-\nachtung des Stofflichen. Das Fell der Thiere, Sandboden mit Halfehgras, Felswand,\nPalmen und Aloe sind in der technischen Behandlung jedenfalls zu gleichmäßig. So\nhätte beispielsweise die Schattenseite der Felswand, rechts wo die Jäger herablugen,\nruhiger und in zurückweichenden Tönen behandelt werden sollen. Der Körper des lie-\ngenden Löwen hätte sich mehr rund von der Fläche abheben müssen, wie auch die ganze\nMuskulatur der Thiere noch präciser und energischer sein könnte. Endlich aber will\ndas Blut vor dem Maule des todten Löwen uns nicht recht wie Blut erscheinen.\nTrotz dieser Ausstellungen an den Einzelnheiten verdient das Blatt als Ganzes -\nnamentlich als tüchtiger gesunder Holzschnitt - den besten Leistungen der Gegen-\nwart auf diesem Gebiet beigerechnet zu werden, und in dieser Eigenschaft empfehlen\nwir es allen Freunden der Kunst angelegentlich.\n\nLiteratur.\n\n\tDie Böhmischen Exulanten in Sachsen von Chr. A. Pescheck Leipzig,\nS. Hirzel 1857. - Es ist von mehrfachem Interesse zu ermitteln, wie die einzel-\nnen Bölterstämme des gegenwärtigen Deutschlands durch die Uebergänge der In-\ndividuen aus einem Stamme in den andern allmälig zu einer deutschen Nation\ngemischt worden sind. Das Ineinanderfließen der Stämme durch Ein- und Aus-\nwanderung war während fast zwei Jahrtausenden niemals ganz unterbrochen, hat\naber in verschiedenen Zeiträumen besondere Ausdehnung erreicht. Von der politischen\nGeschichte wird das massenhafte Einströmen der Deutschen in das Slavenland zwischen\nElbe und Weichsel noch am ausführlichsten behandelt. Aber nicht weniger eigenthümlich\nwaren die Verhältnisse in Böhmen. Seit dem frühsten Mittelalter fand dorthin ein fried-\nliches Einziehen deutscher Bildung und deutscher Individuen statt. Doch die deutsche\nColonisation des Landes wurde mehr als einmal durch eine kräftige Gegenströmung\n"
" \n\n \n\n \n\n78\n\nschneidet, vom bloßen Geldberdienst absehend, mit schöner Beharrlichkeit dem Ziele\nnachstrebten, ein wirtliches Kunstiverk zu schaffen. Daß ihnen dies bis zu einem\ngewissen Grade gelungen ist, leidet keinen Zweifel; jedenfalls ist die Federzeiehnung\nnach welcher der Holzschneidcr gearbeitet hat, in allen Stücken getreu und eorrect,\nwird möchten sagen zu eorrect und getreu, niiedkkgegelbkk\n\nDas Werk stellt eine Felsschlucht dar, in welcher eine Löwin ihren vor ihr\nlniigestrecktenj von einein Wurfspieß durchbohrter! Gemahl betrauert, während oben\ndurch eine Oeffnung in der Steinwand Beduiiienjägssr sichtbar werden, die auch ihr\nLeben zu bedrohen scheinen. Die Gruppiruiig dieser Figuren ist gut, die Bewegung\nder Löwin ist —-- in der Conceptivii —-— ebenfalls angemessen. Die Jäger hätten\nfüglich wegbleiben können, da sie, wofern sie den auch der Löwin drohenden Tod\nandeuten sollen, die eigentliche Wirkung des Bildes der trauernden Löwin stören;\nsollen sie aber sagen, daß der Löwe durch Jäger umgekommen ist, so smd sie Liber-\nflüssig, da die Ursache des Todes schon hinreichend durch den im Leibe des Thieres\nsteckenden abgebrochenen Spieß angegeben ist.\n\nDas Bild wiirde ferner an Wirkung gewonnen haben durch eine feinere Beob-\nachtung des Stofflichen. Das Fell der Thiere, Sandboden mit Halfehgras Felswand,\nPalmen undAloe sind in der technischen Behandlung jedenfalls zu gleichniäßig So\nhätte beispielsweise die Schattenseite der Felswand, rechts wo die Jäger herablugen,\nruhiger und in zuriietkveichendeii Tönen behandelt werden sollen. Der Körper des lie-\ngenden Löwen hätte sich mehr rund von der Fläche abhebeii müssen, wie auch die ganze\nMuskulatur derThiere noch präciser und energischer sein könnte. Endlich aber will\ndas Blut vor den! Maule des todten Löwen uns nicht recht wie Blut erscheinen.\nTrotz dieser Ansstellungen an den Einzelnhciten verdient das Blatt als Ganzes —-\nnamentlich als tüchtiger gesunder Holzsehnitt — den besten Leistungen der Gegen-\n\n \n\nwart auf diescni Gebiet beigereehnet zu werden, und in dieser Eigenschaft empfehlen.\n\nwir es allen Freunden der Kunst angelegentlichJ «\n\nLiteratur.\n\nDie Böhmischen Exulanten in Sachsen von Chr. A. Pescheclc Leipzig- «\n\nS. Hirzel 1857. —- Es ist von mehrfacheni Interesse zu ermitteln, wie die einzel-\nnen Bölterstänune des gegenwärtigen Deutschlands durch die Uebergäiige der IN-\ndividuen aus einem Stamme in den andern allmälig zu einer deutschen Nation\ngemischt worden sind Das Ineinanderfließen der Stämme durch Em- und Aus-\nwanderung war während fast zwei Jahrtausenden niemals ganz unterbrochen, hat\naber in verschiedenen Zeitriiumen besondere Ausdehnung erreicht. Von der politischen\nGeschichte wird das massenhafteEinströiiieii der Deutschen in das Slavenland zwischen\nElbe Und Weichsel noch am ausführlichsteii behandelt. Aber nicht weniger eigenthiinilich\nwaren die Verhältnisse in Böhmen. Seit dem friihsten Neittelalter fand dorthin ein fried-\nliches Einziehen deutscher Bildung und deutscher Individuen statt. Doch die dcutsche\nColonisation des Landes wurde niehr als einmal durch eine triistige lihegeiiströmung\n\n \n\n")
'([[8 5] [0 0]] [[8 4] [0 1]] [[8 4] [0 2]] [[8 5] [0 3]] [[8 4] [0 4]] [[8 4] [0 5]] [[8 5] [0 6]] [[8 5] [0 7]] [[8 4] [0 8]] [[8 4] [0 9]] [[1 1] [12 22]] [[1 1] [30 40]] [[1 1] [108 118]] [[1 7] [121 131] [122 132]] [[1 1] [246 257]] [[2 8] [252 263]] [[1 1] [282 292]] [[1 1] [329 339]] [[1 1] [360 370]] [[1 1] [380 390]] [[8 1] [382 392]] [[1 1] [384 395]] [[1 1] [385 396]] [[8 1] [390 401]] [[1 1] [391 403]] [[1 1] [392 404]] [[2 4] [393 405]] [[4 8] [395 407]] [[1 7] [471 482] [472 483]] [[1 1] [473 485]] [[2 1] [485 497]] [[1 7] [495 507] [496 508]] [[1 7] [518 531] [519 532]] [[1 7] [593 607] [594 608]] [[1 1] [599 614]] [[8 1] [600 615]] [[1 7] [672 688] [673 689]] [[8 4] [726 743]] [[8 4] [727 745]] [[1 1] [743 762]] [[1 7] [744 763] [745 764]] [[8 4] [746 766]] [[8 4] [747 768]] [[7 1] [1015 1037] [1016 1038]] [[1 1] [1023 1044]] [[8 1] [1024 1045]] [[4 4] [1158 1180]] [[1 1] [1169 1191]] [[8 1] [1170 1192]] [[2 8] [1302 1325]] [[5 8] [1324 1346]] [[1 7] [1385 1406] [1386 1407]] [[2 8] [1390 1412]] [[1 1] [1498 1519]] [[1 1] [1499 1520]] [[8 1] [1500 1521]] [[8 1] [1500 1522]] [[1 1] [1501 1524]] [[1 7] [1510 1533] [1511 1534]] [[1 7] [1618 1642] [1619 1643]] [[5 8] [1661 1686]] [[1 7] [1745 1769] [1746 1770]] [[1 1] [1821 1846]] [[1 1] [1849 1874]] [[8 4] [1885 1910]] [[1 1] [1926 1952]] [[4 4] [1933 1959]] [[8 4] [1968 1994]] [[8 5] [1968 1995]] [[8 4] [1968 1996]] [[8 4] [1968 1997]] [[1 1] [1981 2011]] [[1 7] [1982 2012] [1983 2013]] [[1 1] [1998 2029]] [[8 2] [2050 2081]] [[8 4] [2051 2083]] [[2 1] [2097 2130]] [[8 5] [2098 2131]] [[8 4] [2098 2132]] [[4 8] [2112 2147]] [[1 1] [2168 2202]] [[8 1] [2169 2203]] [[2 4] [2177 2212]] [[8 5] [2178 2213]] [[8 4] [2178 2214]] [[8 4] [2179 2216]] [[8 4] [2195 2233]] [[1 7] [2217 2256] [2218 2257]] [[1 7] [2272 2312] [2273 2313]] [[1 1] [2273 2313]] [[1 7] [2324 2365] [2325 2366]] [[1 1] [2333 2375]] [[2 8] [2431 2473]] [[7 1] [2473 2514] [2474 2515]] [[1 1] [2590 2630]] [[8 1] [2591 2631]] [[5 8] [2678 2719]] [[1 7] [2686 2726] [2687 2728]] [[1 7] [2688 2730] [2689 2731]] [[1 1] [2736 2779]] [[1 7] [2771 2814] [2772 2815]] [[1 1] [2810 2854]] [[1 7] [2811 2855] [2812 2856]] [[8 1] [2812 2857]] [[1 1] [2862 2908]] [[8 1] [2863 2909]] [[1 7] [2869 2916] [2870 2917]] [[1 1] [2982 3030]] [[1 7] [3020 3068] [3021 3069]] [[1 1] [3047 3096]] [[1 1] [3049 3098]] [[1 1] [3050 3099]] [[8 1] [3051 3100]] [[1 7] [3056 3106] [3057 3108]] [[1 7] [3060 3112] [3061 3113]] [[8 4] [3070 3123]] [[8 5] [3070 3124]] [[8 5] [3070 3125]] [[8 4] [3070 3126]] [[8 4] [3070 3127]]))))
(big-error-codes-test)
</pre>
</div>
</div>
</div>
<div id="outline-container-sec-6" class="outline-2">
<h2 id="sec-6"><span class="section-number-2">6</span> Helpers for deployment to ocr-visualizer</h2>
<div class="outline-text-2" id="text-6">
<p>
doesn't belong to the code - ignore
</p>
<div class="org-src-container">
<pre class="src src-clojure"> (use 'clojure.java.io)