This program solves the problem of extracting semantic ocr-errors between a two texts, one that is the ground-truth and the other is the performed result from the ocr.
The ocr-error codes are given by the following matrix:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
| 1 | 1 | 1 | 1 | 1 | 1 | 7 | 2 | 5 | 6 |
| 2 | 1 | 1 | 1 | 1 | 1 | 7 | 2 | 5 | |
| 3 | 1 | 1 | 1 | 1 | 1 | 7 | 2 | 5 | |
| 4 | 1 | 1 | 1 | 1 | 1 | 7 | 2 | 5 | 6 |
| 5 | 1 | 1 | 1 | 1 | 1 | 7 | 2 | 5 | |
| 6 | |||||||||
| 7 | |||||||||
| 8 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | ||
| 9 |
The codes are the following:
- 1 letter
- 2 Punctuation
- 3 digit
- 4 special character
- 5 whitespace
- 6 part of character
- 7 multiple characters
- 8 missing character
- 9 Case error
The values in the matrix are the weights for the particular ocr-error. The values shown here are examples, they can be changed for particular situations.
The idea is to separate the semantic informations of the ocr-errors from just the sum of error points that determines the quality of the ocr.
The task of the program is to extract the correct error-codes and their locations from theground-truth and ocr-results texts.
(ns error-codes.core
(:require [clojure.java.io :refer [file]])
(:use [clojure.core.matrix]))
(set! *warn-on-reflection* true)true
(ns error-codes.test-core
(:require [clojure.test :refer :all]
[error-codes.core :refer :all]
[clojure.core.matrix :refer :all]))To get this, we can use the levenshtein distance, which is defined as the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
The levenshtein distance is computed by the standard dynamic programming algorithm:
(defn lev [str1 str2]
(let [mat (new-matrix :ndarray (inc (count str1)) (inc (count str2)))]
(mset! mat 0 0 0)
(dotimes [i (count str1)]
(mset! mat (inc i) 0 (inc i)))
(dotimes [j (count str2)]
(mset! mat 0 (inc j) (inc j)))
(dotimes [dj (count str2)]
(dotimes [di (count str1)]
(let [j (inc dj) i (inc di)]
(mset! mat i j
(cond
(= (.charAt ^String str1 di) (.charAt ^String str2 dj))
(mget mat di dj)
:else
(min (inc (mget mat di j)) (inc (mget mat i dj))
(inc (mget mat di dj))))))))
mat))
This fills a 2-dimensional array which are filled with the levenshtein distances for substrings of the texts.The full levenshtein-distance is in the field [(count gt) (count or)] of the created matrix.
(defn lev-distance [a b]
(let [m (lev a b)]
(mget m (count a) (count b))))
(deftest test-lev
(is (= 1 (lev-distance "" "a")))
(is (= 1 (lev-distance "a" "")))
(is (= 3 (lev-distance "kitten" "sitting"))))
(test-lev)We are not interested in the edit distance, but in the edits that were performed to calculate it! One possibility would be to store not the current distance in the cells of the matrix but the edits so far in addition to the distance. However, this approach has proven too slow (manipulating maps instead of doubles while creating the levenshtein matrix). It is however possible, to trace back the patch on wich the full levenshtein distance was created from the created matrix. This is described here. The order of branches in the backtrace function is important! moving the check for substitutions to the front favours the way with substitutions about other ways with insertions and deletions which have the same error count
The edits between the files are represented as a map with the keys :insertions, :deletions, and :substitutions which have a sequence of pairs [a b] as values where a and b specify the position where the edit was performed in the ground-truth and the orc-results texts
(defn backtrace [d i j acc]
(cond
(and (> i 0) (= (inc (mget d (dec i) j)) (mget d i j)))
(recur d (dec i) j (assoc acc :deletions (cons [(dec i) j] (:deletions acc))))
(and (> j 0) (= (inc (mget d i (dec j))) (mget d i j)))
(recur d i (dec j) (assoc acc :insertions (cons [i (dec j)] (:insertions acc))))
(and (> i 0) (> j 0) (= (inc (mget d (dec i) (dec j))) (mget d i j)))
(recur d (dec i) (dec j) (assoc acc :substitutions (cons [(dec i) (dec j)] (:substitutions acc))))
(and (> i 0) (> j 0) (= (mget d (dec i) (dec j)) (mget d i j)))
(recur d (dec i) (dec j) acc)
:else acc))With the lev and backtrace function, we can define the edits function which returns the edits map described above with the minimal edits between the two texts
(defn edits [a b]
(let [d (lev a b)]
(backtrace d (count a) (count b) {:insertions '() :deletions '()
:substitutions '()
:distance (mget d (count a) (count b))})))A few examples:
(deftest test-edits
(is (= (edits "a" "b")
'{:insertions (), :deletions (), :substitutions ([0 0]), :distance 1}))
;;swapping two characters is not multiple substitutions but insertion and deletions
;;which is more in line with what humans see there.
(is (= (edits "ab" "ba")
'{:insertions ([0 0]), :deletions ([1 2]), :substitutions (), :distance 2}))
(is (= (edits "vr" "io")
'{:insertions (), :deletions (), :substitutions ([0 0] [1 1]), :distance 2}))
;;many to one errors are substitutions followed by insertions
(is (= (edits "m" "rn")
'{:insertions ([1 1]), :deletions (), :substitutions ([0 0]), :distance 2}))
;;one to many errors are substitutions followed by deletions
(is (= (edits "rn" "m")
'{:insertions (), :deletions ([1 1]), :substitutions ([0 0]), :distance 2}))
(is (= (edits "Kitten" "sitting")
'{:insertions ([6 6]), :deletions (), :substitutions ([0 0] [4 4]), :distance 3}))
(is (= (edits "Kitten" "sittieng")
'{:insertions ([4 4] [6 7]), :deletions (), :substitutions ([0 0]), :distance 3}))
(is (= (edits "Kitten" "iiittiing")
'{:insertions ([2 2] [5 6] [6 8]), :deletions (), :substitutions ([0 0] [4 5]), :distance 5}))
)
(test-edits)With the edits in place, the problem of the proper text alignment is solved. what is left to do is mapping the edits to the right error codes. Some codes (like [1 1]) which is a simple substitution (see table at the top) are trivial to extract from the edit distance. But what about the one-to-many errors (codes [x 7]) or the many-to-one errors (codes [7 x] or [x 6] the table is not deterministic here)
Like test-edits showed above, one can recognise many-to-one/one-to-many errors by substitutions which are followed by insertions or deletions
Therefore, they have to be extracted before the substitutions, deletions or insertions are extracted.
A good architecture for the extracting operation is therefore to apply multiple extract passes to the edits to generate the error-codes. The extract phases can be defined as functions which get the current edits map and the two texts as arguments and returns [new-edits extracted-error-codes]. The extract phases can simply be stored in a vector which is traversed in the extraction phase
(declare extraction-list)
(defn edits-to-error-codes
([edits t1 t2] (edits-to-error-codes edits t1 t2 extraction-list))
([edits t1 t2 extraction-list]
(let [{:keys [substitutions deletions insertions]} edits]
(-> (reduce (fn [[codes edits] f]
(let [[ncodes nedits] (f edits t1 t2)]
[(concat codes ncodes) nedits])) [[] edits] extraction-list)
first))))
(defn error-codes
[t1 t2]
(as-> (edits t1 t2) x
(edits-to-error-codes x t1 t2)
;;sort the codes so that the error-codes are ordered by position in text
(sort-by (comp second second) x)))For the extraction phases we also need a small function which maps a character to its number in the matrix.
(defn to-code-number [^Character c]
(cond
(Character/isLetter c) 1
(#{\. \, \? \!} c) 2
(Character/isDigit c) 3
(= c \space) 5
:else 4))Extraction of the basic error codes for substitution, insertion and deletions are easy as they are just the insertions, substitutions and deletions in the edits map along with the right error code. This extraction functions will be at the bottom of the extraction-list
(defn extract-substitution-errors [edits t1 t2]
(let [codes (for [[p1 p2] (:substitutions edits)]
(let [c1 (nth t1 p1) c2 (nth t2 p2)
[f1 f2] (map to-code-number [c1 c2])]
[[f1 f2] [p1 p2]]))]
[codes (assoc edits :substitutions [])]))
(defn extract-insertion-errors [edits t1 t2]
(let [codes (for [[p1 p2] (:insertions edits)]
(let [ c2 (nth t2 p2)
f2 (to-code-number c2)]
[[8 f2] [p1 p2]]))]
[codes (assoc edits :insertions [])]))
(defn extract-deletion-errors [edits t1 t2]
(let [codes (for [[p1 p2] (:deletions edits)]
(let [c1 (nth t1 p1)
f1 (to-code-number c1)]
[[f1 8] [p1 p2]]))]
[codes (assoc edits :deletions [])]))Now comes the difficult path. Recognizing the semantic errors which are not just basic edits.
(edits "mann" "rnann")(edits "rnann" "mann")We see that many-to-one/one-to-many errors can be recognized by substitutions followed by insertions/deletions. However we don’t want to consume all following insertions/deletions when there are to many. Consider this:
(edits "Oma" "Ornrewölkra")If we would just extract all following insertions here this wouldn’t resemble well an ocr-error. Instead we only want to consume here the first two insertions for the letters r and n. The same is true for many-to-one errors.
To generalise this observation, we do a bar-analysis: for one-to-many errors, OCR-engines do mistakes at the level of splitting the image into characters. So the width of the right character (example \m) and the recognised characters (example “rn” or “iii”) will be roughly the same. We can therefore count the horizontal bars of the recognised characters and stop when their bar-count matches the bar-count of the original character.
We map each character to its number of bars (we are dealing with text written in Fraktur):
(def bar-map
{\A 2 \a 2 \B 2 \b 2 \C 2 \c 1 \D 2 \d 2 \E 2 \e 1 \F 2 \f 1 \G 3 \g 2
\H 2 \h 2 \I 2 \J 2 \i 1 \j 1 \K 2 \k 1 \L 2 \l 1 \M 4 \m 3 \N 3 \n 2
\O 2 \o 2 \P 3 \p 2 \Q 2 \q 2 \R 3 \r 1 \S 2 \s 2 \T 2 \t 1 \U 2 \u 2
\V 3 \v 2 \W 4 \w 3 \X 2 \x 1 \Y 3 \y 2 \Z 2 \z 1 \ü 2})
(defn lookup-bar [char]
(get bar-map char 1))
It can also be that there are multiple one-to-many errors following each other. In the edits, this is shown as
(edits "Mammut" "Marniiiut")(edits "Marniiiut" "Mammut")So we also have to check for multiple substitutions followed by inertions/deletions and then map each character in the substitutions to the following characters according to the bar-analysis.
The handling of many-to-one errors is basically the same, the differences are how to determine which insertions/deletion follows the given and how to label the extracted codes
(defn- extract-following-substitutions [substitutions]
(when (seq substitutions)
(reduce (fn [l [a b]]
(if (= (map inc (last (last l))) [a b])
;;add to current group
(update-in l [(dec (count l))] #(conj % [a b]))
;;make a new group
(conj l [[a b]])))
[[(first substitutions)]] (rest substitutions))))
(defn- following [[a b] insdel type]
(loop [[ia ib] [(inc a) (inc b)] acc []]
(if (some #{[ia ib]} insdel)
(recur (case type
:one-to-many [ia (inc ib)]
:many-to-one [(inc ia) ib]) (conj acc [ia ib]))
acc)))
(defn- add-following [following-substitutions insdel type]
(for [fs following-substitutions
:let [if (following (last fs) insdel type)]
:when (seq if)]
[fs (concat fs if)]))
(defn- take-bars [t1 t2 [a b] insdel type]
(let [res (let [bar-count (lookup-bar (case type
:one-to-many (nth t1 a)
:many-to-one (nth t2 b)))
_ (prn "bar-count " bar-count (nth t1 a))]
(loop [[[ia ib] & is] insdel acc 0 ret []]
(if ia
(let [bc (lookup-bar (case type
:one-to-many (nth t2 ib)
:many-to-one (nth t1 ia)))
_ (prn "bc " bc (nth t2 ib)
"acc " acc " ret " ret
(>= (+ acc bc) bar-count)
"ia ib " ia ib)]
(if (>= (+ acc bc) bar-count)
(conj ret [ia ib])
(recur is (+ acc bc) (conj ret [ia ib]))))
ret)))]
(prn "take-bars " res)
res))
;;;todo take care of how many are allowed to be extracted
;;;we know that count-free is > 0 at the beginning
;;;be careful at the start of the error
(defn- extract [t1 t2 fl type]
(for [[fs insdel] fl]
(let [_ (prn "fs " fs "insdel " insdel "count-free" (- (count insdel) (count fs)))
count-free (- (count insdel) (count fs))]
(loop [[[a b] & ss :as aktfs] fs extr [] insdel insdel
count-free (- (count insdel) (count fs))]
(if (and a (seq insdel))
(let [ext (take-bars t1 t2 [a b] (take (inc count-free) insdel) type)
_ (prn "ext-after-take-bars " [a b] ext "insdel-for-ext " insdel "count-free " count-free)]
(recur ;dont drop here
ss
(conj extr [[a b] ext])
(drop (count ext) insdel)
(- count-free (- (count ext) 1))))
extr)))))
(defn- delete-from-edits [edits to-delete]
(into {} (for [[k v] (dissoc edits :distance)] [k (remove (set to-delete) v)])))
(defn to-single-error [t1 t2 a b]
(cond
(= (count t1) a) (let [c2 (nth t2 b)
f2 (to-code-number c2)]
[[8 f2] [a b]])
(= (count t2) b) (let [c1 (nth t1 a)
f1 (to-code-number c1)]
[[f1 8] [a b]]);;deletion
:else [[(to-code-number (nth t1 a))
(to-code-number (nth t2 b))]
[a b]]))
(defn extract-count-changing-errors [type edits t1 t2]
(let [{:keys [substitutions insertions deletions]} edits
fs (extract-following-substitutions substitutions)
_ (prn "fs " fs)
fi (add-following fs (case type :one-to-many insertions :many-to-one deletions) type)
_ (prn "fi " fi)
ext (apply concat (extract t1 t2 fi type))
_ (prn "ext " ext)
nedits (delete-from-edits edits (partition 2 (flatten ext)))
res
[(for [[[a b] e] ext]
(case (count e)
1 (let [[_ b] (first e)]
(to-single-error t1 t2 a b))
(case type
:one-to-many [[(to-code-number (nth t1 a)) 7]
[a (second (first e))] [(inc a) (second (last e))]]
:many-to-one [[7 (to-code-number (nth t2 b))]
[(first (first e)) b] [(first (last e)) (inc b)]]))) nedits]
_ (prn "extr " ext "res" (first res))]
res))
Now the extraction list can be filled. First the one-to-many and many-to-one errors have to be processed. After that the basic substitution, insertion and deletion errors can be extracted
(def extraction-list
[(partial extract-count-changing-errors :one-to-many)
(partial extract-count-changing-errors :many-to-one)
extract-substitution-errors
extract-insertion-errors
extract-deletion-errors])(deftest test-error-codes
(is (= (error-codes "a" "b")
'([[1 1] [0 0]])))
(is (= (error-codes "kitten" "sitting")
'([[1 1] [0 0]] [[1 1] [4 4]] [[8 1] [6 6]])))
(is (= (error-codes "m" "rn")
'([[1 7] [0 0] [1 1]])))
(is (= (error-codes "Mammut" "rnarniiiut")
'([[1 7] [0 0] [1 1]] [[1 7] [2 3] [3 4]] [[1 7] [3 4] [4 7]])))
(is (= (error-codes "rnarniiiut" "Mammut")
'([[7 1] [0 0] [1 1]] [[7 1] [3 2] [4 3]] [[7 1] [4 3] [7 4]]))))
(test-error-codes)(deftest big-error-codes-test
(is (= (error-codes "78\n\nschneider, vom bloßen Geldverdienst absehend, mit schöner Beharrlichkeit dem Ziele\nnachstrebten, ein wirkliches Kunstwerk zu schaffen. Daß ihnen dies bis zu einem\ngewissen Grade gelungen ist, leidet keinen Zweifel; jedenfalls ist die Federzeichnung,\nnach welcher der Holzschneider gearbeitet hat, in allen Stücken getreu und correct,\nwird möchten sagen zu correct und getreu, wiedergegeben.\n\tDas Werk stellt eine Felsschlucht dar, in welcher eine Löwin ihren vor ihr\nhingestreckten, von einem Wurfspieß durchbohrten Gemahl betrauert, während oben\ndurch eine Oeffnung in der Steinwand Beduinenjäger sichtbar werden, die auch ihr\nLeben zu bedrohen scheinen. Die Gruppirung dieser Figuren ist gut, die Bewegung\nder Löwin ist - in der Conception - ebenfalls angemessen. Die Jäger hätten\nfüglich wegbleiben können, da sie, wofern sie den auch der Löwin drohenden Tod\nandeuten sollen, die eigentliche Wirkung des Bildes der trauernden Löwin stören;\nsollen sie aber sagen, daß der Löwe durch Jäger umgekommen ist, so sind sie über-\nflüssig, da die Ursache des Todes schon hinreichend durch den im Leibe des Thieres\nsteckenden abgebrochenen Spieß angegeben ist.\n\tDas Bild würde ferner an Wirkung gewonnen haben durch eine feinere Beob-\nachtung des Stofflichen. Das Fell der Thiere, Sandboden mit Halfehgras, Felswand,\nPalmen und Aloe sind in der technischen Behandlung jedenfalls zu gleichmäßig. So\nhätte beispielsweise die Schattenseite der Felswand, rechts wo die Jäger herablugen,\nruhiger und in zurückweichenden Tönen behandelt werden sollen. Der Körper des lie-\ngenden Löwen hätte sich mehr rund von der Fläche abheben müssen, wie auch die ganze\nMuskulatur der Thiere noch präciser und energischer sein könnte. Endlich aber will\ndas Blut vor dem Maule des todten Löwen uns nicht recht wie Blut erscheinen.\nTrotz dieser Ausstellungen an den Einzelnheiten verdient das Blatt als Ganzes -\nnamentlich als tüchtiger gesunder Holzschnitt - den besten Leistungen der Gegen-\nwart auf diesem Gebiet beigerechnet zu werden, und in dieser Eigenschaft empfehlen\nwir es allen Freunden der Kunst angelegentlich.\n\nLiteratur.\n\n\tDie Böhmischen Exulanten in Sachsen von Chr. A. Pescheck Leipzig,\nS. Hirzel 1857. - Es ist von mehrfachem Interesse zu ermitteln, wie die einzel-\nnen Bölterstämme des gegenwärtigen Deutschlands durch die Uebergänge der In-\ndividuen aus einem Stamme in den andern allmälig zu einer deutschen Nation\ngemischt worden sind. Das Ineinanderfließen der Stämme durch Ein- und Aus-\nwanderung war während fast zwei Jahrtausenden niemals ganz unterbrochen, hat\naber in verschiedenen Zeiträumen besondere Ausdehnung erreicht. Von der politischen\nGeschichte wird das massenhafte Einströmen der Deutschen in das Slavenland zwischen\nElbe und Weichsel noch am ausführlichsten behandelt. Aber nicht weniger eigenthümlich\nwaren die Verhältnisse in Böhmen. Seit dem frühsten Mittelalter fand dorthin ein fried-\nliches Einziehen deutscher Bildung und deutscher Individuen statt. Doch die deutsche\nColonisation des Landes wurde mehr als einmal durch eine kräftige Gegenströmung\n"
" \n\n \n\n \n\n78\n\nschneidet, vom bloßen Geldberdienst absehend, mit schöner Beharrlichkeit dem Ziele\nnachstrebten, ein wirtliches Kunstiverk zu schaffen. Daß ihnen dies bis zu einem\ngewissen Grade gelungen ist, leidet keinen Zweifel; jedenfalls ist die Federzeiehnung\nnach welcher der Holzschneidcr gearbeitet hat, in allen Stücken getreu und eorrect,\nwird möchten sagen zu eorrect und getreu, niiedkkgegelbkk\n\nDas Werk stellt eine Felsschlucht dar, in welcher eine Löwin ihren vor ihr\nlniigestrecktenj von einein Wurfspieß durchbohrter! Gemahl betrauert, während oben\ndurch eine Oeffnung in der Steinwand Beduiiienjägssr sichtbar werden, die auch ihr\nLeben zu bedrohen scheinen. Die Gruppiruiig dieser Figuren ist gut, die Bewegung\nder Löwin ist —-- in der Conceptivii —-— ebenfalls angemessen. Die Jäger hätten\nfüglich wegbleiben können, da sie, wofern sie den auch der Löwin drohenden Tod\nandeuten sollen, die eigentliche Wirkung des Bildes der trauernden Löwin stören;\nsollen sie aber sagen, daß der Löwe durch Jäger umgekommen ist, so smd sie Liber-\nflüssig, da die Ursache des Todes schon hinreichend durch den im Leibe des Thieres\nsteckenden abgebrochenen Spieß angegeben ist.\n\nDas Bild wiirde ferner an Wirkung gewonnen haben durch eine feinere Beob-\nachtung des Stofflichen. Das Fell der Thiere, Sandboden mit Halfehgras Felswand,\nPalmen undAloe sind in der technischen Behandlung jedenfalls zu gleichniäßig So\nhätte beispielsweise die Schattenseite der Felswand, rechts wo die Jäger herablugen,\nruhiger und in zuriietkveichendeii Tönen behandelt werden sollen. Der Körper des lie-\ngenden Löwen hätte sich mehr rund von der Fläche abhebeii müssen, wie auch die ganze\nMuskulatur derThiere noch präciser und energischer sein könnte. Endlich aber will\ndas Blut vor den! Maule des todten Löwen uns nicht recht wie Blut erscheinen.\nTrotz dieser Ansstellungen an den Einzelnhciten verdient das Blatt als Ganzes —-\nnamentlich als tüchtiger gesunder Holzsehnitt — den besten Leistungen der Gegen-\n\n \n\nwart auf diescni Gebiet beigereehnet zu werden, und in dieser Eigenschaft empfehlen.\n\nwir es allen Freunden der Kunst angelegentlichJ «\n\nLiteratur.\n\nDie Böhmischen Exulanten in Sachsen von Chr. A. Pescheclc Leipzig- «\n\nS. Hirzel 1857. —- Es ist von mehrfacheni Interesse zu ermitteln, wie die einzel-\nnen Bölterstänune des gegenwärtigen Deutschlands durch die Uebergäiige der IN-\ndividuen aus einem Stamme in den andern allmälig zu einer deutschen Nation\ngemischt worden sind Das Ineinanderfließen der Stämme durch Em- und Aus-\nwanderung war während fast zwei Jahrtausenden niemals ganz unterbrochen, hat\naber in verschiedenen Zeitriiumen besondere Ausdehnung erreicht. Von der politischen\nGeschichte wird das massenhafteEinströiiieii der Deutschen in das Slavenland zwischen\nElbe Und Weichsel noch am ausführlichsteii behandelt. Aber nicht weniger eigenthiinilich\nwaren die Verhältnisse in Böhmen. Seit dem friihsten Neittelalter fand dorthin ein fried-\nliches Einziehen deutscher Bildung und deutscher Individuen statt. Doch die dcutsche\nColonisation des Landes wurde niehr als einmal durch eine triistige lihegeiiströmung\n\n \n\n")
'([[8 5] [0 0]] [[8 4] [0 1]] [[8 4] [0 2]] [[8 5] [0 3]] [[8 4] [0 4]] [[8 4] [0 5]] [[8 5] [0 6]] [[8 5] [0 7]] [[8 4] [0 8]] [[8 4] [0 9]] [[1 1] [12 22]] [[1 1] [30 40]] [[1 1] [108 118]] [[1 7] [121 131] [122 132]] [[1 1] [246 257]] [[2 8] [252 263]] [[1 1] [282 292]] [[1 1] [329 339]] [[1 1] [360 370]] [[1 1] [380 390]] [[8 1] [382 392]] [[1 1] [384 395]] [[1 1] [385 396]] [[8 1] [390 401]] [[1 1] [391 403]] [[1 1] [392 404]] [[2 4] [393 405]] [[4 8] [395 407]] [[1 7] [471 482] [472 483]] [[1 1] [473 485]] [[2 1] [485 497]] [[1 7] [495 507] [496 508]] [[1 7] [518 531] [519 532]] [[1 7] [593 607] [594 608]] [[1 1] [599 614]] [[8 1] [600 615]] [[1 7] [672 688] [673 689]] [[8 4] [726 743]] [[8 4] [727 745]] [[1 1] [743 762]] [[1 7] [744 763] [745 764]] [[8 4] [746 766]] [[8 4] [747 768]] [[7 1] [1015 1037] [1016 1038]] [[1 1] [1023 1044]] [[8 1] [1024 1045]] [[4 4] [1158 1180]] [[1 1] [1169 1191]] [[8 1] [1170 1192]] [[2 8] [1302 1325]] [[5 8] [1324 1346]] [[1 7] [1385 1406] [1386 1407]] [[2 8] [1390 1412]] [[1 1] [1498 1519]] [[1 1] [1499 1520]] [[8 1] [1500 1521]] [[8 1] [1500 1522]] [[1 1] [1501 1524]] [[1 7] [1510 1533] [1511 1534]] [[1 7] [1618 1642] [1619 1643]] [[5 8] [1661 1686]] [[1 7] [1745 1769] [1746 1770]] [[1 1] [1821 1846]] [[1 1] [1849 1874]] [[8 4] [1885 1910]] [[1 1] [1926 1952]] [[4 4] [1933 1959]] [[8 4] [1968 1994]] [[8 5] [1968 1995]] [[8 4] [1968 1996]] [[8 4] [1968 1997]] [[1 1] [1981 2011]] [[1 7] [1982 2012] [1983 2013]] [[1 1] [1998 2029]] [[8 2] [2050 2081]] [[8 4] [2051 2083]] [[2 1] [2097 2130]] [[8 5] [2098 2131]] [[8 4] [2098 2132]] [[4 8] [2112 2147]] [[1 1] [2168 2202]] [[8 1] [2169 2203]] [[2 4] [2177 2212]] [[8 5] [2178 2213]] [[8 4] [2178 2214]] [[8 4] [2179 2216]] [[8 4] [2195 2233]] [[1 7] [2217 2256] [2218 2257]] [[1 7] [2272 2312] [2273 2313]] [[1 1] [2273 2313]] [[1 7] [2324 2365] [2325 2366]] [[1 1] [2333 2375]] [[2 8] [2431 2473]] [[7 1] [2473 2514] [2474 2515]] [[1 1] [2590 2630]] [[8 1] [2591 2631]] [[5 8] [2678 2719]] [[1 7] [2686 2726] [2687 2728]] [[1 7] [2688 2730] [2689 2731]] [[1 1] [2736 2779]] [[1 7] [2771 2814] [2772 2815]] [[1 1] [2810 2854]] [[1 7] [2811 2855] [2812 2856]] [[8 1] [2812 2857]] [[1 1] [2862 2908]] [[8 1] [2863 2909]] [[1 7] [2869 2916] [2870 2917]] [[1 1] [2982 3030]] [[1 7] [3020 3068] [3021 3069]] [[1 1] [3047 3096]] [[1 1] [3049 3098]] [[1 1] [3050 3099]] [[8 1] [3051 3100]] [[1 7] [3056 3106] [3057 3108]] [[1 7] [3060 3112] [3061 3113]] [[8 4] [3070 3123]] [[8 5] [3070 3124]] [[8 5] [3070 3125]] [[8 4] [3070 3126]] [[8 4] [3070 3127]]))))
(big-error-codes-test)doesn’t belong to the code - ignore
(use 'clojure.java.io)
(defn get-files-sorted [dir]
(->> (file-seq (file dir))
rest
(sort-by #(.getName %))))
(defn deploy-to-ocr-visualizer []
(let [gts (get-files-sorted "/home/kima/programming/ocr-visualizer/resources/public/ground-truth/")
ocr-res (get-files-sorted "/home/kima/programming/ocr-visualizer/resources/public/ocr-results/")]
(doall (pmap (fn [gt ocr]
(let [filename (str "/home/kima/programming/ocr-visualizer/resources/public/edits/" (.getName gt))]
(prn "error-counts for " filename)
(spit filename (pr-str (error-codes (slurp gt) (slurp ocr))))))
gts ocr-res))))
(defn deploy-error-codes [base-directory]
(let [gts (get-files-sorted (file base-directory "ground-truth/"))
ocr-res (get-files-sorted (file base-directory "ocr-results/"))]
(doall (pmap (fn [gt ocr]
(let [filename (file base-directory
"edits/" (.getName gt))]
(prn "error-counts for " filename)
(spit filename (pr-str (error-codes (slurp gt) (slurp ocr))))))
gts ocr-res))))
(defn word-count [text]
(as-> text x
(.split x "\\s*")
(remove empty? x)
(count x)))
(defn error-code-to-matrix-entries [augmented-error-code]
(let [[[[a b] & res :as error-code] [f t]] augmented-error-code]
(cond
(some #{8} [a b]) [[a b]]
(= b 7) [[a b]]
(= a 7) (map (fn [fc] [(to-code-number fc) 6]) f)
(or (= (str b) (.toUpperCase (str a)))
(= (str a) (.toLowerCase (str b))))
[[1 9]]
:else [[a b]])))
(defn augment-error-code [a b error-code]
[error-code (build-error a b error-code)])
(defn no-insertions-or-deletions [matrix-entry]
(not (some #{8} matrix-entry)))
(defn generate-statistics
([base-directory] (generate-statistics base-directory identity))
([base-directory flavour]
(let [ground-truth (map slurp (get-files-sorted (file base-directory "ground-truth/")))
ocr-res (map slurp (get-files-sorted (file base-directory "ocr-results/")))
edits (map (comp read-string slurp) (get-files-sorted (file base-directory "edits/")))
errors (->> (mapcat #(map (partial augment-error-code %1 %2) %3) ground-truth ocr-res edits)
(mapcat error-code-to-matrix-entries)
(filter flavour))
charc (apply + (map count ocr-res))]
{:error-rate (* 100 (/ (count errors) charc))
:charc charc :error-number (count errors)
:by-category (frequencies errors)})))
(defn build-error [a b error-code]
(cond
(= 8 (get-in error-code [0 0]))
["" (str (nth b (get-in error-code [1 1])))]
(= 8 (get-in error-code [0 1]))
[(str (nth a (get-in error-code [1 0]))) ""]
(= 2 (count error-code))
(let [[code [l r]] error-code]
[(str (nth a l)) (str (nth b r))])
(= 7 (get-in error-code [0 1]))
(let [[code [ls rs] [le re]] error-code]
[(subs a ls le) (subs b rs (inc re))])
(= 7 (get-in error-code [0 0]))
(let [[code [ls rs] [le re]] error-code]
[(subs a ls (inc le)) (subs b rs re)])))
- incorporated as postprocessing step in bote/optimize