-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to generate hOCR output from tesseract #80
Comments
This is a really good idea. +1 |
@jsfenfen Yep, been talking with @lukerosiak about this. Definitely want to get it into the lib. |
It is indeed as simple as adding the hocr flag to the tesseract call, no config file appears to be required. But then you have to turn the hocr (html) into text, since I don't think you can get tesseract to '''from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(fin) grafs = soup.findAll('p',{'class':'ocr_par'}) for graf in grafs:
On Tue, Jun 11, 2013 at 12:20 PM, Ted Han [email protected] wrote:
|
I'm doing something similar in ruby, as I want to have both outputs. If you'd like, I can add something like this to the patch. The only downside is that it doesn't do any text cleaning (currently), though I'm sure that could be added. def emit_text(page)
doc = Nokogiri::HTML(File.open("#{page}.html"))
File.open("#{page}.txt", "w") do |out|
pos = 0
doc.css('.ocr_par').each do |par|
par.css('.ocr_line').each do |line|
line.css('.ocrx_word').each do |word|
out.write("#{word.text} ")
start = pos
stop = start + word.text.length
word['start'] = start
word['stop'] = stop
pos += word.text.length + 1
end
out.write("\n")
pos += 1
end
out.write("\n")
pos += 1
end
end
File.open("#{page}.html", "w").write(doc.to_html)
end |
Closing in lieu of #92 |
It'd be nice to have the ability to generate hocr output when running ocr via tesseract. I have a patch and will send the pull request.
The text was updated successfully, but these errors were encountered: