Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to generate hOCR output from tesseract #80

Closed
jhosteny opened this issue May 16, 2013 · 5 comments
Closed

Add option to generate hOCR output from tesseract #80

jhosteny opened this issue May 16, 2013 · 5 comments

Comments

@jhosteny
Copy link

It'd be nice to have the ability to generate hocr output when running ocr via tesseract. I have a patch and will send the pull request.

@jsfenfen
Copy link

This is a really good idea. +1

@knowtheory
Copy link
Member

@jsfenfen Yep, been talking with @lukerosiak about this. Definitely want to get it into the lib.

@lukerosiak
Copy link

It is indeed as simple as adding the hocr flag to the tesseract call, no config file appears to be required. But then you have to turn the hocr (html) into text, since I don't think you can get tesseract to
produce both. To turn hocr into text, something like this (python):

'''from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(fin)

grafs = soup.findAll('p',{'class':'ocr_par'})

for graf in grafs:

print ''

lines = graf.findAll('span',{'class':'ocr_line'})

for line in lines:

    print ''.join(line.findAll(text=True))'''

On Tue, Jun 11, 2013 at 12:20 PM, Ted Han [email protected] wrote:

@jsfenfen https://github.com/jsfenfen Yep, been talking with @lukerosiakhttps://github.com/lukerosiakabout this. Definitely want to get it into the lib.


Reply to this email directly or view it on GitHubhttps://github.com//issues/80#issuecomment-19272844
.

@jhosteny
Copy link
Author

I'm doing something similar in ruby, as I want to have both outputs. If you'd like, I can add something like this to the patch. The only downside is that it doesn't do any text cleaning (currently), though I'm sure that could be added.

def emit_text(page)
  doc = Nokogiri::HTML(File.open("#{page}.html"))
  File.open("#{page}.txt", "w") do |out|
    pos = 0
    doc.css('.ocr_par').each do |par|
      par.css('.ocr_line').each do |line|
        line.css('.ocrx_word').each do |word|
          out.write("#{word.text} ")
          start = pos
          stop = start + word.text.length
          word['start'] = start
          word['stop'] = stop
          pos += word.text.length + 1
        end
        out.write("\n")
        pos += 1
      end
      out.write("\n")
      pos += 1
    end
  end
  File.open("#{page}.html", "w").write(doc.to_html)
end

@jhosteny
Copy link
Author

Closing in lieu of #92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants