You will also need ghostscript installed but no need for hocr2pdf. You can use:īrew install tesseract -HEAD to get the latest version of tesseract. Which requires leptonica to be installed.
Tesseract 3.03+ has built in support for PDF output. Pdftk merged+data.pdf update_info_utf8 in.info output "$in_filename-ocr.pdf" Hocr2pdf -i $f -r 300 -s -o "$f.pdf" in.infoĮcho "InfoValue: PDF OCR scan script" > in.info pdfocr.sh SomeFile.pdf tesseract 1 por "Some Author" "Some Title"Ĭonvert -normalize -density 300 -depth 8 -crop 50%x100% +repage $f "$f.png"Ĭonvert -normalize -density 300 -depth 8 $f "$f.png" # and author, title are used for the PDF metadata. # lang is a language as in "tesseract -list-langs" or "cuneiform -l". # split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page) # where ocr-sfw is either tesseract or cuneiform pdfocr.sh document.pdf ocr-sfw split lang author title" # $ sudo apt-get install tesseract-ocr-porĮcho "usage. # To install languages into tesseract do (e.g. # You also need at least one OCR software which can be either tesseract or cuneiform. # $ sudo apt-get install imagemagick pdftk exactimage # Depends on convert (ImageMagick), pdftk and hocr2pdf (ExactImage). # Based on previous script and many good tips by Konrad Voelkel: # This is a script to transform a PDF containing a scanned book into a searchable PDF. Is there one available? If not, how can one OCR a multi-page PDF and get the results back again in a multi-page PDF in OS X, using free, open source tools? #!/bin/bash I haven't been able to find a port of it for OS X. Most of the dependencies are available in homebrew ( brew install tesseract and brew install imagemagick), except one, hocr2pdf. Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocrĪnd going through the snippet below (from this gist) for Linux, I think I found a method to OCR a multi-page PDF and get a PDF in the output that could also work in OS X.