|
| ||||||||||
| Tags: nuance omnipage, ocr, pdf file, script, ubuntu |
![]() |
| | Thread Tools | Search this Thread |
|
#1
| |||
| |||
| How can I OCR any PDF file on ubuntu?
|
|
#2
| |||
| |||
| Re: How can I OCR any PDF file on ubuntu?
Well you have to perform below mentioned steps to get requirement of yours. Very first thing is to install all the necessary packages. Code: sudo apt-get install tesseract-ocr tesseract-ocr-eng xpdf-reader xpdf imagemagick xpdf-utils Code: pdftotext Now below mentioned is shell script file which will attempt ocr the file of yours. you should place the file at $PATH hence you will be able to run the same into directory with the pdf file. Code: #!/bin/sh
mkdir tmp
cp $@ tmp
cd tmp
pdftoppm * -f 1 -l 10 -r 600 ocrbook
for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l nld; done
for i in *.txt; do cat $i >> ${name}.txt; echo "[pagebreak]" >> pdf-ocr-output.txt; done
mv pdf-ocr-output.txt ..
rm *
cd ..
rmdir tmp Now as far as usage of the above mentioned script is concerned you have to perform below mentioned steps. Copy the script into you fvariote texteditor and save the same. Code: cd /path/to/saved/file/ chmod +x filename ./filename "PDF file you wish to ocr" |
|
#3
| |||
| |||
| Re: How can I OCR any PDF file on ubuntu?
Optical Character Recognition as the name describing itself that it would try to determine characters from the shapes in the image. Detecting actual font and reconstructing the text layout which would produced an image which seems to be quite harder. Can you tell me any tool which should have below mentioned functionally?
|
|
#4
| |||
| |||
| Re: How can I OCR any PDF file on ubuntu?
I am aware of some programs which will help you to get the requirement of yours.
|
|
#5
| |||
| |||
| Re: How can I OCR any PDF file on ubuntu?
Here the updated version of script. Code: #!/bin/bash
TESS_LANG=eng
rflag=
# first figure out what args we have
getopts 'r:' OPT;
shift $(($OPTIND - 1))
if [ $OPT == "r" ]
then
rflag="-rotate $OPTARG";
fi
CURRENT_DIR=`pwd`
SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
mkdir ${TMP_DIR}
for thisfile in "$@"
do
NAME=`basename "${thisfile}" .pdf`
cp "$thisfile" ${TMP_DIR}
cd ${TMP_DIR}
echo "Examining: ${thisfile}";
pgs=`pdfinfo "${thisfile}" | grep Pages | awk '{print $2}'`
echo "Found ${pgs} pages; converting...";
# it's only fair, since we're suppressing it later...
echo "Tesseract Open Source OCR Engine";
for x in `seq 1 ${pgs}`
do
echo -en " Page ${x}...";
pdftoppm "$thisfile" -f $x -l $x -r 600 ocrbook;
BASE=ocrbook-${x};
convert ${BASE}.ppm ${rflag} ${BASE}.tif;
tesseract ${BASE}.tif ${BASE} -l ${TESS_LANG} > /dev/null 2>&1;
cat ${BASE}.txt >> "${NAME}.txt";
echo "[pagebreak]" >> "${NAME}.txt";
rm ocrbook*;
echo "done";
done;
echo "Conversion complete";
mv "${NAME}.txt" ${CURRENT_DIR}
rm *
cd ${CURRENT_DIR}
done
rmdir ${TMP_DIR} ocrpdf.sh -r 90 file1.pdf file\ name\ 2.pdf then it would create couple of new files file1.txt and file name 2.txt in the same directory. |
![]() |
|
| Thread Tools | Search this Thread |
| |
Similar Threads for: "How can I OCR any PDF file on ubuntu?" | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How to convert PDF file to JPG or PNG in ubuntu? | Bengal Tiger | Windows Software | 3 | 03-04-2012 12:37 AM |
| getting Error splicing file: Input/output error while copying file in ubuntu | Rajni | Operating Systems | 10 | 09-01-2012 06:48 AM |
| dcf file format with ubuntu | Nereus | Operating Systems | 3 | 16-11-2010 12:04 PM |
| Which file system to use for Ubuntu 9.04 | Zipp | Operating Systems | 3 | 06-05-2009 12:30 PM |
| How to Run Elf File In Ubuntu | Daiwik | Operating Systems | 2 | 05-05-2009 01:33 PM |