Results 1 to 5 of 5

Thread: How can I OCR any PDF file on ubuntu?

  1. #1
    Join Date
    Jan 2012
    Posts
    44

    How can I OCR any PDF file on ubuntu?

    can you tell me how can I OCR any PDF file? Right now my computer is powered by ubuntu. Please provide steps by steps information so that I can get the requirement of mine. any help will be appreciated. Thanks a lot in advance.

  2. #2
    Join Date
    Jun 2011
    Posts
    487

    Re: How can I OCR any PDF file on ubuntu?

    Well you have to perform below mentioned steps to get requirement of yours.
    Very first thing is to install all the necessary packages.
    Code:
    sudo apt-get install tesseract-ocr tesseract-ocr-eng xpdf-reader xpdf imagemagick xpdf-utils
    See that whether you really need OCR.
    Code:
    pdftotext
    xpdf-utils would provide pdftotext utility.

    Now below mentioned is shell script file which will attempt ocr the file of yours. you should place the file at $PATH hence you will be able to run the same into directory with the pdf file.
    Code:
    #!/bin/sh
    mkdir tmp
    cp $@ tmp
    cd tmp
    pdftoppm * -f 1 -l 10 -r 600 ocrbook
    for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
    for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l nld; done
    for i in *.txt; do cat $i >> ${name}.txt; echo "[pagebreak]" >> pdf-ocr-output.txt; done
    mv pdf-ocr-output.txt ..
    rm *
    cd ..
    rmdir tmp
    note that above mentioned convert the every single page of PDF file into 100MB TIFF image and there will be increment into the usage of hard drive per page into the pdf while program is running.


    Now as far as usage of the above mentioned script is concerned you have to perform below mentioned steps.
    Copy the script into you fvariote texteditor and save the same.
    Code:
    cd /path/to/saved/file/
    chmod +x filename
    now you have to run the script and save the same to $PATH. Now you have to enter the file name as console. it should be followed by the name of file which you wanted to OCR. Instead you will need to CD to /path/to/saved/file.
    ./filename "PDF file you wish to ocr"

  3. #3
    Join Date
    Jan 2012
    Posts
    44

    Re: How can I OCR any PDF file on ubuntu?

    Optical Character Recognition as the name describing itself that it would try to determine characters from the shapes in the image. Detecting actual font and reconstructing the text layout which would produced an image which seems to be quite harder. Can you tell me any tool which should have below mentioned functionally?

    • performs OCR
    • performs font recognition
    • reconstructing the layout of original document,
    • generating an document which should be quite similar to original scan.

  4. #4
    Join Date
    Jun 2011
    Posts
    635

    Re: How can I OCR any PDF file on ubuntu?

    I am aware of some programs which will help you to get the requirement of yours.
    1. Adobe Acrobat Professional:
    2. Abby FineReader
    3. Iris
    4. Erus
    5. Nuance omnipage.

  5. #5
    Join Date
    Aug 2011
    Posts
    460

    Re: How can I OCR any PDF file on ubuntu?

    Here the updated version of script.
    Code:
    #!/bin/bash
    
    TESS_LANG=eng
    rflag=
    # first figure out what args we have
    getopts 'r:' OPT;
    shift $(($OPTIND - 1))
    if [ $OPT == "r" ]
    then
        rflag="-rotate $OPTARG";
    fi
    
    CURRENT_DIR=`pwd`
    SCRIPT_NAME=`basename "$0" .sh`
    TMP_DIR=${SCRIPT_NAME}-tmp
    mkdir ${TMP_DIR}
    
    for thisfile in "$@"
    do
        NAME=`basename "${thisfile}" .pdf`
        cp "$thisfile" ${TMP_DIR}
        cd ${TMP_DIR}
    
        echo "Examining: ${thisfile}";
        pgs=`pdfinfo "${thisfile}" | grep Pages | awk '{print $2}'`
        echo "Found ${pgs} pages; converting...";
        # it's only fair, since we're suppressing it later...
        echo "Tesseract Open Source OCR Engine";
        for x in `seq 1 ${pgs}`
        do
            echo -en "  Page ${x}...";
            pdftoppm "$thisfile" -f $x -l $x -r 600 ocrbook;
            BASE=ocrbook-${x};
            convert ${BASE}.ppm ${rflag} ${BASE}.tif;
            tesseract ${BASE}.tif ${BASE} -l ${TESS_LANG} > /dev/null 2>&1;
            cat ${BASE}.txt >> "${NAME}.txt";
            echo "[pagebreak]" >> "${NAME}.txt";
            rm ocrbook*;
            echo "done";
        done;
    
        echo "Conversion complete";
    
        mv "${NAME}.txt" ${CURRENT_DIR}
        rm *
        cd ${CURRENT_DIR}
    done
    
    rmdir ${TMP_DIR}
    if you have run
    ocrpdf.sh -r 90 file1.pdf file\ name\ 2.pdf
    then it would create couple of new files file1.txt and file name 2.txt in the same directory.

Similar Threads

  1. How to convert PDF file to JPG or PNG in ubuntu?
    By Bengal Tiger in forum Windows Software
    Replies: 3
    Last Post: 03-04-2012, 12:37 AM
  2. Replies: 10
    Last Post: 09-01-2012, 07:48 AM
  3. dcf file format with ubuntu
    By Nereus in forum Operating Systems
    Replies: 3
    Last Post: 16-11-2010, 01:04 PM
  4. Which file system to use for Ubuntu 9.04
    By Zipp in forum Operating Systems
    Replies: 3
    Last Post: 06-05-2009, 12:30 PM
  5. How to Run Elf File In Ubuntu
    By Daiwik in forum Operating Systems
    Replies: 2
    Last Post: 05-05-2009, 01:33 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Page generated in 1,713,924,681.60502 seconds with 17 queries