python - Looking for recommendation on how to convert PDF into structured format -

- May 15, 2013

i analysis on properties listed in upcoming auction. unfortunately, city running auction not publish information in structured format instead provides 700+ page pdf of properties going auction.

i'm wondering if community has thoughts how can approach parsing said pdf structured format insertion db or create spreadsheet of properties.

here's image of each page represents: property guide

and here's page lists properties: sample list of properties

i'm comfortable python , ruby don't have issues scripting solution, because "columns" , data in said columns aren't necessary tied together, seems dubious proposition.

any ideas appreciated.

convert text xpdf using command pdftotext.

i converted file following:

pdftottext.exe -layout -f 23 -l 510 auctionbook2013.pdf auctionbook2013.txt

this conversion leaves text exactly in original layout (due -layout option). options -f , -l indicate first , last page numbers of range of pages extract.

from there, parsing should simple -- number in column 8 indicates first line of record, blank line ends record. follow guide exact positioning of elements within record.

Search This Blog

Sher

python - Looking for recommendation on how to convert PDF into structured format -

Comments

Post a Comment

Popular posts from this blog

java - How to Configure JAXRS and Spring With Annotations -

visual studio - TFS will not accept changes I've made to a Java project -

php - Create image in codeigniter on the fly -