python - Looking for recommendation on how to convert PDF into structured format -


i analysis on properties listed in upcoming auction. unfortunately, city running auction not publish information in structured format instead provides 700+ page pdf of properties going auction.

i'm wondering if community has thoughts how can approach parsing said pdf structured format insertion db or create spreadsheet of properties.

here's image of each page represents: property guide

and here's page lists properties: sample list of properties

i'm comfortable python , ruby don't have issues scripting solution, because "columns" , data in said columns aren't necessary tied together, seems dubious proposition.

any ideas appreciated.

convert text xpdf using command pdftotext.

i converted file following:

pdftottext.exe -layout -f 23 -l 510 auctionbook2013.pdf auctionbook2013.txt 

this conversion leaves text exactly in original layout (due -layout option). options -f , -l indicate first , last page numbers of range of pages extract.

from there, parsing should simple -- number in column 8 indicates first line of record, blank line ends record. follow guide exact positioning of elements within record.


Comments

Popular posts from this blog

java - JavaFX 2 slider labelFormatter not being used -

Detect support for Shoutcast ICY MP3 without navigator.userAgent in Firefox? -

web - SVG not rendering properly in Firefox -