python - Looking for recommendation on how to convert PDF into structured format -
i analysis on properties listed in upcoming auction. unfortunately, city running auction not publish information in structured format instead provides 700+ page pdf of properties going auction.
i'm wondering if community has thoughts how can approach parsing said pdf structured format insertion db or create spreadsheet of properties.
here's image of each page represents:
and here's page lists properties:
i'm comfortable python , ruby don't have issues scripting solution, because "columns" , data in said columns aren't necessary tied together, seems dubious proposition.
any ideas appreciated.
convert text xpdf using command pdftotext
.
i converted file following:
pdftottext.exe -layout -f 23 -l 510 auctionbook2013.pdf auctionbook2013.txt
this conversion leaves text exactly in original layout (due -layout
option). options -f
, -l
indicate first , last page numbers of range of pages extract.
from there, parsing should simple -- number in column 8 indicates first line of record, blank line ends record. follow guide exact positioning of elements within record.
Comments
Post a Comment