Cleaning a CSV file using linux commands, while ignoring commas between quotes -
i've got rather large csv file need import postgresql database, virtually contains many errors import using sql copy command. i've managed fix issue quotation marks being in places shouldn't, using awk command:
awk -f\" 'nf % 2 == 1 { print $0 }' ./db_downloaded.csv > ./db_sanitized.csv
this drops line has odd number of quotation marks on it, , works quite well. other common problem run incorrect number of columns. current solution problem use following awk command drop line has wrong number of commas:
awk -f"," 'nf == 40 { print $0 }' ./db_sanitized.csv > ./db_sanitized2.csv
however, makes drop rows have commas between quotes, quite few rows. now, it's not critical every single row imported, i'd if had way count commas didn't appear between quotation marks on row. way can think have sort of boolean variable flips every time there's quotation mark in stream, turn off checking commas, i'm not sure start if going that.
i'm interested in using linux commands grep, awk, tr, etc. can run on remote server, rather writing own c++ program, may have go route if there's no other way.
edit: ended using ruby script:
lines = file.new("/home/db/product_list.csv", "r") sanitized = file.new("/home/db/product_list_sanitized.csv", "w") lines.each |l| if l.count("\"") % 2 == 0 if l.count(",") > 39 u = true commas = 0 l.each_char |c| if (c == "\"") u = (not u) elsif (u && (c == ",")) commas += 1 end end sanitized.write(l) if commas == 39 else sanitized.write(l) end end end sanitized.close
stash original line, replace sets of quotes more well-behaved token, , check against that; print original if matches:
awk -f, '{line=$0;gsub(/"[^"]*"/,"x")}nf%2{print line}' test.in
Comments
Post a Comment