Cleaning a CSV file using linux commands, while ignoring commas between quotes -


i've got rather large csv file need import postgresql database, virtually contains many errors import using sql copy command. i've managed fix issue quotation marks being in places shouldn't, using awk command:

awk -f\" 'nf % 2 == 1 { print $0 }' ./db_downloaded.csv > ./db_sanitized.csv

this drops line has odd number of quotation marks on it, , works quite well. other common problem run incorrect number of columns. current solution problem use following awk command drop line has wrong number of commas:

awk -f"," 'nf == 40 { print $0 }' ./db_sanitized.csv > ./db_sanitized2.csv

however, makes drop rows have commas between quotes, quite few rows. now, it's not critical every single row imported, i'd if had way count commas didn't appear between quotation marks on row. way can think have sort of boolean variable flips every time there's quotation mark in stream, turn off checking commas, i'm not sure start if going that.

i'm interested in using linux commands grep, awk, tr, etc. can run on remote server, rather writing own c++ program, may have go route if there's no other way.

edit: ended using ruby script:

lines = file.new("/home/db/product_list.csv", "r") sanitized = file.new("/home/db/product_list_sanitized.csv", "w")  lines.each |l|   if l.count("\"") % 2 == 0     if l.count(",") > 39       u = true       commas = 0       l.each_char |c|         if (c == "\"")           u = (not u)         elsif (u && (c == ","))           commas += 1         end       end       sanitized.write(l) if commas == 39     else       sanitized.write(l)     end   end end  sanitized.close 

stash original line, replace sets of quotes more well-behaved token, , check against that; print original if matches:

awk -f, '{line=$0;gsub(/"[^"]*"/,"x")}nf%2{print line}' test.in 

Comments

Popular posts from this blog

Detect support for Shoutcast ICY MP3 without navigator.userAgent in Firefox? -

web - SVG not rendering properly in Firefox -

java - JavaFX 2 slider labelFormatter not being used -