regex - How to transform structured textfiles into PHP multidimensional array -
i have 100 files, each containing x amount of news articles. articles structured via sections following abbreviations:
hd wc pd sn sc pg la cy lp td co in ns re ipc pub
where [lp]
, [td]
can contain number of paragraphs.
a typical messages looks like:
hd corporate news: alcoa earnings soar; outlook stays upbeat by james r. hagerty , matthew day wc 421 words pd 12 july 2011 sn wall street journal sc j pg b7 la english cy (copyright (c) 2011, dow jones & company, inc.) lp alcoa inc.'s profit more doubled in second quarter, giant aluminum producer managed meet analysts' lowered forecasts. alcoa serves bellwether u.s. corporate earnings because first major company report , draws demand wide range of industries. td results marked test of how corporate optimism holding in face of bleak economic news. license article dow jones reprint service[http://www.djreprints.com/link/link.html?factiva=wjco20110712000115] co almam : alcoa inc in i2245 : aluminum | i22 : primary metals | i224 : non-ferrous metals | imet : metals/mining ns c15 : performance | c151 : earnings | c1521 : analyst comment/recommendation | ccat : corporate/industrial news | c152 : earnings projections | ncat : content types | nfact : factiva filters | nfce : fc&e exclusion filter | nfcpin : fc&e industry news filter re usa : united states | use : northeast u.s. | uspa : pennsylvania | namz : north america ipc djcs | ewr | bsc | nnd | cns | lmj | tpt pub dow jones & company, inc. document j000000020110712e77c00035
after each article, there 4 newlines before new article starts. need put these articles array, follows:
$articles = array( [0] = array ( [hd] => corporate news: alcoa earnings soar; outlook... [by] => james r. hagerty... ... [an] => document j000000020110712e77c00035 ) )
[edit]
@casimir et hippolyte have:
$path = "c:/path/to/textfiles/"; if ($handle = opendir($path)) { while (false !== ($file = readdir($handle))) { if ('.' === $file) continue; if ('..' === $file) continue; $text = file_get_contents($path . $file); $subjects = explode("\r\n\r\n\r\n\r\n", $text); $pattern = <<<'lod' ~ # definition (?(define)(?<fieldname>(?<=^|\n)(?>hd|by|wc|pd|sn|sc|pg|la|cy|lp|td|co|in|ns|re|ipc|pub|an))) # pattern \g(?<key>\g<fieldname>)\s++(?<value>[^\n]++(?>\n{1,2}+(?!\g<fieldname>) [^\n]++ )*+)(?>\n{1,3}|$) ~x lod; $result = array(); foreach($subjects $i => $subject) { if (preg_match_all($pattern, $subject, $matches, preg_set_order)) { foreach ($matches $match) { $result[$i][$match['key']] = $match['value']; } } } } closedir($handle); echo '<pre>'; print_r($result); }
however, no matches being found, nor errors produced. can ask me goes wrong here?
a way uses explode separate each blocks , regex extract fields:
$pattern = <<<'lod' ~ # definition (?(define) (?<fieldname> (?<=^|\n) (?>hd|by|wc|pd|sn|sc|pg|la|cy|lp|td|co|in|ns|re|ipc|pub|an) ) ) # pattern \g(?<key>\g<fieldname>) \s++ (?<value> [^\n]++ (?> \n{1,2}+ (?!\g<fieldname>) [^\n]++ )*+ ) (?>\n{1,3}|$) ~x lod; $subjects = explode("\n\n\n\n", $text); $result = array(); foreach($subjects $i=>$subject) { if (preg_match_all($pattern, $subject, $matches, preg_set_order)) { foreach ($matches $match) { $result[$i][$match['key']]=$match['value']; } } } echo '<pre>'; print_r($result);
pattern details:
the pattern divided 2 parts:
- the definitions: can write subpatterns use later
- the pattern itself
in definition part write subpattern named fieldname put field names , condition @ begining. condition checks if fiedname preceded start of string (^
) or newline (\n
) avoid capturing same letters inside paragraph example.
description of pattern part:
\g # forces match contiguous # precedent match or start of string (no gap) (?<key> \g<fieldname> ) # capturing group named "key" fieldname \s++ # 1 or more white characters (?<value> # open capturing group named "value" # field content [^\n]++ # characters except newlines 1 or more times (?> # open atomic group \n{1,2}+ # 1 or 2 newlines allow paragraphs (lp & td) (?!\g<fieldname>) # not followed fieldname (only check) [^\n]++ # characters except newlines 1 or more times )*+ # close atomic group , repeat 0 or more times ) # close capture group "value" (?>\n{1,3}|$) # between 1 or 3 newlines max. or end of # string (necessary if want contiguous matches)
the x
@ end of $pattern allows verbose mode in regex (you can put comments inside #, , can format code want spaces).
notice: pattern doesn't care fields order , if present or not. more readability, use nowdoc syntax (<<<'abc'
) careful use correctly.
if text file has windows format newlines (i.e. \r\n
), must change pattern to:
$pattern = <<<'lod' ~ # definition (?(define) (?<fieldname> (?<=^|\n) (?>hd|by|wc|pd|sn|sc|pg|la|cy|lp|td|co|in|ns|re|ipc|pub|an) ) ) # pattern \g(?<key>\g<fieldname>) \s++ (?<value> [^\r\n]++ (?> (?>\r?\n){1,2}+ (?!\g<fieldname>) [^\r\n]++ )*+ ) (?>(?>\r?\n){1,3}|$) ~x lod; $subjects = explode("\r\n\r\n\r\n\r\n", $text);
Comments
Post a Comment