regex - How to transform structured textfiles into PHP multidimensional array -


i have 100 files, each containing x amount of news articles. articles structured via sections following abbreviations:

hd wc pd sn sc pg la cy lp td co in ns re ipc pub 

where [lp] , [td] can contain number of paragraphs.

a typical messages looks like:

hd corporate news: alcoa earnings soar; outlook stays upbeat  by james r. hagerty , matthew day  wc 421 words pd 12 july 2011 sn wall street journal sc j pg b7 la english cy (copyright (c) 2011, dow jones & company, inc.)   lp   alcoa inc.'s profit more doubled in second quarter, giant  aluminum producer managed meet analysts' lowered forecasts.  alcoa serves bellwether u.s. corporate earnings because  first major company report , draws demand wide range of  industries.  td   results marked test of how corporate optimism holding  in face of bleak economic news.  license article dow jones reprint  service[http://www.djreprints.com/link/link.html?factiva=wjco20110712000115]  co  almam : alcoa inc  in  i2245 : aluminum | i22 : primary metals | i224 : non-ferrous metals | imet    : metals/mining  ns  c15 : performance | c151 : earnings | c1521 : analyst  comment/recommendation | ccat : corporate/industrial news | c152 :  earnings projections | ncat : content types | nfact : factiva filters |  nfce : fc&e exclusion filter | nfcpin : fc&e industry news filter  re  usa : united states | use : northeast u.s. | uspa : pennsylvania | namz :  north america  ipc  djcs | ewr | bsc | nnd | cns | lmj | tpt  pub  dow jones & company, inc.   document j000000020110712e77c00035 

after each article, there 4 newlines before new article starts. need put these articles array, follows:

$articles = array(   [0] = array (     [hd] => corporate news: alcoa earnings soar; outlook...     [by] => james r. hagerty...     ...     [an] => document j000000020110712e77c00035   ) ) 

[edit] @casimir et hippolyte have:

$path = "c:/path/to/textfiles/";  if ($handle = opendir($path)) {   while (false !== ($file = readdir($handle))) {     if ('.' === $file) continue;     if ('..' === $file) continue;      $text = file_get_contents($path . $file);     $subjects = explode("\r\n\r\n\r\n\r\n", $text);      $pattern = <<<'lod'         ~         # definition         (?(define)(?<fieldname>(?<=^|\n)(?>hd|by|wc|pd|sn|sc|pg|la|cy|lp|td|co|in|ns|re|ipc|pub|an)))         # pattern         \g(?<key>\g<fieldname>)\s++(?<value>[^\n]++(?>\n{1,2}+(?!\g<fieldname>) [^\n]++ )*+)(?>\n{1,3}|$)         ~x  lod;      $result = array();     foreach($subjects $i => $subject) {       if (preg_match_all($pattern, $subject, $matches, preg_set_order)) {         foreach ($matches $match) {           $result[$i][$match['key']] = $match['value'];         }       }     }   }   closedir($handle);   echo '<pre>';   print_r($result); } 

however, no matches being found, nor errors produced. can ask me goes wrong here?

a way uses explode separate each blocks , regex extract fields:

$pattern = <<<'lod' ~ # definition (?(define)     (?<fieldname> (?<=^|\n)                   (?>hd|by|wc|pd|sn|sc|pg|la|cy|lp|td|co|in|ns|re|ipc|pub|an)     ) )  # pattern \g(?<key>\g<fieldname>) \s++ (?<value>     [^\n]++      (?> \n{1,2}+ (?!\g<fieldname>) [^\n]++ )*+ ) (?>\n{1,3}|$) ~x lod; $subjects = explode("\n\n\n\n", $text); $result = array();  foreach($subjects $i=>$subject) {     if (preg_match_all($pattern, $subject, $matches, preg_set_order)) {         foreach ($matches $match) {             $result[$i][$match['key']]=$match['value'];         }     } } echo '<pre>'; print_r($result); 

pattern details:

the pattern divided 2 parts:

  • the definitions: can write subpatterns use later
  • the pattern itself

in definition part write subpattern named fieldname put field names , condition @ begining. condition checks if fiedname preceded start of string (^) or newline (\n) avoid capturing same letters inside paragraph example.

description of pattern part:

\g                        # forces match contiguous                           # precedent match or start of string (no gap) (?<key> \g<fieldname> )   # capturing group named "key" fieldname \s++                      # 1 or more white characters (?<value>                 # open capturing group named "value"                           # field content     [^\n]++               # characters except newlines 1 or more times     (?>                   # open atomic group         \n{1,2}+          # 1 or 2 newlines allow paragraphs (lp & td)          (?!\g<fieldname>) # not followed fieldname (only check)         [^\n]++           # characters except newlines 1 or more times     )*+                   # close atomic group , repeat 0 or more times )                         # close capture group "value" (?>\n{1,3}|$)             # between 1 or 3 newlines max. or end of                           # string (necessary if want contiguous matches) 

the x @ end of $pattern allows verbose mode in regex (you can put comments inside #, , can format code want spaces).

notice: pattern doesn't care fields order , if present or not. more readability, use nowdoc syntax (<<<'abc') careful use correctly.

if text file has windows format newlines (i.e. \r\n), must change pattern to:

$pattern = <<<'lod' ~ # definition (?(define)     (?<fieldname> (?<=^|\n)                   (?>hd|by|wc|pd|sn|sc|pg|la|cy|lp|td|co|in|ns|re|ipc|pub|an)     ) )  # pattern \g(?<key>\g<fieldname>) \s++ (?<value>     [^\r\n]++      (?> (?>\r?\n){1,2}+ (?!\g<fieldname>) [^\r\n]++ )*+ ) (?>(?>\r?\n){1,3}|$) ~x lod; $subjects = explode("\r\n\r\n\r\n\r\n", $text); 

Comments

Popular posts from this blog

Detect support for Shoutcast ICY MP3 without navigator.userAgent in Firefox? -

web - SVG not rendering properly in Firefox -

java - JavaFX 2 slider labelFormatter not being used -