html - Using readHTMLTable with multiple tbody -
suppose have html table multiple <tbody>
, we know legal html, , attempt read readhtmltable
follows:
require(xml) table.text <- '<table> <thead> <tr><th>col1</th><th>col2</th> </thead> <tbody> <tr><td>1a</td><td>2a</td></tr> </tbody> <tbody> <tr><td>1b</td><td>2b</td></tr> </tbody> </table>' readhtmltable(table.text)
the output takes first <tbody>
element:
$`null` col1 col2 1 1a 2a
and ignores rest. expected behavior? (i can't find mention in documentation.) , what flexible , robust ways access entire table?
i'm using
table.text <- gsub('</tbody>[[:space:]]*<tbody>', '', table.text) readhtmltable(table.text)
which prevents me using readhtmltable
directly on url table this, , doesn't feel robust.
if @ source readhtmltable
getmethod(readhtmltable, "xmlinternalelementnode")
contains line
if (length(tbody)) node = tbody[[1]]
so purposefully designed select content of first tbody. ?readhtmltable
describes function providing
somewhat robust methods extracting data html tables in html document
it designed utility function. great when works may need hack around it.
Comments
Post a Comment