web scraping - Pretend Firefox instead of Phantom.js -
when try scrap this site phantomjs, default, phantomjs send following headers server:
"name":"user-agent", "value":"mozilla/5.0 (unknown; linux i686) applewebkit/534.34 (khtml, gecko) phantomjs/1.9.1 safari/534.34"}
and status 405 "not allowed"
response.
i read in phantomjs api reference in order imitate request coming other browser, should change user-agent value. on wikipedia found value should use pretending firefox under ubuntu:
'name': 'user-agent', 'value': 'mozilla/5.0 (x11; ubuntu; linux i686; rv:16.0) gecko/20120815 firefox/16.0'
in part of phantomjs should put properties? should insert them - inside page.open, or inside page.evaluate, or @ top of it?
actually, on page.settings
. before open
.
here example using against page linked:
var page = require('webpage').create(); page.settings.useragent = 'mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, gecko) chrome/28.0.1500.71 safari/537.36'; page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() { window.settimeout(function() { var output = page.evaluate(function() { return document.getelementbyid('tournamenttable') .getelementsbyclassname('deactivate')[0] .getelementsbytagname('a')[0] .textcontent; }); console.log(output); }, 1000); });
this example scrape match name in first row on table. (which, in precise moment "san francisco giants - boston red sox
")
about comment, can use jquery under phantomjs! check example:
var page = require('webpage').create(); page.settings.useragent = 'mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, gecko) chrome/28.0.1500.71 safari/537.36'; page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() { window.settimeout(function() { page.includejs("http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js", function() { var output = page.evaluate(function () { return jquery('#tournamenttable .deactivate:first a:first').text(); }); console.log(output); }); }, 1000); });
by way, waiting, instead of window.settimeout
used on examples, recommend use waitfor.js instead.
Comments
Post a Comment