web scraping - Pretend Firefox instead of Phantom.js -


when try scrap this site phantomjs, default, phantomjs send following headers server:

"name":"user-agent", "value":"mozilla/5.0 (unknown; linux i686) applewebkit/534.34 (khtml, gecko) phantomjs/1.9.1 safari/534.34"} 

and status 405 "not allowed" response.

i read in phantomjs api reference in order imitate request coming other browser, should change user-agent value. on wikipedia found value should use pretending firefox under ubuntu:

'name': 'user-agent', 'value': 'mozilla/5.0 (x11; ubuntu; linux i686; rv:16.0) gecko/20120815 firefox/16.0' 

in part of phantomjs should put properties? should insert them - inside page.open, or inside page.evaluate, or @ top of it?

actually, on page.settings. before open.

here example using against page linked:

var page = require('webpage').create(); page.settings.useragent = 'mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, gecko) chrome/28.0.1500.71 safari/537.36'; page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() {     window.settimeout(function() {         var output = page.evaluate(function() {             return document.getelementbyid('tournamenttable')            .getelementsbyclassname('deactivate')[0]            .getelementsbytagname('a')[0]            .textcontent;         });         console.log(output);     }, 1000); }); 

this example scrape match name in first row on table. (which, in precise moment "san francisco giants - boston red sox")


about comment, can use jquery under phantomjs! check example:

var page = require('webpage').create(); page.settings.useragent = 'mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, gecko) chrome/28.0.1500.71 safari/537.36'; page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() {     window.settimeout(function() {         page.includejs("http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js", function() {             var output = page.evaluate(function () {                 return jquery('#tournamenttable .deactivate:first a:first').text();             });             console.log(output);         });     }, 1000); }); 

by way, waiting, instead of window.settimeout used on examples, recommend use waitfor.js instead.


Comments

Popular posts from this blog

java - JavaFX 2 slider labelFormatter not being used -

Detect support for Shoutcast ICY MP3 without navigator.userAgent in Firefox? -

web - SVG not rendering properly in Firefox -