Irhine

home

Headless Browser (Selenium+PhantomJS)

13 Aug 2013

搭建 Headless Browser (Selenium+PhantomJS)

搭建Headless Brwoser,为Scrapy抓取页面服务,适用于分布式架构,在一台机器上部署之后,其他机器可通过remote方式使用。 Phantomjs和Selenium的结合由Ghost Driver提供,现在这个项目已经被include到Phantomjs的官方最新版本:

THAT'S IT!! Because of latest stable GhostDriver being embedded in PhantomJS, you shouldn't need anything else to get started.

安装

Selenium RC
$ cd ~/tmp
$ wget http://selenium.googlecode.com/files/selenium-server-standalone-2.34.0.jar
Selenium Python Client

selenium的python doc $ pip install selenium

PhantomJS

Phantomjs的最新版本托管在Google Code上,被墙,需用通过代理下载,ProxyChains设置

$ proxychians wget 'https://phantomjs.googlecode.com/files/phantomjs-1.9.1-linux-x86_64.tar.bz2'
$ tar jxvf phantomjs-1.9.1-linux-x86_64.tar.bz2
Java

selenium运行需要java环境

$ sudo apt-get install openjdk-7-jre

部署

selenium和phantomjs在测试阶段可在前台运行观察日志

Selenium RC

selenium在默认的4444端口启动,很矬的一个端口...

$ cd /path/to/selenium_jar/
$ nohup java -jar selenium-server-standalone-2.34.0.jar -role hub > /dev/null &
PhantomJS
$ cd /path/to/phantomjs
$ nohup ./bin/phantomjs --webdriver=8080 --webdriver-selenium-grid-hub=http://127.0.0.1:4444 > /dev/null &

Python Example

用来测试的url为当当网手机版的掌上绝杀页面,绝杀倒计时为JS生成

$ cat python_example.py
$ #!/usr/bin/env python
  #coding: utf-8

  import sys

  from selenium import webdriver


  def main(url):
      caps = {
          'takeScreenshot': False,
          'javascriptEnabled': True,
      }
      phantom_link = 'http://127.0.0.1:8080/wd/hub'

      driver = webdriver.Remote(
          command_executor=phantom_link, 
          desired_capabilities=caps
          )

      driver.get(url)

      print driver.title
      print driver.find_element_by_xpath('//ul[@id="clock_0"]').text


  if __name__ == "__main__":
      url = sys.argv[1]
      main(url)

$ python python_example.py 'http://m.dangdang.com/touch/topics.php?page_id=23387'