忘れないようにメモっとく

機械学習とかプログラミングとか。

scrapy command note

scrapy basic command


Scrapy has some useful subcommands, like "startproject" I introduced in a previous entry.
pythonのフレームワークでサクッとクローラをつくる。"Python Framework Scrapy" - ケンキュウハック
This is a note for scrapy subcommands.


startobject
Create a Scrapy project.

$ scrapy startproject newproject

You can edit python files under newproject directory.


genspider
Create a new spider and check available templates.

$ scrapy genspider -t basic newspider01 example.com
Created spider 'newspider01' using template 'basic' in module:
  scrapy_sample.spiders.newspider01

Create a "newspider01" crawls to "http://www.example.com/".
Following command shows available templates.

scrapy genspider -l
  basic
  crawl
  csvfeed
  xmlfeed


crawl
Start crawling a spider.

$ scrapy crawl newspider01


list
Show all spiders

$ scrapy list
newspider01
newspider02


view
Open a web page in a browser.

scrapy view http://www.example.com/

This opens the given url page in your browser.


shell
Check parameters in a python console.

scrapy shell http://www.example.com/some/page.html
...
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head><title>Example Domain</title'>
[s]   item       {}
[s]   request    <GET http://www.example.com/some/page.html>
[s]   response   <200 http://www.iana.org/domains/example>
[s]   settings   <CrawlerSettings module=<module 'scrapy_sample.settings' from '/Users/shinya/scrapy_sample/scrapy_sample/settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x10a0ef190>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>>print hxs
<HtmlXPathSelector xpath=None data=u'<html><head><title>Example Domain</title'>

Check parameters, the spider took from the url, in a python console.

You can see more information here.