使用scrapy爬数据

这两周在尝试使用scrapy爬数据,总结一下遇到的几个坎:

xpath使用

1、xpath().extract()返回的结果是数组,如果需要取第一个,就用extract_first(),否则就用下标取extract()[0];
2、xpath(‘//‘)表示从根搜索,xpath(‘.//‘)从当前节点搜索;
3、浏览器中查看xpath时要禁止javascript,防止数据是被javascript修改过的。

多层爬数据

1、先在parse中获取到下一层的链接detail_url
2、yield Request(url=detail_url, callback=self.parse_detail)进入下一层,其中parse_detal是下一层处理函数
3、如果还需要传递Item到下一层,就用Request(url=detail_url, meta:{“item”: item}, callback=self.parse_detail),获取时用response.meta[“item”]。

反反爬

1、将COOKIES_ENABLED设置为False;
2、将AUTOTHROTTLE全部打开。

爬火星网资讯

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from scrapy.spider import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

import re
import json

from BlocknewSpider.items import TopicItem
from BlocknewSpider.spiders.blocknew_spider import BaseBlocknewSpider

class ComHuoxing24Spider(BaseBlocknewSpider):
name = "huoxing24"

allowed_domains = ["huoxing24.com"]

start_urls = [
"http://www.huoxing24.com/info/news/shownews?currentPage=0&pageSize=25",
]

def parse(self, response):
jsonBody = json.loads(response.body)

for news in jsonBody["obj"]["inforList"]:
detail_url = "http://www.huoxing24.com/newsdetail?id=" + news["id"]

item = TopicItem()
item['author_title'] = news['source'].strip()
item['author_avatar'] = news['iconUrl']
item['title'] = news['title'].strip()
item['url'] = detail_url

cover_pic = json.loads(news['coverPic'])
item['cover'] = cover_pic['pc']

yield Request(url=detail_url, meta={"item": item}, callback=self.parse_detail)

for page in range(1, jsonBody["obj"]["pageCount"] - 1):
next_url = "http://www.huoxing24.com/info/news/shownews?currentPage=" + str(page) + "&pageSize=25"

yield Request(next_url)

def parse_detail(self, response):
hxs = Selector(response)

item = response.meta['item']
item['publish'] = hxs.xpath('//div[@class="issue-box"]/p[2]/span/text()').extract_first().strip()

images = hxs.xpath('//div[@class="detail-text-cont simditor-body"]').xpath('.//img/@src').extract()
if len(images) > 0:
item['media_1'] = images[0]
else:
item['media_1'] = ""
if len(images) > 1:
item['media_2'] = images[1]
else:
item['media_2'] = ""
if len(images) > 2:
item['media_3'] = images[2]
else:
item['media_3'] = ""
yield item