Scrapy：爬取steam热门游戏资料

2022-05-01 Python PV:

有段时间没有做爬虫程序了，正好想起来还没有写过scrapy框架的博客，写个scrapy爬虫进行总结和回顾。本文代码已上传至github：https://github.com/elbadaernU404/steamgames

1.Scrapy简介

搜索scrapy，可以很轻松的找到Scrapy的官方网站，正如其首页所描述的那样，scrapy是一款快速且强力的多线程异步爬虫框架，适合对静态页面（不配置中间件的情况下）的信息进行高速且大面积、高层次的抓取，避免了传统爬虫并发爬取网站数据的局限性，仅需要少量的代码即可完成爬虫工作，是一项高效率的主流爬虫工具。

Scrapy

2.Scrapy框架原理

进入scrapy官网，可以找到官方文档中对框架最新的描述如下：

Scrapy运行结构
官方文档

总结后就是：
scrapy框架中共包括了引擎（Scrapy Engine）、Item 字段、调度器（Scheduler）、调度中间件（Scheduler Middewares）、下载器（Downloader）、下载器中间件（Downloader Middlewares）、爬虫程序（Spiders）、爬虫中间件（Spider Middlewares）和管道（Pipeline）

1.scrapy引擎会从爬虫程序中获取初始请求；
2.scrapy引擎通过调度器（Scheduler）调度Requests并要求获取下一个Requests；
3.调度器将下一个请求返回至scrapy引擎；
4.scrapy引擎通过下载中间件将请求发送到下载器（Downloader）当中，完成一次下载后下载器会生成Response并发送给scrapy引擎；
5.scrapy引擎接受下载器的Response交给爬虫程序。爬虫完成对Response的处理后，将抓取的数据和新的Requests返回给scrapy引擎，scrapy引擎将数据处理完毕后交给Item管道，再将处理的请求发送给调度器，执行下一轮工作；
6.重复上述步骤，直到不再有来自调度器的请求，爬虫程序结束。

3.制作Scrapy爬虫程序

3.1 目标网站：Steam商店<s.team>

完整域名：https://store.steampowered.com/
steam平台官网

3.2 目标数据

爬虫要提取的数据为Steam商店中全球热销榜中的游戏信息，包括：
· 游戏名称
· 链接
· 发售日期
· 好评率
· 折扣
· 价格（原价、折扣价，因为挂了梯子所以货币显示的是新台币）

Steam热销榜
Steam游戏链接
检查元素，找到各数据的xpath

获取xpath
可以在chrome浏览器插件xpath helper中验证xpath是否能获取目标数据

xpath helper

3.3 创建Scrapy项目

启动pycharm，在pycharm的终端中进入存放pycharm程序的目录，新建一个项目steamgames。在终端中输入：

scrapy startproject steamgames

项目创建
成功创建项目后，会得到爬虫项目的根目录steamgames文件夹，其中包含有框架已经创建号的python程序：

项目目录

3.4 创建爬虫程序

进入项目目录后，在终端中输入：scrapy genspider test store.steampowered.com

完成对爬虫程序的创建，命名为test

爬虫创建

3.5 程序制作

3.5.1 编写items.py

首先在目录的items.py文件中添加要爬取数据的item字段,对应的信息分别为上述的
· 游戏名称
· 链接
· 发售日期
· 好评率
· 折扣
· 价格（原价、折扣价）

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class SteamgamesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()

    link = scrapy.Field()

    time = scrapy.Field()

    evaluate = scrapy.Field()

    discount = scrapy.Field()

    price = scrapy.Field()
    #pass

3.5.2 修改settings.py

首先因为是学习用途，关闭遵守robots协议

robots协议全称“网络爬虫排除标准”，又称爬虫协议、机器人协议等，它规定着搜索引擎抓取网站时所能抓取的内容，是网络爬虫爬行网站时第一个需要访问的文件，该文件位于网站的根目录下，文件名是robots.txt，主要用于保护网站的隐私，来防止网站重要的信息被泄露，对网站安全起到一定的作用。

1	ROBOTSTXT_OBEY = False

接下来启用管道程序，用于后续的数据存储。其中数字“300”表示每个管道执行的优先级，数字越小优先级越高，一般不超过1000

1
2
3

ITEM_PIPELINES = {
   'steamgames.pipelines.SteamgamesPipeline': 300,
}

3.5.3 scrapy shell

可以通过scrapy shell + “linkurl”的方式进入scrapy shell，对项目进行测试，如测试使用的xpath是否正常获得了值（未受到反爬手段影响）

这里在终端中输入scrapy shell “https://store.steampowered.com/search/?filter=globaltopsellers&os=win"

成功进入scrapy shell：

scrapy shell
进行xpath测试，输入：

response.xpath(“//*[@id=’search_resultsRows’]/a/div[2]/div[4]/div[1]/span/text()”).extract()
response.xpath(“//*[@id=’search_resultsRows’]/a/div[2]/div[1]/span/text()”).extract()

得到结果，测试成功：

scrapy shell测试

3.5.4 制作爬虫程序test.py

进入已经生成好的test.py程序当中，程序中已存在的代码

1
2
3

name = 'test'
allowed_domains = ['store.steampowered.com']
start_urls = ['https://store.steampowered.com/search/?filter=globaltopsellers&os=win']

分别表示爬虫的名称、爬取的域名界限（网站内）、爬虫的起始网址

爬虫程序可以有多个，爬取的网址子路由也可以有多个，但是不能超出一开始设定的范围之外

在类TestSpider中，函数parse名称不可变，因为其类本身继承自scrapy的Spider，我们只是改写其中的方法，实现爬虫功能。

为了防止在函数结尾return item导致整个函数终止运行，而不能继续完成之后的数据传输进管道程序中，故这里使用的是yield关键字实现一个迭代器

def parse(self, response):
    node_list = response.xpath("//*[@id='search_resultsRows']/a")
    print(node_list)
    for node in node_list:
        item = SteamgamesItem()
        name = node.xpath("./div[2]/div[1]/span/text()").extract()
        link = node.xpath("./@href").extract()
        time = node.xpath("./div[2]/div[2]/text()").extract()
        evaluate = node.xpath("./div[2]/div[3]/span/@data-tooltip-html").extract()
        discount = node.xpath("./div[2]/div[4]/div[1]/span/text()").extract()
        price = node.xpath('normalize-space(./div[2]/div[4]/div[2])').extract()

        item['name'] = name[0]
        item['link'] = link[0]
        tem['time'] = time[0]
        item['evaluate'] = evaluate[0]
        item['discount'] = discount[0]
        item['price'] = price[0]

        yield item

到这里可以使用命令scrapy crawl test -o steamgames.json来将结果直接输出为json格式的文件

但是测试后如果报错：

· KeyError: Spider not found
可能是由于运行的爬虫文件与test.py中的爬虫名称不符，或者创建的爬虫文件未能放入spiders文件夹当中，检查即可

· IndexError: list index out of range
因为并不是所有游戏列出了折扣、折扣价格等情况，所以在得到空列表时会出现异常，增加异常处理，如果发生异常则输出为“暂时没有折扣”，或“详情请查阅游戏链接”即可

修改函数代码如下：

def parse(self, response):
    node_list = response.xpath("//*[@id='search_resultsRows']/a")
    print(node_list)
    for node in node_list:
        item = SteamgamesItem()
        name = node.xpath("./div[2]/div[1]/span/text()").extract()
        link = node.xpath("./@href").extract()
        time = node.xpath("./div[2]/div[2]/text()").extract()
        evaluate = node.xpath("./div[2]/div[3]/span/@data-tooltip-html").extract()
        discount = node.xpath("./div[2]/div[4]/div[1]/span/text()").extract()
        price = node.xpath('normalize-space(./div[2]/div[4]/div[2])').extract()

        item['name'] = name[0]
        item['link'] = link[0]
        try:
            item['time'] = time[0]
        except:
            item['time'] = 'See link for details'
        item['evaluate'] = evaluate[0]
        try:
            item['discount'] = discount[0]
        except:
            item['discount'] = 'No discount for now'
        item['price'] = price[0].strip()

        yield item

3.5.5 制作管道程序pipelines.py

爬虫主体完成之后，需要编写数据存储的管道程序。这里使用open方法，创建一个以json格式保存的steamgames.csv文件,没完成一组数据下载，进行换行

import json

class SteamgamesPipeline(object):
    def __init__(self):
        self.f=open('steamgames.csv','wb+')

    def process_item(self, item, spider):
        content=json.dumps(dict(item),ensure_ascii=False)+',\n'
        self.f.write(content.encode(encoding='utf-8'))
        return item

    def close_spider(self,spider):
        self.f.close()

输出后得到结果，仔细查看有位置错位、长空格等情况：

异常结果-1
异常结果-2
异常结果-3
发现问题的原因：因为json数据是以“，”进行分隔的，而steam商城中的年份、价格、人数等信息都是以“，”分隔千分位，以及折扣价格的标签中带有换行符，所以会造成误差。

为了实现正常的存储和格式的美观，继续完善test.py爬虫程序

可以通过xpath(‘normalize-space(./…)’)或使用字符串方法strip()的方式去除文字中多余的空格

使用列表推导式和字符串方法replace()结合，将数据中的“，”替换成空格或其他字符：[i.replace(“,”,“”) for i in node.xpath(‘…’).extract()]

完整的test.py文件如下：

import scrapy
from steamgames.items import SteamgamesItem


class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['store.steampowered.com']
    start_urls = ['https://store.steampowered.com/search/?filter=globaltopsellers&os=win']

    def parse(self, response):
        node_list = response.xpath("//*[@id='search_resultsRows']/a")
        print(node_list)
        for node in node_list:
            item = SteamgamesItem()
            name = node.xpath("./div[2]/div[1]/span/text()").extract()
            link = node.xpath("./@href").extract()
            time = [i.replace(","," ") for i in node.xpath("./div[2]/div[2]/text()").extract()]
            evaluate = [j.replace(","," ") for j in [i.replace("<br>",",") for i in node.xpath("./div[2]/div[3]/span/@data-tooltip-html").extract()]]
            discount = node.xpath("./div[2]/div[4]/div[1]/span/text()").extract()
            price = [i.replace(",","") for i in node.xpath('normalize-space(./div[2]/div[4]/div[2])').extract()]

            item['name'] = name[0]
            item['link'] = link[0]
            try:
                item['time'] = time[0]
            except:
                item['time'] = 'See link for details'
            item['evaluate'] = evaluate[0]
            try:
                item['discount'] = discount[0]
            except:
                item['discount'] = 'No discount for now'
            item['price'] = price[0].strip()


            yield item

4.结果

完成scrapy爬虫后，执行scrapy crawl test最终得到一个.csv格式的输出结果：

爬虫执行
steam热销游戏
至此，scrapy框架下的爬虫程序完成，已获取到初步的游戏资讯，可以据此对其进行进一步的数据清洗和分析

4.1 pandas数据清洗

将文件重新输出一份json格式的steamgames.json文件，在steamgames当前目录下新建pd_steamgames.py文件准备用于数据的处理

import pandas as pd

df = pd.read_json('steamgames.json',encoding='utf-8')
print(df.to_string())

输出后得到一个pandas DataFrame类型的结果:

DataFrame
这里制定一个清洗目标：仅筛选出打折的热销游戏

import pandas as pd

df = pd.read_json('steamgames.json',encoding='utf-8')
df = df[df['discount']!='No discount for now']
print(df.to_string())

steam折扣中的热销游戏
进而对数据进行分析，拟找到折扣与好评率对应关系，研究折扣是否会对评价造成影响。这里丢弃掉除评价和折扣外的其他数据行

1	df = df[df['discount']!='No discount for now'].drop(['name','link','time','price'],axis=1)

评价与折扣
利用正则表达式清洗掉评价数据中的人数和字符串等信息，同时给数据清除原始的索引：

1
2
3

for i in df.columns:
    df[i] = df[i].str.extract('(\d+%?)')
print(df.to_string(index=False))

筛选出的数据

4.2 数据分析可视化

完成数据的清洗后，导入matpoltlib库，对数据进行可视化展示，完整的pd_steamgames.py文件代码如下：

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_json('steamgames.json',encoding='utf-8')
df = df[df['discount']!='No discount for now'].drop(['name','link','time','price'],axis=1)
for i in df.columns:
    df[i] = df[i].str.extract('(\d+)')

df=df.astype(float)
df.plot()

my_y_ticks = [20,40,60,80,100]
plt.yticks(my_y_ticks)
plt.xlabel('Number of games')
plt.ylabel('favorable rate(%)')
plt.title('evaluate & discount')
plt.show()

评价与折扣的关系
至此，不难得出两者之间并没有严重的线性依赖关系。有的游戏折扣很高，好评率也一直很高，有的游戏给出很少折扣，好评率却依然处于中高水平。当然这和本次使用的数据为全球热销游戏也有一定关系，数据量也较为有限。时间关系这些都是之后将解决的问题，继续尝试多挖掘数据，再多多去做数据分析