scrapy中使用selenium

发表于 2019-11-08 更新于 2025-10-27 分类于 rd ， python ， scrapy Changyan：本文字数： 1.1k 阅读时长 ≈ 4 分钟

在通过scrapy框架进行某些网站数据爬取的时候，往往会碰到页面动态数据加载的情况发生，如果直接使用scrapy对其url发请求，是绝对获取不到那部分动态加载出来的数据值。但是通过观察我们会发现，通过浏览器进行url请求发送则会加载出对应的动态加载出的数据。那么如果我们想要在scrapy也获取动态加载出的数据，则必须使用selenium创建浏览器对象，然后通过该浏览器对象进行请求发送，获取动态加载的数据值。

案例

需求分析

- 需求：爬取网易新闻的国内板块下的新闻数据

- 需求分析：当点击国内超链进入国内对应的页面时，会发现当前页面展示的新闻数据是被动态加载出来的，如果直接通过程序对url进行请求，是获取不到动态加载出的新闻数据的。则就需要我们使用selenium实例化一个浏览器对象，在该对象中进行url的请求，获取动态加载的新闻数据。

原理分析

当引擎将国内板块url对应的请求提交给下载器后，下载器进行网页数据的下载，然后将下载到的页面数据，封装到~中，提交给引擎，引擎将response在转交给Spiders。Spiders接受到的response对象中存储的页面数据里是没有动态加载的新闻数据的。要想获取动态加载的新闻数据，则需要在下载中间件中对下载器提交给引擎的response响应对象进行拦截，切对其内部存储的页面数据进行篡改，修改成携带了动态加载出的新闻数据，然后将被篡改的response对象最终交给`进行解析操作。

流程

重写爬虫文件的构造方法，在该方法中使用selenium实例化一个浏览器对象（因为浏览器对象只需要被实例化一次）
重写爬虫文件的closed(self,spider)方法，在其内部关闭浏览器对象。该方法是在爬虫结束时被调用
重写下载中间件的process_response方法，让该方法对响应对象进行拦截，并篡改response中存储的页面数据
在配置文件中开启下载中间件

代码展示

爬虫wangyi.py

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver


class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    allowed_domains = ['https://news.163.com/domestic/']
    start_urls = ['https://news.163.com/domestic//']

    def __init__(self):
        # 实例化一个浏览器对象(实例化一次)
        self.bro = webdriver.Chrome(executable_path=r'E:\site\python\爬虫\chromedriver.exe')
    def parse(self, response):
        news = response.xpath('//*[@class="ndi_main"]//*[@class="news_title"]//a/text()').extract()
        for item in news:
            print(item)
	# 爬虫结束执行关闭浏览器
    def closed(self, spider):
        print('关闭浏览器')
        self.bro.quit()

中间件middlewares.py

from scrapy import signals
from scrapy.http import HtmlResponse
from time import sleep


class ScrapymiddleDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        #拦截到响应对象（下载器传递给Spider的响应对象）
        #request：响应对象对应的请求对象
        #response：拦截到的响应对象
        #spider：爬虫文件中对应的爬虫类的实例
        print('request.url', request.url)
        # 爬虫中的浏览器对象
        bro = spider.bro
        # 浏览器打开网址
        bro.get(url=request.url)
        # 休眠测试，保证页面加载完成
        sleep(3)
        # 获取浏览器打开页面的源代码
        # 页面数据就是包含了动态加载出来的新闻数据对应的页面数据
        source_code = bro.page_source
        # 休眠3秒
        sleep(3)
        print('spiderurl：', spider.bro.current_url)
        # 篡改响应对象，将响应结果修改为浏览器打开页面的源代码
        return HtmlResponse(url=spider.bro.current_url, body=source_code, encoding='utf8')
        # return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

# 是否遵守robots.txt规则
ROBOTSTXT_OBEY = False

# 是否启用下载中间件
DOWNLOADER_MIDDLEWARES = {
   'scrapymiddle.middlewares.ScrapymiddleDownloaderMiddleware': 543,
}

# 日志输出等级
LOG_LEVEL = "ERROR"