2019-10-01发表2019-10-01更新Python8 分钟读完 (大约1161个字)0次访问

python+requests爬虫澎湃新闻某条新闻下的评论

如题例如抓取这个文章下的所有评论：链接。首先，列出需要抓取的数据：

新闻标题
新闻发布日期
评论者的昵称
评论的内容

分析网页请求找到需要的数据

看下图，在网页的第一个请求里面已经包括了1，2两条的数据。
在这里插入图片描述
接着在浏览器中向下滑动新闻网页加载评论，同时关注控制台，注意搜索框里的load，如下图浏览器会不断的发送请求给服务器，在这个请求的相应里面就包含了需要的3，4条(评论，评论者的用户名)数据。看一下这个请求的url有一堆的请求参数尝试精简下最后得到这样的urlhttps://www.thepaper.cn/load_moreFloorComment.jsp?contid=4489661startId=24750775。在这个url里面只有两个参数，第一个是新闻的id，第二是评论页的id。有了这个url就可以根据不同的startid构造出评论的url最终的抓到所有的评论信息。
在这里插入图片描述

怎么找到不同的startid？

同样是上面那个图，在新标签打开对应的请求，看一下html源码，在第一条评论div里面有一个startid=’24745735’。
在这里插入图片描述
记住这个值，再回去看一下第二条请求评论的url，发现最后的startid值就是第一条请求评论的url里面的startid值。就是这个样子：

至此，所有的数据理论上来说都可以找到了。剩下的就是写代码了。

代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ************************************************************************
# *
# * @file:penpai.py
# * @author:kanhui
# * date:2019-09-22 11:40:30
# * @version 3.7.3
# *
# ************************************************************************

import requests
import re
from lxml import etree
import json


class PengPaiSpider():
    '''
    给定一个澎湃新闻的url爬取其下的评论信息
    例如：https://www.thepaper.cn/newsDetail_forward_1292455
    '''

    def __init__(self):
        print('input url:')

        # 新闻url地址
        self.url = input()

        # 用来判断是否到达最后一页，在请求评论页面时的第一条评论里有一个startId参数如果为0则表示没有下一页了
        self.next_id = ''

        # 用来存储爬到的数据
        self.item = {}

        # 请求头
        self.header = {
            'User-Agent':
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
        }
        # 尝试获取contid 如果获取不到则设置为0程序结束
        self.contid = re.findall(r'forward_(\d*)', self.url)[0] if len(
            re.findall(r'forward_(\d*)', self.url)) is not 0 else 0
        self.comment_url = 'https://www.thepaper.cn/load_moreFloorComment.jsp?contid={}&startId={}'
        if self.contid == 0:
            print('获取contid失败，检查url!!')

    def parse(self, url):
        '''发送请求函数'''
        res = requests.get(url)
        # 调试程序
        print(url)
        return res.text

    def start(self):
        # 获取新闻信息,并获取评论条数
        self.get_new_data()

        while self.next_id is not 0:
            res = self.parse(self.comment_url.format(self.contid,self.next_id))
            # 根据res内容提取数据
            self.handle(res)
            # 存储单条评论
            self.save_comment()

    def handle(self, res):
        root = etree.HTML(res)
        comments = root.xpath(
            "//div[@class='comment_que']//div[@class='ansright_cont']/a/text()"
        )
        usernames = root.xpath('.//h3/a/text()')
        self.item = [{
            'Nickname': username,
            'content': comment
        } for username, comment in zip(usernames, comments)]

        # 更新下一个url
        # new_next_id = int(root.xpath('//div[@class="comment_que"]')[0].xpath('./div')[1].xpath('@startid')[0])
        new_next_id = int(re.findall(r'startId="(.*?)"', res)[0])
        # 老师给的url好像有点小问题，需要减14才能跳转到下一页，否则会在前两页已知循环
        if self.next_id == new_next_id:
            self.next_id = new_next_id - 14
        else:
            self.next_id = new_next_id

    def get_new_data(self):
        '''用来获取新闻的信息'''

        res = self.parse(self.url)
        root = etree.HTML(res)
        self.item['链接'] = self.url
        tmp = root.xpath("//h1[@class='news_title']/text()")
        self.item['标题'] = tmp[0] if len(tmp) is not 0 else ''

        tmp = root.xpath("//h2[@id='comm_span']/span/text()")
        if len(tmp) is not 0:
            self.comment_count = re.findall(r'\（(.*)\）', tmp[0])[0]
        if self.comment_count.isdigit():
            self.item['评论数'] = self.comment_count
        else:# 处理是3.2k这种情况
            self.comment_count = float(self.comment_count[:-1]) * 1000
            self.item['评论数'] = self.comment_count

        tmp = root.xpath('//div[@class="news_about"]/p/text()')
        if len(tmp) == 3:
            # 用正则表达式获取时间
            post_time = re.findall(r'(\d{4}-\d{2}-\d{2})', tmp[1])[0] if len(
                re.findall(r'(\d{4}-\d{2}-\d{2})', tmp[1])) is not 0 else ''
        self.item['时间'] = post_time
        self.save_comment()

    def save_comment(self):
        with open(self.contid + '.json', 'a') as f:
            f.write(json.dumps(self.item, ensure_ascii=False, indent=4))


if __name__ == '__main__':
    my_spider = PengPaiSpider()
    my_spider.start()
    print('完成！')

python+requests爬虫澎湃新闻某条新闻下的评论

https://www.huihuidehui.com/posts/ccc2cbf8.html

作者

lalaking

发布于

2019-10-01

更新于

2019-10-01

许可协议

#Python 爬虫

python+requests爬虫澎湃新闻某条新闻下的评论

分析网页请求找到需要的数据

怎么找到不同的startid？

代码

作者

发布于

更新于

许可协议

评论

目录

分类

最新文章

归档

标签