某云音乐评论爬虫逆向

补环境框架:v_jstool

使用库:request

视频参考(补环境不太一样):掐住橙喵喵的头投稿视频-掐住橙喵喵的头视频分享-哔哩哔哩视频

编写时间:2025.1.7

ps:有不会的欢迎学习交流

寻找评论的响应

评论翻页页面未跳转查看network的xhr,直接在devtool的network里搜索评论的内容,可以定位到包名为https://music.163.com/weapi/w/nuser/account/get?csrf_token=

查看需要逆向的内容

​因为该报文为xhr数据,所以在源代码那里添加xhr断点

查看堆栈,寻找该post请求的构造函数

下图为xhr断点断住后的界面,观察划红线的地方,可以看到e7d已经构造出来了,沿着e7d一步步向上查看堆栈与变量​编辑下图为分析后的堆栈的位置与参数的构造位置,接下来我们看bVk2x是如何构造的

​编辑将鼠标放到asrsea函数上,跳转到红字位置的函数实现(在调试状态下,此时网页正在暂停)

加密函数的实现

经过多次调试发现(刷新几次对比),该函数的只有JSON.stringify(i7b)为加密的参数

在bV2k前后打上记录点,将i7b,stringfiy(i7b)和bV2k的值打印到控制台,放包后,对比bV2k与by_token包的表单数据确定,请求的格式.以下为我多次尝试猜出的i7b的格式

接下来在asrea所在行打个断点,关闭美观输出,复制它所在的行与前面的全部js代码.

关闭标签页到只剩一个,配置好v-jstool后,启用挂钩总开关,刷新,生成临时环境,将临时环境与js代码放到一起(js代码太大了,别用execjs慢的很,可以和我一样用miniracer)

接下来编写python爬虫代码​编辑python代码如下:

import os
from py_mini_racer import MiniRacer

current_file_path = os.path.dirname(__file__)


# 随机ua
def get_ua():
    import random
    first_num = random.randint(55, 76)
    third_num = random.randint(0, 3800)
    fourth_num = random.randint(0, 140)
    os_type = ['(Windows NT 6.1; WOW64)', '(Windows NT 10.0; WOW64)', '(X11; Linux x86_64)', '(Macintosh; Intel Mac OS X 10_14_5)']
    chrome_version = 'Chrome/{}.0.{}.{}'.format(first_num, third_num, fourth_num)
    ua = ' '.join(['Mozilla/5.0', random.choice(os_type), 'AppleWebKit/537.36', '(KHTML, like Gecko)', chrome_version, 'Safari/537.36'])
    return ua

def create_params(songid:Union[int,str],pageNo:Union[int,str])->dict:
    current_timestamp_ms = int(round(time.time() * 1000))
    with open(current_file_path+'\\test.js','rb') as f:
        js = f.read().decode('utf-8')
    i7b = {
        "rid": "R_SO_4_"+str(songid),
        "threadId": "R_SO_4_"+str(songid),
        "pageNo": pageNo,
        "pageSize": 20,
        "cursor": str(current_timestamp_ms),
        "offset": (int(pageNo)-1)*20,
        "orderType": "1",
        "csrf_token": ""
    }
    ctx = MiniRacer()
    ctx.eval(js)
    b = ctx.call("JSON.stringify",i7b)
    a = ctx.call("asrsea",b,"010001","00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7","0CoJUm6Qyw8W8jud")
    datas = {
        'params':a['encText'],
        'encSecKey':a['encSecKey']
    }
    return datas

# 发送报文
def send_post(url:str,datas:dict):
    headers = {
        "User-Agent":  get_ua(),
        'accept': '*/*',
        'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
        'cache-control': 'no-cache',
        'content-type': 'application/x-www-form-urlencoded',
        'cookie':'填自己的',# 应该可以不填,我是curl直接生成的
        'dnt': '1',
        'origin': 'https://music.163.com',
        'pragma': 'no-cache',
        'priority': 'u=1, i',
        'referer': 'https://music.163.com/song?id=2643172761',
        'sec-ch-ua': '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
    }
    response = requests.post(url=url,headers=headers,data=datas)
    return response

# 将清洗后的数据写入csv文件中
def data_to_csv(response)->None:
    current_file_path = os.path.dirname(__file__)
    if response.status_code == 200:
        getdata = response.json()
        comments = getdata['data']['comments']
        comments_data = [{'nickname': comment['user']['nickname'], 'content': comment['content']} for comment in comments if 'nickname' in comment['user'] and 'content' in comment]
        # 将列表转换为DataFrame
        df = pd.DataFrame(comments_data)
        df.to_csv(current_file_path+"\\wumusiccomment.csv",index=False,encoding='utf_8_sig',mode='a')
    else:
        print("状态码:".format(response.status_code))

if __name__ == '__main__':
    print(current_file_path+"\\test.js")
    songid = 2643172761
    page = 10 # 得到page*20个评论
    url = "https://music.163.com/weapi/comment/resource/comments/get?csrf_token="
    for i in  range(1,page+1):
        datas = create_params(songid=songid,pageNo=i)
        response = send_post(url=url, datas=datas)
        data_to_csv(response=response)