某云音乐评论爬虫逆向
补环境框架:v_jstool
使用库:request
视频参考(补环境不太一样):掐住橙喵喵的头投稿视频-掐住橙喵喵的头视频分享-哔哩哔哩视频
编写时间:2025.1.7
ps:有不会的欢迎学习交流
寻找评论的响应
评论翻页页面未跳转查看network的xhr,直接在devtool的network里搜索评论的内容,可以定位到包名为https://music.163.com/weapi/w/nuser/account/get?csrf_token=
查看需要逆向的内容
因为该报文为xhr数据,所以在源代码那里添加xhr断点
查看堆栈,寻找该post请求的构造函数
下图为xhr断点断住后的界面,观察划红线的地方,可以看到e7d已经构造出来了,沿着e7d一步步向上查看堆栈与变量编辑下图为分析后的堆栈的位置与参数的构造位置,接下来我们看bVk2x是如何构造的
编辑将鼠标放到asrsea函数上,跳转到红字位置的函数实现(在调试状态下,此时网页正在暂停)
加密函数的实现
经过多次调试发现(刷新几次对比),该函数的只有JSON.stringify(i7b)为加密的参数
在bV2k前后打上记录点,将i7b,stringfiy(i7b)和bV2k的值打印到控制台,放包后,对比bV2k与by_token包的表单数据确定,请求的格式.以下为我多次尝试猜出的i7b的格式
接下来在asrea所在行打个断点,关闭美观输出,复制它所在的行与前面的全部js代码.
关闭标签页到只剩一个,配置好v-jstool后,启用挂钩总开关,刷新,生成临时环境,将临时环境与js代码放到一起(js代码太大了,别用execjs慢的很,可以和我一样用miniracer)
接下来编写python爬虫代码编辑python代码如下:
import os
from py_mini_racer import MiniRacer
current_file_path = os.path.dirname(__file__)
# 随机ua
def get_ua():
import random
first_num = random.randint(55, 76)
third_num = random.randint(0, 3800)
fourth_num = random.randint(0, 140)
os_type = ['(Windows NT 6.1; WOW64)', '(Windows NT 10.0; WOW64)', '(X11; Linux x86_64)', '(Macintosh; Intel Mac OS X 10_14_5)']
chrome_version = 'Chrome/{}.0.{}.{}'.format(first_num, third_num, fourth_num)
ua = ' '.join(['Mozilla/5.0', random.choice(os_type), 'AppleWebKit/537.36', '(KHTML, like Gecko)', chrome_version, 'Safari/537.36'])
return ua
def create_params(songid:Union[int,str],pageNo:Union[int,str])->dict:
current_timestamp_ms = int(round(time.time() * 1000))
with open(current_file_path+'\\test.js','rb') as f:
js = f.read().decode('utf-8')
i7b = {
"rid": "R_SO_4_"+str(songid),
"threadId": "R_SO_4_"+str(songid),
"pageNo": pageNo,
"pageSize": 20,
"cursor": str(current_timestamp_ms),
"offset": (int(pageNo)-1)*20,
"orderType": "1",
"csrf_token": ""
}
ctx = MiniRacer()
ctx.eval(js)
b = ctx.call("JSON.stringify",i7b)
a = ctx.call("asrsea",b,"010001","00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7","0CoJUm6Qyw8W8jud")
datas = {
'params':a['encText'],
'encSecKey':a['encSecKey']
}
return datas
# 发送报文
def send_post(url:str,datas:dict):
headers = {
"User-Agent": get_ua(),
'accept': '*/*',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'no-cache',
'content-type': 'application/x-www-form-urlencoded',
'cookie':'填自己的',# 应该可以不填,我是curl直接生成的
'dnt': '1',
'origin': 'https://music.163.com',
'pragma': 'no-cache',
'priority': 'u=1, i',
'referer': 'https://music.163.com/song?id=2643172761',
'sec-ch-ua': '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
}
response = requests.post(url=url,headers=headers,data=datas)
return response
# 将清洗后的数据写入csv文件中
def data_to_csv(response)->None:
current_file_path = os.path.dirname(__file__)
if response.status_code == 200:
getdata = response.json()
comments = getdata['data']['comments']
comments_data = [{'nickname': comment['user']['nickname'], 'content': comment['content']} for comment in comments if 'nickname' in comment['user'] and 'content' in comment]
# 将列表转换为DataFrame
df = pd.DataFrame(comments_data)
df.to_csv(current_file_path+"\\wumusiccomment.csv",index=False,encoding='utf_8_sig',mode='a')
else:
print("状态码:".format(response.status_code))
if __name__ == '__main__':
print(current_file_path+"\\test.js")
songid = 2643172761
page = 10 # 得到page*20个评论
url = "https://music.163.com/weapi/comment/resource/comments/get?csrf_token="
for i in range(1,page+1):
datas = create_params(songid=songid,pageNo=i)
response = send_post(url=url, datas=datas)
data_to_csv(response=response)