爬取《喜马拉雅》音频
1. 引言
《喜马拉雅》上既有许多有价值的音频,也有许多繁杂无用的时间陷阱。于是,取我所需就成了第一要务,APP就是舍弃的对象。软件上下载了音频,却发现惨遭加密。那么,就爬吧。
2. 分析
2.1. 1
2.2. 2
再来到专辑主页,主页一共有20个Page,分布着596个声音,分析网页源代码,发现其中只能找到第一个Page中的声音。右键检查网络,点击第二个Page、第三个,看到出现了新的文件,
预览之,其中有我们需要的音频ID,还有每个音频的title。
观察此文件的URL:
https://www.ximalaya.com/revision/album/v1/getTracksList?albumId=31109428&pageNum=2&sort=0。
其中有参数“albumId”,即专辑ID,有“pageNum”,即20个Page的其一。
至此,音频ID找到,下面编写程序。
3. 程序
3.1. 获取id
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)' ' Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62' } id_pages = [f'https://www.ximalaya.com/revision/album/v1/getTracksList?albumId=31109428&pageNum={i}&sort=0' for i in range(1, 21)] # 20个包含ID的page页面 # 存放id与title的列表 audio_id = [] titles = [] for url in id_pages: resp = requests.get(url, headers=headers) tracks = resp.json()['data']['tracks'] for track in tracks: id = track['trackId'] title = track['title'] audio_id.append(id) titles.append(title)
3.2. 获取音频链接
audio_pages = [f'https://www.ximalaya.com/revision/play/v1/audio?id={id}&ptype=1' for id in audio_id] # 596个音频信息页 audio_urls = [] for url in audio_pages: resp = requests.get(url, headers=headers) audio_url = resp.json()['data']['src'] audio_urls.append(audio_url)
3.3. 下载音频
for i in range(596): title = titles[i] url = audio_urls[i] with open(f'D:/videos/Verbal Advantage/{title}.m4a', 'wb') as fo: fo.write(requests.get(url, headers=headers).content) print(f'"{title}" is ok!')
4. 全部代码
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)' ' Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62' } id_pages = [f'https://www.ximalaya.com/revision/album/v1/getTracksList?albumId=31109428&pageNum={i}&sort=0' for i in range(1, 21)] audio_id = [] titles = [] for url in id_pages: resp = requests.get(url, headers=headers) tracks = resp.json()['data']['tracks'] for track in tracks: id = track['trackId'] title = track['title'] audio_id.append(id) titles.append(title) audio_pages = [f'https://www.ximalaya.com/revision/play/v1/audio?id={id}&ptype=1' for id in audio_id] audio_urls = [] for url in audio_pages: resp = requests.get(url, headers=headers) audio_url = resp.json()['data']['src'] audio_urls.append(audio_url) for i in range(596): title = titles[i] url = audio_urls[i] with open(f'D:/videos/Verbal Advantage/{title}.m4a', 'wb') as fo: fo.write(requests.get(url, headers=headers).content) print(f'"{title}" is ok!')