python spider learning 本文章记录python爬虫的学习内容,相当于学习python的知识点。
差不多就学到能爬图片,能爬视频,爬歌吧,简单入个门,也就是感兴趣然后来学了点,还是不错,比二进制收益来的快多了,哈哈哈。后面还是补一下hook,和反调试的文章吧,整理太花时间了,但还是必须整理。。。
request库用法 一行代码 Get 请求
r = requests.get('https://api.github.com/events')
一行代码 Post 请求
r = requests.post('https://httpbin.org/post', data = {'key':'value'})
获取服务器响应文本内容
import requests
r = requests.get('https://api.github.com/events')
r.text
假装自己是浏览器
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, headers=headers)
获取服务器响应文本内容
import requests
r = requests.get('https://api.github.com/events')
r.text
获取响应码
r = requests.get('https://httpbin.org/get')
r.status_code
re库 正则表达式 用来提取很多字符串中,我们需要的信息。
re.S,用来针对换行这种。
贪婪匹配与非贪婪匹配,就是.和. ?
import re
s="s_#aa&"
res = re.findall('.*',s,re.S)
print(res)
res = re.findall('.*?',s,re.S)
print(res)
# ['s_#aa&', '']
# ['', 's', '', '_', '', '#', '', 'a', '', 'a', '', '&', '']
*?是一个数一个数都要匹配,.*?则是匹配
.*?这个用的比较多匹配任意字符串,我只能说6逼,这个几乎你给个头给给尾,直接就能帮你匹配到想要的字符串。
下面来个例子,也是后面有个例子会用到,直接就明白了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import res=''' <li> <div class="list_num ">21.</div> <div class="pic"><a href="http://product.dangdang.com/28541936.html" target="_blank"><img src="http://img3m6.ddimg.cn/38/25/28541936-1_l_9.jpg" alt="男孩的学习力" title="男孩的学习力"/></a></div> <div class="name"><a href="http://product.dangdang.com/28541936.html" target="_blank" title="男孩的学习力">男孩的学习力</a></div> ''' res = re.findall('<li>.*?>(\d+).*?' ,s,re.S) print(res) res = re.findall('<img src="(.*?)"' ,s,re.S) print(res) res = re.findall('title="(.*?)".*?class="name"' ,s,re.S) print(res)
第一个爬虫 爬取当当网的书籍数据。
主要分3块
request用get请求网站,接收到服务器的返回数据
正则表达式处理数据,获得我们想要的。
将有效数据整理,然后存入文件。
get请求
1 2 3 4 5 6 7 8 9 def request_juger (url ): try : response = requests.get(url) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None
正则表达式搜索
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import res=''' <li> <div class="list_num ">22.</div> <div class="pic"><a href="http://product.dangdang.com/28541936.html" target="_blank"><img src="http://img3m6.ddimg.cn/38/25/28541936-1_l_9.jpg" alt="男孩的学习力" title="男孩的学习力"/></a></div> <div class="name"><a href="http://product.dangdang.com/28541936.html" target="_blank" title="男孩的学习力">男孩的学习力</a></div> <div class="star"><span class="level"><span style="width: 97.2%;"></span></span><a href="http://product.dangdang.com/28541936.html?point=comment_point" target="_blank">123849条评论</a><span class="tuijian">100%推荐</span></div> <div class="publisher_info">[日]<a href="http://search.dangdang.com/?key=富永雄辅" title="[日]富永雄辅 著,吴一红 译,酷威文化 出品" target="_blank">富永雄辅</a> 著,<a href="http://search.dangdang.com/?key=吴一红" title="[日]富永雄辅 著,吴一红 译,酷威文化 出品" target="_blank">吴一红</a> 译,<a href="http://search.dangdang.com/?key=酷威文化" title="[日]富永雄辅 著,吴一红 译,酷威文化 出品" target="_blank">酷威文化</a> 出品</div> <div class="publisher_info"><span>2020-06-01</span> <a href="http://search.dangdang.com/?key=四川文艺出版社" target="_blank">四川文艺出版社</a></div> <div class="biaosheng">五星评分:<span>91796次</span></div> <div class="price"> <p><span class="price_n">¥17.90</span> <span class="price_r">¥39.80</span>(<span class="price_s">4.5折</span>) </p> <p class="price_e">电子书:<span class="price_n">¥7.99</span></p> <div class="buy_button"> <a ddname="加入购物车" name="" href="javascript:AddToShoppingCart('28541936');" class="listbtn_buy">加入购物车</a> <a name="" href="http://product.dangdang.com/1901212680.html" class="listbtn_buydz" target="_blank">购买电子书</a> <a ddname="加入收藏" id="addto_favorlist_28541936" name="" href="javascript:showMsgBox('addto_favorlist_28541936',encodeURIComponent('28541936&platform=3'), 'http://myhome.dangdang.com/addFavoritepop');" class="listbtn_collect">收藏</a> </div> ''' cmp=re.compile ('<li>.*?>(\d+).*?</div>.*?<img src="(.*?)".*?title="(.*?)".*?class="name".*?class="tuijian">(.*?)</span>.*?target="_blank">(.*?)</a>.*?<div class="biaosheng">(.*?)<span>.*?"price_n">(.*?)</span>' ,re.S) res = re.findall(cmp,s) print(res)
存入文件
def write_to_file(t):
with open('book.txt', 'a', encoding='UTF-8') as f:
f.write(json.dumps(t, ensure_ascii=False) + '\n')
可以看到json.dumps()
json.dumps将一个Python数据结构转换为JSON
import json
data = {
'name' : 'myname',
'age' : 100,
}
json_str = json.dumps(data)
ensure_ascii=True:默认输出ASCLL码,如果把这个该成False,就可以输出中文。
完整代码,还是没太搞懂yield那个,为什么返回了全部的,而不是一组。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 import reimport requestsimport jsondef request_juger (url ): try : response = requests.get(url) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None def find_useful (html ): cmp = re.compile ( '<li>.*?>(\d+).*?</div>.*?<img src="(.*?)".*?title="(.*?)".*?class="name".*?class="tuijian">(.*?)</span>.*?target="_blank">(.*?)</a>.*?<div class="biaosheng">(.*?)<span>.*?"price_n">(.*?)</span>' , re.S) texts = re.findall(cmp, html) for text in texts: yield { 'range' : text[0 ], 'image' : text[1 ], 'title' : text[2 ], 'recommend' : text[3 ], 'author' : text[4 ], 'times' : text[5 ], 'price' : text[6 ] } def write_to_file (t ): with open ('book.txt' , 'a' , encoding='UTF-8' ) as f: f.write(json.dumps(t, ensure_ascii=False ) + '\n' ) def main (i ): url="http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-" +str (i) texts=request_juger(url) text=find_useful(texts) for i in text: write_to_file(i) if __name__ == "__main__" : for i in range (1 , 26 ): main(i)
BeautifulSoup python的一个库,可以用来达到re库的一些效果,得到某一字符串,标题,超链接等等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 from bs4 import BeautifulSouphtml_doc = """ <html><head><title>学习python的正确姿势</title></head> <body> <p class="title"><b>小帅b的故事</b></p> <p class="story">有一天,小帅b想给大家讲两个笑话 <a href="http://example.com/1" class="sister" id="link1">一个笑话长</a>, <a href="http://example.com/2" class="sister" id="link2">一个笑话短</a> , 他问大家,想听长的还是短的?</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc,'lxml' ) print(soup.title.string) print(soup.p.string) print(soup.title.parent.name) print(soup.a) print(soup.find_all('a' )) print(soup.find(id ="link2" )) print(soup.get_text())
第二个爬虫 爬取豆瓣电影前250的电影信息,写入excel。不同于上个爬虫的就是使用的BeautifulSoup库,
爬取目标: url=https://movie.douban.com/top250?start=0&filter=,改变的只有strat=25*页数。
html基本信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 <ol class ="grid_view" > <li > <div class ="item" > <div class ="pic" > <em class ="" > 1</em > <a href ="https://movie.douban.com/subject/1292052/" > <img width ="100" alt ="肖申克的救赎" src ="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class ="" > </a > </div > <div class ="info" > <div class ="hd" > <a href ="https://movie.douban.com/subject/1292052/" class ="" > <span class ="title" > 肖申克的救赎</span > <span class ="title" > / The Shawshank Redemption</span > <span class ="other" > / 月黑高飞(港) / 刺激1995(台)</span > </a > <span class ="playable" > [可播放]</span > </div > <div class ="bd" > <p class ="" > 导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br > 1994 / 美国 / 犯罪 剧情 </p > <div class ="star" > <span class ="rating5-t" > </span > <span class ="rating_num" property ="v:average" > 9.7</span > <span property ="v:best" content ="10.0" > </span > <span > 2477393人评价</span > </div > <p class ="quote" > <span class ="inq" > 希望让人自由。</span > </p > </div > </div > </div > </li >
可以发现都在class=”grid_view”里面,我们需要得到序号,图片url,名称,作者评分,和短评。然后写入exlsx
先介绍个xlwt库,这个库就是可以将数据写入excel的库。
1 2 3 4 5 6 7 8 9 import xlwtworkbook = xlwt.Workbook(encoding = 'utf-8' ) worksheet = workbook.add_sheet('My Worksheet' ) worksheet.write(1 ,0 , label = 'this is test' )
完整python代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 import requestsimport jsonfrom bs4 import BeautifulSoupimport xlwtworkbook = xlwt.Workbook(encoding = 'utf-8' ,style_compression=0 ) worksheet = workbook.add_sheet('豆瓣电影Top250' , cell_overwrite_ok=True ) worksheet.write(0 , 0 , '名称' ) worksheet.write(0 , 1 , '图片' ) worksheet.write(0 , 2 , '排名' ) worksheet.write(0 , 3 , '评分' ) worksheet.write(0 , 4 , '作者' ) worksheet.write(0 , 5 , '简介' ) n=1 def request_juger (url ): header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/95.0.4638.69 Safari/537.36' , } try : response = requests.get(url=url,headers=header) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None def find_useful_xlwt (html ): soup = BeautifulSoup(html, 'lxml' ) list = soup.find(class_='grid_view' ).find_all('li' ) for item in list : item_index=item.find(class_='' ).string item_name=item.find(class_='title' ).string item_picture=item.find('a' ).find('img' ).get('src' ) item_author=item.find('p' ).text.replace("\n" ,'' ).replace(" " ,'' )[0 :20 ] item_score=item.find(class_='rating_num' ).string if item.find(class_="inq" ) is not None : item_intr=item.find(class_="inq" ).string global n worksheet.write(n, 0 , item_name) worksheet.write(n, 1 , item_picture) worksheet.write(n, 2 , item_index) worksheet.write(n, 3 , item_score) worksheet.write(n, 4 , item_author) worksheet.write(n, 5 , item_intr) n=n+1 def main (i ): url='https://movie.douban.com/top250?start=' + str (i * 25 ) + '&filter=' html=request_juger(url) find_useful_xlwt(html) if __name__ == "__main__" : for i in range (0 , 10 ): main(i) workbook.save(u'豆瓣最受欢迎的250部电影.xls' )
多线程 一个菜鸟上的例子,创建了一个线程类,继承threading.Thread
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import threadingimport timeexitFlag = 0 class myThread (threading.Thread ): def __init__ (self, threadID, name, counter ): threading.Thread.__init__(self) self.threadID = threadID self.name = name self.counter = counter def run (self ): print ("开始线程:" + self.name) print_time(self.name, self.counter, 10 ) print ("退出线程:" + self.name) def print_time (threadName, delay, counter ): while counter: if exitFlag: threadName.exit() time.sleep(delay) print ("%s: %s" % (threadName, time.ctime(time.time()))) counter -= 1 thread1 = myThread(1 , "Thread-1" , 1 ) thread2 = myThread(2 , "Thread-2" , 2 ) thread1.start() thread2.start() thread1.join() thread2.join() print ("退出主线程" )
线程锁,用来保证公共数据在某一时间只会被一个线程使用。
1 2 3 4 5 6 7 8 9 10 def run (self ): print ("开启线程: " + self.name) threadLock.acquire() print_time(self.name, self.counter, 3 ) threadLock.release() threadLock = threading.Lock()
第三个爬虫 爬取某一网站上的图片,嘿嘿嘿。
一个用来判断是否可以下载图片的小步骤。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import requestsheaders = {'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , "Referer" :"https://s2.baozimh.com" } filename='111.jpg' with open (filename, 'wb' ) as f: img = requests.get('https://s2.baozimh.com/scomic/douluodalu-fengxuandongman/0/9-htxl/2.jpg' , headers=headers).content f.write(img)
下面的一个项目中的例子,自己也写了写,真不错。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 import requestsimport jsonfrom bs4 import BeautifulSoupimport xlwtimport osheader={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , "Referer" : "https://www.mzitu.com/all/" } def request_juger (url ): header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/95.0.4638.69 Safari/537.36' , "Referer" : "https://www.mzitu.com/all/" } try : response = requests.get(url=url,headers=header) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None def get_page_urls (): base_url = 'https://www.mzitu.com/page/' for i in range (4 , 5 ): url = base_url + str (i) html=request_juger(url) soup = BeautifulSoup(html, 'lxml' ) list = soup.find(class_='postlist' ).find_all('li' ) meizi_url=[] for i in list : meizi_url.append(i.find('span' ).find('a' ).get('href' )) return meizi_url def download_Pic (title, image_list ): os.mkdir(title) j = 1 for item in image_list: filename = '%s/%s.jpg' % (title, str (j)) print('downloading....%s : NO.%s' % (title, str (j))) with open (filename, 'wb' ) as f: img = requests.get(url=item,headers=header).content f.write(img) j += 1 def download_images (url ): pages=[] html = request_juger(url) soup = BeautifulSoup(html, 'lxml' ) title=soup.find('h2' ).string page=soup.find(class_='pagenavi' ).find_all('a' )[-2 ].find('span' ).string image_list=[] for i in range (1 ,int (page)): html = request_juger(url + '/%s' % i) soup = BeautifulSoup(html, 'lxml' ) img_url = soup.find('img' ).get('src' ) image_list.append(img_url) print(image_list) download_Pic(title, image_list) def main (): urls=get_page_urls() for url in urls: download_images(url) if __name__ == "__main__" : main()
照猫画虎,去爬了点漫画,斗罗大陆。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 import requestsimport jsonfrom bs4 import BeautifulSoupimport xlwtimport osheader={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , "Referer" :"https://s2.baozimh.com" } def request_juger (url ): header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , "Referer" :"https://s2.baozimh.com" } try : response = requests.get(url=url,headers=header) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None def get_page_urls (): base_url = 'https://cn.webmota.com/comic/chapter/douluodalu-fengxuandongman/0_' manhua_url = [] for i in range (0 , 10 ): url = base_url + str (i) + '.html' manhua_url.append(url) return manhua_url def download_Pic (title, image_list ): os.mkdir(title) j = 1 for item in image_list: filename = '%s/%s.jpg' % (title, str (j)) print('downloading....%s : NO.%s' % (title, str (j))) with open (filename, 'wb' ) as f: img = requests.get(url=item,headers=header).content f.write(img) j += 1 def download_images (url ): html = request_juger(url) soup = BeautifulSoup(html, 'lxml' ) title=soup.find('head' ).find('title' ).string list =[] page=soup.find(class_='comic-text__amp' ).find('em' ).string.replace('\n' ,'' ).replace(' ' ,'' )[-1 ] for i in range (int (page)): list .append(soup.find_all('img' )[i].get('src' )) download_Pic(title,list ) def main (): urls=get_page_urls() print(urls) for url in urls: download_images(url) if __name__ == "__main__" : main()
若要多线程爬,直接导入相应线程库,弄成下面这种
1 2 3 4 with concurrent.futures.ProcessPoolExecutor(max_workers=5 ) as exector: for url in urls: exector.submit(download_images, url)
selenium 本来想直接上手爬视频的结果发现链接里面根本没有mp4,还是得回来看selenium获取源代码。
首先pip安装
pip install selenium
接着配谷歌driver环境,去官网下载后,然后配环境变量就行了,然后关机重启一次就能用了。
测试
1 2 3 4 5 driver = webdriver.Chrome() driver.get("https://new.iskcd.com/20211106/HhseFdeO/index.m3u8" ) text=driver.page_source print(text)
第四个爬虫 开始爬视频了,也是自己摸索了很久,视频和图片的爬取还是有非常大的不同的,很多网站视频都采用了m3m8的方式来把mp4文件变成一个个的ts文件,爬取思路就是先得到index.m3m8,然后访问里面的链接,得到返回值,然后提取所有ts文件的url,下载下来,然后用ffmpeg来组装成mp4。
代码,单线程,比较慢。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 import requestsfrom bs4 import BeautifulSoupimport reimport subprocessnum=0 header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , } def request_juger (url ): header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , } try : response = requests.get(url=url,headers=header) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None def get_page_urls (): base_url = 'https://www.great-elec.com/video/924-0-' manhua_url = [] for i in range (7 ,11 ): url = base_url + str (i) + '.html' print(url) manhua_url.append(url) return manhua_url def download_mp4 (title, ts_urls ): global num header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , } print(num) for ts_url in ts_urls: name=ts_url.split('/' )[-1 ] res = requests.get(url=ts_url, headers=header) with open ("D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{}.txt" .format (num), "a+" ) as f: f.write("file '{}'\n" .format (name)) with open ('D:\\learning record\\学术报告\\python_spider\\爬虫\\mp4\\{}' .format (name), 'wb' ) as f: f.write(res.content) cmd='ffmpeg -f concat -i "D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{0}.txt". -c copy "D:/learning record/学术报告/python_spider/爬虫/mp4/vidoe/output{1}.mp4"' .format (num,num) print(cmd) subprocess.Popen(cmd,shell=True ) num+=1 def download (url ): title='aa' html = request_juger(url) soup = BeautifulSoup(html, 'lxml' ) a=soup.find(class_='box' ).find('p' ).find('script' ) cmp1 = re.compile ('<script>.*?now="(.*?)";.*?' ,re.S) texts = re.findall(cmp1,html) m3m8_url=texts[0 ] print(m3m8_url[:40 ]) m3u8_html = requests.get(url=m3m8_url, headers=header).text print(m3u8_html[-22 :-1 ]) new_m3m8_url=m3m8_url[:40 ]+m3u8_html[-22 :-1 ] print(new_m3m8_url) new_m3m8_html = requests.get(url=new_m3m8_url, headers=header).text ts_urls = re.findall(re.compile (',\n(.*?.ts)\n#' ), new_m3m8_html) download_mp4(title,ts_urls) def main (): urls=get_page_urls() for url in urls: download(url) if __name__ == "__main__" : main()
后面弄多线程的时候发现实际上还存在问题,就是拼接new_m3m8_url,可能会出现未能拼接成一个有效的url,当然可以采用split先分割,然后在拼接,因为/是固定的嘛,但是太懒了就没弄了。
多线程爬,实际上也不快
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 import threadingimport requestsfrom bs4 import BeautifulSoupimport reimport subprocessexitFlag = 0 class myThread (threading.Thread ): def __init__ (self, star, name, end ): threading.Thread.__init__(self) self.star = star self.name = name self.end = end def run (self ): print ("开始线程:" + self.name) func(self.star,self.end) print ("退出线程:" + self.name) header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , } def request_juger (url ): header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , } try : response = requests.get(url=url,headers=header) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None def get_page_urls (star,end ): base_url = 'https://www.great-elec.com/video/924-0-' manhua_url = [] for i in range (star,end): url = base_url + str (i) + '.html' print(url) manhua_url.append(url) return manhua_url def download_mp4 (title, ts_urls,num ): header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , } for ts_url in ts_urls: name=ts_url.split('/' )[-1 ] res = requests.get(url=ts_url, headers=header) with open ("D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{}.txt" .format (num), "a+" ) as f: f.write("file '{}'\n" .format (name)) with open ('D:\\learning record\\学术报告\\python_spider\\爬虫\\mp4\\{}' .format (name), 'wb' ) as f: f.write(res.content) cmd='ffmpeg -f concat -i "D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{0}.txt". -c copy "D:/learning record/学术报告/python_spider/爬虫/mp4/vidoe/output{1}.mp4"' .format (num,num) print(cmd) subprocess.Popen(cmd,shell=True ) num+=1 def download (url,num ): title='aa' html = request_juger(url) soup = BeautifulSoup(html, 'lxml' ) a=soup.find(class_='box' ).find('p' ).find('script' ) cmp1 = re.compile ('<script>.*?now="(.*?)";.*?' ,re.S) texts = re.findall(cmp1,html) m3m8_url=texts[0 ] print(m3m8_url[:43 ]) m3u8_html = requests.get(url=m3m8_url, headers=header).text print(m3u8_html[-22 :-1 ]) new_m3m8_url=m3m8_url[:43 ]+m3u8_html[-22 :-1 ] print(new_m3m8_url) new_m3m8_html = requests.get(url=new_m3m8_url, headers=header).text ts_urls = re.findall(re.compile (',\n(.*?.ts)\n#' ), new_m3m8_html) print(ts_urls) download_mp4(title,ts_urls,num) def func (star ,end ): urls=get_page_urls(star,end) for url in urls: download(url,star) if __name__ == "__main__" : thread1 = myThread(13 , "Thread-1" , 14 ) thread2 = myThread(16 , "Thread-2" , 18 ) thread1.start() thread2.start() thread1.join() thread2.join() print ("退出主线程" )
第五个爬虫 爬取歌曲,实际上也是找链接,倒是学了json的知识点,就相当于多维字典数组,挺方便。
实现了输入任意名称,就能爬取显示的所有歌曲,但是试听的只能有一部分。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 import requestsimport reimport osimport jsondef download_song (title, song_list,song_name ): os.mkdir(title) j = 0 for item in song_list: filename = '%s/%s.mp3' % (title, song_name[j]) print('downloading....%s : NO.%s' % (title, song_name[j])) with open (filename, 'wb' ) as f: mp3 = requests.get(item).content f.write(mp3) j += 1 def request_juger (url ): header={ 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/95.0.4638.69 Safari/537.36' , "Referer" : "https://www.kugou.com/" } try : response = requests.get(url=url,headers=header) if response.status_code == 200 : return response.text except requests.RequestException as e: print(e) return None def main (): name=input () base_url='https://searchrecommend.kugou.com/get/complex?callback=jQuery112403385798993366811_1636390150231&word=%s&_=1636390150232' %(name) text=request_juger(base_url) useful=re.match(".*?({.*}).*" , text, re.S) res = json.loads(useful.group(1 )) list = res['data' ]['song' ] song_list=[] song_name=[] for i in list : AlbumID=i['AlbumID' ] hash =i['hash' ] song_url='https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery19108914882384086649_1636392409637&hash=%s&dfid=4FIh0T1FDOGg2mC8cp3BaW48&appid=1014&mid=96e0f9a5a8a4d183f0034aa8ab27c2c9&platid=4&album_id=%s&_=1636392409638' %(hash ,AlbumID) song_text = request_juger(song_url) song_useful = re.match(".*?({.*}).*" , song_text, re.S) song_res = json.loads(song_useful.group(1 )) if (song_res['data' ]['play_url' ]=='' ): continue song_list.append(song_res['data' ]['play_url' ]) song_name.append(song_res['data' ]['song_name' ]) print(song_list) download_song(name,song_list,song_name) if __name__ == "__main__" : main()