python spider learning 本文章记录python爬虫的学习内容,相当于学习python的知识点。
差不多就学到能爬图片,能爬视频,爬歌吧,简单入个门,也就是感兴趣然后来学了点,还是不错,比二进制收益来的快多了,哈哈哈。后面还是补一下hook,和反调试的文章吧,整理太花时间了,但还是必须整理。。。
request库用法 一行代码 Get 请求
r = requests.get('https://api.github.com/events')一行代码 Post 请求
r = requests.post('https://httpbin.org/post', data = {'key':'value'})获取服务器响应文本内容
import requests
r = requests.get('https://api.github.com/events')
r.text假装自己是浏览器
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, headers=headers)获取服务器响应文本内容
import requests
r = requests.get('https://api.github.com/events')
r.text获取响应码
r = requests.get('https://httpbin.org/get')
r.status_codere库 正则表达式 用来提取很多字符串中,我们需要的信息。
re.S,用来针对换行这种。
贪婪匹配与非贪婪匹配,就是.和. ?
import re
s="s_#aa&"
res = re.findall('.*',s,re.S)
print(res)
res = re.findall('.*?',s,re.S)
print(res)
# ['s_#aa&', '']
# ['', 's', '', '_', '', '#', '', 'a', '', 'a', '', '&', '']*?是一个数一个数都要匹配,.*?则是匹配
.*?这个用的比较多匹配任意字符串,我只能说6逼,这个几乎你给个头给给尾,直接就能帮你匹配到想要的字符串。
下面来个例子,也是后面有个例子会用到,直接就明白了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import  res=''' <li>      <div class="list_num ">21.</div>     <div class="pic"><a href="http://product.dangdang.com/28541936.html" target="_blank"><img src="http://img3m6.ddimg.cn/38/25/28541936-1_l_9.jpg" alt="男孩的学习力"  title="男孩的学习力"/></a></div>     <div class="name"><a href="http://product.dangdang.com/28541936.html" target="_blank" title="男孩的学习力">男孩的学习力</a></div> ''' res = re.findall('<li>.*?>(\d+).*?' ,s,re.S) print(res) res = re.findall('<img src="(.*?)"' ,s,re.S) print(res) res = re.findall('title="(.*?)".*?class="name"' ,s,re.S) print(res) 
第一个爬虫 爬取当当网的书籍数据。
主要分3块
request用get请求网站,接收到服务器的返回数据 
正则表达式处理数据,获得我们想要的。 
将有效数据整理,然后存入文件。 
 
get请求
1 2 3 4 5 6 7 8 9 def  request_juger (url ):    try :         response = requests.get(url)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  
正则表达式搜索
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import  res=''' <li>      <div class="list_num ">22.</div>        <div class="pic"><a href="http://product.dangdang.com/28541936.html" target="_blank"><img src="http://img3m6.ddimg.cn/38/25/28541936-1_l_9.jpg" alt="男孩的学习力"  title="男孩的学习力"/></a></div>         <div class="name"><a href="http://product.dangdang.com/28541936.html" target="_blank" title="男孩的学习力">男孩的学习力</a></div>         <div class="star"><span class="level"><span style="width: 97.2%;"></span></span><a href="http://product.dangdang.com/28541936.html?point=comment_point" target="_blank">123849条评论</a><span class="tuijian">100%推荐</span></div>         <div class="publisher_info">[日]<a href="http://search.dangdang.com/?key=富永雄辅" title="[日]富永雄辅 著,吴一红 译,酷威文化 出品" target="_blank">富永雄辅</a> 著,<a href="http://search.dangdang.com/?key=吴一红" title="[日]富永雄辅 著,吴一红 译,酷威文化 出品" target="_blank">吴一红</a> 译,<a href="http://search.dangdang.com/?key=酷威文化" title="[日]富永雄辅 著,吴一红 译,酷威文化 出品" target="_blank">酷威文化</a> 出品</div>         <div class="publisher_info"><span>2020-06-01</span> <a href="http://search.dangdang.com/?key=四川文艺出版社" target="_blank">四川文艺出版社</a></div>                 <div class="biaosheng">五星评分:<span>91796次</span></div>                                 <div class="price">                 <p><span class="price_n">¥17.90</span>                         <span class="price_r">¥39.80</span>(<span class="price_s">4.5折</span>)                     </p>                     <p class="price_e">电子书:<span class="price_n">¥7.99</span></p>                 <div class="buy_button">                           <a ddname="加入购物车" name="" href="javascript:AddToShoppingCart('28541936');" class="listbtn_buy">加入购物车</a>                                                  <a name="" href="http://product.dangdang.com/1901212680.html" class="listbtn_buydz" target="_blank">购买电子书</a>                         <a ddname="加入收藏" id="addto_favorlist_28541936" name="" href="javascript:showMsgBox('addto_favorlist_28541936',encodeURIComponent('28541936&platform=3'), 'http://myhome.dangdang.com/addFavoritepop');" class="listbtn_collect">收藏</a>               </div> ''' cmp=re.compile ('<li>.*?>(\d+).*?</div>.*?<img src="(.*?)".*?title="(.*?)".*?class="name".*?class="tuijian">(.*?)</span>.*?target="_blank">(.*?)</a>.*?<div class="biaosheng">(.*?)<span>.*?"price_n">(.*?)</span>' ,re.S) res = re.findall(cmp,s) print(res) 
存入文件
def write_to_file(t):
    with open('book.txt', 'a', encoding='UTF-8') as f:
        f.write(json.dumps(t, ensure_ascii=False) + '\n')可以看到json.dumps()
json.dumps将一个Python数据结构转换为JSON
import json
data = {
    'name' : 'myname',
    'age' : 100,
}
json_str = json.dumps(data)ensure_ascii=True:默认输出ASCLL码,如果把这个该成False,就可以输出中文。
完整代码,还是没太搞懂yield那个,为什么返回了全部的,而不是一组。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 import  reimport  requestsimport  jsondef  request_juger (url ):    try :         response = requests.get(url)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  def  find_useful (html ):    cmp = re.compile (         '<li>.*?>(\d+).*?</div>.*?<img src="(.*?)".*?title="(.*?)".*?class="name".*?class="tuijian">(.*?)</span>.*?target="_blank">(.*?)</a>.*?<div class="biaosheng">(.*?)<span>.*?"price_n">(.*?)</span>' ,         re.S)     texts = re.findall(cmp, html) 	     for  text in  texts:         yield  {             'range' : text[0 ],             'image' : text[1 ],             'title' : text[2 ],             'recommend' : text[3 ],             'author' : text[4 ],             'times' : text[5 ],             'price' : text[6 ]         } def  write_to_file (t ):    with  open ('book.txt' , 'a' , encoding='UTF-8' ) as  f:         f.write(json.dumps(t, ensure_ascii=False ) + '\n' ) def  main (i ):    url="http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-" +str (i)     texts=request_juger(url)     text=find_useful(texts)     for  i in  text:         write_to_file(i) if  __name__ == "__main__" :    for  i in  range (1 , 26 ):         main(i) 
BeautifulSoup python的一个库,可以用来达到re库的一些效果,得到某一字符串,标题,超链接等等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 from  bs4 import  BeautifulSouphtml_doc = """  <html><head><title>学习python的正确姿势</title></head> <body> <p class="title"><b>小帅b的故事</b></p> <p class="story">有一天,小帅b想给大家讲两个笑话 <a href="http://example.com/1" class="sister" id="link1">一个笑话长</a>, <a href="http://example.com/2" class="sister" id="link2">一个笑话短</a> , 他问大家,想听长的还是短的?</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc,'lxml' ) print(soup.title.string) print(soup.p.string) print(soup.title.parent.name) print(soup.a) print(soup.find_all('a' )) print(soup.find(id ="link2" )) print(soup.get_text()) 
第二个爬虫 爬取豆瓣电影前250的电影信息,写入excel。不同于上个爬虫的就是使用的BeautifulSoup库,
爬取目标: url=https://movie.douban.com/top250?start=0&filter=,改变的只有strat=25*页数。 
html基本信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 <ol  class ="grid_view" >  <li >              <div  class ="item" >                  <div  class ="pic" >                      <em  class ="" > 1</em >                      <a  href ="https://movie.douban.com/subject/1292052/" >                          <img  width ="100"  alt ="肖申克的救赎"  src ="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp"  class ="" >                      </a >                  </div >                  <div  class ="info" >                      <div  class ="hd" >                          <a  href ="https://movie.douban.com/subject/1292052/"  class ="" >                              <span  class ="title" > 肖申克的救赎</span >                                      <span  class ="title" >   /  The Shawshank Redemption</span >                                  <span  class ="other" >   /  月黑高飞(港)  /  刺激1995(台)</span >                          </a >                              <span  class ="playable" > [可播放]</span >                      </div >                      <div  class ="bd" >                          <p  class ="" >                              导演: 弗兰克·德拉邦特 Frank Darabont      主演: 蒂姆·罗宾斯 Tim Robbins /...<br >                              1994  /  美国  /  犯罪 剧情                         </p >                                                   <div  class ="star" >                                  <span  class ="rating5-t" > </span >                                  <span  class ="rating_num"  property ="v:average" > 9.7</span >                                  <span  property ="v:best"  content ="10.0" > </span >                                  <span > 2477393人评价</span >                          </div >                              <p  class ="quote" >                                  <span  class ="inq" > 希望让人自由。</span >                              </p >                      </div >                  </div >              </div >          </li >  
可以发现都在class=”grid_view”里面,我们需要得到序号,图片url,名称,作者评分,和短评。然后写入exlsx
先介绍个xlwt库,这个库就是可以将数据写入excel的库。
1 2 3 4 5 6 7 8 9 import  xlwtworkbook = xlwt.Workbook(encoding = 'utf-8' ) worksheet = workbook.add_sheet('My Worksheet' ) worksheet.write(1 ,0 , label = 'this is test' ) 
完整python代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 import  requestsimport  jsonfrom  bs4 import  BeautifulSoupimport  xlwtworkbook = xlwt.Workbook(encoding = 'utf-8' ,style_compression=0 ) worksheet = workbook.add_sheet('豆瓣电影Top250' , cell_overwrite_ok=True ) worksheet.write(0 , 0 , '名称' ) worksheet.write(0 , 1 , '图片' ) worksheet.write(0 , 2 , '排名' ) worksheet.write(0 , 3 , '评分' ) worksheet.write(0 , 4 , '作者' ) worksheet.write(0 , 5 , '简介' ) n=1  def  request_juger (url ):    header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'              'Chrome/95.0.4638.69 Safari/537.36' ,     }     try :         response = requests.get(url=url,headers=header)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  def  find_useful_xlwt (html ):    soup = BeautifulSoup(html, 'lxml' )     list  = soup.find(class_='grid_view' ).find_all('li' )          for  item in  list :         item_index=item.find(class_='' ).string         item_name=item.find(class_='title' ).string         item_picture=item.find('a' ).find('img' ).get('src' )         item_author=item.find('p' ).text.replace("\n" ,'' ).replace(" " ,'' )[0 :20 ]         item_score=item.find(class_='rating_num' ).string         if  item.find(class_="inq" ) is  not  None :             item_intr=item.find(class_="inq" ).string                  global  n         worksheet.write(n, 0 , item_name)         worksheet.write(n, 1 , item_picture)         worksheet.write(n, 2 , item_index)         worksheet.write(n, 3 , item_score)         worksheet.write(n, 4 , item_author)         worksheet.write(n, 5 , item_intr)         n=n+1  def  main (i ):    url='https://movie.douban.com/top250?start='  + str (i * 25 ) + '&filter='      html=request_juger(url)     find_useful_xlwt(html) if  __name__ == "__main__" :    for  i in  range (0 , 10 ):         main(i) workbook.save(u'豆瓣最受欢迎的250部电影.xls' ) 
多线程 一个菜鸟上的例子,创建了一个线程类,继承threading.Thread
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import  threadingimport  timeexitFlag = 0  class  myThread  (threading.Thread ):    def  __init__ (self, threadID, name, counter ):         threading.Thread.__init__(self)         self.threadID = threadID         self.name = name         self.counter = counter          def  run (self ):         print  ("开始线程:"  + self.name)         print_time(self.name, self.counter, 10 )         print  ("退出线程:"  + self.name) def  print_time (threadName, delay, counter ):    while  counter:         if  exitFlag:             threadName.exit()         time.sleep(delay)         print  ("%s: %s"  % (threadName, time.ctime(time.time())))         counter -= 1  thread1 = myThread(1 , "Thread-1" , 1 ) thread2 = myThread(2 , "Thread-2" , 2 ) thread1.start() thread2.start() thread1.join() thread2.join() print  ("退出主线程" )
线程锁,用来保证公共数据在某一时间只会被一个线程使用。
1 2 3 4 5 6 7 8 9 10     def  run (self ):         print  ("开启线程: "  + self.name)                  threadLock.acquire()         print_time(self.name, self.counter, 3 )                  threadLock.release() threadLock = threading.Lock() 
第三个爬虫 爬取某一网站上的图片,嘿嘿嘿。
一个用来判断是否可以下载图片的小步骤。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import  requestsheaders = {'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,     "Referer" :"https://s2.baozimh.com"  } filename='111.jpg'  with  open (filename, 'wb' ) as  f:    img = requests.get('https://s2.baozimh.com/scomic/douluodalu-fengxuandongman/0/9-htxl/2.jpg' , headers=headers).content     f.write(img) 
下面的一个项目中的例子,自己也写了写,真不错。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 import  requestsimport  jsonfrom  bs4 import  BeautifulSoupimport  xlwtimport  osheader={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '              'Chrome/95.0.4638.69 Safari/537.36' ,         "Referer" : "https://www.mzitu.com/all/"      } def  request_juger (url ):    header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'              'Chrome/95.0.4638.69 Safari/537.36' ,         "Referer" : "https://www.mzitu.com/all/"      }     try :         response = requests.get(url=url,headers=header)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  def  get_page_urls ():    base_url = 'https://www.mzitu.com/page/'      for  i in  range (4 , 5 ):         url = base_url + str (i)         html=request_juger(url)         soup = BeautifulSoup(html, 'lxml' )         list  = soup.find(class_='postlist' ).find_all('li' )         meizi_url=[]         for  i in  list :             meizi_url.append(i.find('span' ).find('a' ).get('href' ))     return  meizi_url def  download_Pic (title, image_list ):         os.mkdir(title)     j = 1           for  item in  image_list:         filename = '%s/%s.jpg'  % (title, str (j))         print('downloading....%s : NO.%s'  % (title, str (j)))         with  open (filename, 'wb' ) as  f:             img = requests.get(url=item,headers=header).content             f.write(img)         j += 1  def  download_images (url ):    pages=[]     html = request_juger(url)     soup = BeautifulSoup(html, 'lxml' )     title=soup.find('h2' ).string     page=soup.find(class_='pagenavi' ).find_all('a' )[-2 ].find('span' ).string     image_list=[]     for  i in  range (1 ,int (page)):         html = request_juger(url + '/%s'  % i)         soup = BeautifulSoup(html, 'lxml' )         img_url = soup.find('img' ).get('src' )         image_list.append(img_url)     print(image_list)     download_Pic(title, image_list) def  main ():    urls=get_page_urls()     for  url in  urls:         download_images(url) if  __name__ == "__main__" :    main() 
照猫画虎,去爬了点漫画,斗罗大陆。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 import  requestsimport  jsonfrom  bs4 import  BeautifulSoupimport  xlwtimport  osheader={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,     "Referer" :"https://s2.baozimh.com"      } def  request_juger (url ):    header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,         "Referer" :"https://s2.baozimh.com"      }     try :         response = requests.get(url=url,headers=header)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  def  get_page_urls ():    base_url = 'https://cn.webmota.com/comic/chapter/douluodalu-fengxuandongman/0_'      manhua_url = []     for  i in  range (0 , 10 ):         url = base_url + str (i) + '.html'          manhua_url.append(url)     return  manhua_url def  download_Pic (title, image_list ):         os.mkdir(title)     j = 1           for  item in  image_list:         filename = '%s/%s.jpg'  % (title, str (j))         print('downloading....%s : NO.%s'  % (title, str (j)))         with  open (filename, 'wb' ) as  f:             img = requests.get(url=item,headers=header).content             f.write(img)         j += 1  def  download_images (url ):    html = request_juger(url)     soup = BeautifulSoup(html, 'lxml' )     title=soup.find('head' ).find('title' ).string     list =[]     page=soup.find(class_='comic-text__amp' ).find('em' ).string.replace('\n' ,'' ).replace(' ' ,'' )[-1 ]     for  i in  range (int (page)):         list .append(soup.find_all('img' )[i].get('src' ))     download_Pic(title,list ) def  main ():    urls=get_page_urls()     print(urls)     for  url in  urls:         download_images(url) if  __name__ == "__main__" :    main() 
若要多线程爬,直接导入相应线程库,弄成下面这种
1 2 3 4 with  concurrent.futures.ProcessPoolExecutor(max_workers=5 ) as  exector:    for  url in  urls:         exector.submit(download_images, url) 
selenium 本来想直接上手爬视频的结果发现链接里面根本没有mp4,还是得回来看selenium获取源代码。
首先pip安装
pip install selenium接着配谷歌driver环境,去官网下载后,然后配环境变量就行了,然后关机重启一次就能用了。
测试
1 2 3 4 5 driver = webdriver.Chrome() driver.get("https://new.iskcd.com/20211106/HhseFdeO/index.m3u8" ) text=driver.page_source print(text) 
第四个爬虫 开始爬视频了,也是自己摸索了很久,视频和图片的爬取还是有非常大的不同的,很多网站视频都采用了m3m8的方式来把mp4文件变成一个个的ts文件,爬取思路就是先得到index.m3m8,然后访问里面的链接,得到返回值,然后提取所有ts文件的url,下载下来,然后用ffmpeg来组装成mp4。
代码,单线程,比较慢。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 import  requestsfrom  bs4 import  BeautifulSoupimport  reimport  subprocessnum=0  header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,     } def  request_juger (url ):    header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,     }     try :         response = requests.get(url=url,headers=header)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  def  get_page_urls ():    base_url = 'https://www.great-elec.com/video/924-0-'      manhua_url = []     for  i in  range (7 ,11 ):         url = base_url + str (i) + '.html'          print(url)         manhua_url.append(url)     return  manhua_url def  download_mp4 (title, ts_urls ):    global  num     header = {         'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                        'Chrome/95.0.4638.69 Safari/537.36' ,     }          print(num)     for  ts_url in  ts_urls:         name=ts_url.split('/' )[-1 ]         res = requests.get(url=ts_url, headers=header)         with  open ("D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{}.txt" .format (num), "a+" ) as  f:             f.write("file '{}'\n" .format (name))         with  open ('D:\\learning record\\学术报告\\python_spider\\爬虫\\mp4\\{}' .format (name), 'wb' ) as  f:             f.write(res.content)     cmd='ffmpeg -f concat -i "D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{0}.txt". -c copy "D:/learning record/学术报告/python_spider/爬虫/mp4/vidoe/output{1}.mp4"' .format (num,num)     print(cmd)     subprocess.Popen(cmd,shell=True )     num+=1  def  download (url ):    title='aa'      html = request_juger(url)     soup = BeautifulSoup(html, 'lxml' )     a=soup.find(class_='box' ).find('p' ).find('script' )     cmp1 = re.compile ('<script>.*?now="(.*?)";.*?'  ,re.S)     texts = re.findall(cmp1,html)     m3m8_url=texts[0 ]     print(m3m8_url[:40 ])     m3u8_html = requests.get(url=m3m8_url, headers=header).text     print(m3u8_html[-22 :-1 ])     new_m3m8_url=m3m8_url[:40 ]+m3u8_html[-22 :-1 ]     print(new_m3m8_url)     new_m3m8_html = requests.get(url=new_m3m8_url, headers=header).text     ts_urls = re.findall(re.compile (',\n(.*?.ts)\n#' ), new_m3m8_html)     download_mp4(title,ts_urls) def  main ():    urls=get_page_urls()     for  url in  urls:         download(url) if  __name__ == "__main__" :    main() 
后面弄多线程的时候发现实际上还存在问题,就是拼接new_m3m8_url,可能会出现未能拼接成一个有效的url,当然可以采用split先分割,然后在拼接,因为/是固定的嘛,但是太懒了就没弄了。
多线程爬,实际上也不快
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 import  threadingimport  requestsfrom  bs4 import  BeautifulSoupimport  reimport  subprocessexitFlag = 0  class  myThread (threading.Thread ):    def  __init__ (self, star, name, end ):         threading.Thread.__init__(self)         self.star = star         self.name = name         self.end = end          def  run (self ):         print  ("开始线程:"  + self.name)         func(self.star,self.end)         print  ("退出线程:"  + self.name) header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,     } def  request_juger (url ):    header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,     }     try :         response = requests.get(url=url,headers=header)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  def  get_page_urls (star,end ):    base_url = 'https://www.great-elec.com/video/924-0-'      manhua_url = []     for  i in  range (star,end):         url = base_url + str (i) + '.html'          print(url)         manhua_url.append(url)     return  manhua_url def  download_mp4 (title, ts_urls,num ):    header = {         'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                        'Chrome/95.0.4638.69 Safari/537.36' ,     }          for  ts_url in  ts_urls:         name=ts_url.split('/' )[-1 ]         res = requests.get(url=ts_url, headers=header)         with  open ("D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{}.txt" .format (num), "a+" ) as  f:             f.write("file '{}'\n" .format (name))         with  open ('D:\\learning record\\学术报告\\python_spider\\爬虫\\mp4\\{}' .format (name), 'wb' ) as  f:             f.write(res.content)     cmd='ffmpeg -f concat -i "D:/learning record/学术报告/python_spider/爬虫/mp4/file_list{0}.txt". -c copy "D:/learning record/学术报告/python_spider/爬虫/mp4/vidoe/output{1}.mp4"' .format (num,num)     print(cmd)     subprocess.Popen(cmd,shell=True )     num+=1  def  download (url,num ):    title='aa'      html = request_juger(url)     soup = BeautifulSoup(html, 'lxml' )     a=soup.find(class_='box' ).find('p' ).find('script' )     cmp1 = re.compile ('<script>.*?now="(.*?)";.*?'  ,re.S)     texts = re.findall(cmp1,html)     m3m8_url=texts[0 ]     print(m3m8_url[:43 ])     m3u8_html = requests.get(url=m3m8_url, headers=header).text     print(m3u8_html[-22 :-1 ])     new_m3m8_url=m3m8_url[:43 ]+m3u8_html[-22 :-1 ]     print(new_m3m8_url)     new_m3m8_html = requests.get(url=new_m3m8_url, headers=header).text     ts_urls = re.findall(re.compile (',\n(.*?.ts)\n#' ), new_m3m8_html)     print(ts_urls)     download_mp4(title,ts_urls,num) def  func (star ,end ):    urls=get_page_urls(star,end)     for  url in  urls:         download(url,star) if  __name__ == "__main__" :         thread1 = myThread(13 , "Thread-1" , 14 )     thread2 = myThread(16 , "Thread-2" , 18 )          thread1.start()     thread2.start()     thread1.join()     thread2.join()     print  ("退出主线程" ) 
第五个爬虫 爬取歌曲,实际上也是找链接,倒是学了json的知识点,就相当于多维字典数组,挺方便。
实现了输入任意名称,就能爬取显示的所有歌曲,但是试听的只能有一部分。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 import  requestsimport  reimport  osimport  jsondef  download_song (title, song_list,song_name ):         os.mkdir(title)     j = 0      for  item in  song_list:         filename = '%s/%s.mp3'  % (title, song_name[j])         print('downloading....%s : NO.%s'  % (title, song_name[j]))         with  open (filename, 'wb' ) as  f:             mp3 = requests.get(item).content             f.write(mp3)         j += 1  def  request_juger (url ):    header={         'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                  'Chrome/95.0.4638.69 Safari/537.36' ,         "Referer" : "https://www.kugou.com/"      }     try :         response = requests.get(url=url,headers=header)         if  response.status_code == 200 :             return  response.text     except  requests.RequestException as  e:         print(e)         return  None  def  main ():    name=input ()     base_url='https://searchrecommend.kugou.com/get/complex?callback=jQuery112403385798993366811_1636390150231&word=%s&_=1636390150232' %(name)     text=request_juger(base_url)     useful=re.match(".*?({.*}).*" , text, re.S)     res = json.loads(useful.group(1 ))     list  = res['data' ]['song' ]     song_list=[]     song_name=[]     for  i in  list :         AlbumID=i['AlbumID' ]         hash =i['hash' ]         song_url='https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery19108914882384086649_1636392409637&hash=%s&dfid=4FIh0T1FDOGg2mC8cp3BaW48&appid=1014&mid=96e0f9a5a8a4d183f0034aa8ab27c2c9&platid=4&album_id=%s&_=1636392409638' %(hash ,AlbumID)                  song_text = request_juger(song_url)         song_useful = re.match(".*?({.*}).*" , song_text, re.S)         song_res = json.loads(song_useful.group(1 ))         if (song_res['data' ]['play_url' ]=='' ):             continue          song_list.append(song_res['data' ]['play_url' ])         song_name.append(song_res['data' ]['song_name' ])     print(song_list)     download_song(name,song_list,song_name) if  __name__ == "__main__" :    main()