0 Posted 2020-07-17Updated 2024-01-11Python / Crawler2 minutes read (About 265 words)

用Python下载高中教材

如何用python爬蟲獲取高中教材

目標網頁: http://www.100.com/article/309299.html(已失效)

點擊網頁，可知，目標圖片的結構爲:

<p style="text-align:center">
  <img id="99770" src="http://edu_img.bs2.100.com/b31c5369425c1c5203b2437.jpg" alt="bb09f673355c1c4ff92c54c.jpg" />
</p>

from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests

## Function for downloading
def page_download(image_name,r):
  try:
    with open('./img/%s' % image_name, 'wb') as f:
        for chunk in r.iter_content(chunk_size=128):
            f.write(chunk)  
  except:
    print("missing")

## Starting resuqest
ulr = "http://www.100.com/article/309299.html?&display=w&fd=wap"

html = urlopen(ulr).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')

Book_list = soup.findAll('p',{"style":"text-align:center"})

Num = 0
for link in Book_list:
  try:
    Num += 1
    url = link.find('img')['src']
    r = requests.get(url, stream=True)
    image_name = str(Num)+'.jpg'
    page_download(image_name,r)
    print('Saved %s' % image_name)
  except:
    print("Missing")

壓縮成單個pdf

sudo pip3.7  install -i https://pypi.tuna.tsinghua.edu.cn/simple img2pdf

img2pdf $(ls | sort -n) -o ../Biology1.pdf

Uy4B3d.md.jpg

完成～

注: 有的網頁，可能不一樣。比如生物必修二就在匹配的時候，多了一個分號，因此，要改成：Book_list = soup.findAll('p',{"style":"text-align:center;"})

enjoy~

用Python下载高中教材

https://karobben.github.io/2020/07/17/Blog/Python_cw_book/

Author

Karobben

Posted on

2020-07-17

Updated on

2024-01-11

Licensed under

#Python Crawler

用Python下载高中教材

如何用python爬蟲獲取高中教材

壓縮成單個pdf

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue

Tags

Subscribe for updates

Links

Recommends

Categories