phython編~スクレイピング~

スクレイピング

Web上のデータを集めたり、ダウンロードするにはスクレイピングが便利。ライブラリの「BeautifulSoup」を利用します。

モジュールのインストール

Requestsのインストール
pip install requests
Beautiful Soup4のインストール
pip install beautifulsoup4
HTML5対応のパーサーのインストール
pip install html5lib

Beautiful Soupの仕様例

# HTMLの文字列を指定
html_str = "<html><body>...</body></html>"

# ライブラリの取り込み
from bs4 import BeautifulSoup

# Beautiful Soupで解析を行う
soup = BeautifulSoup(html_str, 'html5lib')

# 出力
print(soup.prettify())

Beautiful Soupでのプロパティやメソッドの参照例

※e = soup.find('head')を実行した利用例

名前	説明	利用例	結果
e.parent	親要素	e.parent.name	html
e.children	子要素の一覧	list（e.children）[1].name	meta
e.(要素名)	子要素の(要素名)を表す	e.title.name	title
e.previous_sibling	1つ前の兄弟要素	e.title.previous_sibling	'\n'
e.next_sibling	1つ後の兄弟要素	e.title.next_sibling	'\n'
e.next_siblings	前にある兄弟要素一覧	list(e.title.previous_siblings)[1].name	meta
e.next_siblings	後にある兄弟要素一覧	list(e.title.next_siblings)[1].name	link

Beautiful Soupでの属性値やテキストの参照例

※e = soup.find('head')を実行した利用例

名前	説明	利用例
e.attrs	要素の属性一覧	e.link.attrs['href']
e.[属性名]	要素の属性を取得	e.link['href']
e.string	要素のテキスト	e.title.string
e.strings	子や孫要素のテキスト一覧	list(e.strings)[1:3]
e.text	子や孫要素含めたテキストを取得	e.title.text

ページ解析して表示されている画像をダウンロード

import os
import time
import requests
import urllib
from bs4 import BeautifulSoup

target_url = 'ダウンロードしたいサイトのURL'
save_dir = './image-download'


def download_images():  # メイン処理
    html = requests.get(target_url).text
    urls = get_image_urls(html)
    go_download(urls)


def get_image_urls(html):  # HTMLから画像のURL一覧を取得
    soup = BeautifulSoup(html, 'html5lib')
    res = []
    for img in soup.find_all('img'):
        src = img['src']
        url = urllib.parse.urljoin(target_url, src)
        print('img.src=', url)
        res.append(url)
    return res


def go_download(urls):  # 連続でURL一覧をダウンロード
    if not os.path.exists(save_dir):
        os.mkdir(save_dir)
    for url in urls:
        fname = os.path.basename(url)
        save_file = save_dir + '/' + fname
        r = requests.get(url)
        with open(save_file, 'wb')as fp:
            fp.write(r.content)
            print("save:", save_file)
        time.sleep(1)


if __name__ == '__main__':
    download_images()