Python で映画館の上映一覧をスクレイピング（2）

2020.06.17 プログラミング, Python

映画の上映作品をリアルタイムで一覧表示するページを作れないかと Python でスクレイピングを試しています。

前記事

Python で動的サイトをスクレイピング（1）

前回は Selenium, WebDriver を使って動的サイトを読み込むところまでやりました。今回は必要な要素が取得できるまで待つ「待機」、ページの中の「要素を探す」、そしてブラウザを非表示にする「オプション指定」です。

待機 WebDriverWait
要素をクリックして日時を指定する find_element_by_id
WebDriver によるブラウザを非表示にする

01待機 WebDriverWait

動的サイトでは Ajax などを使ってブラウザ側から指定のデータを要求しサーバの返答を待つ場合があります。その場合は必要なデータが表示されるまで待機する必要があります。

Selenium に用意されている WebDriver クラスと expected_conditions クラスを使用します。

名古屋の映画館ミッドランドスクエアシネマの上映一覧でやってみます。このサイトはクローラ禁止ではありませんので不正使用しなければ大丈夫だと思います。

MIDLAND SQUARE CINEMA インターネットチケット購入

このページはチケット購入ページですので指定した日にちの上映一覧を表示します。HTMLは、

<h2 class="square_title1">ミッドランドスクエアシネマ<span>スクリーン1～7</span></h2>
    <section id="Day_schedule1"></section>
        <!--end section#Day_schedule -->

となっており、日にちを指定しますと <section id="Day_schedule1"></section> 要素にデータを読み込むようになっています。初期状態では当日のデータを読み込みますので、日にちを表示する要素 Day_title を読み込んだかどうかでやってみます。

import lxml
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import chromedriver_binary


driver = webdriver.Chrome()


url = 'https://ticket.online-midland-sq-cinema.jp/schedule/ticket/0201/index.html'


driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'Day_title')))


html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
found = soup.select('.MovieTitle1 h2')


with open('output.html', 'w', encoding='utf-8') as g:
  print(found, file=g)


driver.quit()

WebDriverWaitは、デフォルトでは 500ミリ秒ごとに ExpectedCondition を呼び出して監視します。この例では、要素 Day_title が取得できるまで 10秒間だけ待ちます。成功すればその時点で次に進み、成功せず10秒経てば driver は何も返さないことになります。

待機する条件の指定方法や単に指定時間だけ待機する方法などドキュメントが下記サイトにあります。

5. 待機 — Selenium Python Bindings 2 ドキュメント

02要素をクリックして日時を指定する find_element_by_id

このチケット購入ページは日付をクリックすることでその日の上映作品に変わります。また日付は当日から1週間ほどが動的に表示されます。

日付は空の要素 <div id="dayBtnBox"></div> に id="s0100_0201_20200617" という日付が入った要素が並びます。s0100_0201_ はサーバーか何かでしょうか。いずれにしても頻繁に変更されるものではないでしょう。ですので日付要素が表示されるのを待ち、目的の日付をクリックしてデータを取得します。

import lxml
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import chromedriver_binary
import datetime


driver = webdriver.Chrome()


today = datetime.date.today()
date = today + datetime.timedelta(days=2)
t_date = 's0100_0201_' + date.strftime('%Y%m%d')
d_date = str(date.month) + '/' + str(date.day)


url = 'https://ticket.online-midland-sq-cinema.jp/schedule/ticket/0201/index.html'


wait = WebDriverWait(driver, 10)


driver.get(url)


wait.until(EC.presence_of_element_located((By.ID, t_date)))
driver.find_element_by_id(t_date).click()


wait.until(EC.text_to_be_present_in_element((By.ID, 'Day_title'), d_date))


html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
found = soup.select('.MovieTitle1 h2')


with open('output.html', 'w', encoding='utf-8') as g:
  print(found, file=g)


driver.quit()

例では明後日の日付を指定しています。presence_of_element_located で目的の日付要素が表示されるのを待ち driver.find_element_by_id で要素を探してクリックします。

クリック後に該当データを取り込むまで待たなくてはいけませんので、変化するデータを探します。この場合は、id=Day_title にクリックした日付データが 6/17 のフォーマットで入りますのでそれを待ちます。

きれいなコードはありませんがとりあえずはテストということで、明後日の上映作品のタイトルが取得できました。

WebDriverWait は成功しますと、指定した要素（オブジェクト）を返すようですので、

c_date = wait.until(EC.presence_of_element_located((By.ID, t_date)))
c_date.click()

driver.find_element_by_id で要素を探さなくても、これでもいけるようです。

03WebDriver によるブラウザを非表示にする

現状ではスクリプトを実行するたびにブラウザが立ち上がってしまいますので非表示にします。

chrome.options クラスを使い --headless を指定すればブラウザは非表示になります。

import lxml
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import chromedriver_binary
import datetime


options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)


today = datetime.date.today()
date = today + datetime.timedelta(days=2)
t_date = 's0100_0201_' + date.strftime('%Y%m%d')
d_date = str(date.month) + '/' + str(date.day)


url = 'https://ticket.online-midland-sq-cinema.jp/schedule/ticket/0201/index.html'


wait = WebDriverWait(driver, 10)


driver.get(url)


c_date = wait.until(EC.presence_of_element_located((By.ID, t_date)))
c_date.click()


wait.until(EC.text_to_be_present_in_element((By.ID, 'Day_title'), d_date))


html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
found = soup.select('.MovieTitle1 h2')


with open('output.html', 'w', encoding='utf-8') as g:
  print(found, file=g)


driver.quit()

後は必要なデータを取得して整形しサーバを立ち上げればなんとかうまく行きそうです。

5日間で学ぶPython　スクレイピング編

作者:中島省吾（メディアプラネット）
発売日: 2018/09/25
メディア: Kindle版

#Python