ニコ動のブロマガをスクレイピングする

ニコニコ動画のブロマガをスクレイピングしたのでメモ。Python で Beautiful Soup と Selenium を使った。ヘッドレスにはせずに、ちゃんと動いていることを確認しながらやった。

いろいろ warning 出るけど、とりあえず動いたのでヨシ。

ブラウザ立ち上げ〜ログイン〜HTML取得

url = "https://ch.nicovideo.jp/mentalist/blomaga/arxxxxxx"
driver = webdriver.Chrome("path/to/chromedriver.exe")
driver.get(url)

# ログイン
mail = "mail"
password = "password"
driver.find_element_by_xpath(
    '//*[@id="CommonHeader"]/div/div/div/div[2]/a').click()
driver.find_element_by_xpath('//*[@id="input__mailtel"]').send_keys(mail)
driver.find_element_by_xpath(
    '//*[@id="input__password"]').send_keys(password)
driver.find_element_by_xpath('//*[@id="login__submit"]').click()

res = driver.page_source

パースして適当にいらない部分を除く

soup = bs4.BeautifulSoup(res, "html.parser")

while (x := soup.find("script")) is not None:
    x.decompose()
soup.find("div", class_="other_article_list_area is_official").decompose()
while (x := soup.find("noscript")) is not None:
    x.decompose()
soup.find("div", class_="c-siteCommonHeaderUnit").decompose()
soup.find("div", id="js_to_top_start").decompose()
soup.find("div", class_="btn_totop").decompose()
soup.find("aside", id="pc_ch_blomaga_eachtop_728x90_south").decompose()

保存

file = make_filename(driver)
with open(f"html/{file}.html", "w", encoding="utf-8") as f:
    f.write(str(soup))

ここで make_filename は以下のような感じで作った

def rm_colon(dt: str) -> str:
    return dt.replace(" ", "_").replace(":", "")


def url2ar(url: str) -> str:
    return url.split("/")[-1]


def make_filename(driver):
    ar = url2ar(driver.find_element_by_xpath(
        '//*[@id="article_blog_title_area"]/div[1]/div[2]/div/div[1]/div/ul/li[2]/a'
    ).get_attribute("href"))
    t = rm_colon(driver.find_element_by_xpath(
        '//*[@id="article_blog_title_area"]/div[1]/div[2]/div/div[1]/span'
    ).text)
    title = driver.find_element_by_xpath('//*[@id="article_blog_title"]').text
    title = (
        title
        .replace("/", "_")
        .replace(":", "_")
        .replace("?", "_")
        .replace('"', "_")
        .replace("*", "_")
        .replace(":", "_")
        .replace("<", "_")
        .replace(">", "_")
        .replace("|", "_")
        .replace("\\", "_")
    )
    return f"{ar}_{t}_{title}"

さらに、最新の記事から過去に遡っていくためのコードも追加

try:
    driver.find_elements_by_class_name("article_around")[1].click()
except IndexError:
    break
except Exception as e:
    print(e)

たぶんいろいろもっと良い感じに書ける。

使うときは time.sleep も忘れずに。