Social: scraping YouTube video titles and playlists with Python
Introduction
Related Content
One of the consistent problems I've run into with trying to report out on YouTube channel data using Data Studio is that the Data Studio connector doesn't provide many of the useful fields that are in YouTube Analytics. One of the most notable missing fields is the playlist name.
For a small channel with not many playlists, this isn't an insurmountable problem. For a channel with scores of playlists and hundreds of videos, this poses more of a challenge.
This is the python code I created that accomplishes two things:
- Pulls a list of all YouTube video titles and URLs
- Given a few pre-identified playlists, cross reference the full video list and assign videos to those playlists
The code
from selenium import webdriver
import pandas as pd
import time
from selenium.webdriver.common.keys import Keys
BROWSER = webdriver.Chrome(executable_path='{Input your path}')
print('Going to YouTube...')
def get_list(x):
"""For each video in the playlist, pull the video title and URL"""
BROWSER.get(x)
videos = BROWSER.find_elements_by_css_selector('#video-title')
video_list = []
for vid in videos:
video_list.append(vid.get_attribute("href").split("&")[0])
return video_list
def put_list(href):
"""Assign the video a specific category"""
if href in a_list:
return "Category: A"
elif href in b_list:
return "Category: B"
else:
return "Other"
a_list = get_list('{urlA}')
b_list = get_list('{urlB}')
BROWSER.get('{link to your channel's video list}')
One of the main challenges with scraping YouTube for videos is that the video page requires a button click (or several) to scroll all the way to the bottom. This scroll-down code is set to scroll all the way to the bottom (scroll a hundred times). In the event that so many videos have been added that this still is not sufficient, there’s a double check to make sure that the last video is in fact the oldest on the list.
def scroll_down(x):
"""Scroll down repeatedly to reach the bottom of the list"""
counter = 0
while counter <= x:
html = BROWSER.find_element_by_tag_name('html')
html.send_keys(Keys.END)
time.sleep(1)
counter += 1
scroll_down(100)
all_videos = BROWSER.find_elements_by_css_selector('#video-title')
all_list = []
for vid in all_videos:
"""Iterate through all_videos and cross-check against the playlists pulled above"""
new_dict = {}
new_dict['Title'] = vid.get_attribute("title")
new_dict['URL'] = vid.get_attribute("href").split("&")[0]
print('Checking playlists...')
new_dict['Category'] = put_list(vid.get_attribute("href").split("&")[0])
all_list.append(new_dict)
all_df = pd.DataFrame(all_list)
if all_df['Title'].iloc[-1] == "{Text title of the oldest video on the channel's list}":
print("All videos accounted for.")
else:
print("Scrolling error!")