How Scrapy and Selenium is used in Analyzing and Scraping News Articles?

This blog will teach you web scraping news articles using Scrapy and Selenium and stay updated with the latest technology products and startups.

Scraping

Selenium got its start as a web testing tool. Someone, who has never done web testing previously, will find it entertaining to play with — as you will sit there watching your browser being possessed — no, programmatically commanded — to do all sorts of things while sipping coffee with both hands.

Here is the script to get started:

scrapy startproject [project name]cd [project name] scrapy genspider [spider name]

The web driver must be located on the first level of the project folder, which is the same level as the “scrapy.cfg” file, which must be taken care of.

CNN

Without JavaScript, the search word would not even appear on CNN, and we would be presented with a blank page —

cnn

This, on the other hand, demonstrates the pleasure (and problems) of JavaScript

cnn2

So, we'll need to replicate the process of transferring search requests (simply using the “search?q=” string in the URL would serve, but the following will show a more full method of running Selenium from the home page). After that, we'll look at pagination —

cnn3

On a side note, the “Date” button will just allow you to rank by date or relevance, it will not allow you to search for a particular date range. The code for scraping CNN is below, along with an explanation in the comments.

import scrapy    from selenium import webdriver    from selenium.webdriver.chrome.options import Options    from scrapy.selector import Selector    import time            class CnnSpider(scrapy.Spider):        name = 'cnn'        allowed_domains = ['www.cnn.com']        start_urls = ['https://www.cnn.com']                # initiating selenium        def __init__(self):                        # set up the driver            chrome_options = Options()            # chrome_options.add_argument("--headless") # uncomment if don't want to appreciate the sight of a possessed browser            driver = webdriver.Chrome(executable_path=str('./chromedriver'), options=chrome_options)            driver.get("https://www.cnn.com")                        # begin search            search_input = driver.find_element_by_id("footer-search-bar") # find the search bar            search_input.send_keys("immigration") # type in the search term            search_btn = driver.find_element_by_xpath("(//button[contains(@class, 'Flex-sc-1')])[2]") # find the search button            search_btn.click()                        # record the first page            self.html = [driver.page_source]                        # start turning pages            i = 0            while i  100: # 100 is just right to get us back to July                i += 1                time.sleep(5) # just in case the next button hasn't finished loading                next_btn = driver.find_element_by_xpath("(//div[contains(@class, 'pagination-arrow')])[2]") # click next button                next_btn.click()                self.html.append(driver.page_source) # not the best way but will do                        # using scrapy's native parse to first scrape links on result pages        def parse(self, response):            for page in self.html:                resp = Selector(text=page)                results = resp.xpath("//div[@class='cnn-search__result cnn-search__result--article']/div/h3/a") # result iterator                for result in results:                    title = result.xpath(".//text()").get()                    if ("Video" in title) | ("coronavirus news" in title) | ("http" in title):                        continue # ignore videos and search-independent news or ads                    else:                        link = result.xpath(".//@href").get()[13:] # cut off the domain; had better just use request in fact                        yield response.follow(url=link, callback=self.parse_article, meta={"title": title})            # pass on the links to open and process actual news articles        def parse_article(self, response):            title = response.request.meta['title']                        # several variations of author's locator            authors = response.xpath("//span[@class='metadata__byline__author']//text()").getall()            if len(authors) == 0:                authors = response.xpath("//p[@data-type='byline-area']//text()").getall()                if len(authors) == 0:                    authors = response.xpath("//div[@class='Article__subtitle']//text()").getall()                        # two variations of article body's locator            content = ' '.join(response.xpath("//section[@id='body-text']/div[@class='l-container']//text()").getall())            if content is None:                content = ' '.join(response.xpath("//div[@class='Article__content']//text()").getall())            yield {                "title": title,                "byline": ' '.join(authors), # could be multiple authors                "time": response.xpath("//p[@class='update-time']/text()").get(),                "content": content            }

FOX News

Scraping Fox News would be comparable, just like we're dealing with the Show More button instead of standard pagination —

fox

Only the significant deviations from the CNN spider are discussed this time.

import scrapy    from scrapy.selector import Selector    from selenium import webdriver    from selenium.webdriver.chrome.options import Options    from selenium.webdriver.common.by import By    from selenium.webdriver.support.ui import WebDriverWait    from selenium.webdriver.support import expected_conditions as EC    from selenium.common.exceptions import TimeoutException    import time            class FoxnewsSpider(scrapy.Spider):        name = 'foxnews'        allowed_domains = ['www.foxnews.com']        start_urls = ['https://www.foxnews.com']                    def __init__(self):            chrome_options = Options()            #chrome_options.add_argument("--headless")            driver = webdriver.Chrome(executable_path=str('./chromedriver'), options=chrome_options)            driver.get("https://www.foxnews.com/category/us/immigration")                wait = WebDriverWait(driver, 10)                        # first, click 'Show More' many times            i = 0            while i  80:                try:                    time.sleep(1)                    element = wait.until(EC.visibility_of_element_located(                        (By.XPATH, "(//div[@class='button load-more js-load-more'])[1]/a")))                    element.click()                    i += 1                except TimeoutException:                    break                                # then, copy down all that's now shown on the page            self.html = driver.page_source                def parse(self, response):            resp = Selector(text=self.html)            results = resp.xpath("//article[@class='article']//h4[@class='title']/a")            for result in results:                title = result.xpath(".//text()").get()                eyebrow = result.xpath(".//span[@class='eyebrow']/a/text()").get() # scraped only for filtering                link = result.xpath(".//@href").get()                if eyebrow == 'VIDEO':                    continue # filter out videos                else:                    yield response.follow(url=link, callback=self.parse_article, meta={"title": title})            def parse_article(self, response):            title = response.request.meta['title']            authors = response.xpath("(//div[@class='author-byline']//span/a)[1]/text()").getall()            if len(authors) == 0:                authors = [i for i in response.xpath("//div[@class='author-byline opinion']//span/a/text()").getall() if 'Fox News' not in i]            content = ' '.join(response.xpath("//div[@class='article-body']//text()").getall())            yield {                "title": title,                "byline": ' '.join(authors),                "time": response.xpath("//div[@class='article-date']/time/text()").get(),                "content": content            }

To execute these spiders, simply type the following into Terminal:

scrapy crawl [spider name] [-o fileName.csv/.json/.xml]                                 # Saving the output to a file is optional                                # only these three file types are allowed by Scrapy                                

Analyzing

Scrapy does not process data in order, thus the data we collected would be in a bizarre sequence. To expedite the procedure, multiple requests are sent at the same time.

For this section, we'll need the following packages:

# for standard data wranglingimport pandas as pdimport numpy as np# for plottingimport matplotlib.pyplot as plt# for pattern matching during cleaningimport re# for frequency countsfrom collections import Counter# for bigrams, conditional frequency distribution and beyondimport nltk# for word cloudfrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorfrom PIL import Image# for (one way of) keyword extractionfrom sklearn import feature_extractionfrom nltk.stem.snowball import SnowballStemmerfrom sklearn.feature_extraction.text import TfidfVectorizer

Here is a demo of CNN and the Fox News data:

cnn-sample-data
fox-sample-data

There seem to be a few typical cleaning procedures to consider, which will ultimately depend on our goals. If we only want to look at the content, for example, we can disregard the chaos in other columns entirely.

1. Discard articles in unusual formats, such as slideshows (which result in NAS).

df = df.dropna(subset=['column to consider']).reset_index(drop=True)

2. Format Dates

# for CNNdf['date'] = df['time'].apply(lambda x: x.split('ET,')[1][4:].strip())df.date = pd.to_datetime(df.date, format = '%B %d, %Y')# for Fox Newsfor _, row in df.iterrows():    if 'hour' in row['time']:        row['time'] = ('March 24, 2021')    elif 'day' in row['time']:        day_offset = int(row['time'].split()[0])        row['time'] = 'March {}, 2021'.format(24 - day_offset)    elif ('March' in row['time']) or ('February' in row['time']) or ('January' in row['time']):        row['time'] += ', 2021'    else:        row['time'] += ', 2020'df = df.rename(columns = {'time':'date'})df.date = df.date.apply(lambda x: x.strip())df.date = pd.to_datetime(fn.date, format = '%B %d, %Y')

In addition, we also included a new month-year column for future aggregate reports. It also aids in the removal of unneeded items released in July (previously scraped with rough page counts).

df['month_year'] = pd.to_datetime(df['date']).dt.to_period('M')df_cleaned = df[df['month_year']!=pd.Period('2020-07', 'M')].copy()

We now have 644 CNN stories and 738 Fox News articles after cutting. Both media organizations appear to be increasing the number of immigration-related pieces published each month, with Fox showing a noticeable surge in interest in March.

cnn-graph
fox-graph

3. Clean Articles

Because the scraping stage had indiscriminately included all the extraneous stuff, such as ad banners, media sources, and markups like “width” or “video closed,” we could do a far finer job cleaning the body of a post. Some of those, on the other hand, would scarcely compromise our textual analysis.

We could perform a far better job cleaning the content of a post because the scraping stage has randomly included that unnecessary stuff, such as ad banners, media sources, and markups like “width” or “video closed.” But on the other side, several of these would hardly impair our text analysis.

df['content'] = df['content'].apply(lambda x: x.lower())                                cnn.content = cnn.content.apply(lambda x: re.sub(r'use\sstrict.*?env=prod"}', '', x))                                

World Cloud

Here we will initiate, with headlines to make sense of the variation between two publications.

stopwords = nltk.corpus.stopwords.words('english')    stopwords += ['the', 'says', 'say', 'a'] # add custom stopwords    stopwords_tokenized = nltk.word_tokenize(' '.join(stopwords))        def process(text):        tokens = []        for sentence in nltk.sent_tokenize(text):            for word in nltk.word_tokenize(sentence):                token = word.lower().replace("'", "") # put words like 'she and she as one                if ('covid-19' in token) or ('coronavirus' in token): # publications use different terms for covid                    tokens.append('covid') # normalize all the mentions since this is a crucial topic as of now                else:                    tokens.append(token)        tokens_filtered = [t for t in tokens                         if re.search('[a-zA-Z]', t) and t not in stopwords_tokenized]        return tokens_filtered        def gen_wc(bag, name=''):        tokens = process(bag)        plt.figure(figsize=(20,10), dpi=800)        wc = WordCloud(background_color="white",width=1000, height=500) #other options like max_font_size=, max_words=         wordcloud = wc.generate_from_text(' '.join(tokens))        plt.imshow(wordcloud, interpolation="nearest", aspect="equal")        plt.axis("off")        plt.title('Words in Headlines-{}'.format(name))        plt.savefig('headline_wc_{}'.format(name)+'.png', figsize=(20,10), dpi=800)        plt.show()        # generate word cloud for each month    for time in df['month_year'].unique():        df_subset = df[df['month_year']==time].copy()        bag = df_subset['title'].str.cat(sep = ' ')        gen_wc(bag, name=time)

Here is the keyword in the headline for CNN every month.

cnn-keyword

All of the words are capitalized, so “us” means “The US” and “ice” means “ICE” (Immigration and Customs Enforcement), and so on.

FOX News:

fox-keyword

Bigrams:

Another thing we will look at is Bigrams.

out = []for title in list(df['title']):    out.append(nltk.word_tokenize(title))bi = []for title_words in out:    bi += nltk.bigrams(title_words)Counter(bi).most_common()

There are a few unusual bigrams among the anticipated popular ones, such as "Biden administration" and "Trump administration."

Bigrams

With the bigram list, we could conduct a conditional relative frequency search for certain keyword pairings. For instance,

cfd = nltk.ConditionalFreqDist(bi)cfd['Covid']# CNN: FreqDist({'relief': 8, ',': 6, 'law': 1})cfd['coronavirus']# Fox News: FreqDist({'pandemic': 4, 'death': 2, 'vaccine': 1, 'relief': 1, 'records': 1, 'travel': 1, 'is': 1, 'rules': 1, 'canceled': 1, ',': 1, ...})cfd['border']# CNN: FreqDist({'wall': 7, 'crisis': 3, 'is': 3, '.': 2, ',': 2, 'alone': 2, 'surge': 1, 'closed': 1, 'problem': 1, 'encounters': 1, ...})# Fox News: FreqDist({'crisis': 50, 'wall': 19, ',': 14, 'surge': 13, ':': 8, 'as': 7, 'policy': 7, 'crossings': 6, "'crisis": 5, 'situation': 5, ...})

Changing Over Time

It would be interesting to see how word frequency changed over the course of eight months, and hence created a new dataset with monthly word counts:

bag = df['title'].str.cat(sep = ' ')tokens = process(bag)word_df = pd.DataFrame.from_dict(dict(Counter(tokens)), orient='index', columns=['overall'])# create a custom mergedef merge(original_df, frames):    out = original_df    for df in frames:        out = out.merge(df, how='left', left_index=True, right_index=True)    return outframes = []for time in df['month_year'].unique()[::-1]: # in reverse (chronological) order    df_subset = foxnews[foxnews['month_year']==time].copy()    bag = df_subset['title'].str.cat(sep = ' ')    tokens = process(bag)    frames.append(pd.DataFrame.from_dict(dict(Counter(tokens)), orient='index', columns=[str(time)]))end_df = merge(word_df, frames)end_df = end_df.fillna(0)
changing-over-time

Though such a dataset is useful when comparing month to month, this would be more convenient to visualize and show the change in Tableau in a lengthy format – therefore the transformation:

df_long_temp = end_df.drop(columns='overall').reset_index()df_long = pd.melt(df_long_temp,id_vars=['index'],var_name='year', value_name='frequency')
changing-over-time2

Here's a link to a tutorial on how to animate the Tableau visualization.

cnn-report

Beginning in the election month, we observe references of Biden rise quickly, while “Trump” falls off the list totally in March, and attention to migrant children rises with “border.”

fox-report

Since the election, "Biden" has taken the lead, but the attention didn't build up until the start of 2021 when "crisis" and "surge" began to dominate the media.

Keywords in Articles

To see which words in the articles might have greater meaning, we have used TF-IDF again, which evaluates both the value of a term in the document (in this example, a specific news story) and its relevance in the whole corpus, with the all-too-common words weighted less. We also threw in some stops to the mix.

There are various ways to achieve this, but in this case, we tried to pool the top ten terms (ordered by their TF-IDF weights) across articles to analyze the differences in each publication's total vocabulary.

def stemming(token):        global stopwords_tokenized        stemmer = SnowballStemmer("english")        if (token in stopwords_tokenized):            return token        else:            return stemmer.stem(token)        # a slightly revised process function    def preprocess(text):        tokens = []        for sentence in nltk.sent_tokenize(text):            for word in nltk.word_tokenize(sentence):                token = word.lower().replace("'", "")                if ('covid-19' in token) or ('coronavirus' in token):                    tokens.append('covid')                else:                    tokens.append(token)        tokens_filtered = [t for t in tokens if re.search('[a-zA-Z]', t)]                stems = [stemming(t) for t in tokens_filtered]        return stems        articles = df.content.tolist()        tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, max_features=200000, stop_words=stopwords_tokenized,\                                    strip_accents='unicode', use_idf=True, tokenizer=preprocess, ngram_range=(1,2))    tfidf_matrix = tfidf_vectorizer.fit_transform(articles)        terms = tfidf_vectorizer.get_feature_names()        # pool top 10 keywords in each news article    keywords = []    for row in range(tfidf_matrix.shape[0]):        for i in tfidf_matrix[row].toarray()[0].argsort()[::-1][:10]:            keywords.append(terms[i])

We might notice the similarities in keywords this way:

set(fn_content_words).intersection(set(cnn_content_words))# word endings altered due to stemming{'administr', #administration'biden','bill','children','democrat','facil', # facilities'ice','mayorka','mexico','migrant','polic', # policy'polici', # policies'presid', # president'republican','senat', # senate'trump','unaccompani', # unaccompanied'wolf'} # Chad Wolf

We may utilize — to see which words were adopted by one but not the other.

set(fn_content_words).difference(set(cnn_content_words))set(cnn_content_words).difference(set(fn_content_words))

As shown by the above analysis, Fox News' keywords include arrest, caravan, legally questionable, wall, Latino, and various states such as Arizona and Texas, whereas CNN's keywords included American, Black, China, White, Latino, women, campaign, protest, and worker, which did not appear as made a significant impact for Fox News.

Sentiment analysis, topic detection, or more specific content analysis, such as comparing organizations nouns, modals, quotations, or lexical diversity, could be used as a further step.

For any Queries, contact 3i Data scraping!!