Curating Content with Python’s Newspaper

By | December 22, 2017

newspaperThere is a new library for Python called Newspaper that makes scraping news articles from websites easy. It supports multiple languages, has multithreaded downloads, and even does some natural language processing (keywords based on relative frequency). When programming tasks are made this easy, it is sometimes referred to as magic. Which is not a good thing necessarily since the finer details of the program are obscured from the user. Newspaper is certainly magical, but I think it is a great library that achieves something undeniably useful without taking any serious missteps. I ended up using its article object as a foundation for a small web app recently.

What I made basically takes the article object and extends it by downloading article images, cleaning/transforming/organizing the data, and storing it in folders by domain. It also keeps track of the relative file path, something you need to consider for static content. I will share some of the code I used to achieve this below (not all of it is good example code). This is not an entire program, just snippets, and may require some modification before you can use it.

The code below shows how to:

  • Measure an article’s positive and negative sentiment and add it to the object.
  • Match and extract valid domain names with regular expressions (site1.domain.com and site2.domain.com should be organized together under ‘domain.com’, not separately as their fully qualified names).
  • Traverse a nested dictionary structure concisely and modify it as needed.

import newspaper
import requests
from bs4 import BeautifulSoup
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re
import urllib
from datetime import datetime
import logging

def prune_dict2(d, mask):
    """
    d: dictionary
    mask: tuple of types, (str, int, float)
    takes two inputs: a dictionary and a tuple
    traverses the dict hierarchy and removes entries that:
        - have keys in mask[0]
        - or have entries with values that are not of mask[1] types
    """
    if isinstance(d, set): d = list(d)
    if not isinstance(d, (dict, list)):
        return d
    if isinstance(d, list):
        return [prune_dict2(v, mask) for v in sorted(d)]
    try:
        return {k: prune_dict2(v, mask) for k, v in d.items() if k not in mask[0] and isinstance(v, mask[1])}
    except:
        return {k: prune_dict2(v, mask) for k, v in d.items() if k not in mask[0]}


def extract_tags(html):
    """
    A function that takes the HTML of a page and attempts to extract the tags.
    Every site is different, so lambda functions are custom made
    """
    soup = BeautifulSoup(html, "lxml")

    f = lambda content: [x.text.strip() for x in content.select('.td-tags > li')][1:]
    f2 = lambda content: [x.text.strip() for x in content.select('.single-tags > a')]
    f3 = lambda content: [x.text.strip() for x in content.select('.tags > ul > li')]

    result = list()
    try:
        result = result + (f(soup))
        result = result + (f2(soup))
        result = result + (f3(soup))
    except AttributeError as e:
        print('Error extracting tags: %s' % str(e))

    return result


class ArticleSummary:
    """
    A higher level object that builds on top of newspaper.article objects
    http://newspaper.readthedocs.io/en/latest/
    Takes a newspaper.article object and a "mask" as inputs, where mask is a list of length 2.
        mask[0]: keys you want to remove from the dictionary data. 
        mask[1]: a tuple of types that are allowed.
        mask can be used optionally with prune_dict2()
        - call download(), parse(), and nlp() all in one go
        - added sentiment polarity with NLTK's SentimentIntensityAnalyzer (sid)
        - minor functions to transform and standardize datetime and domain data
    """
    def __str__(self):
        return "ArticleSummary object: " + self.url

    def model(self):
        for k, v in sid.polarity_scores(self.text).items():
            setattr(self, k, v)

    # format datetime objects so they all have the same format
    def set_time(self):
        if isinstance(self.publish_date, datetime):
            self.publish_date = self.publish_date.strftime('%Y-%m-%d %H:%M:%S')

    def add_names(self):
        domain = re.compile(r'^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}$')

        # article objects don't always parse the domain consistently
        # this is basically patch work
        if re.match(domain, urlsplit(self.url).netloc):
            setattr(self, "domain", urlsplit(self.url).netloc)
        else:
            setattr(self, "domain", urlparse(self.url).hostname)

        setattr(self, "hostname", urlparse(self.url).hostname)

        if hasattr(self, "top_image"):
            setattr(self, "imagepath", filepath(self.hostname, self.top_image))

    def __init__(self, article, mask):
        try:
            article.download()
            article.parse()
            article.nlp()
        except newspaper.ArticleException as e:
            print("error downloading %s , error: " % self.url + str(e))

        for k, v in article.__dict__.items():
            setattr(self, k, v)

        try:
            self.tags = extract_tags(self.html)
        except AttributeError as e:
            print("Error with getting tags for %s" % self.url)

        self.model = self.model()
        self.set_time()
        self.add_names()

        return None

# example usage:
urls = []
mask = [['discardkeys', 'withthesenames'], (str, int, boolean)]
for url in urls:
    # don't cache/memoize when testing or else it will stop finding new articles
    paper = newspaper.build(url, language='en',memoize_articles=False)
    for article in paper.articles:
        article_summary_obj = ArticleSummary(article, mask)

bs4 logo
The ArticleSummary class is the meat and potatoes. Invoke it with a newspaper.article object (the example usage at the bottom shows how to get there) and it will initialize itself with that article object’s values. Once you have all the data for that article, you can begin making the fine adjustments to the structure and formatting that are specific to that website or project. Taking it further, you could create a “base” class that would contain the attributes that are always necessary, and then create separate classes as needed that initialize with the base class (and define additional attributes).

Sometimes Newspaper misses some things because of how a certain website is made. The extract_tags() function tries parsing a few tag structures with BeautifulSoup and combining anything it finds into a list of results. You should modify this function if you intend to use any of this code because the article tags may be in a different spot on the site you are scraping (here is a very basic BeautifulSoup script, the documentation is great as well)

What if you want more than just keywords from the article? Enter NLTK’s SentimentIntensityAnalyzer. The instance function model(self): gathers the positive and negative valency of the article text and adds that to the object, accessible as self.pos and self.neg. It does this by counting certain words among other things, you can see some examples¬†here.

The coupling of functions within and outside the class could probably be improved, but this is what worked for me the first time around. Some things here are certainly not best practice and I feel I should at least point it out (print statements for debugging, for shame!)

What I left out of this post (for now):

  • A function to download article image(s), embed their relative file path into the ArticleSummary dict, and save it to that location.
  • logging to keep track of any errors or failed downloads (replaced with prints as logging requires a few more lines for configuration and initialization.)
  • Serializing the data as .json and managing updating the file.
  • Generating an HTML snippet by formatting the data into some basic HTML code. Possibly with Bootstrap and Javascript to make a preview window. This may or may not be better than using a traditional ORM and Jinja templates.
  • A lazy way of jamming all the article metadata into one dictionary entry called “meta” rather than having it hidden in lists, deep within the dict structure. I’m open to suggestions when dealing with lists if you plan on serializing it as json. See below.

        self.data = {'meta': {}}
        for k, v in prune_dict2(self.__dict__, mask).items():
            if v:
                setattr(self, k, v)
                if k.startswith('meta'):
                    self.data['meta'][k] = v
                else:
                    self.data[k] = v

Here is where the prune_dict2() actually gets used. I cut it out of the main class because it is more of a hack for a task-specific problem. In any case, prune_dict2() is a really useful function that I found when I was trying to work with more deeply nested json structures and needed a more recursive solution. I have modified it quite a bit to suit my needs.

That’s all for now, comment if you have any questions or suggestions!

Leave a Reply

Your email address will not be published.