Scraping Reddit, part 2

8 minute read

Published:

The last post dealt with using pushshift and handling requests to access posts and comments from Reddit. This post deals with using the Python Reddit API wrapper to accces posts and comments from Reddit and then using some NLP tools for some basic sentiment analysis.

There is some work to set up an application to use praw with oauth, but straightforward enough for anyone who’s just using this as a script.

After setting up the praw application, we can build up a small pipeline:

  1. Use praw to download posts and comments from r/nba
  2. Format them into a dataframe
  3. Use huggingface and spacy for sentiment analysis
from dataclasses import dataclass
import itertools as it
from functools import reduce, partial
import datetime as dt

import pandas as pd
pd.set_option('display.max_colwidth', 150)
import praw
from praw.models import MoreComments
import matplotlib.pyplot as plt
import hfapi
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load("en_core_web_sm")
spacy_text_blob = SpacyTextBlob()
nlp.add_pipe(spacy_text_blob)

client = hfapi.Client()
reddit = praw.Reddit("bot1") # Pulls from praw.ini file
rnba = reddit.subreddit('nba')

Compiling praw objects into a dataframe

@dataclass
class RedditSubmission:
    title: str 
    body: str 
    permalink: str 
    author: str 
    score: float
    timestamp: dt.datetime
    
    def to_dict(self):
        return {
            'title': self.title,
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }
    
    @classmethod
    def from_praw_submission(
        cls,
        praw_submission: praw.models.Submission
    ):      
        return cls(
            praw_submission.title,
            praw_submission.selftext,
            praw_submission.permalink,
            praw_submission.author,
            praw_submission.score,
            dt.datetime.fromtimestamp(praw_submission.created_utc)
        )
        
@dataclass
class RedditComment:                                                                  
    body: str
    permalink: str
    author: str
    score: float
    timestamp: dt.datetime

    def to_dict(self):
        return {
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }

    @classmethod
    def from_praw_comment(
        cls,
        praw_comment: praw.models.Comment
    ):
        return cls(
            praw_comment.body,
            praw_comment.permalink,
            praw_comment.author,
            praw_comment.score,
            dt.datetime.fromtimestamp(praw_comment.created_utc)
        )

        
def process_submission_from_praw(praw_submission_generator):
    for praw_submission in praw_submission_generator:
        yield RedditSubmission.from_praw_submission(praw_submission)
        
def process_comment_from_praw_submission(praw_submission_generator):
    for praw_submission in praw_submission_generator:
        for praw_comment in praw_submission.comments:
            if isinstance(praw_comment, MoreComments):
                continue
            else:
                yield RedditComment.from_praw_comment(praw_comment)
praw_submission_generator1 = rnba.hot(limit=100)
praw_submission_generator2 = rnba.hot(limit=100)

submissions = process_submission_from_praw(praw_submission_generator1)
comments = process_comment_from_praw_submission(praw_submission_generator2)
submission_df = pd.DataFrame(a.to_dict() for a in submissions)
comment_df = pd.DataFrame(a.to_dict() for a in comments)

Using huggingface for sentiment analysis

Specifically, using huggingface api

def classification_single_body(client, sentence):
    classification = client.text_classification(sentence)
    if 'error' in classification:
        return None, None
    neg_sentiment, pos_sentiment = classification[0]

    return neg_sentiment['score'], pos_sentiment['score']

def classification_multiple_body(client, bunch_of_sentences, colnames=None):
    if colnames is None:
        colnames = ['negative_score', 'positive_score']
    df = pd.DataFrame(
        map(lambda x: classification_single_body(client, x), bunch_of_sentences),
        columns=colnames
    )

    
    return df

client = hfapi.Client()
classification_multiple_bodies_partial = partial(classification_multiple_body, client)
submission_df = pd.concat([
    submission_df, classification_multiple_bodies_partial(submission_df['title'].to_list())
], axis=1)

Scoring the submissions, here’s a title with an appropriately positive score “Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player’s OVERALL win shares for the current season.”

Here’s a title that is scored as incredibly negative, but in reality is pretty positive “Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season” – being even close to the 50-40-90 club is incredible

submission_df.sort_values("negative_score")[['title', 'score', 'negative_score', 'positive_score']]
titlescorenegative_scorepositive_score
19[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...2410.0001850.999816
8Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”16100.0001850.999815
12[Thinking Basketball] The 10 Best NBA peaks since 197713460.0002830.999717
25[Highlight] Russell banks in the 3 to tie it at 124920.0006150.999385
23Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player's OVERALL win shares for the current season.4060.0008450.999155
...............
15Charles Barkley: "I've been poor, I've been rich, I've been fat, I've been in the Hall of Fame, and one thing I can tell you is that the Clippers ...233410.9992290.000771
38Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season4430.9992820.000718
75[Stein] The Bucks' too-long-to-list-it-all injury report tonight against Charlotte includes no Giannis Antetokounmpo (left knee soreness) or Jrue ...430.9992860.000714
40Bucks missing all five starters against Hornets790.9994490.000551
93China’s Forced-Labor Backlash Threatens to Put N.B.A. in Unwanted Spotlight1740.9995170.000483

100 rows × 4 columns

I think we were querying the API too quickly, so these responses started timing out, but you get the idea here

comment_df = pd.concat([
    comment_df, classification_multiple_bodies_partial(comment_df['body'].to_list())
], axis=1)

Using spacy for sentiment analysis

submission_df['title_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['title']))]
submission_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['body']))]
comment_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(comment_df['body']))]

Here’s a simple title to score “Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.””

submission_df[['title', 'score', 'title_sentiment']].sort_values("title_sentiment")
titlescoretitle_sentiment
99The Mavs will play 3 back-to-backs over a 7 game span to start April. Over April and May, 62% of their games will be part of a b2b15-0.400000
83[Post Game Thread] The Los Angeles Clippers (35-18) defeat the Phoenix Suns (36-15), 113 - 103727-0.400000
43[Post Game Thread] The Boston Celtics (27-26) defeat the Minnesota Timberwolves (13-40) in OT, 145 - 13649-0.400000
91[Post Game Thread] The Dallas Mavericks (29-22) defeat the Milwaukee Bucks (32-19), 116 - 101754-0.400000
37The Denver Nuggets came onto the floor for their game against the Spurs with "X Gon' Give it to Ya" playing in the background88-0.400000
............
19[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...2410.505556
18Steve Kerr on leaving the Warriors: “I have a great job right now. I love coaching the Warriors, so I'm not going anywhere.”4650.528571
84[Highlight] Cody Zeller perfectly blocks Sam Merrill's layup off the backboard151.000000
8Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”16101.000000
12[Thinking Basketball] The 10 Best NBA peaks since 197713461.000000

100 rows × 3 columns

I want to point out one comment “Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺”, which has a negative sentiment, probably because of the words “off” and “words”, but the sentence itself is more positive because it’s about a player performing very well

comment_df[['body', 'score', 'body_sentiment']].sort_values("body_sentiment")
bodyscorebody_sentiment
2480he has some of the worst luck with injuries.591-1.0
118I tea bagged your fucking drum set!!!3-1.0
2081RIP to the insane plus/minus of the Spurs bench71-1.0
1379Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺1-1.0
1287fucking disgusting1-1.0
............
2270Perfect.... boost his confidence, while we continue to tank51.0
273It’s almost like he’s one of the best point guards of all time!21.0
31Best scorer on the Bulls since MJ1201.0
1632Remember when DSJ was like the mavs best player? What a time11.0
436I will zag and point out another thing here. KD doesn't want to outright say Steph is the greatest shooter ever. He needs to add Klay to this stat...-11.0

3200 rows × 3 columns

Closing remarks

Thanks to praw, it was really easy to pull and gather raw data. On top of that, the plethora of NLP software development has made it really easy to apply these models to whatever context you want.

To really take this further, an important middle step would need data cleaning (modifying for typos, slang, abbreviations), maybe filters/named entity resolution to look for specific players. Maybe you want to find some way to add weights to highly up-voted submissions/comments, or maybe you want some way to combine the sentiments from both submissions and comments. Lastly, the big caveat in NLP for reddit is using a language model sophisticated enough to capture the sarcasm, nuance, and toxicity that is the reddit community (and specifically within r/nba).