Scraping Reddit, part 2

8 minute read

Published: April 09, 2021

The last post dealt with using pushshift and handling requests to access posts and comments from Reddit. This post deals with using the Python Reddit API wrapper to accces posts and comments from Reddit and then using some NLP tools for some basic sentiment analysis.

There is some work to set up an application to use praw with oauth, but straightforward enough for anyone who’s just using this as a script.

After setting up the praw application, we can build up a small pipeline:

Use praw to download posts and comments from r/nba
Format them into a dataframe
Use huggingface and spacy for sentiment analysis

from dataclasses import dataclass
import itertools as it
from functools import reduce, partial
import datetime as dt

import pandas as pd
pd.set_option('display.max_colwidth', 150)
import praw
from praw.models import MoreComments
import matplotlib.pyplot as plt
import hfapi
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load("en_core_web_sm")
spacy_text_blob = SpacyTextBlob()
nlp.add_pipe(spacy_text_blob)

client = hfapi.Client()

reddit = praw.Reddit("bot1") # Pulls from praw.ini file
rnba = reddit.subreddit('nba')

Compiling praw objects into a dataframe

@dataclass
class RedditSubmission:
    title: str 
    body: str 
    permalink: str 
    author: str 
    score: float
    timestamp: dt.datetime
    
    def to_dict(self):
        return {
            'title': self.title,
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }
    
    @classmethod
    def from_praw_submission(
        cls,
        praw_submission: praw.models.Submission
    ):      
        return cls(
            praw_submission.title,
            praw_submission.selftext,
            praw_submission.permalink,
            praw_submission.author,
            praw_submission.score,
            dt.datetime.fromtimestamp(praw_submission.created_utc)
        )
        
@dataclass
class RedditComment:                                                                  
    body: str
    permalink: str
    author: str
    score: float
    timestamp: dt.datetime

    def to_dict(self):
        return {
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }

    @classmethod
    def from_praw_comment(
        cls,
        praw_comment: praw.models.Comment
    ):
        return cls(
            praw_comment.body,
            praw_comment.permalink,
            praw_comment.author,
            praw_comment.score,
            dt.datetime.fromtimestamp(praw_comment.created_utc)
        )

        
def process_submission_from_praw(praw_submission_generator):
    for praw_submission in praw_submission_generator:
        yield RedditSubmission.from_praw_submission(praw_submission)
        
def process_comment_from_praw_submission(praw_submission_generator):
    for praw_submission in praw_submission_generator:
        for praw_comment in praw_submission.comments:
            if isinstance(praw_comment, MoreComments):
                continue
            else:
                yield RedditComment.from_praw_comment(praw_comment)

praw_submission_generator1 = rnba.hot(limit=100)
praw_submission_generator2 = rnba.hot(limit=100)

submissions = process_submission_from_praw(praw_submission_generator1)
comments = process_comment_from_praw_submission(praw_submission_generator2)

submission_df = pd.DataFrame(a.to_dict() for a in submissions)
comment_df = pd.DataFrame(a.to_dict() for a in comments)

Using huggingface for sentiment analysis

Specifically, using huggingface api

def classification_single_body(client, sentence):
    classification = client.text_classification(sentence)
    if 'error' in classification:
        return None, None
    neg_sentiment, pos_sentiment = classification[0]

    return neg_sentiment['score'], pos_sentiment['score']

def classification_multiple_body(client, bunch_of_sentences, colnames=None):
    if colnames is None:
        colnames = ['negative_score', 'positive_score']
    df = pd.DataFrame(
        map(lambda x: classification_single_body(client, x), bunch_of_sentences),
        columns=colnames
    )

    
    return df

client = hfapi.Client()
classification_multiple_bodies_partial = partial(classification_multiple_body, client)

submission_df = pd.concat([
    submission_df, classification_multiple_bodies_partial(submission_df['title'].to_list())
], axis=1)

Scoring the submissions, here’s a title with an appropriately positive score “Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player’s OVERALL win shares for the current season.”

Here’s a title that is scored as incredibly negative, but in reality is pretty positive “Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season” – being even close to the 50-40-90 club is incredible

submission_df.sort_values("negative_score")[['title', 'score', 'negative_score', 'positive_score']]

	title	score	negative_score	positive_score
19	[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...	241	0.000185	0.999816
8	Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”	1610	0.000185	0.999815
12	[Thinking Basketball] The 10 Best NBA peaks since 1977	1346	0.000283	0.999717
25	[Highlight] Russell banks in the 3 to tie it at 124	92	0.000615	0.999385
23	Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player's OVERALL win shares for the current season.	406	0.000845	0.999155
...	...	...	...	...
15	Charles Barkley: "I've been poor, I've been rich, I've been fat, I've been in the Hall of Fame, and one thing I can tell you is that the Clippers ...	23341	0.999229	0.000771
38	Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season	443	0.999282	0.000718
75	[Stein] The Bucks' too-long-to-list-it-all injury report tonight against Charlotte includes no Giannis Antetokounmpo (left knee soreness) or Jrue ...	43	0.999286	0.000714
40	Bucks missing all five starters against Hornets	79	0.999449	0.000551
93	China’s Forced-Labor Backlash Threatens to Put N.B.A. in Unwanted Spotlight	174	0.999517	0.000483

100 rows × 4 columns

I think we were querying the API too quickly, so these responses started timing out, but you get the idea here

comment_df = pd.concat([
    comment_df, classification_multiple_bodies_partial(comment_df['body'].to_list())
], axis=1)

Using spacy for sentiment analysis

submission_df['title_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['title']))]
submission_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['body']))]
comment_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(comment_df['body']))]

Here’s a simple title to score “Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.””

submission_df[['title', 'score', 'title_sentiment']].sort_values("title_sentiment")

	title	score	title_sentiment
99	The Mavs will play 3 back-to-backs over a 7 game span to start April. Over April and May, 62% of their games will be part of a b2b	15	-0.400000
83	[Post Game Thread] The Los Angeles Clippers (35-18) defeat the Phoenix Suns (36-15), 113 - 103	727	-0.400000
43	[Post Game Thread] The Boston Celtics (27-26) defeat the Minnesota Timberwolves (13-40) in OT, 145 - 136	49	-0.400000
91	[Post Game Thread] The Dallas Mavericks (29-22) defeat the Milwaukee Bucks (32-19), 116 - 101	754	-0.400000
37	The Denver Nuggets came onto the floor for their game against the Spurs with "X Gon' Give it to Ya" playing in the background	88	-0.400000
...	...	...	...
19	[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...	241	0.505556
18	Steve Kerr on leaving the Warriors: “I have a great job right now. I love coaching the Warriors, so I'm not going anywhere.”	465	0.528571
84	[Highlight] Cody Zeller perfectly blocks Sam Merrill's layup off the backboard	15	1.000000
8	Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”	1610	1.000000
12	[Thinking Basketball] The 10 Best NBA peaks since 1977	1346	1.000000

100 rows × 3 columns

I want to point out one comment “Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺”, which has a negative sentiment, probably because of the words “off” and “words”, but the sentence itself is more positive because it’s about a player performing very well

comment_df[['body', 'score', 'body_sentiment']].sort_values("body_sentiment")

	body	score	body_sentiment
2480	he has some of the worst luck with injuries.	591	-1.0
118	I tea bagged your fucking drum set!!!	3	-1.0
2081	RIP to the insane plus/minus of the Spurs bench	71	-1.0
1379	Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺	1	-1.0
1287	fucking disgusting	1	-1.0
...	...	...	...
2270	Perfect.... boost his confidence, while we continue to tank	5	1.0
273	It’s almost like he’s one of the best point guards of all time!	2	1.0
31	Best scorer on the Bulls since MJ	120	1.0
1632	Remember when DSJ was like the mavs best player? What a time	1	1.0
436	I will zag and point out another thing here. KD doesn't want to outright say Steph is the greatest shooter ever. He needs to add Klay to this stat...	-1	1.0

3200 rows × 3 columns

Closing remarks

Thanks to praw, it was really easy to pull and gather raw data. On top of that, the plethora of NLP software development has made it really easy to apply these models to whatever context you want.

To really take this further, an important middle step would need data cleaning (modifying for typos, slang, abbreviations), maybe filters/named entity resolution to look for specific players. Maybe you want to find some way to add weights to highly up-voted submissions/comments, or maybe you want some way to combine the sentiments from both submissions and comments. Lastly, the big caveat in NLP for reddit is using a language model sophisticated enough to capture the sarcasm, nuance, and toxicity that is the reddit community (and specifically within r/nba).

Share on

Twitter Facebook LinkedIn

Alex H. Yang

Scraping Reddit, part 2

Compiling praw objects into a dataframe

Using huggingface for sentiment analysis

Using spacy for sentiment analysis

Closing remarks

Share on

You May Also Enjoy

Scraping Reddit, part 1

Accessing FoldingAtHome data on AWS

Poetry and Docker

Exploring PyTorch + ANI + MD

PyTorch + ANI + MD