Scraping Reddit, part 2
The last post dealt with using pushshift and handling requests to access posts and comments from Reddit. This post deals with using the Python Reddit API wrapper to accces posts and comments from Reddit and then using some NLP tools for some basic sentiment analysis.
There is some work to set up an application to use praw with oauth, but straightforward enough for anyone who's just using this as a script.
After setting up the praw application, we can build up a small pipeline:
- Use praw to download posts and comments from r/nba
- Format them into a dataframe
- Use huggingface and spacy for sentiment analysis
from dataclasses import dataclass
import itertools as it
from functools import reduce, partial
import datetime as dt
import pandas as pd
pd.set_option('display.max_colwidth', 150)
import praw
from praw.models import MoreComments
import matplotlib.pyplot as plt
import hfapi
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
nlp = spacy.load("en_core_web_sm")
spacy_text_blob = SpacyTextBlob()
nlp.add_pipe(spacy_text_blob)
client = hfapi.Client()
reddit = praw.Reddit("bot1") # Pulls from praw.ini file
rnba = reddit.subreddit('nba')
@dataclass
class RedditSubmission:
title: str
body: str
permalink: str
author: str
score: float
timestamp: dt.datetime
def to_dict(self):
return {
'title': self.title,
'body': self.body,
'permalink': self.permalink,
'author': self.author,
'score': self.score,
'timestamp': self.timestamp
}
@classmethod
def from_praw_submission(
cls,
praw_submission: praw.models.Submission
):
return cls(
praw_submission.title,
praw_submission.selftext,
praw_submission.permalink,
praw_submission.author,
praw_submission.score,
dt.datetime.fromtimestamp(praw_submission.created_utc)
)
@dataclass
class RedditComment:
body: str
permalink: str
author: str
score: float
timestamp: dt.datetime
def to_dict(self):
return {
'body': self.body,
'permalink': self.permalink,
'author': self.author,
'score': self.score,
'timestamp': self.timestamp
}
@classmethod
def from_praw_comment(
cls,
praw_comment: praw.models.Comment
):
return cls(
praw_comment.body,
praw_comment.permalink,
praw_comment.author,
praw_comment.score,
dt.datetime.fromtimestamp(praw_comment.created_utc)
)
def process_submission_from_praw(praw_submission_generator):
for praw_submission in praw_submission_generator:
yield RedditSubmission.from_praw_submission(praw_submission)
def process_comment_from_praw_submission(praw_submission_generator):
for praw_submission in praw_submission_generator:
for praw_comment in praw_submission.comments:
if isinstance(praw_comment, MoreComments):
continue
else:
yield RedditComment.from_praw_comment(praw_comment)
praw_submission_generator1 = rnba.hot(limit=100)
praw_submission_generator2 = rnba.hot(limit=100)
submissions = process_submission_from_praw(praw_submission_generator1)
comments = process_comment_from_praw_submission(praw_submission_generator2)
submission_df = pd.DataFrame(a.to_dict() for a in submissions)
comment_df = pd.DataFrame(a.to_dict() for a in comments)
Using huggingface for sentiment analysis
Specifically, using huggingface api
def classification_single_body(client, sentence):
classification = client.text_classification(sentence)
if 'error' in classification:
return None, None
neg_sentiment, pos_sentiment = classification[0]
return neg_sentiment['score'], pos_sentiment['score']
def classification_multiple_body(client, bunch_of_sentences, colnames=None):
if colnames is None:
colnames = ['negative_score', 'positive_score']
df = pd.DataFrame(
map(lambda x: classification_single_body(client, x), bunch_of_sentences),
columns=colnames
)
return df
client = hfapi.Client()
classification_multiple_bodies_partial = partial(classification_multiple_body, client)
submission_df = pd.concat([
submission_df, classification_multiple_bodies_partial(submission_df['title'].to_list())
], axis=1)
Scoring the submissions, here's a title with an appropriately positive score "Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player's OVERALL win shares for the current season."
Here's a title that is scored as incredibly negative, but in reality is pretty positive "Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season" -- being even close to the 50-40-90 club is incredible
submission_df.sort_values("negative_score")[['title', 'score', 'negative_score', 'positive_score']]
I think we were querying the API too quickly, so these responses started timing out, but you get the idea here
comment_df = pd.concat([
comment_df, classification_multiple_bodies_partial(comment_df['body'].to_list())
], axis=1)
submission_df['title_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['title']))]
submission_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['body']))]
comment_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(comment_df['body']))]
Here's a simple title to score "Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”"
submission_df[['title', 'score', 'title_sentiment']].sort_values("title_sentiment")
I want to point out one comment "Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺", which has a negative sentiment, probably because of the words "off" and "words", but the sentence itself is more positive because it's about a player performing very well
comment_df[['body', 'score', 'body_sentiment']].sort_values("body_sentiment")
Closing remarks
Thanks to praw, it was really easy to pull and gather raw data. On top of that, the plethora of NLP software development has made it really easy to apply these models to whatever context you want.
To really take this further, an important middle step would need data cleaning (modifying for typos, slang, abbreviations), maybe filters/named entity resolution to look for specific players. Maybe you want to find some way to add weights to highly up-voted submissions/comments, or maybe you want some way to combine the sentiments from both submissions and comments. Lastly, the big caveat in NLP for reddit is using a language model sophisticated enough to capture the sarcasm, nuance, and toxicity that is the reddit community (and specifically within r/nba).