Scraping Reddit, part 2
Published:
The last post dealt with using pushshift and handling requests to access posts and comments from Reddit. This post deals with using the Python Reddit API wrapper to accces posts and comments from Reddit and then using some NLP tools for some basic sentiment analysis.
There is some work to set up an application to use praw with oauth, but straightforward enough for anyone who’s just using this as a script.
After setting up the praw application, we can build up a small pipeline:
- Use praw to download posts and comments from r/nba
- Format them into a dataframe
- Use huggingface and spacy for sentiment analysis
from dataclasses import dataclass
import itertools as it
from functools import reduce, partial
import datetime as dt
import pandas as pd
pd.set_option('display.max_colwidth', 150)
import praw
from praw.models import MoreComments
import matplotlib.pyplot as plt
import hfapi
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
nlp = spacy.load("en_core_web_sm")
spacy_text_blob = SpacyTextBlob()
nlp.add_pipe(spacy_text_blob)
client = hfapi.Client()
reddit = praw.Reddit("bot1") # Pulls from praw.ini file
rnba = reddit.subreddit('nba')
Compiling praw objects into a dataframe
@dataclass
class RedditSubmission:
title: str
body: str
permalink: str
author: str
score: float
timestamp: dt.datetime
def to_dict(self):
return {
'title': self.title,
'body': self.body,
'permalink': self.permalink,
'author': self.author,
'score': self.score,
'timestamp': self.timestamp
}
@classmethod
def from_praw_submission(
cls,
praw_submission: praw.models.Submission
):
return cls(
praw_submission.title,
praw_submission.selftext,
praw_submission.permalink,
praw_submission.author,
praw_submission.score,
dt.datetime.fromtimestamp(praw_submission.created_utc)
)
@dataclass
class RedditComment:
body: str
permalink: str
author: str
score: float
timestamp: dt.datetime
def to_dict(self):
return {
'body': self.body,
'permalink': self.permalink,
'author': self.author,
'score': self.score,
'timestamp': self.timestamp
}
@classmethod
def from_praw_comment(
cls,
praw_comment: praw.models.Comment
):
return cls(
praw_comment.body,
praw_comment.permalink,
praw_comment.author,
praw_comment.score,
dt.datetime.fromtimestamp(praw_comment.created_utc)
)
def process_submission_from_praw(praw_submission_generator):
for praw_submission in praw_submission_generator:
yield RedditSubmission.from_praw_submission(praw_submission)
def process_comment_from_praw_submission(praw_submission_generator):
for praw_submission in praw_submission_generator:
for praw_comment in praw_submission.comments:
if isinstance(praw_comment, MoreComments):
continue
else:
yield RedditComment.from_praw_comment(praw_comment)
praw_submission_generator1 = rnba.hot(limit=100)
praw_submission_generator2 = rnba.hot(limit=100)
submissions = process_submission_from_praw(praw_submission_generator1)
comments = process_comment_from_praw_submission(praw_submission_generator2)
submission_df = pd.DataFrame(a.to_dict() for a in submissions)
comment_df = pd.DataFrame(a.to_dict() for a in comments)
Using huggingface for sentiment analysis
Specifically, using huggingface api
def classification_single_body(client, sentence):
classification = client.text_classification(sentence)
if 'error' in classification:
return None, None
neg_sentiment, pos_sentiment = classification[0]
return neg_sentiment['score'], pos_sentiment['score']
def classification_multiple_body(client, bunch_of_sentences, colnames=None):
if colnames is None:
colnames = ['negative_score', 'positive_score']
df = pd.DataFrame(
map(lambda x: classification_single_body(client, x), bunch_of_sentences),
columns=colnames
)
return df
client = hfapi.Client()
classification_multiple_bodies_partial = partial(classification_multiple_body, client)
submission_df = pd.concat([
submission_df, classification_multiple_bodies_partial(submission_df['title'].to_list())
], axis=1)
Scoring the submissions, here’s a title with an appropriately positive score “Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player’s OVERALL win shares for the current season.”
Here’s a title that is scored as incredibly negative, but in reality is pretty positive “Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season” – being even close to the 50-40-90 club is incredible
submission_df.sort_values("negative_score")[['title', 'score', 'negative_score', 'positive_score']]
title | score | negative_score | positive_score | |
---|---|---|---|---|
19 | [Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h... | 241 | 0.000185 | 0.999816 |
8 | Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.” | 1610 | 0.000185 | 0.999815 |
12 | [Thinking Basketball] The 10 Best NBA peaks since 1977 | 1346 | 0.000283 | 0.999717 |
25 | [Highlight] Russell banks in the 3 to tie it at 124 | 92 | 0.000615 | 0.999385 |
23 | Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player's OVERALL win shares for the current season. | 406 | 0.000845 | 0.999155 |
... | ... | ... | ... | ... |
15 | Charles Barkley: "I've been poor, I've been rich, I've been fat, I've been in the Hall of Fame, and one thing I can tell you is that the Clippers ... | 23341 | 0.999229 | 0.000771 |
38 | Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season | 443 | 0.999282 | 0.000718 |
75 | [Stein] The Bucks' too-long-to-list-it-all injury report tonight against Charlotte includes no Giannis Antetokounmpo (left knee soreness) or Jrue ... | 43 | 0.999286 | 0.000714 |
40 | Bucks missing all five starters against Hornets | 79 | 0.999449 | 0.000551 |
93 | China’s Forced-Labor Backlash Threatens to Put N.B.A. in Unwanted Spotlight | 174 | 0.999517 | 0.000483 |
100 rows × 4 columns
I think we were querying the API too quickly, so these responses started timing out, but you get the idea here
comment_df = pd.concat([
comment_df, classification_multiple_bodies_partial(comment_df['body'].to_list())
], axis=1)
Using spacy for sentiment analysis
submission_df['title_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['title']))]
submission_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['body']))]
comment_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(comment_df['body']))]
Here’s a simple title to score “Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.””
submission_df[['title', 'score', 'title_sentiment']].sort_values("title_sentiment")
title | score | title_sentiment | |
---|---|---|---|
99 | The Mavs will play 3 back-to-backs over a 7 game span to start April. Over April and May, 62% of their games will be part of a b2b | 15 | -0.400000 |
83 | [Post Game Thread] The Los Angeles Clippers (35-18) defeat the Phoenix Suns (36-15), 113 - 103 | 727 | -0.400000 |
43 | [Post Game Thread] The Boston Celtics (27-26) defeat the Minnesota Timberwolves (13-40) in OT, 145 - 136 | 49 | -0.400000 |
91 | [Post Game Thread] The Dallas Mavericks (29-22) defeat the Milwaukee Bucks (32-19), 116 - 101 | 754 | -0.400000 |
37 | The Denver Nuggets came onto the floor for their game against the Spurs with "X Gon' Give it to Ya" playing in the background | 88 | -0.400000 |
... | ... | ... | ... |
19 | [Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h... | 241 | 0.505556 |
18 | Steve Kerr on leaving the Warriors: “I have a great job right now. I love coaching the Warriors, so I'm not going anywhere.” | 465 | 0.528571 |
84 | [Highlight] Cody Zeller perfectly blocks Sam Merrill's layup off the backboard | 15 | 1.000000 |
8 | Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.” | 1610 | 1.000000 |
12 | [Thinking Basketball] The 10 Best NBA peaks since 1977 | 1346 | 1.000000 |
100 rows × 3 columns
I want to point out one comment “Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺”, which has a negative sentiment, probably because of the words “off” and “words”, but the sentence itself is more positive because it’s about a player performing very well
comment_df[['body', 'score', 'body_sentiment']].sort_values("body_sentiment")
body | score | body_sentiment | |
---|---|---|---|
2480 | he has some of the worst luck with injuries. | 591 | -1.0 |
118 | I tea bagged your fucking drum set!!! | 3 | -1.0 |
2081 | RIP to the insane plus/minus of the Spurs bench | 71 | -1.0 |
1379 | Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺 | 1 | -1.0 |
1287 | fucking disgusting | 1 | -1.0 |
... | ... | ... | ... |
2270 | Perfect.... boost his confidence, while we continue to tank | 5 | 1.0 |
273 | It’s almost like he’s one of the best point guards of all time! | 2 | 1.0 |
31 | Best scorer on the Bulls since MJ | 120 | 1.0 |
1632 | Remember when DSJ was like the mavs best player? What a time | 1 | 1.0 |
436 | I will zag and point out another thing here. KD doesn't want to outright say Steph is the greatest shooter ever. He needs to add Klay to this stat... | -1 | 1.0 |
3200 rows × 3 columns
Closing remarks
Thanks to praw, it was really easy to pull and gather raw data. On top of that, the plethora of NLP software development has made it really easy to apply these models to whatever context you want.
To really take this further, an important middle step would need data cleaning (modifying for typos, slang, abbreviations), maybe filters/named entity resolution to look for specific players. Maybe you want to find some way to add weights to highly up-voted submissions/comments, or maybe you want some way to combine the sentiments from both submissions and comments. Lastly, the big caveat in NLP for reddit is using a language model sophisticated enough to capture the sarcasm, nuance, and toxicity that is the reddit community (and specifically within r/nba).