Scraping Reddit, part 1
Published:
In light of recent internet trends about retail investors, I’m sure many of us have questions about the kinds of content that gets posted on reddit, and if there are home-grown, analytical ways of addressing these questions. I’ll be showing two ways of parsing submissions and comments to Reddit, this one focusing on using pushshift API endpoints using the requests
library, some custom classes for processing these responses, and asyncio
to handle asynchronous threading for multiple requests to pushshift.
These codes ran quickly on my chromebook (dual-core, dual-thread, 1.90 Ghz, 4 Gb memory), but querying lots of data from pushshift makes some of the final cells take ~10 minutes.
Note: at the time of putting this together, parts of pushshift appear to be down for repair/upgrade, but at least the github repo is still online
Raw notebook here, but I didn’t bother adding an environment – most of these packages are in the python standard library or easily available on conda or pip
import pandas as pd
import requests
import datetime as dt
import asyncio
import io
At its core, we are submitting queries to a URL and getting responses to these queries. Technically speaking, this means we are submitting get requests to pushshift endpoints.
The endpoint generally takes the form of something like “https://api.pushshift.io/reddit/search/submission”, with the “payload” or params
kwarg to our request being some set of search parameters (like a keyword, subreddit, or timestamp info), pushshift API parameters here. With this endpoint, we’re searching the Reddit submissions (not the comments)
One of the simpler payloads could be searching a subreddit within a particular time window. This requires before and after timestamps, which can easily be handled with python’s datetime
library
today = dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0).timestamp()
today_minus_seven = (dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0) -
dt.timedelta(days=7)).timestamp()
today_minus_eight = (dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0) -
dt.timedelta(days=8)).timestamp()
This the the actual get request, observe the URL as the main arg, and the various search parameters in the params
kwarg
reddit_response = requests.get("https://api.pushshift.io/reddit/search/submission",
params={'subreddit': 'stocks',
'before': int(today_minus_seven),
'after': int(today_minus_eight)})
reddit_response.status_code
200
There are a variety of ways to parse request responses, but here’s one way to parse the title and text from the response to a Reddit submission get request
reddit_response.json()['data'][0]['title'],reddit_response.json()['data'][0]['selftext'],
('Would it be wise to increase the geographical diversity of my portfolio?',
'Hello everyone, \n\nMy portfolio of 16 companies consists of 13 US stocks because they all seem to have some of the highest potential returns but in the midst of the pandemic I feel I should reallocate some resources towards European and UK stocks. Is anyone watching any interesting non-US stocks at the moment?')
As a little bit of dressing on top, we can grab a list of stock tickers. There are a lot of sources to pull tickers from (yfinance
is a popular one), but we can also pull a list of tickers from the SEC
ticker_response = requests.get("https://www.sec.gov/include/ticker.txt")
tickers = pd.read_csv(
io.StringIO(ticker_response.text),
delimiter='\t',
header=None,
usecols=[0],
)[0].to_list()
tickers[:5]
['aapl', 'msft', 'amzn', 'goog', 'tcehy']
import string
from typing import List, Union, Dict, Optional, Any
from collections import Counter
from requests import Response
from dataclasses import dataclass
We have all the raw information contained within the request response object, but for data processing purposes, we can define a class and some functions to simplify the work.
Key characteristics:
- A corresponding python object property for each relevant property of a typical reddit submission.
- Unfortuantely the
score
property from pushshift isn’t the most reliable because it’s only a snapshot from when the data were indexed
- Unfortuantely the
summarize()
that usescollections.Counter
to tally up how frequently a stock ticker appearsto_dict()
for serialization and conversion for pandasfrom_response()
to quickly instantiate aList[RedditSubmission]
from a single response
@dataclass
class RedditSubmission:
title: str
body: str
permalink: str
author: str
score: float
timestamp: dt.datetime
def summarize(self,
tickers: List[str],
weighted: bool = True
) -> Dict[str, Union[float, int]]:
""" Process RedditSubmission for tickers
Use a Counter to count the number of times a ticker occurs.
Include some corrections for punctuation
"""
if self.title is not None:
title_no_punctuation = self.title.translate(
str.maketrans('', '', string.punctuation)
)
tickers_title = Counter(
filter(lambda x: x in tickers, title_no_punctuation.split())
)
else:
tickers_title = Counter()
if self.body is not None:
body_no_punctuation = self.body.translate(
str.maketrans('', '', string.punctuation)
)
tickers_body = Counter(
filter(lambda x: x in tickers, body_no_punctuation.split())
)
else:
tickers_body = Counter()
total_tickers = tickers_title + tickers_body
return total_tickers
def to_dict(self):
return {
'title': self.title,
'body': self.body,
'permalink': self.permalink,
'author': self.author,
'score': self.score,
'timestamp': self.timestamp
}
@classmethod
def from_response(
cls,
resp_object: Response
) -> Optional[List[Any]]:
""" Create a list of RedditSubmission objects from response"""
if resp_object.status_code == 200:
processed_response = [
cls(
msg.get("title", None),
msg.get("body", None),
msg.get("permalink", None),
msg.get("author", None),
msg.get("score", None),
(
dt.datetime.fromtimestamp(msg['created_utc'])
if msg['created_utc'] is not None else None
)
) for msg in resp_object.json()['data']
]
return processed_response
else:
return None
In reality, there’s a decently-long wait time after we make the initial get request. The time to make and process the request is actually fairly quick, so this is a good opportunity to use python’s asyncio library.
Asyncio allows for concurrency in a different manner than multiprocessing or multithreading. You can have many tasks running, but only one is “controlling” the CPU, and gives up control when it’s not actively doing any work (like waiting for a response from the pushshift server).
The overall syntax is very similar to writing any other python function
async def submission_request_coroutine(**kwargs):
await asyncio.sleep(5)
reddit_response = requests.get("https://api.pushshift.io/reddit/search/submission",
params=kwargs)
return reddit_response
Define a range of timestamps, initialize an async coroutine for each timestamp, then use asyncio to submit each request and gather them back together
snapshots = pd.date_range(
start=dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(days=7),
end=dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(days=1),
freq='10min'
)
tasks = [
submission_request_coroutine(subreddit='stocks',
after=int(snapshot.timestamp()),
before=int(snapshots[i+1].timestamp()),
size=10
)
for i, snapshot in enumerate(snapshots[:-1])
]
all_submission_responses = await asyncio.gather(
*tasks
)
The data is a List[Response]
objects, which we can conver to a List[List[RedditSubmission]]
, then flatten as a List[RedditSubmission]
with itertools
import itertools as it
reddit_submissions = [*it.chain.from_iterable(
RedditSubmission.from_response(resp) for resp in all_submission_responses
if resp.status_code == 200
)]
We can get a ticker counter for each RedditSubmission
, but we’d like to quickly aggregate them all into a single, summary ticker counter over all the reddit submission in our time window. This can be easily achieved with functools.reduce
from functools import reduce
from collections import Counter
def aggregate_dictionaries(d1, d2):
""" Given two dictionaries, aggregate key-value pairs """
if len(d1) == 0:
return dict(Counter(**d2).most_common())
my_counter = Counter(**d1)
my_counter.update(d2)
return dict(my_counter.most_common())
submissions_breakdown = reduce(
aggregate_dictionaries,
(submission.summarize(tickers) for submission in reddit_submissions)
)
It seems the list of tickers from the SEC was pretty generous ($A appears to be a ticker), but we can subselect for some of the recent trending tickers
submissions_breakdown['gme'], submissions_breakdown['amc']
(18, 11)
submissions_breakdown
{'a': 322,
'on': 234,
'for': 181,
'it': 105,
'or': 76,
'be': 76,
'next': 71,
'are': 62,
'new': 56,
'good': 54,
'now': 53,
'can': 52,
'all': 49,
'at': 45,
'out': 40,
'amp': 34,
'an': 33,
'by': 31,
'go': 30,
'has': 26,
'am': 24,
'any': 22,
'when': 21,
'best': 20,
'vs': 20,
'one': 19,
'so': 18,
'gme': 18,
'big': 17,
'free': 15,
'play': 13,
'apps': 13,
'amc': 11,
'cash': 10,
'see': 10,
'find': 9,
'run': 8,
'rise': 7,
'else': 7,
'ever': 7,
'work': 6,
'real': 6,
'open': 6,
'wall': 5,
'fund': 5,
'post': 5,
'love': 5,
'well': 5,
'very': 5,
'ago': 5,
'info': 5,
'plan': 5,
'pay': 5,
'bit': 5,
'ride': 4,
'life': 4,
'huge': 4,
'low': 4,
'nok': 4,
'grow': 4,
'cap': 4,
'link': 3,
'safe': 3,
'plus': 3,
'fast': 3,
'stay': 3,
'tech': 3,
'fun': 3,
'he': 3,
'step': 3,
'turn': 3,
'live': 3,
'site': 3,
'ways': 3,
'hear': 2,
'teva': 2,
'bb': 2,
'co': 2,
'boom': 2,
'nice': 2,
'mass': 2,
'peak': 2,
'max': 2,
'wash': 2,
'pump': 2,
'tell': 2,
'fly': 2,
'pros': 2,
'rock': 1,
'both': 1,
'gt': 1,
'loan': 1,
'nga': 1,
'invu': 1,
'most': 1,
'ofc': 1,
'nio': 1,
'spot': 1,
'min': 1,
'onto': 1,
'evfm': 1,
'blue': 1,
'nat': 1,
'pure': 1,
'sign': 1,
'man': 1,
'st': 1,
'de': 1,
'w': 1,
'trtc': 1,
'form': 1,
'hi': 1,
'joe': 1,
'true': 1,
'home': 1,
'vrs': 1,
'med': 1,
'sqz': 1,
'five': 1,
'ship': 1,
'trxc': 1,
'wish': 1,
're': 1,
'car': 1,
'nakd': 1,
'rkt': 1,
'flex': 1,
'pm': 1,
'ppl': 1,
'earn': 1,
'flow': 1,
'lscc': 1,
'peg': 1,
'two': 1,
'gain': 1,
'wow': 1,
'pro': 1,
'team': 1,
'fix': 1,
'fnko': 1,
'et': 1,
'al': 1,
'muh': 1,
'save': 1,
'gold': 1,
'beat': 1,
'vive': 1,
'u': 1,
'rh': 1,
'x': 1,
'vxrt': 1,
'mind': 1,
'ehth': 1,
'job': 1,
'road': 1,
'box': 1}
Lastly, if we’re not interested in the tickers that occur, we can still boil all the data into a single dataframe
df = pd.DataFrame(a.to_dict() for a in reddit_submissions)
df
title | body | permalink | author | score | timestamp | |
---|---|---|---|---|---|---|
0 | KSTR ETF "The nasdaq of china" | None | /r/stocks/comments/l664ce/kstr_etf_the_nasdaq_... | GioDesa | 1 | 2021-01-27 09:56:46 |
1 | Opinions/Projections on AMC? | None | /r/stocks/comments/l665a0/opinionsprojections_... | Double_jn_it | 1 | 2021-01-27 09:58:03 |
2 | GE, SPCE, & PLUG | None | /r/stocks/comments/l6668r/ge_spce_plug/ | _MeatLoafLover | 1 | 2021-01-27 09:59:21 |
3 | Reddit is under DDOS attack. Certain gaming re... | None | /r/stocks/comments/l66692/reddit_is_under_ddos... | theBacillus | 1 | 2021-01-27 09:59:22 |
4 | #GainStock | None | /r/stocks/comments/l66777/gainstock/ | lxPHENOMENONxl | 1 | 2021-01-27 10:00:19 |
... | ... | ... | ... | ... | ... | ... |
2338 | AN OPEN LETTER TO GAMESTOP CEO | None | /r/stocks/comments/l98k85/an_open_letter_to_ga... | Artuhan | 1 | 2021-01-31 03:55:51 |
2339 | AN OPEN LETTER TO GAMESTOP CEO | None | /r/stocks/comments/l98lai/an_open_letter_to_ga... | Artuhan | 1 | 2021-01-31 03:58:05 |
2340 | Thoughts on YOLO (AdvisorShares Pure Cannabis ... | None | /r/stocks/comments/l98nly/thoughts_on_yolo_adv... | ConfidentProgrammer1 | 1 | 2021-01-31 04:02:29 |
2341 | Daily advice | None | /r/stocks/comments/l98pic/daily_advice/ | Bukprotingas | 1 | 2021-01-31 04:06:24 |
2342 | AMC- Next stop? | None | /r/stocks/comments/l98pif/amc_next_stop/ | Hj-Fish | 1 | 2021-01-31 04:06:24 |
2343 rows × 6 columns
Next up
While we just built our own Reddit API from some fundamental python libraries, there are more sophisticated API out there that do a better job of querying Reddit, like praw, and then we could try some other things like sentiment analysis