Scraping Reddit, part 1

10 minute read

Published: February 01, 2021

In light of recent internet trends about retail investors, I’m sure many of us have questions about the kinds of content that gets posted on reddit, and if there are home-grown, analytical ways of addressing these questions. I’ll be showing two ways of parsing submissions and comments to Reddit, this one focusing on using pushshift API endpoints using the requests library, some custom classes for processing these responses, and asyncio to handle asynchronous threading for multiple requests to pushshift.

These codes ran quickly on my chromebook (dual-core, dual-thread, 1.90 Ghz, 4 Gb memory), but querying lots of data from pushshift makes some of the final cells take ~10 minutes.

Note: at the time of putting this together, parts of pushshift appear to be down for repair/upgrade, but at least the github repo is still online

Raw notebook here, but I didn’t bother adding an environment – most of these packages are in the python standard library or easily available on conda or pip

import pandas as pd
import requests
import datetime as dt
import asyncio
import io

At its core, we are submitting queries to a URL and getting responses to these queries. Technically speaking, this means we are submitting get requests to pushshift endpoints.

The endpoint generally takes the form of something like “https://api.pushshift.io/reddit/search/submission”, with the “payload” or params kwarg to our request being some set of search parameters (like a keyword, subreddit, or timestamp info), pushshift API parameters here. With this endpoint, we’re searching the Reddit submissions (not the comments)

One of the simpler payloads could be searching a subreddit within a particular time window. This requires before and after timestamps, which can easily be handled with python’s datetime library

today = dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0).timestamp()
today_minus_seven = (dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0) - 
                     dt.timedelta(days=7)).timestamp()
today_minus_eight = (dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0) - 
                     dt.timedelta(days=8)).timestamp()

This the the actual get request, observe the URL as the main arg, and the various search parameters in the params kwarg

reddit_response = requests.get("https://api.pushshift.io/reddit/search/submission",
                              params={'subreddit': 'stocks',
                                      'before': int(today_minus_seven), 
                                      'after': int(today_minus_eight)})

reddit_response.status_code

There are a variety of ways to parse request responses, but here’s one way to parse the title and text from the response to a Reddit submission get request

reddit_response.json()['data'][0]['title'],reddit_response.json()['data'][0]['selftext'],

('Would it be wise to increase the geographical diversity of my portfolio?',
 'Hello everyone, \n\nMy portfolio of 16 companies consists of 13 US stocks because they all seem to have some of the highest potential returns but in the midst of the pandemic I feel I should reallocate some resources towards European and UK stocks. Is anyone watching any interesting non-US stocks at the moment?')

As a little bit of dressing on top, we can grab a list of stock tickers. There are a lot of sources to pull tickers from (yfinance is a popular one), but we can also pull a list of tickers from the SEC

ticker_response = requests.get("https://www.sec.gov/include/ticker.txt")

tickers = pd.read_csv(
    io.StringIO(ticker_response.text), 
    delimiter='\t', 
    header=None, 
    usecols=[0],
)[0].to_list()

tickers[:5]

['aapl', 'msft', 'amzn', 'goog', 'tcehy']

import string                                                                         
from typing import List, Union, Dict, Optional, Any 
from collections import Counter
from requests import Response
from dataclasses import dataclass

We have all the raw information contained within the request response object, but for data processing purposes, we can define a class and some functions to simplify the work.

Key characteristics:

A corresponding python object property for each relevant property of a typical reddit submission.
- Unfortuantely the score property from pushshift isn’t the most reliable because it’s only a snapshot from when the data were indexed
summarize() that uses collections.Counter to tally up how frequently a stock ticker appears
to_dict() for serialization and conversion for pandas
from_response() to quickly instantiate a List[RedditSubmission] from a single response

@dataclass
class RedditSubmission:
    title: str 
    body: str 
    permalink: str 
    author: str 
    score: float
    timestamp: dt.datetime

    def summarize(self, 
        tickers: List[str], 
        weighted: bool = True
    ) -> Dict[str, Union[float, int]]:
        """ Process RedditSubmission for tickers 
        
        Use a Counter to count the number of times a ticker occurs.
        Include some corrections for punctuation
        """
        if self.title is not None:
            title_no_punctuation = self.title.translate(
                str.maketrans('', '', string.punctuation)
            )
            tickers_title = Counter(
                filter(lambda x: x in tickers, title_no_punctuation.split())
            )
        else:
            tickers_title = Counter()
        if self.body is not None:
            body_no_punctuation = self.body.translate(
                str.maketrans('', '', string.punctuation)
            )

            tickers_body = Counter(
                filter(lambda x: x in tickers, body_no_punctuation.split())
            )
        else:
            tickers_body = Counter()
        total_tickers = tickers_title + tickers_body
        
        return total_tickers
    
    def to_dict(self):
        return {
            'title': self.title,
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }
    
    @classmethod
    def from_response(
        cls, 
        resp_object: Response
    ) -> Optional[List[Any]]:
        """ Create a list of RedditSubmission objects from response"""
        if resp_object.status_code == 200:
            processed_response = [
                cls(
                    msg.get("title", None),
                    msg.get("body", None),
                    msg.get("permalink", None),
                    msg.get("author", None),
                    msg.get("score", None),
                    (
                        dt.datetime.fromtimestamp(msg['created_utc']) 
                        if msg['created_utc'] is not None else None
                    )
                ) for msg in resp_object.json()['data']
            ]
            return processed_response
        else:
            return None

In reality, there’s a decently-long wait time after we make the initial get request. The time to make and process the request is actually fairly quick, so this is a good opportunity to use python’s asyncio library.

Asyncio allows for concurrency in a different manner than multiprocessing or multithreading. You can have many tasks running, but only one is “controlling” the CPU, and gives up control when it’s not actively doing any work (like waiting for a response from the pushshift server).

The overall syntax is very similar to writing any other python function

async def submission_request_coroutine(**kwargs):
    await asyncio.sleep(5)
    reddit_response = requests.get("https://api.pushshift.io/reddit/search/submission",
                              params=kwargs)
    return reddit_response

Define a range of timestamps, initialize an async coroutine for each timestamp, then use asyncio to submit each request and gather them back together

snapshots = pd.date_range(
    start=dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(days=7),
    end=dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(days=1),
    freq='10min'
)

tasks = [
    submission_request_coroutine(subreddit='stocks', 
                 after=int(snapshot.timestamp()),
                 before=int(snapshots[i+1].timestamp()),
                 size=10
                ) 
    for i, snapshot in enumerate(snapshots[:-1])
]
all_submission_responses = await asyncio.gather(
    *tasks
)

The data is a List[Response] objects, which we can conver to a List[List[RedditSubmission]], then flatten as a List[RedditSubmission] with itertools

import itertools as it

reddit_submissions = [*it.chain.from_iterable(
    RedditSubmission.from_response(resp) for resp in all_submission_responses
    if resp.status_code == 200
)]

We can get a ticker counter for each RedditSubmission, but we’d like to quickly aggregate them all into a single, summary ticker counter over all the reddit submission in our time window. This can be easily achieved with functools.reduce

from functools import reduce
from collections import Counter

def aggregate_dictionaries(d1, d2):
    """ Given two dictionaries, aggregate key-value pairs """
    if len(d1) == 0:
        return dict(Counter(**d2).most_common())
    my_counter = Counter(**d1)
    my_counter.update(d2)
    return dict(my_counter.most_common())

submissions_breakdown = reduce(
    aggregate_dictionaries, 
    (submission.summarize(tickers) for submission in reddit_submissions)
)

It seems the list of tickers from the SEC was pretty generous ($A appears to be a ticker), but we can subselect for some of the recent trending tickers

submissions_breakdown['gme'], submissions_breakdown['amc']

(18, 11)

submissions_breakdown

{'a': 322,
 'on': 234,
 'for': 181,
 'it': 105,
 'or': 76,
 'be': 76,
 'next': 71,
 'are': 62,
 'new': 56,
 'good': 54,
 'now': 53,
 'can': 52,
 'all': 49,
 'at': 45,
 'out': 40,
 'amp': 34,
 'an': 33,
 'by': 31,
 'go': 30,
 'has': 26,
 'am': 24,
 'any': 22,
 'when': 21,
 'best': 20,
 'vs': 20,
 'one': 19,
 'so': 18,
 'gme': 18,
 'big': 17,
 'free': 15,
 'play': 13,
 'apps': 13,
 'amc': 11,
 'cash': 10,
 'see': 10,
 'find': 9,
 'run': 8,
 'rise': 7,
 'else': 7,
 'ever': 7,
 'work': 6,
 'real': 6,
 'open': 6,
 'wall': 5,
 'fund': 5,
 'post': 5,
 'love': 5,
 'well': 5,
 'very': 5,
 'ago': 5,
 'info': 5,
 'plan': 5,
 'pay': 5,
 'bit': 5,
 'ride': 4,
 'life': 4,
 'huge': 4,
 'low': 4,
 'nok': 4,
 'grow': 4,
 'cap': 4,
 'link': 3,
 'safe': 3,
 'plus': 3,
 'fast': 3,
 'stay': 3,
 'tech': 3,
 'fun': 3,
 'he': 3,
 'step': 3,
 'turn': 3,
 'live': 3,
 'site': 3,
 'ways': 3,
 'hear': 2,
 'teva': 2,
 'bb': 2,
 'co': 2,
 'boom': 2,
 'nice': 2,
 'mass': 2,
 'peak': 2,
 'max': 2,
 'wash': 2,
 'pump': 2,
 'tell': 2,
 'fly': 2,
 'pros': 2,
 'rock': 1,
 'both': 1,
 'gt': 1,
 'loan': 1,
 'nga': 1,
 'invu': 1,
 'most': 1,
 'ofc': 1,
 'nio': 1,
 'spot': 1,
 'min': 1,
 'onto': 1,
 'evfm': 1,
 'blue': 1,
 'nat': 1,
 'pure': 1,
 'sign': 1,
 'man': 1,
 'st': 1,
 'de': 1,
 'w': 1,
 'trtc': 1,
 'form': 1,
 'hi': 1,
 'joe': 1,
 'true': 1,
 'home': 1,
 'vrs': 1,
 'med': 1,
 'sqz': 1,
 'five': 1,
 'ship': 1,
 'trxc': 1,
 'wish': 1,
 're': 1,
 'car': 1,
 'nakd': 1,
 'rkt': 1,
 'flex': 1,
 'pm': 1,
 'ppl': 1,
 'earn': 1,
 'flow': 1,
 'lscc': 1,
 'peg': 1,
 'two': 1,
 'gain': 1,
 'wow': 1,
 'pro': 1,
 'team': 1,
 'fix': 1,
 'fnko': 1,
 'et': 1,
 'al': 1,
 'muh': 1,
 'save': 1,
 'gold': 1,
 'beat': 1,
 'vive': 1,
 'u': 1,
 'rh': 1,
 'x': 1,
 'vxrt': 1,
 'mind': 1,
 'ehth': 1,
 'job': 1,
 'road': 1,
 'box': 1}

Lastly, if we’re not interested in the tickers that occur, we can still boil all the data into a single dataframe

df = pd.DataFrame(a.to_dict() for a in reddit_submissions)

df

	title	body	permalink	author	score	timestamp
0	KSTR ETF "The nasdaq of china"	None	/r/stocks/comments/l664ce/kstr_etf_the_nasdaq_...	GioDesa	1	2021-01-27 09:56:46
1	Opinions/Projections on AMC?	None	/r/stocks/comments/l665a0/opinionsprojections_...	Double_jn_it	1	2021-01-27 09:58:03
2	GE, SPCE, & PLUG	None	/r/stocks/comments/l6668r/ge_spce_plug/	_MeatLoafLover	1	2021-01-27 09:59:21
3	Reddit is under DDOS attack. Certain gaming re...	None	/r/stocks/comments/l66692/reddit_is_under_ddos...	theBacillus	1	2021-01-27 09:59:22
4	#GainStock	None	/r/stocks/comments/l66777/gainstock/	lxPHENOMENONxl	1	2021-01-27 10:00:19
...	...	...	...	...	...	...
2338	AN OPEN LETTER TO GAMESTOP CEO	None	/r/stocks/comments/l98k85/an_open_letter_to_ga...	Artuhan	1	2021-01-31 03:55:51
2339	AN OPEN LETTER TO GAMESTOP CEO	None	/r/stocks/comments/l98lai/an_open_letter_to_ga...	Artuhan	1	2021-01-31 03:58:05
2340	Thoughts on YOLO (AdvisorShares Pure Cannabis ...	None	/r/stocks/comments/l98nly/thoughts_on_yolo_adv...	ConfidentProgrammer1	1	2021-01-31 04:02:29
2341	Daily advice	None	/r/stocks/comments/l98pic/daily_advice/	Bukprotingas	1	2021-01-31 04:06:24
2342	AMC- Next stop?	None	/r/stocks/comments/l98pif/amc_next_stop/	Hj-Fish	1	2021-01-31 04:06:24

2343 rows × 6 columns

Next up

While we just built our own Reddit API from some fundamental python libraries, there are more sophisticated API out there that do a better job of querying Reddit, like praw, and then we could try some other things like sentiment analysis

Share on

Twitter Facebook LinkedIn

Alex H. Yang

Scraping Reddit, part 1

Next up

Share on

You May Also Enjoy

Scraping Reddit, part 2

Accessing FoldingAtHome data on AWS

Poetry and Docker

Exploring PyTorch + ANI + MD

PyTorch + ANI + MD