Here's a somewhat practical example, getting a sense of your local housing market using some mostly-standard python libraries. I'm particularly interested in how overpriced the homes in my area are (relative to the tax-assessed value), but you could imagine adapting this code or doing periodic data extractions to summarize some trends in your own area.

Brief technical summary

To anyone who might try to use this code:

The rest of the code involves parallelization and summarizing/cleaning all the redfin requests -- it might be slightly over-engineered and overly-specific to what I'm trying to do

import requests
import json
import itertools as it
import io

import pandas as pd
from dask import delayed
from dask.distributed import Client, as_completed, wait
from redfin import Redfin

Define some 'constants' that will help with making requests or just filtering columns we want

request_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng.*/*;q=0.8',
    'accept-encoding':'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Chrome/92.0.4515.130'
}
relevant_columns = [
    'ADDRESS', 'CITY', 'STATE OR PROVINCE', 'ZIP OR POSTAL CODE', "PRICE", "tax_assessed_value", 'overpriced'
]

Because there are a lot of houses and each request can take a while, I'm going to use dask to parallelize many requests.

Initially, I used concurrent.futures, but I preferred a nice visualizer UI to see the progress of my tasks. Also, the dask parallelization seemed slightly faster?

daskclient = Client()
redfinclient = Redfin()
daskclient

Client

Client-77c8f4fb-fe15-11eb-a8f9-00163e49e059

Connection method: Cluster object Cluster type: LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

Define the functions I'll be using

def stitch_full_address(row):
    """ Given a house record from redfin, generate a full address for future querying """
    return row['ADDRESS'] + ', ' + row['CITY'] + " " + row['STATE OR PROVINCE']


def process_redfin_response(response, redfinclient):
    """ Given a response from redfin API, return the tax-assessed value
    
    Notes
    -----
    This can get messy because this response is deeply-nested JSON, and 
    there are many chances for us to fail at pulling tax values.
    In all the places where things can go wrong, I do a very sloppy check and 
    then return -1 if something broke
    """
    if response.status_code != 200:
        return -1
    else:
        resp_dict = json.loads(response.text[4:])
        if (
            (resp_dict.get('errorMessage', None) == 'Success') &
            ('exactMatch' in resp_dict['payload'])
        ): 
            # Pull property metadata
            url = resp_dict['payload']['exactMatch']['url']
            
            data = redfinclient.initial_info(url)['payload']
            if data['responseCode'] != 200:
                return -1
            property_id = data['propertyId']
            listing_id = data['listingId']
            info = redfinclient.below_the_fold(property_id)
            # Pull latest tax-assessed value
            if len(info['payload']['publicRecordsInfo']['allTaxInfo']) > 0:
                tax_assessment = (
                    pd.DataFrame(info['payload']['publicRecordsInfo']['allTaxInfo'])
                    .sort_values("rollYear", ascending=False)
                ).iloc[0]
                return tax_assessment.get('taxableLandValue', 0) + tax_assessment.get('taxableImprovementValue', 0)
            else:
                return -1
        else:
            return -1
        

def query_redfin_dask(session, redfinclient, address, headers=None, **kwargs):
    """ For a given address, query redfin and identify tax-assessed value
    
    This is the function we submit to the dask client
    """
    response = session.get(
        'https://redfin.com/stingray/do/location-autocomplete',
        params={
            'location': address,
            'v': 2,
            **kwargs
        },
        headers=headers
    )
    return {address: process_redfin_response(response, redfinclient)}


def compile_results(results):
    """ Aggregate the results from all the redfin requests into a single series
    
    Take a list of dictionaries (from the dask future objects), flatten them into one 
    dictionary, then turn into a pandas series
    """
    compiled = pd.Series(dict(it.chain.from_iterable(a.items() for a in results)), name='tax_assessed_value')
    
    return compiled

Download a list of all home listings from redfin based on search parameters, do some small cleaning to generate full addresses. There might be good documentation on making this query, but there are definitely a few parameters I don't understand and only kept in here because they showed up when I searched on redfin.com and reverse-engineered the URL

with requests.Session() as session:
    params = {
        'al': 1,
        'hoa': 150,
        'market': 'dc',
        'max_listing_approx_size': 3000,
        'min_listing_approx_size': 1700,
        'max_num_beds': 4,
        'max_price': 800_000,
        'num_baths': 2,
        'num_beds': 2,
        'num_homes': 450, 
        'page_number': 1,
        'region_id': 2965,
        'region_type': 5,
        'sf': '1,2,3,5,6,7',
        'status': 9,
        'uipt': '1,2,3,4,5,6,7,8',
        'v': 8
    }
    download = session.get(
        'https://www.redfin.com/stingray/api/gis-csv?', params=params,
        headers=request_headers
    )
df = (
    pd.read_csv(io.StringIO(download.content.decode("utf-8")))
    .assign(full_address=lambda x: x.apply(stitch_full_address, axis=1))
)

For each home listing, download more-detailed information pertaining to the taxable land value and taxable improvement value. There's certainly a lot of other information you could pull from these queries, but I'll leave that as an exercise to the reader

This is technically broken into some different dask steps to scatter the requests.Session and redfin.Client objects, and then making a bunch of queries (with dask.delayed objects) to redfin for each particular address

with requests.Session() as session:
    session_scattered = daskclient.scatter(session)
    redfinclient_scattered = daskclient.scatter(redfinclient)
    futures = [
        delayed(query_redfin_dask)(session_scattered, redfinclient_scattered, address, headers=request_headers)
        for address in df['full_address'].to_list()
    ]
    futures = daskclient.compute(futures) 
    completed_results = [result for a, result in as_completed(futures, raise_errors=False, with_results=True)]

Join the tax-assessed values back to the original dataframe, compute how overpriced the property is

processed_df = (
    df.merge(compile_results(completed_results), left_on='full_address', right_index=True)
    .assign(overpriced=lambda x: x['PRICE'] - x['tax_assessed_value'])
)

Examine results, some information is masked

In this particular example, some homes have really low tax-assessed values relative to actual price. I think this is because these homes are so new (built in this year) that there's no taxableImprovementValue yet, only taxableLandValue. As a result, these newly-built properties can appear extremely overpriced

(
    processed_df
    .query('tax_assessed_value > 0')
    [relevant_columns]
    .sort_values('overpriced')
    .assign(**{
        'ADDRESS': 'SORRY',
        'CITY': 'I AM',
        'STATE OR PROVINCE': "MASKING",
        'ZIP OR POSTAL CODE': "THESE COLUMNS"
    })
)
ADDRESS CITY STATE OR PROVINCE ZIP OR POSTAL CODE PRICE tax_assessed_value overpriced
60 SORRY I AM MASKING THESE COLUMNS 450000 493760.0 -43760.0
32 SORRY I AM MASKING THESE COLUMNS 484950 483150.0 1800.0
325 SORRY I AM MASKING THESE COLUMNS 535000 532130.0 2870.0
78 SORRY I AM MASKING THESE COLUMNS 650000 642130.0 7870.0
135 SORRY I AM MASKING THESE COLUMNS 570000 560410.0 9590.0
... ... ... ... ... ... ... ...
376 SORRY I AM MASKING THESE COLUMNS 681950 190000.0 491950.0
375 SORRY I AM MASKING THESE COLUMNS 684950 190000.0 494950.0
337 SORRY I AM MASKING THESE COLUMNS 688950 190000.0 498950.0
359 SORRY I AM MASKING THESE COLUMNS 740950 190000.0 550950.0
360 SORRY I AM MASKING THESE COLUMNS 790950 190000.0 600950.0

341 rows × 7 columns

Basic statistics to summarize over our particular housing sample

  • Raw price values are skewed because of the filters I imposed when pulling the entier list of homes-for-sale from redfin
  • Look only at the homes for which I could succesfully identify the tax-assessed value
  • Look only at the homes that don't seem extremely overpriced. Homes that are extremely overpriced might just be due to the absence of taxableImprovementValue from new builds.
  • In the end, this is roughly the set of homes I'm interested in, so I can still continue with finding tax assessed values and evaluating how overpriced they are
(
    processed_df
    .query('tax_assessed_value > 0 & overpriced < 200000')
    [['PRICE', 'tax_assessed_value', 'overpriced']]
    .describe()
)
PRICE tax_assessed_value overpriced
count 325.000000 325.000000 325.000000
mean 632634.787692 538761.950769 93872.836923
std 98730.374007 84934.788980 41077.668809
min 409000.000000 329660.000000 -43760.000000
25% 550000.000000 475990.000000 66000.000000
50% 639500.000000 534280.000000 89140.000000
75% 700000.000000 603570.000000 121590.000000
max 800000.000000 766600.000000 195260.000000

Roadbumps

I initially looked into zillow and pyzillow, but some of those libraries seemed poorly documented or out-of-date, so I ran into some difficulty using those. Redfin conveniently dumps summaries of some metro areas, but these are 1-2 months lagged behind, which is not super helpful to anyone in a buying mood. Basically, for any coding hobbyist wanting to do some useful data science, the API-makers have made it difficult

Final thoughts

No fancy statistics or modeling, just pulling together some pydata tools to scrape redfin and summarize some data that might be useful to me in the future