Surveying my local housing market via Redfin
Here's a somewhat practical example, getting a sense of your local housing market using some mostly-standard python libraries. I'm particularly interested in how overpriced the homes in my area are (relative to the tax-assessed value), but you could imagine adapting this code or doing periodic data extractions to summarize some trends in your own area.
Brief technical summary
To anyone who might try to use this code:
- Start from the individual HTTP requests, then adapt those to whatever interests you
- To download a list of home listings off redfin, this example URL returns a CSV file with information for each listing.
I'd actually recommend reverse engineering this URL by going on redfin, searching for a neighborhood, applying any filters, and then copying the address of the "Download all" link. I think they mention this download limits you to 350 homes, but you can see that the
num_homes
param seems changeable - This redfin package provides a convenient wrapper for some commeon requests to redfin. This package stores a lot of the relevant HTTP endpoints, so browse through the repo
The rest of the code involves parallelization and summarizing/cleaning all the redfin requests -- it might be slightly over-engineered and overly-specific to what I'm trying to do
import requests
import json
import itertools as it
import io
import pandas as pd
from dask import delayed
from dask.distributed import Client, as_completed, wait
from redfin import Redfin
Define some 'constants' that will help with making requests or just filtering columns we want
request_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng.*/*;q=0.8',
'accept-encoding':'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Chrome/92.0.4515.130'
}
relevant_columns = [
'ADDRESS', 'CITY', 'STATE OR PROVINCE', 'ZIP OR POSTAL CODE', "PRICE", "tax_assessed_value", 'overpriced'
]
Because there are a lot of houses and each request can take a while, I'm going to use dask
to parallelize many requests.
Initially, I used concurrent.futures
, but I preferred a nice visualizer UI to see the progress of my tasks.
Also, the dask parallelization seemed slightly faster?
daskclient = Client()
redfinclient = Redfin()
daskclient
Define the functions I'll be using
def stitch_full_address(row):
""" Given a house record from redfin, generate a full address for future querying """
return row['ADDRESS'] + ', ' + row['CITY'] + " " + row['STATE OR PROVINCE']
def process_redfin_response(response, redfinclient):
""" Given a response from redfin API, return the tax-assessed value
Notes
-----
This can get messy because this response is deeply-nested JSON, and
there are many chances for us to fail at pulling tax values.
In all the places where things can go wrong, I do a very sloppy check and
then return -1 if something broke
"""
if response.status_code != 200:
return -1
else:
resp_dict = json.loads(response.text[4:])
if (
(resp_dict.get('errorMessage', None) == 'Success') &
('exactMatch' in resp_dict['payload'])
):
# Pull property metadata
url = resp_dict['payload']['exactMatch']['url']
data = redfinclient.initial_info(url)['payload']
if data['responseCode'] != 200:
return -1
property_id = data['propertyId']
listing_id = data['listingId']
info = redfinclient.below_the_fold(property_id)
# Pull latest tax-assessed value
if len(info['payload']['publicRecordsInfo']['allTaxInfo']) > 0:
tax_assessment = (
pd.DataFrame(info['payload']['publicRecordsInfo']['allTaxInfo'])
.sort_values("rollYear", ascending=False)
).iloc[0]
return tax_assessment.get('taxableLandValue', 0) + tax_assessment.get('taxableImprovementValue', 0)
else:
return -1
else:
return -1
def query_redfin_dask(session, redfinclient, address, headers=None, **kwargs):
""" For a given address, query redfin and identify tax-assessed value
This is the function we submit to the dask client
"""
response = session.get(
'https://redfin.com/stingray/do/location-autocomplete',
params={
'location': address,
'v': 2,
**kwargs
},
headers=headers
)
return {address: process_redfin_response(response, redfinclient)}
def compile_results(results):
""" Aggregate the results from all the redfin requests into a single series
Take a list of dictionaries (from the dask future objects), flatten them into one
dictionary, then turn into a pandas series
"""
compiled = pd.Series(dict(it.chain.from_iterable(a.items() for a in results)), name='tax_assessed_value')
return compiled
Download a list of all home listings from redfin based on search parameters, do some small cleaning to generate full addresses. There might be good documentation on making this query, but there are definitely a few parameters I don't understand and only kept in here because they showed up when I searched on redfin.com and reverse-engineered the URL
with requests.Session() as session:
params = {
'al': 1,
'hoa': 150,
'market': 'dc',
'max_listing_approx_size': 3000,
'min_listing_approx_size': 1700,
'max_num_beds': 4,
'max_price': 800_000,
'num_baths': 2,
'num_beds': 2,
'num_homes': 450,
'page_number': 1,
'region_id': 2965,
'region_type': 5,
'sf': '1,2,3,5,6,7',
'status': 9,
'uipt': '1,2,3,4,5,6,7,8',
'v': 8
}
download = session.get(
'https://www.redfin.com/stingray/api/gis-csv?', params=params,
headers=request_headers
)
df = (
pd.read_csv(io.StringIO(download.content.decode("utf-8")))
.assign(full_address=lambda x: x.apply(stitch_full_address, axis=1))
)
For each home listing, download more-detailed information pertaining to the taxable land value and taxable improvement value. There's certainly a lot of other information you could pull from these queries, but I'll leave that as an exercise to the reader
This is technically broken into some different dask steps to scatter the requests.Session
and redfin.Client
objects, and then making a bunch of queries (with dask.delayed
objects) to redfin for each particular address
with requests.Session() as session:
session_scattered = daskclient.scatter(session)
redfinclient_scattered = daskclient.scatter(redfinclient)
futures = [
delayed(query_redfin_dask)(session_scattered, redfinclient_scattered, address, headers=request_headers)
for address in df['full_address'].to_list()
]
futures = daskclient.compute(futures)
completed_results = [result for a, result in as_completed(futures, raise_errors=False, with_results=True)]
Join the tax-assessed values back to the original dataframe, compute how overpriced the property is
processed_df = (
df.merge(compile_results(completed_results), left_on='full_address', right_index=True)
.assign(overpriced=lambda x: x['PRICE'] - x['tax_assessed_value'])
)
Examine results, some information is masked
In this particular example, some homes have really low tax-assessed values relative to actual price. I think this is because these homes are so new (built in this year) that there's no taxableImprovementValue yet, only taxableLandValue. As a result, these newly-built properties can appear extremely overpriced
(
processed_df
.query('tax_assessed_value > 0')
[relevant_columns]
.sort_values('overpriced')
.assign(**{
'ADDRESS': 'SORRY',
'CITY': 'I AM',
'STATE OR PROVINCE': "MASKING",
'ZIP OR POSTAL CODE': "THESE COLUMNS"
})
)
Basic statistics to summarize over our particular housing sample
- Raw price values are skewed because of the filters I imposed when pulling the entier list of homes-for-sale from redfin
- Look only at the homes for which I could succesfully identify the tax-assessed value
- Look only at the homes that don't seem extremely overpriced. Homes that are extremely overpriced might just be due to the absence of taxableImprovementValue from new builds.
- In the end, this is roughly the set of homes I'm interested in, so I can still continue with finding tax assessed values and evaluating how overpriced they are
(
processed_df
.query('tax_assessed_value > 0 & overpriced < 200000')
[['PRICE', 'tax_assessed_value', 'overpriced']]
.describe()
)
Roadbumps
I initially looked into zillow and pyzillow
, but some of those libraries seemed poorly documented or out-of-date, so I ran into some difficulty using those.
Redfin conveniently dumps summaries of some metro areas, but these are 1-2 months lagged behind, which is not super helpful to anyone in a buying mood.
Basically, for any coding hobbyist wanting to do some useful data science, the API-makers have made it difficult
Final thoughts
No fancy statistics or modeling, just pulling together some pydata tools to scrape redfin and summarize some data that might be useful to me in the future