Here's a somewhat practical example, getting a sense of your local housing market using some mostly-standard python libraries. I'm particularly interested in how overpriced the homes in my area are (relative to the tax-assessed value), but you could imagine adapting this code or doing periodic data extractions to summarize some trends in your own area.

Brief technical summary

To anyone who might try to use this code:

Start from the individual HTTP requests, then adapt those to whatever interests you
To download a list of home listings off redfin, this example URL returns a CSV file with information for each listing. I'd actually recommend reverse engineering this URL by going on redfin, searching for a neighborhood, applying any filters, and then copying the address of the "Download all" link. I think they mention this download limits you to 350 homes, but you can see that the num_homes param seems changeable
This redfin package provides a convenient wrapper for some commeon requests to redfin. This package stores a lot of the relevant HTTP endpoints, so browse through the repo

The rest of the code involves parallelization and summarizing/cleaning all the redfin requests -- it might be slightly over-engineered and overly-specific to what I'm trying to do

import requests
import json
import itertools as it
import io

import pandas as pd
from dask import delayed
from dask.distributed import Client, as_completed, wait
from redfin import Redfin

Define some 'constants' that will help with making requests or just filtering columns we want

request_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng.*/*;q=0.8',
    'accept-encoding':'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Chrome/92.0.4515.130'
}
relevant_columns = [
    'ADDRESS', 'CITY', 'STATE OR PROVINCE', 'ZIP OR POSTAL CODE', "PRICE", "tax_assessed_value", 'overpriced'
]

Because there are a lot of houses and each request can take a while, I'm going to use dask to parallelize many requests.

Initially, I used concurrent.futures, but I preferred a nice visualizer UI to see the progress of my tasks. Also, the dask parallelization seemed slightly faster?

daskclient = Client()
redfinclient = Redfin()
daskclient

Define the functions I'll be using

def stitch_full_address(row):
    """ Given a house record from redfin, generate a full address for future querying """
    return row['ADDRESS'] + ', ' + row['CITY'] + " " + row['STATE OR PROVINCE']


def process_redfin_response(response, redfinclient):
    """ Given a response from redfin API, return the tax-assessed value
    
    Notes
    -----
    This can get messy because this response is deeply-nested JSON, and 
    there are many chances for us to fail at pulling tax values.
    In all the places where things can go wrong, I do a very sloppy check and 
    then return -1 if something broke
    """
    if response.status_code != 200:
        return -1
    else:
        resp_dict = json.loads(response.text[4:])
        if (
            (resp_dict.get('errorMessage', None) == 'Success') &
            ('exactMatch' in resp_dict['payload'])
        ): 
            # Pull property metadata
            url = resp_dict['payload']['exactMatch']['url']
            
            data = redfinclient.initial_info(url)['payload']
            if data['responseCode'] != 200:
                return -1
            property_id = data['propertyId']
            listing_id = data['listingId']
            info = redfinclient.below_the_fold(property_id)
            # Pull latest tax-assessed value
            if len(info['payload']['publicRecordsInfo']['allTaxInfo']) > 0:
                tax_assessment = (
                    pd.DataFrame(info['payload']['publicRecordsInfo']['allTaxInfo'])
                    .sort_values("rollYear", ascending=False)
                ).iloc[0]
                return tax_assessment.get('taxableLandValue', 0) + tax_assessment.get('taxableImprovementValue', 0)
            else:
                return -1
        else:
            return -1
        

def query_redfin_dask(session, redfinclient, address, headers=None, **kwargs):
    """ For a given address, query redfin and identify tax-assessed value
    
    This is the function we submit to the dask client
    """
    response = session.get(
        'https://redfin.com/stingray/do/location-autocomplete',
        params={
            'location': address,
            'v': 2,
            **kwargs
        },
        headers=headers
    )
    return {address: process_redfin_response(response, redfinclient)}


def compile_results(results):
    """ Aggregate the results from all the redfin requests into a single series
    
    Take a list of dictionaries (from the dask future objects), flatten them into one 
    dictionary, then turn into a pandas series
    """
    compiled = pd.Series(dict(it.chain.from_iterable(a.items() for a in results)), name='tax_assessed_value')
    
    return compiled

Download a list of all home listings from redfin based on search parameters, do some small cleaning to generate full addresses. There might be good documentation on making this query, but there are definitely a few parameters I don't understand and only kept in here because they showed up when I searched on redfin.com and reverse-engineered the URL

with requests.Session() as session:
    params = {
        'al': 1,
        'hoa': 150,
        'market': 'dc',
        'max_listing_approx_size': 3000,
        'min_listing_approx_size': 1700,
        'max_num_beds': 4,
        'max_price': 800_000,
        'num_baths': 2,
        'num_beds': 2,
        'num_homes': 450, 
        'page_number': 1,
        'region_id': 2965,
        'region_type': 5,
        'sf': '1,2,3,5,6,7',
        'status': 9,
        'uipt': '1,2,3,4,5,6,7,8',
        'v': 8
    }
    download = session.get(
        'https://www.redfin.com/stingray/api/gis-csv?', params=params,
        headers=request_headers
    )
df = (
    pd.read_csv(io.StringIO(download.content.decode("utf-8")))
    .assign(full_address=lambda x: x.apply(stitch_full_address, axis=1))
)

For each home listing, download more-detailed information pertaining to the taxable land value and taxable improvement value. There's certainly a lot of other information you could pull from these queries, but I'll leave that as an exercise to the reader

This is technically broken into some different dask steps to scatter the requests.Session and redfin.Client objects, and then making a bunch of queries (with dask.delayed objects) to redfin for each particular address

with requests.Session() as session:
    session_scattered = daskclient.scatter(session)
    redfinclient_scattered = daskclient.scatter(redfinclient)
    futures = [
        delayed(query_redfin_dask)(session_scattered, redfinclient_scattered, address, headers=request_headers)
        for address in df['full_address'].to_list()
    ]
    futures = daskclient.compute(futures) 
    completed_results = [result for a, result in as_completed(futures, raise_errors=False, with_results=True)]

Join the tax-assessed values back to the original dataframe, compute how overpriced the property is

processed_df = (
    df.merge(compile_results(completed_results), left_on='full_address', right_index=True)
    .assign(overpriced=lambda x: x['PRICE'] - x['tax_assessed_value'])
)

Examine results, some information is masked

In this particular example, some homes have really low tax-assessed values relative to actual price. I think this is because these homes are so new (built in this year) that there's no taxableImprovementValue yet, only taxableLandValue. As a result, these newly-built properties can appear extremely overpriced

(
    processed_df
    .query('tax_assessed_value > 0')
    [relevant_columns]
    .sort_values('overpriced')
    .assign(**{
        'ADDRESS': 'SORRY',
        'CITY': 'I AM',
        'STATE OR PROVINCE': "MASKING",
        'ZIP OR POSTAL CODE': "THESE COLUMNS"
    })
)

Basic statistics to summarize over our particular housing sample

Raw price values are skewed because of the filters I imposed when pulling the entier list of homes-for-sale from redfin
Look only at the homes for which I could succesfully identify the tax-assessed value
Look only at the homes that don't seem extremely overpriced. Homes that are extremely overpriced might just be due to the absence of taxableImprovementValue from new builds.
In the end, this is roughly the set of homes I'm interested in, so I can still continue with finding tax assessed values and evaluating how overpriced they are

(
    processed_df
    .query('tax_assessed_value > 0 & overpriced < 200000')
    [['PRICE', 'tax_assessed_value', 'overpriced']]
    .describe()
)

Roadbumps

I initially looked into zillow and pyzillow, but some of those libraries seemed poorly documented or out-of-date, so I ran into some difficulty using those. Redfin conveniently dumps summaries of some metro areas, but these are 1-2 months lagged behind, which is not super helpful to anyone in a buying mood. Basically, for any coding hobbyist wanting to do some useful data science, the API-makers have made it difficult

Final thoughts

No fancy statistics or modeling, just pulling together some pydata tools to scrape redfin and summarize some data that might be useful to me in the future

Status: running	Using processes: True
Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 2.72 GiB

Comm: tcp://127.0.0.1:42703	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 2.72 GiB

Comm: tcp://127.0.0.1:34751	Total threads: 1
Dashboard: http://127.0.0.1:45195/status	Memory: 695.89 MiB
Nanny: tcp://127.0.0.1:43271
Local directory: /home/ayang41/sandbox/zillow/dask-worker-space/worker-bjq9wgap

Comm: tcp://127.0.0.1:34411	Total threads: 1
Dashboard: http://127.0.0.1:34047/status	Memory: 695.89 MiB
Nanny: tcp://127.0.0.1:41971
Local directory: /home/ayang41/sandbox/zillow/dask-worker-space/worker-ns_2_qz0

Comm: tcp://127.0.0.1:39719	Total threads: 1
Dashboard: http://127.0.0.1:44761/status	Memory: 695.89 MiB
Nanny: tcp://127.0.0.1:43229
Local directory: /home/ayang41/sandbox/zillow/dask-worker-space/worker-xt_76c_h

Surveying my local housing market via Redfin

Brief technical summary

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Roadbumps

Final thoughts

Connection method: Cluster object	Cluster type: LocalCluster
Dashboard: http://127.0.0.1:8787/status

Comm: tcp://127.0.0.1:41135	Total threads: 1
Dashboard: http://127.0.0.1:38539/status	Memory: 695.89 MiB
Nanny: tcp://127.0.0.1:40457
Local directory: /home/ayang41/sandbox/zillow/dask-worker-space/worker-44tf3776

	ADDRESS	CITY	STATE OR PROVINCE	ZIP OR POSTAL CODE	PRICE	tax_assessed_value	overpriced
60	SORRY	I AM	MASKING	THESE COLUMNS	450000	493760.0	-43760.0
32	SORRY	I AM	MASKING	THESE COLUMNS	484950	483150.0	1800.0
325	SORRY	I AM	MASKING	THESE COLUMNS	535000	532130.0	2870.0
78	SORRY	I AM	MASKING	THESE COLUMNS	650000	642130.0	7870.0
135	SORRY	I AM	MASKING	THESE COLUMNS	570000	560410.0	9590.0
...	...	...	...	...	...	...	...
376	SORRY	I AM	MASKING	THESE COLUMNS	681950	190000.0	491950.0
375	SORRY	I AM	MASKING	THESE COLUMNS	684950	190000.0	494950.0
337	SORRY	I AM	MASKING	THESE COLUMNS	688950	190000.0	498950.0
359	SORRY	I AM	MASKING	THESE COLUMNS	740950	190000.0	550950.0
360	SORRY	I AM	MASKING	THESE COLUMNS	790950	190000.0	600950.0

	PRICE	tax_assessed_value	overpriced
count	325.000000	325.000000	325.000000
mean	632634.787692	538761.950769	93872.836923
std	98730.374007	84934.788980	41077.668809
min	409000.000000	329660.000000	-43760.000000
25%	550000.000000	475990.000000	66000.000000
50%	639500.000000	534280.000000	89140.000000
75%	700000.000000	603570.000000	121590.000000
max	800000.000000	766600.000000	195260.000000