Over the last decade, a number of open source tools have emerged, revolutionizing the fields of data collection, analysis and inference. Among those, the Python language appears to be getting the most attention.
In this piece, which opens my new series about Python and Real Estate Data, I will illustrate how one can gather substantial amount of data at virtually no cost.
To be sure, the initial steps of data collection and data cleansing are means rather than ends, in my view. Yet they are necessary evils that require both ingenuity and rigor. They can make or break your analysis.
Computer science is no more about computers than astronomy is about telescopes – Edsger Dijkstra
Python as a scraping platform
If your ultimate goal is to gather fresh information at the house or building level, there are very few freely downloadable data sets out there. Very likely, you will resort to scraping web data.
Web scraping or web harvesting is the automated extraction of web data – in particular web pages – using dedicated programs and scripts. For this task, Python offers many options.
Because this programming language was largely born out of the scripting world, developers can rely on a number of lower level built-in modules designed to issue and receive http transactions (the main protocol to exchange information over networks) and then parse or draft the html page content.
For instance, in Python 3.x, one can use the urllib module to download the html content of a page that is located at a specific internet address…
>>> import urllib.request
>>> with urllib.request.urlopen('https://en.wikipedia.org/wiki/Web_scraping') as f:
… and in a second step parse the HTML content to isolate and store the text structures of interest (for instance using the html module).
While, in theory, we could be satisfied with those rather rustic tools, practitioners quickly transition to more advanced solutions that can handle all those items, and more, in one fell swoop. Indeed, one piece that can turn out to be especially difficult to implement with basic tools, is the flow of events that typically occur during a generic browsing session such as mouse clicks and scroll downs…
Iridium? what about Selenium?
Yes, what about Selenium, this chemical element four cells away in the periodic table? Well, Selenium is probably the tool of choice for web scraping in Python those days.
With some relative ease, you will find that Selenium will allow you to scrape large amount of data points from online real estate ads portals. Moreover, the framework should enable you to extract a broad array of property features in addition to sale prices and locations. For instance, one can capture the surface area, the number of bedrooms and bathrooms, the type of property (condo, coop, townhouses…).
In my case, after a partial data filtering, I was able to gather information about 6,500 sale listings across Manhattan, Brooklyn and Queens.
While it is not wise to drive any conclusions on the data at such an early stage, it is hard to resist projecting the information on a map so that we can get a sense of the geographic dispersion of the houses. Thankfully, this can be done in a few lines of codes using a census bureau shapefile and the geopandas module (see the chart below).
We are now ready to turn to the meat of the analysis in part two of this series…