In my previous post, I showed how it was possible to “scrape” a cohort of real estate prices from the internet, together with the latitude, the longitude and a few other attributes on the properties. In this post, I will turn to one type of analysis that can be carried out with this data: the estimation of apartment prices at any location in-between.
From 3D graticule to 2D curves
While street addresses have been in use for centuries, the advent of civilian GPS technology in the late 1990s has created an explosion in the volume of data referencing latitudes and longitudes. Those strings of numbers are now ubiquitous and appear to many as the single and unique way to describe the location of a point on the surface of the planet.
This two dimension coordinate system is not the only one however and it embeds a number of complexities when we want to put the data to work.
For instance, the measurement of distances between two points is problematic: one degree of longitude difference at the equator is about 111 km or 70 miles while this this values drops gradually to zero as we move north or south towards the poles.
In fact, any two dimension coordinate system will not be able to preserve every Euclidian distances in every direction for every two points on the surface of the earth.
Over the years, many methods were devised to balance those inaccuracies. Some of those methods were chosen for scientific or technical reasons, others with political motives.
One of the most famous cases is the use of the Mercator projection for nautical charts. In this mapping method, the meridians are drawn parallel so that the navigator can easily transcribe the heading on the compass to an angle with the longitude lines on paper (such transformations that preserve angles are called conformal projections).
Clearly, at very short distances, the earth shape can be roughly approximated as a two-dimensional surface and the analysis of the data is quite robust to the choice of projection system, yet some systems perform significantly better than others when distances increase.
For geostatistics, I favor the azimuthal equidistant projection system centered on the area of interest for the analysis. This transformation ensures that the distances to the center of the map are preserved and that the distortions do not favor any particular directions as we drift away from the point of focus. The hope is that the assumed “isotropic behavior” in the real world, the fact that what happens here happens the same way over there, are preserved on the map.
To illustrate how distances are well-behaved at short distances and break down at larger scales, I show below the graticule of the standard latitude/longitude grid in the azimuthal equidistant projection using Python’s Geopandas.
The red dot plots the location of the city (73.96717 West, 40.77726 North). The red circle locates the equator. The great arc in red on the right is the prime meridian through Greenwich, UK. The circle on the edge of the figure reflects only a single point in the real world, the point at the antipodes.
Clearly, the projection breaks down the notion of closeness when we move far away from focus, yet it performs nicely at short distances.
To illustrate this point further, we revert to our previous post where the simple platte projection was used. We show this map again for the island of Manhattan on the left below. At 41ºN, it grossly overestimates distances on the x-axis, along the parallel lines.
On the right below, we show the “corrected” projection that uses the azimuthal equidistant projection centered on Manhattan. This time, the x-axis is not distorted and it is in line with common maps and common photographs from satellite imaging.
Focusing on Upper-West Side
Having set a proper projection system for the latitude and longitude, I now turn to the rest of the data. For the sake of illustration, I focus on the densely represented data in the Upper West Side (red rectangle in the figure on the right).
Filtering those point from the base data cohort can be easily achieved using the contain method of the Polygon class in Python’s Shapely library (a Geopandas dependency).
The data filtering is not over yet however. A cursory survey of the data shows that it also contains a number of outliers. For instance some surface areas are grossly inflated because some large terraces are included in the reported property size. Conversely, certain price-per-sqft values are high because they do not encompass the value derived from terrace or private garden areas.
Such data cleansing step is a necessary evil and it should not be overlooked. Certain automated methods can be used to detect those records and exclude them from the sample. Other methods can, in some cases, reduce their impact by effectively reducing their weight during the analysis (robust estimators).
Additional exploration on the cleansed data quickly reveals that two of the easy-to-access features have a significant effect on property prices on a prices-per-sqft basis: the type of the unit (condo or coop) and the total size of the unit. I trace below regression lines for each unit type based on a log-log transformation:
Not surprisingly, Coop properties come out significantly cheaper than Condos. In addition, there is a premium to be paid for larger apartment on a per-sqft basis.
Using TensorFlow for Gaussian Process Estimation
Having reasonably filtered out outliers and estimated the effects of units types and sizes, it is now time to revert to the somewhat trite adage “location, location, location”…
Fortunately, the statistical treatment of spatial processes has been the subject of research for years and we do not need to start from scratch there either. In particular miners have been trying to estimate the density of ore in large expanse of rocks based on a reduce set of core drill samples for more than a century. As this discipline matured to become what is now called “geostatistics”, new areas of application such as ecology, quantitative geography and regional economics came along and greatly helped enrich the field.
In this framework, the standard approach to estimate values across a map given the value at only a limited number of points is “Kriging” or Gaussian Process Regression. The main assumption behind this is that the variations in the quantity that we want to measure in the real world are just a single outcome from a random draw out of a probability distribution.
Depending on the situation, various types of regularities are assumed across the search neighborhood (typically stationarity assumptions) and they typically lead to the fit of a gaussian process based on a covariance function between the values at two separate points on the map. The covariance functions is sometimes analyzed separately through the use of what is called a variogram in such a way that the analyst can supplement the statistical fit with domain knowledge on the regularity of the process at short distance (nugget effect) and at very long distance (sill).
In this case, for the sake of simplicity in this presentation, I chose to use a more direct approach where the covariance/kernel of the Gaussian Process is directly estimated through maximum likelihood on the data. This type of estimation can be carried out easily using the GPFlow library, a library that leverages on Google’s TensorFlow to organize and parallelize the computation in the dataflow graph into an efficient set of CPU or GPU instructions.
In the graph below, I regressed the residual from the earlier, location-free regression (the regression of price per sqft onto unit type and unit size). The “hot” palette was used so that yellow areas reflect high values, red areas reflect average values and black reflect low values. To further help read the map, I superimposed the streets from the TigerLine dataset.
While the aim for this post is not to create a best in class price estimator, but rather show how Python can provide a cheap and attractive alternative to vendor software, the realism of the result is striking.
In particular, we can clearly see the higher price-per-square foot values derived from Central Park on the eastern border of the map, especially below 96th street where “Manhattan Valley”/”Bloomingdale District” starts. This behavior also seems to be extended to the entire area around the American Museum of Natural History (80th street and Central Park West).
In comparison, the Riverside Park on the western border, above 72nd street, do not seem to generate as much value on a per-square-foot basis.
As this post draws to a close, feel free to send me your comments.
Happy coding and data analysis!