Predicting the location of housing crimes - the power of machine learning and property licensing

What if we could predict the time and location of environmental health crimes?

This would be a regulator’s dream not unlike the 2002 fantasy Spielberg film Minority Report, in which a specialised police department apprehends criminals based on foreknowledge provided by psychics. Is that fantasy starting to become a reality?

Our co-founder Russell Moffatt and Dr Mark Gardener who is a data analytics expert, describing how we are helping councils to predict the location of housing crimes in combination with large scale property licensing.

“All models are wrong, but some are useful (George E. P. Box)”

Russell writes:

In 2012, during the set-up of Newham's ground-breaking borough wide licensing scheme, we were challenged with identifying 10,000 unlicensed and unsafe private rented properties hidden amongst 105,000 residential properties spread across 14 square miles.

By that point in the Newham property licensing project we had established during a pilot that unlicensed properties were much more likely to have poor housing standards (Category 1 hazards, HHSRS). Failure to license a rented property in Newham was also a criminal offence. So, find the unlicensed properties and we would find housing crimes and properties in substandard condition, simple.


Problem

All we needed was sound intelligence on residential tenure for every property in the borough. The big problem was councils do not hold accurate data on tenure.

The traditional solution to this type of problem was to dispatch a small army of council officers to walk the streets, spotting properties that "looked" like they might be rented, through signs of disrepair, overflowing bins and even the "dirty curtain test". The problem with this technique is it’s costly and unreliable, particularly across a whole borough. This raised the question, could we identify tenure predictors in council data? After all, councils hold lots of data on all properties often linked to a Unique Property Reference Number (UPRN).


Potential solution

The experiment started by matching a small number of property level ‘factors’ to create a simple property warehouse. Over time more and more datasets were added, including council tax, housing benefit, electoral register and complaint records to name a few. After a few iterations we were managing spreadsheets with 8 million cells of data and more.


While wading through the spreadsheet, tenure trends started to appear in the data which enabled some crude tenure predictions. It soon became clear that it would not be practical to mentally compute 100,000+ property tenure predictions. This was the light bulb moment; would it be possible to develop computer models to make fast and reliable predictions instead?

At this point the project needed some specialist academic support.


Mark writes:

I was brought in by Russell to help develop computer aided intelligence to help model property tenure. A particular requirement was to help identify properties that were in need of regulatory intervention. I had heard about some of the work Newham Council was doing to tackle slum housing and was keen to support the project.

The task was to develop a system that allowed prediction of tenure, given the data held by the council. There are several stages in producing computer models of this sort:

  1. Identify the outcome you want to achieve.
  2. Obtain suitable data to help the modelling process.
  3. Check and "tidy up" the input data.
  4. Develop a modelling method.
  5. Run the model and identify possible "predictor" variables.
  6. Check the model for "accuracy".
  7. Refine and revise the model using new and updated data.

Note that the process is not entirely linear, particularly in the later stages. New data is being acquired all the time and this generally means your model can benefit. A model is not a fixed entity and can be revised and altered to reflect changes in the data.


Obtaining data

Councils hold a huge amount of data about properties. The large quantity of data is a double-edged sword, on one hand lots of data helps the "predictive capacity" of a model, on the other hand there are more data to check!

An important component of predictive modelling is in preparing the data. Computers are literal entities and database entries such as "Yes", "yes", "Y" and "y" are regarded as four separate items. Often a considerable period of time is required to "tidy", validate and prepare the data.


Developing a model

We used R: the statistical programming language, for model development. It is powerful and flexible and eminently suitable for the task. It is also free and Open Source.


There are various "kinds" of regression modelling and so it is important to identify what you are trying to achieve. We realised that the important question was "is a property privately rented?" This meant that we would use a method called Logistic Regression, which is a form of generalised linear model using binomial data. In other words, our potential property tenure can be regarded as privately rented or not.

The starting point for a predictive model of this kind is data where the outcome is known. In other words, we already know which properties are HMO. The task is to develop a model that contains the most "useful" variables and allows prediction for data where the outcome is unknown.

Creating a regression model is a subtle mixture of maths and experience. It is possible to use AIC values to "pick" the best available predictive variables from the pool of data. The approach maximises the predictive capacity of the model. New terms (predictive variables) can be added to the model to improve its overall "performance". However, the computer doesn't know anything about the private rental sector, it deals only with maths and statistical likelihood. Some variables may be potentially "very good" as predictors in theory but in practice hard to obtain or unreliable in some other way. This is where the skill and experience of the practitioner comes into play. Russell was able to exercise control over which variables were used in the final model, thus producing a model that was most useful.


Revising and refining a model

The basic model "results" show "how likely" each property is to be a particular tenure. Of course all models are wrong, so the next step is to see how "reliable" the model is under various circumstances. A basic generalised linear model (GLM) returns a D2 statistic, which is a measure of the proportion of the deviation that is explained by the model. This is a summary guide and is especially useful in helping to determine the "final cut", which is how many of the explanatory variables you keep in the final model.

There are diminishing returns of predictive capacity for each additional variable added. In general, the aim is to balance peak predictive power with the number of variables used. Models with fewer variables tend to perform better in their predictions when used on new (previously unseen) data. Thus the revising and refining process is important in that it helps give “best value” and produces more reliable results.

Russell concludes:

The tenure intelligence (Ti) approach described above was a game changer. It solved a important strategic problem and helped unlock the potential of property licensing.

It enabled my team to be much more productive by minimising wasted time surveying and visiting the wrong properties. In just a couple of years Newnham’s private housing enforcement outputs went from 25 prosecutions to 280+ prosecutions per year, a 10-fold increase. In 2016 this represented 67% of all private housing enforcement undertaken in London. Ti was the catalyst to enable this important leap forward.

Ti is now being used by a range of councils across the country to help improve understanding of the rented sector at a policy level and upgrade the quality of regulatory interventions.

It's important to note this approach requires property licensing to be successful. Licensing provides a 'behavioural wedge' to separate good landlords from bad. It also offers a much-improved legal framework for regulators to operate in.


Mark concludes:

A predictive model is a tool, and in this case a tool that gives "intelligence" to frontline practitioners and policy managers. The model allows traditional follow-up methods to be applied more efficiently and effectively. It is a computer version of the "dirty curtain" test, that doesn't rely on a small army of council officers.


About the Authors

Russell Moffatt B.Sc. MPH CEnvH, is co-founder of Metastreet. He is a Chartered Environmental Health Practitioner with more than 20 years of experience working in some of the most challenging London Boroughs. He qualified with a B.Sc. (Hons) Environmental Health at Greenwich University in 2002 and achieved his Masters in Public Health at King’s College, London in 2006.

Mark Gardener PhD, is founder of DataAnalytics.org.uk. He was originally an ecologist but now works in more general areas too, providing training courses and data/project workshops as well as wring textbooks on R. He gained his doctorate in pollination ecology in 2001 and has undertaken research work and teaching in the UK and around the world.

Author:
Russell Moffatt

Russel Moffatt

Chartered EHP and Co-founder of Metastreet
Published:

1st April 2020

Share:
  • Copied!

Interested? Read our latest articles.

We use cookies on this site to enhance your on line experience. By clicking any link on this page you are giving your consent for us to set cookies.

Cookie Policy

We use cookies to analyse our traffic. We strive to comply with the EU user consent requirements and we use reasonable effects to disclose clearly and obtain your consent to any data collection, storing, sharing and usage that takes place on this website.

How we use cookies

"Cookies" are small files which your web browser transfers to your computer when you visit a site. They are widely used in order to make websites work, or to work more efficiently, as well as providing information to the owners of the site.

Like many websites, we use cookies to help us monitor the use of the website. These cookies are used by Google Analytics so that we know when you've visited one page and then another. You can also read more information about the Google Analytics privacy policy.