Manually Filtering the data set

Written on Tuesday, 21st January 2014.

A few weeks ago Shaun embarked on the giant task of manually filtering our new data set. I just thought I'd write up a little about what we did.

The New Dataset

Back in December we conducted our first huge data import from CrunchBase - up until that point we had been operating on our experience-driven manual additions as well as manually approved member submissions. The addition of the CrunchBase data was a great way of adding additional and historic companies to the listings.

The import added just under 10,000 companies to the database in the end which greatly improved the size of our dataset but had the drawback of lessening the quality of the data, with companies which we would originally not have approved on the site being added.

Data trustworthiness is an important aspect of what we are working towards so the next stage in the process was to remove the non-relevant listings from the data set.

Next Step: Manual Filtering

After the import, our next step was to begin a manual process of vetting the 10,000 newly added companies, looking at what they did and looking out for signs that a company may no longer be operating.
It took a while, but we managed to work through the entire list, more than halving the original set and flagging up other entities requiring further investigation.

Moving Forward

As time goes on we hope to work on a variety of manual and automatic filtering to continue to improve the integrity, 'liveness' and completeness of our data set. It will no doubt be an ongoing process as more data sources go live. We'll work to keep you abreast of these updates as we go!
Written on: Tuesday, 21st January 2014
About the author:
CTO, Tech Britain
I'm Ali - one of the guys who works on the TechBritain site.