A few weeks ago Shaun embarked on the giant task of manually filtering our new data set. I just thought I'd write up a little about what we did.
The New Dataset
Back in December we conducted our first huge data import from CrunchBase - up until that point we had been operating on our experience-driven manual additions as well as manually approved member submissions. The addition of the CrunchBase data was a great way of adding additional and historic companies to the listings.
The import added just under 10,000 companies to the database in the end which greatly improved the size of our dataset but had the drawback of lessening the quality of the data, with companies which we would originally not have approved on the site being added.
Data trustworthiness is an important aspect of what we are working towards so the next stage in the process was to remove the non-relevant listings from the data set.