Bad Data Handbook does a good job of giving a range of reasons for bad data and how to deal with them to get useful results. Data sets need to be complete, coherent, correct, traceable. The correct data analysis mechanisms and tools also need to be used. One of the important examples for the requirement of traceability and accountability is the Sarbanes-Oxley act in the United States that requires certification of various financial statements with criminal consequences for lapses.
The book explains how data which is available to a data scientist/analyst in current times may have a lot of information but not be readily suitable for the outlined objectives. Sometimes the collected dataset will have generic data such as census information which needs to be processed as per the data analysis objectives. At other times there will be specific data related to user feedback for a particular model of popular car.
The book outlines how data obtained from various sources needs to be analysed and processed for targeting specific requirements. In such analysis and processing data analysis objectives need to be kept in focus. Important things to consider is choosing the correct dataset satisfying the output requirements, choosing and designing correct primary keys. These are essential for data analysis across entire data sets without losing comprehension.
One of the topics covered in this book from database application usage is the E-R relationship diagram. Authors discuss how the Entity-Relationship diagram needs to be considered so that the inter-relationship between database structures is known to the data analysis team and can be readily utilized for data pivoting and other advanced data analysis constructs.
The book also discusses choosing the correct data format where it can be viewed and manipulated for the data analysis objectives. The raw data can require regular expression filtering to process data into target data set; otherwise, it will also fall into category of bad data. Sometimes converting data to plot/graph gives a useful pictorial output which is otherwise difficult to comprehend from the numbers by itself.
Sometimes the data needs to be extracted from format fit for human consumption to one for machine processing and vice-versa. The book discusses how these requirements present some unique challenges. Text with unknown or misrepresented character encoding can be bad data for a text processing application. Errors due to incompatible encoding characters can be resolved by moving to uniform Unicode charactersets. Normalizing code to standardized Unicode charactersets has its own challenges in the migration effort which is discussed in this book.
Another source of bad data explained here is the data processing application specific characters which leak into plain text and impact the quality of the data. Url encoding, html encoding and sql escaping can all leak into text generated by a web application and create further bad data. Most websites have a robots.txt file that tells web-crawlers which directories it can crawl, how often it can visit them and so on. The authors share some precautions while writing web-crawlers with or without taking help of robots.txt file which can avoid capture of incorrect data.
Also covered is how within survey data analysis, there are pros and cons for the operations performed on such survey data. By itself such data could have inaccuracies due to misrepresentation of facts from survey target audience and causing gaps in the collected information. The data analysis technique of imputing for such data improves statistics such as means and variances, however it is problematic for another aspect such as longitudinal analysis. Imputation bias and reporting error are two of the largest sources of bias in survey data, four other sources of bias are topcoding/bottomcoding, seam bias, proxy reporting, sample selection.