Data is being produced by megabytes, gigabytes, daily. What do you do with all that information? How do you turn it into profitable knowledge? This thorny problem is what Philipp K. Janert’s Data Analysis with Open Source Tools is written to help a reader tackle by helping to orient him or her in the field of general data analysis in the business environment. The book discusses ways of investigating data in order to recover structures responsible for it, capture those structures into models and share the model implications with the organization through business plans, metric dashboards and other methods.
While the book is a great conceptual outline of the process of data-analysis, it is not for beginners, lacking in worked examples or a more granular discussion of a textbook. The book is written primarily for programmers, readers who have the ability to take a concept and implement it in a programming language of their own choosing, and so it is for those who have already a certain level of proficiency in analyzing problems and thinking analytically. In the case of kernel density estimate and splines, for example, there is only a discussion of the formula, but no example implementation, as the reader is assumed to be able to do this on his own.
The book has four sections: graphing data, modeling data, mining data and using data. It discusses a number of open source data tools such as R, Sage and Python in the “Workshop” sections following each chapter. The workshops are meant to explore the purpose of various open source tools and libraries, and Janert discusses the architecture of tool libraries to give the reader an idea whether the tool is worth “spending time on.”
There is a certain level of math here that, depending on your level of math training may induce anything from feelings of boredom, intimidation and frustration, to excitement. A Calculus background is helpful but not necessary, as is some knowledge of statistics. Janert does a good job of discussing the mathematical concepts that he presents, so it is possible to keep up, even if one does not fully understand the notion. The book has its limits: it is not meant to be a book on analysis of scientific data, formal statistical analysis, network analysis, text mining, or Big Data.
The best parts of the book are the section-end “intermezzos” where Janert describes in-depth data analysis. In the first intermezzo, he takes carbon measurements above Mauna Loa on Hawaii and layer by layer recovers the different components of the time series to come up with a model of the series. In the second intermezzo Janert engages in mythbusting, discussing averaging averages, the problems with standard deviation, the proper application of least squares. And in the third intermezzo, Janert discusses some suggestions for managing projects involving heavy computation or combinatorially complex issues. Also informative are Janert’s discussions of classical statistics and it’s limitations.
The emphasis on thinking informs the book throughout, and the end result is that the reader is challenged to think about problems of data analysis and the application of different analytical methods and tools. This is definitely not a book that simply teaches how to apply formulas; Janert encourages the reader to learn to think about data analysis problems on a conceptual level.