Data is being produced by megabytes, gigabytes, daily. What do you do with all that information? How do you turn it into profitable knowledge? This thorny problem is what Philipp K. Janert's Data Analysis with Open Source Tools is written to help a reader tackle by helping to orient him or her in the field of general data analysis in the business environment. The book discusses ways of investigating data in order to recover structures responsible for it, capture those structures into models and share the model implications with the organization through business plans, metric dashboards and other methods.
While the book is a great conceptual outline of the process of data-analysis, it is not for beginners, lacking in worked examples or a more granular discussion of a textbook. The book is written primarily for programmers, readers who have the ability to take a concept and implement it in a programming language of their own choosing, and so it is for those who have already a certain level of proficiency in analyzing problems and thinking analytically. In the case of kernel density estimate and splines, for example, there is only a discussion of the formula, but no example implementation, as the reader is assumed to be able to do this on his own.
The book has four sections: graphing data, modeling data, mining data and using data. It discusses a number of open source data tools such as R, Sage and Python in the “Workshop” sections following each chapter. The workshops are meant to explore the purpose of various open source tools and libraries, and Janert discusses the architecture of tool libraries to give the reader an idea whether the tool is worth “spending time on.”
There is a certain level of math here that, depending on your level of math training may induce anything from feelings of boredom, intimidation and frustration, to excitement. A Calculus background is helpful but not necessary, as is some knowledge of statistics. Janert does a good job of discussing the mathematical concepts that he presents, so it is possible to keep up, even if one does not fully understand the notion. The book has its limits: it is not meant to be a book on analysis of scientific data, formal statistical analysis, network analysis, text mining, or Big Data.