May 4, 2014

Digital traces of human activity

Interesting article responding to detractors of big data hype. Particularly like this definition of big data "digital traces of human activity" that appears in the comments. That does sum up a large part of what people are discussing when they mention big data, but there are clearly many other sides to this coin.

February 20, 2014

Big data hype

Good article in The Register today that nicely summarises some of the hyperbole surrounding big data.

January 30, 2014

Big data: beyond the hype

Interesting article in The Register today summarising a recent Big Data conference. It says what I have been thinking about for a while, that Big Data is really a rebranding of activities that has been occurring in industry and academia (i.e., high energy physics (HEP)) for years now. What has changed is that there are new software products (i.e. Hadoop) that make it easier for new people to enter the game and start taking more advantage of the data they have been analysing already (or wanting to analyse). As an example, in HEP we have been analysing highly structured datasets for years, using large CPU farms and tape/disk stores, but we just called this a batch system. Much of the Big Data hype is actually about data analytics (or data "science"), particularly about bringing together disparate sources of information (often unstructured) and looking for the patterns and correlations in that. In many cases, this is being facilitated by more and more organisations (governments, companies, universities...) releasing information that they have collected. Of course, there is an aspect of the data itself being "big" on some scale, so there are storage and data management issues to be addressed. For example, the experiments at the LHC have 10's of PB's of structured data that has been collected, processed, distributed around the globe and reprocessed multiple times over the past 3 years, but I suspect that this is not a typical use case for most organisations. Something that is important for Big Data (particularly in scientific disciplines) is the issue of data archiving, software curation and documentation such that we can ensure that results obtained from one data analysis using a particular set of software can be reliably reproduced in the future (not just next year, but in 10's of years or more (perhaps when platforms and compilers have become obsolete)). This could also be an issue for government/finance, where a record of all transactions needs to be retained. I would argue that is certainly an unsolved problem.

January 28, 2014

January 27, 2014

Useful data analytics tools


  • Scipy http://scipy.org/
    • which comprises...
    • Matplotlib http://matplotlib.org/
    • Numpy http://www.numpy.org/
    • Pandas http://pandas.pydata.org/index.html
    • Sympy http://sympy.org/en/index.html
  • Anaconda https://store.continuum.io/cshop/anaconda/
    • A very handy distribution of python libraries
  • pyROOT http://root.cern.ch/drupal/content/pyroot
    • Access to wide range of HEP data analysis tools