May 4, 2014

Digital traces of human activity

Interesting article responding to detractors of big data hype. Particularly like this definition of big data "digital traces of human activity" that appears in the comments. That does sum up a large part of what people are discussing when they mention big data, but there are clearly many other sides to this coin.

February 20, 2014

Big data hype

Good article in The Register today that nicely summarises some of the hyperbole surrounding big data.

January 30, 2014

Big data: beyond the hype

Interesting article in The Register today summarising a recent Big Data conference. It says what I have been thinking about for a while, that Big Data is really a rebranding of activities that has been occurring in industry and academia (i.e., high energy physics (HEP)) for years now. What has changed is that there are new software products (i.e. Hadoop) that make it easier for new people to enter the game and start taking more advantage of the data they have been analysing already (or wanting to analyse). As an example, in HEP we have been analysing highly structured datasets for years, using large CPU farms and tape/disk stores, but we just called this a batch system. Much of the Big Data hype is actually about data analytics (or data "science"), particularly about bringing together disparate sources of information (often unstructured) and looking for the patterns and correlations in that. In many cases, this is being facilitated by more and more organisations (governments, companies, universities...) releasing information that they have collected. Of course, there is an aspect of the data itself being "big" on some scale, so there are storage and data management issues to be addressed. For example, the experiments at the LHC have 10's of PB's of structured data that has been collected, processed, distributed around the globe and reprocessed multiple times over the past 3 years, but I suspect that this is not a typical use case for most organisations. Something that is important for Big Data (particularly in scientific disciplines) is the issue of data archiving, software curation and documentation such that we can ensure that results obtained from one data analysis using a particular set of software can be reliably reproduced in the future (not just next year, but in 10's of years or more (perhaps when platforms and compilers have become obsolete)). This could also be an issue for government/finance, where a record of all transactions needs to be retained. I would argue that is certainly an unsolved problem.

January 28, 2014

January 27, 2014

Useful data analytics tools


  • Scipy http://scipy.org/
    • which comprises...
    • Matplotlib http://matplotlib.org/
    • Numpy http://www.numpy.org/
    • Pandas http://pandas.pydata.org/index.html
    • Sympy http://sympy.org/en/index.html
  • Anaconda https://store.continuum.io/cshop/anaconda/
    • A very handy distribution of python libraries
  • pyROOT http://root.cern.ch/drupal/content/pyroot
    • Access to wide range of HEP data analysis tools

Visualising quantum oscillations

The image below could (and probably should) be placed into any new book on particle physics. It shows the decay time distribution of B_s (B-sub-s) mesons decaying to two other particles (a Ds meson and a pion). This has two main components, 1. The exponential decay, which you can clearly see 2. A sinusoidal oscillation which modulates the exponential. This sinusoid is a clear sign of the quantum oscillations in the Bs system as matter oscillates to antimatter (Bs <-> Bsbar). It was the product of a huge amount of work from the LHCb collaboration at CERN (that I am a member of), more details of which can be found here. Particle physics is definitely one example of a sector that has Big Data problems and, over the years, come up with many Big Data solutions. More on this in a later post.

Book review: The Big Data Revolution

Yesterday, I read The Big Data Revolution by J. and J. Kolb. It was a bit of an impulse buy: a couple of quid for the Kindle version and a couple of spare hours on a Sunday to read it. Unfortunately, I was disappointed. The book (if you can really call it that) is poorly written and edited, with many repetitions of the same points and peppered with adverts throughout. I guess you really do get what you pay for. If you are completely new to "Big Data", are at the board level of some company looking to make a move in this direction and have a couple of hours to spare at an airport then you might take something away from this book. Otherwise, if you actually want to implement an algorithm or set up a system to gain insight into your data then look elsewhere.

Git repository for analytics code

All code will be uploaded to this Git repository. For the moment, I'm focussing on python, due to all of the nice features available in packages like matplotlib, numpy, scipy, sympy etc...