Some thoughts on data analytics, software tools and physics. https://github.com/gcowan/analytics
January 30, 2014
Big data: beyond the hype
Interesting article in The Register today summarising a recent Big Data conference. It says what I have been thinking about for a while, that Big Data is really a rebranding of activities that has been occurring in industry and academia (i.e., high energy physics (HEP)) for years now. What has changed is that there are new software products (i.e. Hadoop) that make it easier for new people to enter the game and start taking more advantage of the data they have been analysing already (or wanting to analyse). As an example, in HEP we have been analysing highly structured datasets for years, using large CPU farms and tape/disk stores, but we just called this a batch system.
Much of the Big Data hype is actually about data analytics (or data "science"), particularly about bringing together disparate sources of information (often unstructured) and looking for the patterns and correlations in that. In many cases, this is being facilitated by more and more organisations (governments, companies, universities...) releasing information that they have collected. Of course, there is an aspect of the data itself being "big" on some scale, so there are storage and data management issues to be addressed. For example, the experiments at the LHC have 10's of PB's of structured data that has been collected, processed, distributed around the globe and reprocessed multiple times over the past 3 years, but I suspect that this is not a typical use case for most organisations.
Something that is important for Big Data (particularly in scientific disciplines) is the issue of data archiving, software curation and documentation such that we can ensure that results obtained from one data analysis using a particular set of software can be reliably reproduced in the future (not just next year, but in 10's of years or more (perhaps when platforms and compilers have become obsolete)). This could also be an issue for government/finance, where a record of all transactions needs to be retained. I would argue that is certainly an unsolved problem.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment