Finding correlations in complex datasets   5 comments

It is now almost three years since I moved to Boston to start working at Fathom Information Design and the Sabeti Lab at Harvard. As I noted back then, one of the goals of this work was to create new tools for exploring complex datasets -mainly of epidemiological and health data- which could potentially contain up to thousands of different variables. After a process that went from researching visual metaphors suitable to explore these kind of datasets interactively, learning statistical techniques that can be used to quantify general correlations (not necessarily linear or between numerical quantities), and going over several iterations of internal prototypes, we finally released the 1.0 version of a tool called “Mirador” (spanish word for lookout), which attempts to bridge the space between raw data and statistical modeling. Please jump to the Mirador’s homepage to access the software and its user manual, and continue reading below for some more details about the development and design process.

The first step to build a narrative out of data is arguably finding correlations between different magnitudes or variables in the data. For instance, the placement of roads is highly correlated with the anthropogenic and geographical features of a territory. A new, unexpected, intuition-defying, or polemic correlation would probably result in an appealing narrative. Furthermore, a visual representation (of the correlation) that succeeds in its aesthetic language or conceptual clarity is also part of an appealing “data-driven” narrative. Within the scientific domains, these narratives are typically expressed in the form of a model that can be used by the researchers to make predictions. Although fields like Machine Learning and Bayesian Statistics have grown enormously in the past decades and offer techniques that allows the computer to infer predictive models from data, these techniques require careful calibration and overall supervision from the expert users who run these learning and inference algorithms. A key consideration is what variables to include in the inference process, since too few variables might result in a highly-biased model, while too many of them would lead to overfitting and large variance on new data (what is called the bias-variance dilemma.)

Leaving aside model building, an exploratory overview of the correlations in a dataset is also important in situations where one needs to quickly survey association patterns in order to understand ongoing processes, for example, the spread of an infectious disease or the relationship between individual behaviors and health indicators. The early identification of (statistically significant) associations can inform decision making and eventually help to save lives and improve public policy.

With this background in mind, three years ago we embarked in the task of creating a tool that could assist data exploration and model building by providing a visual interface to find and inspect correlations in general datasets, while having a focus on public health and epidemiological data. The thesis work from David Reshef with his tool VisuaLyzer was our starting point. Once we were handed over the initial VisuaLyzer prototype, we carried out a number of development and design iterations at Fathom, which redefined the overall workspace in VisuaLyzer but kept its main visual metaphor for data representation intact. Within this metaphor, the data is presented in “stand-alone” views such scatter plots, histograms, and maps where several “encodings” can be defined at once. An encoding is a mapping between the values of a variable in the dataset and a visual parameter, for example X and Y coordinates, size, color and opacity of the circles representing data instances, etc. This approach of defining multiple encodings in a single “large” data view is similar to what the Gapminder World visualization does. Below you have a screenshot of the scatter view in the final version of VisuaLyzer that resulted from the redesign:


VisuaLyzer also had a “network” view that generated a force-directed network based on the pairwise correlations between all the variables in the dataset. However, this view involves a entirely different set of questions in terms of visual representation, numerical estimation, and statistical meaning that were ultimately outside the initial scope of Mirador, so I won’t be discussing this view here as it was not included in the 1.0 release of Mirador.

As we started to work with larger datasets, particularly the National Health and Nutrition Examination Survey (NHANES), we encountered some limitations in VisuaLyzer’s interface, as it was not designed to handle situations where the number of variables could scale up to the thousands. In addition to that, health and epidemiological datasets are often of mixed nature, which means that both numerical variables (age, height, etc.) and categorical variables (gender, education level, etc.) need to be represented side by side. In this regard, the views we originally implemented in VisuaLyzer were not designed with categorical variables in mind. Finally, we needed an underlying statistical module that would not only compute correlation scores between arbitrary pairs of variables, but could also determine the statistical significance of the observed correlations.

After some experiments, a correlation matrix -also called corrplot or corrgram- seemed as a good visual metaphor to represent correlations in large datasets, because of it is generality and scalability, while at the same time being relatively familiar for users of statistical software. Some early Processing sketches using NHANES data looked like this:


Still, the large number of variables in this dataset makes hard to read detailed information and called for some sort of zooming functionality:


We discussed these ideas with Yonatan Grad -research fellow at the Harvard School of Public Health who gave us invaluable feedback during the development process of Mirador- and we concluded that a very useful functionality would be to sort the columns (of the correlation matrix) based on the correlation ranking of the columns in relation to a variable of interest. For example, if a user is studying the factors that influence the prevalence of obesity among the population, he or she might want be get the list of variables that have the highest correlation with the obesity variable, arranged from the highest to lowest correlation. Secondly, most of these analysis need to be restricted to specific subpopulations in the sample data, for instance a certain age range or ethnic group. These additional factors or “covariates” should control the column rankings, and the entire visualization as a whole by restricting it to data ranges of interest. By making all these operations (correlation calculation, column sorting, range setting, plot updating) interactive, the hypothesis-generation task becomes a more dynamic process where the intuition of the expert user is guided by the immediate visual feedback that results from constraining and re-arranging the data through Mirador’s interface.

All these concepts came together around 1 year ago in the first working prototype of Mirador, which at the time was called “Zye” (word for eye in Haitian Creole.) The basic structure of the program was an interactive correlation matrix, where a plot (scatter, histogram, etc.) between variables X and Y is presented at the intersection of the X column and the Y row. The covariate variables Z can be (optionally) displayed at the bottom of the interface. Each column, row and covariate has a selector that allows to set the range of values and update the plots and correlation calculations accordingly. The following is a screenshot of that first version:


Although most of the basic functionality was already in place, we needed to start working on the visual design. Terrence Fradet -a very talented designer from Fathom- took on this task, and very quickly we started iterating on a series of interface compositions, from which the following image is an example:


After several rounds of interface design, we reached a version that was stable and refined enough for internal use at the Sabeti Lab:


At this point, we realized that -in order to make a public release- we needed some internal reorganization in the code and a final round of visual design. Three significant issues came up during the testing phase: problems in threading that affected the responsiveness of the interface when many plots and correlations where computed in the background, the second issue affected the UI framework when many (up to thousands) widgets were created when loading large datasets (since each row and column required its own widget and range selector), and the third being a “feature creep” when additional functions such as map view and variable building were added in order to try out different interface ideas or to obtain specific results.

Therefore, we spent the last couple of months taking care of these issues and preparing the 1.0 release that we just made available a few days ago. The interface now looks like follows:


,and after removing non-essential functionality, we still managed to included the sorting functionality:


Mirador is freely available as a downloadable application for OSX and Windows at this page, where you can also find the user manual. The announcement post at the Fathom blog also contains a short video showing the basic operation of Mirador.

About these ads

Posted June 18, 2014 by ac in Science, Software, Statistics, Visualization

Tagged with , , , ,

5 responses to “Finding correlations in complex datasets

Subscribe to comments with RSS.

  1. Pingback: Resource: Mirador: A tool to help you find correlations in complex datasets | Digital Humanities Now

  2. just wanted to let you know that I stumbled upon this blog and it looks pretty cool. I’m an undergraduate doing bioinformatics research right now. This is definitely a good source of inspiration for me.

  3. Hi! Mirador looks great and I would like to try it. I downloaded the zip file and tried to load a data file with 1,000,000 rows and 10 columns. It was broken while loading. Is it ment to be used with such amount of data? Best regards, Fernando.

    • Hello, the largest dataset I have tried had around 100,000 rows, it might be that the max memory setting (512M) is enough for your dataset. Do you get any error messages?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 60 other followers

%d bloggers like this: