Explore the data – find the story in multiple variables
Posted on August 23, 2016
We’re looking at data insights and communications. Basically that’s asking a question, exploring data to find answers, and then explaining the results by communicating using visualizations with audience specific context. Separating insights from communications, we started with gaining insights from exploring the data.
Recap five steps of data exploration
Our five steps were outlined in the data exploration post. We subsequently took a deep dive into the first step – ask a question, the second step – gather the data, the third step – select your tools, and the fourth step – format the data.
- Ask a question
- Gather the data
- Select your tools
- Format the data
- Explore the data
In this post we are finally at the step in this phase where we use the tools with the structured data to find out what the story is.
Explore the data
The exploration process in the exploratory analysis is large and important enough that Nathan Yau (Data Points, Visualization that Means Something, 2013 p. 136) uses four steps to describe it:
- What data do you have?
- Proportion
- Time series
- What do you want to know about the data? [Remember that question we asked before selecting the data?]
- Look for relationships
- Correlation
- Distribution
- Comparison (single variable)
- Check for differences
- Comparison (multiple variables)
- Multidimensional scaling
- Outliers
- What visualization methods should be used?
- What’s the story? What do you see? Does it make sense?
Having looked at your data and considered the choices above in steps one and two, let’s investigate in a little more detail steps three and four. In the first part of this exploration step looking at visualizations we focused on a single variable. Most data is more complicated than a single variable. We’re interested in multiple variables, and the relationships or interactions between multiple variables.
What to do about multiple variables?
Look for relationships and/or differences between the variables. These techniques apply to proportions and time series with multiple variables.
Scatter plot: Plot two variables of interest on the x-axis and y-axis. Look for correlation. If you want to see a third variable use color or the size of the data point.
Heat map: Are you trying to see more than three variables in a single view? Create a heat map to look for correlations across multiple variables with color. Use different sorts to look for correlations. It can be difficult to discover the story with these as the visualizations become more complex.
Parallel coordinates plot: Instead of the heat map, this plot uses vertical position as the visual cue. Lines parallel? The variables are positively correlated. Lines cross? The variables are negatively correlated. Lines all mixed up? Variables are showing weak correlation.
What is the data range? How is the data distributed?
Distributions also give information about proportion data and time series. Simplistically you’ll look at the data points that are minimum and maximum, along with the mean, median, and mode. This isn’t enough as these numbers can be misleading. You also need to see the spread or variance to understand context around the data. Visually, these numbers about the data show up in a box plot. A box plot visually shows the range, median, and quartiles. The best way to visualize the full distribution is a histogram (showing discrete bins of data) or a density plot (for continuous data).
Is that finally it?
There is nothing simple or fast about finding the story exploring data. Keep track of the visualizations that told you what you wanted to know. Also keep track of the visualizations hat surprised you. These will come in handy when we begin to create the communication visuals. Don’t be tempted to simply pass along the visuals created in the exploration phase. If you explored a lot (and I know you did) these exploration visuals are much too complicated.
A last note, when observing correlations, don’t be tempted to attribute causation to data that appears to be correlated (or negatively correlated). Dig deep into the story, but no assumptions from data analysis alone.
photo credit: Pixabay CC0 Public Domain