Explore data – find the story from a single variable
Posted on August 17, 2016
We’re looking at data insights and communications. Basically that’s asking a question, exploring data to find answers, and then explaining the results by communicating using visualizations with audience specific context. Separating insights from communications, we started with gaining insights from exploring the data.
Recap five steps of data exploration
Our five steps were outlined in the data exploration post. We subsequently took a deep dive into the first step – ask a question, the second step – gather the data, the third step – select your tools, and the fourth step – format the data.
- Ask a question
- Gather the data
- Select your tools
- Format the data
- Explore the data
In this post we are finally at the step in this phase where we use the tools with the structured data to find out what the story is.
Explore the data
The exploration process in the exploratory analysis is large and important enough that Nathan Yau (Data Points, Visualization that Means Something, 2013 p. 136) uses four steps to describe it:
- What data do you have?
- Time series
- What do you want to know about the data? [Remember that question we asked before selecting the data?]
- Look for relationships
- Comparison (single variable)
- Check for differences
- Comparison (multiple variables)
- Multidimensional scaling
- What visualization methods should be used?
- What’s the story? What do you see? Does it make sense?
Having looked at your data and considered the choices above in steps one and two, let’s investigate steps three and four in a little more detail.
Exploratory visualization methods, what’s the story?
Let’s start by setting expectations. When you’re exploring, multiple charts and visuals will help you figure out the story and answer the question. Plus, it should be no surprise at all that when you’re looking at multiple visuals, many may not tell you anything and will be abandoned. Many more will cause you to ask more clarifying questions and lead to the need to create even more visualizations to explore the new questions. That’s perfectly acceptable. Expect it. Take your time. Look, search, think about what you’re seeing. Put it down and come back to it with fresh eyes. Show it to a colleague or several. What do they see? Is it different from what you’re seeing? All this analysis helps you find the story. Creating exploratory visuals that you don’t use again isn’t a waste of time. To the contrary, those abandoned visuals help you define and refine the story.
The next sections will show how the visualization methods (step 3) provide insights based on data type (step 1) and what you want to know (step 2). In using the methods, you determine the story (step 4).
Proportion – single variable
These datasets can be examined by the full population, categories, and/or subcategories.
Categories: A bar chart is one of the most straight-forward visualization methods. It shows a single dimension and uses length as the cue. Do you need an additional dimension? Try a symbol chart (e.g. circles or boxes) and be aware that size differences in symbols can be more difficult to see and therefore mislead the story.
Parts of a whole: The parts add up to 100%. Use a single stacked bar, where the width has no meaning. The much maligned pie chart is used to show parts of a whole. The pie is denigrated because it can be difficult to see area and angles, and if there are many tiny slices you may not be able to see them at all.
Subcategories: A treemap will show hierarchical structure using size, shape, and color. A mosaic plot adds another dimension to the stacked bar that allows you to compare across multiple categories in one view. It shows proportion within categories and category combinations, and it gets complex fast.
With proportion data, note the minimum and maximum quickly with simple data sorting. Check out the spread or variance by looking at the distribution. Look for structure, patterns across categories and subcategories. Check for relationships and/or differences.
Time series – single variable
These datasets are examined looking at patterns over time.
Discrete data: That straight-forward bar chart is you friend again. It’s most useful when the length of the bar is meaningful. If the point at the top of the bar is more important, use a point chart (sometimes referred to as a dot chart). Don’t connect the dots with a line unless there is meaning to the connecting line. Point bar charts show the length of the bar and emphasize the point at the top (or end if the chart is horizontal), and the bar can be as thin as a simple line from the axis to the point.
Continuous data: Use a line chart. Sometimes this is connecting the dots. Consider carefully the slope of the line. If points change in steps, use a step chart. If the points change smoothly, connect the adjacent dots by the shortest distance between the dots.
Does the time series show cycles? Create a radial plot emphasizing the apparent interval. Maybe a calendar heat map using color can show the patterns. Look for changes over time. Explore all the chart types to determine the period of interest. Do you see patterns based on days or weeks, time of day or day of week? Again, you’re looking for structure, patterns across time intervals. Check these visualizations for relationships and/or differences.
Is that it?
Of course not, data is complicated and messy. What about multiple variables? How to we look at multiple variables together? Stay tuned for the next post.
photo credit: Pixabay CC0 Public Domain