Exploring data – select tools
Posted on August 11, 2016
We’re looking at data insights and communications. Basically that’s asking a question, exploring data to find answers, and then explaining the results by communicating using visualizations with audience specific context. Separating insights from communications, we started with gaining insights from exploring the data.
Recap five steps of data exploration
Our five steps were outlined in the data exploration post. We subsequently took a deep dive into the first step, ask a question and the second step, gather the data.
- Ask a question
- Gather the data
- Select your tools
- Format the data
- Explore the data
Select your tools
Recently I saw a post where Justin Megahan considers the difference between data science and statistics and was data science really a thing. I asked myself, and the online community that read my comments on the post, was there really a difference between data science and statistics. Is data science simply more updated terminology? Is it the new tools that make data science a new discipline instead of simply updated jargon for statistics?
What is Statistics?
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. According to the American Statistical Association, it is the science of learning from data. What? Isn’t that what we’re writing about? And yet I haven’t used the term, statistics. Why bring it up now, when this post is supposed to be about tools? In my opinion, it’s the exploratory tools that bring us back to basics, and the basics are applied statistics.
To be sure, high school statistics (and even college statistics) was quite concerned with hypothesis testing and that made most people’s head explode. Another main precept of old school statistics was using data samples. Old school statistical tools couldn’t handle the huge, full population of datasets. No longer an issue (as long as computing power and speed aren’t an issue), ensuring tools operate on a representative sample of data isn’t even a consideration.
Caveat, I’m trained as a chemist and chemical engineer – therefore I still ask questions about how the data was collected, what experimental design led to creating the data. I’d probably argue that no data set is a full, complete population of data. In other words, the dataset one is working with is more than likely a sample, not the exhaustive population. It should be treated and thought of as a sample regardless of how big it is. Nevertheless…
Back to Tool Selection
The purpose of exploring the data, is to look for the story that answers your question. What tool(s) will you use? The simplest tool is pencil and paper. Really. Hold on, before you click away remember we are exploring the data. Maybe the first thing to do is actually look at a sample of the data to make sure it makes sense. Before you invest too much time (and money) with tools you need to be sure the data is valid. At every step assuring no garbage in eliminates one potential method for garbage out. That said, here is a list (not comprehensive of course) of tools to consider. There’s no need to be fancy at this point. You’re exploring, ie. looking for patterns. The easier the tool the better in order to try a number of analyses and be confident you have found the story.
List of popular analysis tools
I have stayed away from referring to the tools as visualization tools. Indeed, they are visualization tools and always have been – even the pencil and paper. The idea is to chart or graph your data using the tools and look at it or visualize it to determine the story. It’s not yet the time to make sure you have attractive, influential charts and graphs.
- Microsoft Excel – ubiquitous, and quite robust for exploration, pat of he popular Office software suite
- Google Sheets – cloud based, similar in look/feel to Excel
- Google Charts – cloud based, requires coding skills
- Watson Analytics – IBM analytics tool, the IBM Many Eyes public cloud-based tool closed in 2015
- Tableau – available for the private desktop or publicly online
- Python – requires coding skills, handles large datasets
- PHP – requires coding skills, use with a MySQL database for hefty datasets
- R – classic statistics tool, open source (S+ and SAS are paid alternatives to R)
At this point in the process you’re simply selecting a tool that will handle the dataset you’re working with. Selecting the tool means you now use it’s documentation to ensure the dataset is in a format such that it can be imported into the tool. As much as possible you want to avoid manually entering data – which allows an additional opportunity to interject errors.
I’ll reiterate this list isn’t comprehensive and some of these tools can be used later for data communication as well. Preview: as long as you can simplify the chart or graph and annotate it to help share the story, you’ll see the tool again. On our Resources page, we reference Creative Bloq, who lists their 38 best tools for data visualization. You can bet they didn’t include pencil and paper.
Tool selected, what’s next?
Next we’ll get the data ready to use with the tool(s) selected. Retrieve as much documentation as you can find. It helps to make sure you manipulate the data as little as possible.
photo credit: Pixabay CC0 Public Domain