What does a pile of garbage casting a shadow have to do with “Data Science”?

There’s a lot of heady language around these days. NoSQL, “Big Data”, algorithms, animals like Pythons and Pandas, nonsense-seeming words like Hadoop … and a lot of other terms and concepts that are impenetrable to the average executive who is a programmer neither by training nor experience.

Data science is evolving rapidly, and it will continue to insinuate itself into business processes. As an executive, you need to see data science as a toolkit, and you are the general contractor. You may not need expertise in using the tools, but you do need to understand the tools in the kit, how they’re used, and what problems you can point your expert at to get your strategic (or tactical) goals accomplished.

What is Data Science, anyway?

Periodically, it seems, the business world (and tech especially) comes up with a neologism that both describes something that isn’t really new and is vague in its meaning. The current craze for “data science” fits the description.

When I was in school (and really, up until about 5 years ago), a high proportion of people seemed flummoxed by and uninterested in quantitative analytics. Now everyone talks like an expert. So unless we’ve experienced a dramatic change in both education and human predisposition, there seems to be a gap. I want to help bridge that gap.

“At its core, data science is simply shining a light on piles of “junk” from lots of different perspectives until a pattern can be ascertained.”

Back to the pile of garbage. It’s pretty self-explanatory as a piece of art; when the light hits the disorderly heap in just the right way, it casts a shadow that’s easily recognizable. If we move the light source, the shadow will no longer be recognizable.

This is data science. At its core, data science is simply shining a light on piles of “junk” from lots of different perspectives until a pattern can be ascertained. Replace all those tin cans and Snickers wrappers with numbers in a database, and there you have the magical, mysterious data science. It is about looking for the right place to stand with a spotlight, so that recognizable patterns emerge. Sure, there are a lot of places we can go from there, but those are the basics.

Just a little math before going on…

Look at the GIF below before mousing over it; it just looks like a random set of points:

Data Spin

Now mouse over (or tap on, if you’re reading this on a mobile device) the GIF. The perspective changes. What looked like an unordered mess snaps into view, and now we can draw a straight line right through the points. Mathematics knows a lot about straight lines, which means we can now use the line we’ve discovered to make predictions, comparisons, “what-ifs”, etc.

I’ve made it easy here by keeping the “random” points confined to a set that has a straight line going through them. It was just about spinning the axes in just the right way to see the line.

Sometimes – more often than not, in fact – the points won’t fit a straight line. That’s OK, though, because mathematics knows a lot about other shapes, too. And as long as we can recognize and describe a pattern, we can use math to make those predictions, comparisons, and “what-ifs”. (By the way, Murmurations form beautiful patterns that we can’t really describe with equations. So do Lava Lamps.)

Sometimes the data can’t be plotted with three axes alone. We mortals like to think in two or three axes, because we can relate to them; we live in three axes. But that’s OK, too, because mathematics knows a lot about 4 dimensions, and 5 dimensions, and on and on 1the data scientists might try to impress or confuse you (or both) by talking about hyperspace, or even worse, “n-dimensional hyperspace“. All that means is we have more than three variables. For example, if we have a database with age, gender, car make, and car model, that’s 4 dimensions. Simple to understand, impossible to graph.. We don’t have to be able to physically inhabit a space (3-D) to be able to describe it mathematically. And again, if we can describe it….

OK, OK, but what about the tools?!

Lukas Biewald, writing The Data Science Ecosystem for Crowdflower, gives an excellent flyover of the specific tools that data scientists currently use. He breaks them into three categories: Data Sources, Data Wrangling, and Data Applications. The business generalist, though, might not need the level of granularity Lukas provides. He covers the specific companies and the tools they provide. Sticking with our “general contractor” analogy, this information is more useful for a subcontractor.

Data Science

Rather than getting into actual hands-on information, I want to focus on the business uses of data-driven analysis. It’s always helpful to start with an old-school consulting 2×2 matrix; one that zeroes in on how the tools are used in a business setting. (Really, we’re focusing here on what I’ll call “methods” rather than tools – there are loads of tools within each method.)

The dimensions of this framework, when considered from the business use perspective, are:

  1. Temporal Focus. In other words, are you using data to
    • analyze what happened
    • understand what is happening now
    • look into the future
  2. Source of the Data. For our purposes, the “sources” could be either
    • direct observation, e.g., last quarter’s financials. The data are known, and they’re discrete – which essentially means there is a single value for, say, 2014 First Quarter sales.
    • modeled observation, e.g., projected customers in North America. The data here are not known, but they’re still discrete.
    • modeled decision, e.g., “what happens if I run a $3 million marketing campaign? what if I make it $1.5 million instead?” Here the data can be in a range of values, rather than discrete.

When considered this way, we can break pretty much all of the methods into the three main categories indicated by shaded areas in the chart. These categories differ from Lukas Biewald’s; they are focused more on the business questions being asked than the functional operations they perform with data. Again, general contractor questions, rather than master craftsman questions.

Within the categories there are methods for Discovery, Management, and Decision-making, and each has its own value to the business executive:

Discovery

When we look back in time using discrete data we already have, we can uncover insights to apply to our current and future decision-making. Nutrisystem CEO Dawn Zier, writing in The Philadelphia Business Journal, provides a great example of this in Are you an analytical innovator?. 2Hat tip: Rick Erwin, President and General Manager, Audience Solutions at Axciom Corporation, who alerted me to this article. Dawn starts her article with the story of UPS’ discovery that left-hand turns cost them money. This discovery was made using discrete, historical data collected from the GPS tracking systems on their delivery trucks. Brilliant!

In this category, the first important questions for business people are “what data do we have?” and “what data can we get easily?” 3In fact, these questions apply across the board. Let’s talk about that some other time.. Very often businesses collect information they don’t even know they’re collecting. In the UPS example, the company was likely using GPS already anyway, presumably to help route (and possibly keep tabs on) the drivers. The data were already being collected.

“Real-Time” Management

This category is about real-time information applied to real-time decision-making; driving a car is a great analogy. This is why we see “dashboards” all over the tools & methods in this space. Driving a car is a moment-by-moment collection and synthesis of quickly changing information, and making constant, small adjustments to inputs to stay on the road and avoid collisions. Does this sound like part of your job? (Or maybe all of your job?)

SABRE is a great example. Launched as a reservation system in 1960 by American Airlines, SABRE was able to collect real-time information about air travel demand – since it was doing the booking! Eventually they realized that demand and price are intimately intertwined, and with real-time demand information they could make real-time pricing decisions at a very granular level. Were any long-term strategic decisions made based on these data? I don’t know, but I would imagine that the real strategic value of SABRE is the tactical decisions it empowers.

Decision-making


Forward-looking decisions are very different from real-time decisions. Clearly, looking forward in time requires data that we don’t have – but we might be able to make very good assumptions about them. In our car-driving example, we make constant, minute adjustments to the force we exert on the gas pedal to maintain a reasonable speed. The decisions require a feedback loop so quick that it literally becomes a reflex.

But assume you’re on I-70 driving from Brooklyn to Los Angeles to start that great new job you just landed. You have an old friend and snowboard buddy who lives in Denver, and you talked about stopping off on the way for a quick day of shredding. The thing is, your buddy won’t know if she’s available until the day before. You’ll call her right before you hit St. Louis; if she’s available, you’ll take I-70 and hit the slopes. If not, you’ll veer south on I-44 and stop at Albuquerque to visit your mom. Mom’s available no matter what, as moms are, and the route’s just a tad shorter, too.

This is a “what-if”, and it can be modeled based on data we don’t have yet (will my friend be available?). You make the decision ahead of time, based on assumed values. You model what you’ll do in each scenario, along with the impact. As soon as you have the decision criteria, you’re ready to make the decision.

Conclusion

“Data science” is not new, nor should it be confusing or intimidating – even if you don’t know how to write code. Data science is a largely amorphous term that encompasses a wide variety of tools, and covers a wide variety of analytical methods to answer questions. My goal here is to demystify the discipline enough so you, the generalist, know whom to point at what, and focus on asking the right questions in the right context. Which, after all, is what a good generalist does well.

In future posts on the “data science” category, I’ll go into some details, maybe a case examination here & there, and throw some opinions around. Maybe I’ll even call BS on some people once in a while! But I do plan to stay focused on how business executives can think about data, and without having to learn how to program in R.

Finally, please let me know what you think. I’ll adjust my content to fit the audience as much as I can. 🙂

I am currently looking for opportunities to help companies as a consultant or interim COO. Interested? Please get in touch.

References   [ + ]

1. the data scientists might try to impress or confuse you (or both) by talking about hyperspace, or even worse, “n-dimensional hyperspace“. All that means is we have more than three variables. For example, if we have a database with age, gender, car make, and car model, that’s 4 dimensions. Simple to understand, impossible to graph.
2. Hat tip: Rick Erwin, President and General Manager, Audience Solutions at Axciom Corporation, who alerted me to this article.
3. In fact, these questions apply across the board. Let’s talk about that some other time.