Data analysis as a story, and the problem of induction
What makes humans unique is the need to tell stories. Not just the ability: the need. This need comes from our perception of time – we never exist simply in the now, we are always between the past and the future. What are you doing right now? Well, you are for sure reading this article (otherwise you wouldn’t be here by now). But this statement is already in the past, because you never reach “the now”. “Reading an article” is a whole that consists of a bunch of consecutive moments in time. To say that you are reading an article, you have to have this concept in your mind, and it always takes place between what already happened and what is about to happen soon. In other words, this statement is a model of reality, a model that puts reality in order and gives meaning to our actions.
How it is related to data analysis
The same thing applies to data analysis. An analysis is not reality, it is how we see it through data. We don’t analyze all the data, not even all the data that is available. An analyst decides what to show as an image of the thing he or she has to analyze. When creating an analysis of “Why sales in Q3 was lower than in Q2”, we create a story of what happened and what we can do to prevent it from happening again.
This connection between the past and the future is crucial to understanding what can go wrong. The future is perceived through what happened in the past, since we don’t have any other source of what can happen in the future than the historical data. However this connection can be really tricky.
As Bertrand Russel illustrated the problem of induction:
Domestic animals expect food when they see the person who usually feeds them. We know that all these rather crude expectations of uniformity are liable to be misleading. The man who has fed the chicken every day throughout its life at last wrings its neck instead, showing that more refined views as to the uniformity of nature would have been useful to the chicken.
If the chicken was asked on day 74 what will happen next, it would probably say that it would eat something delicious and do whatever chickens do all day. This is what happened for its entire life so far, so why would the next day be different? For a chicken, decapitation is as unexpected as for a business owner is the market crash or a natural disaster that leaves him without a penny. We can never be sure about the future and we should be careful not to trust predictions based on historical data too much.
Data in organization
The history of using data in business begins with the invention of writing. As soon as humans became able to write, writing was used in the process of exchange of goods. How much one has and what is the price; who owes whom, how much and when it is due. But just like other stories, the stories written on clay tablets, papyrus, sheepskin and finally on paper, this could be easily falsified on purpose or just by accident. Mostly the latter – when manually making a copy of a document and doing the calculation mentally, making a mistake is almost inevitable.
A big step in data analysis
Spreadsheets had an enormous impact on doing business and analyzing data. If you analyze numbers with pen and paper, there is a certain limit to what you can do. Creating a simple sum of 1000 sales from a year is a really tedious job and (as we said before) it’s very easy to make a mistake. But if you have a tool that automatically calculates large numbers – not 1000, but 10 000 or 100 000 entries – the moment you are writing them down, your appetite for more data to analyze can increase. And that was certainly the case: with Lotus123, and then Microsoft Excel working on personal computers, companies started to measure and analyze almost every aspect of production and delivery in order to optimize it, and the 1980s were a great renaissance of Taylorism. When having such a tool and so much data on hand, it’s really easy to feel like you know everything about the process you measure – how much time it takes to produce part X, how much time to take it from point A to point B, how much you pay the workers and what price you get when selling it at place Z.
The role of stories told by data
This first part of digital transformation is probably the most important one, when looking back into the history of analyzing business data. The tools used back then are still in use today (most of our financial and personal data today is stored in Excel files), but the most important change was done to the idea of what role data plays in the life of the company. And today, with more data coming from devices used in the process of production, and better tools for data tracking and analysis, we can easily be fooled by the great story they tell – the story about the past. And it’s the same story the chicken would tell the day before its decapitation: in the past 74 days there was no record of such an event, which would make it really improbable. So, is there a way to tell if the decapitation is coming?
Prediction models for the unmeasurable
The fact that the chicken hasn’t observed any event of decapitation of a chicken before, doesn’t make the decapitation non-existent. And that’s the first lesson of working with prediction models — they are as good as the amount of data we use to build them. Even if we have large amounts of data, of different types and sources, and we manage to calculate the probability of a certain event within a certain time, this is never the whole picture. Only one missing parameter can significantly change the outcome of a model, especially when we ask a slightly different question.
The risk analysis
But even if we’ve asked the right questions, gathered all the data and properly calculated the probability, this doesn’t resolve all the problems. Frank Knight, an American economist, made a distinction between risk and uncertainty. Risk is calculated for a class of objects. When we are thinking about opening a restaurant in a small town, we can build a prediction to answer the question “Will it work?”. First, we can build a model based on the history of restaurants in that town, which will tell us that 80% of all restaurants that ever existed went out of business within the first two years. That gives us an 8:2 chance we will go bankrupt. However, we wanted to run a pizza place, and most of the bankrupt restaurants did not serve pizza, and from the remaining 20% almost all made it. That gives us, let’s say, 7:3 chances that making pizza is the key to our business success. But the only place we can rent for this business is in the part of town where almost all the restaurants that went bankrupt in the last 5 years were placed, etc.
Data set is the key
So, all the models answer the question about a certain class: a set of restaurants, a set of pizza restaurants and a set of restaurants in a certain part of town. However it would be rational to review our business plan based on these models, they don’t describe our future restaurant. There is no way of calculating the probability of a unique event. This is the field of uncertainty. Uncertainty can’t be measured and calculated, because it has almost nothing to do with what happened in the past and what can be observed in the data. You can’t calculate how profitable the production of personal computers will be where there is no such thing as a personal computer yet.
Is data analysis always the answer?
According to economist Israel Kirzner, uncertainty is the key to entrepreneurship. If every business decision ever made was based on calculated risk – and the data – there wouldn’t be anything new in the history of mankind. An entrepreneur is someone who discovers a disequilibrium in the market – needs that are not fulfilled yet – and when corrected, brings the market to equilibrium. And this is not (just) about asking people what they need, but recognizing it with the help of intuition and creativity. No dashboard can be helpful in this process.
Beyond the data
So is it time to conclude that data doesn’t really matter? Obviously not. We are living in a world of data, whether we want it or not. Machine learning algorithms are predicting our next moves anytime we make a new entry in our digital footprint (or even when we don’t). Although a “superintelligent AI” that takes over the world is a science-fiction concept, the fact that Google and Facebook right now can use data to make us do as they want is very real. But even on a macroeconomic scale that’s just a domain of probability, and probability is never 100%.
If we are trying to answer the question of how to manage the information inside an organization, the answer is quite different. We still should create analyses to find out how our company is doing, but we should be aware that this is only a small part of reality. As we said before – no matter how much information we have, it’s always just a story we can tell, and the story is never about the future. The organization creates the future, mostly its own.
Significant variables in the analysis
Peter Drucker famously said that “If you can’t measure it, you can’t improve it”. Since today we can measure almost everything that has an impact on business, it should not be a problem to constantly make it better, right? It’s not that simple. It depends on what data you are trying to measure, how you are measuring it and how accurately you can interpret it.
So, if we’re trying to properly handle the data inside of an organization, we should never stick to the Tayloristic approach to data and improvement. There is always a possibility that our models don’t include important pieces of information, or the input data is biased in some way. The analysis can show patterns that are helpful to identify such problems, but it doesn’t have to. There is non-zero probability that the dashboard that looks good is in fact too good to be true.
Prediction models for the organization
It is true that the procedures and automation can make the life of an organization more predictable. However, any procedure and automatic process is based on the past events, and more importantly, on this part of past events that are visible in the data. If the data is incomplete, automation will make the whole organization fragile and shock-sensitive. The job of an analyst is not so much to work with the existing data, but to create more data (e.g. during discovery workshops) and to know the limits of what can, and what can’t be expressed in quantitative terms.
Measuring the unmeasurable – Summary
Last, but not least, we need to be aware of the unknown, or to put it in Nassim Taleb’s terms – of the existence of Black Swans: events of low-probability, but also of very high impact. The world can take an unexpected turn and if it does, all of our predictions will become useless. Mind the second part in the chance equation; even with 99:1 you should have a scenario for when the 1 happens, especially if it means the end of your business.