Designing Distributed Systems

Share this post

Lies and Statistics

distributeddataengines.substack.com

Lies and Statistics

Every CEO needs to read this

Vipul Vaibhaw
Oct 9, 2021
1
Share this post

Lies and Statistics

distributeddataengines.substack.com

The charismatic Steve Jobs was famous for his charts and visualisations during his demos. Following is a picture of Steve Jobs demonstration of iPhone 4 in Apple WWDC 2010.

In the chart above we can see that there is a substantial big chunk of usage via iPhones. However, the question is - are the other contestants really that small?

Here is the chart with the same data -

I am sure that you will agree with me on the fact that Android, RIM and even Others look way bigger in the chart above. The game my friend is not with the data itself but the way you present the data.



Charts - The tool for Deception

Data visualisation is an art with which we can get our point across with deception. Take a look at the following chart presented by Steve Jobs.

The chart above represents iPhone sales, since the bigger portion of chart is towards the audience it appears way larger that it actually is.

The chart at the bottom is classic, take a look before I explain more.

I hope that you were able to figure out that 19.5% Apple’s pie is closer to the audience and is bigger than 21.2%.

Following is a bar chart representation of Jharkhand(an Indian state) results -

Jharkhand Assembly election results | Jharkhand Assembly results - The  Hindu BusinessLine

The same result if shown via the actual seat counts and not vote share percentages, it would look as follows(I have ignored others because I didn’t have the actual numbers) -

We can observe that the winning party is now not looking that big leader! Same story but different impact.


Share


Average - No one can pin you down!

I simply love the way our new channels, ad agencies, youtube videos or saas websites use the word average.

In case you are wondering, there are three types of averages - mean, median and mode. The trick is that you don’t tell people the type of average you are talking about.

Let us say that my distributed data engine(you see what i did there? 😉 ) has an average response time of 1ms.

As a data scientist, this statement is meaningless. It is because of the following reasons -

  1. I don’t know the sample size.

  2. I don’t know the type of average used.

Here is one important fact which shouldn’t be inferred - There will not be any request which takes like 1 sec to respond! If I take a sample size of 20,000 request where most of the request are around 1 ms but one of the request take 5 seconds still the mean is 0.001 sec, i.e, 1ms. Most of the time mean is meaningless, median makes more sense.

Median means half of the requests take less than 1ms to response and the other half took more.


Final Thoughts

I wanted to point out a few examples from the world of almighty statistics which can be used in a completely opposite way stats is meant to be used. The over-reliance on statistics should be well-thought. Statistics can be used to deceive facts.

Reach out to me if you need help with data visualisation.


Weekend Reads

- The Gray Lady Winked


Please feel free to reach out to me if you need help with data science or data engineering at your organisation, I would be more than happy to help.

If you are liking my articles and want me to write more then feel free to buy me a coffee! 😁

I hope that you enjoyed this read, if so then please share this article with your friends, let’s build a solid community. I will be back next week with another well thought/researched article delivered straight in your inbox. See you!

You can connect with me on Twitter or LinkedIn.

Share this post

Lies and Statistics

distributeddataengines.substack.com
Comments
TopNew

No posts

Ready for more?

© 2023 Vipul Vaibhaw
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing