Designing Distributed Systems

Share this post

The need for OLAP Databases

distributeddataengines.substack.com

The need for OLAP Databases

different business requirements needs different access patterns

Vipul Vaibhaw
Sep 25, 2021
1
Share this post

The need for OLAP Databases

distributeddataengines.substack.com

Happy weekend dear reader! I have always loved writing and I must say the process of generating a piece of content for you to read is truly satisfying, the growth of this newsletter has been fascinating and humbling to me. I get to write about a very niche topic and I promise to keep this newsletter very specific. Please send your thoughts to me either by simply replying to this email or by putting a comment to this article, I am looking forward for your message.

Introduction

In early days of computing, any operation in business usually resulted in money transaction. That is why the term “transaction” got stuck with database operations.

A transaction needn’t necessarily have ACID properties. Till the time the databases allow low-latency reads and writes, one should be fine.

A simplified sample ETL

With time databases have become smarter, continent-wide distributed and application specific. Today databases are used for many kinds of data - blog post, contacts, games etc.

An applications typically looks up a small number of records to get, update, or delete them using an index. Since these apps are interactive, the access patterns became known as Online Transaction Processing (OLTP).



Rise of Data Analytics

We can see that the interest of the world started to jump up in data analytics.

Google Trends for Data Analytics term

Databases also started being increasingly used for Data Analytics, it has a very different access pattern.

Most of the time the analytics query needs to do huge amount of scan over number of records and calculate aggregates(like sum) rather than return the raw data to the user.

In case of the ETL pipeline shown above, the analytics query in case of doctor-patient scheduling app can be like -

  • How many appointments has a particular doctor has done?

  • Which is the most busiest time slot for booking appointment?

  • What is the demographic of patients using our software?

Queries like these are often written by Data Analysts or Business Analysts, or Data Scientists to generate reports which can help Management to take intelligent/data-informed decisions.


We have clearly established the fact that a database needs two separate access patterns. Building a single database to achieve that is going to be very hard! Hence, we have to differentiate analytics from transaction processing and it has been called as Online Analytics Processing (OLAP).

A few characteristics which distinguishes OLAP with OLTP wrt access patterns -

  • Main Read Pattern - Aggregate over large number of records vs small number of key based records per query.

  • Main Write Pattern - Bulk imports or Event Streams in OLAPs vs Random access in OLTPs.

  • Users - Analysts for OLAP vs Applications for OLTPs.

  • Data Size - Petabytes/Terabytes in OLAP vs Terabytes/Gigabytes in OLTP.

A database which has the above mentioned characteristics is called as a Data Warehouse.


Share


Data Warehousing

OLTP systems are expected to be highly available and process requests with low latency. For this reason, it makes total sense to move a a different database which can process heavier queries. We can’t harm the performance of OLTP databases. Hence, Data Warehouse comes to rescue.

A Data Warehouse contains read-only copy of data in various OLTP databases present in the architecture. Data from OLTPs are brought into Data Warehouses either via a periodic dump or via a continuous stream of updates. It is transformed either in the process or dumped into data warehouse and transformed later (ETL vs ELT debate).

There are several Data Warehouse which are available as a Service -

  1. AWS Redshift

  2. Azure Synapse

  3. Google BigQuery, etc.

Data Warehouses now exists in almost all the organisations, the earlier this setup is done the faster your business can start to take data driven decisions.

A big advantage for using separate data warehouse is that it is tuned to operate efficiently for analytics access pattern.

Usually data warehouses are kept relational because various analyst are trained to use SQL, even MBAs(read non-techie) can fire SQL queries.

The results from these queries are often loaded into good visualisation softwares like -

  1. Tableau

  2. Apache Superset etc.


Schema for Analytics : Snowflake or Star?

On the OLTP side of things there are a wide variety of data models which are available at our disposal(please let me know if you want to me to write about that). For OLAP systems we have much less choice, usually people go with -

  1. Star Schema

  2. Snowflake Schema

At the center of any schema, we have Fact Tables. Simply put, imagine your fact table as a representation of all of the data via Primary Keys and Composite Keys. These Composite Keys are explained in Dimension Table.

Star Schema has one fact table and a number of associated dimension tables. It is known as star schema as its structure resembles a star. The Star Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join Schema and is optimised for querying large data sets.

Source - Wikipedia

Snowflake Schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. ”Snowflaking" is a method of normalizing the dimension tables in a star schema. When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.

Source - Wikipedia

Snowflake schemas will use less space to store dimension tables but are more complex. Star schemas will only join the fact table with the dimension tables, leading to simpler, faster SQL queries.

A general rule to thumb is that while creating “Data Marts”(a potential topic for next post) start schema is more suitable.

There are some more schemas available like Galaxy Schema etc.


Parting thoughts

Data Warehouses and OLAP systems have changed the way analytics used to work and they have allowed businesses to do take smarter, data driven decisions.

Reach out to me if you are interested in building data systems at your organisation, I would be happy to help.

I hope that you enjoyed this read, if so then please share this article. I will be back next week with another well thought/researched article delivered straight in your inbox. See you!

You can connect with me on Twitter or Linkedin.

Share this post

The need for OLAP Databases

distributeddataengines.substack.com
Comments
TopNew

No posts

Ready for more?

© 2023 Vipul Vaibhaw
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing