It is important to understand several terms before implementing a Data Warehouse in a SQL Server/Azure DB. There are three common names for database architectures: Star Schema, Snowflake Schema, and Columnar Tables.
A Star Schema is the most basic data model for relational databases. They consist of one (or more) Fact Tables which have a relationship to many Dimension Tables. A Fact Table can be considered the parent table which has Foreign Key relationships to the Dimension Tables’ primary key. Consider purchasing an item from a supermarket and how the information for the transaction would be stored. There could be a fact table which records the transaction. It would hold the storeID, itemID, date, price, and quantity for the transaction. The storeID would be related and constrained to the Store Dimension Table, the itemID to the Item Dimension Table, and the date and price would be values within the fact table itself.
The key take away about Fact/Dimension tables is the difference between the two is that fact tables hold the data we want to analyze and the dimension tables hold the information necessary to allow us to query it.
The Snowflake Schema is similar to the Star Schema. However, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are denormalized with each dimension represented by a single table. That is, each Fact Table’s Dimension Table can have its own child tables to further normalize the information. Continuing our supermarket example, the item Dimension Table could have a supplierID which relates only to the item table and is not used by the Fact Table.
Some Data Warehouse solutions use a Wide Table architecture. Instead of having Dimension Tables, there are Fact Tables with all of the Dimension Table(s) information on a single row. This allows aggregation of the Fact Tables to happen quickly. However, updating data is hindered because each row needs to be updated. Also, data can be suspect because there is no relationship constraining the values. Row-Oriented and Column-Oriented are the two architectures of wide tables as well as just having each fact on each row without any dimension tables.
A Row-Oriented architecture has each transaction from the supermarket example appended to each other. A Column-Oriented architecture would have the transactions columns grouped together. This is easier to show than explain:
No dimension tables
1 | Giant Bird | 234 East Street | Cleveland | Ohio | 44111 | milk | parishable |
2 | Giant Bird | 234 East Street | Cleveland | Ohio | 44111 | soap | health |
Row-Oriented
1 | Giant Bird | 234 East Street | Cleveland | Ohio | 44111 | milk | parishable | 2 | Giant Bird | 234 East Street | Cleveland | Ohio | 44111 | soap | health |
Column-Oriented
1 | 2 | Giant Bird | Giant Bird | 234 East Street | 234 East Street | Cleveland | Cleveland | Ohio | Ohio | 44111 | 44111 | milk | soap | parishable | health |
One of the main benefits of a columnar database is that data can be highly compressed. The compression permits columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly. Another benefit is that because a column-based DBMSs is self-indexing, it uses less disk space than a relational database management system (RDBMS) containing the same data.
As the use of in-memory analytics increases, however, the relative benefits of row-oriented vs. column oriented databases may become less important. In-memory analytics is not concerned with efficiently reading and writing data to a hard disk. Instead, it allows data to be queried in random access memory (RAM).
Starting with SQL Server 2012, Microsoft introduced columnstore indexes. A columnstore index stores each column in a separate set of disk pages as compared to a heap or B-tree which store multiple rows per page. Leveraging columnstore indexes on “wide” tables allows for quicker queries and compression of the table. In the wild, I had a 56 GB table compressed to 1.6 GB using columnstore indexes. Microsoft uses its own proprietary VertiPaq compression engine for these tables.
This concludes this section of Data Warehousing terms, check back as I may append to this entry