Slowly Changing Dimension (SCD) in Power BI, Part 2, Implementing SCD 1

Slowly Changing Dimension (SCD) in Power BI, Part 2, Implementing SCD 1

I explained what SCD means in a Business Intelligence solution in my previous post. We also discussed that while we do not expect to handle SCD2 in a Power BI implementation, we can handle scenarios similar to SCD1. In this post, I explain how to do so.

Scenario

We have a retail company selling products. The company releases the list of products in Excel format, including list price and dealer price, every year. The product list is released on the first day of July when the financial year starts. We have to implement a Power BI solution that keeps the latest product data to analyse the sales transactions. The following image shows the Product list for 2013:

Products List 2013 in Excel
Products List 2013

So each year, we receive a similar Excel file to the above image. The files are stored on a SharePoint Online site.

Scenario Explained

As the previous post explains, an SCD1 always keeps the current data by updating the old data with the new data. So an ETL process reads the data from the source, identifies the existing data in the destination table, inserts the new rows to the destination, updates the existing rows, and deletes the removed rows.

Here is why our scenario is similar to SCD1, with one exception:

  • We do not actually update the data in the Excel files and do not create an ETL process to read the data from the Excel files, identify the changes and apply the changes to an intermediary Excel file
  • We must read the data from the source Excel files, keep the latest data while filtering out the old ones and load the data into the data model.

As you see, while we are taking a very different implementation approach, the results are very similar with an exception: we do not delete any rows.

Implementation

Here is what we should do to achieve the goal:

  • We get the data in Power Query Editor using the SharePoint Folder connector
  • We combite the files
  • We use the ProductNumber column to identify the duplicated products
  • We use the Reporting Date column to identify the latest dates
  • We only keep the latest rows

Getting Data from SharePoint Online Folder

As we get the data from multiple files stored on SharePoint Online, we have to use the SharePoint Folder connector. Follow these steps:

  1. Login to SharePoint Online and navigate to the site holding the Product list Excel files and copy the site URL from the browser
Getting SharePoint Online Site URL
Getting SharePoint Online Site URL
  1. From the Get Data in the Power BI Desktop, select the SharePoint Folder connector
  2. Click Connect
Connecting to SharePoint Online Folder from Power BI
Connecting to SharePoint Online Folder from Power BI
  1. Paste the Site URL copied on step 1
  2. Click OK
Connecting to SharePoint Online Folder from Power BI using the SharePoint Folder connector
Connecting to SharePoint Online Folder from Power BI using the SharePoint Folder connector
  1. Click Transform Data
Transforming data in Power Query Editor
Transforming data in Power Query Editor

  1. Click the filter dropdown on the Folder Path column
  2. Find the Products List folder hosting the Excel files and select it
  3. Click OK
Filtering a folder in SharePoint Online Site in Power Query Editor in Power BI
Filtering a folder in SharePoint Online Site in Power Query Editor
  1. Rename the query to Product
Renaming a query in Power Query Editor in Power BI
Renaming Query1 to Product

So far, we are connected to the SharePoint Online Folder in Power Query Editor. The next step is to combine the Excel files.

Combining Files

We have multiple options to combine binary files in a table from the Power Query Editor. In this post, we use the most straightforward method:

  1. Click the Combine Files button from the Content column
Combining Excel Files in a Table in Power Query Editor in Power BI
Combining Excel Files in a Table in Power Query Editor
  1. Select the ProductList table
  2. Click OK
Selecting the object to be extracted from each file on the Combine Files window in Power Query Editor in Power BI
Selecting the object to be extracted from each file on the Combine Files window in Power Query Editor

The above process creates a couple of queries grouped in separate folders, as shown in the following image:

The results of combining Excel files
The results of combining Excel files

So far, we have successfully combined the Excel files. The next step is to keep the latest data only.

Keeping the Latest Data

In the next few steps, we look closer at the data, and we implement a mechanism to identify the latest data, keep them and load them into the data model.

  1. Looking at the results of the combined data shows a Source.Name column that we do now require to keep, so we remove it by selecting it and clicking the Remove Columns button from the Home tab
Removing Columns in Power Query Editor in Power BI
Removing Columns in Power Query Editor

So far, we have connected to the SharePoint Online Folder and combined the contained Excel files. Let’s look at the data and see what we’ve got. I sorted the data by ProductNumber to better understand the data changes. The following image shows the results:

Repeated products in different lists

As the above image shows, there are multiple products appearing in multiple lists. That is exactly what we expected to see. The goal is to keep the latest product data only based on the Reporting Date. So we should get the ProductNumber and the maximum of the Reporting Date. To achieve this, we use the Group By functionality in Power Query Editor. Using the Group By from the UI in the Power Query Editor uses the Table.Group() function in Power Query. As the Group By process does not need the data to be sorted we remove the Sorted Rows step. With that, let’s get the job done.

  1. Select the ProductNumber column
  2. Click the Group By column from the Transform tab
  3. Enter Reporting Date for the New column name
  4. Select Max from the Operation dropdown
  5. Select the Reporting Date from the Column dropdown
  6. Click OK
Table.Group() Group By functionality in Power Query for Power BI
Group By functionality in Power Query

The following image shows the results:

The results of the Group By operation in Power Query for Power BI
The results of the Group By operation

We now have all product numbers with their latest reporting dates. The only remaining piece of the puzzle is to join the results of the Grouped Rows step with the data of its previous step. For that we use the Merge Queries functionality which runs the Table.NestedJoin() function in Power Query.

  1. Select the Grouped Rows step from the Applied Steps list from the Query Settings pane
  2. Click the Merge Queries button from the Home tab
  3. Select the Product (Current) table from the dropdown. Note that we are selecting the current query (Product)
  4. On the top table, press the Ctrl button on your keyboard and select the ProductNumber and the Reporting Date columns sequentially
  5. Do the same for the bottom table. Note that the sequence of selecting the columns is important
  6. Ensure that the Join Kind is Left Outer (all from first, matching from second)
  7. Click OK
Left outer join (Merging) in Power Query for Power BI
Left outer join (Merging) in Power Query

As mentioned earlier, the merge operation uses the Table.NestedJoin() function, which accepts two tables (highlighted in yellow in the expression below), a list of their key columns to use on the join (highlighted in red in the expression below), a name for the new column of type table and the join kind. In the above operation, as the Grouped Rows is the last transformation step, we joined the results of the Grouped Rows transformation step by itself. Here is the code generated by Power Query Editor after going through the step 21 to 27:

Table.NestedJoin(#"Grouped Rows", {"ProductNumber", "Reporting Date"}, #"Grouped Rows", {"ProductNumber", "Reporting Date"}, "Grouped Rows", JoinKind.LeftOuter)

But that is not what we want, we do not need to join the results of the Grouped Rows transformation step by itself. We need to join the results of the Grouped Rows transformation step by the results of the Removed Columns step. Therefore, we have to modify the above expression as follows:

Table.NestedJoin(#"Grouped Rows", {"ProductNumber", "Reporting Date"}, #"Removed Columns", {"ProductNumber", "Reporting Date"}, "Grouped Rows", JoinKind.LeftOuter)

The following image shows the modification made in the expression and the results:

The results of joining the results of the Grouped Rows transformation step by the results of the Removed Columns step in Power Query for Power BI
The results of joining the results of the Grouped Rows transformation step by the results of the Removed Columns step
  1. Click the Expand button on the Grouped Rows column
  2. Deselect the ProductNumber and Reporting Date columns to keep the other columns selected
  3. Untick the Use original column name as prefix option
  4. Click OK
Expanding a structured column in Power Query for Power BI
Expanding a structured column in Power Query

All done! The following image shows the final results:

The final results of implementing SCD1
The final results of implementing SCD1

We can now apply the changes to load the data into the data model. With this technique, when a new Excel file (a new Product List) lands in SharePoint Online, Power BI goes through the above transformation steps to ensure we always have the latest Product data loaded into the data model, which is very similar to the behaviour of an SCD1.

Have you used this method before? Do you have a better technique to handle a similar scenario? You can share your thoughts in the comments section below.

Related Posts