The philosophy behind the Lean Data Processing Framework that I’ve developed is similar to that of the Lean Manufacturing methodology, where the aim is to minimize waste (無駄 muda). Three of the seven mudas defined in the original Toyota Production System (the precursor to Lean Manufacturing) are:
- Motion (people or equipment moving or walking more than is required to perform the processing)
- Waiting (waiting for the next production step, interruptions of production during shift change)
- Over Processing (resulting from poor tool or product design creating activity)
In the world of Data, waste is typically not as tangible as in the world of Manufacturing, but at least the three mudas listed above can certainly be found when we deal with Data Processing. The main materials that are typically wasted when processing data are hardware resources (CPU, RAM, Storage) and, perhaps most importantly, time; specifically, the time it takes for an operation to be performed, or for a data processing step to be completed, leading to the muda of Waiting.
In many deployments, all data preparation operations are done overnight and, as long as the results are ready by morning, when users start accessing their Dashboards and reports, the time these operations take is not always relevant. However, there are also cases where the data processing time is of extreme relevance for the operation of the business. For instance, when the business operation relies on intraday reporting, it becomes necessary to have dashboards and reports update every hour, or couple hours. This is when we start looking at reducing waste in data operations, especially when we are also dealing with large datasets.
Another example where processing time is important is when dealing with very large datasets, in which the extract and transform steps may take hours to complete. When the entire data pipeline takes 4, 6 or even 8 hrs to complete, if a process fails overnight, it means that reports will not be updated by morning and users are left waiting.
QVD Segmentation: The Lean Approach
One of the tools I’ve developed as part of the Lean Data Processing Framework is an optimized implementation of the QVD Segmentation technique. This technique is typically used in Qlik Deployments that deal with large datasets, and allows us to split a large table into multiple QVD files. For example, storing one QVD for each month of data, rather than storing the entire dataset in a single file.
While this technique has always been around, and other implementations exist that perform the same operation, the modified version that I’m sharing today focuses on reducing the time it takes for the process to complete by avoiding over-processing, and reducing the motion of data required in the process, in effect tackling the three mudas mentioned earlier. More concretely, this implementation has the following main features:
- Phased partitioning. This allows the segmentation loop to be done in phases to make it more efficient. For example, if the final segmentation is to be by Month, there can be a pre-partition by Year, which we would then loop through in order to get the sub-partitions corresponding to each of the months within that one single year, instead of looping through the entire table for each individual month. This script allows the creation of multiple phased partitions recursively.
- Auto-pruning. This helps reduce the table size continuously on each iteration of the loop(s) after a portion of the data is stored, so that subsequent iterations deal with less and less data. Because of this, each subsequent iteration is performed faster than the previous one. When the order in which the partitions are created is defined based on data volume for each segment, the process becomes even more efficient because large portions of data are processed first and, in turn, pruned from the remaining dataset to be processed.
- Allows incremental loads. Whether it’s an Incremental Extract, or an Incremental Transform process, this script allows us to append new data with old data in pre-existing QVDs by simply setting a parameter in the function call and specifying a primary key field.
- Hierarchical Folder Structure. The script also allows us to automatically create a hierarchical folder structure based on time periods. This is useful when we have many different segments, and takes advantage of a feature recently added to QlikView and Qlik Sense that automatically creates the folder if it doesn’t exist when saving a QVD file. Older versions of QlikView or Qlik Sense may not allow this functionality to work, and it’s also not supported on Qlik Sense on Kubernetes or in Qlik Core.
Here is a brief video where we can compare the traditional approach to the Lean approach.
In the traditional approach, the source table is iterated over for each of the existing time periods/segments. This means that the entire table is read over and over. In one process I recently worked on that was using this approach on a large, wide table, the entire process was taking around 9hrs to complete (just the QVD segmentation process).
In the Lean approach, we can define phased partitioning, so that the amount of data we deal with on each iteration is considerably smaller. It also uses auto-pruning, which means that, after each iteration, the processed time segment is removed from the source table so that subsequent iterations operate on a smaller set of data, speeding up the entire process considerably. For comparison, in my previous example, creating segmented QVDs from the same table using this new approach took just 2hrs, which represents almost a 5x improvement over the traditional approach. In both cases in this comparison, the load was performed in full mode, not incrementally.
Feel free to download the script with the button below and test on your own processes. It is simply an external script file (qvs), that you need to include in your Qlik Scripts.
Within the script itself, you’ll find more details about how to use it and what each parameter means. Try setting different combinations of parameters to see how they affect the overall performance of the process. Here are a few examples of how to use the function.
CALL StoreSegmentedQVDs('Directory\Path', 'MyTable', 'DateField', 'Month');
CALL StoreSegmentedQVDs('Directory\Path', 'MyTable', 'DateField', 'Month', '', '', '', '', 'Yes');
CALL StoreSegmentedQVDs('Directory\Path', 'MyTable', 'DateField', 'Day', '', '', ', '', 'Year,Month', 'Yes', 'YYYY\MM');