Performance tuning is the improvement of system performance. The motivation for such activity is called a performance problem, which can be real or anticipated. In many cases the system performance decreases for example with increasing load volumes or the number of executed applications. A system’s ability to accommodate to changing conditions or circumstances is called scalability. If it is necessary to modify a system in order to handle the new conditions like higher loads or to achieve other goals like decreasing process times, it is called performance tuning.
Further we will deal especially with ETL performance tuning. The main performance requirement for an ETL process is to load a certain amount of data in a defined amount of time from a source to a destination. During the ETL process several changes can be applied to the data. So the main indicators to identify performance problems in ETL are:
- Data Volume: The data volume describes the amount of data which is loaded during the ETL process from the source to the destination. Usually this amount is measured in MB (Megabyte) or GB (Gigabyte). It is also possible to measure the data volume by the number of rows loaded. As rows can differ according to their number and type of columns, this approach is not recommended as in most cases it is not precise enough.
- Processing Time: Processing time means the time a certain ETL package or the complete ETL process, including many ETL packages, takes to complete.
- Throughput: The data volume loaded in a certain processing time is called throughput. It is typically measured in MB/sec or GB/sec. Throughput tells you how fast you are able to load a certain data volume on a certain environment.
- So we are talking about a performance issue when the ETL data volume is not processed in the expected or defined processing time. In this case the process described in the following should be utilized.
Important: The data volume, the expected processing time and other restrictions have to be defined before the tuning process. Tuning need certain goals to be achieved otherwise it ends up tuning for eternity.
The following described process and explanations are related to the tuning process presented by Thomas Kejser („Designing highly scalable data loading“). These techniques are Best Practices and I like to thank Thomas for providing this material.
Systematic tuning follows these steps:
- Assess the problem and establish numeric values that categorize acceptable behavior.
- Measure the performance of the system before modification.
- Identify the part of the system that is critical for improving the performance. (Bottlenecks)
- Modify that part of the system to remove the bottleneck.
- Measure the performance of the system after modification.
A commonly accepted approach to analyze and handle performance issues is shown in the figure below:
In the first step it is necessary to agree on hypothesis in order to get a common baseline. Usually the motivation for this step is driven by already known performance issues or also by the need and desire to improve system or process performance in order to achieve defined goals.
A performance problem may be identified by slow or unresponsive systems or by long loading times. This usually occurs because high system loading, causing some part of the system to reach a limit in its ability to respond. This limit within the system is referred to as a bottleneck. The bottleneck is the part of a system which is at capacity. Other parts of the system will be idle waiting for it to perform its task. In the process of finding and removing bottlenecks, it is important to prove their existence, before acting to remove them. There is a strong temptation to guess. Guesses are often wrong, and investing only in guesses can be highly time-consuming without any effects on the dedicated goal.
In this case the hypothesis should always refer to the known performance issue, like:
“The hard drive I/O causes the bottleneck and is responsible for the high processing time.”
Even if an application, seen for itself, performs well, it can happen due to overall system performance or resource government that it is necessary to deal with performance tuning. A typical scenario can be that the application is running on a server which also operates other application. In this case it is imaginable that restrictions due to resource limitations are set up for our application. So a certain hypothesis can be:
““Due to the ETL design and the high data volume, the application allocates too much memory; the overall memory allocation of the application should not exceed 10 GB.”
After you have agreed on the aimed goals and hypothesis, it is important to measure the system / application before making some modifications. So that you get an overview of how the system acts and performs initially.
This step is necessary in order to have comparable values for later results. Without these values we are not able to determine in what kind of effect a certain modification resulted. The measured values should also relate to the defined goals and hypothesis. Which indicators and tools should be chosen for which problem, will be explained later in detail. The results have to be tracked in order to be available for later comparison. This can be done for example in a structured excel sheet.
In most cases the main measurements include:
- For a better understanding of the measured values it is also recommended to have a landscape overview. The landscape should at least contain the following information:
- Servers in charge
- Process and applications per Server
- Data sources
- Data destinations
In most cases the required information can be received from the customer’s IT department.
When you have saved and analyzed the initial performance, you can start with the modifications. Per process iteration always make just small changes at a time, so that measured values can be assigned to a certain change. Optimally per iteration only one parameter will be changed.
Modifications should also just be made in relation to the agreed targets of optimization. They can concern the ETL design, SQL queries, database design and structure, system architecture, etc. How to focus on which targets will be explained later.
Measure, Save result, Generate Hypothesis
The last three steps will be explained continuous as they can be seen coherent. After the change has been applied, it is necessary to measure again the same indicators as before, so that a comparable result can be achieved. The result has to be evaluated with regard to the hypothesis and the initially measured values. So the hypothesis can be verified or falsified. When the hypothesis has been verified, it is still necessary to compare the results with defined goals. Were the changes also suffice to achieve the defined goals, the process can be seen as completed. If not, it is necessary to define also other hypothesis in order to achieve our aims, also if the hypothesis has been falsified. In this case it can be also necessary to roll back the previously made changes, if they had a bad impact on performance.
Now the process starts from beginning.
The number of iterations depends on the defined goals and the appropriate hypothesis. Also experience is an important factor here.
Things you should NOT do:
- Start tuning before checking that best practices are followed. For this reason please have a look at the following articles: SQL Integration Services Best Practices, The Data Loading Performance Guide
- Conclude that SQL Server® is broken
- Forget about root cause and just change things
- Buy lots of bigger hardware and expect things to run faster
- Please: “Stop Guessing!”