A Correlation Encounter
Addressing a common phenomenon for processing industries
by Robert L. Mason and John C. Young
During a recent visit to the control room for a processing unit, a new process engineer asks the question: "Why doesn’t the correlation between the two process variables, x1 and x2, match the correlation as suggested by the theory?"
He explains the correlation between these two variables, based on data taken from daily runs, seldom, if ever, agrees with the expected high correlation as predicted by theoretical considerations.
The workers in the control room respond to his question by noting the correlation between the two variables for data taken from a special performance run—in which the processing unit was taken through all levels of operation from the lowest to the highest level—had good agreement with the suggested theoretical value.
Further complicating matters was the fact the special performance run was made under the direction of the previous process engineer.
To understand this result, the new engineer decides to examine a scatterplot of the two variables derived using the performance run data, as shown in Figure 1. As suggested by the theory, the near-perfect linear relationship that exists between the two variables is indicative of a strong correlation between the variables. The value of the correlation coefficient for the data in Figure 1 is very large, with a computed value of 0.99.
Figure 2 is a scatterplot of the same two variables for data taken from a typical daily run. Note that the data in the plot correspond to the group of points located in the box in the upper right-hand corner of Figure 1. Observe the lack of a strong linear relationship between the two process variables. The computed correlation for these data is 0.66. This moderate value is considerably less than the one determined from the data of the performance run. Examination of the data from other daily runs yielded similar results.
From questioning the control room staff, the process engineer soon discovers that the implemented statistical control procedure for the process was based on monitoring the residual errors of a regression model between the two variables.1
The residual error is a measure of how well the regression model predicts. A small error value indicates good prediction. For example, the value of x1 should be relative to the value of x2. In contrast, a large error value indicates that the prediction is poor, which implies something has fouled the linear relationship between x1 and x2.
The logic behind such a control procedure is that, under the assumption nothing has changed in the process, you should be able to use the existing linear relationship between the two variables to predict one from the other.
Given the high correlation between x1 and x2 observed in Figure 1, you might choose to predict the value of x1 from x2 using the regression equation given by x1 = b0 + b1x2, in which the regression coefficients (b0, b1) are estimated from the performance data.
If the value of x1 for a given observation is where it should be relative to the value of x2, there should be little difference (for example, residual error) between the actual value of x1 and the predicted value of x1.
A control chart for the standardized residual errors from the regression model (labeled model one) based on the relationship between the two variables in Figure 1 is constructed. This is shown in Figure 3. The upper control limit (UCL) and lower control limit (LCL) in the chart are placed at +3 and –3, respectively. All points associated with the performance run data are well within these control limits and indicate the previous relationship between x1 and x2 is being maintained.
Using the data from the daily runs presented in Figure 2, a second regression model (labeled model two) is obtained for predicting x1 from the current value of x2. This is shown in Figure 4. A plot of the standardized residual errors for this model, using the same UCL and LCL as above, is constructed. In this chart, all the observed values in the data are well below the LCL value of –3.
If this model is correct, no observation associated with the daily run data is within statistical control. The new process engineer questions why model two is in error because this is how the unit has operated every day since he has been in his new position.
A number of important concepts must be understood before discussing why model two is wrong.
1. Consider the definition of correlation. Correlation between two variables is a measure of the strength of the linear association between the variables. For positive correlation, both variables increase or decrease together. For negative correlation, when one variable increases, the other variable decreases.
2. You must understand how to model the linear relationship between two variables. You can do this theoretically or empirically.
A theoretical approach involves using the design mathematics of the processing unit to model the relationship between the variables. Empirical modeling uses the data obtained from the processing unit to compute a regression equation and the corresponding correlation coefficient.
Sometimes, the former modeling method is referred to as a first-principle approach and the latter as an empirical approach.
3. Discrepancies can occur between the models developed using these two approaches. This is because first-principle relationships seldom take into consideration individual unit differences, whereas empirical models are usually good only for the processing unit on which the data are obtained.
Because there are so many intricate differences between the individual units that can affect performance, the empirical model, as a general rule, is better than the first-principle model because it contains the information pertinent only to that unit.
Then why do you see failure in the situation described earlier? The answer lies in the fact the data obtained in a typical daily run does not provide enough information about the relationship between the two variables. Instead, it provides only a snapshot of what is taking place.
To understand this, consider the scatterplot of the two process variables, y1 from y2, in Figure 5. Observe the strong linear relationship between the two variables. The computed correlation has a value of 0.8120.
Now suppose the process operates only for values of the major control variable, y1, between the values of –0.5 and 0.0. The daily run data will be observed as the points within the boxed region given in Figure 5. For easier interpretation, these data are presented in Figure 6. The computed correlation for the daily run for this subset of the data is 0.2657. This implies there is a weak linear relationship between the two variables as opposed to the strong linear relationship noted in Figure 5 when the total data set was used.
Such a phenomenon will occur in many processing units when only a part of the operational region is used in estimating process-variable relationships. In creating a historical data set for a processing unit, the collected data must be taken from all types of run conditions the process may encounter during in-control run conditions. Not only is this necessary to obtain valid estimates of the parameters, such as correlations, but it is also necessary to reduce the number of false signals.
For example, if a particular run condition is not included in the historical data but occurs during process monitoring, the statistical control procedure will interpret it as something different from the baseline and produce a signal.
What the new process engineer observed is a common phenomenon in many processing industries. Most processing units are run using advanced process controls and distributed control systems. These are engineering systems that provide tighter process control and reduce variation by keeping the variables within a limited operating region. This leads to improved precision and lower production cost.
In the current problem, the region of daily run data is denoted as the small box in the upper right-hand corner of Figure 1. This region also corresponds to the maximum allowable values of x1 and x2, and provides a strong indication the unit is operating at its maximum output (which is the case).
As noted in Figure 2, restricting the range of the variables to stay within this boxed region presents a poor snapshot of how the variables vary together, and thus a poor estimate of the true correlation is obtained.
This problem can be corrected by ensuring estimates of needed parameters are obtained using a historical data set that includes all possible in-control run conditions for the processing unit.
- Robert L. Mason and John C. Young, Multivariate Statistical Process Control With Industrial Applications, ASA-SIAM, 2002.
Robert L Mason is an institute analyst at Southwest Research Institute in San Antonio. He received a doctorate in statistics from Southern Methodist University and is a fellow of ASQ and the American Statistical Association.
John C. Young is a retired professor of statistics at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.