## 2020

STATISTICS ROUNDTABLE

# A Correlation Encounter

## Addressing a common phenomenon for processing industries

by Robert L. Mason and John C. Young

During a recent visit to
the control room for a processing unit, a new process engineer asks the
question: "Why doesn’t the correlation between the two process variables, *x*_{1}
and *x*_{2},
match the correlation as suggested by the theory?"

He explains the correlation between these two variables, based on data taken from daily runs, seldom, if ever, agrees with the expected high correlation as predicted by theoretical considerations.

The workers in the control room respond to his question by noting the correlation between the two variables for data taken from a special performance run—in which the processing unit was taken through all levels of operation from the lowest to the highest level—had good agreement with the suggested theoretical value.

Further complicating matters was the fact the special performance run was made under the direction of the previous process engineer.

### Studying scatterplots

To understand this result, the new engineer decides to examine a scatterplot of the two variables derived using the performance run data, as shown in Figure 1. As suggested by the theory, the near-perfect linear relationship that exists between the two variables is indicative of a strong correlation between the variables. The value of the correlation coefficient for the data in Figure 1 is very large, with a computed value of 0.99.

Figure 2 is a scatterplot of the same two variables for data taken from a typical daily run. Note that the data in the plot correspond to the group of points located in the box in the upper right-hand corner of Figure 1. Observe the lack of a strong linear relationship between the two process variables. The computed correlation for these data is 0.66. This moderate value is considerably less than the one determined from the data of the performance run. Examination of the data from other daily runs yielded similar results.

From questioning the control room staff, the process engineer soon
discovers that the implemented statistical control procedure for the process
was based on monitoring the residual errors of a regression model between the
two variables.^{1}

The residual error is a measure of how well the regression model predicts.
A small error value indicates good prediction. For example, the value of *x*_{1}
should be relative to the value of *x*_{2}. In contrast,
a large error value indicates that the prediction is poor, which implies
something has fouled the linear relationship between *x*_{1}
and *x*_{2}.

The logic behind such a control procedure is that, under the assumption nothing has changed in the process, you should be able to use the existing linear relationship between the two variables to predict one from the other.

Given the high correlation between *x*_{1} and *x*_{2}
observed in Figure 1, you might choose to predict the value of *x*_{1}
from *x*_{2}
using the regression equation given by *x*_{1} = b_{0} + b_{1}*x*_{2},
in which the regression coefficients (b_{0}, b_{1}) are
estimated from the performance data.

If the value of *x*_{1} for a given observation is where it should be relative to the value of *x*_{2},
there should be little difference (for example,
residual error) between the actual value of *x*_{1}
and the predicted value of *x*_{1}.

A control chart for the standardized residual errors from the regression
model (labeled model one) based on the relationship between the two variables
in Figure 1 is constructed. This is shown in Figure 3. The upper control limit
(UCL) and lower control limit (LCL) in the chart are placed at +3 and –3,
respectively. All points associated with the performance run data are well
within these control limits and indicate the previous relationship between *x*_{1}
and *x*_{2}
is being maintained.

Using the data from the daily runs presented in Figure 2, a second regression
model (labeled model two) is obtained for predicting *x*_{1}
from the current value of *x*_{2}.
This is shown in Figure 4. A plot of the standardized residual errors for this
model, using the same UCL and LCL as above, is constructed. In this chart, all
the observed values in the data are well below the LCL value of –3.

If this model is correct, no observation associated with the daily run data is within statistical control. The new process engineer questions why model two is in error because this is how the unit has operated every day since he has been in his new position.

### The explanation

A number of important concepts must be understood before discussing why model two is wrong.

**1. Consider the definition of correlation. **Correlation between two variables is a measure of the strength of the
linear association between the variables. For positive correlation, both
variables increase or decrease together. For negative correlation, when one
variable increases, the other variable decreases.

**2. You must understand how to model the
linear relationship between two variables.** You can do this theoretically or empirically.

A theoretical approach involves using the design mathematics of the processing unit to model the relationship between the variables. Empirical modeling uses the data obtained from the processing unit to compute a regression equation and the corresponding correlation coefficient.

Sometimes, the former modeling method is referred to as a first-principle approach and the latter as an empirical approach.

**3. Discrepancies can occur between the models developed using these
two approaches. **This is because first-principle
relationships seldom take into consideration individual unit differences,
whereas empirical models are usually good only for the processing unit on which
the data are obtained.

Because there are so many intricate differences between the individual units that can affect performance, the empirical model, as a general rule, is better than the first-principle model because it contains the information pertinent only to that unit.

Then why do you see failure in the situation described earlier? The answer lies in the fact the data obtained in a typical daily run does not provide enough information about the relationship between the two variables. Instead, it provides only a snapshot of what is taking place.

To understand this, consider the scatterplot of the two process
variables, *y*_{1}
from *y*_{2},
in Figure 5. Observe the strong linear relationship between the two variables.
The computed correlation has a value of 0.8120.

Now suppose the process operates only for
values of the major control variable, *y*_{1},
between the values of –0.5 and 0.0. The daily run data will be observed
as the points within the boxed region given in Figure 5. For easier interpretation,
these data are presented in Figure 6. The computed correlation for the daily
run for this subset of the data is 0.2657. This implies there is a weak linear
relationship between the two variables as opposed to the strong linear
relationship noted in Figure 5 when the total data set was used.

Such a phenomenon will occur in many processing units when only a part of the operational region is used in estimating process-variable relationships. In creating a historical data set for a processing unit, the collected data must be taken from all types of run conditions the process may encounter during in-control run conditions. Not only is this necessary to obtain valid estimates of the parameters, such as correlations, but it is also necessary to reduce the number of false signals.

For example, if a particular run condition is not included in the historical data but occurs during process monitoring, the statistical control procedure will interpret it as something different from the baseline and produce a signal.

### True correlation

What the new process engineer observed is a common phenomenon in many processing industries. Most processing units are run using advanced process controls and distributed control systems. These are engineering systems that provide tighter process control and reduce variation by keeping the variables within a limited operating region. This leads to improved precision and lower production cost.

In the current problem, the region of daily run data is denoted as the
small box in the upper right-hand corner of Figure 1. This region also
corresponds to the maximum allowable values of *x*_{1} and *x*_{2},
and provides a strong indication the unit is operating at its maximum output
(which is the case).

As noted in Figure 2, restricting the range of the variables to stay within this boxed region presents a poor snapshot of how the variables vary together, and thus a poor estimate of the true correlation is obtained.

This problem can be corrected by ensuring estimates of needed parameters are obtained using a historical data set that includes all possible in-control run conditions for the processing unit.

### Reference

- Robert L. Mason and John C. Young,
*Multivariate Statistical Process Control With Industrial Applications*, ASA-SIAM, 2002.

**Robert L Mason** is an institute analyst at Southwest Research Institute in San
Antonio. He received a doctorate in statistics from Southern Methodist
University and is a fellow of ASQ and the American Statistical Association.

**John C. Young** is a retired professor of statistics at McNeese State University in
Lake Charles, LA. He received a doctorate in statistics from Southern Methodist
University.