## 2020

STATISTICS ROUNDTABLE

# A Remedy Using Residuals

## Develop a univariate technique to control two process variables

by Robert L. Mason and John C. Young

It is common in industrial processes for input variables to be closely associated with output variables. You may frequently encounter two process variables tied together.

For example, consider temperature and pressure. As the temperature increases, so does the pressure. Consider fuel use and steam production. An increase in steam production requires an increase in fuel use to produce more steam.

To
statistically control two closely associated process variables, two distinct
approaches can be used. One approach is to control the two variables (together)
through multivariate statistical process control;^{1} however, not
everyone is familiar with multivariate techniques. As an alternative, you can
develop a univariate control procedure based on monitoring one of the variables
after removing the effect of the other. This technique involves monitoring the
residuals and has many applications in statistical process control.

### Residual error

Consider a processing industry that monitors the amount of fuel used to produce steam. A natural gas-fired boiler converts water into steam for use in the process. The water is brought into the boiler and heated by natural gas under pressure to produce high-temperature and high-pressure steam that is distributed as an energy source throughout the plant. The unit of measure for the natural gas is in standard cubic feet of gas per hour (scfh), and the unit of measure for steam produced is pounds per hour (lbs/hr).

To increase steam production, more natural gas must be used in firing the boiler. Likewise, to decrease the amount of steam being produced, less natural gas is used to fire the boiler. This demand factor produces a swing in fuel use that invalidates the use of most control procedures because it increases the variation and produces extended runs (for example, consecutive observations above or below the fuel-use mean). This is evident in the Shewhart control chart of fuel use for a natural gas boiler in Figure 1. The 109 observations were obtained when the boiler performance was deemed excellent.

Plotting moving averages will lessen the size of these fuel-use swings and will considerably reduce the estimated standard deviation. Consider the moving-average control chart in Figure 2 for the boiler data. The moving averages are computed from the original fuel-use data using a time span of two. Thus, each point in the chart represents the average of two consecutive points. Observe that there are numerous moving averages in the chart outside the upper control limit (UCL) and lower control limit (LCL), yet these values were obtained during a time period in which boiler performance was excellent.

An investigation of the points outside the control limits in Figure 2 shows the moving averages that signaled were either very large values (indicating a time period with too much fuel use) or very small values (indicating a time period with too little fuel use). In other words, the control procedure specifies a moving average as being atypical if it is not in the middle of the pack. The largeness or smallness of fuel use alone does not, however, constitute an upset condition in the system. It takes a large amount of fuel to produce a large amount of steam.

To set a clearer picture of the process, you must examine the relationship between fuel use and steam production. This is depicted in the scatterplot in Figure 3 for the sample of data used to construct the control chart given in Figure 1. The computed correlation between fuel use and steam production in this dataset is 0.989, indicating a very strong linear relationship exists between them.

The relationship between the two variables also can be studied by fitting a regression line to fuel use based on steam production. The straight-line regression equation is given by:

Predicted
fuel = *a* + *b* (steam),

in which *a* is the
estimated intercept and *b* is the
estimated slope of the line. A residual, which is the error in the regression
fit, is defined as the difference between the observed fuel and the predicted
fuel use for a given steam value. The two estimated regression coefficients are
computed from the data using the method of least squares to minimize the sum of
the squared residuals. Using the data in Figure 3, the regression equation is
calculated to be:

Fuel = –18,028.7 + 1.21994 (steam).

It is represented by the line plotted in Figure 3.

The above regression equation has numerous uses, including predicting the value of fuel for a given value of steam. For example, consider the circled point in the upper-right corner of Figure 3. The coordinates of this point are: fuel use = 850,740 scfh and steam production = 732,298 lbs/hr. Using the regression equation, fuel use can be predicted by substituting the steam production value of 732,298 and solving the equation to obtain:

Fuel = –18,029 + 1.21994 (732,298) = 875,331.

For this observation, the corresponding residual (observed fuel – predicted fuel) is given by:

Residual = 850,740 – 875,331 = –24,591.

This appears to be a large number. When you consider the variation, however, you will see a different picture.

### Control based on residuals

The size of a residual can be judged most easily by examining its corresponding studentized residual, which is obtained by dividing the residual error by its estimated standard deviation. For example, the value of the estimated standard deviation for the above residual is 20,954, producing a studentized residual of –24,291 / 20,954 = –1.174.

In general, a studentized residual can be treated as an observation from an approximate standard normal distribution—for example, a normal variable with a mean of zero and a variance of 1. Because standard normal variables occur between ± 3 approximately 99.73% of the time, this fact can be used to establish a control chart based on the residual values and with control limits set at ±3.

Using the previous fuel-use data, a control procedure was developed for the corresponding studentized residuals obtained from the plotted regression equation given in Figure 3. The results are in Figure 4. Only one point (No. 18) is designated as being out of control (below the LCL). Contrast this result with the many signals obtained in the moving-average chart in Figure 2, which did not adjust fuel for the effects of the steam.

For situations similar to the earlier example, a regression equation with multiple variables can be constructed from a historical data set (HDS) obtained under good operational conditions to predict an important process variable (criterion variable). Given a good fit to the data and no changes in the process, the regression equation should accurately predict the value of the criterion variable from the other process variables.

The residual is a measure of the goodness of the prediction. Small errors in the residuals from this multiple regression equation would imply the process is conforming to the relationships established in the HDS. Large errors would indicate something is out of control.

For most control procedures, you assume the
observations are independent. Many times in the process industry, this
condition is not satisfied because the data are time dependent or
autocorrelated.^{2} This means that an observation taken at time *t* is related to the
previous observation made at time (*t*
– 1).

When autocorrelation is present, it is
difficult to construct a control procedure on a variable due to the time
effect. It is easy to construct a control procedure based on the residuals,
however, in which the effect of the time variable has been removed. In this
case, the underlying regression model takes a slightly different form and is
referred to as an autoregressive model of order one.^{3 }The resulting
prediction equation is:

(Observation
at time *t*) = *c* + *d* (observation at time
(*t* – 1)),

in which *c* and *d* are the estimated
coefficients. The time-adjusted residuals from this fitted equation could then
be plotted in a control chart.

For example, if we apply run rules (rules for detecting signals due to the length of the number of successive observations above and below the mean of zero) to the control chart in Figure 4, there would be a number of signals. This indicates the corresponding studentized residuals in the plot are autocorrelated. When you fit an autoregressive model to these studentized residuals, however, and then compute the time-adjusted studentized residuals using an autoregressive model, you obtain the time-adjusted residual plot in Figure 5.

This plot does not contain all of the runs observed in Figure 4 and is free of much of the effects of the time dependency. In addition, the same observation that produced a signal in Figure 4 continues to produce a signal in Figure 5.

### References

- Robert
L. Mason and John C. Young,
*Multivariate Statistical Process Control With Industrial Applications*, ASA-SIAM, 2002. - Bovas
Abraham and Johannes Ledolter,
*Introduction to Regression Modeling,*Thomson Brooks/Cole, 2006. - Ibid.

**Robert L. Mason** is an institute analyst at Southwest
Research Institute in San Antonio. He has a doctorate in statistics from
Southern Methodist University in Dallas and is a fellow of ASQ and the American
Statistical Association.

**John C. Young** is a retired professor of statistics at
McNeese State University in Lake Charles, LA. He received a doctorate in
statistics from Southern Methodist University.

Featured advertisers