When the independent variables are correlated with one another in a multiple regression analysis this condition is called?

In multiple regression, a dependent variable (HR) is predicted from several independent variables simultaneously (minutes of stair climbing, patient age, number of days per week the patient exercises).

From: Statistics in Medicine (Second Edition), 2006

Multiple Regression

Gary Smith, in Essential Statistics, Regression, and Econometrics (Second Edition), 2015

The Effect of Air Pollution on Life Expectancy

Many scientists believe that air pollution is hazardous to human health, but they cannot run laboratory experiments on humans to confirm this belief. Instead, they use observational data—noting, for example, that life expectancies are lower and the incidence of respiratory disease is higher for people living in cities with lots of air pollution. However, these grim statistics might be explained by other factors, for example, that city people tend to be older and live more stressful lives. Multiple regression can be used in place of laboratory experiments to control for other factors. One study used biweekly data from 117 metropolitan areas to estimate the following equation [3]:

yˆ=19.61+0.041x1 +0.071x2+0.001x3+0.41x4+6.87x5 [2.5][3.2][1.7][5.8 ][18.9]

where:

 Y = annual mortality rate, per 10,000 population (average = 91.4).

X1 = average suspended particulate reading (average = 118.1).

X2 = minimum sulfate reading (average = 4.7).

X3 = population density per square mile (average = 756.2).

X4 = percentage of population that is nonwhite (average = 12.5).

X5 = percentage of population that is 65 or older (average = 8.4).

[ ] = t value.

The first two explanatory variables are measures of air pollution. The last three explanatory variables are intended to accomplish the same objectives as a laboratory experiment in which these factors are held constant to isolate the effects of air pollution on mortality.

With thousands of observations, the cutoff for statistical significance at the 5 percent level is a t value of approximately 1.96. The two pollution measures have substantial and statistically significant effects on the mortality rate. For a hypothetical city with the average values of the five explanatory variables, a 10 percent increase in the average suspended particulate reading, from 118.1 to 129.9, increases the mortality rate by 0.041(11.8) = 0.48, representing an additional 48 deaths annually in a city of 1,000,000. A 10 percent increase in the minimum sulfate reading, from 4.7 to 5.17, increases the mortality rate by 0.71(0.47) = 0.34 (representing an additional 34 deaths annually in a city of 1,000,000).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128034590000108

Multiple Regression

Andrew F. Siegel, in Practical Business Statistics (Seventh Edition), 2016

Comparing the Correlation Coefficients

You might not really be interested in the regression coefficients from a multiple regression, which represent the effects of each variable with all others fixed. If you simply want to see how strongly each X variable affects Y, allowing the other X variables to “do what comes naturally” (ie, deliberately not holding them fixed), you may compare the absolute values of the correlation coefficients for Y with each X in turn.

The correlation clearly measures the strength of the relationship (as was covered in Chapter 11), but why use the absolute value? Remember, a correlation near 1 or − 1 indicates a strong relationship, and a correlation near 0 suggests no relationship. The absolute value of the correlation gives the strength of the relationship without indicating its direction.

Multiple regression adjusts or controls for the other variables, whereas the correlation coefficient does not.11 If it is important that you adjust for the effects of other variables, then multiple regression is your answer. If you do not need to adjust, the correlation approach may meet your needs.

Here are the correlation coefficients of Y with each of the X variables for the magazine ads example. For example, the correlation of page costs with median income is − 0.148.

Correlation With Page Costs
AudiencePercent MaleMedian Income
0.850 − 0.126 − 0.148

In terms of the relationship to page costs, without adjustments for the other X variables, audience has by far the highest absolute value of correlation, 0.850. Next in absolute value of correlation is median income, with | − 0.148| = 0.148. Percent male has the smallest absolute value, −0.126 =0.126. It looks as if only audience is important in determining page costs. In fact, neither of the other two variables (by itself, without holding the others constant) explains a significant amount of page costs. Significance of the effect of income emerges when we also adjust for audience size in the multiple regression.

The multiple regression gives a different picture because it controls for other variables. After you adjust for audience, the multiple regression coefficient for median income indicates a significant effect of income on page costs. Here is how to interpret this: The adjustment for audience controls for the fact that higher incomes go with smaller audiences (which counteracts the pure income effect). The audience effect is removed (“adjusted for”) in the multiple regression, leaving only the pure income effect, which can be detected because it is no longer masked by the competing audience effect.

Although the correlation coefficients indicate the individual relationships with Y, the standardized regression coefficients from a multiple regression can provide you with important additional information because they reflect the adjustments made due to the other variables in the regression.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128042502000122

Multiple Regression

Andrew F. Siegel, Michael R. Wagner, in Practical Business Statistics (Eighth Edition), 2022

Comparing the Correlation Coefficients

You might not really be interested in the regression coefficients from a multiple regression, which represent the effects of each variable with all others fixed. If you simply want to see how strongly each X variable affects Y, allowing the other X variables to “do what comes naturally” (ie, deliberately not holding them fixed), you may compare the absolute values of the correlation coefficients for Y with each X in turn.

The correlation clearly measures the strength of the relationship (as was covered in Chapter 11), but why use the absolute value? Remember, a correlation near 1 or −1 indicates a strong relationship, and a correlation near zero suggests no relationship. The absolute value of the correlation gives the strength of the relationship without indicating its direction.

Multiple regression adjusts or controls for the other variables, whereas the correlation coefficient does not.10 If it is important that you adjust for the effects of other variables, then multiple regression is your answer. If you do not need to adjust, the correlation approach may meet your needs.

Here are the correlation coefficients of Y with each of the X variables for the magazine ads example. For example, the correlation of page costs with median income is −0.402.

Correlation With Page Costs
AudiencePercent MaleMedian Income
0.743 −0.369 −0.402

In terms of the relationship to page costs, without adjustments for the other X variables, audience has by far the highest absolute value of correlation, 0.743. Next in absolute value of correlation is median income, with |−0.402| = 0.402. Percent male has the smallest absolute value, |−0.369 |=0.369. Note that all three correlations are significant (using Table D.5 in Appendix D after squaring the correlation coefficients).

The multiple regression gives a different picture because it controls for other variables. After you adjust for audience, the multiple regression coefficient for median income is no longer significant. Here is how to interpret this: The adjustment for audience controls for the fact that higher incomes go with smaller audiences (which counteracts the pure income effect). The audience effect is removed (“adjusted for”) in the multiple regression, leaving only the pure income effect, which can no longer be detected, indicating that the pure income effect is in fact driven primarily by the underlying audience size.

Although the correlation coefficients indicate the individual relationships with Y, the standardized regression coefficients from a multiple regression can provide you with important additional information because they reflect the adjustments made as a result of the other variables in the regression.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128200254000129

Methods You Might Meet, But Not Every Day

R.H. Riffenburgh, in Statistics in Medicine (Third Edition), 2012

Canonical Correlation

Multiple regression, met in Chapters 22 and 23Chapter 22Chapter 23, is a form of multivariate analysis. In this case, one dependent variable is predicted by several independent variables. A coefficient of determination R2 is calculated and may be considered as a multiple correlation coefficient, that is, the correlation between the dependent variable and the set of independent variables. If this design is generalized to multiple dependent variables, a correlation relationship between the two sets is of interest.

Canonical correlation is a term for an analysis of correlation among items in two lists (vectors of variables). For example, does a list of lab test results correlate with a list of clinical observations on a patient? The goal is to find two linear combinations, one for each list of variables, that maximize the correlation between them. The coefficients (multipliers of the variables) act as weights on the variables providing information on the interrelationships.

Suppose an investigator is interested in differentiating forms of meningitis. Blood test results (C-reactive protein, blood counts, et al) along with MRI and lumbar puncture (LP) findings are taken from samples of meningitis patients having different forms. Canonical correlations are generalizations of simple correlations between individual variables to correlations between groups. In this case, canonical correlations are found between blood test results as a group and MRI/LP results as a group for each form of meningitis and may then be compared with one another.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123848642000287

Structural Equation Modeling

Kentaro Hayashi, ... Ke-Hai Yuan, in Essential Statistical Methods for Medical Statistics, 2011

3.5 Nonlinear SEM

In multiple regression, the dependent variable can be a nonlinear function of the independent variables by the use of polynomial and/or interaction terms. This is straightforward. On the contrary, in SEM it has been a difficult task to connect a dependent latent variable with independent latent variables in a nonlinear fashion. Efforts to construct and estimate a nonlinear SEM have been made for the last 20 years. Early works include Kenny and Judd (1984), Bentler (1983), Mooijaart (1985), and Mooijaart and Bentler (1986). The Kenny–Judd model, a particular simple nonlinear model that includes an interaction term, has been intensively studied. More recent works include Bollen (1996), Bollen and Paxton (1998), Jöreskog and Yang (1996), Klein and Moosbrugger (2000), Lee et al. (2004), Lee and Zhu (2000, 2002), Marsh et al. (2004), Wall and Amemiya (2000, 2001, 2003), Yang Jonsson (1998). The Bollen–Paxton and Klein–Moosbrugger approaches seem to be especially attractive. The Wall–Amemiya approach seems to be the most theoretically defensible under a wide range of conditions, since it yields consistent estimates under distributional violations. The Bayesian approaches of Lee and his colleagues are the most promising for small samples. However, to the best of the authors' knowledge, no general SEM software incorporates the Wall–Amemiya or Lee approaches.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444537379500104

Regression and Correlation Methods

ROBERT H. RIFFENBURGH, in Statistics in Medicine (Second Edition), 2006

EXAMPLE POSED: PREDICTING LENGTH OF HOSPITAL STAY

A psychologist would like to be able to predict length of hospital stay (measured in days) for inpatients at time of admission.22Information available before examining the patient includes IQ (mean, 100; standard deviation, 10), age (in years), and sex (0,1). Data are available for n = 50 inpatients. How can length of hospital stay be predicted by three variables simultaneously, with each contributing as it can?

METHOD

Concept

Multiple regression is the term applied to the prediction of a dependent variable by several (rather than one) independent variables. For example, in an investigation of cardiovascular syncope, no one sign would be sufficient to predict its occurrence. The investigator would begin with, at least. SSP, diastolic BP, and HR, and then would extend the list of predictors to include perhaps the presence or absence of murmurs, clicks, vascular bruits, and so on. The concept of multiple regression is similar to that of simpler regression, in that the dependent variable is related to the independent variables by a best fit. However, the geometry is extended from a line in x,y dimensions to a plane in x1, x2, y dimensions, or to what is called a hyperplane x1, x2, x3,…, y dimensions, which is just a plane extended to more than three dimensions. A hyperplane can be treated similarly to a plane mathematically, although it cannot be visualized. A multiple regression model with two independent variables is y = β0 + β1x1 + β2x2. Models of this sort are considered in Section 23.5 and visualized in Fig, 23.11. More generally, we are not even restricted to a plane, that is, to first-degree terms. The model can contain second-degree or other terms of curvature that lead to a curved surface or, in several dimensions, a curved hypersurface. An example of such a model might be y=β0+β1x1+β2x2+β3x22. The foregoing conceptualizations may be more confusing than enlightening to some readers. If so, it is sufficient to remember just that several independent variables combine to predict one dependent variable.

Choosing the Model

We can consider each predictor x one at a time and enter its relationship to the dependent variable y as if it were alone. If y is related to x1 in a straight line, we add ß1x1 to the model. If y is related to x1 in a second-degree curve, we add β1x1+β2x12 to the model. Then we proceed with x2, and so on. (Components combining variables in the same term are possible but form nonlinear models, which are unusual and outside the realm of this book.)

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780120887705500642

Linear Regression

Ronald N. Forthofer, ... Mike Hernandez, in Biostatistics (Second Edition), 2007

13.4.4 Multicollinearity Problems

In a multiple regression situation, it is not uncommon to have independent variables that are interrelated to a certain extent especially when survey data are used. Multicollinearity occurs when an explanatory variable is strongly related to a linear combination of the other independent variables. Multicollinearity does not violate the assumptions of the model, but it does increase the variance of the regression coefficients. This increase means that the parameter estimates are less reliable. Severe multicollinearity also makes determining the importance of a given explanatory variable difficult because the effects of explanatory variables are confounded.

Recognizing multicollinearity among a set of explanatory variables is not necessarily easy. Obviously, we can simply examine the scatterplot matrix or the correlations between these variables, but we may miss more subtle forms of multicollinearity. An alternative and more useful approach is to examine what are known as the variance inflation factors (VIF) of the explanatory variables. The VIF for the jth independent variable is given by

VIFj=11-Rj2

where R2j is the R2 from the regression of the jth explanatory variable on the remaining explanatory variables. The VIF of an explanatory variable indicates the strength of the linear relationship between the variable and the remaining explanatory variables. A rough rule of thumb is that the VIFs greater than 10 give some cause for concern.

Now let us review the multiple regression results shown in Tables 13.10 and 13.11. The VIFs shown in these tables are all less than 10, indicating that the multicollinearity does not pose a serious problem for those models. As a demonstration for a severe multicollinearity, we added to the model shown in Table 13.10 another independent variable that is closed associated with weight and height. Table 13.12 shows the multiple regression analysis of SBP on weight, age, height, and the body mass index (BMI) defined as your weight in kilograms divided by the square of your height in meters. The VIFs for weight, height, and BMI are all greater than 10 in Table 13.12. More important, the variances of the regression coefficients for weight and height increased, and these variables are no longer statistically significant. The effect of weight on SBP shown in the earlier model cannot be demonstrated if we add BMI. A solution to a severe multicollinearity is to delete one of correlated variables. If we drop the BMI variable, we would eliminate the extreme multicollinearity.

Table 13.12. Multiple regression analysis III: Systolic blood pressure versus weight, age, height, body mass index.

PredictorCoefSE CoefTpVIF
Constant 105.2 154.4 0.68 0.499
Weight 0.3052 0.4413 0.69 0.493 97.6
Age 0.4364 0.1333 3.27 0.002 1.2
Height − 0.354 2.246 − 0.16 0.875 22.3
BMI − 1.040 3.016 − 0.34 0.732 60.9
S = 13.90 R − Sq = 37.8% R − Sq(adj) = 32.3%
Analysis of Variance:
Source DF SS MS F p
Regression 4 5,289.4 1,322.4 6.85 < 0.001
Residual Error 45 8,691.4 193.1
Total 49 13,980.9
Source DF Seq SS
Weight 1 3,021.9
Age 1 2,182.6
Height 1 61.9
BMI 1 23.0

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123694928500182

Linear regression models

Kandethody M. Ramachandran, Chris P. Tsokos, in Mathematical Statistics with Applications in R (Third Edition), 2021

7.6.1 ANOVA for multiple regression

As in Section 7.3, we can obtain an ANOVA table for multilinear regression (with k independent or explanatory variables) to test the hypothesis

H0:β1=β2=⋯=βk=0

versus,

Ha : At least one of the parameters βj≠0, j=1,…,k.

The calculations for multiple regression are almost identical to those for simple linear regression, except that the test statistic (MSR)/(MSE) has an F(k, n – k – 1) distribution. Note that the F-test does not indicate which of the parameters βj ≠ 0, except to say that at least one of them is not zero. The ANOVA table for multiple regression is given by Table 7.6.

Table 7.6. ANOVA Table for Multiple Regression.

Source of variationDegrees of freedomSum of squaresMean sum of squaresF-ratio
Regression (model) K SSR MSR=SSRd.f. MSRMSE
Error (residuals) n – k – 1 SSE SSEd.f.
Total n – 1 SST

Example 7.6.3

For the data in Example 7.6.2, obtain an ANOVA table and test the hypothesis

H0 :β1=β2=0vs.Ha: at least one of theβi≠0, i=1,2.

Use α = 0.05.

Solution

We test H0: β1 = β2 = 0 vs. Ha: At least one of the βi ≠ 0, i = 1, 2. Here n = 5, k = 2. Using Minitab, we obtain the ANOVA table (Table 7.7). Based on the p value, we cannot reject the null hypothesis at α = 0.05.

Table 7.7. ANOVA Table for Home Price Data.

Source of variationDegrees of freedomSum of squaresMean sum of squaresF-ratiop Value
Regression (model) 2 956.5 478.2 2.50 0.286
Error (residuals) 2 382.7 191.4
Total 4 1339.2

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128178157000075

Multiple and Curvilinear Regression

R.H. Riffenburgh, in Statistics in Medicine (Third Edition), 2012

Visualizing the Models

The concept of multiple regression is similar to that of simpler regression, in that the dependent variable is related to the independent variables by a best fit. However, the geometry is extended from a line in x,y dimensions to a plane in x1,x2,y dimensions or to what is called a hyperplane in x1,x2,x3,…,y dimensions, which is just a plane extended to more than three dimensions. A hyperplane can be treated similarly to a plane mathematically, although it cannot be visualized. A multiple regression model with two independent variables is y = β0 + β1x1 + β2x2. Models of this sort were considered in Section 19.5 and visualized in Figure 19.14. More generally, we are not even restricted to a plane, that is, to first-degree terms. The model can contain second-degree or other terms of curvature, which leads to a curved surface or, in several dimensions, a curved hypersurface. An example of such a model might be y = β0 + β1x1 + β2x2 + β3x22. The foregoing conceptualizations may be more confusing than enlightening to some readers. If so, it is sufficient to remember just that several independent variables combine to predict one dependent variable.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123848642000226

Statistical analysis of multivariate data

Milan Meloun, Jiří Militký, in Statistical Data Analysis, 2011

4.8.5 Statistical Tests and Confidence Intervals

Logistic regression has two types of inferential tests: tests of models and tests of individual predictors. Statistical inference in logistic regression is based on certain properties of maximum-likelihood estimators and on likelihood ratio tests. These are large-sample or asymptotic results. Inferences about individual regression coefficients, groups of regression coefficients, goodness-of-fit, mean responses, and predictions of group membership of new observations are all of interest. These inference procedures can be treated by considering hypothesis tests and/or confidence intervals. The inference procedures in logistic regression rely on large sample sizes for accuracy. Two procedures are available for testing the significance of one or more independent variables in a logistic regression: likelihood ratio tests and Wald tests. Simulation studies usually show that the likelihood ratio test performs better than the Wald test. However, the Wald test is still used to test the significance of individual regression coefficients because of its ease of calculation.

There are numerous models in logistic regression: an intercept-only model that includes no predictors, an incomplete model that includes the intercept plus some predictors, a full model that includes the intercept plus all predictors (including, possibly, interactions and variables raised to a power), and a perfect (hypothetical) model that would provide an exact fit of expected frequencies to observed frequencies if only the right set of predictors were measured. As a consequence, there are several comparisons possible: between the intercept-only model and the full model, between the intercept-only model and an incomplete model, between an incomplete model and the full model, between two incomplete models, between a chosen model and the perfect model, etc. Not only are there numerous possible comparisons among models but also there are several tests to evaluate goodness-of-fit. No single test is universally preferred, so the computer programmes use different tests for the models.

Likelihood ratio and deviance: The overall measure of how well the model fits, similar to the residual or error sums of squares value for multiple regression, is given by the likelihood ratio test statistic LR. (It is actually − 2 times the log of the likelihood value and is referred to as − 2LL or − 2 log likelihood.) A well-fitting model will have a small value for − 2LL The minimum value for − 2LL is 0. (A perfect fit has a likelihood of 1, and − 2LL is then 0.) The likelihood value can be compared between equations as well,with the difference representing the change in predictive fit from one equation to another. Statistical programmes have automatic tests for the significance of these differences. The likelihood ratio LR is − 2 times the difference between the log likelihoods of two models, one of which is a subset of the other. The distribution of the LR statistic is closely approximated by the chi-square distribution for large sample sizes. The degrees of freedom (DF) of the approximating chi-square distribution is equal to the difference in the number of regression coefficients in the two models. The test is named as a ratio rather than a difference since the difference between two log likelihoods is equal to the log of the ratio of the two likelihoods. That is, if Lfull is the log likelihood of the full model and Lsubse is the log likelihood of a subset of the full model, the likelihood ratio is defined as LR=−2Lsubset−Lfull=−2InLsubsetLfull. The − 2 adjusts LR so the chi-square distribution can be used to approximate its distribution. The likelihood ratio test is the test of choice in logistic regression. Various simulation studies have shown that it is more accurate than the Wald test in situations with small to moderate sample sizes. In large samples, it performs about the same. Unfortunately, the likelihood ratio test requires more calculations than the Wald test, since it requires that two maximum-likelihood models must be fit.

Deviance: When the full model in the likelihood ratio test statistic is the saturated model, LR is referred to as the deviance. A saturated model is one which includes all possible terms (including interactions) so that the predicted values from the model equal the original data. The formula for the deviance is D = − 2[LReduced − LSaturated]. The deviance may be calculated directly using the formula for the deviance residuals. This expression may be used to calculate the log likelihood of the saturated model without actually fitting a saturated model. The formula is Lsaturated=LReduced+D2. The deviance in logistic regression is analogous to the residual sum of squares in multiple regression. In fact, when the deviance is calculated in multiple regression, it is equal to the sum of the squared residuals. Deviance residuals, to be discussed later, may be squared and summed as an alternative way to calculate the deviance, D. The change in deviance, ΔD, due to excluding (or including) one or more variables is used in logistic regression just as the partial F test is used in multiple regression. The formula for ΔD for testing the significance of the regression coefficient(s) associated with the independent variable x1 is

ΔDx1=Dwithoutx1−Dwithx1=−2Lwithoutx1−Lsaturated+2Lwithx1−Lsaturated=−2 Lwithoutx1−Lwithx1

This formula looks identical to the likelihood ratio statistic. Because of the similarity between the change in deviance test and the likelihood ratio test, their names are often used interchangeably.

Testing for significance of the coefficients: Logistic regression can also test the hypothesis that a coefficient is different from zero (zero means that the odds ratio does not change and the probability is not affected), as is done in multiple regression. In multiple regression, the t value is used to assess the significance of each coefficient. Logistic regression uses a different statistic, the Wald statistic. It provides the statistical significance for each estimated coefficient so that hypothesis testing can occur just as it does in multiple regression. The Wald test will be familiar to those who use multiple regression. In multiple regression, the common t-test for testing the significance of a particular regression coefficient is a Wald test. In logistic regression, the Wald test is calculated in the same manner. The formula for the Wald statistic is zj= bjsbj where sbj is an estimate of the standard deviation of bj provided by the square root of the corresponding diagonal element of the covariance matrix, Vβ^. With large sample sizes, the distribution of zj is closely approximated by the normal distribution. With small and moderate sample sizes, the normal approximation is described as ‘adequate.’

Confidence intervals: Confidence intervals CI for the regression coefficients are based on the Wald statistics. The formula for the limits of a 100(1 − α)% two-sided confidence interval CI is bj±|zα/2|sbj. When the confidence interval CI includes zero, so at the 5% significance ievel, we would not reject the null hypothesis H0 : βj = 0 that this model coefficient is zero. The regression coefficient βj is also the logarithm of the odds ratio. Because we know how to find a confidence interval CI for βj, it is easy to find a CI for the odds ratio. The point estimate of the odds ratio is OR = exp(bj) and the 100(1 − α) percent CI for the odds ratio is expbj−za/2×sbj≤OR≤bj+za/2×sbj . The CI for the odds ratio is generally not symmetric around the point estimate. Furthermore, the point estimate O^R=expb j actually estimates the median of the sampling distribution of O^R.

R2: In multiple regression, R2M represents the proportion of variation in the dependent variable accounted for by the independent variables. (The subscript “M” emphasizes here that this statistic is for multiple regression.) It is the ratio of the regression sum of squares to the total sum of squares. When the residuals from the multiple regression can be assumed to be normally distributed, R2M can be calculated as RM2=Lp−L0L0 where L0 is the log likelihood of the intercept-only model and Lp is the log likelihood of the model that includes the independent variables. Here Lp varies from L0 to 0 and R2M varies between zero and one. This quantity has been proposed for use in logistic regression. Unfortunately, when R2L (the R2 for logistic regression) is calculated using the above formula, it does not necessarily range between zero and one. This is because the maximum value of Lp is not always 0 as it is in multiple regression. Instead, the maximum value of Lp is the log likelihood of the saturated model. To allow R2L to vary from zero to one, it is calculated as follows RL2 =Lp−L0L0−Ls . The introduction of Ls into this formula causes a degree of ambiguity with R2L that does not exist with R2M. This ambiguity is due to the fact that the value of Ls depends on the configuration of independent variables. The following example will point out the problem. Consider a logistic regression problem consisting of a binary dependent variable and a pool of four independent variables. The data for this example are given in the following table.

yx1x2x3x4
0 1 1 2.3 5.9
0 1 1 3.6 4.8
1 1 1 4.1 5.6
0 1 2 5.3 4.1
0 1 2 2.8 3.1
1 1 2 1.9 3.7
1 1 2 2.5 5.4
1 2 1 2.3 2.6
1 2 1 3.9 4.6
0 2 1 5.6 4.9
0 2 2 4.2 5.9
0 2 2 3.8 5.7
0 2 2 3.1 4.5
1 2 2 3.2 5.5
1 2 2 4.5 5.2

If only x1, and x2 are included in the model, the dataset may be collapsed because of the number of repeats. In this case, the value of Ls will be less than zero. However, if x3 or x4 are used there are no repeats and the value of will be zero. Hence, the denominator of RL2 depends on which of the independent variables is used. This is not the case for RM2. This ambiguity comes into play especially during subset selection. It means that as we enter and remove independent variables, the target value Ls can change. Hosmer and Lemeshow [144] recommend against the use RL2 as a goodnesss-of-fit measure. However, we have included it in our output because it does provide a comparative measure of the proportion of the log likelihood that is accounted for by the model. Just we should remember than an RL2 value of 1.0 indicates that the logistic regression model achieves the same log likelihood as the saturated model. However, this does not mean that it fits the data perfectly. Instead, it means that it fits the data as well as could be hoped for.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780857091093500042

What is it called when independent variables are correlated?

Key Takeaways. Multicollinearity is a statistical concept where several independent variables in a model are correlated. Two variables are considered to be perfectly collinear if their correlation coefficient is +/- 1.0. Multicollinearity among independent variables will result in less reliable statistical inferences.

When the independent variables in a multiple regression model are correlated is?

Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

What is correlation in multiple regression?

The most commonly used techniques for investigating the relationship between two quantitative variables are correlation and linear regression. Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation.

When two or more of the independent variables in a multiple regression are correlated with each other the condition is called?

Many difficulties tend to arise when there are more than five independent variables in a multiple regression equation. One of the most frequent is the problem that two or more of the independent variables are highly correlated to one another. This is called multicollinearity.