What statistical technique is used to explain the variance in the outcome variable based on the differences in the predictor variable?

Regression Analysis

Claudia Angelini, in Encyclopedia of Bioinformatics and Computational Biology, 2019

Introduction

Regression analysis is a well-known statistical learning technique useful to infer the relationship between a dependent variable Y and p independent variables X=[X1||Xp]. The dependent variable Y is also known as response variable or outcome, and the variables Xk (k=1,…,p) as predictors, explanatory variables, or covariates. More precisely, regression analysis aims to estimate 00the mathematical relation f() for explaining Y in terms of X as, Y=f(X), using the observations (xi,Yi),i=1,…,n, collected on n observed statistical units. If Y describes a univariate random variable the regression is said to be univariate regression, otherwise it is referred as multivariate regression. If Y depends on only one variable x (i.e., p=1), the regression is said simple, otherwise (i.e., p>1), the regression is said multiple, see (Abramovich and Ritov, 2013; Casella and Berger, 2001; Faraway, 2004; Fahrmeir et al., 2013; Jobson, 1991; Rao, 2002; Sen and Srivastava, 1990).

For the sake of brevity, in this chapter we limit our attention to univariate regression (simple and multiple), so that Y=(Y1,Y2…Yn)T represents the vector of observed outcomes, and X=(x1T….xnT) represents the design matrix of observed covariates, where xi=(xi1,…,xip)T (or xi=(1,xi1,…,xip)T ). In this setting X=[X1||Xp] is a p-dimensional variable (p≥1).

Regression analysis techniques can be organized into two main categories: non-parametric and parametric. The first category contains those techniques that do not assume a particular form for f(); while the second category includes those techniques based on the assumption of knowing the relationship f() up to a fixed number of parameters β that need to be estimated from the observed data. When the relation between the explanatory variables, X, and the parameters, β , is linear, the model in known as Linear Regression Model.

The Linear Regression Model is one of the oldest and more studied topics in statistics and is the type of regression most used in applications. For example, regression analysis can be used for investigating how a certain phenotype (e.g., blood pressure) depends on a series of clinical parameters (e.g., cholesterol level, age, diet, and others) or how gene expression depends on a set of transcription factors that can up/down regulate the transcriptional level, and so on. Despite the fact that linear models are simple and easy to handle mathematically, they often provide an adequate and interpretable estimate of the relationship between X and Y. Technically speaking, the linear regression model assumes the response Y to be a continuous variable defined on the real scale and each observed data is modeled as

Yi =β0+β1xi1+…+β pxip+εi=xiT β+εii=1,…,n

where β=(β0, β1,…,βp)T is a vector of unknown parameters called regression coefficients and ε represents the errors or noise term that accounts for the randomness of the measured data or the residual variability not explained by X. The regression coefficients β can be estimated by fitting the observed data using the least squares approach. Under the Gauss-Markov conditions (i.e., the εi are assumed to be independent and identically distributed random variables, with zero mean and finite variance σ2), the ordinary least squares estimates β^ are guaranteed to provide the best linear unbiased estimator (BLUE). Moreover, under the further assumptions that ε~N(0,σ2In), β^ allows statistical inference to be carried out on the model, as described later. The validity of both the Gauss-Markov conditions and the normal distribution of the error term are known as “white noise” conditions. In this context, linear regression model is also known as the regression of the “mean”, since it models the conditional expectation of Y given X, as follows Y^=E(Y|X)=XTβ^, where E(Y|X) denotes the conditional expected value of Y for fixed values of the regressors X.

The linear regression model not only allows estimating the regression coefficients β as β^ (and hence quantifying the strength of the relationship between Y and each of the p explanatory variables when the remaining p-1 are fixed), but also selecting those variables that have no relationship with Y (when the remaining ones are fixed), as well as identifying which subsets of explanatory variables have to be considered in order to explain sufficiently well the response Y. These tasks can be carried out by testing the significance of each individual regression coefficient when the others are fixed, by removing the coefficients that are not significant and re-fitting the linear model and/or by using model selection approaches. Moreover, linear regression model can be also used for prediction. For this purpose, given the estimated values, β^, it is possible to predict the response, Y0^=x0Tβ^, corresponding to any novel value x0 and to estimate the uncertainty of such prediction. The uncertainty depends on the type of prediction one wants to make. In fact, it is possible to compute two types of confidence intervals: the one for the expectation of a predicted value at a given point x0, and the one for a future generic observation at a given point x0.

As the number, p, of explanatory variables increases, the least squares approach suffers from a series of problems, such as lack of prediction accuracy and difficulty of interpretation. To address these problems, it is desirable to have a model with only a small number of “important” variables, which is able to provide a good explanation of the outcome and good generalization at the price of sacrificing some details. Model selection consists in identifying which subsets of explanatory variables have to be “selected” to sufficiently explain the response Y making a compromise referred as the bias-variance trade-off. This is equivalent to choosing between competing linear regression models (i.e., with different combinations of variables). On the one hand, one has to consider that including too few variables leads to so-called "underfit" of the data, characterized by poor prediction performance with high bias and low variance. On the other hand, selecting to many variables rise to so-called "overfit" of the data, characterized by poor prediction performance with low bias and high variance. Stepwise linear regression is an attempt to address this problem (Miller et al., 2002; Sen and Srivastava, 1990) constituting a specific example of subset regression analysis. Although model selection can be used in classical regression context, it is one of the most effective tool in high dimensional data analysis.

Classical regression deal with the case n≥p where n denotes the number of independent observations (i.e., the sample size) and p the number of variables. Nowadays, in many applications especially in biomedical science, high-throughput assays are capable of measuring from thousands to hundreds of thousands of variables on a single statistical unit. Therefore, one has often to deal with the case p >>n. In such a case, ordinary least squares cannot be applied, and other types of approaches (for example, including the use of a penalization function) such as Ridge regression, Lasso or Elastic net regression (Hastie et al., 2009; James et al., 2013; Tibshirani, 1996, 2011) have to be used for estimating the regression coefficients. In particular, Lasso is very effective since it also performs also variable selection and has opened the new framework of high-dimensional regression (Bühlmann and van de Geer, 2011; Hastie et al., 2015). Model selection and high dimensional data analysis are strongly connected, and they might also benefit from dimension reduction techniques such as principal component analysis, or feature selection.

In the classical framework, Y is treated as a random variable, while X are considered fixed, hence, depending on the distribution of Y, different types of regression models can be defined. With X fixed, the assumptions on the distribution of Y are elicited through the distribution of error term ε=( ε1,…,εn)T. As above mentioned, classical linear regression requires the error term to satisfy the Gauss-Markov conditions and be normally distributed. However, when the error term is not normally distributed, linear regression might be not appropriate. Generalized linear models (GLM) constitute a generalization of classical linear regression that allows the response variable Y to have an error distribution other than normal (McCullagh and Nelder, 1989). In this way, GLM generalize linear regression by allowing the linear model to be related to the response variable via a link function, and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. In this way GLM represent a wide framework that includes linear regression, logistic regression, Poisson regression, multinomial regression, etc. In this framework the regression coefficients can be estimated using the maximum likelihood approach, often solved by iteratively reweighted least squares algorithms.

In the following, we briefly summarize the key concepts and definitions related to linear regression, moving from the simple linear model, to the multiple linear model. In particular, we discuss the Gauss-Markov conditions and the properties of the least squares estimate. We discuss the concepts of model selection and also provide suggestions on how to handling outliers and deviation from standard assumptions. Then, we discuss modern forms of regression, such as Ridge regression, Lasso and Elastic Net, which are based on penalization terms and are particularly useful when the dimension of the variable space, p, increases. We conclude by extending the linear regression concepts to the Generalized Linear Models (GLM).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128096338203609

Research and Methods

Heining Cham, in Comprehensive Clinical Psychology (Second Edition), 2022

3.02.7.2 Regression

Regression analysis is another popular adjustment method. In regression analysis, all the measured baseline confounders are included as predictors to the outcome. This adjustment can produce the average causal effect and average causal effect on the treated. There are three major disadvantages for this method (Schafer and Kang, 2008). The first disadvantage is that correct model specification is assumed to produced unbiased effects. When dimensionality increases, it is more difficult to determine the correct forms of relationship even using regression diagnostic procedures. The second disadvantage is that regression is a single-step procedure that the equating of the confounders and causal effect estimation occurs simultaneously. There can be a danger of “fishing” the causal effect by keep modifying the regression model until the estimate achieves desired magnitude and effect. The third disadvantage is that we can often mistakenly over-generalize the results above and beyond the regression plane, which is complex and multidimensional to be understood. Given these reasons, this article does not get into the details of this method.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128186978002144

Analysis and Interpretation of Multivariate Data

D.J. Bartholomew, in International Encyclopedia of Education (Third Edition), 2010

Regression Analysis

Regression analysis is the oldest, and probably, most widely used multivariate technique in the social sciences. Unlike the preceding methods, regression is an example of dependence analysis in which the variables are not treated symmetrically. In regression analysis, the object is to obtain a prediction of one variable, given the values of the others. To accommodate this change of viewpoint, a different terminology and notation are used. The variable being predicted is usually denoted by y and the predictor variables by x with subscripts added to distinguish one from another. In linear multiple regression, we look for a linear combination of the predictors (often called regressor variables). For example, in educational research, we might be interested in the extent to which school performance could be predicted by home circumstances, age, or performance on a previous occasion. In practice, regression models are estimated by least squares using appropriate software. Important practical matters concern the best selection of the best regressor variables, testing the significance of their coefficients, and setting confidence limits to the predictions.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947013038

Echinocandins

In Meyler's Side Effects of Drugs (Sixteenth Edition), 2016

Enzyme inducers

Regression analysis of pharmacokinetic data from patients has suggested that co-administration of caspofungin with inducers of drug metabolism and mixed inducer/inhibitors, namely carbamazepine, dexamethasone, efavirenz, nelfinavir, nevirapine, phenytoin, and rifampicin, can cause clinically important reductions in caspofungin concentrations. However, no data are currently available from formal interaction studies, and it is not known which clearance mechanisms of caspofungin are inducible. The manufacturer currently recommends considering an increase in the daily dose of caspofungin to 70 mg in patients who are taking these drugs concurrently and who are not responding [6].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444537171006727

Activins and Inhibins

Yogeshwar Makanji, ... David M. Robertson, in Vitamins & Hormones, 2011

E Future directions

Regression analysis showed that inhibin B was also an independent regulator of serum LH across the follicular phase of the menstrual cycle. While this relationship has been supported by in vitro studies, there has been little evidence observed in vivo. This was one of the first reports. These studies have also indicated a possible role for AMH in FSH regulation. The regression analysis studies show that AMH is independently and inversely correlated with FSH. There is currently little evidence supporting a pituitary role for AMH; is its role related to its stimulatory action on inhibin synthesis by the ovary potentiating inhibin's action on the pituitary? Finally, the apparent higher biological activity of inhibin B may not be translated to other biological systems where inhibin A may predominate. For example, in the luteal phase of the menstrual cycle, where inhibin A is in excess of inhibin B, inhibin A may play a greater role in regulating FSH. In conclusion, the regulation of FSH by ovarian inhibins is a multi-step process in which inhibin B appears to be the major inhibin involved. However, it is apparent that structurally different forms of inhibins related to posttranslational changes and other factors such as ovarian steroids and AMH contribute, but these aspects are less well defined.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123859617000147

Linear Hypothesis: Regression (Graphics)

R.D. Cook, in International Encyclopedia of the Social & Behavioral Sciences, 2001

Regression analysis is the study of how a response variable depends on one or more predictors. In regression graphics we pursue low-dimensional sufficient summary plots. These plots, which do not require a model for their construction, contain all the information on the response that is available from the predictors. They can be used to visualize dependence, to discover unexpected relationships, to guide the choice of a first model, and to check plausible models. This article covers the foundations for sufficient summary plots and how they can be estimated and used in practice. Their relationship to standard model-based graphics such as residual plots is covered as well.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767004551

Ancestry Estimation

Elizabeth A. DiGangi, Joseph T. Hefner, in Research Methods in Human Skeletal Biology, 2013

Ordinal Regression

Ordinal regression analysis (ORA) measures the association of an ordinal response variable (a categorical variable with ordering—i.e., small, medium, large) to a set of predictor variables (a variable used to predict the value of another variable). In traditional linear regression, the sum-of-squared differences between a continuous dependent variable and the weighted combination of the independent variables are minimized prior to calculating regression coefficients. This is not the case when the dependent variable is ordinal. Ordinal regression calculates coefficients based on the assumption that the response variable is a categorical response with some underlying continuous distribution. In most cases, there is a valid theoretical basis for assuming this underlying distribution. However, even when this assumption is not met, the model can still theoretically produce valid results.

Rather than predicting the actual cumulative probabilities, an ORA predicts a function of those values using a process known as a link function. Simplistically, the link function links the model specified in the design matrix to the real parameters of the dataset. After initial model development, the predicted probability of each response category can be used to assign an unknown individual to a group. An ORA can be expressed as

(5.1)link(γij)=θj−[β1χi1+ β2χi2+βpχij]

where link( ) is the link function for the current analysis, γij is the cumulative probability of the jth category for the ith case, θj is the threshold for the jth category, p is the number of regression coefficients, χi1…χip are the values of the predictors for the ith case, and β1… βp are the regression coefficients. One of the benefits of ORA, and a similarity of ORA to analysis of variance (ANOVA), is the ability to assess the significance of individual response variables and to test for any interaction between all response variables. For example, ORAs allow one to determine if sex, ancestry, or the interaction of sex and ancestry significantly affect the expression of inferior nasal aperture morphology.

Ordinal regression analysis can be carried out using the PLUM function in SPSS®. The purpose of the ORA in ancestry research is twofold. First, as mentioned above, the ORA can be used to determine the significance of sex and ancestry, and the interaction of the two, on the expression of each morphoscopic trait. Significance is assessed at the α = 0.05 level using the Wald statistic, a measure similar to the F-value in a traditional ANOVA. Each of these parameter estimates is then assessed for significance. As an example, the ORA parameter estimates for interorbital breadth are presented in Table 5.4. Once all significant traits are determined, we can apply the ORA with all significant traits set as the predictor variables to assess ancestry for the entire sample. As Table 5.5 shows, the ORA works well, separating a sample of American Blacks and Whites (data collected by JTH) in a two-way analysis correctly nearly 90% of the time. Table 5.5 also presents the classification matrix for the two-group analysis.

TABLE 5.4. Parameter Estimates and Significance Levels for Interorbital Breadth

Ind. VariableEstimateStd. ErrorWalddfSig.
Ancestry 2.492 0.340 53.723 1 0.000
Sex −1.113 1.250 0.792 1 0.373
Ancestry∗Sex 0.929 1.299 0.512 1 0.420

TABLE 5.5. Classification Matrix for the ORA Two-Group Analysis

BlackWhiteTotal% Correct
Black 203 15 218 93.12
White 22 124 146 84.93
Total 225 139 364 89.03

χ2 = 190.709; p < 0.000

Multiway ORAs are not as successful. In a three-way analysis the ORA correctly classified approximately 70% of the sample of American Whites, American Blacks, and Amerindians (Table 5.6). As more groups are added to the model the classification rate is drastically reduced. This may be because ORAs are somewhat sensitive to sample size. Yet the method is promising and merits further scrutiny and research.

TABLE 5.6. Classification Matrix for the ORA Three-Group Analysis

AmerindianBlackWhiteTotal% Correct
Amerindian 206 46 10 262 78.63
Black 59 130 29 218 59.63
White 10 33 103 146 70.55
Total 275 209 142 626 69.60

χ2 = 287.765; p < 0.000

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123851895000054

Research methods, statistics and evidence-based practice

Andrew M. McIntosh, ... Stephen M. Lawrie, in Companion to Psychiatric Studies (Eighth Edition), 2010

Regression analysis

Regression analysis is the study of relationships between two or more variables and is usually conducted for the following reasons:

when we want to know whether any relationship between two or more variables actually exists;

when we are interested in understanding the nature of the relationship between two or more variables; and

when we want to predict a variable given the value of others.

In its simplest form regression analysis is very similar to correlation; in fact the underlying mathematical models are virtually identical. Regression analysis can, however, be used where there are many explanatory variables and where various data types are used together. The general regression model is:

Y=a+b X1+cX2+…+error

Where α is a constant, X1, X2, etc. are the predictor variables, and the error term is the difference between the observed and predicted value of γ. A practical example of the above equation using the performance IQ data might take the following form:

IQ=149−0.57×duration of psychosis

The error term is omitted here and is assumed to have a mean of 0. The distances between each data point and the line of best fit summarising their relationship are called the residuals. These are the differences between the observed and predicted values and are a measure of the unexplained variation. The model can be extended to more complicated examples, e.g. brain volume using the variables diagnosis, height and IQ. The equation might take the following form:

Total brain volume= 10×diagnosis+0.03×height+IQ/20

Diagnosis is a categorical variable, and therefore it makes no sense to allocate a number to each diagnostic category as there is no order in the categories. Therefore we have to include a number of ‘dummy variables’ each one indicating the presence or absence of a diagnosis. The example above would be a suitable model when only one diagnosis is considered, as the variable diagnosis will only have to take values of 1 or 0.

If we are interested in a potential interaction between two variables (e.g. we might think that IQ is related to brain volume in healthy controls but not in people with schizophrenia, say) we can examine these by including the diagnosis × IQ interaction as another explanatory variable in the regression equation. If we had further information about IQ we might want to include this in our regression analysis. The printout from the statistical software might look like that in Table 9.12. The table is labelled ANOVA and it shows the mean squares about the regression model (similar to the between-groups variance) the residual mean squares (unexplained variance), their ratio F and its significance. What it does not tell us is whether there is an interaction between duration of psychosis and IQ or whether the addition of IQ to our model is better than having duration of psychosis as the only predictor variable. To test the first hypothesis, that there is an interaction between IQ and duration of psychosis, we would need to expand our model to include an interaction term.

If we wanted to see which model is best (in terms of how much variance or R2 is explained overall) we need to either add or take away predictor terms to see which model fits the data best. Most statistical packages have a variety of methods for doing this. The most common methods are called forward entry, backward entry and stepwise. Forward entry is a method of regression analysis whereby the predictor variable most significantly associated with the dependent variable is included in the model first, and if other predictor variables are also significantly associated with the dependent variable, they are entered into the model. Backward entry regression enters all of the terms into the regression equation first and removes successive terms if they do not predict the dependent variable. Stepwise regression is a combination of forward and backward entry methods.

The table in the regression analysis was titled ANOVA as regression and ANOVA use virtually identical underlying models. For instance, one could conduct a regression analysis where IQ was the dependent variable and duration of psychosis was the predictor. If we had done that we would have arrived at the same answer as an ANOVA.

There are, however, limitations to multiple regression. For example, as we enter more terms into our regression analysis, it becomes more and more difficult to interpret the results. In such cases clear descriptive statistics become invaluable. Further, in the above example we have only dealt with a situation in which the dependent variable is at least interval, ratio or continuous. When our dependent variable is an outcome (e.g. dead or alive) then we need to use a closely related technique called logistic regression. Other more complex models are available but are beyond the scope of this chapter (see Altman 1991 for more details).

Linear regression

Is a parametric statistical test

Tests the null hypothesis that there is no relationship between a predictor variable and a dependent variable

Uses the test statistic F to test for the significance of the regression model used

Can incorporate interaction terms

The value of F, the degrees of freedom and R2 should be stated

Can often be helpfully combined with the use of graphs or other descriptive statistics

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780702031373000097

Linear Hypothesis: Regression (Basics)

S. Weisberg, in International Encyclopedia of the Social & Behavioral Sciences, 2001

Regression analysis is the study of how a response variable depends on one or more predictors. Usually, dependence is assumed to be through the mean, and we think of the mean or regression function as describing how the mean of the response depends on the predictors. Although some regression problems can be usefully summarized using nonparametric regression methods that make a minimum of assumptions, adding a few plausible assumptions allows using parametric models for regression that give elegant and simple results. The important class of linear regression models is described in this aritcle. These models are frequently used in practice, and can often lead to very useful results. Topics covered include both simple and multiple regression, the basic forms of the models, interpretation of estimates and parameters, testing ideas, including comparison of groups, and diagnostic methods.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B008043076700454X

Support Vector Machine: Principles, Parameters, and Applications

Raoof Gholami, Nikoo Fakhari, in Handbook of Neural Computation, 2017

27.4.1 Classic Regression Analysis

Regression analysis is one of the widely used statistical tools used to assess the relationship between an independent (Y) and dependent variables (x1,x2,…,xn ) included in a system. In this analysis, it is often attempted to find the best decision function which can satisfactorily explain the variation of the target parameter based on the input variables. This function, however, should have the minimum possible error of prediction when chosen. To minimize the empirical risk (error), on these occasions, the parameter ε is defined to measure the discrepancy between the real and estimated values. The sum of εi can then be minimized by the help of, for instance, an old-fashioned least square method to find the best function [14]. However, classic statistical approaches do not often exceed the expectation of providing a very accurate estimation due to including the entire data points into their analysis, even those that are already very well explained with the model [34]. This can easily result in reducing the flexibility of the model when few outliers are included in the input space.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128113189000272

Which of provided statistical technique predicts or explains value for dependent variable using values of independent variables?

A regression is a statistical technique that relates a dependent variable to one or more independent (explanatory) variables. A regression model is able to show whether changes observed in the dependent variable are associated with changes in one or more of the explanatory variables.

What is the set of statistical method used to describe the relationship between independent variables and a dependent variable?

Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables. It can be utilized to assess the strength of the relationship between variables and for modeling the future relationship between them.

What statistical technique is used to make predictions of future outcomes based on present?

In most cases, the investigators utilize regression analysis to develop their prediction models. Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables.

What is explained variation in regression analysis?

The explained variation is the sum of the squared of the differences between each predicted y-value and the mean of y.