interpret the coefficient on log(dist). is the sign of this estimate what you expect it to be?

Log transformations are often recommended for skewed data, such every bit monetary measures or sure biological and demographic measures. Log transforming data normally has the outcome of spreading out clumps of data and bringing together spread-out data. For example, below is a histogram of the areas of all 50 Usa states. It is skewed to the correct due to Alaska, California, Texas and a few others.

After a log transformation, notice the histogram is more or less symmetric. Nosotros've moved the big states closer together and spaced out the smaller states.

Why practice this? One reason is to brand data more "normal", or symmetric. If we're performing a statistical analysis that assumes normality, a log transformation might aid us meet this assumption. Another reason is to help see the assumption of abiding variance in the context of linear modeling. Still another is to assist make a non-linear relationship more linear. But while information technology's easy to implement a log transformation, information technology can complicate interpretation. Permit's say we fit a linear model with a log-transformed dependent variable. How exercise we interpret the coefficients? What if we have log-transformed dependent and independent variables? That's the topic of this article.

First we'll provide a recipe for estimation for those who just want some quick help. And then we'll dig a little deeper into what we're saying about our model when we log-transform our data.


Rules for interpretation

OK, you ran a regression/fit a linear model and some of your variables are log-transformed.

  1. Only the dependent/response variable is log-transformed. Exponentiate the coefficient, decrease one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit of measurement increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – one) * 100 = 21.nine. For every one-unit increase in the contained variable, our dependent variable increases past about 22%.
  2. Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable past (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases past about 0.002. For x percentage increment, multiply the coefficient by log(ane.10). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.x) = 0.02.
  3. Both dependent/response variable and independent/predictor variable(south) are log-transformed. Interpret the coefficient equally the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by well-nigh 0.20%. For x percentage increase, summate 1.x to the power of the coefficient, subtract one, and multiply by 100. Case: For every 20% increase in the independent variable, our dependent variable increases by about (1.xx 0.198 – ane) * 100 = 3.7 percent.

What Log Transformations Really Mean for your Models

It's nice to know how to correctly interpret coefficients for log-transformed data, but it's of import to know what exactly your model is implying when it includes log-transformed data. To get a better agreement, let's employ R to simulate some data that volition require log-transformations for a right analysis. We'll keep it unproblematic with one independent variable and usually distributed errors. First we'll await at a log-transformed dependent variable.

x <- seq(0.ane,v,length.out = 100) set.seed(i) e <- rnorm(100, hateful = 0, sd = 0.2)          

The outset line generates a sequence of 100 values from 0.1 to 5 and assigns it to x. The next line sets the random number generator seed to 1. If you do the same, you'll get the same randomly generated information that we got when you run the next line. The lawmaking rnorm(100, hateful = 0, sd = 0.two) generates 100 values from a Normal distribution with a mean of 0 and standard deviation of 0.2. This will be our "error". This is one of the assumptions of simple linear regression: our data can be modeled with a direct line but volition be off past some random amount that we assume comes from a Normal distribution with mean 0 and some standard difference. Nosotros assign our error to e.

Now nosotros're ready to create our log-transformed dependent variable. We choice an intercept (ane.2) and a gradient (0.2), which nosotros multiply past x, and so add our random fault, due east. Finally we exponentiate.

y <- exp(1.2 + 0.ii * x + e)          

To see why we exponentiate, discover the following:

$$\text{log}(y) = \beta_0 + \beta_1x$$
$$\text{exp}(\text{log}(y)) = \text{exp}(\beta_0 + \beta_1x)$$
$$y = \text{exp}(\beta_0 + \beta_1x)$$

So a log-transformed dependent variable implies our simple linear model has been exponentiated. Recall from the product rule of exponents that we can re-write the terminal line higher up as

$$y = \text{exp}(\beta_0) \text{exp}(\beta_1x)$$

This further implies that our independent variable has a multiplicative relationship with our dependent variable instead of the usual additive relationship. Hence the need to express the result of a one-unit alter in x on y as a percent.

If we fit the correct model to the data, notice we practice a pretty good task of recovering the true parameter values that we used to generate the data.

lm1 <- lm(log(y) ~ x) summary(lm1)  Call: lm(formula = log(y) ~ x)  Residuals:     Min      1Q  Median      3Q     Max  -0.4680 -0.1212  0.0031  0.1170  0.4595   Coefficients:             Estimate Std. Error t value Pr(>|t|)     (Intercept)  1.22643    0.03693   33.twenty   <2e-16 *** 10            0.19818    0.01264   fifteen.68   <2e-sixteen *** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Residual standard error: 0.1805 on 98 degrees of freedom Multiple R-squared:  0.7151,	Adjusted R-squared:  0.7122  F-statistic:   246 on 1 and 98 DF,  p-value: < 2.2e-16          

The estimated intercept of 1.226 is close to the true value of ane.2. The estimated slope of 0.198 is very shut to the true value of 0.ii. Finally the estimated residue standard fault of 0.1805 is non too far from the true value of 0.2.

Recall that to interpret the slope value we need to exponentiate it.

exp(coef(lm1)["x"])        10  1.219179          

This says every one-unit increase in x is multiplied by about one.22. Or in other words, for every one-unit increase in x, y increases by about 22%. To get 22%, subtract 1 and multiply by 100.

(exp(coef(lm1)["10"]) - one) * 100         10  21.91786          

What if we fit just y instead of log(y)? How might nosotros figure out that we should consider a log transformation? Simply looking at the coefficients isn't going to tell you much.

lm2 <- lm(y ~ ten) summary(lm2)  Telephone call: lm(formula = y ~ 10)  Residuals:     Min      1Q  Median      3Q     Max  -two.3868 -0.6886 -0.1060  0.5298  three.3383   Coefficients:             Estimate Std. Error t value Pr(>|t|)     (Intercept)  three.00947    0.23643   12.73   <2e-16 *** 10            i.16277    0.08089   14.38   <2e-16 *** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.i ' ' 1  Residue standard error: ane.156 on 98 degrees of freedom Multiple R-squared:  0.6783,	Adjusted R-squared:  0.675  F-statistic: 206.six on 1 and 98 DF,  p-value: < 2.2e-16          

Sure, since we generated the information, we can come across the coefficients are manner off and the residue standard error is much too high. But in real life yous won't know this! This is why we do regression diagnostics. A central assumption to bank check is abiding variance of the errors. We can do this with a Scale-Location plot. Here'due south the plot for the model we just ran without log transforming y.

plot(lm2, which = iii) # three = Scale-Location plot          

Find the standardized residuals are trending upward. This is a sign that the constant variance assumption has been violated. Compare this plot to the same plot for the right model.

The trend line is even and the residuals are uniformly scattered.

Does this hateful that you should always log-transform your dependent variable if yous suspect the abiding-variance assumption has been violated? Not necessarily. The non-constant variance may be due to other misspecifications in your model. Also think virtually what modeling a log-transformed dependent variable ways. Information technology says it has a multiplicative human relationship with the predictors. Does that seem right? Use your judgment and bailiwick expertise.

Now let's consider data with a log-transformed contained predictor variable. This is easier to generate. Nosotros simply log-transform x.

y <- one.2 + 0.2*log(x) + e          

Once once again we first fit the correct model and find information technology does a bang-up task of recovering the true values nosotros used to generate the data:

lm3 <- lm(y ~ log(x)) summary(lm3)  Call: lm(formula = y ~ log(x))  Residuals:      Min       1Q   Median       3Q      Max  -0.46492 -0.12063  0.00112  0.11661  0.45864   Coefficients:             Gauge Std. Error t value Pr(>|t|)     (Intercept)  1.22192    0.02308  52.938  < 2e-16 *** log(x)       0.19979    0.02119   ix.427 two.12e-15 *** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.one ' ' 1  Residual standard fault: 0.1806 on 98 degrees of freedom Multiple R-squared:  0.4756,	Adjusted R-squared:  0.4702  F-statistic: 88.87 on i and 98 DF,  p-value: ii.121e-15          

To interpret the slope coefficient nosotros split it past 100.

coef(lm3)["log(10)"]/100      log(x)  0.001997892          

This tells united states of america that a 1% increase in ten increases the dependent variable by well-nigh 0.002. Why does it tell us this? Let's practise some math. Beneath we summate the change in y when irresolute x from 1 to 1.01 (ie, a 1% increment).

$$(\beta_0 + \beta_1\text{log}1.01) – (\beta_0 + \beta_1\text{log}1)$$
$$\beta_1\text{log}1.01 – \beta_1\text{log}1$$
$$\beta_1(\text{log}1.01 – \text{log}1)$$
$$\beta_1\text{log}\frac{1.01}{1} = \beta_1\text{log}1.01$$

The result is multiplying the slope coefficient by log(1.01), which is approximately equal to 0.01, or \(\frac{1}{100}\). Hence the interpretation that a 1% increase in x increases the dependent variable by the coefficient/100.

Once again let's fit the wrong model by declining to specify a log-transformation for x in the model syntax.

Viewing a summary of the model will reveal that the estimates of the coefficients are well off from the true values. But in practice we never know the true values. Once again diagnostics are in order to appraise model adequacy. A useful diagnostic in this case is a partial-remainder plot which tin can reveal departures from linearity. Think that linear models presume that predictors are additive and have a linear relationship with the response variable. The auto package provides the crPlot function for chop-chop creating partial-residual plots. Just give it the model object and specify which variable you want to create the partial residual plot for.

library(car) crPlot(lm4, variable = "10")          

The straight line represents the specified human relationship between x and y. The curved line is a polish trend line that summarizes the observed relationship between 10 and y. Nosotros can tell the observed relationship is not-linear. Compare this plot to the partial-residual plot for the correct model.

crPlot(lm3, variable = "log(10)")          

The smooth and fitted lines are right on meridian of one another revealing no serious departures from linearity.

This does not mean that if you lot see departures from linearity you should immediately presume a log transformation is the one and only fix! The not-linear relationship may be complex and not so easily explained with a simple transformation. Only a log transformation may be suitable in such cases and certainly something to consider.

Finally permit'south consider data where both the dependent and independent variables are log transformed.

y <- exp(1.2 + 0.2 * log(x) + due east)          

Wait closely at the code above. The relationship between x and y is now both multiplicative and non-linear!

As usual we can fit the right model and notice that it does a fantastic job of recovering the true values we used to generate the data:

lm5 <- lm(log(y)~ log(x)) summary(lm5)  Telephone call: lm(formula = log(y) ~ log(x))  Residuals:      Min       1Q   Median       3Q      Max  -0.46492 -0.12063  0.00112  0.11661  0.45864   Coefficients:             Judge Std. Fault t value Pr(>|t|)     (Intercept)  1.22192    0.02308  52.938  < 2e-16 *** log(x)       0.19979    0.02119   9.427 2.12e-xv *** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Residue standard error: 0.1806 on 98 degrees of freedom Multiple R-squared:  0.4756,	Adapted R-squared:  0.4702  F-statistic: 88.87 on 1 and 98 DF,  p-value: 2.121e-15          

Interpret the 10 coefficient as the percent increase in y for every 1% increase in 10. In this case that'due south about a 0.2% increase in y for every 1% increase in x.

Fitting the wrong model one time once more produces coefficient and rest standard error estimates that are wildly off target.

lm6 <- lm(y ~ x) summary(lm6)          

The Scale-Location and Partial-Residual plots provide show that something is amiss with our model. The Scale-Location plot shows a curving trend line and the Fractional-Residual plot shows linear and smooth lines that neglect to match.

crPlot(lm6, variable = "x")          

How would nosotros know in real life that the correct model requires log-transformed independent and dependent variables? Nosotros wouldn't. Nosotros might have a hunch based on diagnostic plots and modeling feel. Or we might take some subject matter expertise on the process we're modeling and have good reason to think the relationship is multiplicative and non-linear.

Hopefully you lot now take a better handle on non simply how to interpret log-transformed variables in a linear model but too what log-transformed variables mean for your model.

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab manufactures.

Clay Ford
Statistical Enquiry Consultant
University of Virginia Library
August 17, 2018

illingworthhatheince.blogspot.com

Source: https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/

0 Response to "interpret the coefficient on log(dist). is the sign of this estimate what you expect it to be?"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel