Generalized Addictive Models GAMs incorporates non linear form of predictions, and are useful when we have not linearity between response variable and predictors. GAMs doesn’t force the predictors to a square as in polynomial regression, but GAMes tries to do a smooth line. The data we use here is biocapacity of different countries.
library(psych) eco <- read.csv("C:/07 - R Website/dataset/ML/biocap.csv") pairs.panels(eco, method = "pearson", # correlation method hist.col = "#00AFBB", density = TRUE, # show density plots ellipses = FALSE # show correlation ellipses )
From the scatterplot above, we can see we have quite curve relationship between Human Development Index HDI and Gross domestic product GDP. Now we can try to build a Generalized Addictive Model with BiocapacityT as response variable.
library(mgcv) # GAM model mod_lm = gam(BiocapacityT ~ Population+HDI+Grazing.Footprint+Carbon.Footprint+ Cropland+Forest.Land+Urban.Land+GDP, data=eco) summary(mod_lm)
Family: gaussian Link function: identity Formula: BiocapacityT ~ Population + HDI + Grazing.Footprint + Carbon.Footprint + Cropland + Forest.Land + Urban.Land + GDP Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.356e-01 5.065e-01 1.058 0.292 Population -3.230e-04 5.959e-04 -0.542 0.589 HDI -8.647e-01 8.646e-01 -1.000 0.319 Grazing.Footprint 2.206e+00 2.535e-01 8.703 4.82e-15 *** Carbon.Footprint 1.611e-02 9.163e-02 0.176 0.861 Cropland 1.764e+00 1.496e-01 11.797 < 2e-16 *** Forest.Land 1.098e+00 1.105e-02 99.364 < 2e-16 *** Urban.Land -2.958e+00 1.977e+00 -1.496 0.137 GDP 6.233e-06 8.969e-06 0.695 0.488 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.985 Deviance explained = 98.6% GCV = 1.3504 Scale est. = 1.2754 n = 162
From the summary results above, we can see tht GDP is not statistically significant. one aspect with GAMs is the Concurvity. It refers to the generalization of collinearity to the GAM setting. In this case it refers to the situation where a smooth term can be approximated by some combination of the others. Can lead to unstable estimates. If fact, Concurvity can be viewed as a generalization of co-linearity, and causes similar problems of interpretation. We have to drop one of the variable with concurvity.
# Instead of splines specify tensor product smooth mod_gam3 <- gam(BiocapacityT ~ te(Grazing.Footprint, Cropland, Forest.Land), data=eco) concurvity(mod_gam3)
para te(Grazing.Footprint,Cropland,Forest.Land) worst 2.123046e-16 8.161468e-15 observed 2.123046e-16 3.681739e-33 estimate 2.123046e-16 8.832898e-34
vis.gam(mod_gam3, type='response', plot.type='persp', phi=30, theta=30,n.grid=500, border=NA)
The model mod_gam3 uses the tensor smooth for predictors. If we look at the concurvity, in the worst case scenario Grazing.Footprint and Forest.Land have a strog relationship, but looking at the observed estimation the correlation is not very strong. The 3D visualization there is aportion that has come down (in red), and we can conclude for some extend that the predictors not contribute to the biocapacoty, and so we have to look to our data into more details.
Now, we try to use the R library caret() to tune parameters. By default it uses the smooth spline to mdel the relationship between response variable and independent variables. We use Leave One Out cross validation for the trainin set. We also use GCV as smoothing parameter.
# Instead of splines specify tensor product smooth library(caret) b = train(BiocapacityT ~ ., data = eco, method = "gam", trControl = trainControl(method = "LOOCV", number = 1, repeats = 1), # use leave one out cross validation tuneGrid = data.frame(method = "GCV.Cp", select = FALSE)) summary(b$finalModel)
Family: gaussian Link function: identity Formula: .outcome ~ s(Urban.Land) + s(HDI) + s(Grazing.Footprint) + s(Cropland) + s(Forest.Land) + s(Carbon.Footprint) + s(Population) + s(GDP) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.61185 0.06785 53.23 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Urban.Land) 1.000 1.000 0.000 1.000 s(HDI) 1.598 1.993 1.164 0.320 s(Grazing.Footprint) 8.306 8.827 16.729 <2e-16 *** s(Cropland) 3.270 3.934 59.088 <2e-16 *** s(Forest.Land) 4.457 4.997 3271.685 <2e-16 *** s(Carbon.Footprint) 1.000 1.000 0.036 0.850 s(Population) 1.479 1.768 0.421 0.524 s(GDP) 2.104 2.583 1.044 0.336 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.991 Deviance explained = 99.2% GCV = 0.87676 Scale est. = 0.74572 n = 162
From the resut above wefound the most significant predictors, and we reached an Adjusted R-squared of 99% as before. We can now, use again the library caret() to deal with Concurvity and Collinearity.
About the smoothing parameter GCV we can say that both REML and GCV try to do the same thing. It has been shown that GCV will select optimal smoothing parameters (in the sense of low prediction error) when the sample size is infinite. At smaller (finite) sample sizes GCV can develop multiple minima making optimisation difficult and therefore tends to give more variable estimates of the smoothing parameter. If GCV is prone to undersmoothing at finite sample sizes, then we will end up fitting models that are more wiggly than we want, thought it best to switch to REML by default to avoid potential overfitting and highly variable smoothing parameter estimates.