While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.
Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.
R Code
library(gpls)library(pROC)df1 <- read.csv("credit_count.txt")df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)]set.seed(2016)n <- nrow(df2)sample <- sample(seq(n), size = n / 2, replace = FALSE)train <- df2[sample, ]test <- df2[-sample, ]m1 <- glm(DEFAULT ~ ., data = train, family = "binomial")cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n")print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response")))cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n")print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response")))m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5)cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n")print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1]))cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n")print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1]))cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n")print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))
Output
### ROC OF GLM PREDICTION WITH TRAINING DATA ###Call:roc.default(response = train$DEFAULT, predictor = predict(m1, newdata = train, type = "response"))Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).Area under the curve: 0.6641### ROC OF GLM PREDICTION WITH TESTING DATA ###Call:roc.default(response = test$DEFAULT, predictor = predict(m1, newdata = test, type = "response"))Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).Area under the curve: 0.6537### ROC OF GPLS PREDICTION WITH TRAINING DATA ###Call:roc.default(response = train$DEFAULT, predictor = predict(m2, newdata = train)$predicted[, 1])Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).Area under the curve: 0.6627### ROC OF GPLS PREDICTION WITH TESTING DATA ###Call:roc.default(response = test$DEFAULT, predictor = predict(m2, newdata = test)$predicted[, 1])Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).Area under the curve: 0.6542### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ### glm gpls(Intercept) -0.1940785071 -0.1954618828AGE -0.0122709412 -0.0147883358ACADMOS 0.0005302022 0.0003671781ADEPCNT 0.1090667092 0.1352491711MAJORDRG 0.0757313171 0.0813835741MINORDRG 0.2621574192 0.2547176301OWNRENT -0.2803919685 -0.1032119571INCOME -0.0004222914 -0.0004531543LOGSPEND -0.1688395555 -0.1525963363
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...