Risk Models with Generalized PLS

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

R Code

library(gpls)library(pROC)df1 <- read.csv("credit_count.txt")df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)]set.seed(2016)n <- nrow(df2)sample <- sample(seq(n), size = n / 2, replace = FALSE)train <- df2[sample, ]test <- df2[-sample, ]m1 <- glm(DEFAULT ~ ., data = train, family = "binomial")cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n")print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response")))cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n")print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response")))m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5)cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n")print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1]))cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n")print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1]))cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n")print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

Output

### ROC OF GLM PREDICTION WITH TRAINING DATA ###Call:roc.default(response = train$DEFAULT, predictor = predict(m1,     newdata = train, type = "response"))Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).Area under the curve: 0.6641### ROC OF GLM PREDICTION WITH TESTING DATA ###Call:roc.default(response = test$DEFAULT, predictor = predict(m1,     newdata = test, type = "response"))Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).Area under the curve: 0.6537### ROC OF GPLS PREDICTION WITH TRAINING DATA ###Call:roc.default(response = train$DEFAULT, predictor = predict(m2,     newdata = train)$predicted[, 1])Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).Area under the curve: 0.6627### ROC OF GPLS PREDICTION WITH TESTING DATA ###Call:roc.default(response = test$DEFAULT, predictor = predict(m2,     newdata = test)$predicted[, 1])Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).Area under the curve: 0.6542### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###                      glm          gpls(Intercept) -0.1940785071 -0.1954618828AGE         -0.0122709412 -0.0147883358ACADMOS      0.0005302022  0.0003671781ADEPCNT      0.1090667092  0.1352491711MAJORDRG     0.0757313171  0.0813835741MINORDRG     0.2621574192  0.2547176301OWNRENT     -0.2803919685 -0.1032119571INCOME      -0.0004222914 -0.0004531543LOGSPEND    -0.1688395555 -0.1525963363

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Risk Models with Generalized PLS

Trending Articles

Halestorm – Everest – Pre-Single [iTunes Plus M4A]

Ex-Colchester United youth player Craig Winskill carried out armed robbery to...

Jonathan Dawson – Bishop Auckland

Bureau of Internal Revenue: Regional Offices (Directory)

Telangana TS New Food Security Card/ Telangana Ration card Application Form...

Arms accused back in court next month

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Trial of East Grinstead man accused of rape to begin next week

TBT: Samini “Tempo” Feat Mugeez (R2Bees) Prod by Kaywa

MS-CHAPV2 NAP Policy failing - Reason Code 65

Practice Sheet of Right form of verbs for HSC Students

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Form: VAT: registration - land and property (VAT5L)

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

High-speed Ethernet switches a bright spot in network forecasts

Wazifa Remedy to Increase Enlarge Penis Size

Theja Surapaneni The ‘Most Attractive' Man on Australian TV Of All Time

Schools benefit from American donation

In Court: Cases heard at Central Devon Magistrates' Court