tfestimators – Package: Embeddings for Categorical Variables

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

In my last posts (here and here) I explored how to use embeddings to represent categorical variables. Furthermore, I showed how to represent categorical variables with embeddings and add other variable to create a more complex model. Both posts focused on the Keras (R) functionality. I concluded that it feels artificial to represent categorical variables with embeddings in Keras. Especially concatenating multiple input layers is quite cumbersome with the current Keras interface.

This week, I watched the official release video by J.J. Allaire and learnt about the tfestimators package. It turns out that the awesome rstudio team build a very handy interface to access tensorflow and train models with multiple parameters and embeddings.

In this post, I will show how to use the package to quickly fit a model in which categorical variables are represented as embeddings.

As in the posts before, I work with the nyc citi bike count data from Kaggle. It contains daily bicycle counts for 4 major bridges in NYC. In order to have a longer dataset, I use the bicycle count for all bridges as the dependent variable.

#https://www.kaggle.com/new-york-city/nyc-east-river-bicycle-crossingsdf<-read.csv("data/nyc-east-river-bicycle-counts.csv")dflong<-data.table::melt(df[c("Date","Brooklyn.Bridge","Manhattan.Bridge","Williamsburg.Bridge","Queensboro.Bridge")],idvars="date")dflong$date<-as.Date(dflong$Date)dflong$weekday<-wday(dflong$date,label=T)dflong<-merge(dflong,df[,c("Date","Precipitation","Low.Temp..Â.F.")],by="Date")dflong$ScaledUsers<-scale(dflong$value)dflong$lowTemp<-scale(dflong[,"Low.Temp..Â.F."])dflong$rain<-ifelse(dflong$Precipitation!=0,0,1)dflong$Bridge<-factor(dflong$variable)levels(dflong$Bridge)<-1:length(levels(dflong$Bridge))levels(dflong$weekday)<-1:length(levels(dflong$weekday))

The goal of our play model is to predict the number of bicycle per day on a certain bridge dependent on the weekday, the bridge (“Brooklyn.Bridge”, “Manhattan.Bridge”, “Williamsburg.Bridge” ,”Queensboro.Bridge”), if it rains and the temperature. So overall we have 2 categorical variables, one binary and one continuous variable.

library(tfestimators)## convert the factor to integer -- tfestimators is strict with input types.dflong$Bridge<-as.integer(dflong$Bridge)dflong$weekday<-as.integer(dflong$weekday)embedding_dimension_bridges=2embedding_dimension_weekdays=3cols<-feature_columns(column_numeric("lowTemp","rain"),column_embedding(column_categorical_with_vocabulary_list("weekday",vocabulary_list=c(1:7)),embedding_dimension_weekdays),column_embedding(column_categorical_with_vocabulary_list("Bridge",vocabulary_list=c(1:4)),embedding_dimension_bridges))

The first step that we need to do is to define the input variables and their type. Let’s start with the simple numeric variables lowTemp and rain. We just define the input as “column_numeric(“lowTemp”,”rain”)”. Next, the two categorical variables that we want to embed, need a bit more work. a) they need a list of all possible values (defined in vocabulary_list parameter). Additionally, we need to define the embedding_dimension for each categorical variable.

Next, we write a short function that defines the input and output of the the model as well as batch size and number of epochs.

library(tfestimators)bridge_input_fn<-function(data,num_epochs=1){tfestimators::input_fn(data,features=c("lowTemp","rain","weekday","Bridge"),response="ScaledUsers",batch_size=2,num_epochs=num_epochs)}############ Train and Test Dataset #############indices<-sample(1:nrow(dflong),size=0.80*nrow(dflong))train<-dflong[indices,]test<-dflong[-indices,]############## Define the model ############model<-dnn_regressor(feature_columns=cols,hidden_units=c(32,10),dropout=0.15)# train the modelhistory<-model%>%train(bridge_input_fn(train[,c("ScaledUsers","lowTemp","rain","weekday","Bridge")],num_epochs=1))## eval the outputmodel%>%evaluate(bridge_input_fn(test))

In order to evaluate the model, we split the data in train and test set. We define the model as deep neural network (DNN), regression model with two hidden layers (one with 32 the other with 10 nodes). Compared to the Keras version, in which one needs to concatenate different input layers this interface is straight-forward. Finally, we check the model’s accuracy on test set and print the learning history.

require(ggplot2)df<-data.frame(losses=history$losses$mean_losses,steps=history$step)ggplot(df,aes(steps,losses))+geom_point()+geom_smooth()+theme_bw()+ylab("Loss")+xlab("Training Steps")+ggtitle("Testing: TF Estimators")

plot of chunk unnamed-chunk-5

To conclude, the package is a great step forward to apply deep neural nets to everyday problems and to quickly use embeddings for categorical variables. Big Kudus to the Rstudio team for their efforts. If you have time left and want a quick update on deep learning for the R community check out J.J. Allaire’s video.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tfestimators – Package: Embeddings for Categorical Variables

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112