Reproduce economic indicators from ‘The Economist’

December 9, 2019, 3:00 pm

≪ Previous: a journey from basic prototype to production-ready Shiny dashboard

[This article was first published on Macroeconomic Observatory - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Economic data (% change on year ago)

	Gross domestic product	Industrial production	Consumer prices	Unemployment rate, %
	latest	quarter^*	2019^†	2020^†	latest	latest	2019^†	latest
United States	+2.1 Q3	+2.1	+2.4	+2.1	-1.1 Oct	+1.8 Oct	+1.8	+3.6 Oct
China	+0.1 Q3	+0.2	+6.1	+5.8	+5.4 Nov^‡	+3.8 Oct	+2.3	+3.6 Q3
Japan	+1.4 Q3	+0.2	+0.9	+0.5	-6.0 Oct	+0.2 Oct	+1.0	+2.4 Oct
Britain	+1.0 Q3	+1.2	+1.2	+1.4	-1.4 Sep	+1.5 Oct	+1.8	+3.8 Aug
Canada	+1.7 Q3	+1.3	+1.5	+1.8	-1.8 Sep	+1.9 Oct	+2.0	+5.5 Oct
Euro area	+1.2 Q3	+0.9	+1.2	+1.4	-1.9 Sep	+0.7 Oct	+1.2	+7.5 Oct
Austria	+1.5 Q3	+0.5	+1.6	+1.7	+0.3 Aug	+1.1 Oct	+1.5	+4.6 Oct
Belgium	+1.6 Q3	+1.7	+1.2	+1.3	+6.0 Sep	+0.4 Nov	+1.5	+5.6 Oct
France	+1.4 Q3	+1.1	+1.2	+1.3	+0.1 Sep	+0.8 Oct	+1.2	+8.5 Oct
Germany	+0.5 Q3	+0.3	+0.5	+1.2	-5.1 Sep	+1.1 Oct	+1.5	+3.1 Oct
Greece	+1.9 Q2	+3.4	+2.0	+2.2	+0.8 Sep	-0.7 Oct	+0.6	+16.7 Aug
Italy	+0.3 Q3	+0.2	+0.0	+0.5	-2.1 Sep	+0.2 Oct	+0.7	+9.7 Oct
Netherlands	+1.8 Q3	+1.8	+1.8	+1.6	+0.3 Sep	+2.7 Oct	+2.5	+3.5 Oct
Spain	+2.0 Q3	+1.7	+2.2	+1.8	+0.8 Sep	+0.1 Oct	+0.7	+14.2 Oct
Czech Republic	+2.5 Q3	+1.5	+2.5	+2.6	+0.1 Aug	+2.7 Oct	+2.6	+2.2 Oct
Denmark	+2.2 Q3	+1.3	+1.7	+1.9	+4.2 Sep	+0.6 Oct	+1.3	+5.3 Oct
Norway	+0.6 Q3	+0.1	+1.9	+2.4	-8.0 Sep	+1.8 Oct	+2.3	+3.9 Sep
Poland	+4.0 Q3	+5.3	+4.0	+3.1	+3.5 Oct	+2.5 Oct	+2.4	+3.2 Oct
Russia	+0.8 Q2	+0.6	+1.1	+1.9	+2.6 Sep	+3.8 Oct	+4.7	+4.6 Q3
Sweden	+1.7 Q3	+1.1	+0.9	+1.5	+1.6 Sep	+1.6 Oct	+1.7	+6.6 Oct
Switzerland	+1.0 Q3	+1.6	+0.8	+1.3	+5.4 Q4^‡	-0.3 Oct	+0.6	+2.4 May
Turkey	+0.5 Q3	+1.7	+0.2	+3.0	+2.8 Sep	+8.6 Oct	+15.7	+14.2 Aug
Australia	+1.7 Q3	+1.8	+1.7	+2.3	+2.7 Q3	+1.7 Q3	+1.6	+5.3 Oct
Hong Kong	+0.5 Q2	-1.7	+0.3	+1.5	+0.4 Q2	+3.3 Sep	+3.0	+2.8 Q1
India	+4.7 Q3	+4.3	+6.1	+7.0	+2.6 Dec^‡	+7.6 Oct	+3.4	+2.6 Year^‡
Indonesia	+5.1 Q3	+5.0	+5.0	+5.1	-3.7 Apr	+3.0 Nov	+3.2	+4.6 Q3^‡
Malaysia	+4.7 Q4^‡	+14.7	+4.5	+4.4	+3.1 Mar	+1.5 Aug	+1.0	+3.3 Q1
Pakistan	+5.5 Year^‡	NA	+3.3	+2.4	-7.0 Aug	+8.2 Feb	+7.3	+4.4 Q2^‡
Philippines	+6.1 Q4^‡	+6.4	+5.7	+6.2	-8.2 Jul	+3.3 Mar	+2.5	+2.2 Q4^‡
Singapore	+3.9 Q2^‡	+7.8	+0.5	+1.0	+0.1 Sep	+0.4 Oct	+0.7	+3.0 Q1
South Korea	+2.0 Q3	+1.6	+2.0	+2.2	-2.6 Oct	-0.4 Sep	+0.5	+3.5 Oct
Taiwan	+2.9 Q3	NA	+2.0	+1.9	NA	+0.4 Q3	+0.8	+3.7 Q2
Thailand	+2.3 Q2	+2.4	+2.9	+3.0	-1.2 Q1	+0.1 Oct	+0.9	+0.7 Q4^‡
Argentina	-0.0 Q2	-0.0	-3.1	-1.3	+4.4 Q3^§	+50.5 Oct	+54.4	+9.8 Q1
Brazil	+1.2 Q3	+2.5	+0.9	+2.0	+0.6 Sep	+2.5 Oct	+3.8	+8.0 Nov
Chile	+2.8 Q3	+3.0	+2.5	+3.0	-3.7 Oct	+2.7 Oct	+2.2	+6.9 Aug
Colombia	+3.3 Q3	+2.3	+3.4	+3.6	-1.1 Dec^‡	+3.9 Oct	+3.6	+10.7 Sep
Mexico	-0.2 Q3	+0.1	+0.4	+1.3	-2.9 Jun	+3.0 Oct	+3.8	+3.6 Oct
Peru	+2.1 Q1^§	-16.9	+2.6	+3.6	+20.3 Apr^‡	+2.2 Mar	+2.2	+6.2 Q2
Egypt	+5.3 Year^‡	NA	+5.5	+5.9	+6.2 Mar^‡	+15.7 Nov^‡	+13.9	+11.8 Q4^§
Israel	+3.3 Q3	+4.1	+3.1	+3.1	+4.4 Sep	+0.4 Oct	+1.0	+3.7 Sep
Saudi Arabia	+0.5 Q2	-10.4	+0.2	+2.2	+1.6 Q3^§	-0.3 Oct	-1.1	+6.0 Year^‡
South Africa	+1.0 Q2	+3.1	+0.7	+1.1	+1.3 Aug^‡	+3.7 Oct	+4.4	+28.8 Q3
Source: DBnomics (Eurostat, ILO, IMF, OECD and national sources). Click on the figures in the `latest` columns to see the full time series.
^* % change on previous quarter, annual rate ^†IMF estimation/forecast ^‡ 2018 ^§ 2017

The aim of this blog post is to reproduce part of the economic indicators table from ‘The Economist’ using only free tools. We take data directly from DBnomics. The DBnomics API can be accessed through R with the rdbnomics package. All the following code is written in R, thanks to the RCoreTeam (2016) and the RStudioTeam (2016). To update the table, just download the code here and re-run it.

if (!"pacman"%in%installed.packages()[,"Package"])install.packages("pacman",repos='http://cran.r-project.org')pacman::p_load(tidyverse,rdbnomics,magrittr,zoo,lubridate,knitr,kableExtra,formattable)opts_chunk$set(fig.align="center",message=FALSE,warning=FALSE)currentyear<-year(Sys.Date())lastyear<-currentyear-1beforelastyear<-currentyear-2CountryList<-c("United States","China","Japan","Britain","Canada","Euro area","Austria","Belgium","France","Germany","Greece","Italy","Netherlands","Spain","Czech Republic","Denmark","Norway","Poland","Russia","Sweden","Switzerland","Turkey","Australia","Hong Kong","India","Indonesia","Malaysia","Pakistan","Philippines","Singapore","South Korea","Taiwan","Thailand","Argentina","Brazil","Chile","Colombia","Mexico","Peru","Egypt","Israel","Saudi Arabia","South Africa")

Download

gdp<-rdb("OECD","MEI",ids=".NAEXKP01.GPSA+GYSA.Q")hongkong_philippines_thailand_gdp<-rdb("IMF","IFS",mask="Q.HK+PH+TH.NGDP_R_PC_CP_A_SA_PT+NGDP_R_PC_PP_SA_PT")%>%rename(Country=`Reference Area`)%>%mutate(Country=case_when(Country=="Hong Kong, China"~"Hong Kong",TRUE~Country),MEASURE=case_when(INDICATOR=="NGDP_R_PC_CP_A_SA_PT"~"GYSA",INDICATOR=="NGDP_R_PC_PP_SA_PT"~"GPSA"))malaysia_peru_saudi_singapore_gdp<-rdb("IMF","IFS",mask="Q.MY+PE+SA+SG.NGDP_R_PC_CP_A_PT+NGDP_R_PC_PP_PT")%>%rename(Country=`Reference Area`)%>%mutate(MEASURE=case_when(INDICATOR=="NGDP_R_PC_CP_A_PT"~"GYSA",INDICATOR=="NGDP_R_PC_PP_PT"~"GPSA"))taiwan_gdp<-rdb("BI/TABEL9_1/17.Q")%>%mutate(Country="Taiwan",MEASURE="GYSA")egypt_pakistan_gdp<-rdb("IMF","WEO",mask="EGY+PAK.NGDP_RPCH")%>%rename(Country=`WEO Country`)%>%mutate(MEASURE="GYSA")%>%filter(year(period)<currentyear)china_gdp_level<-rdb(ids="OECD/MEI/CHN.NAEXCP01.STSA.Q")gdp_qoq_china<-china_gdp_level%>%arrange(period)%>%mutate(value=value/lag(value)-1,MEASURE="GPSA")gdp_yoy_china<-china_gdp_level%>%arrange(period)%>%mutate(quarter=quarter(period))%>%group_by(quarter)%>%mutate(value=value/lag(value)-1,MEASURE="GYSA")argentina_gdp_level<-rdb(ids="Eurostat/naidq_10_gdp/Q.SCA.KP_I10.B1GQ.AR")%>%rename(Country=`Geopolitical entity (reporting)`)gdp_qoq_argentina<-argentina_gdp_level%>%arrange(period)%>%mutate(value=value/lag(value)-1,MEASURE="GPSA")gdp_yoy_argentina<-argentina_gdp_level%>%arrange(period)%>%mutate(quarter=quarter(period))%>%group_by(quarter)%>%mutate(value=value/lag(value)-1,MEASURE="GYSA")gdp<-bind_rows(gdp,hongkong_philippines_thailand_gdp,malaysia_peru_saudi_singapore_gdp,taiwan_gdp,egypt_pakistan_gdp,gdp_yoy_china,gdp_qoq_china,gdp_yoy_argentina,gdp_qoq_argentina)indprod<-rdb("OECD","MEI",ids=".PRINTO01.GYSA.M")australia_swiss_indprod<-rdb("OECD","MEI","AUS+CHE.PRINTO01.GYSA.Q")china_egypt_mexico_malaysia_indprod<-rdb("IMF","IFS",mask="M.CN+EG+MX+MY.AIP_PC_CP_A_PT")%>%rename(Country=`Reference Area`)indonesia_pakistan_peru_philippines_singapore_southafrica_indprod<-rdb("IMF","IFS",mask="M.ID+PK+PE+PH+SG+ZA.AIPMA_PC_CP_A_PT")%>%rename(Country=`Reference Area`)argentina_hongkong_saudiarabia_thailand_indprod<-rdb("IMF","IFS",mask="Q.AR+HK+SA+TH.AIPMA_PC_CP_A_PT")%>%rename(Country=`Reference Area`)%>%mutate(Country=case_when(Country=="Hong Kong, China"~"Hong Kong",TRUE~Country))indprod<-bind_rows(indprod,australia_swiss_indprod,china_egypt_mexico_malaysia_indprod,indonesia_pakistan_peru_philippines_singapore_southafrica_indprod,argentina_hongkong_saudiarabia_thailand_indprod)cpi<-rdb("OECD","MEI",ids=".CPALTT01.GY.M")australia_cpi<-rdb("OECD","MEI",ids="AUS.CPALTT01.GY.Q")taiwan_cpi<-rdb("BI/TABEL9_2/17.Q")%>%mutate(Country="Taiwan")other_cpi<-rdb("IMF","IFS",mask="M.EG+HK+MY+PE+PH+PK+SG+TH.PCPI_PC_CP_A_PT")%>%rename(Country=`Reference Area`)%>%mutate(Country=case_when(Country=="Hong Kong, China"~"Hong Kong",TRUE~Country))cpi<-bind_rows(cpi,australia_cpi,taiwan_cpi,other_cpi)unemp<-rdb("OECD","MEI",ids=".LRHUTTTT.STSA.M")swiss_unemp<-rdb("OECD","MEI",mask="CHE.LMUNRRTT.STSA.M")brazil_unemp<-rdb("OECD","MEI",mask="BRA.LRUNTTTT.STSA.M")southafrica_russia_unemp<-rdb("OECD","MEI",mask="ZAF+RUS.LRUNTTTT.STSA.Q")china_unemp<-rdb(ids="BUBA/BBXL3/Q.CN.N.UNEH.TOTAL0.NAT.URAR.RAT.I00")%>%mutate(Country="China")saudiarabia_unemp<-rdb(ids="ILO/UNE_DEAP_SEX_AGE_RT/SAU.BA_627.AGE_AGGREGATE_TOTAL.SEX_T.A")%>%rename(Country=`Reference area`)%>%filter(year(period)<currentyear)india_unemp<-rdb(ids="ILO/UNE_2EAP_NOC_RT/IND.XA_1976.A")%>%rename(Country=`Reference area`)%>%filter(year(period)<currentyear)indonesia_pakistan_unemp<-rdb("ILO","UNE_DEAP_SEX_AGE_EDU_RT",mask="IDN+PAK..AGE_AGGREGATE_TOTAL.EDU_AGGREGATE_TOTAL.SEX_T.Q")%>%rename(Country=`Reference area`)other_unemp<-rdb("ILO","UNE_DEA1_SEX_AGE_RT",mask="ARG+EGY+HKG+MYS+PER+PHL+SGP+THA+TWN..AGE_YTHADULT_YGE15.SEX_T.Q")%>%rename(Country=`Reference area`)%>%mutate(Country=case_when(Country=="Hong Kong, China"~"Hong Kong",Country=="Taiwan, China"~"Taiwan",TRUE~Country))unemp<-bind_rows(unemp,brazil_unemp,southafrica_russia_unemp,swiss_unemp,china_unemp,saudiarabia_unemp,india_unemp,indonesia_pakistan_unemp,other_unemp)forecast_gdp_cpi_ea<-rdb("IMF","WEOAGG",mask="163.NGDP_RPCH+PCPIPCH")%>%rename(`WEO Country`=`WEO Countries group`)forecast_gdp_cpi<-rdb("IMF","WEO",mask=".NGDP_RPCH+PCPIPCH")%>%bind_rows(forecast_gdp_cpi_ea)%>%transmute(Country=`WEO Country`,var=`WEO Subject`,value,period)%>%mutate(Country=str_trim(Country),var=str_trim(var))%>%mutate(Country=case_when(Country=="United Kingdom"~"Britain",Country=="Hong Kong SAR"~"Hong Kong",Country=="Korea"~"South Korea",Country=="Taiwan Province of China"~"Taiwan",TRUE~Country),var=case_when(var=="Gross domestic product, constant prices - Percent change"~"GDP",var=="Inflation, average consumer prices - Percent change"~"CPI",TRUE~var))forecast_gdp_cpi<-left_join(data.frame(Country=CountryList),forecast_gdp_cpi,by="Country")

Transform

gdp_yoy_latest_period<-gdp%>%filter(MEASURE=="GYSA")%>%filter(!is.na(value))%>%group_by(Country)%>%summarise(period=max(period))gdp_yoy_latest<-gdp%>%filter(MEASURE=="GYSA")%>%inner_join(gdp_yoy_latest_period)%>%mutate(var="GDP",measure="latest")gdp_qoq_latest_period<-gdp%>%filter(MEASURE=="GPSA")%>%filter(!is.na(value))%>%group_by(Country)%>%summarise(period=max(period))gdp_qoq_latest<-gdp%>%filter(MEASURE=="GPSA")%>%inner_join(gdp_qoq_latest_period)%>%mutate(value=((1+value/100)^4-1)*100,var="GDP",measure="quarter")gdp_2019_2020<-forecast_gdp_cpi%>%filter(var=="GDP"&(period=="2019-01-01"|period=="2020-01-01"))%>%mutate(measure=as.character(year(period)))indprod_latest_period<-indprod%>%filter(!is.na(value))%>%group_by(Country)%>%summarise(period=max(period))indprod_latest<-indprod%>%inner_join(indprod_latest_period)%>%mutate(var="indprod",measure="latest")cpi_latest_period<-cpi%>%filter(!is.na(value))%>%group_by(Country)%>%summarise(period=max(period))cpi_latest<-cpi%>%inner_join(cpi_latest_period)%>%mutate(var="CPI",measure="latest")cpi_2019<-forecast_gdp_cpi%>%filter(var=="CPI"&period=="2019-01-01")%>%mutate(measure="2019")unemp_latest_period<-unemp%>%filter(!is.na(value))%>%group_by(Country)%>%summarise(period=max(period))unemp_latest<-unemp%>%inner_join(unemp_latest_period)%>%mutate(var="unemp",measure="latest")

Merge

df_all<-bind_rows(gdp_yoy_latest,gdp_qoq_latest,gdp_2019_2020,indprod_latest,cpi_latest,cpi_2019,unemp_latest)%>%mutate(value=ifelse(value>=0,paste0("+",sprintf("%.1f",round(value,1))),sprintf("%.1f",round(value,1))))%>%unite(measure,c(var,measure))df_latest<-df_all%>%filter(measure%in%c("GDP_latest","indprod_latest","CPI_latest","unemp_latest"))%>%mutate(value=case_when(`@frequency`=="quarterly"~paste(value," Q",quarter(period),sep=""),`@frequency`=="monthly"~paste(value," ",month(period,label=TRUE,abbr=TRUE,locale="en_US.utf8"),sep=""),`@frequency`=="annual"~paste(value," Year",sep=""),TRUE~value))%>%mutate(value=text_spec(ifelse(year(period)==lastyear,paste0(value,footnote_marker_symbol(3)),ifelse(year(period)==beforelastyear,paste0(value,footnote_marker_symbol(4)),value)),link=paste("https://db.nomics.world",provider_code,dataset_code,series_code,sep="/"),color="#333333",escape=F,extra_css="text-decoration:none"))df_final<-df_all%>%filter(measure%in%c("GDP_quarter","GDP_2019","GDP_2020","CPI_2019"))%>%bind_rows(df_latest)%>%mutate(Country=case_when(Country=="United Kingdom"~"Britain",Country=="Euro area (19 countries)"~"Euro area",Country=="China (People's Republic of)"~"China",Country=="Korea"~"South Korea",TRUE~Country))%>%select(Country,value,measure)%>%spread(measure,value)%>%select(Country,GDP_latest,GDP_quarter,GDP_2019,GDP_2020,indprod_latest,CPI_latest,CPI_2019,unemp_latest)df_final<-left_join(data.frame(Country=CountryList),df_final,by="Country")

Display

names(df_final)[1]<-""names(df_final)[2]<-"latest"names(df_final)[3]<-paste0("quarter",footnote_marker_symbol(1))names(df_final)[4]<-paste0("2019",footnote_marker_symbol(2))names(df_final)[5]<-paste0("2020",footnote_marker_symbol(2))names(df_final)[6]<-"latest"names(df_final)[7]<-"latest"names(df_final)[8]<-paste0("2019",footnote_marker_symbol(2))names(df_final)[9]<-"latest"df_final%>%kable(row.names=F,escape=F,align=c("l",rep("c",8)),caption="Economic data (% change on year ago)")%>%kable_styling(bootstrap_options=c("striped","hover","responsive"),fixed_thead=T,font_size=13)%>%add_header_above(c(" "=1,"Gross domestic product"=4,"Industrial production  "=1,"Consumer prices"=2,"Unemployment rate, %"=1))%>%column_spec(1,bold=T)%>%row_spec(seq(1,nrow(df_final),by=2),background="#D5E4EB")%>%row_spec(c(5,14,22,33,39),extra_css="border-bottom: 1.2px solid")%>%footnote(general="DBnomics (Eurostat, ILO, IMF, OECD and national sources). Click on the figures in the `latest` columns to see the full time series.",general_title="Source: ",footnote_as_chunk=T,symbol=c("% change on previous quarter, annual rate ","IMF estimation/forecast",paste0(lastyear),paste0(beforelastyear)))

Bibliography

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016. URL: https://www.R-project.org.

RStudio Team. RStudio: Integrated Development Environment for R. RStudio, Inc., Boston, MA, 2016. URL: http://www.rstudio.com/.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Macroeconomic Observatory - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

TSstudio 0.1.5 on CRAN

December 10, 2019, 4:00 pm

≫ Next: Key R Operators

≪ Previous: Reproduce economic indicators from ‘The Economist’

[This article was first published on Rami Krispin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version (0.1.5) of the TSstudio package was pushed to CRAN last month. The release includes new functions as well as bug fixing, and update of the package license (modified from GPL-3 to MIT).

New features

train_model– a flexible framework for training, testing, evaluating, and forecasting models. This function provides the ability to run multiple models with backtesting or single training/testing partitions. This function will replace the ts_backtesting function which will deprecated in the next release.
plot_model– animation the performance of the train_model output on the backtesting partitions
plot_error– plotting the error distribution of the train_model output
ts_cor– for ACF and PACF plots with seasonal lags, this function will replace the ts_acf and ts_pacf functions that will deprecated in the next release.
arima_diag– a diagnostic plot for identify the AR, MA and differencing components of the ARIMA model

Fix errors

ts_seasonal– aligning the box plot color
ts_plot– setting the dash and marker mode for multiple time series

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rami Krispin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Key R Operators

December 11, 2019, 8:02 am

≫ Next: HTTP testing in R: overview of tools and new features

≪ Previous: TSstudio 0.1.5 on CRAN

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Operators you should make more use of in R

Only recently have I discovered the true power of some the operators in R. Here are some tips on some underused operators in R:

The %in% operator

This funny looking operator is very handy. It’s short for testing if several values appear in an object. For instance

x = c(2, 6, 4, 4, 6, 8, 10, 14, 2)

To grab all the values where x is 2, 4 or 14 we could do

x[x == 2 | x == 4 | x == 14]

## [1]  2  4  4 14  2

or we could use %in%…

x[x %in% c(2, 4, 14)]

## [1]  2  4  4 14  2

This is something I use all the time for filtering data. Imagine you’ve got a tibble of data relating to the world (step up spData)

library("dplyr")
library("sf")
library("sp")
data(world, package = "spData")

# drop the geometry column because we don't need it
world = world %>% 
  st_drop_geometry()

Your colleague sends you a list of 50 countries (I’m going to randomly sample the names from the data) and says that they want the average life expectency for each continent group within these 50 countries.

colleague_countries = world %>% 
  sample_n(50) %>% 
  pull(name_long)
head(colleague_countries)

## [1] "Yemen"         "New Zealand"   "Kyrgyzstan"    "New Caledonia"
## [5] "Morocco"       "Ecuador"

We could then ask R to return every row where the column name_long matches any value in colleague_countries using the %in% operator

world %>% 
  filter(name_long %in% colleague_countries) %>% 
  group_by(continent) %>% 
  summarise(av_life_exp = mean(lifeExp, na.rm = TRUE))

## # A tibble: 6 x 2
##   continent     av_life_exp
##                  
## 1 Africa               63.6
## 2 Asia                 72.3
## 3 Europe               79.0
## 4 North America        74.6
## 5 Oceania              80.3
## 6 South America        74.3

Did you know?

You can make your own %% operators! For instance

%add% = function(a, b) a + b

2 %add% 3

## [1] 5

The && and || operators

If you look on the help page for the logical operators & and |, you’ll find && and ||. What do they do and hope they actually differ from their single counterparts? Let’s look at an example. Take a vector x

x = c(2, 4, 6, 8)

To test for the values in x that are greater than 3 and less than 7 we would write

x > 3 & x < 7

## [1] FALSE  TRUE  TRUE FALSE

Then to return these values we would subset using square brackets

x[x > 3 & x < 7]

## [1] 4 6

What happens if we repeat these steps with &&?

x > 3 && x < 7

## [1] FALSE

x[x > 3 && x < 7]

## numeric(0)

What is happening here is that the double & only evaluates the first element of a vector. So evaluation proceeds only until a result is determined. This has another nice consequence. For example, take the object a

a = 5

In the following test

a == 4 & a == 5 & a == 8

## [1] FALSE

All 3 tests are evaluated, even though we know that after the first test, a == 4, this test is FALSE. Where as in using the double &&

a == 4 && a == 5 && a == 8

## [1] FALSE

Here we only evaluate the first test as that is all we need to determine the result. This is more efficient as it won’t evaluate any test it doesn’t need to. To demonstrate this, we’ll use two toy functions

a = function(){
  print("Hi I am f")
  return(FALSE)
}
b = function(){
  print("Hi I am g")
  return(TRUE)
}

a() & b()

## [1] "Hi I am f"
## [1] "Hi I am g"

## [1] FALSE

When using the single &, R has to evaluate both functions even thought the output of the left hand side is FALSE

a() && b()

## [1] "Hi I am f"

## [1] FALSE

But using &&, R only has to evaluate the first function until the result is determined! It’s the same rule for ||…

b() | a()

## [1] "Hi I am g"
## [1] "Hi I am f"

## [1] TRUE

b() || a()

## [1] "Hi I am g"

## [1] TRUE

The xor() function

This last one isn’t so much an operator as a function. The xor() function is an exclusive version of the |. Take two vector, x and y

x = c(1,1,2)
y = c(1,2,2)

To get all the elements where either x is 1 or y is 2 we would write

x == 1 | y == 2

## [1] TRUE TRUE TRUE

However, this will also return the elements where x = 1 AND y = 2. If we only want elements where only one statement is TRUE, we would use xor()

xor(x == 1 , y == 2)

## [1]  TRUE FALSE  TRUE

That’s all for this time. Thanks for reading!

The post Key R Operators appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

HTTP testing in R: overview of tools and new features

December 10, 2019, 4:00 pm

≫ Next: IPO Portfolios and a Benchmark

≪ Previous: Key R Operators

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Testing is a crucial component to any software package. Testing makes sure that your code does what you expect it to do; and importantly, makes it safer to make changes moving forward because a good test suite will tell you if a change has broken existing functionality. Our recent community call on testing is a nice place to get started with testing.

One way to make testing even harder is through including HTTP requests. This adds complexity for many reasons:

Remote resources can change in a variety of ways, all leading to broken tests:
- Response structure can change, potentially breaking downstream processing
- Query parameter names can change, routes can be deprecated
- Remote resources can be temporarily down
If you’re testing a function in your package that does an HTTP request, you’re testing the code, but also the remote resource. This may be what you want, but perhaps its not.
Some remote resources require authentication. Good software development uses continuous integration; testing in different computational environments means you have to think about whether you can run a test that requires authentication.
- In R, we are in the unusual situation relative to other programming languages (e.g., Ruby, Python, Node) where the distribution channel for our packages (CRAN) also checks the packages; yet another context in which you have to think about whether you can run a test or not
Test suite runs are slower when HTTP requests are made (and variable depending on server response times and internet speed)

To make testing easier for R packages that do HTTP requests we’ve been working on two packages: webmockr and vcr (introduced on this blog in February 2018 and May 2018, respectively). There’s been a lot of changes since those two blog posts, making testing R packages with HTTP requests easier than ever.

Mocking with webmockr

webmockr is a general purpose library for mocking HTTP requests, built to work with any HTTP client in R (exception: curl, for good reason). Use cases include:

mocking requests in package test suites (i.e., it can be hard to re-create some scenarios, e.g., intermittent server errors, timeouts, etc.)
you want to run code when you don’t have access to the internet – webmockr can help with that
webmockr can allow you to run tests before the API you’ll use even exists

Let’s run through a simple example. First, load and enable webmockr:

library(webmockr)
webmockr::enable()

Then create a stub with stub_request(). Here, we stub a request based on HTTP method and URL only.

stub_request("get", "https://httpbin.org/get")

Now make a “request”. If the request matches the stub, a real HTTP request will NOT be made, but instead webmockr returns the same object you’ get with a real HTTP request, but with whatever you told webmockr to return (in this case nothing).

library("crul")
x <- HttpClient$new(url = "https://httpbin.org")
x$get('get')
#>  
#>   url: https://httpbin.org/get
#>   request_headers: 
#>     User-Agent: libcurl/7.54.0 r-curl/4.3 crul/0.9.0
#>     Accept-Encoding: gzip, deflate
#>     Accept: application/json, text/xml, application/xml, */*
#>   response_headers: 
#>   status: 200

What’s new in webmockr?

If you were using webmockr with httr, you’re in luck. The latest version of webmockr (v0.5) fixed an issue where we weren’t mocking httr simple authentication.
A big new feature in v0.5 is support for writing to disk. Both crul and httr allow users to write to disk as part of the HTTP request. HTTP clients in some other languages I’m familiar with (e.g., Ruby) don’t do this; you’d have to write to disk yourself. Anyway, the point is that supporting writing to disk (here and in vcr) was not super easy; it’s a complicated thing to support. There will probably be edge cases that will break; open an issue if you run into problems mocking HTTP requests that write to disk.

Check out the release notes for all webmockr changes

Record and replay requests with vcr

vcr leverages webmockr to handle matching requests, but instead of just mocking requests, vcr records real requests and plays them back (hence the name).

Let’s run through a simple example. First, load vcr:

library(vcr)
library(crul)

The main interface in vcr is vcr::use_cassette(). Pass a code block to use_cassette as the second argument and vcr will allow a real HTTP request the first time the block is run, and record the request and response to a file. Any subsequent running of the same code will use the cached data.

For example, here we make a request to https://httpbin.org

use_cassette(name = "helloworld", {
  req1 = crul::HttpClient$new(url = "https://httpbin.org")$get("get")
})

Which makes a real HTTP request and records the request and response to a file helloworld.yml. The second call to the same block

use_cassette(name = "helloworld", {
  req2 = crul::HttpClient$new(url = "https://httpbin.org")$get("get")
})

uses the helloworld.yml file – a real HTTP request is not done on subsequent calls.

And we can test that the returned response is the same in both code blocks above

testthat::compare(req1, req2)
#> Equal

What’s new in vcr?

Just as webmockr now handles mocking writing to disk, vcr can now handle requests that write to disk. See ?mocking-disk-writing to get started.
You can now easily turn off vcr completely, either through a function call or using an environment variable (e.g., ideal for those that work on the command line or use in continuous integration scenarios). See ?lightswitch to get started.
Ignoring certain requests now works, only for crul for now. Ignoring means that the real HTTP request you want to ignore will be allowed, while all other requests will be handled by vcr. You can right now ignore certain hosts (e.g., google.com) and all localhost requests; in the future we hope to support ignoring any specific request via user defined functions. See the vcr configuration vignette for discussion and examples.

Open an issue if you run into problems with any of the above features.

Check out the release notes for all vcr changes

HTTP Testing Book

We’ve also been working on a book: HTTP mocking and testing in R. The book is intended as a detailed guide for HTTP mocking and testing in R code and packages. The book aims to introduce the reader to leveraging webmockr and vcr in their test suites, going into detail with both packages.

We’ve got some new content in the book and we have more planned. We’d love any feedback on the book; let us know by opening an issue.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

IPO Portfolios and a Benchmark

December 10, 2019, 4:00 pm

≫ Next: Shiny splash screen using modules and shinyjs

≪ Previous: HTTP testing in R: overview of tools and new features

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In two previous posts, we explored IPOs and IPO returns by sector and year since 2004 and then examined the returns of portfolios constructed by investing in IPOs each year. In today’s post, we will add a benchmark so that we can compare our IPO portfolios to something besides themselves. Next time, we will delve into return attribution to visualize how individual equities have contributed to portfolios over time.

I won’t review the code from the previous posts, but briefly we imported prices for every ticker that IPO’d between 2004 and 2014, found the monthly returns of those tickers, then constructed portfolios on a year-by-year basis, so that we had a portfolio formed each year consisting of equal weights in every IPO for that year.

# object holding time series of monthly closing prices, monthly returns, tickers, IOP year and sectoripo_riingo_prices_pins  # function to calculate returns of portfolios constructed by investing equally in each IPO in each yearipo_by_year_portfolios <- function(year, show_growth = F){    ipo_riingo_prices_pins %>%  select(ticker, date, monthly_returns, ipo.year) %>%   filter(ipo.year == year) %>%   tq_portfolio(assets_col  = ticker,               returns_col = monthly_returns,               col_rename  = paste(year, "_port_returns", sep = ""),               wealth.index = show_growth,               rebalance_on = "months")}# vector of yearsyears_numeric <- seq(2004, 2014, by = 1)# pass the years and the indiviual returns object to the functionreturns_each_year_ipo_portfolios <-map(years_numeric, ipo_by_year_portfolios) %>%   reduce(left_join)

And here is the resulting object of portfolio returns:

returns_each_year_ipo_portfolios %>%   tail()

# A tibble: 6 x 12  date                `2004_port_retu… `2005_port_retu… `2006_port_retu…                                                   1 2019-05-31 00:00:00         -0.101           -0.0575           0.261  2 2019-06-28 00:00:00          0.382            0.0635           0.0533 3 2019-07-31 00:00:00          0.00659          0.0185           0.0346 4 2019-08-30 00:00:00         -0.0229          -0.0317          -0.007895 2019-09-30 00:00:00          0.0256           0.00545         -0.005396 2019-10-31 00:00:00          0.0347           0.0180           0.0233 # … with 8 more variables: `2007_port_returns` ,#   `2008_port_returns` , `2009_port_returns` ,#   `2010_port_returns` , `2011_port_returns` ,#   `2012_port_returns` , `2013_port_returns` ,#   `2014_port_returns`

All that was done last time, and it gave us the returns of our 11 IPO portfolios from formation to today.

Now, let’s calculate the returns of a benchmark portfolio so we can compare those IPO portfolios to something besides themselves. We will use SPY as the benchmark and start by importing monthly prices since 2004. I’ll also go ahead and calculate monthly returns in the same piped flow.

spy_benchmark <-   "SPY" %>%  riingo_prices(start_date = "2004-01-01", end_date = "2019-10-31",  resample_frequency = "monthly") %>%   select(ticker, date, close) %>%   mutate(spy_monthly_returns = close/lag(close) -  1) %>%   na.omit()spy_benchmark %>%   head()

# A tibble: 6 x 4  ticker date                close spy_monthly_returns                                 1 SPY    2004-02-27 00:00:00  115.              0.01362 SPY    2004-03-31 00:00:00  113.             -0.01673 SPY    2004-04-30 00:00:00  111.             -0.01894 SPY    2004-05-31 00:00:00  113.              0.01715 SPY    2004-06-30 00:00:00  115.              0.01486 SPY    2004-07-30 00:00:00  111.             -0.0322

From here, it’s straightforward to compare these benchmark returns to those of the 2004 IPO portfolio. First, we line up the two columns of returns.

returns_each_year_ipo_portfolios %>%   select(date, `2004_port_returns`) %>%   add_column(benchmark = spy_benchmark$spy_monthly_returns) %>%   tail()

# A tibble: 6 x 3  date                `2004_port_returns` benchmark                                   1 2019-05-31 00:00:00            -0.101     -0.06382 2019-06-28 00:00:00             0.382      0.06443 2019-07-31 00:00:00             0.00659    0.01514 2019-08-30 00:00:00            -0.0229    -0.01675 2019-09-30 00:00:00             0.0256     0.01486 2019-10-31 00:00:00             0.0347     0.0221

Then we pivot_longer() and apply the SharpeRatio() function, same as we did last time.

returns_each_year_ipo_portfolios %>%   select(date, `2004_port_returns`) %>%   add_column(benchmark = spy_benchmark$spy_monthly_returns) %>%   pivot_longer(-date, names_to = "portfolio", values_to = "monthly_return") %>%  group_by(portfolio) %>%  arrange(portfolio, date) %>%  filter(!is.na(monthly_return)) %>%  tq_performance(Ra = monthly_return,                 performance_fun = SharpeRatio,                 Rf = 0,                 FUN= "StdDev")

# A tibble: 2 x 2# Groups:   portfolio [2]  portfolio         `StdDevSharpe(Rf=0%,p=95%)`                                     1 2004_port_returns                       0.2342 benchmark                               0.153

Here’s the result piped straight to ggplot().

returns_each_year_ipo_portfolios %>%   select(date, `2004_port_returns`) %>%   add_column(benchmark = spy_benchmark$spy_monthly_returns) %>%   pivot_longer(-date, names_to = "portfolio", values_to = "monthly_return") %>%  group_by(portfolio) %>%  arrange(portfolio, date) %>%  filter(!is.na(monthly_return)) %>%  tq_performance(Ra = monthly_return,                 performance_fun = SharpeRatio,                 Rf = 0,                 FUN= "StdDev") %>%  `colnames<-`(c("portfolio", "port_sharpe")) %>%   ggplot(aes(x = portfolio, y = port_sharpe, fill = portfolio)) +   geom_col(width = .2) +  labs(y = "sharpe ratio", title = "Benchmark v. IPO Portfolio")

Our IPO portfolio has a higher Sharpe Ratio, but remember that we built this without regard to survivorship bias, we didn’t invest in any companies that haven’t survived to 2019.

That’s a nice comparison of one portfolio to the benchmark, but we want to run this same analysis on all of our portfolios.

First, let’s calculate the Sharpes for all of our IPO portfolios, same as we did last time.

years_numeric <- seq(2004, 2014, by = 1)port_sharpes <-  returns_each_year_ipo_portfolios %>%           pivot_longer(-date, names_to = "portfolio_by_year", values_to = "monthly_return") %>%  group_by(portfolio_by_year) %>%  arrange(portfolio_by_year, date) %>%  filter(!is.na(monthly_return)) %>%  tq_performance(Ra = monthly_return,                 performance_fun = SharpeRatio,                 Rf = 0,                 FUN= "StdDev") %>%  `colnames<-`(c("portfolio_by_year", "port_sharpe"))%>%  add_column(year = years_numeric)port_sharpes

# A tibble: 11 x 3# Groups:   portfolio_by_year [11]   portfolio_by_year port_sharpe  year                        1 2004_port_returns       0.234  2004 2 2005_port_returns       0.192  2005 3 2006_port_returns       0.249  2006 4 2007_port_returns       0.190  2007 5 2008_port_returns       0.142  2008 6 2009_port_returns       0.220  2009 7 2010_port_returns       0.279  2010 8 2011_port_returns       0.152  2011 9 2012_port_returns       0.309  201210 2013_port_returns       0.182  201311 2014_port_returns       0.218  2014

And now, let’s calculate the Sharpe Ratio for the benchmark for each year. That means we will build or organize 11 different return streams for SPY, each starting in a year from 2004 to 2014, and then calculate the Sharpes for each of those 11 return streams.

Here’s how we do it for just 2004.

start_year <- "2004"start_date <- ymd(parse_date(start_year, format = "%Y"))spy_benchmark %>% filter(date >= start_date) %>% tq_performance(Ra = spy_monthly_returns,                 performance_fun = SharpeRatio,                 Rf = 0,                 FUN= "StdDev")

# A tibble: 1 x 1  `StdDevSharpe(Rf=0%,p=95%)`                        1                       0.153

This looks like a good candidate for a function that accepts one argument, the start_year, that we can pass a vector of years.

spy_sharpe_function <- function(start_year){start_date <- ymd(parse_date(start_year, format = "%Y"))spy_benchmark %>% filter(date >= start_date) %>%   tq_performance(Ra = spy_monthly_returns,                 performance_fun = SharpeRatio,                 Rf = 0,                 FUN = "StdDev")  %>%  `colnames<-`("spy_sharpe") %>%   mutate(year = as.numeric(start_year))}

Let’s pass in one year and peek at the result.

spy_sharpe_function("2005")

# A tibble: 1 x 2  spy_sharpe  year        1      0.150  2005

Now, let’s map across different years.

years_character <- as.character(years_numeric)spy_sharpes <-   map_dfr(years_character, spy_sharpe_function)spy_sharpes

# A tibble: 11 x 2   spy_sharpe  year          1      0.153  2004 2      0.150  2005 3      0.152  2006 4      0.139  2007 5      0.141  2008 6      0.258  2009 7      0.253  2010 8      0.263  2011 9      0.310  201210      0.300  201311      0.229  2014

That worked! Let’s join our benchmark results with the IPO portfolio results for ease of comparison and pipe straight to ggplot().

port_sharpes %>%   left_join(spy_sharpes, by = "year") %>%   pivot_longer(c(-year, -portfolio_by_year), names_to = "port_type", values_to = "sharpe") %>%   ggplot(aes(x = year, y = sharpe, fill = port_type)) +   geom_col(position = position_dodge2(padding = .2)) +  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

It looks like our IPO portfolios outperformed in the years 2004-2007. That might be due to our survivorship bias since we’re only investing in companies that we know, with hindsight, have survived to 2019.

Let’s also remember that Sharpe Ratios aren’t everything. Our IPO portfolios might be so volatile that we wouldn’t have a the gumption to stick with them through the hard times. To get a better sense of what we’d have faced, let’s visualize the drawdowns for the 2004 IPO portfolio versus the benchmark.

returns_each_year_ipo_portfolios %>%   select(date, `2004_port_returns`) %>%  left_join(spy_benchmark %>%  select(date, spy_monthly_returns), by = "date") %>%   pivot_longer(-date, names_to = "fund", values_to =  "drawdown") %>%   mutate(drawdown = case_when(drawdown > 0 ~ 0,                              TRUE ~ drawdown),         drawdown = drawdown * 100) %>%       plot_ly(type = 'scatter', x = ~date, y = ~drawdown, color = ~fund,               mode = 'lines', fill = 'tonexty') %>%       layout(yaxis = list(ticksuffix = "%"))

Click on the legend to isolate the chart of either the IPO portfolio or the benchmark and notice the much rougher history of the IPO portfolio. Our IPO portfolio had a large drawdown of around 26% in 2008 – would we have stuck with it?

I’ve been making my way through more of the courses over at Business Science U and one habit I’ve picked up is to wrap visualizations into functions. Let’s do that for the drawdown chart so that all we have to do is supply a year. This would be helpful in a Shiny application where we wanted to explore different years interactively.

drawdown_vis_fun <- function(start_year){  start_date <- ymd(parse_date(start_year, format = "%Y"))spy_benchmark <- spy_benchmark %>% filter(date >= start_date)    ipo_port <- paste(start_year, "_port_returns", sep = "")    returns_each_year_ipo_portfolios %>%   select(date, `ipo_port`) %>%   left_join(spy_benchmark %>%  select(date, spy_monthly_returns), by = "date") %>%   pivot_longer(-date, names_to = "fund", values_to =  "drawdown") %>%   mutate(drawdown = case_when(drawdown > 0 ~ 0,                              TRUE ~ drawdown),         drawdown = drawdown * 100) %>%       plot_ly(type = 'scatter', x = ~date, y = ~drawdown, color = ~fund,               mode = 'lines', fill = 'tonexty') %>%       layout(yaxis = list(ticksuffix = "%"))}

Now, let’s pass the function a single year and examine the drawdown history.

drawdown_vis_fun("2007")

That’s all for today’s addendum.

I’m also going to be posting weekly code snippets on linkedin, connect with me there if you’re keen for some R finance stuff.

Thanks for reading and see you next time when we’ll tackle asset contribution to portfolio return!

_____='https://rviews.rstudio.com/2019/12/11/ipo-portfolios-and-a-benchmark/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Shiny splash screen using modules and shinyjs

December 10, 2019, 4:00 pm

≫ Next: NIMBLE short course, June 3-4, 2020 at UC Berkeley

≪ Previous: IPO Portfolios and a Benchmark

[This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while ago I was researching on creating a splash screen for a Shiny application. My gut feel was that there will readily be a package available for this activity. I was surprised to see that not much information is available based on a 10 minute Google search. The top StackOverflow question which comes up with a search for ‘r shiny splash screen’ is this which recommends a modal dialog. I also discovered the waiter package which is very cool.

In this article, I present a solution using Shiny modules and the shinyjs package. I am not sure the solution is scalable for a very complex application, but should work for simple applications. The gif file used in the demo application was downloaded from giphy.

The key ideas to do this are very simple

create a module for the UI and server which forms your main application
create another module to display an image as a splash screen
display the image for a few seconds and then hide it
the main application UI should be hidden by default and should be shown a few miliseconds after the splash screen has been hidden

The last two tasks are accomplished using functions from the shinyjs package.

For the demo application, I took the default code created by RStudio when you create a Shiny project. This is a simple visualization of the faithful dataset, where the number of bins in the histogram are controlled by a slider input. The code for this, with some modifications, are defined in the module mainapp.

mainapp_ui <- function(id) {
    ns <- NS(id)
    fluidPage(
        
        # Application title
        hidden(div(id = ns("app_title"),
                   titlePanel("Old Faithful Geyser Data"))),
        # Application UI elements
        hidden(
            fluidRow(id = ns("app_slider_plot"),
                column(
                    4,
                    sliderInput(ns("bins"),
                                "Number of bins:",
                                min = 1,
                                max = 50,
                                value = 30)
                ),
                column(
                    8,
                    plotOutput(ns("distPlot"))
                )
            )
        )
    )
}

mainapp_server <- function(input, output, session) {
    
    delay(ms = 3500, show("app_title"))
    delay(ms = 3800, show("app_slider_plot"))
    
    output$distPlot <- renderPlot({
        # generate bins based on input$bins from ui.R
        x    <- faithful[, 2]
        bins <- seq(min(x), max(x), length.out = input$bins + 1)
        
        # draw the histogram with the specified number of bins
        hist(x, breaks = bins, col = 'darkgray', border = 'white')
    })
}

I have converted the sidebarLayout into a single fluidRow divided into two columns. Note that the row has been provided an id of app_slider_plot and is hidden by default. The same is true of the titlePanel– as an id cannot be defined for a titlePanel, this has been wrapped in a HTML div. In the server function for this module, we use the show function from the shinyjs package to display the title and the application UI elements, but only after a delay of around three and a half seconds. As we see below, this is because the initial splash screen is shown for three seconds.

splash_ui <- function(id) {
    ns <- NS(id)
    div(id = ns("splash_screen"), img(src = "giphy.gif"),
        style = "text-align:center; padding-top:250px;")
}

splash_server <- function(input, output, session) {
    hide("splash_screen", anim = TRUE, animType = "fade", time = 3)
}

The module for the splash screen is really simple. It just loads a GIF image which is animated to provide the appearance that the app is being loaded. Creative people will also include some kind of logo and branding as part of this image. The code in the server function ensures that the image is hidden after three seconds. This action is also animated to make it slightly cooler.

The UI and server for the full Shiny application is really simple.

# Define UI for application that draws a histogram
ui <- fluidPage(
    useShinyjs(),
    fluidRow(splash_ui("splash_module")),
    fluidRow(mainapp_ui("mainapp_module"))
)

# Define server logic required to draw a histogram
server <- function(input, output) {
    ss <- callModule(splash_server, "splash_module")
    ma <- callModule(mainapp_server, "mainapp_module")
}

The useShinyjs function is required to enable shinyjs. Other than that, it simply calls the modules for the main application and the splash screen.

The complete code is available in Github.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

NIMBLE short course, June 3-4, 2020 at UC Berkeley

December 11, 2019, 11:02 am

≫ Next: Gold-Mining Week 15 (2019)

≪ Previous: Shiny splash screen using modules and shinyjs

[This article was first published on R – NIMBLE, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ll be holding a two-day training workshop on NIMBLE, June 3-4, 2020 in Berkeley, California. NIMBLE is a system for building and sharing analysis methods for statistical models, especially for hierarchical models and computationally-intensive methods (such as MCMC and SMC).

The tutorial will cover

the basic concepts and workflows for using NIMBLE and converting BUGS or JAGS models to work in NIMBLE.
overview of different MCMC sampling strategies and how to use them in NIMBLE.
writing new distributions and functions for more flexible modeling and more efficient computation.
tips and tricks for improving computational efficiency.
using advanced model components, including Bayesian non-parametric distributions (based on Dirichlet process priors), conditional auto-regressive (CAR) models for spatially correlated random fields, and reversible jump samplers for variable selection.
an introduction to programming new algorithms in NIMBLE.
calling R and compiled C++ code from compiled NIMBLE models or functions.

If participant interests vary sufficiently, the second half-day will be split into two tracks. One of these will likely focus on ecological models. The other will be chosen based on attendee interest from topics such as (a) advanced NIMBLE programming including writing new MCMC samplers, (b) advanced spatial or Bayesian non-parametric modeling, or (c) non-MCMC algorithms in NIMBLE such as sequential Monte Carlo. Prior to the workshop, we will survey attendee interests and adjust content to meet attendee interests.

If you are interested in attending, please pre-register to hold a spot at https://forms.gle/6AtNgfdUdvhni32Q6. The form also asks if you are interested in a relatively cheap dormitory-style housing option. No payment is necessary to pre-register. Fees to finalize registration will be $230 (regular) or $115 (student). We hope to be able to offer student travel awards; more information will follow.

The workshop will assume attendees have a basic understanding of hierarchical/Bayesian models and MCMC, the BUGS (or JAGS) model language, and some familiarity with R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Gold-Mining Week 15 (2019)

December 11, 2019, 11:11 am

≫ Next: sampling the mean

≪ Previous: NIMBLE short course, June 3-4, 2020 at UC Berkeley

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Week 15 Gold Mining and Fantasy Football Projection Roundup now available.

The post Gold-Mining Week 15 (2019) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

sampling the mean

December 11, 2019, 3:19 pm

≫ Next: Internal functions in R packages

≪ Previous: Gold-Mining Week 15 (2019)

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A challenge found on the board of the coffee room at CEREMADE, Université Paris Dauphine:

When sampling with replacement three numbers in {0,1,…,N}, what is the probability that their average is (at least) one of the three?

With a (code-golfed!) brute force solution of

mean(!apply((a<-sample(0:n,3e6,rep=T),3)),2,mean)-apply(a,2,median))

producing a graph pretty close to 3N/2(N+1)² (which coincides with a back-of-the-envelope computation): temp

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Internal functions in R packages

December 11, 2019, 4:00 pm

≫ Next: Looking back at 2019 and plans for 2020

≪ Previous: sampling the mean

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An R package can be viewed as a set of functions, of which only a part are exposed to the user. In this blog post we shall concentrate of the functions that are not exposed to the user, so called internal functions: what are they, how does one handle them in one’s own package, and how can one explore them?

Internal functions 101

What is an internal function?

It’s a function that lives in your package, but that isn’t surfaced to the user. You could also call it unexported function or helper function; as opposed to exported functions and user-facing functions.

For instance, in the usethis package there’s a base_and_recommended() function that is not exported.

# doesn't worklibrary("usethis")base_and_recommended()

## Error in base_and_recommended(): could not find function "base_and_recommended"

usethis::base_and_recommended()

## Error: 'base_and_recommended' is not an exported object from 'namespace:usethis'

# worksusethis:::base_and_recommended()

##  [1] "base"       "boot"       "class"      "cluster"    "codetools" ##  [6] "compiler"   "datasets"   "foreign"    "graphics"   "grDevices" ## [11] "grid"       "KernSmooth" "lattice"    "MASS"       "Matrix"    ## [16] "methods"    "mgcv"       "nlme"       "nnet"       "parallel"  ## [21] "rpart"      "spatial"    "splines"    "stats"      "stats4"    ## [26] "survival"   "tcltk"      "tools"      "utils"

As an user, you shouldn’t use unexported functions of another package in your own code.

Why not export all functions?

There are at least these two reasons:

In a package you want to provide your user an API that is useful and stable. You can vouch for a few functions, that serve the package main goals, are documented enough, and that you’d only change with great care if need be. If your package users rely on an internal function that you decide to ditch when re-factoring code, they won’t be happy, so only export what you want to maintain.
If all packages exposed all their internal functions, the user environment would be flooded and the namespace conflicts would be out of control.

Why write internal functions?

Why write internal functions instead of having everything in one block of code inside each exported functions?

When writing R code in general there are several reasons to write functions and it is the same within R packages: you can re-use a bit of code in several places (e.g. an epoch converter used for the output of several endpoints from a web API), and you can give it a self-explaining name (e.g. convert_epoch()). Any function defined in your package is usable by other functions of your package (unless it is defined inside a function of your package, in which case only that parent function can use it).

Having internal functions also means you can test these bits of code on their own. That said if you test internals too much re-factoring your code will mean breaking tests.

To find blocks of code that could be replaced with a function used several times, you could use the dupree package whose planned enhancements include highlighting or printing the similar blocks.

When not to write internal functions?

There is a balance to be found between writing your own helpers for everything and only depending on external code. You can watch this excellent code on the topic.

Where to put internal functions?

You could save internal functions used in one function only in the R file defining that function, and internal functions used in several other functions in a single utils.R file or specialized utils-dates.R, utils-encoding.R files. Choose a system that helps you and your collaborators find the internal functions easily, R will never have trouble finding them as long they’re somewhere in the R/ directory.

Another possible approach to helper functions when used in several packages is to pack them up in a package such as Yihui Xie’s xfun. So then they’re no longer internal functions.

How to document internal functions?

You should at least add a few comments in their code as usual. Best practice recommended in the tidyverse style guide and the rOpenSci dev guide is to document them with roxygen2 tags like other functions, but to use #' @NoRd to prevent manual pages to be created.

#' Compare x to 1#' @param x an integer#' @NoRdis_one <- function(x) {  x == 1}

The keyword @keywords internal would mean a manual page is created but not present in the function index. A confusing aspect is that you could use it for an exported, not internal function you don’t want to be too visible, e.g. a function returning the default app for OAuth in a package wrapping a web API.

#' A function rather aimed at developers#' @description A function that does blabla, blabla.#' @keywords internal#' @exportdoes_thing <- function(){ message("I am an exported function")}

Explore internal functions

You might need to have a look at the guts of a package when wanting to contribute to it, or at the guts of several packages to get some inspiration for your code.

Explore internal functions within a package

Say you’ve started working on a new-to-you package (or resumed work on a long forgotten package of yours ). How to know how it all hangs together? You can use the same methods as for debugging code, exploring code is like debugging it and vice versa!

One first way to understand what a given helper does is looking at its code, from within RStudio there are some useful tools for navigating functions. You can then search for occurrences of its names across R scripts. These first two tasks are static code analysis (well unless your brain really executes R code by reading it!). Furthermore, a non static way to explore a function is to use browser() inside it or inside functions calling it.

Another useful tool is the in development pkgapi package. Let’s look at the cranlogs source code.

map <- pkgapi::map_package("/home/maelle/Documents/R-hub/cranlogs")

We can see all defined functions, exported or not.

str(map$defs)

## 'data.frame':8 obs. of  7 variables:##  $ name    : chr  "check_date" "cran_downloads" "cran_top_downloads" "cranlogs_badge" ...##  $ file    : chr  "R/utils.R" "R/cranlogs.R" "R/cranlogs.R" "R/badge.R" ...##  $ line1   : int  1 61 184 16 137 105 117 126##  $ col1    : int  1 1 1 1 1 1 1 1##  $ line2   : int  6 103 208 33 153 115 124 135##  $ col2    : int  1 1 1 1 1 1 1 1##  $ exported: logi  FALSE TRUE TRUE TRUE FALSE FALSE ...

We can see all calls inside the package code, to functions from the package and other packages.

str(map$calls)

## 'data.frame':84 obs. of  9 variables:##  $ file : chr  "R/badge.R" "R/badge.R" "R/badge.R" "R/badge.R" ...##  $ from : chr  "cranlogs_badge" "cranlogs_badge" "cranlogs_badge" "cranlogs_badge" ...##  $ to   : chr  "base::c" "base::match.arg" "base::paste0" "base::paste0" ...##  $ type : chr  "call" "call" "call" "call" ...##  $ line1: int  17 21 23 25 30 7 8 62 65 66 ...##  $ line2: int  17 21 23 25 30 7 8 62 65 66 ...##  $ col1 : int  38 14 14 16 3 14 14 35 8 17 ...##  $ col2 : int  38 22 19 21 8 19 19 35 14 25 ...##  $ str  : chr  "c" "match.arg" "paste0" "paste0" ...

We can filter that data.frame to only keep calls between functions defined in the package.

library("magrittr")internal_calls <- map$calls[map$calls$to %in% glue::glue("{map$name}::{map$defs$name}"),]internal_calls %>%  dplyr::arrange(to)

##           file           from                      to type line1 line2 col1## 1 R/cranlogs.R cran_downloads    cranlogs::check_date call    69    69    7## 2 R/cranlogs.R cran_downloads    cranlogs::check_date call    73    73    7## 3 R/cranlogs.R        to_df_1 cranlogs::fill_in_dates call   123   123    3## 4 R/cranlogs.R cran_downloads         cranlogs::to_df call   101   101    3## 5 R/cranlogs.R          to_df       cranlogs::to_df_1 call   109   109    5## 6 R/cranlogs.R          to_df       cranlogs::to_df_r call   107   107    5##   col2           str## 1   16    check_date## 2   16    check_date## 3   15 fill_in_dates## 4    7         to_df## 5   11       to_df_1## 6   11       to_df_r

That table can help understand how a package works. One could combine that with a network visualization.

library("visNetwork")internal_calls <- internal_calls %>%  dplyr::mutate(to = gsub("cranlogs\\:\\:", "", to))nodes <- tibble::tibble(id = map$defs$name,                        title = map$defs$file,                        label = map$defs$name,                        shape = dplyr::if_else(map$defs$exported,                                               "triangle",                                               "square"))edges <- internal_calls[, c("from", "to")]visNetwork(nodes, edges, height = "500px") %>%  visLayout(randomSeed = 42) %>%  visNodes(size = 10)

In this interactive visualization one sees three exported functions (triangles), with only one that calls internal functions. Such a network visualization might not be that useful for bigger packages, and in our workflow is limited to pkgapi’s capabilities (e.g. not memoised functions)… but it’s at least quite pretty.

Explore internal functions across packages

Looking at helpers in other packages can help you write your own, e.g. looking at a package elegantly wrapping a web API could help you wrap another one elegantly too.

Bob Rudis wrote a very interesting blog post about his exploration of R packages “utility belts” i.e. the utils.R files. We also recommend our own blog post about reading the R source.

Conclusion

In this post we explained what internal functions are, and gave a few tips as to how to explore them within a package and across packages. We hope the post can help clear up a few doubts. Feel free to comment about further ideas or questions you may have.

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Looking back at 2019 and plans for 2020

December 11, 2019, 4:00 pm

≫ Next: A Collection of Benchmarks in R

≪ Previous: Internal functions in R packages

[This article was first published on R on msperlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m just about to leave for my vacation and, as usual, I’ll write about the highlights of 2019 and my plans for the year to come. First, let’s talk about my work in 2019.

Highlights of 2019

The year of 2019 was not particularly fruitful in journal publications. I only had two: Accessing Financial Reports and Corporate Events with GetDFPData, published in RBfin and A consumer credit risk structural model based on affordability: balance at riskpublished in JCR. Both are papers I wrote back in 2017 and 2018 and not new articles. Most of the papers I worked this year will be published in 2020 or 2021.

This year, I’m mostly proud of the publication of my book about investing in the fixed income markets Poupando e Investindo em Renda Fixa: Uma Abordagem Baseada em Dados. This was a special project, very different from the usual writing style of scientific research and, lots of fun to write. As a side effect, I learned a lot about the fixed income market in Brasil and it forced me to think objectively about a problem that is inherently subjective, personal finance and investing. Hopefully, in the future, I’ll write another book about the stock market and real state investing, a topic that I’m also very interested.

My main project this year is the work in the second edition of my R book Analyzing Financial and Economic Data with R. It took a major part of my working weeks and weekends, but its coming together. Soon I’ll be publishing it in Amazon. You can find more details about it in this previous post.

In the programming side, I wrote two new CRAN packages in 2019: simfinR and GetQuandlData. Both are featured in the new edition of my R book (soon to be published).

My blog posts in 2018

Let’s have a look at my blog posts so so far.

my.blog.folder <- '~/Dropbox/11-My Website/www.msperlin.com-blog/content/post/'post.files <- list.files(path = my.blog.folder, pattern = '.Rmd')post.files

##  [1] "2017-01-01-First-post.Rmd"                 ##  [2] "2017-01-02-GetHFData.Rmd"                  ##  [3] "2017-01-15-CalculatingBetas.Rmd"           ##  [4] "2017-01-30-Exams-with-dynamic-content.Rmd" ##  [5] "2017-02-13-R-and-Tennis-Players.Rmd"       ##  [6] "2017-02-16-Writing-a-book.Rmd"             ##  [7] "2017-03-05-Prophet-and_stock-market.Rmd"   ##  [8] "2017-05-04-pafdR-is-out.Rmd"               ##  [9] "2017-05-09-Studying-Pkg-Names.Rmd"         ## [10] "2017-05-15-R-Finance.Rmd"                  ## [11] "2017-08-12-Switching-to-Linux.Rmd"         ## [12] "2017-09-14-Brazilian-Yield-Curve.Rmd"      ## [13] "2017-12-06-Package-GetDFPData.Rmd"         ## [14] "2017-12-13-Serving-shiny-apps-internet.Rmd"## [15] "2017-12-30-Looking-Back-2017.Rmd"          ## [16] "2018-01-22-Update-BatchGetSymbols.Rmd"     ## [17] "2018-03-16-Writing_Papers_About_Pkgs.Rmd"  ## [18] "2018-04-22-predatory-scientometrics.Rmd"   ## [19] "2018-05-12-Investing-Long-Run.Rmd"         ## [20] "2018-06-12-padfR-ed2.Rmd"                  ## [21] "2018-06-29-BenchmarkingSSD.Rmd"            ## [22] "2018-10-10-BatchGetSymbols-NewVersion.Rmd" ## [23] "2018-10-11-Update-GetLattesData.Rmd"       ## [24] "2018-10-13-NewPackage-PkgsFromFiles.Rmd"   ## [25] "2018-10-19-R-and-loops.Rmd"                ## [26] "2018-10-20-Linux-and-R.Rmd"                ## [27] "2018-11-03-NewBlog.Rmd"                    ## [28] "2018-11-03-RstudioTricks.Rmd"              ## [29] "2019-01-08-Looking-Back-2018.Rmd"          ## [30] "2019-01-12-GetDFPData-ver14.Rmd"           ## [31] "2019-03-09-pafdR-promotion.Rmd"            ## [32] "2019-03-10-pafdR-promotion_2.Rmd"          ## [33] "2019-03-23-Bettina-Case.Rmd"               ## [34] "2019-04-13-Parallel-BatchGetsymbols.Rmd"   ## [35] "2019-04-15-GetBCBData.Rmd"                 ## [36] "2019-05-01-MeanVariance.Rmd"               ## [37] "2019-05-17-R-in-Brazil.Rmd"                ## [38] "2019-05-20-Lindy-Effect.Rmd"               ## [39] "2019-07-01-ftp-shutdown.Rmd"               ## [40] "2019-08-08-ftp-NOT-shutdown.Rmd"           ## [41] "2019-10-01-new-package-GetQuandlData.Rmd"  ## [42] "2019-10-12-support-GetDFPData-shiny.Rmd"   ## [43] "2019-10-16-new-package-GetEdgarData.Rmd"   ## [44] "2019-11-01-new-package-simfinR.Rmd"        ## [45] "2019-11-25-feedback-TOC-afedR.Rmd"         ## [46] "2019-12-02-dynamic-exercises-afedR.Rmd"    ## [47] "2019-12-15-Looking-Back-2019.Rmd"

The blog started in january of 2017 and, over time, I wrote 47 posts, around 15.6666667 per year. Let’s get more information from the .Rmd files. I’ll write function read_blog_files and use it for all post files.

read_blog_files <- function(f.in) {  require(tidyverse)  my.front.matter <- rmarkdown::yaml_front_matter(f.in)  df.out <- data_frame(post.title = my.front.matter$title,                       post.date = lubridate::ymd(my.front.matter$date),                       post.month = as.Date(format(post.date, '%Y-%m-01')),                       tags = paste0(my.front.matter$tags, collapse = ';'),                       categories = paste0(my.front.matter$categories, collapse = ';'),                       content = paste0(read_lines(f.in), collapse = ' '))  return(df.out)}df.posts <- dplyr::bind_rows(purrr::map(post.files, read_blog_files))

## Warning: `data_frame()` is deprecated, use `tibble()`.## This warning is displayed once per session.

glimpse(df.posts)

## Observations: 47## Variables: 6## $ post.title  "My first post!", "Using R to download high frequency trad…## $ post.date   2017-01-01, 2017-01-02, 2017-01-15, 2017-01-30, 2017-02-1…## $ post.month  2017-01-01, 2017-01-01, 2017-01-01, 2017-01-01, 2017-02-0…## $ tags        "about me", "R;GetHFData;B3;market microstructure;high fre…## $ categories  "about me", "R;GetHFData;B3;market microstructure;high fre…## $ content     "--- title: \"My first post!\" subtitle: \"A little bit ab…

First, let’s look at the frequency of posts over time.

df.posts <- df.posts %>%  filter(post.date >= as.Date('2019-01-01'),          post.date <= as.Date('2020-01-01'))print( ggplot(df.posts, aes(x = post.month)) + geom_histogram(stat='count') +         theme(axis.text.x = element_text(angle = 90, hjust = 1)) +         labs(y = 'Number of posts', x = ''))

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Checking 2019’s plans

At the end of 2018, my plans for 2019 were:

New edition of “Analyzing Financial and Economic Data with R”: in progress!
Work on my new book: “Investing For the Long Term” (provisory title): Done! The first idea was to write a book about investing in general. I soon realized I would not be able to complete such task within one year. So, I decided to first write about the fixed income market and later, perhaps, write about stock markets and real state.
Solidify my research agenda in Board Composition: In progress. I’ve got four papers in development, two already submitted and two in the pipeline.

Plans for 2020

Publish afedR (analyzing financial and economic data with R): My expectation is to publish the book in the first months of the year. I believe it is quite doable, unless something unexpected happens.
Finish board papers: Also doable. The working papers are in a good shape and should be submitted soon.
Start “personal finance project”: I’m not yet sure how to call it, but I’ve got a couple of ideas for creating a project all about helping people with their finances.

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

A Collection of Benchmarks in R

December 12, 2019, 12:00 am

≫ Next: front-line house democrats: a quick guide

≪ Previous: Looking back at 2019 and plans for 2020

[This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you write code in R, you might face the question: „Is there a faster way to do this?“. Over the years I worked at STATWORX, I have done a lot of little benchmarks to find an answer to this kind of question. Often, I just did a quick check to see if there is any time difference between two methods, used the faster one and moved on. Of course, I forgot about my tests over time and may have wondered twice about the same problem. To break this vicious circle, I created an overview of all the benchmarks I have done so far with the possibility to add more in the future. This overview can be found on my Github.

Creating an overview of all results

The overview needed to cater to multiple purposes:

function as a quick lookup table (which is the fastest way to do a specific task)
show the alternatives that were tested
give an idea of what was tested

Since the tested functions are often not that complicated (e.g. range(x) vs max(x) - min(x)), the benchmarks I did so far mostly had two varying parameters (e.g., the size or the number of different values). After some feedback from two of my colleagues, I settled for this table:

DATE	TEST	COMMENT	BEST	TIME_FACTOR	BEST_RUNS	DETAILS	DURATION
2019-11-29 08:53:33	Accsess a colum in a data frame, table or tibble.	varying size of data	$ tbl	66.9%	4/4	link	00:00:06
2019-11-29 08:53:36	assign with <- or =	varying size of vector	equal sign	27.4%	6/6	link	00:00:01

Since this is a work in progress, there is a good chance the format will change again in the future. But for now, this is shown in the table:

The DATE of the last time, the benchmark run.
A short description TEST of the benchmark.
In the COMMENTS I tried to give a hint of what the setups looked like.
The BEST option out of all tested alternatives compared by their mean time.
The TIME_FACTOR presents the mean time that can be saved with the best option compared with the mean of the alternatives over all grid setups. Note: The time factor can be negative if the best option is not the best in the cases where it takes more time. For these cases, have a look at the details and dependencies of the grid parameters.
BEST_RUNS is the number of cases were BEST solution was actually the best one in relation of all different varying setups that were used (e.g. sample size).
DURATION is the time the whole benchmark with all setups took.

Making the benchmark setup multi-usable

As I said before, I planned to make the overview extendable for new benchmarks in the future. Therefore, I created some helper functions and templates to make it easier to include new benchmarks. The main parts for this were:

a template folder and script for new benchmarks
a function that saves the result in my desired output
a function that creates the overview by reading in all existing results
a script that runs all benchmarks.

For adding a new benchmark, I have to copy the template folder and include the new setup I want to test. The save_benchmark() function will create the same output as for the previous benchmarks and the update_bench_overview() function will add it to the overview.

The main issue is the visualization of different grid parameters and their results. The good thing is that if I get an idea on how to improve this visualization, I could add it to save_benchmark() and rerun all benchmarks with the run_all_bench.R script. At the moment, a plot for each grid parameter is created, which indicates how the change influenced the timing. Also, the summaries for each run are shown, so one can see what exactly is going on.

How to set up a new benchmark

The template for further benchmarks has different sections that can be easily adjusted. Since this is a work in progress, it might change in the future. So if you have any good ideas or think I missed something, let me know and raise an issue on my Github.

It all starts with settings

There are three libraries I need for my functions to run. If the next benchmark needs other packages, I can add them here.

# these are neededlibrary(microbenchmark)library(helfRlein)library(data.table)source("functions/save_benchmark.R")# add more here

The next step is to describe the benchmark. Where are the results saved? What is the benchmark all about? What parameters are changing? All this information is later used to create the plots and tables to make it more understandable.

# test setup --------------------------------------------------------------# folder for resultsfolder <- "benchmarks/00_template_folder/"# test descriptiondescription <- "A short description of what is tested."# number of repetitionsreps <- 100Lcomments <- "what parameters changed"start_time <- Sys.time()

The more parameters, the merrier

How valid are the benchmark results? The more different settings it was tested in, the better the generalization. Is there maybe even a dependency, which is the best alternative? That can all be set up in this section, where you can define the different grid settings. I’d advise you to use variable names that can easily be understood, e.g., number_of_rows, unique_values, or sample_size. These names are also used in the plots at the end – so choose wisely!

# grid setup --------------------------------------------------------------# if there are different values to testgrid <- as.data.table(expand.grid(  param_1 = 10^c(2:3),  param_2 = c(5,10,20)))result_list <- as.list(rep(NA, dim(grid)[1]))best_list <- as.list(rep(NA, dim(grid)[1]))

The benchmark core

Looping over all grid settings, creating the starting values for each run, and adding all alternatives functions – this is the main part of the function: the benchmark itself.

for (i in c(1:nrow(grid))) {  # i <- 1  i_param_1 <- grid[i, param_1]  i_param_2 <- grid[i, param_2]  # use grid parameters to define tested setup  x <- rnorm(n = i_param_1, mean = i_param_2)  tmp <- microbenchmark(    "Alternative 1" = mean(x),    "Alternative 2" = sum(x) / length(x),    times = reps,    control = list(warmup = 10L),    unit = "ms")  #tmp <- data.table(summary(tmp), i = grid[i, ])  result_list[[i]] <- tmp  # select best by mean  tmp_sum <- summary(tmp)  best_list[[i]] <- as.character(tmp_sum$expr[tmp_sum$mean == min(tmp_sum$mean)])}

All that is not saved will be lost

During all the previous steps, the intermediate results are stored in lists, which are the input values for the save_benchmark() function. As mentioned before, it creates tables for each benchmark run and plots with an overview of the effects of each grid parameter. Lastly, it updates the main README file with the newest results.

## saving all datasave_benchmark(result_list = result_list,               best_list = best_list,               folder = folder,               start_time = start_time,               description = description,               grid = grid,               reps = reps,               comments = comments)

How do the results look like

After running the benchmark, a new README file is automatically created. This file contains an overview of the tested alternatives (as you named them), the used grid parameters, plots with the impact of these grid parameters, and the tabled summary of every single result.

For example, here, you can see that the number of unique values has a positive effect (faster) on the time it takes to filter, but the number of rows has a negative impact (slower).

If you are interested in not only the overview but the actual data, have a look at result_list.rds. This list contains all results of microbenckmark() for each grid combination.

The last two created files are last_result.rds and log_result.txt. The first is used to create the current overall README.md, and the second is just a logfile with all previous results.

Ideas for further benchmarks

Do you have any thoughts on what we should benchmark next? Or did we maybe forget an alternative? Then raise an issue at my Github. If you can think of a method to better visualize the results, feel free to contact me. I welcome any feedback!

Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and a statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US

STATWORXis a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.

.button { background-color: #0085af;}.x-container.width { width: 100% !important;}.x-section { padding-top: 00px !important; padding-bottom: 80px !important;}

Der Beitrag A Collection of Benchmarks in R erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

front-line house democrats: a quick guide

December 12, 2019, 4:00 pm

≫ Next: An accidental side effect of text mining

≪ Previous: A Collection of Benchmarks in R

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So, with an impeachment vote nigh and the 2020 general elections slowly approaching, lots of talk about the 31 House Democrats that represent congressional districts won by the sitting president in 2016. Here, we present a quick/simple/R-based investigation into the composition of this group of Democrats from several different perspectives. We also consider the 43 congressional districts that flipped Democratic in the 2018 midterms.

Pundits constantly cite these districts and count them in different ways; and I am constantly trying to recreate these counts. As open source federal election data can be a bit of a mess, this is an attempt to organize & collate some data sources.

House & presidential election returns

The R data package uspoliticalextras includes federal election returns from a couple of sources, namely the Daily Kos and the MIT Election Data and Science Lab. I use the package as a bit of a cache for US political data; it is available via Git Hub. The focus here will be on the last Presidential election (2016) & the last two House races (2016 & 2018).

library(tidyverse)#devtools::install_github("jaytimm/uspoliticalextras")  pres <- uspoliticalextras::uspol_dkos_returns_pres_cd house <- uspoliticalextras::uspol_medsl_returns_house_cd %>%  filter(year > 2015) %>%  mutate(party = ifelse(party == 'Independent', 'Republican Party', party)) # Amash

Trump margins in 2016

Number of congressional districts carried by DJT & HRC.

pres %>%  filter(year == 2016) %>%  group_by(candidate) %>%  count() %>%  knitr::kable()

candidate	n
Clinton	205
Trump	230

Trump margins by congressional district in 2016.

library(sf)to_map <- uspoliticalextras::uspol_dkos_equalarea_sf$hex_cds %>%  select(GEOID) %>%  left_join(pres %>%               filter(year == 2016) %>%               mutate(trump_margin = republican - democrat),            by = 'GEOID') to_map %>%  ggplot() +   geom_sf(aes(fill = trump_margin),          color = 'white') +  geom_sf(data=uspoliticalextras::uspol_dkos_equalarea_sf$hex_states,           fill = NA,           show.legend = F,           color="#5a5c5b",           lwd=.5) +  ggsflabel::geom_sf_text(data = uspoliticalextras::uspol_dkos_equalarea_sf$hex_cds,                          aes(label = district_code),                           size = 2,                          color='black') +  scale_fill_distiller(palette = "RdBu", direction=-1) +  theme_minimal()+  theme(axis.text.x=element_blank(),        axis.text.y=element_blank(),        axis.title.x=element_blank(),        axis.title.y=element_blank(),        legend.position = 'none') +  labs(title = "2016 Trump Margins by Congressional District",       caption = 'Source: Daily Kos')

Flipped House Districts from 2016 to 2018

Congressional districts that elected a Republican representative in 2016 and a Democratic representative in 2018.

house_flips <- house %>%  select(GEOID, congress,party) %>%  spread(congress, party) %>%  left_join(house %>%               filter(year == 2018) %>%               mutate(house_rep_margin = round(republican - democrat, 1)) %>%              select(GEOID, state, district_code, candidate, house_rep_margin) %>%               rename(house_rep = candidate))

Some corrections/amendments for Pennsylvania districts per 2019 redistricting.

pa16_rs <- c('4201', '4205', '4206',           '4207', '4209', '4210',           '4211', '4212', '4213','4215',           '4216', '4217')pa16_ds <- c('4202', '4203', '4204',           '4208',  '4214',           '4218')pa18_rs <- c('4203', '4204', '4205', '4218')pa18_ds <- c('4213', '4214')

house_flips1 <-  house_flips %>%  mutate(`115` = ifelse(GEOID %in% pa16_rs, 'Republican Party', `115`),         `115` = ifelse(GEOID %in% pa16_ds, 'Democratic Party', `115`)) %>%  mutate(house_flip = paste0(`115`, ' | ', `116`)) %>%  left_join(to_map %>%               sf::st_drop_geometry() %>%               select(GEOID, candidate, trump_margin)) %>%  rename(Pres16 = candidate) %>%  mutate(`116` = ifelse(GEOID %in% pa18_rs, 'Republican Party', `116`),         `116` = ifelse(GEOID %in% pa18_ds, 'Democratic Party', `116`)) %>%    mutate(Pres16_House18 = paste0(Pres16, ' | ', `116`))

Democrats netted a total of 40 seats in the midterm elections in 2018. The numbers & districts presented below align with those presented on Bollotpedia.

house_flips1 %>%  group_by(house_flip) %>%  count() %>%  knitr::kable()

house_flip	n
Democratic Party \| Democratic Party	192
Democratic Party \| Republican Party	3
Republican Party \| Democratic Party	43
Republican Party \| Republican Party	197

House Representatives from flipped districts:

house_flips1 %>%  filter(house_flip %in%            c('Democratic Party | Republican Party',              'Republican Party | Democratic Party')) %>%  mutate(house_flip = ifelse(grepl('^D', house_flip), 'D -> R', 'R -> D')) %>%  select(house_flip, state, district_code, house_rep) %>%  DT::datatable(rownames = FALSE)

{"x":{"filter":"none","data":[["R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","D -> R","R -> D","R -> D","D -> R","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","D -> R","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D","R -> D"],["Arizona","California","California","California","California","California","California","California","Colorado","Florida","Florida","Georgia","Illinois","Illinois","Iowa","Iowa","Kansas","Maine","Michigan","Michigan","Minnesota","Minnesota","Minnesota","Minnesota","New Jersey","New Jersey","New Jersey","New Jersey","New Mexico","New York","New York","New York","Oklahoma","Pennsylvania","Pennsylvania","Pennsylvania","Pennsylvania","Pennsylvania","South Carolina","Texas","Texas","Utah","Virginia","Virginia","Virginia","Washington"],[2,10,21,25,39,45,48,49,6,26,27,6,6,14,1,3,3,2,8,11,1,2,3,8,2,3,7,11,2,11,19,22,5,5,6,7,14,17,1,7,32,4,2,7,10,8],["KIRKPATRICK, Ann","HARDER, Josh","COX, TJ","HILL, Katie","CISNEROS, Gil","PORTER, Katie","ROUDA, Harley","LEVIN, Mike","CROW, Jason","MUCARSEL-POWELL, Debbie","SHALALA, Donna","McBATH, Lucy","CASTEN, Sean","UNDERWOOD, Lauren","FINKENAUER, Abby","AXNE, Cynthia","DAVIDS, Sharice","GOLDEN, Jared","SLOTKIN, Elissa","STEVENS, Haley","HAGEDORN, Jim","CRAIG, Angela","PHILLIPS, Dean","STAUBER, Peter","VAN DREW, Jefferson","KIM, Andy","MALINOWSKI, Tomaz","SHERRILL, Mikie","TORRES SMALL, Xochitl","ROSE, Max","DELGADO, Antonio","BRINDISI, Anthony","HORN, Kendra","SCANLON, Mary Gay","HOULAHAN, Christina","WILD, Susan","RESCHENTHALER, Guy","LAMB, Conor","CUNNINGHAM, Joe","FLETCHER, Elizabeth","ALLRED, Colin","McADAMS, Ben","LURIA, Elaine","SPANBERGER, Abigail","WEXTON, Jennifer","SCHRIER, Kim"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th>house_flip<\/th>\n <th>state<\/th>\n <th>district_code<\/th>\n <th>house_rep<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}

The 31 House Democrats in Trump-supportive districts

The table below summarizes how districts voted in the 2016 presidential election and House elections in 2018. Again, 31 House Democrats represent congressional districts that Trump won in 2016’s presidential election. In contrast, only three Republicans represent districts carried by HRC. Note: Numbers & districts align with those presented in this article from The Hill.

house_flips1 %>%  group_by(Pres16_House18) %>%  count() %>%  #mutate(n = formattable::color_tile('white', 'steelblue')(n)) %>%  knitr::kable()

Pres16_House18	n
Clinton \| Democratic Party	202
Clinton \| Republican Party	3
Trump \| Democratic Party	31
Trump \| Republican Party	199

The 31 Democratic lawmakers representing Trump won districts include:

house_flips1 %>%  filter(Pres16_House18 == 'Trump | Democratic Party') %>%  select(state, district_code, house_rep) %>%  DT::datatable(rownames = FALSE)

{"x":{"filter":"none","data":[["Arizona","Georgia","Illinois","Illinois","Iowa","Iowa","Iowa","Maine","Michigan","Michigan","Minnesota","Minnesota","Nevada","New Hampshire","New Jersey","New Jersey","New Jersey","New Jersey","New Mexico","New York","New York","New York","New York","Oklahoma","Pennsylvania","Pennsylvania","South Carolina","Utah","Virginia","Virginia","Wisconsin"],[1,6,14,17,1,2,3,2,8,11,2,7,3,1,2,3,5,11,2,11,18,19,22,5,8,17,1,4,2,7,3],["O'HALLERAN, Thomas C.","McBATH, Lucy","UNDERWOOD, Lauren","BUSTOS, Cheri","FINKENAUER, Abby","LOEBSACK, Dave","AXNE, Cynthia","GOLDEN, Jared","SLOTKIN, Elissa","STEVENS, Haley","CRAIG, Angela","PETERSON, Collin Clark","LEE, Susie","PAPPAS, Chris","VAN DREW, Jefferson","KIM, Andy","GOTTHEIMER, Josh S.","SHERRILL, Mikie","TORRES SMALL, Xochitl","ROSE, Max","MALONEY, Sean Patrick","DELGADO, Antonio","BRINDISI, Anthony","HORN, Kendra","CARTWRIGHT, Matt","LAMB, Conor","CUNNINGHAM, Joe","McADAMS, Ben","LURIA, Elaine","SPANBERGER, Abigail","KIND, Ron"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th>state<\/th>\n <th>district_code<\/th>\n <th>house_rep<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":1}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}

A quick geographical perspective

to_map2 <- uspoliticalextras::uspol_dkos_equalarea_sf$hex_cds %>%  select(GEOID) %>%  left_join(house_flips1, by = 'GEOID') to_map2 %>%  ggplot() +   geom_sf(aes(fill = Pres16_House18),          alpha = .75,          color = 'white') +  geom_sf(data=uspoliticalextras::uspol_dkos_equalarea_sf$hex_states,           fill = NA,           show.legend = F,           color="black",           lwd=.5) +  ggsflabel::geom_sf_text(data = uspoliticalextras::uspol_dkos_equalarea_sf$hex_cds,                          aes(label = district_code),                           size = 2,                          color='black') +  ggthemes::scale_fill_stata() +  theme_minimal()+  theme(axis.text.x=element_blank(),        axis.text.y=element_blank(),        axis.title.x=element_blank(),        axis.title.y=element_blank(),        legend.position = 'bottom') +  labs(title = "2016 Presidential Support & 2018 House Representative",       subtitle = 'By Congressional District')

The 13 House Democrats in solid Trump districts

Of the 31, 13 Democrats represent districts Trump carried by more than 6 points.

house_flips1 %>%  filter(Pres16_House18 == 'Trump | Democratic Party' & trump_margin > 6) %>%  select(state, district_code, house_rep, trump_margin, house_rep_margin) %>%  arrange(desc(trump_margin)) %>%  knitr::kable()

state	district_code	house_rep	trump_margin	house_rep_margin
Minnesota	7	PETERSON, Collin Clark	30.8	-4.3
New York	22	BRINDISI, Anthony	15.5	-1.8
Oklahoma	5	HORN, Kendra	13.4	-1.4
South Carolina	1	CUNNINGHAM, Joe	13.1	-1.4
Maine	2	GOLDEN, Jared	10.3	-1.2
New Mexico	2	TORRES SMALL, Xochitl	10.2	-1.9
Pennsylvania	17	LAMB, Conor	10.1	-12.5
New York	11	ROSE, Max	9.8	-6.5
New York	19	DELGADO, Antonio	6.8	-5.2
Michigan	8	SLOTKIN, Elissa	6.7	-3.8
Utah	4	McADAMS, Ben	6.7	-0.3
Virginia	7	SPANBERGER, Abigail	6.5	-1.9
New Jersey	3	KIM, Andy	6.2	-1.3

Voting patterns in presidential elections

Counts of how districts voted in the last three presidential elections are presented below.

house_flips2 <- house_flips1 %>%  left_join (pres %>%                group_by(GEOID) %>%               summarize(pres_lineage = paste0(candidate, collapse = ' | ')) %>%               ungroup())

house_flips2 %>%  group_by(pres_lineage) %>%  count() %>%  knitr::kable()

pres_lineage	n
McCain \| Obama \| Trump	1
McCain \| Romney \| Clinton	8
McCain \| Romney \| Trump	184
Obama \| Obama \| Clinton	190
Obama \| Obama \| Trump	20
Obama \| Romney \| Clinton	7
Obama \| Romney \| Trump	25

Voting patterns for the 31 Trump-House Dem districts

12 out of the 31 Dem-Trump districts have voted for Republican presidential candidates in the last three elections, ie, McCain-Romney-Trump districts.

house_flips2 %>%  filter(Pres16_House18 == 'Trump | Democratic Party') %>%  group_by(pres_lineage) %>%  count() %>%  knitr::kable()

pres_lineage	n
McCain \| Obama \| Trump	1
McCain \| Romney \| Trump	12
Obama \| Obama \| Trump	14
Obama \| Romney \| Trump	4

Representatives for these twelve districts are presented below.

house_flips2 %>%  filter(Pres16_House18 == 'Trump | Democratic Party' &            pres_lineage == 'McCain | Romney | Trump') %>%  select(state, district_code, house_rep) %>%  knitr::kable()

state	district_code	house_rep
Arizona	1	O’HALLERAN, Thomas C.
Georgia	6	McBATH, Lucy
Minnesota	7	PETERSON, Collin Clark
New Jersey	5	GOTTHEIMER, Josh S.
New Jersey	11	SHERRILL, Mikie
New Mexico	2	TORRES SMALL, Xochitl
New York	22	BRINDISI, Anthony
Oklahoma	5	HORN, Kendra
South Carolina	1	CUNNINGHAM, Joe
Utah	4	McADAMS, Ben
Virginia	2	LURIA, Elaine
Virginia	7	SPANBERGER, Abigail

The 5 House Democrats that should probably vote against impeachment

The table below lists the five House Reps representing districts that have supported Republican presidential candidates in the last three elections, with 2016 Trump margins greater than ten points.

house_flips2 %>%  filter(Pres16_House18 == 'Trump | Democratic Party' &            trump_margin > 10 &           pres_lineage == 'McCain | Romney | Trump') %>%  select(state, district_code, house_rep, pres_lineage, trump_margin) %>%  arrange(desc(trump_margin)) %>%  knitr::kable()

state	district_code	house_rep	pres_lineage	trump_margin
Minnesota	7	PETERSON, Collin Clark	McCain \| Romney \| Trump	30.8
New York	22	BRINDISI, Anthony	McCain \| Romney \| Trump	15.5
Oklahoma	5	HORN, Kendra	McCain \| Romney \| Trump	13.4
South Carolina	1	CUNNINGHAM, Joe	McCain \| Romney \| Trump	13.1
New Mexico	2	TORRES SMALL, Xochitl	McCain \| Romney \| Trump	10.2

Front-line Freshmen House Democrats

Via the Rvoteview package, we identify 88 freshman House Representatives, 56 of which are Democrats.

house_vv_freshies <- Rvoteview:: member_search(chamber= 'House', congress = 116) %>%  filter(congresses == 'c(116, 116)' & party_name != 'Independent') %>%  select(bioname, party_name) %>%  rename(house_rep = bioname) house_vv_freshies %>%  group_by(party_name) %>%  count() %>%  knitr::kable()

party_name	n
Democratic Party	56
Republican Party	32

Of the 31, then, 22 are freshman. So, a pretty vulnerable group.

house_flips3 <- house_flips2 %>%  left_join(house_vv_freshies %>%               mutate(is_fresh = 'Freshman') %>%               select(house_rep, is_fresh)) %>%  mutate(is_fresh = ifelse(is.na(is_fresh), 'Not', is_fresh))house_flips3 %>%  group_by(is_fresh, Pres16_House18) %>%  count() %>%  #filter(is_fresh == 'Y') %>%  spread(is_fresh, n) %>%  janitor::adorn_totals(where = c('row', 'col')) %>%  knitr::kable()

Pres16_House18	Freshman	Not	Total
Clinton \| Democratic Party	35	167	202
Clinton \| Republican Party	NA	3	3
Trump \| Democratic Party	22	9	31
Trump \| Republican Party	29	170	199
Total	86	349	435

Data set

The data set compiled in this post/guide is available here.

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

An accidental side effect of text mining

December 12, 2019, 4:00 pm

≫ Next: Confidence and prediction intervals explained… (with a Shiny app!)

≪ Previous: front-line house democrats: a quick guide

[This article was first published on Posts | SERDAR KORUR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reprinted from https://towardsdatascience.com/an-accidental-side-effect-of-text-mining-4b43f8ee1273 configtoml As I read on my kindle I highlight the passages that I like so that I can re-read them later. These annotations are stored on my Kindle and are backed up at Amazon. And after some time, they started to accumulate and became some kind of data.

I came up with an idea to analyze all those text I highlighted on my kindle, to figure out what kind of content I was most likely to highlight.

The plan was to use text mining and sentiment analysis, generate insights and compare them to my real opinions of those books. So I can have the first-hand test of how useful text mining is. With that knowledge, I can be more convinced when I apply the method to a business problem.

Now to be objective, I will create an independent character called “the bat”. He is a hacker and he is not the bookworm type but his unfortunate fate gave him a challenge.

His next mission is to hack my data and analyze it to gather insights for selling me more books. The room was in a mess when he entered. As he plugged the USB drive on his laptop, he barely heard the radio which was still on.

The reporter:

From the moment we switch on the early morning our cell phones to deal with the flood of information in some form of a whatsapp or facebook message or a tweet until we fall asleep at night overwriting or reading a product review we leave bread crumbs to our personal flavors on internet.

Many businesses use this unstructured data to drive their sales by better marketing through targeted product recommendations or to segregate their customers…

He squeezed his teeth when he saw the data of 21000 lines of text from 28 books. His first encounter was the book “Mindset” by Carol Dweck, where she introduces the concept of growth mindset.

Long lines of text made him tired, he didn’t realize how the time passed. He decided to learn text mining. This might sound quite a bit investment but now he is a man of growth.

He continued to learn R packages needed for text mining, he didn’t like the package name tidytext but he was slightly losing his prejudices. It was a long night. He fell asleep on his table as the sun slowly rose. It was lightning my back garden where

…

I could glance from time to time to the trees painted by the snow overnight . Without an idea about how things went on the another part of the town, I continued to read and highlight my kindle as I zip from a glass of red wine .

…

This is how the exported kindle highlights look like.

configtoml The hacker’s notes

He noted down each step of his text mining plan carefully. Let me help you go through them. Reading and parsing the text file

# Use readLines function to parse the text filehighlights <- readLines("posts_data/Kindle_highlights_Serdar.Rmd", encoding = "UTF-8")# Create a dataframe where each row is a line from the textdf <- data.frame(highlights)# Packageslibrary(tidyverse)   # includes ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcatslibrary(tidytext)library(wordcloud2)

In every data science project, there is some sort of data preparation. Stop words are generally the most common words in a language and are usually filtered out before processing of text data.

Let’s look at the stop_words dataset from the tidytext package. Since it is a long list of words (>1K) I will print every fifth word as an example.

data(stop_words)# print every 50th word stop_words_small <- stop_words[seq(1, nrow(stop_words), 50),]stop_words_small %>% print(n=50)

## # A tibble: 23 x 2##    word       lexicon ##             ##  1 a          SMART   ##  2 at         SMART   ##  3 contain    SMART   ##  4 few        SMART   ##  5 hers       SMART   ##  6 last       SMART   ##  7 nine       SMART   ##  8 presumably SMART   ##  9 some       SMART   ## 10 they'd     SMART   ## 11 very       SMART   ## 12 without    SMART   ## 13 what       snowball## 14 they'll    snowball## 15 during     snowball## 16 again      onix    ## 17 but        onix    ## 18 finds      onix    ## 19 if         onix    ## 20 much       onix    ## 21 parted     onix    ## 22 since      onix    ## 23 under      onix

Sometimes even a small dot can have big influence on your results. Looking carefully I see that stop_words uses single quotes whereas in the text file used apostrophes (‘).

e.g. they’ll in stop_words

And how the word they’ll appears in the text:

Yellow highlight | Page: 200 Memories are continually revised, along with the meaning we derive from them, so that in the future they’ll be of even more use.

This incompatibility will prevent some of the stop_words such as they’ll, don’t, can’t e.g. getting filtered. To prevent this we have to replace them.

He quickly spotted that.

str_replace_all() function from Stringr will do that.

df$highlights <- str_replace_all(df$highlights, "’", "'")

Now, the text is ready for the frequency analysis. Words in a text mining project are called tokens. We can split the text into single words by unnest_tokens() function from tidytext package, filter the stop_words and count.

df <- df %>% unnest_tokens(word, highlights) %>%             anti_join(stop_words) %>%              filter(!word %in% c("highlights","highlight", "page",                       "location", "yellow", "pink", "orange", "blue"))

He also added here some additional words which frequently appear in kindle highlights output.

dplyr() package functions are very useful for grouping and counting the words from the lists that are created.

top_kindle_highlights <- df %>%  group_by(word) %>%  count() %>%  arrange(desc(n))

He noted down his first insight. 10 most frequent words from my kindle highlights.

top_kindle_highlights

## # A tibble: 12,433 x 2## # Groups:   word [12,433]##    word       n##      ##  1 people   592##  2 story    340##  3 life     318##  4 time     309##  5 mind     213##  6 change   212##  7 feel     211##  8 world    171##  9 person   170## 10 habits   157## # ... with 12,423 more rows

Wordclouds are a good alternative to long lists of words for visualizing text data. Wordcloud2 package allows you to use any image as the markup. configtoml

wordcloud2(top_kindle_highlights, figPath = bat, size = 1, backgroundColor = "white", color = color_vector(data$freq) )

Some ideas started to get shaped in his mind. He thought who made those highlights is someone interested in storytelling, writing and good communication, good habits, and people. Someone who wants to influence his life in a positive way. He was becoming more and more interested in the books.

He wanted to dig deeper.

Bigram Analysis

Single words are a good starting point what the books were about. But they are not very informative without context. Frequency analysis can also be performed to measure how often pairs of two words (bigrams) occur in the text. This allows us to capture finer details in the text.

To do this he combined the unnested single tokens which is isolated above back into a continuous text and then performed bigram analysis. You can use str_c() function from stringr package to concatenate the single words.

# Recreate the dfdf <- data.frame(highlights)df$highlights <- str_replace_all(df$highlights, "’", "'")df <- df %>% unnest_tokens(word, highlights) %>%   anti_join(stop_words) %>%  filter(!word %in% c("highlights","highlight", "page",                       "location", "yellow", "pink", "orange", "blue",                      "export", "hidden", "truncated", "kindle", "note", "limits"))df_com <- str_c(df$word, " ") df_com <- data.frame(df_com)

Let’s split the text into bigrams and count the most common two word pairs.

df_bigram <- df_com %>%  unnest_tokens(bigram, df_com, token = "ngrams",  n = 3, n_min = 2)top_bigrams <- df_bigram %>%  group_by(bigram) %>%  count() %>%  arrange(desc(n))%>%  print(n=20)

## # A tibble: 107,317 x 2## # Groups:   bigram [107,317]##    bigram                    n##                     ##  1 body language            30##  2 behavior change          23##  3 crucial conversations    19##  4 fixed mindset            19##  5 growth mindset           19##  6 subconscious mind        19##  7 told story               18##  8 type person              17##  9 object desire            16## 10 system 1                 16## 11 pay attention            15## 12 bad habits               13## 13 law behavior             13## 14 law behavior change      13## 15 social media             13## 16 inciting incident        12## 17 people feel              12## 18 subject matter           12## 19 human nature             11## 20 objective criteria       11## # ... with 1.073e+05 more rows

# And visualize them on a plottop <- top_bigrams[1:25,]

top %>% ungroup() %>% mutate(bigram = fct_reorder(bigram, n)) %>%  ggplot(aes(x=bigram, y=n)) +  geom_col() +  coord_flip() + theme_classic() +  theme(legend.position = "none", text = element_text(size=18))

For example, if you go back above in the top 10 most frequent words table 6th word was change. But we didn’t know what the change was about. And here we see that one of the most common bigram is behavioral change. It is making more sense. But it can improve to look at each book individually.

We can also do what we did for the whole document for highlights from single books.

But how can we capture them individually?

Let’s first look at the text once more.Before each book “Your Kindle Notes For:” appears.

Let’s find out the line numbers for the beginning and the end of each book and use those indexes for fishing out each book.

We will reuse the data frame df we created above. str_which() function returns index numbers of the lines which contain a given pattern. In the last step, capturing the text between two consecutive indexes will give us the book between them.

# Since I modified df above. I will recreate it again.df <- data.frame(highlights)df$highlights <- str_replace_all(df$highlights, "’", "'")# Getting the index number for each bookindexes <- str_which(df$highlights, pattern = fixed("Your Kindle Notes For"))book_names <- df$highlights[indexes + 1]indexes <-  c(indexes,nrow(df))# Create an empty list books <- list()# Now the trick. Capture each 28 book separately in a list. for(i in 1:(length(indexes)-1)) {    books[[i]] <- data.frame(df$highlights[(indexes[i]:indexes[i+1]-1)])    colnames(books[[i]]) <- "word_column"    books[[i]]$word_column <- as.character(books[[i]]$word_column)}

Let’s check whether it worked, for example you can look up the 5th book on our list.

head(books[[5]])

##                                           word_column## 1                                                    ## 2                              Your Kindle Notes For:## 3 Bird by Bird: Some Instructions on Writing and Life## 4                                         Anne Lamott## 5             Last accessed on Saturday July 27, 2019## 6                         75 Highlight(s) | 4 Note(s)

head(books[[15]])

##                                                  word_column## 1                                                           ## 2                                     Your Kindle Notes For:## 3 Getting to Yes: Negotiating an agreement without giving in## 4                               Roger Fisher and William Ury## 5                 Last accessed on Saturday November 3, 2018## 6                               266 Highlight(s) | 3 Note(s)

Now, we have the individual books captured. I will repeat the procedure we used to analyse the whole text above to analyze each of the 28 books by using a for loop.

top <- list()for(i in 1:28){books[[i]] <- books[[i]] %>% unnest_tokens(word, word_column) %>%             anti_join(stop_words) %>%              filter(!word %in% c("highlights","highlight", "page",                       "location", "yellow", "pink", "orange", "blue",                      "export", "hidden", "truncated", "kindle", "note", "limits"))# Find out the top words in each book and capture them in a list (top)top[[i]] <- books[[i]] %>%               group_by(word) %>%               count() %>%               arrange(desc(n))}for(i in 1:28){  print(book_names[[i]])  print(top[[i]])}

## [1] "Thinking, Fast and Slow"## # A tibble: 1,619 x 2## # Groups:   word [1,619]##    word          n##         ##  1 people       33##  2 system       26##  3 1            18##  4 mind         18##  5 effect       17##  6 bad          15##  7 cognitive    15##  8 ease         15##  9 theory       13## 10 decision     12## # ... with 1,609 more rows## [1] "Influence: The Psychology of Persuasion (Collins Business Essentials)"## # A tibble: 278 x 2## # Groups:   word [278]##    word            n##           ##  1 142             5##  2 146             5##  3 131             3##  4 147             3##  5 154             3##  6 179             3##  7 association     3##  8 food            3##  9 information     3## 10 people          3## # ... with 268 more rows## [1] "On Writing Well, 30th Anniversary Edition: An Informal Guide to Writing Nonfiction"## # A tibble: 770 x 2## # Groups:   word [770]##    word         n##        ##  1 writing     26##  2 write       18##  3 sentence    15##  4 writer      15##  5 reader      13##  6 people      10##  7 words        9##  8 person       8##  9 writers      8## 10 day          7## # ... with 760 more rows## [1] "Wired for Story: The Writer's Guide to Using Brain Science to Hook Readers from the Very First Sentence"## # A tibble: 1,657 x 2## # Groups:   word [1,657]##    word              n##             ##  1 story           104##  2 goal             41##  3 protagonist      40##  4 life             27##  5 protagonist's    23##  6 internal         21##  7 brain            20##  8 reader           20##  9 external         19## 10 world            19## # ... with 1,647 more rows## [1] "Bird by Bird: Some Instructions on Writing and Life"## # A tibble: 522 x 2## # Groups:   word [522]##    word          n##         ##  1 writing      17##  2 mind          7##  3 bird          6##  4 voices        5##  5 attention     4##  6 day           4##  7 hope          4##  8 life          4##  9 makes         4## 10 muscles       4## # ... with 512 more rows## [1] "Atomic Habits: An Easy and Proven Way to Build Good Habits and Break Bad Ones"## # A tibble: 2,736 x 2## # Groups:   word [2,736]##    word         n##        ##  1 habits     140##  2 habit      110##  3 behavior    94##  4 change      73##  5 people      50##  6 time        47##  7 identity    38##  8 day         36##  9 brain       32## 10 person      32## # ... with 2,726 more rows## [1] "Storynomics: Story-Driven Marketing in the Post-Advertising World"## # A tibble: 3,042 x 2## # Groups:   word [3,042]##    word          n##         ##  1 story       149##  2 mind         50##  3 stories      48##  4 core         47##  5 marketing    46##  6 brand        45##  7 life         42##  8 change       41##  9 audience     35## 10 due          33## # ... with 3,032 more rows## [1] "Crucial Conversations Tools for Talking When Stakes Are High, Second Edition"## # A tibble: 1,828 x 2## # Groups:   word [1,828]##    word              n##             ##  1 people           84##  2 dialogue         40##  3 stories          40##  4 due              34##  5 feel             33##  6 crucial          31##  7 conversations    30##  8 meaning          30##  9 story            30## 10 conversation     28## # ... with 1,818 more rows## [1] "Pre-Suasion: A Revolutionary Way to Influence and Persuade"## # A tibble: 524 x 2## # Groups:   word [524]##    word             n##            ##  1 attention        6##  2 influence        5##  3 mental           5##  4 trust            5##  5 visitors         5##  6 comfort          4##  7 emotional        4##  8 experience       4##  9 message          4## 10 associations     3## # ... with 514 more rows## [1] "Made to Stick: Why some ideas take hold and others come unstuck"## # A tibble: 1,752 x 2## # Groups:   word [1,752]##    word          n##         ##  1 people       64##  2 knowledge    27##  3 story        25##  4 ideas        24##  5 concrete     18##  6 surprise     17##  7 care         16##  8 time         15##  9 attention    14## 10 core         14## # ... with 1,742 more rows## [1] "The Charisma Myth: Master the Art of Personal Magnetism"## # A tibble: 1,802 x 2## # Groups:   word [1,802]##    word            n##           ##  1 feel           43##  2 body           38##  3 people         35##  4 language       33##  5 charisma       27##  6 warmth         27##  7 charismatic    24##  8 power          22##  9 person         19## 10 confidence     18## # ... with 1,792 more rows## [1] "The Power of Moments: Why Certain Experiences Have Extraordinary Impact"## # A tibble: 1,299 x 2## # Groups:   word [1,299]##    word              n##             ##  1 moments          29##  2 moment           21##  3 people           17##  4 time             15##  5 insight          13##  6 milestones       13##  7 purpose          11##  8 relationships    11##  9 create            9## 10 goal              9## # ... with 1,289 more rows## [1] "Principles: Life and Work"## # A tibble: 1,131 x 2## # Groups:   word [1,131]##    word           n##          ##  1 people        54##  2 thinking      16##  3 decision      12##  4 level         12##  5 life          12##  6 pain          12##  7 habits        11##  8 understand    11##  9 change        10## 10 knowing       10## # ... with 1,121 more rows## [1] "Deep Work: Rules for Focused Success in a Distracted World"## # A tibble: 711 x 2## # Groups:   word [711]##    word          n##         ##  1 attention    12##  2 deep         11##  3 ability       9##  4 book          9##  5 life          9##  6 time          9##  7 mind          7##  8 world         7##  9 focus         6## 10 called        5## # ... with 701 more rows## [1] "Getting to Yes: Negotiating an agreement without giving in"## # A tibble: 1,489 x 2## # Groups:   word [1,489]##    word            n##           ##  1 agreement      33##  2 negotiation    33##  3 options        23##  4 people         19##  5 objective      17##  6 positions      17##  7 ideas          16##  8 position       15##  9 shared         15## 10 solution       15## # ... with 1,479 more rows## [1] "Who: The A Method for Hiring"## # A tibble: 920 x 2## # Groups:   word [920]##    word           n##          ##  1 people        38##  2 job           22##  3 players       16##  4 person        15##  5 candidate     14##  6 candidates    13##  7 company       13##  8 hire          11##  9 hiring        11## 10 interview     11## # ... with 910 more rows## [1] "Mindset: Changing The Way You think To Fulfil Your Potential"## # A tibble: 910 x 2## # Groups:   word [910]##    word         n##        ##  1 mindset     43##  2 people      33##  3 growth      27##  4 fixed       23##  5 blame       18##  6 learning    16##  7 learn       15##  8 effort      11##  9 failure     11## 10 makes       10## # ... with 900 more rows## [1] "The 4-Hour Work Week: Escape the 9-5, Live Anywhere and Join the New Rich"## # A tibble: 736 x 2## # Groups:   word [736]##    word          n##         ##  1 time         11##  2 life         10##  3 mail          6##  4 product       6##  5 week          6##  6 world         6##  7 baby          5##  8 celebrity     5##  9 create        5## 10 days          5## # ... with 726 more rows## [1] "Tools of Titans: The Tactics, Routines, and Habits of Billionaires, Icons, and World-Class Performers"## # A tibble: 1,956 x 2## # Groups:   word [1,956]##    word       n##      ##  1 people    41##  2 life      25##  3 time      24##  4 write     24##  5 world     22##  6 10        17##  7 ideas     14##  8 book      13##  9 times     12## 10 read      11## # ... with 1,946 more rows## [1] "The Elements of Eloquence: How to Turn the Perfect English Phrase"## # A tibble: 116 x 2## # Groups:   word [116]##    word           n##          ##  1 change         5##  2 english        3##  3 pattern        3##  4 poets          3##  5 19             2##  6 44             2##  7 attitude       2##  8 colour         2##  9 contradict     2## 10 fall           2## # ... with 106 more rows## [1] "The One Thing: The Surprisingly Simple Truth Behind Extraordinary Results: Achieve your goals with one of the world's bestselling success books (Basic Skills)"## # A tibble: 587 x 2## # Groups:   word [587]##    word              n##             ##  1 time             29##  2 success          15##  3 results          11##  4 block             9##  5 day               9##  6 extraordinary     9##  7 life              8##  8 matters           8##  9 successful        8## 10 discipline        7## # ... with 577 more rows## [1] "How to Win Friends and Influence People"## # A tibble: 140 x 2## # Groups:   word [140]##    word          n##         ##  1 people        7##  2 ability       3##  3 fears         3##  4 116           2##  5 book          2##  6 human         2##  7 knowledge     2##  8 meeting       2##  9 person's      2## 10 sell          2## # ... with 130 more rows## [1] "The Untethered Soul: The Journey Beyond Yourself"## # A tibble: 770 x 2## # Groups:   word [770]##    word           n##          ##  1 life          73##  2 feel          34##  3 events        26##  4 mind          25##  5 world         20##  6 fear          19##  7 inside        19##  8 energy        17##  9 experience    17## 10 heart         17## # ... with 760 more rows## [1] "Man's Search For Meaning: The classic tribute to hope from the Holocaust"## # A tibble: 894 x 2## # Groups:   word [894]##    word            n##           ##  1 life           29##  2 suffering      24##  3 meaning        20##  4 human          19##  5 intention      11##  6 75              9##  7 logotherapy     9##  8 patient         9##  9 world           9## 10 called          8## # ... with 884 more rows## [1] "The Power of your Subconscious Mind and Other Works"## # A tibble: 600 x 2## # Groups:   word [600]##    word             n##            ##  1 mind            34##  2 subconscious    28##  3 wealth          13##  4 idea            11##  5 mental          10##  6 love             9##  7 life             8##  8 peace            8##  9 happiness        7## 10 desire           6## # ... with 590 more rows## [1] "Ego is the Enemy: The Fight to Master Our Greatest Opponent"## # A tibble: 831 x 2## # Groups:   word [831]##    word         n##        ##  1 ego         19##  2 people      12##  3 purpose     11##  4 111          8##  5 change       7##  6 147          6##  7 function     6##  8 life         6##  9 passion      6## 10 path         6## # ... with 821 more rows## [1] "Outliers: The Story of Success"## # A tibble: 105 x 2## # Groups:   word [105]##    word             n##            ##  1 ability          3##  2 knowing          3##  3 sense            3##  4 communicate      2##  5 distance         2##  6 family           2##  7 intelligence     2##  8 power            2##  9 practical        2## 10 sternberg        2## # ... with 95 more rows## [1] "The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career"## # A tibble: 570 x 2## # Groups:   word [570]##    word              n##             ##  1 people           14##  2 product           8##  3 opportunities     7##  4 person            7##  5 start             7##  6 assets            6##  7 job               6##  8 time              6##  9 138               5## 10 create            5## # ... with 560 more rows

Now, looking at the frequent words from each book we can get more insights what they are about.

The bigrams for the same books.

df <- data.frame(highlights)df$highlights <- str_replace_all(df$highlights, "’", "'")# Getting the index number for each bookindexes <- str_which(df$highlights, pattern = fixed("Your Kindle Notes For"))book_names <- df$highlights[indexes + 1]indexes <-  c(indexes,nrow(df))# Capturing each book individuallybooks <- list()for (i in 1:(length(indexes)-1)) {    books[[i]] <- data.frame(df$highlights[(indexes[i]:indexes[i+1]-1)])    colnames(books[[i]]) <- "word_column"    books[[i]]$word_column <- as.character(books[[i]]$word_column)}# Next step in the plan was splitting the text into single words by unnest_tokens function.for(i in 1:28){books[[i]] <- books[[i]] %>% unnest_tokens(word, word_column) %>%             anti_join(stop_words) %>%              filter(!word %in% c("highlights","highlight", "page",                       "location", "yellow", "pink", "orange", "blue",                      "export", "hidden", "truncated", "kindle", "note", "limits"))}# After this preparation step I can combine the single words back into a continous textfor(i in 1:28){books[[i]] <- str_c(books[[i]]$word, " ") books[[i]] <- data.frame(books[[i]]) }df_bigram <- list()for(i in 1:28){                      df_bigram[[i]] <- books[[i]] %>%        unnest_tokens(bigram, books..i.., token = "ngrams",                                      n = 3, n_min = 2)}for (i in 1:28){  print(book_names[i])df_bigram[[i]] %>%   group_by(bigram) %>%   count() %>%   arrange(desc(n))%>%   print(n=10)  }

## [1] "Thinking, Fast and Slow"## # A tibble: 5,768 x 2## # Groups:   bigram [5,768]##    bigram                    n##                     ##  1 system 1                 16##  2 cognitive ease            9##  3 system 2                  8##  4 halo effect               4##  5 loss aversion             4##  6 possibility effect        4##  7 affective forecasting     3##  8 availability bias         3##  9 cognitive strain          3## 10 decision weights          3## # ... with 5,758 more rows## [1] "Influence: The Psychology of Persuasion (Collins Business Essentials)"## # A tibble: 673 x 2## # Groups:   bigram [673]##    bigram                    n##                     ##  1 association principle     2##  2 click whirr               2##  3 click whirr response      2##  4 luncheon technique        2##  5 reciprocity rule          2##  6 whirr response            2##  7 0 13                      1##  8 0 13 rule                 1##  9 13 rule                   1## 10 13 rule reciprocation     1## # ... with 663 more rows## [1] "On Writing Well, 30th Anniversary Edition: An Informal Guide to Writing Nonfiction"## # A tibble: 2,172 x 2## # Groups:   bigram [2,172]##    bigram                       n##                        ##  1 500th appendix               2##  2 choice unity                 2##  3 confronted solved            2##  4 despair finding              2##  5 despair finding solution     2##  6 english language             2##  7 federal buildings            2##  8 finally solve                2##  9 finally solve surgeon        2## 10 finding solution             2## # ... with 2,162 more rows## [1] "Wired for Story: The Writer's Guide to Using Brain Science to Hook Readers from the Very First Sentence"## # A tibble: 6,602 x 2## # Groups:   bigram [6,602]##    bigram                    n##                     ##  1 external goal             8##  2 internal goal             6##  3 cognitive unconscious     5##  4 internal issue            5##  5 real life                 5##  6 story question            5##  7 antonio damasio           4##  8 effect trajectory         4##  9 steven pinker             4## 10 1 story                   3## # ... with 6,592 more rows## [1] "Bird by Bird: Some Instructions on Writing and Life"## # A tibble: 1,304 x 2## # Groups:   bigram [1,304]##    bigram                   n##                    ##  1 bird bird                3##  2 muscles cramp            3##  3 cramp wounds             2##  4 life view                2##  5 likable narrator         2##  6 muscles cramp wounds     2##  7 pay attention            2##  8 1,015 read               1##  9 1,015 read reading       1## 10 1,048 digress            1## # ... with 1,294 more rows## [1] "Atomic Habits: An Easy and Proven Way to Build Good Habits and Break Bad Ones"## # A tibble: 12,309 x 2## # Groups:   bigram [12,309]##    bigram                  n##                   ##  1 behavior change        23##  2 type person            17##  3 law behavior           13##  4 law behavior change    13##  5 bad habits             11##  6 social media            9##  7 habits attractive       6##  8 3rd law                 5##  9 bad habit               5## 10 break chain             5## # ... with 1.23e+04 more rows## [1] "Storynomics: Story-Driven Marketing in the Post-Advertising World"## # A tibble: 12,819 x 2## # Groups:   bigram [12,819]##    bigram                 n##                  ##  1 object desire         16##  2 told story            16##  3 inciting incident     12##  4 positive negative     10##  5 purpose told          10##  6 subject matter        10##  7 core character         9##  8 purpose told story     8##  9 real beauty            7## 10 change team's          6## # ... with 1.281e+04 more rows## [1] "Crucial Conversations Tools for Talking When Stakes Are High, Second Edition"## # A tibble: 8,751 x 2## # Groups:   bigram [8,751]##    bigram                    n##                     ##  1 crucial conversations    19##  2 due 27                   10##  3 mutual purpose           10##  4 shared pool               8##  5 silence violence          8##  6 crucial conversation      7##  7 path action               7##  8 due 26                    6##  9 due 43                    6## 10 fool's choice             6## # ... with 8,741 more rows## [1] "Pre-Suasion: A Revolutionary Way to Influence and Persuade"## # A tibble: 1,261 x 2## # Groups:   bigram [1,261]##    bigram                      n##                       ##  1 attention goal              2##  2 concept audience            2##  3 levels importance           2##  4 mandel johnson              2##  5 mental activity             2##  6 social proof                2##  7 thousand dollars            2##  8 twenty thousand             2##  9 twenty thousand dollars     2## 10 writing session             2## # ... with 1,251 more rows## [1] "Made to Stick: Why some ideas take hold and others come unstuck"## # A tibble: 6,372 x 2## # Groups:   bigram [6,372]##    bigram                      n##                       ##  1 curse knowledge             7##  2 guessing machines           6##  3 people care                 6##  4 goodyear tires              5##  5 knowledge gaps              5##  6 people's attention          5##  7 popcorn popper              5##  8 security goodyear           5##  9 security goodyear tires     5## 10 sinatra test                5## # ... with 6,362 more rows## [1] "The Charisma Myth: Master the Art of Personal Magnetism"## # A tibble: 6,343 x 2## # Groups:   bigram [6,343]##    bigram                        n##                         ##  1 body language                30##  2 power warmth                  6##  3 feel bad                      4##  4 imagination reality           4##  5 people feel                   4##  6 responsibility transfer       4##  7 charismatic body              3##  8 charismatic body language     3##  9 confidence ability            3## 10 distinguish imagination       3## # ... with 6,333 more rows## [1] "The Power of Moments: Why Certain Experiences Have Extraordinary Impact"## # A tibble: 3,967 x 2## # Groups:   bigram [3,967]##    bigram                         n##                          ##  1 defining moments               5##  2 backward integrated            3##  3 backward integrated design     3##  4 breaking script                3##  5 connecting meaning             3##  6 integrated design              3##  7 moments pride                  3##  8 understanding validation       3##  9 bad stronger                   2## 10 bose headphones                2## # ... with 3,957 more rows## [1] "Principles: Life and Work"## # A tibble: 3,960 x 2## # Groups:   bigram [3,960]##    bigram                   n##                    ##  1 common sense             3##  2 left brained             3##  3 responsible parties      3##  4 134 people               2##  5 274 remember             2##  6 407 values               2##  7 407 values abilities     2##  8 achieve goals            2##  9 bad outcomes             2## 10 blind spots              2## # ... with 3,950 more rows## [1] "Deep Work: Rules for Focused Success in a Distracted World"## # A tibble: 1,981 x 2## # Groups:   bigram [1,981]##    bigram                  n##                   ##  1 deliberate practice     4##  2 13 master               2##  3 14 deep                 2##  4 29 ability              2##  5 77 gallagher            2##  6 ability concentrate     2##  7 anders ericsson         2##  8 book shining            2##  9 choose focus            2## 10 fixed schedule          2## # ... with 1,971 more rows## [1] "Getting to Yes: Negotiating an agreement without giving in"## # A tibble: 5,363 x 2## # Groups:   bigram [5,363]##    bigram                     n##                      ##  1 objective criteria        11##  2 principled negotiation     8##  3 bottom line                6##  4 inventing options          6##  5 mutual gain                6##  6 reach agreement            6##  7 reaching agreement         6##  8 options mutual             5##  9 options mutual gain        5## 10 brainstorming session      4## # ... with 5,353 more rows## [1] "Who: The A Method for Hiring"## # A tibble: 3,196 x 2## # Groups:   bigram [3,196]##    bigram                     n##                      ##  1 talented people            6##  2 outcomes competencies      4##  3 96 performance             3##  4 96 performance compare     3##  5 fit company                3##  6 performance compare        3##  7 2 million                  2##  8 95 interrupt               2##  9 career goals               2## 10 company 31                 2## # ... with 3,186 more rows## [1] "Mindset: Changing The Way You think To Fulfil Your Potential"## # A tibble: 3,182 x 2## # Groups:   bigram [3,182]##    bigram                   n##                    ##  1 fixed mindset           19##  2 growth mindset          19##  3 people fixed             4##  4 people fixed mindset     4##  5 183 son                  3##  6 assign blame             3##  7 social interactions      3##  8 142 create               2##  9 157 fixed                2## 10 157 fixed mindset        2## # ... with 3,172 more rows## [1] "The 4-Hour Work Week: Escape the 9-5, Live Anywhere and Join the New Rich"## # A tibble: 1,927 x 2## # Groups:   bigram [1,927]##    bigram                     n##                      ##  1 http e.ggtimer.com         3##  2 basic assumptions          2##  3 car seat                   2##  4 limit tasks                2##  5 offer customer             2##  6 options offer              2##  7 options offer customer     2##  8 parkinson's law            2##  9 shorten time               2## 10 suggest days               2## # ... with 1,917 more rows## [1] "Tools of Titans: The Tactics, Routines, and Habits of Billionaires, Icons, and World-Class Performers"## # A tibble: 6,321 x 2## # Groups:   bigram [6,321]##    bigram                  n##                   ##  1 10 ideas                4##  2 bad ideas               4##  3 keeping track           4##  4 track times             4##  5 world war               4##  6 516 write               3##  7 extreme ownership       3##  8 heart head              3##  9 keeping track times     3## 10 narrative narrative     3## # ... with 6,311 more rows## [1] "The Elements of Eloquence: How to Turn the Perfect English Phrase"## # A tibble: 272 x 2## # Groups:   bigram [272]##    bigram                    n##                     ##  1 change attitude           2##  2 change pattern            2##  3 change pattern change     2##  4 fall love                 2##  5 pattern change            2##  6 0 19                      1##  7 0 19 bred                 1##  8 11 2018                   1##  9 11 2018 8                 1## 10 19 bred                   1## # ... with 262 more rows## [1] "The One Thing: The Surprisingly Simple Truth Behind Extraordinary Results: Achieve your goals with one of the world's bestselling success books (Basic Skills)"## # A tibble: 1,629 x 2## # Groups:   bigram [1,629]##    bigram                    n##                     ##  1 extraordinary results     7##  2 time block                7##  3 selected discipline       3##  4 3 time                    2##  5 3 time block              2##  6 achieve extraordinary     2##  7 block day                 2##  8 default settings          2##  9 discipline build          2## 10 easier unnecessary        2## # ... with 1,619 more rows## [1] "How to Win Friends and Influence People"## # A tibble: 318 x 2## # Groups:   bigram [318]##    bigram               n##                ##  1 time meeting         2##  2 0 72                 1##  3 0 72 lies            1##  4 110 people           1##  5 110 people smile     1##  6 112 time             1##  7 112 time meeting     1##  8 116 116              1##  9 116 116 bad          1## 10 116 bad              1## # ... with 308 more rows## [1] "The Untethered Soul: The Journey Beyond Yourself"## # A tibble: 3,195 x 2## # Groups:   bigram [3,195]##    bigram                   n##                    ##  1 preconceived notions     8##  2 life avoiding            7##  3 devote life              6##  4 empty space              5##  5 experience life          5##  6 model reality            5##  7 rest life                5##  8 spend life               5##  9 spend life avoiding      5## 10 153 events               4## # ... with 3,185 more rows## [1] "Man's Search For Meaning: The classic tribute to hope from the Holocaust"## # A tibble: 2,917 x 2## # Groups:   bigram [2,917]##    bigram                    n##                     ##  1 paradoxical intention     6##  2 hyper intention           4##  3 anticipatory anxiety      3##  4 existential vacuum        3##  5 fall asleep               3##  6 human existence           3##  7 intention fall            3##  8 intention fall asleep     3##  9 meaning life              3## 10 potential meaning         3## # ... with 2,907 more rows## [1] "The Power of your Subconscious Mind and Other Works"## # A tibble: 1,750 x 2## # Groups:   bigram [1,750]##    bigram                         n##                          ##  1 subconscious mind             17##  2 dominant idea                  4##  3 idea subconscious              3##  4 peace mind                     3##  5 power subconscious             3##  6 accept idea                    2##  7 accepted subconscious          2##  8 accepted subconscious mind     2##  9 annoy irritate                 2## 10 annoy irritate permit          2## # ... with 1,740 more rows## [1] "Ego is the Enemy: The Fight to Master Our Greatest Opponent"## # A tibble: 2,290 x 2## # Groups:   bigram [2,290]##    bigram                 n##                  ##  1 112 start              2##  2 147 deceived           2##  3 33 purpose             2##  4 beat people            2##  5 ego enemy              2##  6 function function      2##  7 people beat            2##  8 people beat people     2##  9 people beneath         2## 10 purpose realism        2## # ... with 2,280 more rows## [1] "Outliers: The Story of Success"## # A tibble: 232 x 2## # Groups:   bigram [232]##    bigram                           n##                            ##  1 knowing knowing                  2##  2 power distance                   2##  3 practical intelligence           2##  4 0 884                            1##  5 0 884 write                      1##  6 1,051 robert                     1##  7 1,051 robert sternberg           1##  8 1,052 practical                  1##  9 1,052 practical intelligence     1## 10 1,063 annette                    1## # ... with 222 more rows## [1] "The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career"## # A tibble: 1,611 x 2## # Groups:   bigram [1,611]##    bigram                 n##                  ##  1 product management     3##  2 faster cheaper         2##  3 skills experiences     2##  4 soft assets            2##  5 weak ties              2##  6 0 15                   1##  7 0 15 paranoid          1##  8 101 business           1##  9 101 business crazy     1## 10 101 inspired           1## # ... with 1,601 more rows

If you want to see another example of this capturing process you can have a look at my recent post here.

Looking at each book individually, he started to be more and more obsessed about the books in my kindle. He decided to order a couple of them.

Sentiment analysis is used to evaluate emotional charge in a text mining project. Most common uses are social media monitoring, customer experience management, and Voice of Customer, to understand how they feel.

The bing lexicon categorizes words into positive and negative categories, in a binary fashion. The nrc lexicon uses categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

Using bing lexicon

This gives us the top words contributed to each emotional category. Some examples to note are success, effective, for positive and bad, hard and limits.

Here is how R produced the above plot:

df <- data.frame(highlights)df$highlights <- str_replace_all(df$highlights, "’", "'")df <- df %>% unnest_tokens(word, highlights) %>%   anti_join(stop_words) %>%  filter(!word %in% c("highlights","highlight", "page",                       "location", "yellow", "pink", "orange", "blue",                      "export", "hidden", "truncated", "kindle", "note", "limits"))bing_word_counts <- df %>% inner_join(get_sentiments("bing")) %>%   count(word, sentiment, sort = TRUE) %>%  ungroup()bing_word_counts

## # A tibble: 1,854 x 3##    word      sentiment     n##              ##  1 bad       negative     93##  2 success   positive     91##  3 hard      negative     85##  4 love      positive     69##  5 difficult negative     63##  6 negative  negative     63##  7 easy      positive     62##  8 fear      negative     62##  9 wrong     negative     62## 10 positive  positive     58## # ... with 1,844 more rows

# Top contributors to positive and negative sentimentbing <- bing_word_counts %>%   group_by(sentiment) %>%   top_n(10) %>%   ggplot(aes(reorder(word, n), n, fill=sentiment)) +   geom_bar(alpha=0.8, stat="identity", show.legend = FALSE)+  facet_wrap(~sentiment, scales = "free_y") +  labs(y= "Contribution to sentiment", x = NULL) +  coord_flip()bing

Using nrc lexion

We see that I am more likely to highlight if a text is charged with positive rather than negative sentiment, and individually trust, anticipation and joy rather than fear and sadness.

df <- data.frame(highlights)df$highlights <- str_replace_all(df$highlights, "’", "'")df <- df %>% unnest_tokens(word, highlights) %>%   anti_join(stop_words) %>%  filter(!word %in% c("highlights","highlight", "page",                       "location", "yellow", "pink", "orange", "blue",                      "export", "hidden", "truncated", "kindle", "note", "limits"))

## Joining, by = "word"

sentiment <- df %>%        left_join(get_sentiments("nrc")) %>%        filter(!is.na(sentiment)) %>%        count(sentiment, sort = TRUE)

## Joining, by = "word"

sentiment

## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive      8326##  2 trust         4165##  3 negative      3860##  4 anticipation  3366##  5 joy           2642##  6 fear          2446##  7 sadness       1844##  8 anger         1799##  9 surprise      1339## 10 disgust       1093

Normalized sentiments

One important thing to add, since each emotion category has different number of words in a language. Emotional categories with less words are less likely to appear in a given text. Thus, I would like to normalize them according to their numbers in the lexicon and see how it differs than the above results.

# I will add numbers of each categories from the NRC lexiconlexicon <- c(2317, 3338, 1234, 842, 1483, 691, 1250, 1195, 1060, 535)polarity <-  c(1,1,1,1,1,0,0,0,0,0)sentiment <- data.frame(sentiment, lexicon)norm_sentiment <- sentiment %>% mutate( normalized = n/lexicon) %>% arrange(desc(normalized))sentiment <- data.frame(norm_sentiment, polarity)sentiment

##       sentiment    n lexicon normalized polarity## 1  anticipation 3366     842   3.997625        1## 2      positive 8326    2317   3.593440        1## 3          fear 2446     691   3.539797        1## 4      negative 3860    1234   3.128039        1## 5       disgust 1093     535   2.042991        1## 6           joy 2642    1483   1.781524        0## 7         anger 1799    1195   1.505439        0## 8       sadness 1844    1250   1.475200        0## 9      surprise 1339    1060   1.263208        0## 10        trust 4165    3338   1.247753        0

# General findingssentiment %>% group_by(polarity) %>% summarize(n2 = sum(lexicon))

## # A tibble: 2 x 2##   polarity    n2##       ## 1        0  8326## 2        1  5619

Now, anticipation is the highest emotion found in the text that I highlighted. This does not seem a coincidence to me. Since most of the books in our analysis is about productivity and self-development. The productivity tips and tools usually contain words associated with anticipation.

In a similar way, I can look at the sentiment for individual books

df <- data.frame(highlights)# Kindle uses apostrophes (’), but stop_words uses sigle quotes (') # To be able to use all stop_words I should replace apostrophes with quotesdf$highlights <- str_replace_all(df$highlights, "’", "'")# Getting the index number for each bookindexes <- str_which(df$highlights, pattern = fixed("Your Kindle Notes For"))book_names <- df$highlights[indexes + 1]indexes <-  c(indexes,nrow(df))# Capturing each book individuallybooks <- list()for (i in 1:(length(indexes)-1)) {    books[[i]] <- data.frame(df$highlights[(indexes[i]:indexes[i+1]-1)])    colnames(books[[i]]) <- "word_column"    books[[i]]$word_column <- as.character(books[[i]]$word_column)}# Next step in the plan was splitting the text into single words by unnest_tokens function.for(i in 1:28){books[[i]] <- books[[i]] %>% unnest_tokens(word, word_column) %>%             anti_join(stop_words) %>%              filter(!word %in% c("highlights","highlight", "page",                       "location", "yellow", "pink", "orange", "blue"))}sentiment <- list()for (i in 1:28){sentiment[[i]] <- books[[i]] %>%        left_join(get_sentiments("nrc")) %>%        filter(!is.na(sentiment)) %>%        count(sentiment, sort = TRUE)        print(book_names[i])        print(sentiment[[i]])}

## [1] "Thinking, Fast and Slow"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       450##  2 trust          256##  3 negative       254##  4 anticipation   163##  5 fear           153##  6 sadness        116##  7 joy            107##  8 anger          104##  9 disgust         81## 10 surprise        75## [1] "Influence: The Psychology of Persuasion (Collins Business Essentials)"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive        53##  2 trust           37##  3 joy             15##  4 negative        15##  5 fear            12##  6 anticipation    11##  7 sadness          8##  8 anger            7##  9 surprise         3## 10 disgust          2## [1] "On Writing Well, 30th Anniversary Edition: An Informal Guide to Writing Nonfiction"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       172##  2 negative        98##  3 trust           81##  4 anticipation    63##  5 anger           48##  6 fear            47##  7 disgust         42##  8 sadness         42##  9 joy             37## 10 surprise        26## [1] "Wired for Story: The Writer's Guide to Using Brain Science to Hook Readers from the Very First Sentence"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       413##  2 negative       197##  3 trust          178##  4 anticipation   168##  5 fear           152##  6 joy            116##  7 sadness        108##  8 anger           96##  9 surprise        84## 10 disgust         41## [1] "Bird by Bird: Some Instructions on Writing and Life"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive        77##  2 negative        55##  3 anticipation    40##  4 trust           38##  5 joy             37##  6 fear            30##  7 sadness         27##  8 disgust         17##  9 surprise        16## 10 anger           15## [1] "Atomic Habits: An Easy and Proven Way to Build Good Habits and Break Bad Ones"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       835##  2 trust          455##  3 anticipation   439##  4 negative       356##  5 joy            296##  6 fear           254##  7 sadness        180##  8 anger          147##  9 surprise       139## 10 disgust        117## [1] "Storynomics: Story-Driven Marketing in the Post-Advertising World"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       860##  2 trust          405##  3 negative       364##  4 anticipation   335##  5 joy            250##  6 fear           221##  7 sadness        171##  8 anger          167##  9 surprise       166## 10 disgust         76## [1] "Crucial Conversations Tools for Talking When Stakes Are High, Second Edition"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       758##  2 negative       496##  3 trust          412##  4 fear           282##  5 anticipation   258##  6 anger          243##  7 joy            216##  8 sadness        196##  9 disgust        142## 10 surprise       108## [1] "Pre-Suasion: A Revolutionary Way to Influence and Persuade"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive        84##  2 trust           51##  3 negative        31##  4 anticipation    27##  5 fear            24##  6 joy             22##  7 anger           14##  8 sadness         12##  9 surprise         9## 10 disgust          3## [1] "Made to Stick: Why some ideas take hold and others come unstuck"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       499##  2 trust          236##  3 anticipation   198##  4 negative       167##  5 joy            156##  6 fear           123##  7 surprise       107##  8 sadness         74##  9 anger           65## 10 disgust         60## [1] "The Charisma Myth: Master the Art of Personal Magnetism"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       483##  2 negative       254##  3 trust          228##  4 joy            166##  5 anticipation   162##  6 fear           157##  7 sadness        143##  8 anger          120##  9 surprise        65## 10 disgust         58## [1] "The Power of Moments: Why Certain Experiences Have Extraordinary Impact"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       294##  2 trust          132##  3 anticipation   123##  4 negative       106##  5 joy             96##  6 fear            72##  7 anger           52##  8 surprise        50##  9 sadness         45## 10 disgust         19## [1] "Principles: Life and Work"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       313##  2 trust          178##  3 negative       129##  4 anticipation   120##  5 joy            103##  6 sadness         80##  7 fear            78##  8 anger           53##  9 surprise        50## 10 disgust         35## [1] "Deep Work: Rules for Focused Success in a Distracted World"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       176##  2 trust           69##  3 anticipation    54##  4 negative        36##  5 joy             32##  6 fear            19##  7 sadness         14##  8 surprise        14##  9 anger           12## 10 disgust          7## [1] "Getting to Yes: Negotiating an agreement without giving in"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       444##  2 trust          234##  3 negative       180##  4 anticipation   135##  5 anger          103##  6 fear           100##  7 joy             83##  8 sadness         68##  9 surprise        48## 10 disgust         38## [1] "Who: The A Method for Hiring"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       259##  2 trust          125##  3 anticipation    95##  4 joy             73##  5 negative        68##  6 fear            30##  7 surprise        29##  8 anger           25##  9 sadness         22## 10 disgust         16## [1] "Mindset: Changing The Way You think To Fulfil Your Potential"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       317##  2 trust          160##  3 negative       134##  4 joy            117##  5 anticipation   100##  6 fear            78##  7 anger           70##  8 sadness         65##  9 disgust         57## 10 surprise        44## [1] "The 4-Hour Work Week: Escape the 9-5, Live Anywhere and Join the New Rich"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       131##  2 anticipation    70##  3 negative        64##  4 trust           57##  5 joy             56##  6 fear            34##  7 surprise        27##  8 anger           24##  9 sadness         20## 10 disgust         14## [1] "Tools of Titans: The Tactics, Routines, and Habits of Billionaires, Icons, and World-Class Performers"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       406##  2 negative       251##  3 trust          199##  4 anticipation   188##  5 fear           134##  6 joy            126##  7 anger          111##  8 sadness        108##  9 surprise        78## 10 disgust         74## [1] "The Elements of Eloquence: How to Turn the Perfect English Phrase"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive        18##  2 negative        13##  3 fear             9##  4 trust            9##  5 joy              6##  6 sadness          6##  7 anger            5##  8 anticipation     3##  9 disgust          2## 10 surprise         2## [1] "The One Thing: The Surprisingly Simple Truth Behind Extraordinary Results: Achieve your goals with one of the world's bestselling success books (Basic Skills)"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       139##  2 anticipation    97##  3 trust           60##  4 joy             56##  5 negative        32##  6 fear            20##  7 anger           14##  8 surprise        14##  9 disgust          9## 10 sadness          9## [1] "How to Win Friends and Influence People"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive        33##  2 trust           14##  3 negative        11##  4 anticipation    10##  5 joy              7##  6 anger            6##  7 fear             5##  8 surprise         5##  9 disgust          4## 10 sadness          4## [1] "The Untethered Soul: The Journey Beyond Yourself"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       353##  2 negative       251##  3 fear           183##  4 anticipation   172##  5 trust          158##  6 joy            156##  7 sadness        137##  8 anger          125##  9 surprise        65## 10 disgust         56## [1] "Man's Search For Meaning: The classic tribute to hope from the Holocaust"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       199##  2 negative       172##  3 fear           108##  4 sadness        100##  5 trust           97##  6 anticipation    93##  7 joy             77##  8 anger           62##  9 disgust         56## 10 surprise        33## [1] "The Power of your Subconscious Mind and Other Works"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       183##  2 joy            110##  3 trust          110##  4 anticipation    74##  5 negative        59##  6 anger           43##  7 fear            38##  8 sadness         29##  9 surprise        26## 10 disgust         22## [1] "Ego is the Enemy: The Fight to Master Our Greatest Opponent"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       206##  2 trust          109##  3 negative        97##  4 anticipation    85##  5 joy             79##  6 fear            59##  7 anger           54##  8 sadness         42##  9 disgust         37## 10 surprise        31## [1] "Outliers: The Story of Success"## # A tibble: 7 x 2##   sentiment        n##           ## 1 positive        24## 2 trust           11## 3 joy              5## 4 anticipation     4## 5 fear             3## 6 surprise         3## 7 sadness          1## [1] "The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career"## # A tibble: 10 x 2##    sentiment        n##            ##  1 positive       145##  2 anticipation    79##  3 trust           64##  4 joy             42##  5 negative        40##  6 surprise        22##  7 fear            21##  8 sadness         17##  9 anger           14## 10 disgust          8

for (i in 1:28){sentiment[[i]] %>%     filter(sentiment %in% c('positive','negative')) %>%     mutate( n2 = n/sum(n)) %>% print()  }

## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    450 0.639## 2 negative    254 0.361## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     53 0.779## 2 negative     15 0.221## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    172 0.637## 2 negative     98 0.363## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    413 0.677## 2 negative    197 0.323## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     77 0.583## 2 negative     55 0.417## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    835 0.701## 2 negative    356 0.299## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    860 0.703## 2 negative    364 0.297## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    758 0.604## 2 negative    496 0.396## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     84 0.730## 2 negative     31 0.270## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    499 0.749## 2 negative    167 0.251## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    483 0.655## 2 negative    254 0.345## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    294 0.735## 2 negative    106 0.265## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    313 0.708## 2 negative    129 0.292## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    176 0.830## 2 negative     36 0.170## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    444 0.712## 2 negative    180 0.288## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    259 0.792## 2 negative     68 0.208## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    317 0.703## 2 negative    134 0.297## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    131 0.672## 2 negative     64 0.328## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    406 0.618## 2 negative    251 0.382## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     18 0.581## 2 negative     13 0.419## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    139 0.813## 2 negative     32 0.187## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     33  0.75## 2 negative     11  0.25## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    353 0.584## 2 negative    251 0.416## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    199 0.536## 2 negative    172 0.464## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    183 0.756## 2 negative     59 0.244## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    206 0.680## 2 negative     97 0.320## # A tibble: 1 x 3##   sentiment     n    n2##         ## 1 positive     24     1## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    145 0.784## 2 negative     40 0.216

books <- str_trunc(book_names, width=22)all <- list()for (i in 1:28) {all[[i]] <- sentiment[[i]] %>% filter(sentiment %in% c('positive','negative')) %>% mutate(n2 = n/sum(n)) %>% print()}

## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    450 0.639## 2 negative    254 0.361## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     53 0.779## 2 negative     15 0.221## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    172 0.637## 2 negative     98 0.363## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    413 0.677## 2 negative    197 0.323## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     77 0.583## 2 negative     55 0.417## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    835 0.701## 2 negative    356 0.299## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    860 0.703## 2 negative    364 0.297## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    758 0.604## 2 negative    496 0.396## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     84 0.730## 2 negative     31 0.270## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    499 0.749## 2 negative    167 0.251## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    483 0.655## 2 negative    254 0.345## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    294 0.735## 2 negative    106 0.265## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    313 0.708## 2 negative    129 0.292## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    176 0.830## 2 negative     36 0.170## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    444 0.712## 2 negative    180 0.288## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    259 0.792## 2 negative     68 0.208## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    317 0.703## 2 negative    134 0.297## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    131 0.672## 2 negative     64 0.328## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    406 0.618## 2 negative    251 0.382## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     18 0.581## 2 negative     13 0.419## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    139 0.813## 2 negative     32 0.187## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive     33  0.75## 2 negative     11  0.25## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    353 0.584## 2 negative    251 0.416## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    199 0.536## 2 negative    172 0.464## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    183 0.756## 2 negative     59 0.244## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    206 0.680## 2 negative     97 0.320## # A tibble: 1 x 3##   sentiment     n    n2##         ## 1 positive     24     1## # A tibble: 2 x 3##   sentiment     n    n2##         ## 1 positive    145 0.784## 2 negative     40 0.216

Positivty Map of the books.

all_bound <- do.call("rbind", all) %>% filter(sentiment == "positive")library(ggrepel)all_bound %>% ggplot(aes(x= book_names, y=n2)) +   geom_point() +   geom_label_repel(aes(label=books, color = ifelse(n2 <0.55, "red", "blue")), size = 3) +  theme_classic() +  theme(legend.position = "none",        text = element_text(size=18),         axis.text.x = element_blank()) +   xlab("Books") +   ylab("Positivity score")

The lowest positivity score was found in the book “Man’s search for meaning”. This is also kind of expected. Since the book is based on Victor Frankl sufferings during the second world war.

I am getting more and more convinced text mining is giving good insights.

The book “The Outliers” appeared on the top of the positivity plot was a real outlier here.

No panic.

Let’s look at the word count in our Outlier.

book_names[[27]]

## [1] "Outliers: The Story of Success"

top[[27]]

## # A tibble: 105 x 2## # Groups:   word [105]##    word             n##            ##  1 ability          3##  2 knowing          3##  3 sense            3##  4 communicate      2##  5 distance         2##  6 family           2##  7 intelligence     2##  8 power            2##  9 practical        2## 10 sternberg        2## # ... with 95 more rows

The word count from the book “The Outliers” below is 107. This is really low. So in the next iteration, I would remove it from the analysis since it will not be very informative. It is hard to know everything from the beginning and we will go back and make some additional cleaning.

…

Summary

It is not feasible to read millions of pages to check whether text mining is reliable. But here I got some data that I know the content and I applied text mining approaches and sentiment analysis. Both the monograms or bigrams pointed to similar ideas what the books were about. And the sentiments made sense with the genres of the books in my kindle.

Let’s come back to our hacker.

He was affected by an unanticipated side effect of the text analyses. As he continued the project, the insights from the frequent ngrams made him more and more interested in the content. He started reading again and the more he read the more world look differently.

He was transformed into a better version of himself.

The world was brighter.

The radio disrupted the silence.

“brrring…..brrring…..brrring…..”

I woke up.

…

Thank you for reading. I hope you’ve learned something or got inspiration from this. Please feel free to leave comments, suggestions, and questions. (You can always contact me by email at serdar.korur@gmail.com)

If you would like to use the data I will make it available on my github.

I would be more encouraged to write more if you liked the story and follow or clap me on medium!

Until next time!

Serdar

To leave a comment for the author, please follow the link and comment on their blog: Posts | SERDAR KORUR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Confidence and prediction intervals explained… (with a Shiny app!)

December 13, 2019, 4:00 am

≫ Next: R 3.6.2 is out, and a preview of R 4.0.0

≪ Previous: An accidental side effect of text mining

[This article was first published on R on Adi Sarid's personal blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This semester I started teaching introduction to statistics and data analysis with R, at Tel-Aviv university.

I put in a lot of efforts into bringing practical challenges, examples from real life, and a lot of demonstrations of statistical theory with R. This post is an example for how I’ve been using R code (and specifically Shiny apps) to demonstrate statistical theory, concepts and provide intuition.

What’s the difference between confidence and prediction intervals?

Last week I taught multiple linear regression, and I noticed that students have a hard time comprehending the difference between confidence intervals and prediction intervals. The former being an interval for the model (i.e., interval for the underlying model), and the latter being an interval for a noval observation.

As the sample size increases, our uncertainty of the model’s parameters decreases, but the uncertainty in the value of a new observation, $y_0$ is associated with variance of $Y$ (the random variable from which $y_0$ is drawn). Hence, it has a lower bound, based on that variance.

In R, you can get a prediction or a confidence interval by using either

predict(object, newdata, interval = "prediction")

predict(object, newdata, interval = "confidence")

For a prediction or for a confidence interval, respectively.

To help me illustrate the differences between the two, I decided to build a small Shiny web app. It shows the differences between confidence intervals, prediction intervals, the regression fit, and the actual (original) model.

The app is available here, and the source code is available on github.

With this app you can choose three types of models to demonstrate. Simple linear regression, and regression with a twist ($\log$ transformation on the $y$ or $\sin$ transformation on the $x$:

Linear model $y = a + bx + \epsilon$
Log-linear model $\log(y)=a+bx+\epsilon$
Sine $y = a + b\sin(x) + \epsilon$

All the models are based on simple linear regression (lm function), for the latter two models with either a log or sin transformation.

The app allows you to play around with various values such as the $x$ range, the model’s parameters ($a$ and $b$), the error’s standard deviation ($\epsilon$), and show or hide any of the following elements, on the chart:

The original function (i.e., the original model)
The sampled points
The confidence interval
The prediction interval
The model’s fit

Feel free to share the app or the app’s code. As mentioned above, the source code for the app is available here: https://github.com/adisarid/prediction_confidence_intervals_demo.

Here’s an example for what the app’s generating code and output looks like, for a model of the type $\log(y) = 1 + \frac{x}{2} + \epsilon$:

library(dplyr)library(tidyr)library(tibble)library(ggplot2)sample_size <- 90x_range <- c(0, 1.5)a <- 1b <- 1.5sigma <- 0.3actual_function <- tibble(x = seq(x_range[1], x_range[2], by = 0.01)) %>%   mutate(actual_y = exp(a + b*x))random_sample <- tibble(epsilon_err = rnorm(n = sample_size,                                             mean = 0,                                            sd = sigma),                        x = runif(n = sample_size,                                  min = x_range[1],                                  max = x_range[2])) %>%   mutate(sampled_y = exp(a + b*x + epsilon_err))linear_model <- lm(formula = log(sampled_y) ~ x, data = random_sample)prediction_i <- predict(object = linear_model,                         newdata = actual_function,                         interval = "prediction") %>%   as_tibble() %>%   rename_at(vars(lwr,upr), ~paste0(., "_pi")) %>%   mutate_all(exp)confidence_i <- predict(object = linear_model,                         newdata = actual_function,                         interval = "confidence") %>%   as_tibble() %>%   rename_at(vars(lwr,upr), ~paste0(., "_ci")) %>%   select(-fit) %>%   mutate_all(exp)intervals <- actual_function %>%   bind_cols(prediction_i,            confidence_i)ggplot() +   geom_line(data = actual_function, aes(x, actual_y, color = "Original Model"), size = 1) +   geom_point(data = random_sample, aes(x, sampled_y), alpha = 0.5) +   geom_line(data = intervals,             aes(x, fit, color = "Regression Fit"), size = 1) +   geom_line(data = intervals,             aes(x, lwr_pi, color = "Prediction Interval"),             linetype = 2, size = 1) +  geom_line(data = intervals,             aes(x, upr_pi, color = "Prediction Interval"),             linetype = 2, size = 1) +   geom_line(data = intervals,             aes(x, lwr_ci, color = "Confidence Interval"),             linetype = 2, size = 1) +   geom_line(data = intervals,             aes(x, upr_ci, color = "Confidence Interval"),             linetype = 2, size = 1) +   theme_bw() +   xlab("x") +   ylab("y") +   ggtitle("Log-linear: Model, Fit, Confidence and Prediction Intervals")

Conclusions

Shiny apps are a great way to illustrate theoretical concepts, to provide intuition, and to let students experiment with parameters and see the outcomes. In this post I demonstrated how a Shiny app can be used to explain the concepts of a regression fit, confidence, and prediction intervals.

If you used Shiny for interesting educational demonstrations I’d love to hear about it! feel free to share in the comments or message me on twitter @SaridResearch.

To leave a comment for the author, please follow the link and comment on their blog: R on Adi Sarid's personal blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

R 3.6.2 is out, and a preview of R 4.0.0

December 13, 2019, 12:46 pm

≫ Next: Meta Machine Learning aggregator packages in R, The 2nd generation

≪ Previous: Confidence and prediction intervals explained… (with a Shiny app!)

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R 3.6.2, the latest update to the R language, is now available for download on Windows, Mac and Linux.

As a minor release, R 3.6.2 makes only small improvements to R, including some new options for dot charts and better handling of missing values when using running medians as a smoother on charts. It also includes several bug fixes and performance improvements.

But big changes are coming to R with version 4.0.0, which is expected to be released not long after R's official 20th birthday on February 29, 2020. (The CelebRation 2020 conference will mark the occasion in Copenhagen.) The R Core team has announced previews of some of the changes, which include:

An enhanced reference counting system. When you delete an object in R, it will usually releases the associated memory back to the operating system. Likewise, if you copy an object with y <- x, R won't allocate new memory for y unless x is later modified. In current versions of R, however, that system breaks down if there are more than 2 references to any block of memory. Starting with R 4.0.0, all references will be counted, and so R should reclaim as much memory as possible, reducing R's overall memory footprint. This will have no impact on how you write R code, but this change make R run faster, especially on systems with limited memory and with slow storage systems.

Normalization of matrix and array types. Conceptually, a matrix is just a 2-dimensional array. But current versions of R handle matrix and 2-D array objects differently in some cases. In R 4.0.0, matrix objects will formally inherit from the array class, eliminating such inconsistencies.

A refreshed color palette for charts. The base graphics palette for current versions of R (shown as R3 below) features saturated colors that vary considerably in brightness (for example, yellow doesn't display as prominently as red). In R 4.0.0, the palette R4 below will be used, with colors of consistent luminance that are easier to distinguish, especially for viewers with color deficiencies. Additional palettes will make it easy to make base graphics charts that match the color scheme of ggplot2 and other graphics systems.

Many other smaller changes are in the works too. See the NEWS file for the upcoming R release for details.

R developer page: NEWS file for upcoming R release

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Meta Machine Learning aggregator packages in R, The 2nd generation

December 14, 2019, 12:00 am

≫ Next: How H2O propels data scientists ahead of itself: enhancing Driverless AI with advanced options, recipes and visualizations

≪ Previous: R 3.6.2 is out, and a preview of R 4.0.0

[This article was first published on Dror Berel's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TL;DR

mlr was refactored into mlr3. caret was refactored into tidymodels. What are the main differences in terms of software design, and tweaking it for your own needs. R6 vs S3. Which one is less fraigle?

Motivation

My previous post from mid 2018 described my learning experience with R packages for ‘meta’ machine learning aggregator packages: mlr, caret and SuperLerner. These packages unify a machine learning framework over multiple independent individual multivariate models/packages, and provide a ‘meta’ machine learning framework around them for common tasks such as resampling, tuning, benchmarking, ensemble and others.

Since then, a couple of major development has evolved and matured in the R ecosystem, replacing deprecated packages that, ahem, ahem, not comfortable to admit, have started to suffer from … scope creep.

The current refactoring packages I will discuss here are:

mlr -> mlr3 (page, manual caret -> tidyModels (github repo)

This post will attempt to describe the main differences between the next generation of these two packages, with an emphasis on the software design and object architecture / structure that each package utilize, and how it affects the user/practitioner/developer experience, not only for basic use, but also when taking it to the next level of customizing and modifying it for a more specific use.

Scope creep… are you telling me your ‘yesterday’s state-of-the-art package is actually crap?

Image credit: https://leankit.com/wp-content/uploads/2013/11/Screen-Shot-2013-11-25-at-4.25.52-PM.png

When I was first exposed to mlr, I thought to myself, WOW, what a huge effort was invested into this package, not even mentioning the extensive documentation, it will probably last forever. In retrospective I found it was definitely worth my time to learn it, in spite of the non-trivial ‘dialect’ it had, which was different from the good-old-friend ‘caret’ package I had already been accustomed to.

Then, one shiny day, I was surprised to discover that the developers of mlr are actually no longer satisfied with their current version of mlr, and are already working on developing a new package, mlr3, which mostly consists of R6, an advanced system for object oriented class that is way beyond my comfort zone.

My initial concerns were: Is this the end of my affair with mlr? How am I going to cope with not only the challenge of the complexity of machine learning by itself, but also adding a complex structure of R6 class on top of it, that I know nothing about?

But after processing the terrifying news for a while, a new realization emerged… if R6 is worth the learning curve for the developers of mlr3 in such a drastic way that they are going to throw away their entire former efforts on mlr, and never look back, it must have some functionality that give it some advantages… perhaps it is time for me as well to benefit from its advantages.

Don’t understand me wrong! I always wanted to learn and use R6, in spite of already being a huge fan of the S4 system for object oriented classes. However, maybe it was too convenient to seamlessly benefit from it run on the back of many sophisticated packages (tidyverse among many others), that have kept me away from diving into a new unknown world of R6.

While mlr3 is still under development, I thought that at least I can earn some time until it will officially mature into a stable version, so I won’t have to suffer through the painful growing time, and if mlr will no longer be maintained, I could always go back to caret.

But mlr was not alone to be deserted by its developers …caret was experiencing similar scope creep issues. However, in caret’s case, its developer(s) has joined hands with the experts of RStudio, to complete the missing link in the tidyverse framework, and released collection of packages under the umbrella term of ‘tidymodels’, including:

*rsample and recipes for data pre-processing (transformation, hold-out sampling, etc). *parsnip for model creation *yardstick for Measure model performance

Tidymodels was designed to utilize the tidyverse approach, a mature popular intuitive approach, along with plenty of useful documentation and vignettes. As an avid user of tidyverse, I couldn’t wish for more.

About cohesive machines…

My previous post discussed the main challenges in designing such ‘meta’ tools, among them, perhaps the most challenging, is the cohesiveness of the entire system, that is, when you work on specific feature, how to tackle it without accidentally harming the other delicate components that expect a very specific behavior / structure from the tinkered component.

Image credit: https://oip.manual.canon/

Did you ever try to fix or troubleshoot one of the big copy machines that does everything from double side printing, binding, and maybe even a cup of milkshake, but with two spoons of sugar? So many delicate parts in such a cohesive system? Go figure. Yet, when an accidental paper jam occurs, what are you going to do next? How can you figure out where exactly the malfunction is? Are you expected to take each piece out until you diagnose the source of the problem?

Alas, no worries! Rest assure, you don’t even need to call a technician! The copy machine is so smart, it will automatically instruct you exactly where to look for the possible issue, so you don’t need to take it all apart piece by piece. Detailed instructions carefully instruct you which specific door/clip to push, to open another hidden drawer where the possible paper jam is. Voila. Fixed. All that is left to do is to close the door. The machine automatically runs a quick self inspection, and is ready for the next task.

What a magical and elegant machine. No need for us to worry about messy ink, destructive laser beams shooting all over. Instead, all the magic is done inside the copy machine. What’s happening in the machine stays in the machine. All the user care about is to get organized, clean, binned printout.

So why object oriented? What’s happens in the object, stays in the object!

To me, an object oriented system is a constrained restricted class that wraps other things (attributes, functions/methods) inside of it, so things are kept ONLY where they belong, and not all over. Yes, there is a lot of pain in dealing with such constraints, but from a software design perspective, they were deliberately designed to be annoyingly ‘’unbroken” for the sole purpose of saving us from our own clutter!

S3 and S4 are common systems for object oriented classes, where S4 is the more restricted one, supporting multiple dispatches. In my earlier series of posts I described how Bioconductor developers utilize the S4 system for their own complex needs of analyzing large data sets, with advanced statistical methods. While caret and mlr does not support the unflatten structure of Bioconductor’s popular S4 SummarizedExperiment class for genomic assay data, there are Bioconductor packages that attempt to do similar stuff as caret and mlr, i.e. unify multivariate models around the SummarizedExperiment object. Another approach, by the Bioc2mlr package, going the other way, flattening the S4 objects into the standard wide data frame format that mlr and caret expect, but with the cost of losing/breaking the tight knots that were holding the multiple slots tied to each other.

Why R6? The self-assignment, object-frugality approach:

In my own experience, I have learned the hard way, that the less objects one manage in a code, the less clutter and cognitive burden there is to handle, or to put it simpler, there are less places to look for bugs when things are not going as expected.

In line with this object-frugality approach, I am a big fan of working with the pipe operator %>%, which I find is contributing to object-minimized cleaner code. Nevertheless, The %<>% ‘self-assignment’ operator, even do more than that by saving us the redundancy of using the object twice: one time as the origin class to begin with, and the second time as the destination to be assigned with the new revised content.

The R6 system supports an elegant way for object state-changes and reference semantics. What this means is, that you don’t even need an assignment. All you need is to call the method (function) on the object. It will automatically update the designated slot within the object, with the new content that the function creates. Voila. It might take time to get used to, but once you do, traditional code with assignment all of a sudden seems redundant, repetitive, cluttered, and less elegant.

Elegance is of course of personal taste, but to me, it translates into stronger, less “vulnerable” code. It might still break, but because it was very strict to begin with, you will know exactly where to begin exploring for the cause of the issue within the mechanics of the object, and once found, how to locally fix it in a safe way, without accidentally damaging other related parts of your code.

OK, by now you already figured out I am a big fan of object oriented classes, so I may be heavily biased in my following sections :), so back to ML…

The good old gasoline engine vs the fancy electric one …

Do you drive an electric car? I don’t (yet), but what I know about it is that it is more reliable, and when it breaks, a traditional mechanics may not have the tools to open it and fix it. They might get a very detailed log message of the malfunction code, but they may not even have the tools to uncrack the electric engine cover. Instead, you might need to send it over to a special shop that specializes in this new advanced technology, and they will have the tools to quickly diagnose it, find the problem, and fix it.

Well, my friend, this might be also the case with the comparison of these two refactored ML approaches.

Similarities:

On the one hand, both mlr3 and tidymodels share the same algorithms for multivariate models, resampling, etc, and will probably achieve very similar model performances (accuracy). There might also be differences in scalability, which are beyond the scope of this post.

Dissimilarities:

To me, there are 3 main crucial differences, first two are directly derived from the architecture of the R6 class:

1. Having an isolated, compacted, separated ‘space’ for the ML analysis (mlr3) vs traditional R function that do something, and return a result in this structure or another (tidymodels).

I simply find it to be more elegant (i.e. stable, consistent, less fragile) when specialized tasks are done in their designated space, that carefully monitor and care only about what is going in, and what is spit out. Nothing more.

mlr3 R6 may not be trivial for many users, but for the ones who appreciate such strict class, it is heaven. tidymodels will keep many others happy with the traditional R S3 system, which also benefits from the flexibility of simply running a function, without worrying too much about edge scenarios that accidentally break the function only when you tweak it all the way to its boundaries. For many users, they may never even be near those boundaries, so why worry about it.

2. Connecting the wires by myself when I need to customize it for a new specific need, in the safest way.

Image credit: https://www.syfy.com/

Good packages that last longer are designed with a very wide scope. wide enough for the developers to demonstrate how one can customize the main functions of the package. In the context of ML, customizing and adding new multivariate models (learners), new performance metrics, new transformations etc. Both mlr3 and tidymodels do this very nicely, in terms of documentation (tutorial showing you how to add new methods from scratch), and in terms of strict constraints (yet, up to some level).

However, mlr3’s R6 constraints may be more strict than the ones enforced by tidymodels, which scales up when you knit it together with other parts of the ML methods. In other words, in such powerful, multi function pipeline that integrates resampling + tuning + benchmarking + …, even a small splash of tin (e.g. NAs in a prediction vector that were not supposed to be there) that somehow gets stuck in the wheels, may cause side effects that might break the entire machine. How likely is this to happen? Maybe it will never happen at all, but the best practice to minimize the odds for this to happen, is with a more restricted system.

With my luck, (well, I do enjoy stretching such methods to their limits), it does occasionally happens, and when it does, it is not fun at all to spend precious time on long expeditions for finding the cause of the error. You might not care about it at all, but some people do.

3. Pre-processing: customize new composable (monad) transformations

As mentioned in my previous post, the tricky part of the pre-processing is to guarantee that whichever transformation(s) were applied on the training training data set, one should apply on the test data set. This is not a trivial task when one needs to pass on the parameters that keep the transformation details. Both packages also supports a monad programming, meaning, allowing to compose different transformation in any order, while guaranteeing compatibility of the ‘pipes’ to each other.

The less trivial task is to write your own customized pre-processing transformation. This is not trivial at all, in spite of extensive documentation at both recipes and mlr3pipelines. Overall, both approaches have similar capabilities, yet, each developed its own terminology and jargon, which might take some time to get used to.

4. Survival and unsupervised models

Currently supported by mlr3, but seems like tidymodels developers plan to add it as well.

5. Can you please just show me the code ?

https://mlr3pipelines.mlr-org.com/articles/comparison_mlr3pipelines_mlr_sklearn.html has nice code comparison between mlr3 and tidymodels (and others).

Vision:

Similar to my vision at my previous post, what really matters is for the user to be comfortable with the documentation and to have access to reliable support channels. I also envision that there be developed ‘converter’ functions from one package to another, at least for the transformations. Rich collection of transformation libraries is a shared interest of both approaches, regardless of the above differences.

For some reason, not fully clear to me, both approaches do not yet provide a designated package/functions for ‘Interpretable Model-Agnostic’ tools such as LIME, DALEX and iml (among others). To me these post ML tools seem to be complementary to the common ML framework, though, some of the packages mentioned might already support them

Bottom line:

1. Respect the good old deprecated packages. Don’t jump to the next level before you mastered them!

I would first encourage new ML practitioners to play with the deprecated versions of caret and mlr, just so you become familiar with the basic terms, scope, and functionality of common machine learning components.

Only then, when you feel comfortable with terms like classification, regression, feature engineering, model parameter tuning, cross-validation, ensemble, benchmarking and others, gradually move on and take it to the next level.

2. If you don’t have to make it complex, just keep it simple.

Many practitioners probably won’t even care about the differences I mentioned above. In that case, tidymodels would probably be the preferred option. It is simple, straight forward, well documented, and supported.

3. Didn’t you always want to learn R6, but never had enough motivation to do so?

Don’t miss this opportunity, or stay behind.

4. When things break…

For others, who enjoy tinkering and stretch the limits of the tool’s scope, such concerns about package objects structure, and what are the actual mechanics that are happening at the back end may be more crucial.

By the end of the day, even today’s state-of-the art tool is tomorrow’s deprecated deserted package, but living in the moment, at present, one should bet on the most stable solution, that is less likely to break.

Why would it break? If it wouldn’t break, there would be no need for this post to be written at all.

Disclaimers:

The author is not affiliated with any of the groups who developed the packages listed above, besides the package Bioc2mlr.

What about SuperLearner refactored into tlverse? By the way, it was also done with R6. I don’t know. Maybe in a future blog. I would mention, however, that the academic interest of the developer group behind it is mostly motivated and leaning towards Targeted Maximum Likelihood / Minimum Loss-Based Estimation (TMLE), which is a statistical approach for parameter estimation, rather than the other two packages that are perhaps more ‘engineering’ oriented. There is a recent great documentation about it here.

Fun fact: mlr3 never mention p-values. Not even sure if that’s on purpose.

Did I forget a key player in this (semi-biased) review? If so let me know. If I don’t have a good reason to exclude it I don’t mind to add it.

Check my blog and github page for more fun stuff.

first published at my Medium blog

To leave a comment for the author, please follow the link and comment on their blog: Dror Berel's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How H2O propels data scientists ahead of itself: enhancing Driverless AI with advanced options, recipes and visualizations

December 14, 2019, 1:46 pm

≫ Next: Creating an RSS Feed to Add Your Jekyll / Github Pages Blog to R-Bloggers

≪ Previous: Meta Machine Learning aggregator packages in R, The 2nd generation

[This article was first published on novyden, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

H2O engineers continually innovate and implement latest techniques by following and adopting latest research, working on cutting edge use cases, and participating and winning machine learning competitions like Kaggle. But thanks to explosion of AI research and applications even most advanced automated machine learning platforms like H2O.ai Driverless AI can not come with all bells and whistles to satisfy each data scientist out there. Which means there is that feature or algorithm that customer may be wanting and not yet finding in H2O docs.

Having that in mind we designed several mechanisms to help users lead the way with Driverless AI instead of waiting or looking elsewhere. These mechanisms allow data scientists extend functionality with little (or possibly more involved) effort by seamlessly integrating into Driverless AI workflow and model pipeline:

experiment configuration profile
transformer recipes (custom feature engineering)
model recipes (custom algorithms)
scorer recipes (custom loss functions)
data recipes (data load, prep and augmentation; starting with 1.8.1)
Client APIs for both Python and R

Let us explain what each means and finish with more elobarate example of using R Client to enhance model analysis with visualizations.

Experiment Configuration

All possible configuration options inside Driverless AI can be found inside config.toml file (see here). Every experiment can override any of these options (as applicable) in its Expert Settings using Add to config.toml via toml String entry while applying them for this experiment only.

For example, while Driverless AI completely automates tuning and selection of its built-in algorithms (GLM, LightGBM, XGBoost, TensorFlow, RuleFit, FTRL) it can not possibly foresee all possible use cases to control for all available parameters. So with every experiment the following configuration settings let user customize any parameters for each algorithm:

LightGBM parameters: params_lightgbm and params_tune_lightgbm
XGBoost GBM: params_xgboost and params_tune_xgboost
XGBoost Dart: params_dart and params_tune_dart
Tensorflow: params_tensorflowand params_tune_tensorflow
GLM: params_gblinear and params_tune_gblinear
RuleFit: params_rulefitand params_tune_rulefit
FTRL: params_ftrland params_tune_ftrl

For example, use params_tensorflow to override Tensorflow models in your experiment:

params_tensorflow = "{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30, 'layers': (100, 100), 'activation': 'selu', 'batch_size': 64, 'chunk_size': 1000, 'dropout': 0.3, 'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type': 'streaming'}"

or override LightGBM parameters:

params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64, 'random_state': 1234}"

or use params_tune_xxxx to provide a grid for Driverless AI to limit or extend its hyper parameter search for optimal model over certain values like this:

params_tune_xgboost = "{'max_leaves': [8, 16, 32, 64]}"

To add multiple parameters via Expert Settings just use double double quotes (“”) around whole configuration string and new line (\n) to separate parameters:

""params_tensorflow = "{'lr': 0.01, 'epochs': 30, 'activation': 'selu'}" \n params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64}" \n params_tune_xgboost = "{'max_leaves': [8, 16, 32, 64]}"""

To confirm that new settings took effect just look inside experiment’s log file (to accomplish that while experiment running see here or for completed experiment here) and find Driverless AI Config Settings. Overridden settings should appear with an asterisk and assigned values:

params_tensorflow                     *: {'lr': 0.01, 'epochs': 30, 'activation': 'selu'} params_lightgbm                       *: {'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64}params_tune_xgboost                   *: {'max_leaves': [8, 16, 32, 64]}

Transformer Recipes

Starting with version 1.7.0 (July 2019) Driverless AI supports Bring Your Own Recipe (BYOR) framework to seamlessly integrate user written extension into its workflow. Feature engineering and selection make up significant part of the automated machine learning (AutoML) workflow utilizing Genetic Algorithm (GA) and rich set of built-in feature transformers and interactions to maximize model performance. A high-level and rough view of Driverless AI GA and BYOR workflow to illustrate how its pieces fall together displayed below:

Figure 1. Driverless AI GA and BYOR workflow

Still, variety of data and ever more comlex use cases often demand more specialized transformations and interactions performed on features. Using custom transformer recipes (a.k.a. BYOR transformers) core functionality can be extended to include any transformations and interactions written in Python accoring to BYOR specification. Implemented in Python and able to use any Python packages transformer recipe will be part of core GA workflow to compete with built-in feature transformations and interactions.

Such fare competition of transformers inside Driverless AI is good for both Driverless AI models and for customers who can share and borrow ideas from each other and became true realization of democratising AI H2O.ai stands for. To start with custom transformers use one of many recipes found in public H2O repo for recipes in transformer section: https://github.com/h2oai/driverlessai-recipes/tree/master/transformers. For more help on how to create your own transformer see How to Write a Transformer Recipe.

Model Recipes

XGBoost and LightGBM consistently deliver top models and carry most of transactional and time series use cases in Driverless AI. Other workhorse algorithm delivering top models for NLP and multi-class use cases is TensorFlow. Still more algorithms – Random Forest, GLM, and FTRL – compete for the best model in Driverless AI. But this competition is not a closed tournament because BYOR lets any algorithm available in Python to compete for the best model. Using BYOR model recipes user can incorporate their classification and regression algorithms into Driverless AI workflow and let it tune and select the one with the best score for the final model or ensemble. Based on accuracy setting Driverless AI either picks the best algorithm or continues workflow with meta learner to assemble final ensemble out of top finishers. Any program written in Python just need to implement BYOR model interface to start competing as part of Driverless AI workflow. For examples and wide variety of existing recipes refer to h2oai/driverlessai-recipes/models repository.

Scorer Recipes

Often data scientists swear by their favorite scorer so Driverless AI includes sufficiently large set of built-in scorers for both classificaiton and regression. But we don’t pretend to have all the answers and, again, BYOR framework allows to extend Driverless AI workflow to any scoring (loss) function being it from the latest research papers, or driven by specific business requirements. Rather representative and useful collection of scorers can be found in h2oai/driverlessai-recipes/scorers repository and for how to use custom scorers in Driverless AI here. Remember that Driverless AI uses custom scorers inside GA workflow to select best features and model parameters and not inside its algorithms where it is more dangerous and likely not desirable.

Data Recipes

Starting with version 1.8.1 (December 2019) new BYOR feature – data recipes – were added to Driverless AI. The concept is simple: bring your Python code into Driverless AI to create new and manipulate existing data to enhance data and elevate models. Data recipe utilize data APIs, datatable, pandas, numpy and other third-party libraries in Python and belong to one of the are two types:

producing data recipe create new dataset(s) by prototyping connectors, bringing data in and processing it. They are similar to data connectors in a way they import and munge data from the external sources (see here);
modifying data recipe creates new dataset(s) by transforming a copy of existing one (see here). Variety of data preprocessing (data prep) use cases fall into this category including data munging, data quality, data labeling, unsupervised algorithms such as clustering or latent topic analysis, anomaly detection, etc.

One important difference between data recipes and previous types of recipes (transformer, model, and scorer) is their relationship to model scoring pipelines. While the latter integrate into MOJO or Python scoring pipelines and get deployed with the models the former manipuate data prior to modeling workflow takes place and do not mix with scoring pipeline. For various examples of data recipes refer to https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.1/data repository.

Python Client

All Driverless AI features and actions found in web user interface are also available via Python Client API. See docs for instructions on how to install Python package here and for examples here. For Driverless AI users who are proficient in Python scripting repeatable and reusable tasks with Python Client is next logical step in adopting and productionizing Driverless AI models. Examples of such tasks are re-fitting on latest data and deploying final ensemble, executing business-driven workflows that combine data prep and Driverless AI modeling, computing business reports and KPIs using models, implementing Reject Inference method for credit approval, and other use cases.

R Client

Driverless AI R Client parallels functionality of Python Client emphasising consistency with R language conventions and appeals to data scientists practicing R. Moreover R’s unparallel visualization libraries extend model analysis beyond already powerful tools and features found in Driverless AI web interface. Let’s conclude with exmpale of using ggplot2package based on grammar of graphics by Leland Wilkinson (Chief Scientists at H2O.ai) and create Response Distribution Chart (RDC) to analyze binary classification models trained in Driverless AI. RDC lets us analyze the distribution of responses (probabilities) generated by the model to assess quality of the model on a basis how well it distinguishes two classes (see 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com, section 6).

To plot distributions we show the full workflow how to connect, import, split data, run experiment to create model, and finally score data inside Driverless AI with R client.

Before starting with creating R client script install Driverless AI R client package by downloading it from the Driverless AI instance itself:


Figure 2. Downloading Driverless AI Client R package

After download completes RStudio lets you find and install package from its menu Tools -> Install Packages…

With dai package installed every script begins with connecting to running Driverless AI instance it will be using (change its name, user id, and password):

For our examples we will use infamous titanic dataset that I saved and slightly enhanced on my local machine (you can download dataset here). The following command uploads data file from local machine into Driverless AI:

While H2O pipeline automates machine learning workflow including creating and using validation splits it is best practices to provide separate test set so that Driverless AI can produce an out of sample score estimate for its final model. Splitting data on appropriate target, fold, or time column is one of many built-in functions called using:

Now we can start automated machine learning workflow to predict survival chances for Titanic passengers that resutls in complete and fully featured classification model:

If you login into Driverless AI you can observe just created model via browser UI:

Having Driverless AI classifier there are many ways to obtain predictions. One way is to download file with computed test predictions to client and then read it into R:

Because we want to use features from the model in visualizations there is a way to score dataset and attach hand picked features in results (scoring all Titanic data in this case):

At this point full power of R graphics is available to produce additional visualizations on the model with predictions stored inside R data frame. As promised, we show how to implement the method of Response Distribution Analysis:

The method is based on the Response Distribution Chart (RDC), which is simply a histogram of the output of the model. The simple observation that the RDC of an ideal model should have one peak at 0 and one peak at 1 (with heights given by the class proportion). Source: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com

First, we plot RDC on all data:

Few more examples of RDC follow – first with separate distributions on survived and not survived passengers:

Next plot compares RDC’es for male and female passengers:

Finally, RDC’s by port of embarkation:

H2O engineers hardly ever stop improving and enhancing the product with new features so likely RDC’s will become part of model diagnostics tools in Driverless AI soon. But this example still serves its purpose of illustrating how to produce practically any type of model analysis with the help of R Client.

To leave a comment for the author, please follow the link and comment on their blog: novyden.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Creating an RSS Feed to Add Your Jekyll / Github Pages Blog to R-Bloggers

December 14, 2019, 4:00 pm

≫ Next: A large repository of networkdata

≪ Previous: How H2O propels data scientists ahead of itself: enhancing Driverless AI with advanced options, recipes and visualizations

[This article was first published on Method Matters Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we will go through the steps you need to follow if you would like to add a Jekyll / Github Pages blog to R-Bloggers. I recently went through this process and had to search through a lot of information in order to figure out how to do it. The process was particularly tricky because not all of the posts on my blog are about R (and therefore not all are relevant for R-Bloggers). Below, we will go through the process that will allow you to submit only blog posts about R to R-Bloggers. Hopefully this blog post will make the process simpler for others!

In what follows, we will assume that you have a working Jekyll site that you build locally and which you host publically at Github or somewhere else, and that you understand the basic layout of the folders in the Jekyll site directory.

Step 1: Add Jekyll Feed to Your Site

R-Bloggers needs an RSS feed in order to show your blog posts on their site. You can add a Jekyll module, called Jekyll Feed in order to generate this RSS feed.

Simply follow these three steps to install Jekyll Feed (the first two are copied directly from the Jekyll Feed link above):

Add this line to your site’s Gemfile:

   gem 'jekyll-feed'

And then add this line to your site’s _config.yml:

   plugins:  - jekyll-feed

Open the Ruby prompt on your computer, navigate to your site’s root directory, and run:

   bundle install

This will make sure you have the jekyll-feed plugin installed (if it isn’t already).

Step 2: Create An XML File to Generate the RSS Feed

As explained in this excellent blog post by Yongfu Liao, Jekyll Feed doesn’t allow you to generate an RSS feed for a particular tag or category of posts. However, there is a workaround which you can read more about here and here.

Yongfu shares his modification of the original workaround on Github. This code is really nice because it provides exactly the format of the RSS feed required by R-Bloggers.

All you need to do is to take the code below, paste it into a text editor, and save it as an .xml file. For example, my file is called “feed.r.bloggers.xml.”

Make sure you change the “Author” section in the code to your email address (the name part is optional). Indicate the tag you will use to indicate that a given post is suitable for R-Bloggers. In the example below, I specify: [“R”], meaning all posts with this (case-sensitive) tag will be included in the RSS feed.

All credit for this code goes to Yongfu Liao – thank you for sharing this solution on your blog!

Step 3: Place the .xml File In Your Site’s Root Directory

Place the .xml file in the root directory of your Jekyll site. Here is a screenshot example for my blog:

xml location

Step 4: Rebuild Your Site and Find the RSS Feed

Once you put the .xml file in the root directory of your site, your blog should re-build itself automatically, generating the RSS feed we specified in the .xml file.

You can find the rss feed file in your site’s _site directory. The file we’ve generated here is called “feed.r.bloggers.xml.”

rss location

Notice that there is also a file called “feed.xml.” This is the rss feed for all the posts on the blog, generated by default from the jekyll-feed plugin we installed earlier.

You can find the R-Bloggers feed in the local version of your site at this location: http://localhost:4000/feed.r.bloggers.xml

Step 5: Push to Github and Test Your RSS Feed

Once you’ve got the rss feed working correctly on the local version of your site, it’s time to push it to Github (or wherever you host your site).

When you’ve done this, you’ll find the R-Bloggers rss feed at your website address + ‘/feed.r.bloggers.xml’. For example, the R-Bloggers rss feed for this blog can be found at this address: https://methodmatters.github.io/feed.r.bloggers.xml.

Step 6: Validate Your RSS Feed and Submit to R-Bloggers

Before submitting to R-Bloggers, you need to check that your RSS feed is valid. Tal Galili (who runs R-Bloggers) recommends testing your feed via this validator: https://validator.w3.org/feed/

Simply paste the link to the feed on your site into the box and click on “Check.”

xml location

If your feed passes the inspection, you’re ready to submit your blog to R-Bloggers, which you can do here! Make sure to follow all of the guidelines Tal lays out in order to be accepted!

Summary and Conclusion

In this post, we saw how to configure a Jekyll / Github Pages website to create an RSS feed to allow us to submit blog posts about R to R-Bloggers. The process is not very straightforward, but I personally feel that the hassle is worth it in order to take advantage of all the really nice features of Jekyll (great flexibility, beautiful templates) combined with free hosting on Github Pages. I made the switch to Jekyll / Github Pages this past summer from Blogger, and am super happy about the way the blog looks, and the freedom I have to blog about R, Python, or anything else.

Coming Up Next

In the next post, we will return analyzing data, specifically, detailed records of my phone usage. We will use data munging and visualization to see how and when my phone usage patterns differ throughout the day.

Stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: Method Matters Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

A large repository of networkdata

December 14, 2019, 4:00 pm

≫ Next: Introducing the schrute Package: the Entire Transcripts From The Office

≪ Previous: Creating an RSS Feed to Add Your Jekyll / Github Pages Blog to R-Bloggers

[This article was first published on schochastics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There are many network repositories out there that offer a large variety of amazing free data. (See the awesome network analysis list on github for an overview.) The problem is, that network data can come in many formats. Either in plain text as edgelist or adjacency matrix, or in a dedicated network file format from which there are many (paj,dl,gexf,graphml,net,gml,…). The package igraph has an import function for these formats (read_graph()) but I have found it to be unreliable at times.

The networkdata package collates datasets from many different sources and makes the networks readily available in R. The data is very diverse, ranging from traditional social networks to animal, covert, and movie networks. In total, the package includes 979 datasets containing 2135 networks. As such, I hope this package to be a good resource for teaching, workshops and for research if example data is needed. You can only get so far with the Karate network.

library(igraph)library(networkdata)

Install

Due to the nature of the package (only data, no functions), it will not go to CRAN at any point. However, the package is available via drat (If you are looking for stable builds of the package). With drat, you can install and upgrade non-CRAN packages directly from R using the standard install.packages() and update.packages() functions.

# install.packages("drat")drat::addRepo("schochastics")install.packages("networkdata")

To save on line of code in the future, you can add drat::addRepo("schochastics") to your .Rprofile.

The developer version is available via github.

remotes::install_github("schochastics/networkdata")

The required space for the package is ~22MB, given that it includes a lot of data.

Overview

So far, the package includes datsets from the following repositories:

Freeman’s datasets from http://moreno.ss.uci.edu/data including most of the classical (small) datasets in social network analysis.
The movie networks from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T4HBA3 includes interactions among characters in 773 movies which were automatically parsed from scripts.
Covert networks from https://sites.google.com/site/ucinetsoftware/datasets/covert-networks. Around 50 datasets of networks with different criminal activities (drugs, terrorism, etc.). This dataset was originally compiled by the Mitchell Centre for SNA.
Animal networks from https://bansallab.github.io/asnr/. A large collection of networks with different interactions (dominance, grooming, etc.) in groups of different species.
Shakespeare’s plays networks build with data from https://github.com/mallaham/Shakespeare-Plays. Similar to the movie dataset. Includes scene co-appearances of characters in 36 plays of Shakespeare.
Some networks from http://konect.uni-koblenz.de/. Konect is a large repository of network data. Only a view small datasets are included from there.
Tennis networks compiled from https://github.com/JeffSackmann (please give credit to him if you use this data). Includes all matches between 1968 and 2019 (both WTA and ATP) as yearly networks. The networks are directed, pointing from loser to winner with the surface (hard, grass, etc) as edge attribute.
Some random datasets I had lying around (e.g. the “Grey’s Anatomy” hook-up network)

All networks are in igraph format. If you are used to work with the network format (as in sna and ergm), you can use the intergraph package to easily switch between igraph and network.

A list of all datasets can be obtained with

data(package = "networkdata")

Alternatively, use the function show_networks() to get a list of datasets with desired properties.

head(show_networks(type = "directed"),n = 10)

##         variable_name      network_name is_collection no_of_networks## 38             ants_1            ants_1         FALSE              1## 39             ants_2            ants_2         FALSE              1## 42                atp               atp          TRUE             52## 45             bkfrac            bkfrac         FALSE              1## 47             bkoffc            bkoffc         FALSE              1## 49             bktecc            bktecc         FALSE              1## 50               bott              bott         FALSE              1## 55           cent_lit          cent_lit         FALSE              1## 106 dnc_temporalGraph dnc_temporalGraph         FALSE              1## 109     eies_messages     eies_messages         FALSE              1##         nodes     edges is_directed is_weighted is_bipartite has_vattr## 38    16.0000   200.000        TRUE       FALSE        FALSE     FALSE## 39    13.0000   361.000        TRUE       FALSE        FALSE     FALSE## 42   499.3462  3164.404        TRUE        TRUE        FALSE      TRUE## 45    58.0000  3306.000        TRUE        TRUE        FALSE     FALSE## 47    40.0000  1558.000        TRUE        TRUE        FALSE     FALSE## 49    34.0000  1122.000        TRUE        TRUE        FALSE     FALSE## 50    11.0000   256.000        TRUE        TRUE        FALSE      TRUE## 55   129.0000   613.000        TRUE       FALSE        FALSE     FALSE## 106 1891.0000 37421.000        TRUE       FALSE        FALSE     FALSE## 109   32.0000   460.000        TRUE        TRUE        FALSE      TRUE

If you use any of the included datasets, please make sure to cite the appropriate orginal source, which can be found in the help file for each network.

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧