Quantcast
Channel: R-bloggers
Viewing all 12171 articles
Browse latest View live

Python and R – Part 1: Exploring Data with Datatable

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Article Update

Interested in more Python and R tutorials?

👉Register for our blog to get new articles as we release them.


Introduction

Python’s datatable was launched by h2o two years ago and is still in alpha stage with cautions that it may still be unstable and features may be missing or incomplete. We found that it feels very similar to the R version, with a few syntax differences and also some important pieces still to be added (as we will discuss). We could only find a handful of posts showing how to use datatable, and many of the examples we were probably not written by regular users of R data.table, and were often focused on its efficiency and ability to scale relative to pandas. We use R data.table every day and love the speed and concise syntax, so this walk-through analysis of the EPA’s Big MT cars data set will be on the syntax of the most frequent actual data exploration operations. As for plotnine, it feels more seamless with ggplot2 with a few problems formatting plots in Rmarkdown.

EPA’s Big MT Dataset

To make it a little interesting, we will use the Tidy Tuesday Big MT Cars with 36 years of 42,230 new US car models. The data dictionary with 83 variables describing each annual new car model is found here. Everyone loves cars and remembering historical models, and we have naturally been curious about this data set. After closer analysis however, we discovered that there are some unfortunate missing pieces.

When we have modeled mtcars, weight (wt) and horsepower (hp), and their interaction, have been most informative for predicting mpg. It would have been interesting to look at the evolution of the mtcars coefficients over time, but these variables are not unfortunately not available. In addition, it is hard to get a sense of fleet mileage without the annual unit-volume of each new car model. Because of this, it is impossible to know the evolution of more fuel efficient electric vehicles relative to more fuel-hungry model sales.

It is difficult to understand why these variables are not included when that information must be available to the EPA, and it clearly says on page 6 of Fuel Economy Guide 2020 that an extra 100 lbs decreases fuel economy by 1%. While the data set is still of interest to practice for data cleaning, it doesn’t look likely that we will be able replicate mtcars over time unless we can find more variables.

Loading Data with fread

# R Librarieslibrary("reticulate")library("skimr")knitr::opts_chunk$set(  fig.width = 15,  fig.height = 8,  out.width = '100%')
 # Install Python packageslapply(c("datatable", "pandas"), function(package) {       conda_install("r-reticulate", package, pip = TRUE)})
# Python librariesfrom datatable import *import numpy as npimport reimport pprint

We tried to download both the origin zipped data directly from the EPA website (see link below), and the .csv from the Tidy Tuesday website, but were unsuccessful in both cases using Python and R versions of fread. We were able to download the Tidy Tuesday .csv link with fread in data.table but not datatable, and the error message didn’t give us enough information to figure it out. The documentation for data.table fread is among the most extensive of any function we know, while still thin for datatable’s version so far. In the end, we manually downloaded and unzipped the file from the EPA’s website, and uploaded from our local drive.

# Data dictionary, EPA vehicles zip and Tidy Tuesday vehicles csv links#Data dictionary https://www.fueleconomy.gov/feg/ws/index.shtml#fuelType1#EPA zip data set https://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip#Tidy Tuesday csv data set https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-15/big_epa_cars.csv# Load vehiclesbig_mt = fread("~/Desktop/David/Projects/general_working/mt_cars/vehicles.csv")# Dimensionsbig_mt.shape 
## (42230, 83)

The list of all 83 variables below, and we can see that there are several pertaining to fuel efficiency, emissions, fuel type, range, volume and some of the same attributes that we all know from mtcars (ie: cylinders, displacement, make, model and transmission). As mentioned, gross horsepower and weight are missing, but carburetors, acceleration and engine shape are also absent. We have all classes of vehicles sold, so get vehicle class information (VClass) not available in mtcars which is only cars. We will discuss further down, changes to the weight cutoffs on some of the categories over time make VClass of questionable use.

# Set up pprint params and printpp = pprint.PrettyPrinter(width=80, compact = True)pp.pprint(big_mt.names)
## ('barrels08', 'barrelsA08', 'charge120', 'charge240', 'city08', 'city08U',##  'cityA08', 'cityA08U', 'cityCD', 'cityE', 'cityUF', 'co2', 'co2A',##  'co2TailpipeAGpm', 'co2TailpipeGpm', 'comb08', 'comb08U', 'combA08',##  'combA08U', 'combE', 'combinedCD', 'combinedUF', 'cylinders', 'displ', 'drive',##  'engId', 'eng_dscr', 'feScore', 'fuelCost08', 'fuelCostA08', 'fuelType',##  'fuelType1', 'ghgScore', 'ghgScoreA', 'highway08', 'highway08U', 'highwayA08',##  'highwayA08U', 'highwayCD', 'highwayE', 'highwayUF', 'hlv', 'hpv', 'id', 'lv2',##  'lv4', 'make', 'model', 'mpgData', 'phevBlended', 'pv2', 'pv4', 'range',##  'rangeCity', 'rangeCityA', 'rangeHwy', 'rangeHwyA', 'trany', 'UCity', 'UCityA',##  'UHighway', 'UHighwayA', 'VClass', 'year', 'youSaveSpend', 'guzzler',##  'trans_dscr', 'tCharger', 'sCharger', 'atvType', 'fuelType2', 'rangeA',##  'evMotor', 'mfrCode', 'c240Dscr', 'charge240b', 'c240bDscr', 'createdOn',##  'modifiedOn', 'startStop', 'phevCity', 'phevHwy', 'phevComb')

Set-up Thoughts from R Perspective

There were a couple of things about the set-up for datatable, which weren’t apparent coming over from data.table as an R user. The first was to use from dt import * at the outset to avoid having to reference the package short name every time within the frame. From a Python perspective, this is considered bad practice, but we are only going to do it for that one package because it makes us feel more at home. The second was to use export_names() in order to skip having to use the f operator or quotation marks to reference variables. In order to do this, we had to create a dictionary of names using the names list from above, and each of their f expressions extracted with export_names in a second list. We then used update from the local environment to assign all of the dictionary values to their keys as variables. From then on, we can refer to those variable without quotation marks or the f operator (although any new variables created would still need f or quotation marks). We weren’t sure why this is not the default behavior, but it is easily worked around for our purposes. These two possibly not “Pythonic” steps brought the feel of datatable a lot closer to the usual R data.table (ie: without the package and expression short codes).

Basic Filter and Select Operations

A few lines of some key variables are shown in the code below, and it is clear that they need significant cleaning to be of use. One difference with R data.table can be seen below with filtering. Using our year_filter in i (the first slot), the 1204 2019 models are shown below. Unlike R data.table, we refer to year outside of the frame in an expression, and then call it within i of the frame. The columns can be selected within () or [] in j (the second slot) as shown below, and new columns can be created within {}.

# Key variables for year 2019year_filter = (year == 2020)print(big_mt[year_filter, (year, make, model, trany, evMotor, VClass)])
##      | year  make     model                        trany                             evMotor             VClass                            ## ---- + ----  -------  ---------------------------  --------------------------------  ------------------  ----------------------------------##    0 | 2020  Toyota   Corolla                      Automatic (AV-S10)                                    Compact Cars                      ##    1 | 2020  Toyota   Corolla Hybrid               Automatic (variable gear ratios)  202V Ni-MH          Compact Cars                      ##    2 | 2020  Toyota   Corolla                      Manual 6-spd                                          Compact Cars                      ##    3 | 2020  Toyota   Corolla XSE                  Automatic (AV-S10)                                    Compact Cars                      ##    4 | 2020  Toyota   Corolla                      Automatic (variable gear ratios)                      Compact Cars                      ##    5 | 2020  Toyota   Corolla                      Manual 6-spd                                          Compact Cars                      ##    6 | 2020  Toyota   Corolla XLE                  Automatic (variable gear ratios)                      Compact Cars                      ##    7 | 2020  Kia      Soul                         Automatic (variable gear ratios)                      Small Station Wagons              ##    8 | 2020  Kia      Soul Eco dynamics            Automatic (variable gear ratios)                      Small Station Wagons              ##    9 | 2020  Kia      Soul                         Manual 6-spd                                          Small Station Wagons              ##   10 | 2020  Kia      Soul                         Automatic (AM-S7)                                     Small Station Wagons              ##   11 | 2020  Kia      Sportage FWD                 Automatic (S6)                                        Small Sport Utility Vehicle 2WD   ##   12 | 2020  Kia      Sportage FWD                 Automatic (S6)                                        Small Sport Utility Vehicle 2WD   ##   13 | 2020  Kia      Telluride FWD                Automatic (S8)                                        Small Sport Utility Vehicle 2WD   ##   14 | 2020  Kia      Sportage AWD                 Automatic (S6)                                        Small Sport Utility Vehicle 4WD   ##    … |    …  …        …                            …                                 …                   …                                 ## 1199 | 2020  Porsche  718 Cayman GT4               Manual 6-spd                                          Two Seaters                       ## 1200 | 2020  Bentley  Mulsanne                     Automatic (S8)                                        Midsize Cars                      ## 1201 | 2020  Porsche  Cayenne e-Hybrid             Automatic (S8)                    99 kW DC Brushless  Standard Sport Utility Vehicle 4WD## 1202 | 2020  Porsche  Cayenne e-Hybrid Coupe       Automatic (S8)                    99 kW DC Brushless  Standard Sport Utility Vehicle 4WD## 1203 | 2020  Porsche  Taycan 4S Perf Battery Plus  Automatic (A2)                    120 kW ACPM         Large Cars                        ## ## [1204 rows x 6 columns]

We usually like to make a quick check if there are any duplicated rows across the whole our dataFrame, but there isn’t a duplicated() function yet in datatable. According to How to find unique values for a field in Pydatatable Data Frame, the unique() function also doesn’t apply to groups yet. In order to work around this, identifying variables would have to be grouped, counted and filtered for equal to 1, but we weren’t sure yet exactly which variables to group on. We decided to pipe over to pandas to verify with a simple line of code that there were no duplicates, but hope this function will be added in the future.

Aggregate New Variable and Sort

We can see that below that eng_dscr is unfortunately blank 38% of the time, and high cardinality for the rest of the levels. A small percentage are marked “GUZZLER” and “FLEX FUELS”. in a few cases, potentially helpful information about engine like V-6 or V-8 are included with very low frequency, but not consistently enough to make sense try to extract. Another potentially informative variable, trans_dscr is similarly blank more than 60% of the time. It seems unlikely that we could clean these up to make it useful in an analysis, so will probably have to drop them.

print(big_mt[:, {'percent' : int32(count() * 100/big_mt.nrows) }, by(eng_dscr)]\            [:,:, sort(-f.percent)])
##     | eng_dscr                      percent## --- + ----------------------------  -------##   0 |                                    38##   1 | (FFS)                              20##   2 | SIDI                               14##   3 | (FFS) CA model                      2##   4 | (FFS)      (MPFI)                   1##   5 | (FFS,TRBO)                          1##   6 | FFV                                 1##   7 | (121)      (FFS)                    0##   8 | (122)      (FFS)                    0##   9 | (16 VALVE) (FFS)      (MPFI)        0##  10 | (16-VALVE) (FFS)                    0##  11 | (16-VALVE) (FFS)      (MPFI)        0##  12 | (16-VALVE) (FFS,TRBO)               0##  13 | (164S)     (FFS)      (MPFI)        0##  14 | (16VALVES) (FFS)                    0##   … | …                                   …## 556 | VTEC       (FFS)                    0## 557 | VTEC-E                              0## 558 | VTEC-E     (FFS)                    0## 559 | Z/28                                0## 560 | new body style                      0## ## [561 rows x 2 columns]

Separate and Assign New Variables

As shown above, trany has both the transmission-type and gear-speed variables within it, so we extracted the variable from big_mt with to_list(), drilled down one level, and used regex to extract the transmission and gear information needed out into trans and gear. Notice that we needed to convert the lists back into columns with dt.Frame before assigning as new variables in big_mt.

In the third line of code, we felt like we were using an R data.table. The {} is used group by trans and gear, and then to create the new percent variable in-line, without affecting the other variables in big_mt. We tried to round the decimals in percent, but couldn’t figure it out so far. Our understanding is that there is no round() method yet for datatable, so we multiplied by 100 and converted to integer. We again called export_names(), to be consistent in using non-standard evaluation with the two new variables.

big_mt['trans'] = Frame([re.sub('[\s\(].*$','', s) for s in big_mt[:, 'trany'].to_list()[0]])big_mt['gear'] = Frame([re.sub('A\w+\s|M\w+\s','', s) for s in big_mt[:, 'trany'].to_list()[0]])gear, trans= big_mt[:, ('gear', 'trans')].export_names()# Summarize percent of instances by transmission and speedprint(big_mt[:, { 'percent' : int32(count() * 100 /big_mt.nrows) }, by(trans, gear)]\            [0:13, : , sort(-f.percent)])
##    | trans      gear                    percent## -- + ---------  ----------------------  -------##  0 | Automatic  4-spd                        26##  1 | Manual     5-spd                        19##  2 | Automatic  (S6)                          7##  3 | Automatic  3-spd                         7##  4 | Manual     6-spd                         6##  5 | Automatic  5-spd                         5##  6 | Automatic  (S8)                          4##  7 | Automatic  6-spd                         3##  8 | Manual     4-spd                         3##  9 | Automatic  (variable gear ratios)        2## 10 | Automatic  (AM-S7)                       1## 11 | Automatic  (S5)                          1## 12 | Automatic  7-spd                         1## ## [13 rows x 3 columns]

Set Key and Join

We wanted to create a Boolean variable to denote if a vehicle had an electric motor or not. We again used {} to create the variable in the frame, but don’t think it is possible to update by reference so still had to assign to is_ev. In the table below, we show the number of electric vehicles rising from 3 in 1998 to 149 this year. Unfortunately,

# Create 'is_ev' within the framebig_mt['is_ev'] = big_mt[:, { 'is_ev' : evMotor != '' }]is_ev = big_mt[:, 'is_ev'].export_names()ann_models = big_mt[:, {'all_models' : count()}, by(year)]ev_models = big_mt[:, {'ev_models' : count() }, by('year', 'is_ev')]\                  [(f.is_ev == 1), ('year', 'ev_models')]ev_models.key = "year"print(ann_models[:, :, join(ev_models)]\                [:, { 'all_models' : f.all_models,                       'ev_models' : f.ev_models,                       'percent' : int32(f.ev_models * 100 / f.all_models) },                       by(year)]\                [(year > 1996), :])
##    | year  all_models  ev_models  percent## -- + ----  ----------  ---------  -------##  0 | 1997         762         NA       NA##  1 | 1998         812          3        0##  2 | 1999         852          7        0##  3 | 2000         840          4        0##  4 | 2001         911          5        0##  5 | 2002         975          2        0##  6 | 2003        1044          1        0##  7 | 2004        1122         NA       NA##  8 | 2005        1166         NA       NA##  9 | 2006        1104         NA       NA## 10 | 2007        1126         NA       NA## 11 | 2008        1187         23        1## 12 | 2009        1184         27        2## 13 | 2010        1109         34        3## 14 | 2011        1130         49        4## 15 | 2012        1152         55        4## 16 | 2013        1184         68        5## 17 | 2014        1225         77        6## 18 | 2015        1283         76        5## 19 | 2016        1262         95        7## 20 | 2017        1293         92        7## 21 | 2018        1344        103        7## 22 | 2019        1335        133        9## 23 | 2020        1204        149       12## 24 | 2021          73          6        8## ## [25 rows x 4 columns]

Using Regular Expressions in Row Operations

Next, we hoped to extract wheel-drive (2WD, AWD, 4WD, etc) and engine type (ie: V4, V6, etc) from model. The re_match() function is helpful in filtering rows in i. As shown below, we found almost 17k matches for wheel drive, but only 718 for the engine size. Given that we have over 42k rows, we will extract the wheels and give up on the engine data. It still may not be enough data for wheels to be a helpful variable.

# Regex match with re_match()print('%d of rows with wheels info.' % (big_mt[model.re_match('.*(.WD).*'), model].nrows))
## 16921 of rows with wheels info.
print('%d of rows with engine info.' % (big_mt[model.re_match('.*(V|v)(\s|\-)?\d+.*'), model].nrows))
## 718 of rows with engine info.

We used regex to extract whether the model was 2WD, 4WD, etc as wheels from model, but most of the time, it was the same information as we already had in drive. It is possible that our weakness in Python is at play, but this would have been a lot simpler in R, because we wouldn’t have iterated over every row in order to extract part of the row with regex. We found that there were some cases where the 2WD and 4WD were recorded as 2wd and 4wd. The replace() function was an efficient solution to this problem, replacing matches of ‘wd’ with ‘WD’ over the entire frame.

# Extract 'wheels' and 'engine' from 'model'reg = re.compile(r'(.*)(.WD|4x4)(.*)', re.IGNORECASE)big_mt[:, 'wheels'] = Frame([reg.match(s).group(2) if reg.search(s) else '' for s in big_mt[:, model].to_list()[0]])wheels = big_mt[:, 'wheels'].export_names()# Fix problem notationsbig_mt.replace("\dwd", "\dWD")# Summarize total count for all yearscols = ['make', 'model', 'cylinders', 'wheels', 'drive']print(big_mt[(f.wheels != ''), cols]\            [:, count(), by(f.wheels, cylinders, drive)]\            [0:14:, :, sort(-f.count)])
##    | wheels  cylinders  drive                       count## -- + ------  ---------  --------------------------  -----##  0 | 2WD             8  Rear-Wheel Drive             2616##  1 | 2WD             6  Rear-Wheel Drive             2255##  2 | 4WD             6  4-Wheel or All-Wheel Drive   1637##  3 | 4WD             8  4-Wheel or All-Wheel Drive   1481##  4 | 2WD             4  Rear-Wheel Drive             1063##  5 | 4WD             4  4-Wheel or All-Wheel Drive    984##  6 | AWD             6  All-Wheel Drive               771##  7 | FWD             4  Front-Wheel Drive             638##  8 | AWD             4  All-Wheel Drive               629##  9 | 2WD             4  Front-Wheel Drive             508## 10 | FWD             6  Front-Wheel Drive             497## 11 | 2WD             6  Front-Wheel Drive             416## 12 | AWD             4  4-Wheel or All-Wheel Drive    368## 13 | 4WD             8  4-Wheel Drive                 361## ## [14 rows x 4 columns]

Reshaping

There was no such thing as an 4-wheel drive SUVs back in the 80’s, and we remember the big 8-cylinder Oldsmobiles and Cadillacs, so were curious how these models evolved over time. datatable doesn’t yet have dcast() or melt(), so we had to pipe these out to_pandas() and then use pivot_table(). Its likely that a lot of the the many models where wheel-drive was unspecified were 2WD, which is still the majority of models. We would have liked to show these as whole numbers, and there is a workaround in datatable to convert to integer, but once we pivoted in pandas, it reverted to float. We can see the first AWD models starting in the late 80s, and the number of 8-cylinder cars fall by half. There are are a lot fewer annual new car models now than in the 80s, but were surprised how many fewer 4-cylinders.

 # Summarize by year again having to move to pandas to pivotprint(big_mt[:, count(), by(f.wheels, year)].to_pandas().pivot_table(index='wheels', columns='year', values='count'))
 ## year      1984    1985   1986   1987   1988  ...   2017   2018   2019   2020  2021## wheels                                       ...                                  ##         1184.0  1057.0  698.0  732.0  677.0  ...  798.0  821.0  797.0  706.0  46.0## 2WD      472.0   430.0  338.0  310.0  262.0  ...   89.0   97.0  110.0   94.0   4.0## 4WD      304.0   208.0  174.0  201.0  187.0  ...  107.0  119.0  131.0  131.0   5.0## 4x4        NaN     NaN    NaN    2.0    2.0  ...    1.0    1.0    NaN    NaN   NaN## AWD        NaN     NaN    NaN    2.0    2.0  ...  186.0  197.0  195.0  180.0  10.0## FWD        1.0     4.0    NaN    NaN    NaN  ...  104.0   96.0   88.0   78.0   5.0## RWD        3.0     2.0    NaN    NaN    NaN  ...    8.0   13.0   14.0   15.0   3.0## ## [7 rows x 38 columns]
print(big_mt[:, count(), by(cylinders, year)].to_pandas().pivot_table(index='cylinders', columns='year', values='count'))
## year         1984   1985   1986   1987   1988  ...   2017   2018   2019   2020  2021## cylinders                                      ...                                  ## 2.0           6.0    5.0    1.0    3.0    3.0  ...    1.0    2.0    2.0    2.0   NaN## 3.0           NaN    6.0    9.0   11.0   13.0  ...   26.0   22.0   22.0   19.0   7.0## 4.0        1020.0  853.0  592.0  625.0  526.0  ...  563.0  590.0  585.0  523.0  44.0## 5.0          39.0   20.0   18.0   26.0   17.0  ...    1.0    2.0    2.0    2.0   NaN## 6.0         457.0  462.0  323.0  296.0  325.0  ...  416.0  449.0  440.0  374.0  17.0## 8.0         439.0  351.0  263.0  282.0  241.0  ...  211.0  219.0  224.0  222.0   4.0## 10.0          NaN    NaN    NaN    NaN    NaN  ...    7.0    8.0    4.0    6.0   NaN## 12.0          3.0    2.0    3.0    4.0    5.0  ...   38.0   27.0   20.0   21.0   1.0## 16.0          NaN    NaN    NaN    NaN    NaN  ...    NaN    1.0    1.0    1.0   NaN## ## [9 rows x 38 columns]

Combining Levels of Variables with High Cardinality

With 35 distinct levels often referring to similar vehicles, VClass also needed to be cleaned up. Even in R data.table, we have been keenly awaiting the implementation of fcase, a data.table version of the dplyr case_when() function for nested control-flow statements. We made a separate 16-line function to lump factor levels (not shown). In the first line below, we created the vclasses list to drill down on the VClass tuple elements as strings. In the second line, we had to iterate over the resulting strings from the 0-index of the tuple to extract wheel-drive from a list-comprehension. We printed out the result of our much smaller list of lumped factors, but there are still problems with the result. The EPA changed the cutoff for a “Small Pickup Truck” from 4,500 to 6,000 lbs in 2008, and also used a higher cut-off for “small” SUV’s starting in 2011. This will make it pretty hard to us VClass as a consistent variable for modeling, at least for Pickups and SUVs. As noted earlier, if we had the a weight field, we could have easily worked around this.

# Clean up vehicle type from VClassvclasses = [tup[0] for tup in big_mt[:, 'VClass'].to_tuples()]big_mt['VClass'] = Frame([re.sub('\s\dWD$|\/\dwd$|\s\-\s\dWD$', '', x) if re.search(r'WD$|wd$', x) is not None else x for x in vclasses])big_mt['VClass'] = Frame([collapse_vclass(line[0]) for line in big_mt[:, 'VClass'].to_tuples()])# Show final VClass types and countsprint(big_mt[:, count(), VClass][:,:, sort(-f.count)])
##    | VClass                   count## -- + -----------------------  -----##  0 | Small Car                16419##  1 | Midsize Car               5580##  2 | Standard Pickup Trucks    4793##  3 | Sport Utility Vehicle     4786##  4 | Large Car                 2938##  5 | Small Pickup and SUV      2937##  6 | Special Purpose Vehicle   2457##  7 | Vans                      1900##  8 | Minivan                    420## ## [9 rows x 2 columns]

Selecting Multiple Columns with Regex

In the chunk (below), we show how to select columns from the big_mt names tuple by creating the measures selector using regex matches for the key identifier columns and for integer mileage columns matching ‘08’. This seemed complicated and we couldn’t do it in line within the frame as we would have with data.table .SD = patterns(). We also wanted to reorder to move the identifier columns (year, make and model) to the left side of the table, but couldn’t find a equivalent setcolorder function. There is documentation about multi-column selection, but we couldn’t figure out an efficient way to make it work. We show the frame with the year_filter which we set up earlier.

# Regex search for variable selectionmeasures = [name for name in big_mt.names if re.search(r'make|model|year|08$', name)]# Print remaining cols with measures filterprint(big_mt[year_filter,  measures])
##      | barrels08  barrelsA08  city08  cityA08  comb08  combA08  fuelCost08  fuelCostA08  highway08  highwayA08  make     model                        year## ---- + ---------  ----------  ------  -------  ------  -------  ----------  -----------  ---------  ----------  -------  ---------------------------  ----##    0 |   9.69441       0          31        0      34        0         800            0         40           0  Toyota   Corolla                      2020##    1 |   6.33865       0          53        0      52        0         500            0         52           0  Toyota   Corolla Hybrid               2020##    2 |  10.3003        0          29        0      32        0         850            0         36           0  Toyota   Corolla                      2020##    3 |   9.69441       0          31        0      34        0         800            0         38           0  Toyota   Corolla XSE                  2020##    4 |   9.98818       0          30        0      33        0         800            0         38           0  Toyota   Corolla                      2020##    5 |   9.98818       0          29        0      33        0         800            0         39           0  Toyota   Corolla                      2020##    6 |  10.3003        0          29        0      32        0         850            0         37           0  Toyota   Corolla XLE                  2020##    7 |  10.987         0          27        0      30        0         900            0         33           0  Kia      Soul                         2020##    8 |  10.6326        0          29        0      31        0         900            0         35           0  Kia      Soul Eco dynamics            2020##    9 |  12.2078        0          25        0      27        0        1000            0         31           0  Kia      Soul                         2020##   10 |  11.3659        0          27        0      29        0         950            0         32           0  Kia      Soul                         2020##   11 |  12.6773        0          23        0      26        0        1050            0         30           0  Kia      Sportage FWD                 2020##   12 |  14.3309        0          20        0      23        0        1200            0         28           0  Kia      Sportage FWD                 2020##   13 |  14.3309        0          20        0      23        0        1200            0         26           0  Kia      Telluride FWD                2020##   14 |  14.3309        0          22        0      23        0        1200            0         26           0  Kia      Sportage AWD                 2020##    … |         …           …       …        …       …        …           …            …          …           …  …        …                               …## 1199 |  17.3479        0          16        0      19        0        2000            0         23           0  Porsche  718 Cayman GT4               2020## 1200 |  27.4675        0          10        0      12        0        3150            0         16           0  Bentley  Mulsanne                     2020## 1201 |  10.5064        0.426      20       45      21       41        1800         1400         22          37  Porsche  Cayenne e-Hybrid             2020## 1202 |  10.5064        0.426      20       45      21       41        1800         1400         22          37  Porsche  Cayenne e-Hybrid Coupe       2020## 1203 |   0.294         0          68        0      69        0         950            0         71           0  Porsche  Taycan 4S Perf Battery Plus  2020## ## [1204 rows x 13 columns]

Selecting Columns and Exploring Summary Data

We looked for a Python version of skimr, but it doesn’t seem like there is an similar library (as is often the case). We tried out pandas profiling, but that had a lot of dependencies and seemed like overkill for our purposes, so decided to use skim_tee on the table in a separate R chunk (below). It was necessary to convert to pandas in the Python chunk (above), because we couldn’t figure out how to translate a datatable back to a data.frame via reticulate in the R chunk.

When we did convert, we discovered there were some problems mapping NA’s which we will show below. We suspect it isn’t possible to pass a datatable to data.table, and this might be the first functionality we would vote to add. There is a sizable community of data.table users who are used to the syntax, and as we are, might be looking to port into Python (rather than learn pandas directly). As reticulate develops, opening this door seems to make so much sense. Below, we again run export_names() in order to also prepare the newly generated variables for non-standard evaluation within the frame, and then filtered for the 21 columns we wanted to keep.

# List of cols to keepcols = ['make',         'model',         'year',         'city08',         'highway08',         'comb08',         'VClass',         'drive',        'fuelType1',         'hlv',         'hpv',         'cylinders',         'displ',        'trans',         'gear',        'wheels',        'is_ev',        'evMotor',         'guzzler',        'tCharger',        'sCharger']# Select cols and create pandas versionbig_mt_pandas = big_mt[:, cols].to_pandas()
# Skimrskim_tee(py$big_mt_pandas)
## ── Data Summary ────────────────────────##                            Values## Name                       data  ## Number of rows             42230 ## Number of columns          21    ## _______________________          ## Column type frequency:           ##   character                12    ##   logical                  1     ##   numeric                  8     ## ________________________         ## Group variables            None  ## ## ── Variable type: character ────────────────────────────────────────────────────##    skim_variable n_missing complete_rate   min   max empty n_unique whitespace##  1 make                  0             1     3    34     0      137          0##  2 model                 0             1     1    47     0     4217          0##  3 VClass                0             1     4    23     0        9          0##  4 drive                 0             1     0    26  1189        8          0##  5 fuelType1             0             1     6    17     0        6          0##  6 trans                 0             1     0     9    11        3          0##  7 gear                  0             1     0    22    11       34          0##  8 wheels                0             1     0     3 25265        7          0##  9 evMotor               0             1     0    51 41221      171          0## 10 guzzler               0             1     0     1 39747        4          0## 11 tCharger              0             1     0     1 34788        2          0## 12 sCharger              0             1     0     1 41352        2          0## ## ── Variable type: logical ──────────────────────────────────────────────────────##   skim_variable n_missing complete_rate   mean count                ## 1 is_ev                 0             1 0.0239 FAL: 41221, TRU: 1009## ## ── Variable type: numeric ──────────────────────────────────────────────────────##   skim_variable n_missing complete_rate    mean    sd    p0    p25   p50    p75## 1 year                  0         1     2002.   11.4   1984 1991    2003 2012  ## 2 city08                0         1       18.5   8.36     6   15      17   21  ## 3 highway08             0         1       24.6   8.03     9   20      24   28  ## 4 comb08                0         1       20.8   8.06     7   17      20   23  ## 5 hlv                   0         1        1.99  5.92     0    0       0    0  ## 6 hpv                   0         1       10.2  27.9      0    0       0    0  ## 7 cylinders           240         0.994    5.71  1.76     2    4       6    6  ## 8 displ               238         0.994    3.29  1.36     0    2.2     3    4.3##     p100 hist ## 1 2021   ▇▅▆▆▇## 2  150   ▇▁▁▁▁## 3  132   ▇▁▁▁▁## 4  141   ▇▁▁▁▁## 5   49   ▇▁▁▁▁## 6  195   ▇▁▁▁▁## 7   16   ▇▇▅▁▁## 8    8.4 ▁▇▅▂▁

In the result above, we see a lot of challenges if we had hoped to have appropriate data to build a model to predict mpg over time. Many variables, such as evMotor, tCharger, sCharger and guzzler, are only available in a small number of rows. When we set out on this series, we hoped we would be able to experiment with modeling gas mileage for every year just like mtcars, but that seems unlikely based on the available variables.

Conclusion

It took us a couple of months to get up and running with R data.table, and even with daily usage, we are still learning its nuance a year later. We think the up-front investment in learning the syntax, which can be a little confusing at first, has been worth it. It is also less well documented than dplyr or pandas. We learned so much about data.table from a few blog posts such as Advanced tips and tricks with data.table and A data.table and dplyr tour. The goal of this post is to help to similarly fill the gap for datatable.

Python datatable is promising, and we are grateful for it as familiar territory as we learn Python. We can’t tell how much of our difficulty has been because the package is not as mature as data.table or our just inexperience with Python. The need to manually set variables for non-standard evaluation, to revert to pandas to accomplish certain tasks (ie: reshaping) or the challenges extracting and filtering data from nested columns. It was still not easy to navigate the documentation and there were areas where the documentation was not Also, it would be appreciated to seamlessly translate between a datatable and data.table.

Author: David Lucy, Founder of Redwall Analytics David spent 25 years working with institutional global equity research with several top investment banking firms.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Python and R - Part 1: Exploring Data with Datatable first appeared on R-bloggers.


Little useless-useful R function – Full moon finder

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another one from the series [1,2,3,4] of useless functions. This one is particularly useful when you have overcast weather. And you are feeling moody, feel slightly off-balance and that causes you behaving erratically.

The full moon function, or should we call it fool moon – due to it’s simplistic and approximate nature, calculates the the difference between the date (only date, no time, no long/lat coordinates) and Julian constant. Should you be using a different calendar, don’t run the function, just look out the window.

The function is written based on generalized equation for julian day numbers and months. Another one could be to calculate RMSE of the predicted values and realization of lunar behavior (lunatic start time). In this case – reversed engineering – you would use the the approximate date/time for the first new moon after that date if the synod period was constant. This number than obtained is only empirically proven by recursively solving for the new “possible date/time” of lunar behavior and calculate the prediction error. In order to minimize the RMSE value of the difference between the full moon dates/times predicted formula and the dates/times for the full moon over the next 10 years you get something like this.

And the formula is pretty useless as well (using epoch Unix time format and the sine function. But this is not wacky enough (with help of WolframAlpha):

nn <- 1604007055 #today - now#idea of calculation / rough approx.((                1 * (nn/2551442844-0.228535)+ 0.0059199782 * sin(nn/5023359217 + 3.1705094)- 0.0038844129 * sin(nn/4374537912 + 2.0017265)+ 0.0176727762 * sin(nn/3789212683 - 1.5388144)- 0.0004144842 * sin(nn/1385559605 - 1.2363316))%%1)

Final version of R function is:

IsItFullMoon <- function(){  #when date (conert to 113) is 14 -> is getting full, else empty, based on Hijri (Kuwait) Calendar  da <- Sys.Date()  julianConstant <- 2451549.5  cycle <- 29.53  y <- as.integer(format(da, format="%Y"))  m <- as.integer(format(da, format="%m"))  d <- as.integer(format(da, format="%d"))    # If the month is January or February, subtract 1 from the year and add 12 to the month.  if(m==1 | m==2) {    y <- y-1    m <- m + 12  }    a = y/100  b = a/4  c = 2-a+b  e = 365.25 * (y + 4716)  f = 30.6001 * (m+1)  jd = c+d+e+f-1524.5  new_moon = jd - 2451549.5  semi <- new_moon/29.53  dec <- semi%%1  if (dec*cycle>= 14.50 & dec*cycle <= 15.50){    print("Full Moon. Yay!")  } else {    print("You should wait. Go back to sleep")  }}  IsItFullMoon()

As of writing this blog post, the function returns:

and I think, I will just to that – go to sleep. As always, the wacky version of R code is available at Github.

Happy R-ing!!!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Little useless-useful R function – Full moon finder first appeared on R-bloggers.

DN Unlimited 2020: Europe’s largest data science gathering | Nov 18 – 20 online

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • The DN Unlimited Conference will take place online for the first time this year
  • More than 100 speakers from the fields of AI, machine learning, data science, and technology for social impact, including from The New York Times, IBM, Bayer, and Alibaba Cloud
  • Fully remote networking opportunities via a virtual hub

Europe’s largest data science community launches new digital platform for this year’s conference

The Data Natives Conference, Europe’s biggest data science gathering, will take place virtually and invite data scientists, entrepreneurs, corporates, academia, and business innovation leaders to connect on November 18-20, 2020. The conference’s mission is to connect data experts, inspire them, and let people become part of the equation again. With its digital networking platform, DN Unlimited expects to reach a new record high with 5000+ participants. Visitors can expect keynotes and panels from the industry experts and a unique opportunity to start on new collaborations during networking and matchmaking sessions.

In 2019, the sold-out Data Natives conference gathered over 3000 data, technology professionals and decision-makers from over 30 countries, including 29 sponsors, 45 community and media partners, and 176 speakers.The narrative of DN Unlimited Conference 2020 focuses on assisting the digital transformation of businesses, governments, and communities by offering a fresh perspective on data technologies – from empowering organizations to revamp their business models to shedding light on social inequalities and challenges like Climate Change and Healthcare accessibility.

Data science, new business models and the future of our society

In spring 2020, the Data Natives community of 80.000 data scientists mobilised to tackle the challenges brought by the pandemic – from the shortage of medical equipment to remote care – in a series of Hackcorona and EUvsVirus hackathons. Through the collaboration of governments such as the Greek Ministry for Digital Governance, institutions such as the Charité and experts from all over Europe, over 80 data-driven solutions have been developed. DN Unlimited conference will continue to facilitate similar cooperation.

The current crisis demonstrates that only through collaboration, businesses can thrive.

While social isolation may be limiting traditional networking opportunities, we are more equipped than ever before to make connections online. “…The ability to connect to people and information instantly is so common now. It’s just the beginning of an era of even more profound transformation. We’re living in a time of monumental change. And as the cloud becomes ambiguous, it’s literally rewriting entire industries” – says Gretchen O’Hara, Microsoft VP; DN Unlimited & Humanaize Open Forum speaker.

The crisis has called for a digital realignment from both companies and institutions. Elena Poughia, the Founder of Data Natives, perceives the transformation as follows:

“It’s not about deploying new spaces via data or technology – it’s about amplifying human strengths. That’s why we need to continue to connect with each other to pivot and co-create the solutions to the challenges we’re facing. These connections will help us move forward.” 

The DN Unlimited Conference will bring together data & technology leaders from across the globe – Christopher Wiggins (Chief Data Scientist, The New York Times), Lubomila Jordanova (CEO & Founder, Plan A), Angeli Moeller (Bayer AG, Head Global Data Assets), Jessica Graves (Founder & Chief Data Officer, Sefleuria) and many more will take on the virtual stages to talk about the growing urge for global data literacy, resources for improving social inequality and building a data culture for agile business development. 

On stage among others:


DN Unlimited 2020: Europe’s largest data science gathering | Nov 18 – 20 online was first posted on October 29, 2020 at 10:40 am. ©2020 “R-posts.com“. Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at tal.galili@gmail.com

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post DN Unlimited 2020: Europe’s largest data science gathering | Nov 18 – 20 online first appeared on R-bloggers.

beakr – A small web framework for R

$
0
0

[This article was first published on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is beakr?

beakr is an unopinionated and minimalist web framework for developing and deploying web services with R. It is designed to make it as simple as possible for data scientists and engineerings to quickly write web applications, services, and APIs without worrying about lower-level details or high-level side-effects. In other words, beakr is made to be explicit,robust, and scalable– and the batteries are not included.

beakr is built on the httpuv package – which itself is built on top of the libuv and http-parser C libraries. So while beakr will be stable for most use cases, it is not necessarily recommended for building extensive web-applications and is not designed to be an especially performant web framework. If you’re looking for a full-featured web framework see Shiny, django, etc. beakr is inspired by the minimalist and massively-expandable frameworks offered by Express.js and Flask.

beakr takes a programmatic approach to creating web-services and the framework focuses on just core HTTP protocol functionality (e.g POST, GET, PUT, etc.). To illustrate, we can create a “Hello World” application with beakr with only a few lines of code.

library(beakr)beakr <- newBeakr()beakr %>%  httpGET("/hello", function(req, res, err) {    "Hello, world!"  }) %>%   listen() 

Running this script will serve the application (by default on host = 127.0.0.1 and port = 25118). When a client points to the path /hello a GET request is made to the server. The beakr instance is listening and handles it with a defined server response – in this case, the response will show an HTML page with the plain-text “Hello, world!”.

So what is beakr good for?

beakr is simple and flexible and as such is best suited for simplicity. It is a great tool to quickly and easily stand up web services without the limitations of heavier-weight contenders so you can worry less about having to learn a new framework and more about what the application should do. A few examples could be:

  • Building RESTful APIs
  • Building web-applications and services

Examples

REST API

It is increasingly important for data scientists to share operational data across computers, languages, and pipelines. beakr makes it easy to create micro-services like RESTful (REpresentational State Transfer) APIs and other infrastructure that are flexible and easy to deploy, scale, and maintain.

For example, we can easily build a RESTful API with two endpoints (/Sepal.Length,/Sepal.Width) that will respond with a JSON data object retrieved from from Fischers Iris dataset.

library(beakr)data("iris")beakr <- newBeakr()beakr %>%   httpGET("/iris/Sepal.Width", function(req, res, err) {    res$json(iris$Sepal.Width)  }) %>%   httpGET("/iris/Sepal.Length", function(req, res, err) {    res$json(iris$Sepal.Length)  }) %>%  listen()

Now a client (or a web-browser) can access the JSON data at 127.0.0.1:25118/Sepal.Length and 127.0.0.1:25118/Sepal.Width.

Deploy a model

beakr can also be used to deploy models and algorithms as micro-services. For example, we can deploy a simple KNN-model to return the flower species (from the Iris dataset) to a client as plain text. Using the POST method we can create a beakr instance to handle recvieving a JSON object containing the sepal and petal lengths and widths (sl, sw, pl, pw, respectively).

First we can define and train a simplistic K-nearest neighbors flower model using the caret package.

library(caret)# Load the Iris data set data('iris')# Train using KNNknn_model <- train(  Species ~ .,   data = iris,   method = 'knn',   trControl = trainControl(method='cv', number=10),   metric = 'Accuracy')

We can create and expose normal R functions using beakr‘s built in decorate() function, which easily prepares functions to accept parameters and respond as you’d expect. With this is mind, we can write a simple function to accept our petal and sepal parameters and return the predicted species using our model defined above.

# Function to predict the species using the trained model.predict_species <- function(sl, sw, pl, pw) {  test <- data.frame(    Sepal.Length = as.numeric(sl),    Sepal.Width = as.numeric(sw),    Petal.Length = as.numeric(pl),    Petal.Width = as.numeric(pw),    Species = NA  )  predict(knn_model, test)}library(beakr)# Use beakr to expose the model in the "/predict-species" url path.#   See help("decorate") for more info about decorating functions.newBeakr() %>%  httpPOST("/predict-species", decorate(predict_species)) %>%  handleErrors() %>%  listen(host = "127.0.0.1", port = 25118)

By sending a POST request to 127.0.0.1:25118/predict-species with a JSON object containing labeled numeric data we can see it responds with an answer.

$ curl -X POST http://127.0.0.1:25118/predict-species   -H 'content-type: application/json'   -d '{ "sl": 5.3, "sw": 4, "pl": 1.6, "pw": 0.2 }'  > setosa

beakr really shines in these sort of things. At Mazama Science, we use beakr to create a large variety of web services that perform tasks like:

  • creation of raw and processed data products
  • creation of data graphics for inclusion in other interfaces
  • automated report generation

Any task that you might create a script for is a candidate for conversion into a web service. beakr has become a proven tool that we at Mazama Science are excited to share!

You can install beakr from CRAN with: install.packages('beakr').

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post beakr – A small web framework for R first appeared on R-bloggers.

Roger Bivand – Applied Spatial Data Analysis with R – retrospect and prospect

$
0
0

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A month ago we finished Why R? 2020 conference. We had an pleasure to host Roger Bivand, a professor at Norwegian School of Economics and Member of R Foundation. This post contains a biography of the speaker and an abstract of his talk: Applied Spatial Data Analysis with R – retrospect and prospect.

When we began over 20 years ago, spatial data was usually found in proprietary software, usually geographical information systems, and most positional data was very hard to acquire. Statistics for spatial data existed, but largely without convenient access to positional data. Using S-Plus with ArcView (or GRASS) was popular, but costly; for teaching and field research, R and open source geospatial applications and software libraries provided a feasible alternative. Starting from writing classes for spatial data in R in 2003, we first used the classes in our own analysis packages; our book first appeared in 2008. At that time, a handful of packages used these classes, but now the R spatial cluster of CRAN packages using spatial classes is almost 900 strong. This places a burden of responsibility on us, to juggle the needs of these packages against the advances in crucial geospatial libraries and changes in industry standards for representing positional data. The early insights into why statistics for spatial (and spatio-temporal) data are challenging remain equally valid today; more data are available, but spatial patterning and scale remain interesting problems.

Roger Bivand, an active R user and contributor since 1997, is a professor at Norwegian School of Economics. Roger has contributed to and led the development of several of the core R packages for spatial analysis, including rgdal, sp, sf and maptools. His contributions helped in advancing the status of R as the tool for spatial statistics. His involvement in the open software community is exemplified by his participation in the work of R Foundation, and as editor of the R Journal 2015-2018. Roger’s passion for spatial analysis resulted not only in numerous scientific publications, but also in the authorship of the Applied Spatial Data Analysis with R book (https://asdar-book.org/) and the winner of the OpenGeoHub Life Achievement award.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Roger Bivand - Applied Spatial Data Analysis with R - retrospect and prospect first appeared on R-bloggers.

2020 Table Contest Deadline Extended

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The original deadline for the 2020 Table Contest was scheduled for October 31, 2020.

We know you’ve been busy and that’s okay. Because we’ve had a number of requests for extensions — including some interest in summarizing election data — the deadline has been extended by two weeks to November 14, 2020.

If you have already submitted an entry, you are free to update it up to the closing date. Find all table contest submissions on RStudio Community.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 2020 Table Contest Deadline Extended first appeared on R-bloggers.

Rmd-based Reports with R Code Appendices

$
0
0

[This article was first published on TeachR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The PDF file accompanying this post was created by the attached Rmd file. You can see how they work together. If you like this concept, use the Rmd file as a template for your next report!

Intro

R markdown files allow you to show code and outputs in the order they were run. However, in a class I’m taking currently, our professor doesn’t want to see our R code until the end of the report, in an appendix. So, she has said that our reports should not be compiled from R markdown files. But, there is a way to create PDF reports from R markdown files where the code echoing is suppressed and instead shown in an appendix! The Rmd file above is an example of that.

I’ll show a bunch of example code chunks so you can see some different options. The inline images below are not part of the rendered output – they are screenshots from the Rmd file.

Setup Chunk

Please notice above the setup chunk. There are a couple of things I want to point out:

  • The chunk options are very different from what you are used to
  • Every package required anywhere in the report is loaded right up front
  • The setup chunk is not included in the appendix! It is reserved solely for code that is required to facilitate document generation

So, why have I put library statements there? You’ll see that the library statements are wrapped in suppressPackageStartupMessages and that I’ve passed a few extra parameters that you may not have seen before. This means that packages will not produce any pesky output in your report when they are loaded. However, because we don’t want to include the setup chunk in the appendix, you will want to “re-load” every package within code chunks that will end up in the appendix.

A Note About Default Chunk Options

You can ignore this section on the first read. Just follow the conventions outlined below for the different examples.

Default optionWhy?
eval = TRUEAll R code is executed by default
echo = FALSEDo not show R code at the time it is run
message = FALSEDo not show any messages
error = FALSEDo not show any warnings
warning = FALSEDo not show any errors
purl = FALSEBy default, code chunks will not appear in the appendix. You will have to explicitly mark the ones you want to include
results = 'hide'You are probably used to code chunks outputing something to include in your report. If you want this, you’ll have to explicitly override this option!

Examples of Different Configurations

Example 1: Data Prep Chunk

You’ll use this kind of code chunk when you are prepping data for use in other chunks, but there won’t be any output to the report. You want the code in the appendix so the reader can reproduce your work, but there isn’t any output yet.

Chunk options:

  • Default options apply
  • purl=TRUE means “include in appendix”

Example 2: Content Chunk

The option results='markup' is what you are used to working with in Rmd files. There are other values you can set results to, but you probably won’t use them very often. (Except for asis, and you will see an example of that below when we bootstrap in the appendix.)

Example 3: kable Output

Let’s say you want to put some table output in your report. But, you want the reader, when they run your code, to be able to get readable output. (Nicely formatted stuff will have a lot of extra tags around it and isn’t always the easiest to read.)

Table: Crime (1 = Yes, 0 = No) versus Average Room Counts

 456789
00414873120
143313642113

Example 4: Experiments

You’re going to try lots of stuff when you are writing your report. But, why should you have to delete the code just because it ended up not being needed?

Remember purl=FALSE and results='hide' are set by default.

Example 5: Code for the Reader

The following chunk won’t do anything for your report or analysis, but will show up in the appendix. This might be used for something that you experimented with and talked about, but doesn’t have any content for your report. The reader might want to see what you tried if you’ve mentioned it in your write-up.

Appendix 1: R Code for Analysis

And, here is the appendix. I haven’t figured out how to get the file name of the Rmd file knitr is compiling, so that is hardcoded. (It’s the name of this Rmd file!)

# ============================# Example 1: data prep chunk# ============================# Re-list the packages your code uses# You don't need to list knitr unless that is required for reproducing your worklibrary(alrtools)library(tidyverse)# Notice that I've put a big banner comment at the beginning of this# Since I am including it in the appendix, I want the reader to be# able to know what section of the report the code applies to# If you are using functions the reader may not have seen before# it's not a bad idea to preface them with the package they come from.# readr was loaded as part of the tidyverse# So the "namespacing" is not required, only helpfulboston <- readr::read_csv('crime-training-data_modified.csv')# ============================# Example 2: data prep chunk# ============================mod1 <- lm(medv ~ age + rm, data = boston)par(mfrow = c(2, 2))plot(mod1)# ============================# Example 3: `kable` output# ============================# This shows a table of response variable versus rounded room counts# But, it's not prettytbl <- table(boston$target, round(boston$rm, 0))print(tbl)# ============================# Example 5: code for the reader# ============================library(tree)tree1 <- tree::tree(medv ~ ., data = boston)par(mfrow = c(1, 1))plot(tree1, type = 'uniform')text(tree1, pretty = 5, col = 'blue', cex = 0.8)
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: TeachR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Rmd-based Reports with R Code Appendices first appeared on R-bloggers.

BASIC XAI with DALEX — Part 2: Permutation-based variable importance

$
0
0

[This article was first published on R in ResponsibleML on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

BASIC XAI

BASIC XAI with DALEX — Part 2: Permutation-based variable importance

Introduction to model exploration with code examples for R and Python.

By Anna Kozak

Welcome to the “BASIC XAI with DALEX” series.

In this post, we present the permutation-based variable-importance, the model agnostic method, which is the one we can use for any type of model.

The first part of this series you can find here BASIC XAI with DALEX — Part 1: Introduction.

So, shall we start?

First — Why do we need the importance of variables in the model?

When building a model, we often ask ourselves the question — Which variables are the most important? What to pay attention to? When modeling the price of a property, we would like to know what so much impact on the price, whether it is the area or maybe the year of construction? When modeling the credit risk, we consider what influenced the fact that customers do not get a loan.

Among the methods of variables importance evaluation, we can distinguish a few methods for specific groups of models. For example:

  • linear models

We can easily identify which attributes are important in additive models such as linear and logistic regression. They are just model coefficients.

  • tree-based models

We can use method based on uses a calculation of the Gini impurity for each tree, then calculate an average. We can compare relative importance of averaged Gini impurity scores.

  • random forest

We can use method based on out-of-bag data.

  • other models

Even though some models offer model-specific measures for variable importance, we cannot compare importances between model structures. To overcome this problem we can use a model agnostic method, that is, one that works independently on the structure of a model. An example of such measure is the permutation-based variable-importance.

Second — Idea of permutation-based variable-importance

The idea is very simple, to assess how important is the variable V we will compare the initial model with the model on which effect of the variable V is removed. How to remove the effect of variable V? In the LOCO (leave-one covariate out) approach we retrain the model without variable V. But in permutation-based variable-importance we take a different approach, the effect of a variable is removed though a random reshuffling of the data. Following the picture below, we take the original data (left part of the picture), then we permutate (a mixer, which mixes the values), and we get “new” data, on which we calculate the prediction.

Idea of permutation-based variable-importance

If a variable is important in a model, then after its permutation the model prediction should be less precise. The permutation importance of a variableiis the difference between model prediction for original data and prediction for data with permutation variable i:

Permutation-based variable-importance for model f and variable i

where L_{org} is the value of the loss function for the original data, while L_{perm} is the value of the loss function after permutation of the i-th variable. Note that we can use a loss function of some function used for assessment of the performance, like AUC (which is not a loss function but is a popular measure for performance).

Third — let’s get a model in R and Python

Let’s write some code. We are still working on the DALEXapartments data. To calculate the permutation variable importance we use the model_parts() function. We can calculate the validity of the variables by considering only one permutation, but it is recommended to repeat it several times and average the results. By default, in the model_parts() function, we have 10 times of permutation. Additionally, we should take into consider the loss function, the default is 1-AUC for classification and RMSE for regression.

<a href="https://medium.com/media/e27b5338f9a68ee06ecd087b8cce6fef/href" rel="nofollow" target="_blank">https://medium.com/media/e27b5338f9a68ee06ecd087b8cce6fef/href</a>

The data frame of model_parts object, we have variables and the average change after permutations.

Below is a plot that summarizes permutation-based variable-importance. It is worth notice that the bars start in RMSE value for the model on the original data (x-axis). The length of the bar corresponds to the RMSE loss after permutations. The boxplots show how the for random permutations differ.

As we see the most important variable is district_Srodmiescie, the higher price of the property depends on the city district. This should not come as a surprise, because it is a district in the center of Warsaw.

In the next part, we will talk about Partial Dependence Profiles (PDP) method.

Many thanks to Przemyslaw Biecek and Hubert Baniecki for their support on this blog.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.

In order to see more R related content visit https://www.r-bloggers.com


BASIC XAI with DALEX — Part 2: Permutation-based variable importance was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R in ResponsibleML on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post BASIC XAI with DALEX — Part 2: Permutation-based variable importance first appeared on R-bloggers.


Sequence Mining My Browsing History with arulesSequences

$
0
0

[This article was first published on R | JLaw's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Typically when thinking of pattern mining people tend to think of Market Basket Analysis with the conventional example showing people typically buy both Beer and Diapers in the same trip. When order doesn’t matter this is called Association Rules Mining and is implemented by the arules package in R. In this example, the person is buying both diapers and beer. It doesn’t really matter if diapers led to the beer purchase or beer lead to the diaper purchased. However, there are instances where the order of events are important to what we’d consider a pattern. For example, “cause and effect” relationships imply order. Putting your hand on a hot stove leads to burning your hand. The reverse direction of burning your hand leading you to put your hand on a hot stove makes less sense. When the notion of order is applied to association rules mining it becomes “Sequence Mining”. And to do this, we’ll use the arulesSequences package to run the cSPADE algorithm.

Unfortunately, I don’t have access to grocery store data or much other data that would be an interesting use-case for sequence mining. But what I do have is access to my own browsing history. So for this post, I’ll be looking for common sequential patterns in my web own browsing habits.

Getting the Data

I wasn’t able to figure out how to extract my browsing history directly from Chrome in a way that would easily be read into R. However, there are 3rd party programs that can extract browsing histories. In this case, I used a program called BrowsingHistoryView by Nir Sofer. The interface is very straight forward and allowed for extracting my browsing history to a CSV file.

From this program I was able to extract 85 days worth of browsing history from 2020-06-13 through 2020-09-09.

Loading Libraries and Reading in Data

The libraries used in this analysis are the usual gang of tidyverse, lubridate, ggtext which are often used in this blog. Some new ones specific for this analysis are:

  • arulesSequences– Which will run the sequence mining algorithm
  • tidygraph and ggraph– Which will allow for plotting my browsing history as a directed graph
library(tidyverse) #Data Manipulation and Plottinglibrary(lubridate) #Date Manipulationlibrary(arulesSequences) #Running the Sequence mining algorithmlibrary(ggtext) #Making adding some flair to plotslibrary(tidygraph)  ## Creating a Graph Structurelibrary(ggraph) ## Plotting the Network Graph Structure

A .csv file was created from the Browsing History View software and read into R through readr.

browsing_history <- read_csv('browsing_history_v2.csv')

The read-in data looks as follows:

URLTitleVisited OnVisit CountTyped CountReferrerVisit IDProfileURL LengthTransition TypeTransition Qualifiers
https://watch.wwe.com/original/undertaker-the-last-ride-134576wwe network - undertaker: the last ride6/13/2020 2:59:23 PM21NA331141Default62TypedChain Start,Chain End
https://watch.wwe.com/original/undertaker-the-last-ride-134576wwe network - undertaker: the last ride6/13/2020 2:59:28 PM21NA331142Default62LinkChain Start,Chain End
https://www.google.com/search?q=vtt+to+srt&oq=vtt+to+srt&aqs=chrome.0.69i59j0l7.1395j0j4&sourceid=chrome&ie=utf-8vtt to srt - google search6/13/2020 4:33:34 PM20NA331157Default113GeneratedChain Start,Chain End
https://www.google.com/search?q=vtt+to+srt&oq=vtt+to+srt&aqs=chrome.0.69i59j0l7.1395j0j4&sourceid=chrome&ie=utf-8vtt to srt - google search6/13/2020 4:33:37 PM20NA331158Default113LinkChain Start,Chain End
https://twitter.com/home / twitter6/13/2020 5:19:55 PM9890NA331167Default20TypedChain Start,Chain End
https://twitter.com/homehome / twitter6/13/2020 5:20:03 PM4140NA331168Default24LinkChain Start,Chain End

Looking at the data there are a number of cleaning steps that will need to be done to make the sequence mining more useful.

  1. The variable names are not machine friendly and contain spaces,
  2. Some of the URLs are redirects or generated and therefore not URLs I specifically went to. I’ll want to exclude those.
  3. Visited On is a character rather than a date
  4. If we’re looking for common patterns, I should probably limit the URLs to just the domains as its very unlikely that I would read the same news articles multiple times.

The following code block carries out the cleaning steps outlined above:

browsing_history_cleaned <- browsing_history %>%   #Make the names more R friendly  janitor::clean_names() %>%  #Subset to URLs I either typed or   #Linked to (excluding redirects/form submissions)  filter(transition_type %in% c('Link', 'Typed'),         str_detect(transition_qualifiers, 'Chain Start')         )%>%   #Keep Only the Base URL and remove the prefix  mutate(base_url = str_remove(url, '^https?:\\/\\/') %>%            str_remove('^www\\.') %>%            str_extract(., '^.+?\\/'),         #Parse the Date Format         dttm = mdy_hms(visited_on),         ds = as.Date(dttm)  ) %>%   select(base_url, dttm, title, ds)

The above block:

  1. Uses janitor::clean_names() to convert the column names into an R-friendly format (Visited On -> visited_on)
  2. Keeps only the ‘Typed’ and ‘Link’ transition types to keep only URLs I either typed to or clicked to
  3. Keep only ‘Chain Start’ qualifiers to remove URLs that came from redirects
  4. Create a base_url field by removing the “http[s]://” and “www.” strings if they exist.
  5. Converts visited_on into both a timestamp and a datestamp
  6. Only keeps the four columns we’re interested in.

After these changes, the data looks like:

base_urldttmtitleds
watch.wwe.com/2020-06-13 14:59:23wwe network - undertaker: the last ride2020-06-13
watch.wwe.com/2020-06-13 14:59:28wwe network - undertaker: the last ride2020-06-13
google.com/2020-06-13 16:33:37vtt to srt - google search2020-06-13
twitter.com/2020-06-13 17:19:55home / twitter2020-06-13
twitter.com/2020-06-13 17:20:03home / twitter2020-06-13

Sessionizing the Data

Even though I have a date field for my browsing history, the cSPADE algorithm is going to want to be able to differentiate between when one session begins and another session ends. While a reasonable choice might be to break things apart by day, it’s likely that on weekends I have multiple browsing sessions which can sometimes stretch past midnight. So a more reasonable choice might be to say a new session begins if there is a gap of at least 1 hour since the last page I browsed to.

Another aspect of the data that I’d like to deal with is to eliminate when I go to multiple pages within the same domain. Having an eventual rule that “twitter.com/ -> twitter.com” isn’t that interesting. So I will also remove any consecutive rows that have the same domain.

collapsed_history <- browsing_history_cleaned %>%   #Order by Time  arrange(dttm) %>%   # Create a new marker every time a Page Browsing is more than 1 hour since  # the last one  # Also, create a segment_id to identify each session  mutate(time_diff = dttm-lag(dttm),         #Count Segments as more than an hour btw events         new_segment = if_else(is.na(time_diff) | time_diff >= 60*60, 1, 0),         segment_id = cumsum(new_segment)  ) %>%   group_by(segment_id) %>%   arrange(dttm) %>%   #Remove Instances where the same baseurl appears consecutively  filter(base_url != lag(base_url) | is.na(lag(base_url))) %>%   #Create Within Segment ID  mutate(item_id = row_number()) %>%   select(segment_id, ds, dttm, item_id, base_url) %>%   ungroup() %>%   #Convert Everything to Factor  mutate(across(.cols = c("segment_id", "base_url"), .f = as.factor))

In order to create segment_ids to represent each session, I use dplyr::lag() to calculate the difference in seconds between each event. Then when the event occurs more than 1 hour after the prior event I mark it with a 1 in the new_segment column. Then using the cumsum option, I can fill down the segment_ids to all the other events in that session.

Similarly I use the lag function to remove consecutively occurring identical base_url.

Finally, a quirk of the arulesSequences package is that the “items” or the URLs in this case must be factors.

The data for the 154 browsing sessions now looks like:

collapsed_history %>% head(5) %>% knitr::kable()
segment_iddsdttmitem_idbase_url
12020-06-132020-06-13 14:59:231watch.wwe.com/
22020-06-132020-06-13 16:33:371google.com/
22020-06-132020-06-13 17:19:552twitter.com/
22020-06-132020-06-13 17:20:093gmail.com/
22020-06-132020-06-13 17:24:144twitter.com/

Constructing the Transactions Data Set for arulesSequences

I haven’t found a ton of resources online about using the arulesSequences package. This blog post from Revolution Analytics has been one of the best that I’ve found. However, their process involves exporting to .csv and then reading back in to create the transactions data set. Personally, I’d like to avoid doing as much outside of R as possible.

However, the blog post does provide a good amount of detail about how to properly get the data in the proper format. Using the as function, I can convert the previous data frame into a “transactions” format and set the following fields for use in cSPADE:

  • items: The elements that make up a sequence
  • sequenceID: The identifier for each sequence
  • eventID: The identifier for an item within a sequence
sessions <-  as(collapsed_history %>% transmute(items = base_url), "transactions")transactionInfo(sessions)$sequenceID <- collapsed_history$segment_idtransactionInfo(sessions)$eventID = collapsed_history$item_id

If I wanted to use better controls around time gaps, I would need to provide better information about time. But since this is pretty basic, I don’t use that field as the differentiation between sessions is enough.

The Transaction data class can be viewed with the inspect() function:

inspect(head(sessions))##     items                  transactionID sequenceID eventID## [1] {items=watch.wwe.com/} 1             1          1      ## [2] {items=google.com/}    2             2          1      ## [3] {items=twitter.com/}   3             2          2      ## [4] {items=gmail.com/}     4             2          3      ## [5] {items=twitter.com/}   5             2          4      ## [6] {items=gothamist.com/} 6             2          5

Having the “items=” for every items is a little annoying so let’s remove that by altering the itemLabels for the transactions set:

itemLabels(sessions) <- str_replace_all(itemLabels(sessions), "items=", "")inspect(head(sessions))##     items            transactionID sequenceID eventID## [1] {watch.wwe.com/} 1             1          1      ## [2] {google.com/}    2             2          1      ## [3] {twitter.com/}   3             2          2      ## [4] {gmail.com/}     4             2          3      ## [5] {twitter.com/}   5             2          4      ## [6] {gothamist.com/} 6             2          5

Much better.

Running the cSPADE algorithm

The sequence mining algorithm can be run by using the cspade() function in the arulesSequences package. Before running the algorithm, I’ll need to explain the concept of support. Support can be best thought of as the proportion of sessions that contain a certain URL. Why that’s important is that the cSPADE algorithm works recursively to find the frequent patterns starting with 1-item sets, then moving to 2-items, etc. In order to limit how much time the algorithm will run for, you can set a minimum support threshold. Why this helps is that by definition the support of a 2-item set will be less than or equal to the support of either 1-item set. For example, if A occurs 40% of the time, A and B cannot occur more frequently.

So if A alone does not meet the support threshold, then we don’t need to care about any 2 or more item subsets that contain A.

For this purpose I’ll set a minimum support of 25%. The cspade function will return all of the frequent itemsets that occur in my browsing data.

itemsets <- cspade(sessions,                    parameter = list(support = 0.25),                    control = list(verbose = FALSE))

The summary() function will provide a lot of useful information, but we’ll just look at the first few rows with inspect():

inspect(head(itemsets))##    items                   support ##  1 <{buzzfeed.com/}>     0.4090909 ##  2 <{en.wikipedia.org/}> 0.3311688 ##  3 <{facebook.com/}>     0.3311688 ##  4 <{github.com/}>       0.3051948 ##  5 <{google.com/}>       0.8051948 ##  6 <{gothamist.com/}>    0.4090909 ## 

Here we see the results of a series of 1-item sets where the support is the number of sessions containing at least 1 visit to that URL. Apparently I use google A LOT as it appears in 80% of my sessions.

We can also convert the itemsets data back to a data frame using the as() function and go back to using the usual dplyr or ggplot functions. For example, I can visualize the 10 Most Frequent Sequences I visit based on the support metric:

#Convert Back to DSitemsets_df <- as(itemsets, "data.frame") %>% as_tibble()#Top 10 Frequent Item Setsitemsets_df %>%  slice_max(support, n = 10) %>%   ggplot(aes(x = fct_reorder(sequence, support),                    y = support,                    fill = sequence)) +     geom_col() +     geom_label(aes(label = support %>% scales::percent()), hjust = 0.5) +     labs(x = "Site", y = "Support", title = "Most Frequently Visited Item Sets",         caption = "**Support** is the percent of segments the contain the item set") +     scale_fill_discrete(guide = F) +    scale_y_continuous(labels = scales::percent,                       expand = expansion(mult = c(0, .1))) +     coord_flip() +     cowplot::theme_cowplot() +     theme(      plot.caption = element_markdown(hjust = 0),      plot.caption.position = 'plot',      plot.title.position = 'plot'    )

Now we see some of the 2-item sets. Not only do I use Google in 80% of sessions. In 66% of sessions I visit google twice!

Turning Frequent Sequences into Rules

While knowing what URLs occur frequently is interesting, it would be more interesting if I could generate rules around what websites lead to visits to other websites.

The ruleInduction() function will turn the item sets into “if A then B” style rules. To control the size of the output, I will introduce the concept of confidence. The Confidence of an “If A then B” rule is the % of the times the rule is true when A occurs. So if “if A then B” has a 50% confidence then when A occurs we have a 50% chance of seeing B vs. seeing anything other than B.

For this post, I’ll use a minimum confidence of 60%.

rules <- ruleInduction(itemsets,                        confidence = 0.6,                        control = list(verbose = FALSE))inspect(head(rules, 3))##    lhs                     rhs                    support confidence     lift ##  1 <{gothamist.com/}>   => <{westsiderag.com/}> 0.2727273  0.6666667 1.901235 ##  2 <{gothamist.com/}>   => <{twitter.com/}>     0.2662338  0.6507937 1.113580 ##  3 <{t.co/}>            => <{twitter.com/}>     0.3246753  0.7812500 1.336806 ## 

The returned data structure has 5 fields:

  • lhs: Left-hand side - The “A” in our “if A then B” rule
  • rhs: Right-hand side - The “B” in our “if A then B” rule
  • support: The % of sessions where “A then B” occurs
  • confidence: How often the rule is true (If A occurs the % of Time that B occurs)
  • lift: The strength of the association. Defined as the ratio of support “A then B” divided by the Support of A times the Support of B. In other words, how much more likely are we to see “A and B together” vs. what we would expect if A and B were completely independent of each other.

The first row shows two NYC specific blogs, one of NYC overall and one for the Upper West Side. The support shows that 27% of my sessions include these two blogs. The confidence shows that if I visit Gothamist there’s 67% chance I’ll visit WestSideRag after. Finally, the lift shows that the likelihood of this rule is 90% higher than you’d expect if there was no relation between my visiting these sites.

Redundant Rules

In order to create the most effective and simplest rules we’ll want to remove redundant rules. In this context a rule is redundant when a subset of the left-hand side has a higher confidence than the rule with more items on the left-hand side. In simpler terms, we want to simplest rule that doesn’t sacrifice information. For example, {A, B, C} -> D is redundant of {A, B} -> {D} if the confidence of the 2nd rule is greater than or equal to the 1st

A real example from this data is:

lhsrhssupportconfidencelift
<{t.co/}>=> <{twitter.com/}>0.32467530.78125001.336806
<{twitter.com/}, {t.co/}>=> <{twitter.com/}>0.31818180.77777781.330864

The addition of “twitter.com” to the left-hand side does not make for a more confident rule so therefore it is redundant.

Removing redundant rules can be done easily with the is.redundant() function:

rules_cleaned <- rules[!is.redundant(rules)]

The rules class can also be converted back to a data.frame with the as() function. Then we can use tidyr::separate() to break apart the rule column into the lhs and rhs columns.

rules_df <- as(rules_cleaned, "data.frame") %>%   as_tibble() %>%   separate(col = rule, into = c('lhs', 'rhs'), sep = " => ", remove = F)

Now we can look at the highest confidence rules:

rules_df %>%   arrange(-confidence) %>%   select(lhs, rhs, support, confidence, lift) %>%   head() %>%   knitr::kable()
lhsrhssupportconfidencelift
<{google.com/},{google.com/},{google.com/},{google.com/}><{google.com/}>0.37012990.91935481.141779
<{github.com/}><{google.com/}>0.27922080.91489361.136239
<{buzzfeed.com/},{google.com/}><{google.com/}>0.25974030.85106381.056966
<{t.co/},{google.com/}><{google.com/}>0.27272730.84000001.043226
<{lifehacker.com/}><{reddit.com/}>0.25324680.82978721.726854
<{google.com/}><{google.com/}>0.66233770.82258061.021592

And this is pretty boring. I wind up on Google a lot, so it appears in a lot of the rules. So let’s make this more interesting by removing Google from the results and by also looking at both confidence and lift.

rules_df %>%   #Remove All Rules that Involve Google  filter(!str_detect(rule, '\\{google.com\\/\\}')) %>%   #Keep only Rule, Confidence, and Lift - 1  transmute(rule, confidence, lift = lift - 1) %>%   #Pivot Lift and confidence into a single column  pivot_longer(cols = c('confidence','lift'),               names_to = "metric",                values_to = "value") %>%   group_by(metric) %>%   #Keep only the Top 10 Rules for Each Metric  top_n(10, value) %>%   ungroup() %>%   # Reorder so that order is independent for each metrics  ggplot(aes(x = tidytext::reorder_within(rule, value, metric),             y = value,             fill = rule)) +     geom_col() +     geom_label(aes(label = value %>% scales::percent()),                hjust = 0) +    scale_fill_discrete(guide = F) +     tidytext::scale_x_reordered() +     scale_y_continuous(label = scales::percent,                        limits = c(0, 1),                       expand = expansion(mult = c(0, .1))) +     labs(x = "Rule",          y = "",          title = "Top Rules by Confidence and Lift",         caption = "**Confidence** is the probability RHS occurs          given LHS occurs           **Lift** is the increased liklihood of seeing LHS & RHS together vs. independent") +    facet_wrap(~metric, ncol = 1, scales = "free_y") +    coord_flip() +    theme_minimal() +    theme(      plot.caption = element_markdown(hjust = 0),      plot.caption.position = 'plot',      strip.text = element_textbox(        size = 12,        color = "white", fill = "#5D729D", box.color = "#4A618C",        halign = 0.5, linetype = 1, r = unit(5, "pt"), width = unit(1, "npc"),        padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)      )    )

Some of the high lift rules that occur are:

  • I visit WestSideRag after Gothamist
  • I visit Reddit after LifeHacker
  • I visit Buzzfeed after Twitter.

By the way, all this is true. My usually weekday pattern tends to be Twitter -> Gothamist -> WestSideRag -> ILoveTheUpperWest -> Buzzfeed -> LifeHacker -> Reddit.

So it does appear that the Sequence Mining rules do in fact represent my browsing habits! But certain sites like the 2nd Upper West Side blog did not make the top rules.

Visualizing these relationships as a graph

Ultimately, my browsing habits can be restructured as a directed graph where each URL leads to another URL. Then rather than relying on statistical measures like Support, Confidence, and Lift, I can visualize my browsing as a network. However, to turn my data into an edge list I need to re-structure the URLs from a sequential list into a series of “Source/Destination” edges.

To do this, I’ll group by each browsing session, setting the URL to the "source’ and using dplyr::lead() to grab the URL from the next row to form the destination. Then since for the last URL, the destination will be null, I’ll remove these endpoints from the data. Finally, to create edge weightings I’ll count the number of instances for each source/destination pair.

collapsed_history_graph_dt <- collapsed_history %>%   group_by(segment_id) %>%   transmute(item_id, source = base_url) %>%   mutate(destination = lead(source)) %>%   ungroup() %>%  filter(!is.na(destination)) %>%   select(source, destination, segment_id) %>%   count(source, destination, name = 'instances') 

In order to create the graph, I’ll be using the tidygraph and ggraph packages to convert the data frame into the appropriate format and visualize the network in a ggplot style.

To make the resulting network more readable, I’ll filter my edge list to only those with at least 15 occurrences and then use tidygraph::as_tbl_graph to convert to a graph-friendly data type.

g <- collapsed_history_graph_dt %>%   filter(instances > 14) %>%   as_tbl_graph()

Creating Graph Clusters

To make the visualization a little more interesting I thought it would be fun to cluster the network. The igraph::cluster_optimal function will calculate the optimal community structure of the graph. This membership label then gets applied as a node attribute to the graph object g created in the prior code block.

clp <- igraph::cluster_optimal(g)g <- g %>%   activate("nodes") %>%   mutate(community = clp$membership)

Plotting the Network WIth ggraph

Ggraph follows a similar syntax to ggplot where the data object is based in and then there are geoms to reflect the nodes/edges of the plot. The layout option specifies how the nodes and edges will be laid out. Here I’m using the results of the Fruchterman-Reingold algorithm for a force-directed layout. As used in this code block the relevant geoms are:

  • geom_node_voronoi - Used to plot the clustering as the background of the graph
  • geom_edge_parallel - Since this is a directional graph, it will draw separate parallel arrows for each direction. The shading will be based on the log number of instances.
  • geom_node_point - Plots a circle for each node
  • geom_node_text - Plots the names of the nodes and reduces overlap
set.seed(20201029)ggraph(g, layout = 'fr') +   geom_node_voronoi(aes(fill = as.factor(community)), alpha = .4) +   geom_edge_parallel(aes(edge_alpha = log(instances)),                  #color = "#5851DB",                  edge_width = 1,                  arrow = arrow(length = unit(4, 'mm')),                  start_cap = circle(3, 'mm'),                  end_cap = circle(3, 'mm')) +  geom_node_point(fill = 'orange', size = 5, pch = 21) +   geom_node_text(aes(label = name), repel = T) +   labs(title = "My Browsing History",       caption = "Minimum 15 Instances") +   scale_fill_viridis_d(guide = F) +   scale_edge_alpha_continuous(guide = F) +   theme_graph()

This graph shows 5 clusters:

  1. Twitter -> Gothamist -> WestSideRag -> ILoveTheUpperWestSide
    • The websites I typically visit after work on weekdays
  2. Datacamp / Google Docs
    • When I did some Datacamp courses, I take notes in Google Docs so constantly switching back and forth makes sense.
  3. Facebook.com / l.facebook.com
    • This is just using Facebook. But interesting that Facebook has no frequent connection outside of the Facebook ecosystem.
  4. BuzzFeed/LifeHacker
    • This a the last piece of my usual post-work routine. But perhaps it occurs later after the Twitter/NYC Blog Cluster
  5. The Google Centered Cluster
    • Google is the center of my browsing universe but some fun connections here are 127.0.0.1:4321 which is the local instance when I’m developing this blog. This co-occurs with lots to trips to Google, Github, and Stack Overflow while I try to figure out / debug aspects of my blog development pipeline.

Conclusion

There weren’t a ton of resources that showed how to use the arulesSequences package in my searches and most required dumping and rereading a .csv file. Hopefully, this post showed that it isn’t necessary to do that. Additionally, it shows an example of how sequence mining can be used to identify interesting patterns when the order is important. There is a lot of functionality of the arulesSequences package not touched upon in this post, but this should serve as good starting point.

As for visualization, I’ve covered how to plot rules in the usual tabular structure with ggplot2 as well as a network using ggraph. I really like the way the network visualization worked out and in a future post I may go more in-depth to learn about how to best use tidygraph and ggraph.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | JLaw's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Sequence Mining My Browsing History with arulesSequences first appeared on R-bloggers.

Tidyverse Tips

$
0
0

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have found the following commands quite useful during the EDA part of any Data Science project. We will work with the tidyverse package where we will actually need the dplyr and the ggplot2 only and with the irisdataset.

select_if | rename_if

The select_if function belongs to dply and is very useful where we want to choose some columns based on some conditions. We can also add a function that applies to column names.

Example: Let’s say that I want to choose only the numeric variables and to add the prefix “numeric_” to their column names.

library(tidyverse)iris%>%select_if(is.numeric,  list(~ paste0("numeric_", .)))%>%head() 

Output:

Tidyverse Tips 1

Notice that we can also use the rename_if in the same way. An important note is that the rename_if(), rename_at(), and rename_all() have been superseded by rename_with(). The matching select statements have been superseded by the combination of a select() + rename_with().

These functions were superseded because mutate_if() and friends were superseded by across(). select_if() and rename_if() already use tidy selection so they can’t be replaced by across() and instead we need a new function.


everything

In many Data Science projects, we want one particular column (usually the dependent variable y) to appear first or last in the dataset. We can achieve this using the everything() from dplyr package.

Example: Let’s say that I want the column Species to appear first in my dataset.

mydataset<-iris%>%select(Species, everything())mydataset%>%head() 
Tidyverse Tips 2

Example: Let’s say that I want the column Species to appear last in my dataset.

This is a little bit tricky. Have a look below at how we can do it. We will work with the mydataset where the Species column appears first and we will remove it to the last column.

mydataset%>%select(-Species, everything())%>%head() 
Tidyverse Tips 3

relocate

The relocate() is a new addition in dplyr 1.0.0. You can specify exactly where to put the columns with .before or .after

Example: Let’s say that I want the Petal.Width column to appear next to Sepal.Width

iris%>%relocate(Petal.Width, .after=Sepal.Width)%>%head()
Tidyverse Tips 4

Notice that we can also set to appear after the last column.

Example: Let’s say that I want the Petal.Width to be the last column

iris%>%relocate(Petal.Width, .after=last_col())%>%head() 
Tidyverse Tips 5

You can find more info in the tidyverse documentation


pull

When we work with data frames and we select a single column, sometimes we the output to be as.vector. We can achieve this with the pull() which is part of dplyr.

Example: Let’s say that I want to run a t.test in the Sepal.Length for setosa versus virginica. Note the the t.test function expects numeric vectors.

setosa_sepal_length<-iris%>%filter(Species=='setosa')%>%select(Sepal.Length)%>%pull()virginica_sepal_length<-iris%>%filter(Species=='virginica')%>%select(Sepal.Length)%>%pull()t.test(setosa_sepal_length,virginica_sepal_length) 
Tidyverse Tips 6

reorder

When you work with ggplot2 sometimes is frustrating when you have to reorder the factors based on some conditions. Let’s say that we want to show the boxplot of the Sepal.Width by Species.

iris%>%ggplot(aes(x=Species, y=Sepal.Width))+geom_boxplot() 
Tidyverse Tips 7

Example: Let’s assume that we want to reorder the boxplot based on the Species’ median.

We can do that easily with the reorder() from the stats package.

iris%>%ggplot(aes(x=reorder(Species,Sepal.Width, FUN = median), y=Sepal.Width))+geom_boxplot()+xlab("Species") 
Tidyverse Tips 8

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Tidyverse Tips first appeared on R-bloggers.

My R Table Competition 2020 Submission: xG Timeline Table for Soccer/Football with {gt}!

$
0
0

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this blog post I’ll be talking about my submission to the 2020 RStudio Table contest: A xG and shots timeline table for soccer/football matches powered by data from understat.com!

<br />

I started out in soccer analytics mainly to practice my data cleaning and visualization skills and over time it evolved into an entire portfolio of plots and graphs that you can see in the soccer_ggplots Github Repo. At the beginning of last season (summer of 2019) I figured out how to grab match-level data from understat.com via Ewen Henderson’s {understatr} package and started creating a “match day summary graphic”. These were intended to visualize the data understat.com provided in my own aesthetically-pleasing way for the general public (mainly on soccer analytics Twitter). These aren’t intended for coaches/clubs but are more media/fan oriented pieces to help convey the flow of the match in a different way than the usual box scores (number of shots, tackles, yellow cards, total passes, etc.).

Version 1 (last year) looked like this:

In the past few months, after the Corona-extended-season ended, I intended to make tweaks to the graph above but instead I just started creating other viz with the data available while also incorporating some other stuff that I couldn’t fit into the original match day viz. 

One of the new viz I created was a chronological table of shots during the course of the match which provides the viewer with a lot more detail about any individual shot. Besides the xG value, things like the “situation” (open play? free kick? etc.), “shot type” (right, left, header), etc. were added for more context. In general I wanted the next series of viz to add a bit more context to the overall xG flow graph and the shot maps.

So here are the new xG match summary viz in their entirety (minus the timeline table which was the first image you saw on this blog post):

This blog post will only go over the {gt} time lines.

Design Choices

I mostly make very simple designs in my soccer visualizations and I try not to clutter each individual viz with too much info, but rather try to peel layers of data apart and split them up into different component viz (for more examples see here. In reality, this table viz is only just one of a series of data viz intended to capture details about a match using understat.com’s data that I post on Twitter but since this tutorial is for a table viz competition, I’ll keep the following discussion to the table viz only.

The color-coding of the rows is intended to give you a clear idea of who was dominating shots at any given time period at a glance, without even having to look closely at the details. Of course, there may be a ton of shots in a small period of time or vice-versa which can skew your perception if you don’t look at the “minute” column closely. Still, I think it’s cool that you can look at the table from far away and identify time “chunks” of shot dominance. The previous few sentences only touch on the info about the “quantity” of shots taken. The “quality” of the shots can be gleaned from looking at the shot maps, as well as the cumulative sum of a team’s xG as you travel down the table chronologically.

Most of the plots you see inserted into {gt} or other R tables in general are spark lines so I thought about including them in mine too. However, I realized that the cumulative xG column covers that and the line graph wasn’t very visible on the table. The chronological nature of the table also means you get a feel for that perspective of the game anyways (e.g. chunks of “red” rows means that team “red” was taking more shots in that time period). In any case the line plot is already covered by another viz in my series so I decided to focus on something that’s a lot easier to see at first glance, the main protagonist of the table, the actual shots themselves. The idea was to plot all of the shots of that table row’s respective team on a soccer field plot and then highlight the specific shot of that row along with its xG value.

For the title headline text I experimented with a few different sizes depending on the text content (team name, xG value, score line). The actual score being the biggest in size, then the team names, and then the xG totals for either team in the smallest font at the far ends. I’m still not sure how the text alignment (especially vertical) works so I didn’t want to try anything too drastic. I’m still not very fond of the fact that the parentheses around the xG values kind of dip below the other letters, especially the larger text. I am also still vacillating between adding a bit more color to either “xG” or the xG values themselves along their respective team colors. I did all of this as I wanted people to guide people’s focus on actual score line first and then down across to the xG values. I have Expected Points (xPoints) in the other match graphics so for the stand-alone version of the tables it would make sense to include them somehow in a later version as well.

I made the column header color slightly gray to differentiate the text from the title text but I’m still not 100% sure it provides a good balance with the rest of the text in the table. For the text inside the table cells I made sure they were thick bold text as it in the version that I used to release with my other match viz they were a lot thinner and harder to read. I tried to place emphasis on rows with goals by switching the text to all-caps and adding the emoji alongside it. I think there could be ways of adding more emphasis by coloring the border of that row or something else. Unfortunately, for the PNG version in my regular match summary viz that I post on Twitter, the emoji is a star instead of a soccer ball as the soccer ball emoji doesn’t convert well from HTML to PNG. I’ll be using the soccer ball emoji here as I’m aiming for an HTML output and don’t really care about its ability to be embedded in tweets.

Some other concerns I have are as follows. The large size of table is a concern as it is not very mobile-friendly nor small-screen friendly, even on computers. The size could still be tweaked but I’m not really sure about making it any smaller or the shot map will be even less visible than it already is. My own inexperience with HTML/CSS to get team logos to slot in on either side of the title text. This was only solved after hours upon hours of failed attempts and honestly took up most of my time creating this viz. 

There is a need to be careful about what colors to use as the main team colors (yellow for example is strictly off-limits). A dark theme design could potentially be used to alleviate some of the concerns with using yellow or other brighter colors that don’t work well with white bold text. Another idea is to use an if-else statement to switch the text to black if the color fill for that row is a certain set of bright colors too.

The team logos are all taken from Wikipedia so the sizes may differ slightly. I may try specifying a specific aspect ratio in the HTML/CSS code but the logos can come in different shapes as well. I also wanted to place the logos slightly inward closer to the text and not exactly in the corners as they are now but setting margins around the logo images caused other problems.

As this is a static visualization there aren’t any scrolling elements embedded in the table itself nor are there freeze panels so that the column labels follow you as you go down the table. I did try to alleviate this problem by ordering the columns in a way so that its immediately obvious what the column is showing. An interactive version is the solution for a lot of these problems and it may be worth looking into in the future.

Packages

THe packages we’ll be using are some of the usual tidyverse stalwarts such as {dplyr}, {tidyr}, {ggplot2}, {forcats}, etc. For soccer-specific packages we have the {understatr} package for downloading the data and the {ggsoccer} package for creating soccer fields in {ggplot2}.

We also use the {extrafont} package to make use of some cool fonts for the table/plots, they’re loaded in via the loadfonts() call. I’ll be using the “Roboto Condensed” and “Roboto Slab” fonts that you can download from Google Fonts. Please read the instructions on extrafont’s Github repo and/or CRAN docs for how to install fonts on your computer and use them in R. If you don’t want to bother with all that then just delete the code where I specify fonts throughout the blog post.

Also attached is my sessionInfo().

# Packagespacman::p_load(dplyr, tidyr, stringr, janitor, purrr,               tibble, lubridate, glue, rlang,                rvest, polite,               ggplot2,                gt, forcats, ggtext, extrafont,                understatr, ggsoccer)## Load fontsloadfonts(quiet = TRUE)sessionInfo()## R version 3.5.3 (2019-03-11)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 19041)## ## Matrix products: default## ## locale:## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252   ## [3] LC_MONETARY=English_United States.1252## [4] LC_NUMERIC=C                          ## [5] LC_TIME=English_United States.1252    ## ## attached base packages:## [1] stats     graphics  grDevices utils     datasets  methods   base     ## ## other attached packages:##  [1] ggsoccer_0.1.4        understatr_1.0.0.9000 extrafont_0.17       ##  [4] ggtext_0.1.0          forcats_0.5.0         gt_0.2.2             ##  [7] ggplot2_3.3.0         polite_0.1.1          rvest_0.3.5          ## [10] xml2_1.3.2            rlang_0.4.7           glue_1.4.1           ## [13] lubridate_1.7.4       tibble_3.0.1          purrr_0.3.4          ## [16] janitor_1.2.1         stringr_1.4.0         tidyr_1.1.0          ## [19] dplyr_1.0.1           goodshirt_0.2.2      ## ## loaded via a namespace (and not attached):##  [1] Rcpp_1.0.4.6      pillar_1.4.4.9000 compiler_3.5.3    tools_3.5.3      ##  [5] digest_0.6.25     robotstxt_0.6.2   jsonlite_1.7.0    gtable_0.3.0     ##  [9] evaluate_0.14     memoise_1.1.0     lifecycle_0.2.0   pkgconfig_2.0.3  ## [13] yaml_2.2.1        xfun_0.12.2       Rttf2pt1_1.3.8    withr_2.2.0      ## [17] httr_1.4.1        knitr_1.27        fs_1.4.1          generics_0.0.2   ## [21] vctrs_0.3.2       gridtext_0.1.0    grid_3.5.3        rprojroot_1.3-2  ## [25] tidyselect_1.1.0  here_0.1          R6_2.4.1          qdapRegex_0.7.2  ## [29] rmarkdown_2.1     pacman_0.5.1      extrafontdb_1.0   ratelimitr_0.4.1 ## [33] magrittr_1.5      scales_1.1.0      usethis_1.6.1     backports_1.1.8  ## [37] ellipsis_0.3.1    htmltools_0.5.0   assertthat_0.2.1  colorspace_1.4-1 ## [41] stringi_1.4.6     munsell_0.5.0     crayon_1.3.4

Data Source

understat.com is a data source used by fans, media, bloggers, and analysts alike as it is a free and open resource for expected goals data across six European soccer leagues, the English Premier League, La Liga (Spain), Bundesliga (Germany), Serie A (Italy), Ligue 1 (France), and RFPL (Russia). The data goes back to the 2014-2015 season and has a lot of different xG and xG-adjacent data to look at on tables, graphs, and pitch maps.

Definitions for the variables are provided by hovering over column names on the website. Others are opaque but follow similar terminology set by other data companies and websites. For the average Joe understat.com along with the more recent FBref (with advanced metrics provided by StatsBomb) are the best free and publicly available websites for obtaining this kind of data that I know of.

Nothing is really stopping you from using StatsBomb, Opta, InStat, WyScout, etc. data to create these tables, you just need a few extra steps to prepare/structure the data sets similarly to how I’ve done it the next few sections.

According to the website, the xG model that understat use to calculate the values are done via a neural network on a data set of over 100,000 shots with over 10 parameters. The biggest concern with using the data is that I’m not in control over the calculated outputs and I am merely going to be using their data as given. It would be a point of interest to create my own xG model based on the variables provided on understat (as well as combining data from other sources) by following some of the “how-to” blog posts around the #rstats world such as:

Code Breakdown (Part 1): Initial Data Gathering & Cleaning!

We can finally get to the real meat of this blog post which is the code to create the visualization. We’ll be going through things almost line-by-line and then at the end we’ll create a function out of the code so it can be applied to other matches/teams as well!

Let’s get started!

Defining Variables

The variables defined below will be used throughout the code and some are appended to the data frames we’ll be creating and cleaning. They’ll also be used as arguments for a function we’ll create that takes in all the code that’s be described throughout this blog post.

Make sure that the “home_team” and “away_team” variables are the exact names for the teams on understat.com but without the _ symbols (e.g. “West_Ham” should be “West Ham”, “Borussia_M.Gladbach” should be “Borussia M.Gladbach”). The underscores will be converted throughout the code so don’t worry about them. You have to be careful as some teams on understat.com are labeled with their full name (“West_Bromwich_Albion”, “Parma_Calcio_1913”) while some other ones aren’t (“Tottenham”, “Brighton”, “Leeds”).

The team color is your choice but the table will be using white text so be sure to not choose yellow or white. For teams like Borussia Dortmund I usually just choose black. There are a number of great web sites for finding team colors such as:

Finally you can set the match date, league, season, match day, and source text to show in the table in whatever format you wish. You can actually scrape the “match_date” via {understatr} as well but I didn’t want to bother reformatting with {lubridate} so I just set it manually here.

home_team = "Liverpool"away_team = "Leeds"home_color = "#d00027" away_color = "#1D428A"home_team_logo <- "https://i.imgur.com/RlXYW46.png"away_team_logo <- "https://i.imgur.com/r6Y9lT8.png"match_date <- "Sep. 12, 2020"league_year <- "Premier League 2020-2021"matchday <- 1source_text <- "**Table**: Ryo Nakagawara (**Twitter**: @R_by_Ryo) | **Data**: understat"

{understatr} Package

You can get the data via the {understatr} package from Ewen Henderson. There are a number of blog posts that you may want to look into alongside the official documentation:

To get match shot data for a match you need to find the match ID on understat. There are a number of helper functions to help you find this ID but you can just go to understat and find it yourself as the ID is part of the URL for the match page. Another way is to scrape the team’s page on understat and then you’ll have the match IDs for that team (including for matches they haven’t played yet). https://understat.com/match/14090 is the URL for Liverpool vs. Leeds United and the set of digits at the end, 14090, is the match ID. You use this ID as the input to the get_match_shots() function to get the raw match shots data.

match_id <- 14090raw_match_df <- understatr::get_match_shots(match_id = match_id)glimpse(raw_match_df)## Rows: 28## Columns: 20## $ id               376602, 376603, 376604, 376606, 376607, 376608, 376...## $ minute           2, 3, 6, 19, 24, 28, 32, 39, 39, 48, 52, 52, 59, 61...## $ result           "BlockedShot", "Goal", "BlockedShot", "Goal", "Save...## $ X                0.875, 0.885, 0.860, 0.946, 0.734, 0.936, 0.870, 0....## $ Y                0.347, 0.500, 0.322, 0.542, 0.374, 0.359, 0.392, 0....## $ xG               0.06855621, 0.76116884, 0.06080124, 0.41931298, 0.0...## $ player           "Mohamed Salah", "Mohamed Salah", "Mohamed Salah", ...## $ h_a              "h", "h", "h", "h", "h", "h", "h", "h", "h", "h", "...## $ player_id        1250, 1250, 1250, 833, 605, 5247, 1250, 838, 838, 7...## $ situation        "OpenPlay", "Penalty", "OpenPlay", "FromCorner", "O...## $ season           2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202...## $ shotType         "LeftFoot", "LeftFoot", "LeftFoot", "Head", "RightF...## $ match_id         14090, 14090, 14090, 14090, 14090, 14090, 14090, 14...## $ h_team           "Liverpool", "Liverpool", "Liverpool", "Liverpool",...## $ a_team           "Leeds", "Leeds", "Leeds", "Leeds", "Leeds", "Leeds...## $ h_goals          4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...## $ a_goals          3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...## $ date             "2020-09-12 16:30:00", "2020-09-12 16:30:00", "2020...## $ player_assisted  "Sadio Mané", NA, "Sadio Mané", "Andrew Robertson",...## $ lastAction       "Pass", "Standard", "Pass", "Cross", "None", "None"...

When I was first making these plots (around a year ago) the understatr::get_match_shots() function didn’t exist so I had my own way of scraping this data. I’ve kept using my own methods in my personal scripts but for the purposes of this blog post I’ll use understatr::get_match_shots() for your convenience.

The raw data we get is already in good shape thanks to {understatr} but we still need to do some more cleaning and setting some variables the way we want for our table.

shots_df <- raw_match_df %>%   ## 1. Take out 2 columns we don't really need.  select(-h_goals, -a_goals) %>%   ## 2. Make sure the selected columns are set to numeric type.  mutate(across(c(minute, xG, X, Y,                  player_id, match_id, season), as.numeric)) %>%   ## 3. If xG is `NA` then set it to 0.  ## 4. Relabel the categories in "result", "situation", "lastAction", and "shotType" columns so they're more human-friendly and presentable.  mutate(xG = if_else(is.na(xG), 0, xG),         result = case_when(           result == "SavedShot" ~ "Saved Shot",           result == "BlockedShot" ~ "Blocked Shot",           result == "MissedShots" ~ "Missed Shot",           result == "ShotOnPost" ~ "On Post",           result == "OwnGoal" ~ "Own Goal",           TRUE ~ result),         situation = case_when(           situation == "OpenPlay" ~ "Open Play",            situation == "FromCorner" ~ "From Corner",           situation == "DirectFreekick" ~ "From Free Kick",           situation == "SetPiece" ~ "Set Piece",           TRUE ~ situation),         lastAction = case_when(           lastAction == "BallRecovery" ~ "Ball Recovery",           lastAction == "BallTouch" ~ "Ball Touch",           lastAction == "LayOff" ~ "Lay Off",           lastAction == "TakeOn" ~ "Take On",           lastAction == "Standard" ~ NA_character_,           lastAction == "HeadPass" ~ "Headed Pass",           lastAction == "BlockedPass" ~ "Blocked Pass",           lastAction == "OffsidePass" ~ "Offside Pass",           lastAction == "CornerAwarded" ~ "Corner Awarded",           lastAction == "Throughball" ~ "Through ball",           lastAction == "SubstitutionOn" ~ "Subbed On",           TRUE ~ lastAction),         shotType = case_when(           shotType == "LeftFoot" ~ "Left Foot",           shotType == "RightFoot" ~ "Right Foot",           shotType == "OtherBodyPart" ~ "Other",           TRUE ~ shotType)) %>%   ## 5. Consolidate team name into a single column "team_name" based on the "h_a" column.  mutate(team_name = case_when(    h_a == "h" ~ h_team,    h_a == "a" ~ a_team)) %>%   ## 6. Add team colors to the row depending on the team.  mutate(team_color = if_else(team_name == h_team, home_color, away_color)) %>%   ## 7. Own Goal is set to the team that conceded it so swap it to the team that actually scored from it.  mutate(team_name = case_when(    result == "Own Goal" & team_name == home_team ~ away_team,    result == "Own Goal" & team_name == away_team ~ home_team,    TRUE ~ team_name)) %>%   ## 8. Set "team_name" as a factor variable.  mutate(team_name = forcats::as_factor(team_name)) %>%   ## 9. Arrange the rows by `id` so that shots are in chronological order.  arrange(id) %>%   ## 10. Separate "player" into two, then re-combine.  separate(player, into = c("firstname", "player"),            sep = "\\s", extra = "merge") %>%   ## players like Fabinho are listed without a last name "Tavares"  ## so just add their name in again if NA  mutate(player = if_else(is.na(player), firstname, player),         ## 11. Set a new and cleaner ID for shots so that it starts at 1 and goes to `n`.         id = row_number())glimpse(shots_df)## Rows: 28## Columns: 21## $ id               1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...## $ minute           2, 3, 6, 11, 19, 24, 28, 29, 32, 39, 39, 48, 49, 52...## $ result           "Blocked Shot", "Goal", "Blocked Shot", "Goal", "Go...## $ X                0.875, 0.885, 0.860, 0.842, 0.946, 0.734, 0.936, 0....## $ Y                0.347, 0.500, 0.322, 0.607, 0.542, 0.374, 0.359, 0....## $ xG               0.06855621, 0.76116884, 0.06080124, 0.05824600, 0.4...## $ firstname        "Mohamed", "Mohamed", "Mohamed", "Jack", "Virgil", ...## $ player           "Salah", "Salah", "Salah", "Harrison", "van Dijk", ...## $ h_a              "h", "h", "h", "a", "h", "h", "h", "a", "h", "h", "...## $ player_id        1250, 1250, 1250, 8720, 833, 605, 5247, 822, 1250, ...## $ situation        "Open Play", "Penalty", "Open Play", "Open Play", "...## $ season           2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202...## $ shotType         "Left Foot", "Left Foot", "Left Foot", "Right Foot"...## $ match_id         14090, 14090, 14090, 14090, 14090, 14090, 14090, 14...## $ h_team           "Liverpool", "Liverpool", "Liverpool", "Liverpool",...## $ a_team           "Leeds", "Leeds", "Leeds", "Leeds", "Leeds", "Leeds...## $ date             "2020-09-12 16:30:00", "2020-09-12 16:30:00", "2020...## $ player_assisted  "Sadio Mané", NA, "Sadio Mané", "Kalvin Phillips", ...## $ lastAction       "Pass", NA, "Pass", "Take On", "Cross", "None", "No...## $ team_name        Liverpool, Liverpool, Liverpool, Leeds, Liverpool, ...## $ team_color       "#d00027", "#d00027", "#d00027", "#1D428A", "#d0002...

You have to adjust xG for a single sequence of possession so I regularly use the code in Joe Gallagher’s blog post for adjusting them whenever there’s a sequence with a rebound.

The xG totals (Liverpool 3.15 xG vs. Leeds United 0.27 xG) you see on understats’ match pages are using the ADJUSTED values BUT the xG values you see on their shot maps (when you hover over them with your mouse) are the RAW values which are same as the ones downloaded via {understatr}. So do be careful when you’re taking shot data from the website. 99% of the time this discrepancy has been solved by applying the below code from Joe’s blog post to rebound shots and other shots in that possession sequence. Understat could be doing something else under the hood but from my experience this usually matches the values up. Do note that not all shots that were adjusted by understat for the box score have the “rebound” shot result label which makes it much more difficult to tease out which ones need the adjustment. This is why its important you watch the match or have an easy way of checking every shot/shot sequence.

shots_df %>% filter(team_name == "Liverpool") -> lfc_xglfc_adj <- lfc_xg %>%   filter(minute == 39) %>%  mutate(xg_total = (1 - prod(1 - xG))) %>%  mutate(xg_adj = xg_total * (xG / sum(xG))) %>%   select(minute, xG, xg_adj, player, situation, lastAction)glimpse(lfc_adj)## Rows: 2## Columns: 6## $ minute      39, 39## $ xG          0.10687961, 0.08808844## $ xg_adj      0.10171848, 0.08383471## $ player      "Mané", "Mané"## $ situation   "Open Play", "Open Play"## $ lastAction  "Pass", "Rebound"

Then we just overwrite the raw xG values with the adjusted values into shots_df.

shots_df <- shots_df %>%   mutate(xG = case_when(    minute == 39 & id == 10 ~ 0.10171848,    minute == 39 & id == 11 ~ 0.08383471,    TRUE ~ xG))

Rolling sum of xG

This data frame calculates the rolling sum of xG values throughout the match. Use the cumsum() function to calculate it along the order of shots. This bit of code is actually used for creating the line plot timeline viz you saw in the introduction but for the table viz we only need the shot ID and the rolling sum values so that’s what we’ll select() from the data frame.

## 1. Get minute of last shotlast_min <- shots_df$minute %>% unique() %>% last()## 2. If last shot happened before 90th minute then change to 90if (last_min < 90) {last_min <- 90}## 3. Create index of every minute in the matchminute <- c(0:last_min)## 4. Set team names in a listteam_name <- c(shots_df$h_team %>% unique(),               shots_df$a_team %>% unique())rollsum_df <- shots_df %>%   ## 5. Expand shots_df to include rows for every minute  full_join(crossing(minute, team_name)) %>%   arrange(minute) %>%   group_by(team_name) %>%   ## 6. Change NAs to 0  ## Apply rolling cumulative sum on xG  mutate(xG = if_else(is.na(xG), 0, xG),         rollsum = lag(cumsum(xG))) %>%   ungroup() %>%   ## 7. Change Player Labels (Not used for the table viz so can be ignored)  mutate(rollsum_goal = rollsum + xG) %>%   ## for Minute == 0  mutate(rollsum = if_else(is.na(rollsum), 0, rollsum),         rollsum_goal = if_else(is.na(rollsum_goal), 0, rollsum_goal)) %>%   ## FOR THIS BLOGPOST // {gt} TABLE WE DON'T NEED MOST OF THESE COLUMNS  ## We'll only use the shot order ID and the rolling sum of xG that we just calculated.  filter(xG != 0.00) %>%   select(id, rollsum_goal)## Joining, by = c("minute", "team_name")glimpse(rollsum_df)## Rows: 28## Columns: 2## $ id            1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...## $ rollsum_goal  0.06855621, 0.82972505, 0.89052629, 0.05824600, 1.3098...

Team Stats (Box score data frame)

This is the data you can see on the “Stats” tab on any match page on understat.com. In my own work flow this code is its own function along with a few additions which aren’t pertinent to creating the tables. We’ll be scraping this directly using the {polite} and {rvest} packages. Then we’ll slowly massage the text data into a nice data frame.

match_url <- stringr::str_glue("https://understat.com/match/{match_id}")match_page <- polite::bow(match_url)team_stats_raw <- polite::scrape(match_page) %>%   html_nodes("div.scheme-block:nth-child(4)") %>%   html_text() %>%   str_remove_all(., "CHANCES") %>%   str_remove_all(., "([0-9]{2,}%)") %>%   str_replace_all(., "SHOTS ON TARGET", "ON-TARGET") %>%   str_squish()## make sure that you set "home_team" and "away_team" in the beginning, exactly as they appear on understat.comif (str_detect(home_team, " ") == TRUE |     str_detect(away_team, " ") == TRUE) {    home_team_sp <- str_replace_all(home_team, " ", "-")  away_team_sp <- str_replace_all(away_team, " ", "-")    team_stats_raw <- team_stats_raw %>%     str_replace_all(., home_team, home_team_sp) %>%     str_replace_all(., away_team, away_team_sp)}home_team_sp <- str_replace_all(home_team, " ", "-")away_team_sp <- str_replace_all(away_team, " ", "-")team_stats <- team_stats_raw %>%   read.table(text = ., header = FALSE, sep = " ",             col.names = c("var_name", "home", "away")) %>%   t() %>%   tibble::as_tibble(.name_repair = "minimal") %>%   janitor::row_to_names(row_number = 1) %>%   mutate_at(vars(-TEAMS), ~ as.numeric(.)) %>%   mutate(TEAMS = case_when(    str_detect(TEAMS, home_team_sp) ~ home_team,    str_detect(TEAMS, away_team_sp) ~ away_team,    TRUE ~ TEAMS  ))## split team stats into "home" and "away"home_stats <- team_stats[1,]away_stats <- team_stats[2,]## add colors based on defined variables for team's respective colorhome_stats$home_team_color <- home_coloraway_stats$away_team_color <- away_colorglimpse(team_stats)## Rows: 2## Columns: 8## $ TEAMS        "Liverpool", "Leeds"## $ GOALS        4, 3## $ xG           3.15, 0.27## $ SHOTS        22, 6## $ `ON-TARGET`  6, 3## $ DEEP         19, 2## $ PPDA         9.94, 9.33## $ xPTS         2.93, 0.04

You can check out more on how to scrape websites responsibly with the {rvest} and {polite} packages in my previous blog post here.

Data contents

Now that we’ve got the basic data we need, let’s go over what each data frame holds. We’ll still be doing a lot more data manipulation after this section but it’s a good place to pause as we have the essential columns we need. These are also the data frames we’ll pass as arguments into the function we’ll create for the table. In my usual workflow the above Part 1 are usually a few different functions and then the code in Part 2 and Part 3 are in a different single function that creates the {gt} tables. You can certainly shove Part 1 code into the overall function as well but I use the data frames from Part 1 in other visualizations so I have them split up.

  1. shots_df

This data frame is the main protagonist for our table and holds pretty much all the essential info we need. You can add/subtract any columns as you see fit from understat or from your own data source here.

glimpse(shots_df)## Rows: 28## Columns: 21## $ id               1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...## $ minute           2, 3, 6, 11, 19, 24, 28, 29, 32, 39, 39, 48, 49, 52...## $ result           "Blocked Shot", "Goal", "Blocked Shot", "Goal", "Go...## $ X                0.875, 0.885, 0.860, 0.842, 0.946, 0.734, 0.936, 0....## $ Y                0.347, 0.500, 0.322, 0.607, 0.542, 0.374, 0.359, 0....## $ xG               0.06855621, 0.76116884, 0.06080124, 0.05824600, 0.4...## $ firstname        "Mohamed", "Mohamed", "Mohamed", "Jack", "Virgil", ...## $ player           "Salah", "Salah", "Salah", "Harrison", "van Dijk", ...## $ h_a              "h", "h", "h", "a", "h", "h", "h", "a", "h", "h", "...## $ player_id        1250, 1250, 1250, 8720, 833, 605, 5247, 822, 1250, ...## $ situation        "Open Play", "Penalty", "Open Play", "Open Play", "...## $ season           2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202...## $ shotType         "Left Foot", "Left Foot", "Left Foot", "Right Foot"...## $ match_id         14090, 14090, 14090, 14090, 14090, 14090, 14090, 14...## $ h_team           "Liverpool", "Liverpool", "Liverpool", "Liverpool",...## $ a_team           "Leeds", "Leeds", "Leeds", "Leeds", "Leeds", "Leeds...## $ date             "2020-09-12 16:30:00", "2020-09-12 16:30:00", "2020...## $ player_assisted  "Sadio Mané", NA, "Sadio Mané", "Kalvin Phillips", ...## $ lastAction       "Pass", NA, "Pass", "Take On", "Cross", "None", "No...## $ team_name        Liverpool, Liverpool, Liverpool, Leeds, Liverpool, ...## $ team_color       "#d00027", "#d00027", "#d00027", "#1D428A", "#d0002...
  • minute: Minute of the game that the shot was taken. Note that this doesn’t contain seconds so you need to be careful when re-arranging rows. When you get the data via {understatr} the data should come in proper order so the first thing you should do is to create an ID variable so that you always have shots in the same minute in the proper order.

  • result: The result of the shot. “Goal”, “Blocked Shot”, “Saved Shot”, “Own Goal”, etc.

  • X & Y: Coordinates of the shot location.

  • xG: Expected Goals value of the shot taken per understat.com’s model.

  • firstname& player: Player name. firstname is a column I created myself for the shot timeline/shot maps in a previous step of my viz process. When you’re grabbing the data from understat yourself it should already be in one united column. We’ll go over the code to combine these together for the table.

  • h_a: Home or away team. These are used for matching the team colors to their respective row and is in general useful when you have turned your code into a function and want to set certain conditions depending on whether the team is “home” or “away”.

  • player_id& match_id: understat IDs for players and matches. Not used here but good to keep if you’re going to be doing other stuff later on.

  • situation: The situation in which the shot took place. “Open Play”, “Set Piece”, “Penalty”, etc.

  • season: Season. Not used here but good to keep if you’re doing other stuff with this data.

  • shotType: Type of shot. “Left Foot”, “Right Foot”, & “Header”.

  • date: Date of match. We could use {lubridate} or something to make it look nice but for this example I just manually created a variable “match_date”.

  • player_assisted: Player that assisted the shot. Do note that “assisted” could mean very trivial things like a small touch or a deflection.

  • lastAction: Action that lead to the shot. “Pass”, “Dribble”, “Take On”, “Cross”, “Rebound”, etc. There are some mystery labels such as “Standard” but we’ll be removing them from the data in the code section.

  • team_name: Team name used for labels.

  1. rollsum_df

Has the rolling sum of xG values for the match over every shot taken in the match.

glimpse(rollsum_df)## Rows: 28## Columns: 2## $ id            1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...## $ rollsum_goal  0.06855621, 0.82972505, 0.89052629, 0.05824600, 1.3098...
  1. home_stats & away_stats

The box score you see in the “Stats” tab of the match page on understat.com. “home_stats” and “away_stats” is just “team_stats” divided into their respective teams. The data below are ONLY used for labels and titles in the plot so if you’re not too fussed about scraping the data in the previoius you could just manually insert them in the text (we’ll be using paste()/glue() to insert this into the plot text) or add the info as arguments to your plotting function and insert this data that way as well.

glimpse(team_stats)## Rows: 2## Columns: 8## $ TEAMS        "Liverpool", "Leeds"## $ GOALS        4, 3## $ xG           3.15, 0.27## $ SHOTS        22, 6## $ `ON-TARGET`  6, 3## $ DEEP         19, 2## $ PPDA         9.94, 9.33## $ xPTS         2.93, 0.04

Code Breakdown (Part 2): Creating the Shot Maps & Cleaning the Data!

Creating the soccer field plots

For the actual soccer field I use the {ggsoccer} package. Along with the pitch dimension specifications for data sources like Opta, WyScout, and StatsBomb, the package also allows you to specify your own dimensions to fit your own data. If you want a bit more of an in-depth code through using the {ggsoccer} package, take a look at the package README or have a look at one of my first blog posts where I create my own World Cup 2018 data and visualize it with {ggsoccer}.

After playing around with the dimensions on understat the below are the measurements what I came up with and I defined it as a variable called pitch_custom.

pitch_custom <- list(  length = 587,  width = 373,  penalty_box_length = 101,  penalty_box_width = 211,  six_yard_box_length = 31,  six_yard_box_width = 111,  penalty_spot_distance = 66,  goal_width = 45,  origin_x = 0,  origin_y = 0)

Since we are orienting the shot maps vertically we need to convert the coordinates from the horizontal view to vertical. We are only using one side of the pitch we don’t need to have different adjustments for the “home” and “away” teams. The new coordinates will be defined to the variables “X” and “Y” (capitalized), the coordinates in the lower-case variables will be kept in just as a reference and for checking positions. Then we select() only the variables we need then capitalize “Goal” and “Own Goal” labels for the table (you could also use stringr::str_to_upper() here). Also note that you can include most if not all of these mutate() calls into onemutate() call but I like to keep them separated so I can comment out blocks of code easier.

## create coordsmatch_df <- shots_df %>%   ## switch coordinates for vertical view  mutate(    x = case_when(      h_a == "a" ~ X * 587,      h_a == "h" ~ X * 587,      TRUE ~ 0),    y = case_when(      h_a == "a" ~ Y * 373,      h_a == "h" ~ Y * 373,      TRUE ~ 0)) %>%  ## edit result values  mutate(result = case_when(    result == "Goal" ~ "GOAL",    result == "Own Goal" ~ "OWN GOAL",    TRUE ~ result)) %>%   mutate(result = forcats::as_factor(result),         result = forcats::fct_relevel(result, "GOAL", "Saved Shot",                                       "On Post", "Blocked Shot",                                        "Missed Shots", "OWN GOAL"))## Warning: Problem with `mutate()` input `result`.## x Unknown levels in `f`: On Post, Missed Shots, OWN GOAL## i Input `result` is `forcats::fct_relevel(...)`.## Warning: Unknown levels in `f`: On Post, Missed Shots, OWN GOAL

This is how the data looks like now. There’s a row for every shot in chronological order along with the coordinates for the shots.

glimpse(match_df)## Rows: 28## Columns: 23## $ id               1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...## $ minute           2, 3, 6, 11, 19, 24, 28, 29, 32, 39, 39, 48, 49, 52...## $ result           Blocked Shot, GOAL, Blocked Shot, GOAL, GOAL, Saved...## $ X                0.875, 0.885, 0.860, 0.842, 0.946, 0.734, 0.936, 0....## $ Y                0.347, 0.500, 0.322, 0.607, 0.542, 0.374, 0.359, 0....## $ xG               0.06855621, 0.76116884, 0.06080124, 0.05824600, 0.4...## $ firstname        "Mohamed", "Mohamed", "Mohamed", "Jack", "Virgil", ...## $ player           "Salah", "Salah", "Salah", "Harrison", "van Dijk", ...## $ h_a              "h", "h", "h", "a", "h", "h", "h", "a", "h", "h", "...## $ player_id        1250, 1250, 1250, 8720, 833, 605, 5247, 822, 1250, ...## $ situation        "Open Play", "Penalty", "Open Play", "Open Play", "...## $ season           2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202...## $ shotType         "Left Foot", "Left Foot", "Left Foot", "Right Foot"...## $ match_id         14090, 14090, 14090, 14090, 14090, 14090, 14090, 14...## $ h_team           "Liverpool", "Liverpool", "Liverpool", "Liverpool",...## $ a_team           "Leeds", "Leeds", "Leeds", "Leeds", "Leeds", "Leeds...## $ date             "2020-09-12 16:30:00", "2020-09-12 16:30:00", "2020...## $ player_assisted  "Sadio Mané", NA, "Sadio Mané", "Kalvin Phillips", ...## $ lastAction       "Pass", NA, "Pass", "Take On", "Cross", "None", "No...## $ team_name        Liverpool, Liverpool, Liverpool, Leeds, Liverpool, ...## $ team_color       "#d00027", "#d00027", "#d00027", "#1D428A", "#d0002...## $ x                513.625, 519.495, 504.820, 494.254, 555.302, 430.85...## $ y                129.431, 186.500, 120.106, 226.411, 202.166, 139.50...

Using this data frame we can create shot map plots that will serve as the base for a more detailed plot in a later section.

Function to define base shot map

This function will take the data and team_name inputs that come from the match_df data frame that we just created. This plot will plot all shots per team for every row of the data frame. After using filter() to only get the shots for the specific team, annotate_pitch() is used to create the pitch object using the dimensions we specified in pitch_custom earlier. The theme_pitch() is a helper function for {ggsoccer} that removes all the background and axes details while coord_flip() is used to flip the field so that the goal is at the top of the plot. All the shots will be gray and transparent (via the ‘alpha’ argument) as later on we will highlight the actual shot for a specific row in our table in a different plotting function.

create_shotmap_basic <- function(df = data, team_name = team_name) {    shotxG_map_raw <-     ggplot(df %>% filter(team_name == team_name),            aes(x = x, y = y)) +    annotate_pitch(dimensions = pitch_custom) +    ## all shots in grey and transparent    geom_point(aes(x = x, y = y), color = "grey20",                size = 3, alpha = 0.3) +    #scale_x_continuous(expand = c(0.01, 0)) +    theme_pitch(aspect_ratio = 373/587) +    coord_flip(xlim = c(280, 590),                ylim = c(10, 365)) +    theme(plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "pt"),          text = element_markdown(family = "Roboto Condensed"),          legend.position = "none")        return(shotxG_map_raw)}

We nest() the data by “team_name” so that when we plot our basic shot maps each row will plot all the shots for each team in “team_name”. By “nesting” the data this way we can apply the plotting function to all the data for a specific team team. This way we won’t be plotting a basic shot map using shot data from both teams.

match_df %>% tibble::as_tibble() %>% group_by(team_name) %>% nest()## # A tibble: 2 x 2## # Groups:   team_name [2]##   team_name data              ##                    ## 1 Liverpool ## 2 Leeds     

The “data” column contains all the data that we saw previously for each team. If we unnest() the “data” column for the row with “Liverpool” in the “team_name” column we can see that it contains all the data seen previously but only for Liverpool. The same will apply if we “unpack” the “data” column for the row with Leeds United.

match_df %>%   tibble::as_tibble() %>%   group_by(team_name) %>%   nest() %>%   head(1) %>%   unnest(cols = c(data))## # A tibble: 22 x 23## # Groups:   team_name [1]##    team_name    id minute result     X     Y     xG firstname player h_a  ##                         ##  1 Liverpool     1      2 Block~ 0.875 0.347 0.0686 Mohamed   Salah  h    ##  2 Liverpool     2      3 GOAL   0.885 0.5   0.761  Mohamed   Salah  h    ##  3 Liverpool     3      6 Block~ 0.86  0.322 0.0608 Mohamed   Salah  h    ##  4 Liverpool     5     19 GOAL   0.946 0.542 0.419  Virgil    van D~ h    ##  5 Liverpool     6     24 Saved~ 0.734 0.374 0.0191 Jordan    Hende~ h    ##  6 Liverpool     7     28 Block~ 0.936 0.359 0.0941 Naby      Keita  h    ##  7 Liverpool     9     32 GOAL   0.87  0.392 0.0624 Mohamed   Salah  h    ##  8 Liverpool    10     39 Block~ 0.849 0.557 0.102  Sadio     Mané   h    ##  9 Liverpool    11     39 Block~ 0.862 0.56  0.0838 Sadio     Mané   h    ## 10 Liverpool    12     48 Saved~ 0.891 0.543 0.121  Georginio Wijna~ h    ## # ... with 12 more rows, and 13 more variables: player_id ,## #   situation , season , shotType , match_id ,## #   h_team , a_team , date , player_assisted ,## #   lastAction , team_color , x , y 

With this data structure in hand, we apply the basic shot map plot function to every row of data inside the “data” column for both teams. We can do this by passing the “data” and “team_name” columns as the arguments to the shot map function which then iterates over every row in each team’s “data” column.

## add blank shot map to all rowsmatch_df_basic_plot <- match_df %>%   tibble::as_tibble() %>%   ## nest by team  group_by(team_name) %>%   nest() %>%   ## apply plots for all shots per team, for each row in their respective "data" column  mutate(plot = map2(data, team_name, create_shotmap_basic)) %>%   ungroup()

When we take a glimpse() at our data frame we now see an additional column, “plot” that houses plots for each row of shots for both teams. So Liverpool took 22 total shots and therefore have 22 rows inside “data” which results in 22 plots while for Leeds there are 6, 6, and 6 of the same. Both “data” and “plot” are now what are called “list-columns”.

glimpse(match_df_basic_plot)## Rows: 2## Columns: 3## $ team_name  Liverpool, Leeds## $ data       [, ]## $ plot       [<1, 2, 3, 5, 6, 7, 9, 10, 11, 12, 14, 15, 16, 17, 19, 2...

So now we have a data frame with a basic shot map for each team for each row of shots. However, all of the detailed info we need for our table is hidden inside the “data” column. We can get those back by unnest()-ing the “data” column so all the other shot information is unpacked.

match_df_basic_plot <- match_df_basic_plot %>%   unnest(cols = "data") %>%   arrange(id)glimpse(match_df_basic_plot)## Rows: 28## Columns: 24## $ team_name        Liverpool, Liverpool, Liverpool, Leeds, Liverpool, ...## $ id               1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...## $ minute           2, 3, 6, 11, 19, 24, 28, 29, 32, 39, 39, 48, 49, 52...## $ result           Blocked Shot, GOAL, Blocked Shot, GOAL, GOAL, Saved...## $ X                0.875, 0.885, 0.860, 0.842, 0.946, 0.734, 0.936, 0....## $ Y                0.347, 0.500, 0.322, 0.607, 0.542, 0.374, 0.359, 0....## $ xG               0.06855621, 0.76116884, 0.06080124, 0.05824600, 0.4...## $ firstname        "Mohamed", "Mohamed", "Mohamed", "Jack", "Virgil", ...## $ player           "Salah", "Salah", "Salah", "Harrison", "van Dijk", ...## $ h_a              "h", "h", "h", "a", "h", "h", "h", "a", "h", "h", "...## $ player_id        1250, 1250, 1250, 8720, 833, 605, 5247, 822, 1250, ...## $ situation        "Open Play", "Penalty", "Open Play", "Open Play", "...## $ season           2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202...## $ shotType         "Left Foot", "Left Foot", "Left Foot", "Right Foot"...## $ match_id         14090, 14090, 14090, 14090, 14090, 14090, 14090, 14...## $ h_team           "Liverpool", "Liverpool", "Liverpool", "Liverpool",...## $ a_team           "Leeds", "Leeds", "Leeds", "Leeds", "Leeds", "Leeds...## $ date             "2020-09-12 16:30:00", "2020-09-12 16:30:00", "2020...## $ player_assisted  "Sadio Mané", NA, "Sadio Mané", "Kalvin Phillips", ...## $ lastAction       "Pass", NA, "Pass", "Take On", "Cross", "None", "No...## $ team_color       "#d00027", "#d00027", "#d00027", "#1D428A", "#d0002...## $ x                513.625, 519.495, 504.820, 494.254, 555.302, 430.85...## $ y                129.431, 186.500, 120.106, 226.411, 202.166, 139.50...## $ plot             [<1, 2, 3, 5, 6, 7, 9, 10, 11, 12, 14, 15, 16, 17,...

I like to think of it as “KABOOM”-ing the “data” column so all its contents explode back out into your data frame.

kaboom-parksrec

Function to define specific shot map

This second plotting function takes the plots we created with the previous base shot map function and adds in the specific shot for each row to the base plot. The “plot” that is being passed into this function are all the plots we had saved in the “plot” column of the data frame. We add in a new geom_point() for the specific shot recorded in a row and make it bigger in size and color-coded according to the team to make it stand out from the rest of the shots the team took. At the bottom we add in a label for the xG value.

add_xG_shot <- function(x, y, xG, team_color, plot) {  shotxG_map_point <-     plot +    # specific shot point in black and bold    geom_point(x = x, y = y, aes(fill = team_color),               size = 12, stroke = 3, shape = 21) +    scale_fill_identity() +    #scale_x_continuous(expand = c(0.01, 0)) +    ## label for shot point    geom_label(x = 318, y = 186.5,                color = "black", size = 20,               fill = "white", family = "Roboto Slab",               label = glue::glue("{xG %>% round(digits = 2)} xG"))    return(shotxG_map_point)}

Combining plots to data frame

Now we finally work with the data frame that has the rolling xG sums. The IDs should match up with the IDs in match_df_basic_plot because the roll_sum_df data frame is just an expanded version of match_df_basic_plot that created empty rows for every minute. Once you filter() out those rows without an xG value the rows for both should match up. Be very careful when manipulating both of these data frames in previous steps as it can become very easy for the IDs to get messed up especially when there are multiple chances by either team in the same minute. Make sure the row order of the shots data frame is exactly as downloaded from understat.com before applying the IDs to the rows.

Since we have more than two arguments that we are passing to the add_xG_shot() function we need to iterate over every row using the pmap() function instead of map2().

## map plot to df again with a shot point for each row/plotdfdfdf <- match_df_basic_plot %>%   ## shot-per-row, using 'plot' as base pass along the 'x', 'y' coordinates and xG value  ## to "add_xG_shot()` function for each row.   ## have empty 'ggplot' column for gt plot-insertion purposes  mutate(complete_plot = pmap(list(x, y, xG, team_color, plot), add_xG_shot),         ggplot = NA) %>%   select(-plot) %>%   left_join(rollsum_df, by = "id")

We can check the data frame for the plot in the first row, which is the first shot in the game from Mo Salah in the 7th minute.

dfdfdf$complete_plot[[1]]

Now for a plot for a row where the team is Leeds United. You can see that the plot is different from the Liverpool one as the “grey”-ed out shots are Leeds’ other shots. This is because of what we did earlier in creating the “basic” shot map for each team separately.

dfdfdf$complete_plot[[4]]

Tidying up to prep for table creation

We are now very close to the data frame we need to start making the table.

glimpse(dfdfdf)## Rows: 28## Columns: 26## $ team_name        Liverpool, Liverpool, Liverpool, Leeds, Liverpool, ...## $ id               1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...## $ minute           2, 3, 6, 11, 19, 24, 28, 29, 32, 39, 39, 48, 49, 52...## $ result           Blocked Shot, GOAL, Blocked Shot, GOAL, GOAL, Saved...## $ X                0.875, 0.885, 0.860, 0.842, 0.946, 0.734, 0.936, 0....## $ Y                0.347, 0.500, 0.322, 0.607, 0.542, 0.374, 0.359, 0....## $ xG               0.06855621, 0.76116884, 0.06080124, 0.05824600, 0.4...## $ firstname        "Mohamed", "Mohamed", "Mohamed", "Jack", "Virgil", ...## $ player           "Salah", "Salah", "Salah", "Harrison", "van Dijk", ...## $ h_a              "h", "h", "h", "a", "h", "h", "h", "a", "h", "h", "...## $ player_id        1250, 1250, 1250, 8720, 833, 605, 5247, 822, 1250, ...## $ situation        "Open Play", "Penalty", "Open Play", "Open Play", "...## $ season           2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202...## $ shotType         "Left Foot", "Left Foot", "Left Foot", "Right Foot"...## $ match_id         14090, 14090, 14090, 14090, 14090, 14090, 14090, 14...## $ h_team           "Liverpool", "Liverpool", "Liverpool", "Liverpool",...## $ a_team           "Leeds", "Leeds", "Leeds", "Leeds", "Leeds", "Leeds...## $ date             "2020-09-12 16:30:00", "2020-09-12 16:30:00", "2020...## $ player_assisted  "Sadio Mané", NA, "Sadio Mané", "Kalvin Phillips", ...## $ lastAction       "Pass", NA, "Pass", "Take On", "Cross", "None", "No...## $ team_color       "#d00027", "#d00027", "#d00027", "#1D428A", "#d0002...## $ x                513.625, 519.495, 504.820, 494.254, 555.302, 430.85...## $ y                129.431, 186.500, 120.106, 226.411, 202.166, 139.50...## $ complete_plot    [<1, 2, 3, 5, 6, 7, 9, 10, 11, 12, 14, 15, 16, 17,...## $ ggplot           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...## $ rollsum_goal     0.06855621, 0.82972505, 0.89052629, 0.05824600, 1.3...

There are a lot of columns that we won’t really need for the table so we’ll do some final cleaning up before we get to the {gt} table code. We select() for the columns that hold the info that we really need and then make edits to lastAction column to clean them up a bit. We replace any NAs with a blank so they don’t show up in the table as NA text. A new player_name column is created by concatenating the firstname and player (which contains the player’s last name) columns into one. Finally we use the new relocate() function to place the newly created player_name column to be placed after the team_name column.

## data creation for actual tablematch_shots_table_df <- dfdfdf %>%   select(minute, team_name, result, xG, firstname, player,          ggplot, complete_plot, rollsum = rollsum_goal,         situation, type = shotType, player_assisted, lastAction) %>%   ## player name labels, clean "lastAction"  mutate(player_name = paste(firstname, player),         lastAction = if_else(lastAction == "None", NA_character_, lastAction),         xG = xG %>% round(digits = 2),         rollsum = rollsum %>% round(digits = 2)) %>%   ## NAs as blanks  mutate(across(where(is.character), ~ replace_na(., ""))) %>%   ## take out extraneous name vars and move to after team name  select(-firstname, -player) %>%   relocate(player_name, .after = team_name) glimpse(match_shots_table_df)## Rows: 28## Columns: 12## $ minute           2, 3, 6, 11, 19, 24, 28, 29, 32, 39, 39, 48, 49, 52...## $ team_name        Liverpool, Liverpool, Liverpool, Leeds, Liverpool, ...## $ player_name      "Mohamed Salah", "Mohamed Salah", "Mohamed Salah", ...## $ result           Blocked Shot, GOAL, Blocked Shot, GOAL, GOAL, Saved...## $ xG               0.07, 0.76, 0.06, 0.06, 0.42, 0.02, 0.09, 0.06, 0.0...## $ ggplot           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...## $ complete_plot    [<1, 2, 3, 5, 6, 7, 9, 10, 11, 12, 14, 15, 16, 17,...## $ rollsum          0.07, 0.83, 0.89, 0.06, 1.31, 1.33, 1.42, 0.11, 1.4...## $ situation        "Open Play", "Penalty", "Open Play", "Open Play", "...## $ type             "Left Foot", "Left Foot", "Left Foot", "Right Foot"...## $ player_assisted  "Sadio Mané", "", "Sadio Mané", "Kalvin Phillips", ...## $ lastAction       "Pass", "", "Pass", "Take On", "Cross", "", "", "",...

Code Breakdown (Part 3): The {gt} table!

Soccer ball emoji

To give rows where the shots are goals a bit of ‘oomph’ in the table, we will add a soccer ball emoji next to the all-caps “GOAL” text. We can add in the unicode version of the emoji as an object that we’ll then paste() into the text. From my attempts it only works well for the .HTML output and for the .PNG output I use for my regular match summary graphics I have to use a “star” emoji instead as the soccer ball one doesn’t render too well in .PNG.

soccer_ball <- ""

Table Headline/Title & Team Logos

To start off, we add in the title and team logo images via HTML. Something very simple as getting the team logo images to be set to the side of the text took a looooonnggg time and many many iterations because I’m not very good at HTML/CSS but I got it done eventually…

You can define variables in text via glue::glue() and declaring variables inside {} so it can be customized depending on the match data you’re passing in. For this example home_stats$xG will equal to 3.15 for Liverpool’s xG but if we were running this for a different match with a different home_stats data frame it’ll evaluate to a different xG value for the table title text automatically.

match_gt_xG_timeline <-   gt(match_shots_table_df) %>%   tab_header(       title = gt::html(glue::glue("

home_team_logo({home_stats$xG} xG){home_stats$TEAMS}{home_stats$GOALS} - {away_stats$GOALS}{away_stats$TEAMS}({away_stats$xG} xG)away_team_logo

{league_year}: Matchday {matchday} ({match_date})")))

Table Text Style

Styling the individual cells of our {gt} table can be done via the tab_style() function. Throughout the titles and headers we use the font Roboto Slab which is a thicker version of the Roboto Condensed font that I use for nearly all of my visualizations.

First, we add some styling to the title text that we just created above. We can specify that the cells we want to style via the locations argument and setting that to “title” via the cells_title() function. For the actual styling via the style argument, we can specify font, weight, color, etc. via the cell_text() function.

For styling the column headers we specify the location with the cells_column_labels() function. The special everything() helper function (from the {tidyselect} package) is used as we want to specify the style for all column headers. We want the text in the column headers to stand out so we make them extra big by specifying the size to be xx-large and weight bold.

We also want to add some small vertical borders to the column headers so they are divided up, styles for borders can be specified via the cells_borders() function.

match_gt_xG_timeline <-   match_gt_xG_timeline %>%   ## title style  tab_style(    style = list(      cell_text(        font = "Roboto Slab",        align = "center",        weight = "bold",        color = "#000000"      )    ),    locations = list(      cells_title(groups = "title")    )  ) %>%   ## column style  tab_style(    style = list(      cell_text(font = "Roboto Slab", align = "center",                 size = "xx-large", weight = "bold"),      cell_borders(sides = c("left", "right"),                    color = "grey20", weight = px(2))    ),    locations = list(      cells_column_labels(everything())    )  ) 

Table Row Color Fill

To style the rows for either team we have to do a bit more in the locations argument of tab_style(). Using filter()-style syntax we specify the rows so that the team_name column matches the value of either the home_team (Liverpool) or away_team (Leeds United). For rows where Liverpool took a shot we fill the cells with home_color“red” and for Leeds away_color or “blue”. We specified the home_color and away_color variables at the beginning of the tutorial. When all of this code is turned into a function, you would have these variables set as arguments to the table-making function so you can easily specify any color hex code as needed.

match_gt_xG_timeline <-   match_gt_xG_timeline %>%   ## HOME TEAM  tab_style(    style = cell_fill(color = home_color),    locations = cells_body(      rows = team_name == home_team)  ) %>%   ## AWAY TEAM  tab_style(    style = cell_fill(color = away_color),    locations = cells_body(      rows = team_name == away_team)  ) %>%   ## all cell text  tab_style(    style = cell_text(color = "white", align = "center", size = "x-large",                      font = "Roboto Condensed", weight = "bold"),    locations = cells_body(      columns = TRUE)   ) 

Goal/Own Goal Rows

In this section of the {gt} code we go over the text_transform() function for doing two very different operations, for appending the soccer ball emoji unicode to the “GOAL” or “OWN GOAL” cells and for adding the shot maps we created earlier into the table.

We can use the text_transform() function to pass a inline function that appends the soccer ball emoji unicode next to either “GOAL” or “OWN GOAL” text via paste(). In specifying the exact location, we have the columns point to the “result” column and filter for rows that have “GOAL” or “OWN GOAL” in the respective “result” column.

match_gt_xG_timeline <-   match_gt_xG_timeline %>%   ## add Goal result emoji by pasting in the emoji next to the 'result' text  text_transform(    locations = cells_body(      columns = vars(result),      rows = result %in% c("GOAL", "OWN GOAL")),    fn = function(x) paste(x, soccer_ball)  ) 

Appending Shot Map Plots

To get our shot map plots to show up in the table we pass a different inline function to text_transform(). We can use the ggplot_image() helper function which generates an HTML fragment of a ggplot object. We map() the function over every row in the table to generate the shot maps for every row. The “height” and “aspect_ratio” arguments are passed along to map() to set the dimensions for the plot inside the cells of the table.

match_gt_xG_timeline <-   match_gt_xG_timeline %>%   ## add plots into the empty 'ggplot' column  ## use `ggplot_image()` function and set height and aspect ratio  text_transform(    locations = cells_body(columns = vars(ggplot)),    fn = function(x) {      map(match_shots_table_df$complete_plot, ggplot_image, height = px(150), aspect_ratio = 2)    }  )

Rename Columns

Using the cols_label() function we can easily re-name all of the column headers to a more table-friendly name. The cols_align() is used to align the text of columns.

match_gt_xG_timeline <-   match_gt_xG_timeline %>%   ## Rename cols  cols_label(    minute = "Minute", team_name = "Team", player_name = "Player",     result = "Result", xG = "xG", rollsum = "Cumulative xG",     ggplot = "Shot Map", situation = "Situation",     type = "Shot Type", lastAction = "Assist Action", player_assisted = "Assist Player"  ) %>%   cols_align("center")

Other Table Options & Source Notes

The tab_options() function is a very general function that gives you a lot of options for styles similar to tab_style() except without having to bother with specifying the location and such. For our purposes I’m using this to add a thick gray border at the top and bottom of the column headers. This can be done with tab_style() like we did for the “left” and “right” vertical borders but I didn’t want to bother with the whole tab_style() syntax all over again. You could just add another tab_style() function and just specify the “sides” in cells_borders() to “top” and “bottom” if you want to do it that way. We make the top and bottom borders of the column labels a bit thicker to separate them from the title and the body of the table by setting the width to be px(5) (px stands for pixel).

The “data_row.padding” argument sets the amount of vertical padding to each row of the table while “source_notes.font.size” is fairly self-explanatory. We set the actual text for the source note via the tab_source_note() function. Instead of using HTML to style the text, we can use markdown via the md() function and style the text using regular markdown syntax.

Finally we use the cols_hide() function to hide the columns that we don’t want to show.

match_gt_xG_timeline <-   match_gt_xG_timeline %>%   ## general table options  tab_options(    column_labels.border.top.color = "grey",    column_labels.border.top.width= px(5),    column_labels.border.bottom.color = "grey",    column_labels.border.bottom.width= px(5),    data_row.padding = px(15),    source_notes.font.size = 20  ) %>%   tab_source_note(source_note = md(source_text)) %>%   cols_hide(vars(complete_plot, xG))

Complete function

We can wrap this all up in a neat function that takes a few useful and flexible arguments so that we can use the code and apply it to other matches. We’ll use the inputs we created up to Part 1 as the arguments to pass to the function, the function will contain all the code from Part 2 and Part 3.

home_team = "Liverpool"away_team = "Leeds"home_color = "#d00027" away_color = "#1D428A"match_date <- "Sep. 12, 2020"league_year <- "Premier League 2020-2021"matchday <- 1source_text <- "**Table**: Ryo Nakagawara (**Twitter**: @R_by_Ryo) | **Data**: understat"## shots_df, rollsum_df, home_stats, away_stats are all created in Part 1!create_timeline_table <- function(shots_df, rollsum_df, home_stats, away_stats,                                  home_team, away_team, home_color, away_color,                                   match_date, league_year, matchday,                                  source_text) {    pitch_custom <- list(    length = 587,    width = 373,    penalty_box_length = 101,    penalty_box_width = 211,    six_yard_box_length = 31,    six_yard_box_width = 111,    penalty_spot_distance = 66,    goal_width = 45,    origin_x = 0,    origin_y = 0)    ## create coords  match_df <- shots_df %>%     ## switch coordinates for vertical view    mutate(      x = case_when(        h_a == "a" ~ X * 587,        h_a == "h" ~ X * 587,        TRUE ~ 0),      y = case_when(        h_a == "a" ~ Y * 373,        h_a == "h" ~ Y * 373,        TRUE ~ 0)) %>%    ## edit result values    mutate(result = case_when(      result == "Goal" ~ "GOAL",      result == "Own Goal" ~ "OWN GOAL",      TRUE ~ result)) %>%     mutate(result = forcats::as_factor(result),           result = forcats::fct_relevel(result, "GOAL", "Saved Shot",                                         "On Post", "Blocked Shot",                                          "Missed Shots", "OWN GOAL"))    create_shotmap_basic <- function(df = data, team_name = team_name) {        shotxG_map_raw <-       ggplot(df %>% filter(team_name == team_name),              aes(x = x, y = y)) +      annotate_pitch(dimensions = pitch_custom) +      ## all shots in grey and transparent      geom_point(aes(x = x, y = y), color = "grey20",                  size = 3, alpha = 0.3) +      #scale_x_continuous(expand = c(0.01, 0)) +      theme_pitch(aspect_ratio = 373/587) +      coord_flip(xlim = c(280, 590),                  ylim = c(10, 365)) +      theme(plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "pt"),            text = element_markdown(family = "Roboto Condensed"),            legend.position = "none")            return(shotxG_map_raw)  }    ## add blank shot map to all rows  match_df_basic_plot <- match_df %>%     tibble::as_tibble() %>%     ## nest by team    group_by(team_name) %>%     nest() %>%     ## apply plots for all shots per team, for each row in their respective "data" column    mutate(plot = map2(data, team_name, create_shotmap_basic)) %>%     ungroup() %>%     unnest(cols = "data") %>%     arrange(id)    add_xG_shot <- function(x, y, xG, team_color, plot) {    shotxG_map_point <-       plot +      # specific shot point in black and bold      geom_point(x = x, y = y, aes(fill = team_color),                 size = 12, stroke = 3, shape = 21) +      scale_fill_identity() +      #scale_x_continuous(expand = c(0.01, 0)) +      ## label for shot point      geom_label(x = 318, y = 186.5,                  color = "black", size = 20,                 fill = "white", family = "Roboto Slab",                 label = glue::glue("{xG %>% round(digits = 2)} xG"))        return(shotxG_map_point)  }    ## map plot to df again with a shot point for each row/plot  dfdfdf <- match_df_basic_plot %>%     ## shot-per-row, using 'plot' as base pass along the 'x', 'y' coordinates and xG value    ## to "add_xG_shot()` function for each row.     ## have empty 'ggplot' column for gt plot-insertion purposes    mutate(complete_plot = pmap(list(x, y, xG, team_color, plot), add_xG_shot),           ggplot = NA) %>%     select(-plot) %>%     left_join(rollsum_df, by = "id")    ## data creation for actual table  match_shots_table_df <- dfdfdf %>%     select(minute, team_name, result, xG, firstname, player,            ggplot, complete_plot, rollsum = rollsum_goal,           situation, type = shotType, player_assisted, lastAction) %>%     ## player name labels, clean "lastAction"    mutate(player_name = paste(firstname, player),           lastAction = if_else(lastAction == "None", NA_character_, lastAction),           xG = xG %>% round(digits = 2),           rollsum = rollsum %>% round(digits = 2)    ) %>%     ## NAs as blanks    mutate(across(where(is.character), ~ replace_na(., ""))) %>%     ## take out extraneous name vars and move to after team name    select(-firstname, -player) %>%     relocate(player_name, .after = team_name)     ## TABLE!!!  soccer_ball <- ""    match_gt_xG_timeline <-     gt(match_shots_table_df) %>%       tab_header(       title = gt::html(glue::glue("

home_team_logo({home_stats$xG} xG){home_stats$TEAMS}{home_stats$GOALS} - {away_stats$GOALS}{away_stats$TEAMS}({away_stats$xG} xG)away_team_logo

{league_year}: Matchday {matchday} ({match_date})"))) %>% ## title style tab_style( style = list( cell_text( font = "Roboto Slab", align = "center", weight = "bold", color = "#000000" ) ), locations = list( cells_title(groups = "title") ) ) %>% ## column style tab_style( style = list( cell_text(font = "Roboto Slab", align = "center", size = "xx-large", weight = "bold"), cell_borders(sides = c("left", "right"), color = "grey20", weight = px(2)) ), locations = list( cells_column_labels(everything()) ) ) %>% ## HOME TEAM tab_style( style = cell_fill(color = home_color), locations = cells_body( rows = team_name == home_team) ) %>% ## AWAY TEAM tab_style( style = cell_fill(color = away_color), locations = cells_body( rows = team_name == away_team) ) %>% ## all cell text tab_style( style = cell_text(color = "white", align = "center", size = "x-large", font = "Roboto Condensed", weight = "bold"), locations = cells_body( columns = TRUE) ) %>% ## add Goal result emoji by pasting in the emoji next to the 'result' text text_transform( locations = cells_body( columns = vars(result), rows = result %in% c("GOAL", "OWN GOAL")), fn = function(x) paste(x, soccer_ball) ) %>% ## add plots into the empty 'ggplot' column ## use `ggplot_image()` function and set height and aspect ratio text_transform( locations = cells_body(columns = vars(ggplot)), fn = function(x) { map(match_shots_table_df$complete_plot, ggplot_image, height = px(150), aspect_ratio = 2) } ) %>% ## Rename cols cols_label( minute = "Minute", team_name = "Team", player_name = "Player", result = "Result", xG = "xG", rollsum = "Cumulative xG", ggplot = "Shot Map", situation = "Situation", type = "Shot Type", lastAction = "Assist Action", player_assisted = "Assist Player" ) %>% cols_align("center") %>% ## general table options tab_options( column_labels.border.top.color = "grey", column_labels.border.top.width= px(5), column_labels.border.bottom.color = "grey", column_labels.border.bottom.width= px(5), data_row.padding = px(15), source_notes.font.size = 20 ) %>% tab_source_note(source_note = md(source_text)) %>% cols_hide(vars(complete_plot, xG)) return(match_gt_xG_timeline)}match_gt_xG_timeline <- create_timeline_table(shots_df, rollsum_df, home_stats, away_stats, home_team, away_team, home_color, away_color, home_team_logo, away_team_logo, match_date, league_year, matchday, source_text)match_gt_xG_timeline

<br />

Save Online to RPubs

Taking the first three letters of the home and away team via str_sub() and then pasting them in to the file name with glue() we can save our table with the gtsave() function with a uniform naming style. If you want to save tables for an entire team’s matches in a season I recommend you put the Match Day number or some other identifier first so you can sort it nicely in a folder.

hom <- str_sub(home_stats$TEAMS, 1, 3)awa <- str_sub(away_stats$TEAMS, 1, 3)## Save the table in the RMD/output folder in your directory or whatever path you prefer:gtsave(match_gt_xG_timeline, here::here(glue("RMD/output/{hom}{awa}_match_gt_shotmap.html")))

Alternatively you can upload it directly to your RPub account on the web using markdown::rpubsUpload(). It’ll direct you to the RPubs website where you’ll be required to log-in or create an account. It’s free and its a nice place to host your R web related stuff. Most people use it to host their presentations or analyses notebooks.

markdown::rpubsUpload(title = title = glue::glue("gt-{hom}{awa}_match_gt_shotmap.html"),                       htmlFile = here::here(glue("RMD/output/{hom}{awa}_match_gt_shotmap.html")))

Examples (Full Image & Code)

We can run the same function for other matches from understat. You can modify the inputs/code a bit and use your own data or other data sources as well. Below is a screen shot, links to both the full table image and the code to create examples for other matches.

El Clasico: Barcelona vs. Real Madrid

<br />

Inter Milan vs. Fiorentina

<br />

Acknowledgements

  • Ewen Henderson for the {understatr} package.

  • Rich Iannone for the {gt} package.

  • Joe Gallagher for the R code for adjusting xG per possession sequence.

  • David Smale for providing the inspiration and initial #TidyTuesday code example for inserting ggplots into {gt} via the text_transform() function.

  • Twitter friends (#rstats & football analytics Twitter) for providing feedback on my stuff.

  • Countless Stack Overflow and miscellaneous Google searches.

Conclusion & Future Steps

This blog post was a tutorial for how to create xG and shot time lines using data from understat.com. I also described the data source and my design choices along with a code breakdown. Using the code available you can try to make your own or even modify the code so that you can create these tables using other data sources that you may have access to.

Writing up this blog post also helped me re-factor my code, so a good chunk of the stuff you saw above are going to be incorporated into my own scripts in some shape or another in the near future.

If you made it all the way through, thanks for reading and happy coding!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post My R Table Competition 2020 Submission: xG Timeline Table for Soccer/Football with {gt}! first appeared on R-bloggers.

Little useless-useful R functions – R Lorem Ipsum

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lorem Ipsum is simply dummy text of the printing and typesetting industry and extensively popularized in the past 60 years.

It’s main purpose it more than obvious, in addition the fact is that a reader is not distracted by readable content and is focusing on layout, outlook or design of a page (or specimen).

Idea is to have a Lorem Ipsum text for R community, randomly generated from all the functions available in base package. You might be thinking that users might get distracted by reading the content, but randomly selected names of the functions is as much readable as Lorem Ipsum text itself. But there is a twist. You might actually spot a function that you never heard of, when merely glancing over the text or when copy and pasting it in your design.

The code is consist of two steps. First one grabs all functions from base package:

function_list <- function(){  lw <- builtins() #(internal = FALSE)  lw2 <- help(package="base")   lw3 <- ls("package:base")  lwA <- c(lw,lw2,lw3)  lwA <- unique(lwA)  lwA <- trimws(gsub("[[:punct:]]", " ", lwA))  #ltrim / rtrim  return(lwA)  }

And the second part generates the length of the text based on the limitations of length and approximation.

RLoremIpsum <- function(text_length, approx=TRUE){  lw <- function_list()  LorIps <- ''  while (nchar(LorIps) < text_length) {    lw <- gsub("^ *|(?<= ) | *$", "", lw, perl = TRUE)    new_w <-  sample(lw,1, replace=TRUE)    LorIps <- paste(LorIps, new_w, sep = " ")    if (approx==FALSE){    LorIps <- substr(LorIps, 1, text_length)    }  }   last_word <- tail(strsplit(LorIps ,split=" ")[[1]],1)     if ((nchar(last_word) == 1) == TRUE) {    LorIps <- substr(LorIps, 1, nchar(LorIps)-1) # replace last char with blank space    }   return(LorIps)}

And simply generating the text:

# generated Lorem Ipsum with 1000 charactersRLoremIpsum(10000, approx=TRUE)

There is your random text filler. But this time, forget the Latin, hello R 🙂

As always, code is available on Github.

Happy R-coding 🙂

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Little useless-useful R functions – R Lorem Ipsum first appeared on R-bloggers.

Excess Deaths during the 1st Wave of Covid-19

$
0
0

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Abstract

Our goal is to provide some summary statistics of deaths across countries during the 1st Wave of Covid-19 and to compare these numbers with the corresponding ones of the previous years. This analysis is not scientific and we cannot drive any conclusion about the impact of Covid-19 since we need to take into consideration many other parameters and this is out of the scope of this analysis.

We still keep our character which is a Data Science focusing on R and Python and for that reason, we will work with R and we will share the code making the analysis to be reproducible.

Data Source

We found many difficulties to find structured, reliable and consistent data about deaths across countries. For this analysis we obtained the data from the Our World in Data and more particularly at the Excess Mortality using raw deaths counts and then we downloaded the excess-mortality-raw-death-counts.csv file.

Excess Deaths during the 1st Wave of Covid-19 1

Some notes about the data:

  • The number of deaths is measured by week. So, if we decide to report the results by month then some months will have more weeks than the others, that is why is better to report the average weekly deaths by months instead of total.
  • Week dates are defined by international standard ISO 8601, although not all countries follow this standard.
  • The Human Mortality Database has data for England & Wales (and Scotland) but not for the UK as a whole. England & Wales compose ~89% of the UK population. Source: UK ONS Population estimates for the UK, England and Wales, Scotland and Northern Ireland: mid-2019
  • For some countries, there exist data up to September 2020 wherein some other up to October 2020. It takes some weeks for the database to be updated.
  • For this analysis, we will include countries of the Northern Hemisphere and those with a high number of Corona Virus Cases per Population like the US, Italy, Spain, France etc.

Summary Statistics by Country

The data is from 2015 up to 2020 and is referring to the same period (for example January to September). We will provide the following summary statistics by country:

  • A table of Average Weekly Deaths by Month
  • A chart of Average Weekly Deaths by Month
  • Total Deaths per Year for the same period
  • Percentage difference in Total deaths of 2020 compared to 2015, 2016, 2017, 2018 and 2019

Prepare the data:

library(tidyverse)# https://ourworldindata.org/excess-mortality-covid?fbclid=IwAR1LZGFQGl_nXmVMMnhnXOkLzhYTHglGPGYBIKyxOQCP3L2HLr9H99XNgE0df<-read.csv("excess-mortality-raw-death-count.csv")# Rename the column namesnames(df)<-c("Country", "Code", "Date", "Deaths_2020", "Deaths_AVG_2015_2019",              "Deaths_2015", "Deaths_2016", "Deaths_2017", "Deaths_2018", "Deaths_2019")# Convert the Date to Date-Formatdf$Date<-as.Date(df$Date)

US

Let’s provide the summary statistics of the US.

# Excess Deaths by Month in US # Create also a Date Month column where we truncate the date to the month# Create also a month column which is the monthcountry<- df%>%filter(Code=="USA")%>%  mutate(Date_Month=lubridate::floor_date(Date, unit = "month"), Month = lubridate::month(Date, label=TRUE))country%>%select(-Deaths_AVG_2015_2019)%>%group_by(Month)%>%  summarise(across(Deaths_2020:Deaths_2019, ~mean(.x, na.rm=TRUE))) 

Table of Average Weekly Deaths by Month

Excess Deaths during the 1st Wave of Covid-19 2

Chart of Average Weekly Deaths by Month

country%>%select(-Deaths_AVG_2015_2019)%>%group_by(Month)%>%  summarise(across(Deaths_2020:Deaths_2019, ~mean(.x, na.rm=TRUE)))%>%  pivot_longer(-Month, names_to="Year", values_to="Total_Deaths")%>%  ggplot(aes(x=Month, y=Total_Deaths, group=Year))+  geom_line(aes(col=Year))+geom_point(aes(col=Year))+  ggtitle("US: Avg Weekly Deaths by Month")+ylab("Avg Weekly Deaths within Month")+  theme_minimal() 
Excess Deaths during the 1st Wave of Covid-19 3

As we can see, in 2020 the deaths during January and February were very closed with the past years and then, from February onwards there is a significant increase.

Total Deaths per Year for the same period

country%>%select(-Deaths_AVG_2015_2019)%>%  summarise(across(Deaths_2020:Deaths_2019, ~sum(.x, na.rm=TRUE))) 
Excess Deaths during the 1st Wave of Covid-19 4

Percentage difference in Total deaths of 2020 compared to previous years

# % Increase in deaths country%>%select(-Deaths_AVG_2015_2019)%>%  summarise(sum(Deaths_2020)/across(Deaths_2020:Deaths_2019, ~sum(.x, na.rm=TRUE))-1) 
Excess Deaths during the 1st Wave of Covid-19 5

We can see that compared to 2019, there is an increase of 13.53% in deaths.


Summary Statistics for Other Countries

We provided a detailed example of how you can generate these statistics for a specific country. You can simply change the country code. Let’s provide the charts and the percentage of increased deaths compared to previous years for the other countries.

Italy

Excess Deaths during the 1st Wave of Covid-19 6Excess Deaths during the 1st Wave of Covid-19 7

Spain

Excess Deaths during the 1st Wave of Covid-19 8Excess Deaths during the 1st Wave of Covid-19 9

France

Excess Deaths during the 1st Wave of Covid-19 10Excess Deaths during the 1st Wave of Covid-19 11

England and Wales

Excess Deaths during the 1st Wave of Covid-19 12Excess Deaths during the 1st Wave of Covid-19 13

Sweden

Excess Deaths during the 1st Wave of Covid-19 14Excess Deaths during the 1st Wave of Covid-19 15

Discussion

From the data above we may argue that for the countries that have many Covid-19 cases, there is a significant increase in raw deaths during the March-May period and then during the summer period the raw deaths for most the countries revert back to its “normal” levels. All of these countries above showed an increase in deaths compared to the previous years. For example, Spain has 17.76% more deaths compared to 2019 for the same period (Jan-Sep). The corresponding numbers for the rest countries are US 13.53%, Italy 12.35%, France 4.64%, England and Wales 15.62% and Sweden 9.81%.

Again we want to stress out that we cannot guarantee the validity of our data and we do not take into consideration other parameters. For example, The pandemic may result in fewer deaths from other causes – for instance, the mobility restrictions during the pandemic might lead to fewer deaths from road accidents. From the data above it seems that there is an “anomaly” in deaths during the 1st Wave of Covid-19. If this Covid-19 is a hoax according to our friends below, then what is this external factor which causes more deaths?

How Conspiracy Theories Are Shaping the 2020 Election | TimeCoronavirus: Why do people believe conspiracy theories - and can they ever  be convinced not to? | Science & Tech News | Sky News var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Excess Deaths during the 1st Wave of Covid-19 first appeared on R-bloggers.

Frank Harrell – Controversies in Predictive Modeling and Machine Learning

$
0
0

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A month ago we finished Why R? 2020 conference. We had an pleasure to host Frank Harrell, a professor and a founding chair of the Department of Biostatistics at Vanderbilt University School of Medicine.. This post contains a biography of the speaker and an abstract of his talk: Controversies in Predictive Modeling and Machine Learning.

Frank is a professor and a founding chair of the Department of Biostatistics at Vanderbilt University School of Medicine. Aside from more than 300 scientific publications, Frank has authored Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd Edition 2015, Springer-Verlag), which still serves as a primer in modern statistical modeling for generations of statisticians. His specialties are development of accurate prognostic and diagnostic models, model validation, clinical trials, observational clinical research, cardiovascular research, technology evaluation, pharmaceutical safety, Bayesian methods, quantifying predictive accuracy, missing data imputation, and statistical graphics and reporting.

The fact that some people murder doesn’t mean we should copy them. And murdering data, though not as serious, should also be avoided. —Frank E. Harrell (answering a question on categorization of continuous variables in survival modelling) R-help (July 2005), fortunes::fortune(32)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Frank Harrell - Controversies in Predictive Modeling and Machine Learning first appeared on R-bloggers.

Satellite Image Analysis FAQ: Can I Use R to Analyze Satellite Images?

$
0
0

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Article thumbnail

Analyzing satellite images isn’t exclusively reserved for governments and large corporations. At Appsilon, we’ve spent countless hours working and even writing about analyzing these types of images. As the quality, quantity, and accessibility of satellite images improves, there are more and more satellite image datasets readily available. As a result, interest in satellite imagery analysis (particularly as it relates to deep learning) is at an all-time high.

To address some of the most common questions our team receives, we decided to write an article in an FAQ format. Some of the answers here might surprise you.

Interested in ways to acquire satellite image datasets? Please read our complete guide.

This FAQ addresses the following questions:

How do satellites take pictures?

The answer to this question is not obvious or intuitive. Taking a picture with a satellite is not at all like taking a picture with a standard camera on the Earth’s surface. 

Let’s illustrate with an example. This is the Sentinel-2 mission, which consists of twin satellites (Sentinel 2A and Sentinel 2B). Each satellite orbits the Earth longitudinally while the Earth spins laterally. As you can see in the image below, each satellite takes a “strip” of photos of the Earth’s surface. After five days of collecting these complementary strips, they can capture the whole Earth.

Image from satellite

What does this mean in practice? If you choose any specific point on Earth, you will get a new image of that location every 5 to 7 days. 

Image from satellite

Do satellites collide?

More than 5700 satellites are currently orbiting the Earth. This doesn’t sound like a lot, considering the large size of the Earth and the small size of your average satellite. So, the short answer is no – satellites do not collide. However, they can collide

Here is a tweet from the European Space Agency after they had to perform a maneuver because two satellites were about to crash. This was a special moment because it was the first time a space agency had to initiate such a maneuver. 

ESA Operations

Can I go to Space?

This is a common question. We have good news and bad news for people who want to go to Space. The good news is that it is possible to go to the International Space Station. This is fresh news as of June 2019. NASA decided that the space station is open to private citizens if you want to visit. 

However, the estimated cost of the ticket is $52M. So, needless to say, the price is prohibitive to most. 

.@Space_Station is open for commercial business! Watch @Astro_Christina talk about the steps we’re taking to make our orbiting laboratory accessible to all Americans. pic.twitter.com/xLp2CpMC2x

— NASA (@NASA) June 7, 2019

Can I use R to process satellite images?

In general, yes. But let’s start with what shouldn’t be done in R. There are two main categories: 

  • Data pre-processing 
  • Resource-intensive operations

Downloading 100 images and processing them on your computer is very resource-intensive and shouldn’t be done locally in R. There are platforms available that will do the pre-processing and send you small cutouts of the shapefile that you want. 

There is Google Earth Engine and Amazon Web Services (AWS), which allow you to query the API. They already have public image sets available, and you can upload your own image sets. All of this is available at your fingertips. 

All you need to say is “Google, I want a set of dates for Sentinel images that cover a small square containing Loews Hotel,” and you are set. From there, you choose one or more dates and ask the API to send you already cropped images, reducing the image size by hundreds of kilobytes.

This all happens quite quickly, as you’re using a vast distributed infrastructure to do the calculations. Besides, you can conduct computations there and receive indicators. For example, you can receive the NDVI indicator, which is a simple, mathematical combination of the near-infrared and red channels.

Where does R shine for satellite image analysis? In 3 places:

  • When building dashboards – for presenting data
  • When building statistical models
  • When training deep learning models – by using R interfaces to Keras and H2O

You can analyze and forecast the indicators that you have built. Operating on small images allows you to leverage many useful R packages to experiment with the data and gain valuable insights. You can also build neural networks that will help you identify objects in the images.

Here is an example of a satellite image analysis dashboard that you can build with R.

Dashboard in R

By combining publicly available geospatial data for parcel shapefiles, you can draw any parcel on a map and request available dates of images for that parcel. Then you can analyze the image, indicate where crops are destroyed or where they are unhealthy.

Want to see more Shiny Dashboard Examples?

If you need to create powerful, customizable, and interactive satellite imagery dashboards that look great and respond to user input, R Shiny is a clear choice. It requires a bit of coding knowledge even for simple dashboards, but R isn’t a very complicated language. You can quickly get up to speed in a couple of weeks or even a couple of days, depending on your prior knowledge of programming. If you want to make a scalable enterprise Shiny dashboard, then you can always reach out to Appsilon for help. We’re continually pushing the limits of what’s possible with Shiny, and we’d be happy to guide you and your company.

Learn more

Appsilon Hiring

Appsilon is hiring! See our Careers page for new openings.

Article Satellite Image Analysis FAQ: Can I Use R to Analyze Satellite Images? comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Satellite Image Analysis FAQ: Can I Use R to Analyze Satellite Images? first appeared on R-bloggers.


Upcoming workshop: My 1st Shiny App

$
0
0

[This article was first published on Mirai Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learn how to empower your data through a Shiny interface at Mirai’s new workshop My 1ST Shiny App on 17th November.

Storytelling is pivoting for successful data science. Communicating the results of your analytics and having an accessible way to data exploration is crucial to engage decision makers. The best way to achieve this is through an attractive and dynamic User Interface (UI). Unfortunately rarely a data scientist is also a UI developer.

Are we at a standstill? Not at all! Thanks to shiny you can make beautiful interactive visualizations, allow business users to safely play with your data, quickly build a handy prototype and go all the way to a productive solution without knowing anything more than R.

First Shiny App

Get an introduction to the shiny world of R shiny and create your first ever shiny application with our 3.5h hands-on workshop. We will cover the concept of reactivity, explore various input and output widgets and learn how to customize the layout of a shiny app without having to really know any HTML or Javascript. Registrations are now open at this link.

For an overview about the benefits of working with shiny, have a look at this article and for some inspiration on the beautiful things that one can make with shiny visit our gallery.

Mirai Shiny App

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mirai Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Upcoming workshop: My 1st Shiny App first appeared on R-bloggers.

PredictIt vs Five Thirty Eight: An explanation of the differences between prediction markets and polling models

$
0
0

[This article was first published on sweissblaug, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you’re like me you’ve been looking at FiveThirtyEight and PredictIt election probabilities multiple times a day. Recently I’ve been focusing on some interesting disagreements in electoral outcomes at the state level.  For instance; FiveThirtyEight gives Pennsylvania, Michigan, Florida, Wisconsin and Ohio at least 20% lower probability of a Trump victory than PredictIt.  Why is PredictIt giving Trump a much higher chance in those states? 

One theory is that FiveThirtyEight focuses on polls and who the people of the state will vote for while PredictIt focuses on which candidate will actually win the state. The difference could come down to who is in control of the state. It is possible that the governor or legislature may get to decide who wins. To explore this hypothesis I gathered the current predictions (morning of 2020-11-01) and compared the differences on which party controls which part of the state government (data found here and here). 

Below is a bar chart ranked differences between PredictIt and FiveThirtyEight Trump probabilities on who controls the governor, Upper House, and Lower House. 

One can see that the of the top 10 states with largest discrepancies 6 have Republic Governors, 9 have Republican Upper Legislatures and  8 have Republican Lower Houses.  Interestingly, the states where Predictit has Biden ahead are also generally Republican held.  
To further explore the relationship I made a scatterplot of the difference between Trump probabilities on FiveThirtyEight’s probability below. 

On average Trump gets  ~5.9% higher probability on PredictIt than FiveThirtyEight. Those heavily favored to be Trump states have (less than 15% change of Biden win on FiveThirtyEight) are actually less favored to go for Trump on PredictIt. This is reversed in states that are most likely to go for Biden as Predictit gives them a higher chance for Trump. This could be because prediction markets are not as sure of the polls and expect errors-in-variables from the polls. 
Those heavily favored to be Biden Wins (>80% on FiveThirtyEight) but are held by Republican Upper Legislatures (Minnesota, Wisconsin, Michigan, and Pennsylvania) have an average difference in prediction probabilities of 22% while those that are Democratic held have a ~5% difference. If FiveThirtyEight correctly estimates the voting intention and PredictIt correctly estimates who wins the state then this implies ~a 15% chance that legislatures can decide the outcome in these states. 

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post PredictIt vs Five Thirty Eight: An explanation of the differences between prediction markets and polling models first appeared on R-bloggers.

sampling w/o replacement except when replacing

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another Riddle(r), considering a box with M myrtle balls and D dandelion balls. Drawing balls without replacement while they stay of the same color as the initial draw, else put back the last ball and repeat the process until all balls are drawn. The funny thing is that, unless M=0 or D=0, the probability to draw a myrtle ball at the end is always ½..! This can be easily checked by simulation (when M=2 and D=8)

r=function()sample(0:1,1,p=c(d,m))for(t in 1:1e6){  m=2;d=8  i=r();m=m-!!i;d=d-!i  while(!!m*d){    j=r();i=ifelse(i==j,j,r())    m=m-!!i;d=d-!i}  F=F+(m>0)}F/1e6

Now the proof that the probability is ½ is quite straightforward, for M=1 (or D=1). But I cannot find a quick fix for larger values. I thus reasoned by recursion, with the probability of emptying a given colour first is d!m!/(d+m)!, whatever the colour and whatever d>0,m>0. Hence half a chance to finish with myrtle. Any shorter sequence of a given colour reduces the value of either d or m, at which point we are using the recursion assumption that the probability is ½…

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post sampling w/o replacement except when replacing first appeared on R-bloggers.

Why RStudio Supports Python for Data Science

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As RStudio’s products have increasingly supported Python over the past year, some of our seasoned customers have given us quizzical looks and ask, “Why are you adding Python support? I thought you were an R company!”

Just to set the record straight, RStudio does love R and the R community, and we have no plans to change that. However, if RStudio’s goal is to “enhance the production and consumption of knowledge by everyone, regardless of economic means” (which is what we say in our mission statement), that means we have to be open to all ways of approaching that goal, not just the R-based ones.

This still leaves open the question of why we would embrace a language that some in the data science world think of as a competitor. And while I can’t claim we have a definitive answer, we do have something more than anecdotes to encourage R users to embrace Python as well. We have data.

Survey Data Says “R and Python Are Used for Different Things”

“In God we trust; others must provide data.”

– Attributed to W. Edwards Deming and others, including Anonymous

RStudio has run a broad-based survey of people who use or intend to use R over the past two years. In the 2019 edition of the survey, we asked our more than 2,000 respondents to answer two questions:

“What applications do you use R for most?”

and

“What applications do you use Python for most?”

Respondents were allowed to check as many answers as they wished in both cases. They also were allowed to enter their own application categories as an open-ended response. It is important to note that while this data is indicative of user attitudes, it is by no means conclusive.

Below are the summary plots for the results of these survey questions.

Summary plot of R usesFigure 1: R is used most commonly for visualization, statistical analysis, and data transformation. Summary plot of Python usesFigure 2: R users employ Python most commonly for data transformation and machine learning.

Taking these charts at face value (again, read the next section before you do that), we can draw some interesting conclusions:

  • R users use Python! We had just over 2,000 survey respondents who said they use R while nearly 1,100 survey respondents said they used Python. Because our survey is focused on R users, this means that roughly half of our respondents are using Python as well as R.

  • Visualization and statistical analysis are R’s most common uses. Nearly 9 out of 10 R users apply it in these ways. Data transformation is also a close third place.

  • Data transformation and machine learning are Python’s most common applications. A majority of Python users do data transformation and machine learning with the language. No other applications are as common; only a third of Python users use it for statistical analysis or modeling.

Think of These Results As Directional Instead of Hard Numbers

While these analyses are interesting and the sample sizes reasonable, readers should understand that these results aren’t really representative of all data scientists. As the creator and primary analyst for this survey, I can give you several reasons why you shouldn’t put too much stock in these numbers beyond their overall direction:

  • We only surveyed people interested in R. The introduction to the survey specifically says that it is open to “anyone who is interested in R, regardless of whether they have learned the language.” If a Python-only user looked at the survey, it’s unlikely they would have completed it, which means they aren’t represented in the results.

  • We didn’t do a random sample. We solicited responses by asking RStudio employees to invite their Twitter and RStudio Community followers to fill it out. It’s highly unlikely that our friends and followers are representative of the larger data science or statistics community, and it undoubtedly leaves out broad swaths of the population of programmers and data scientists.

  • None of the data has been weighted to be representative of any broad population. We have not weighted the anonymous demographic information collected in the survey to represent any larger population. That means the survey may have significant gender, ethnic, industry, and educational biases that we haven’t corrected for.

The best way to think of this survey is that it represents the views of a few thousand of RStudio’s friends and customers. While this doesn’t give us any conclusions about the general population of data scientists or programmers, we can use it to think about what we can do to make those people more productive.

RStudio Should (and Does) Support Both R and Python

Despite the fact that we can’t use this survey for general conclusions, we can use this data to think about how RStudio should support our customers and data science community in their work:

  • We should reject the myth that users must choose between R or Python. We had always hypothesized that R users use more than one language to do data science. The data we collected from this survey supports that hypothesis. Becoming an R-only company would only make data science jobs more difficult.

  • We should embrace Python because fully half of our community uses it in addition to R. With more than 50% of R users applying Python to various applications, not supporting Python would force those users to use other tools to get their jobs done.

  • Embracing Python as well as R means that our products should support it too. Forcing data scientists to swap back and forth between different programming environments is inefficient and lowers productivity. By supporting Python in all our products, both free and commercial, we can help our customers get results faster and more seamlessly. That in turn, will help RStudio achieve our broader mission: “to enhance the production and consumption of knowledge by everyone, regardless of economic means”.

While RStudio already offers Python support in its products, we’ll be adding to that support in new versions that will be released in the coming months. Those announcements will appear both here on blog.rstudio.com and on the main web site, so check regularly for when those are released.


Survey Details

RStudio fielded its 2019 R community survey beginning on December 13, 2019. We closed the survey on January 10, 2020 after it had accumulated 2,176 responses. Its details are as follows:

  • The survey was fielded in both English and Spanish. Of the 2,176 responses, 1,838 were in English and 338 were in Spanish. All Spanish results were translated into English for analysis.
  • The survey consists of 52 questions, but it includes branching so not all respondents answer all questions. It also includes questions to detect survey-completing robots.
  • Respondents were solicited from posts on community.rstudio.com and Twitter followers of RStudio employees.
  • Survey results are not representative of any broader population
  • Complete data and incomplete processing scripts can be found in the survey’s Github repository
  • The data and scripts are open source and available to anyone interested.
  • RStudio expects to field this year’s survey in December, 2020.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Why RStudio Supports Python for Data Science first appeared on R-bloggers.

torch for tabular data

$
0
0

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Machine learning on image-like data can be many things: fun (dogs vs. cats), societally useful (medical imaging), or societally harmful (surveillance). In comparison, tabular data – the bread and butter of data science – may seem more mundane.

What’s more, if you’re particularly interested in deep learning (DL), and looking for the extra benefits to be gained from big data, big architectures, and big compute, you’re much more likely to build an impressive showcase on the former instead of the latter.

So for tabular data, why not just go with random forests, or gradient boosting, or other classical methods? I can think of at least a few reasons to learn about DL for tabular data:

  • Even if all your features are interval-scale or ordinal, thus requiring “just” some form of (not necessarily linear) regression, applying DL may result in performance benefits due to sophisticated optimization algorithms, activation functions, layer depth, and more (plus interactions of all of these).

  • If, in addition, there are categorical features, DL models may profit from embedding those in continuous space, discovering similarities and relationships that go unnoticed in one-hot encoded representations.

  • What if most features are numeric or categorical, but there’s also text in column F and an image in column G? With DL, different modalities can be worked on by different modules that feed their outputs into a common module, to take over from there.

Agenda

In this introductory post, we keep the architecture straightforward. We don’t experiment with fancy optimizers or nonlinearities. Nor do we add in text or image processing. However, we do make use of embeddings, and pretty prominently at that. Thus from the above bullet list, we’ll shed a light on the second, while leaving the other two for future posts.

In a nutshell, what we’ll see is

  • How to create a custom dataset, tailored to the specific data you have.

  • How to handle a mix of numeric and categorical data.

  • How to extract continuous-space representations from the embedding modules.

Dataset

The dataset, Mushrooms, was chosen for its abundance of categorical columns. It is an unusual dataset to use in DL: It was designed for machine learning models to infer logical rules, as in: IF a AND NOT b OR c […], then it’s an x.

Mushrooms are classified into two groups: edible and non-edible. The dataset description lists five possible rules with their resulting accuracies. While the least we want to go into here is the hotly debated topic of whether DL is suited to, or how it could be made more suited to rule learning, we’ll allow ourselves some curiosity and check out what happens if we successively remove all columns used to construct those five rules.

Oh, and before you start copy-pasting: Here is the example in a Google Colaboratory notebook.

library(torch)library(purrr)library(readr)library(dplyr)library(ggplot2)download.file(  "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data",  destfile = "agaricus-lepiota.data")mushroom_data <- read_csv(  "agaricus-lepiota.data",  col_names = c(    "poisonous",    "cap-shape",    "cap-surface",    "cap-color",    "bruises",    "odor",    "gill-attachment",    "gill-spacing",    "gill-size",    "gill-color",    "stalk-shape",    "stalk-root",    "stalk-surface-above-ring",    "stalk-surface-below-ring",    "stalk-color-above-ring",    "stalk-color-below-ring",    "veil-type",    "veil-color",    "ring-type",    "ring-number",    "spore-print-color",    "population",    "habitat"  ),  col_types = rep("c", 23) %>% paste(collapse = "")) %>%  # can as well remove because there's just 1 unique value  select(-`veil-type`)

In torch, dataset() creates an R6 class. As with most R6 classes, there will usually be a need for an initialize() method. Below, we use initialize() to preprocess the data and store it in convenient pieces. More on that in a minute. Prior to that, please note the two other methods a dataset has to implement:

  • .getitem(i) . This is the whole purpose of a dataset: Retrieve and return the observation located at some index it is asked for. Which index? That’s to be decided by the caller, a dataloader. During training, usually we want to permute the order in which observations are used, while not caring about order in case of validation or test data.

  • .length(). This method, again for use of a dataloader, indicates how many observations there are.

In our example, both methods are straightforward to implement. .getitem(i) directly uses its argument to index into the data, and .length() returns the number of observations:

mushroom_dataset <- dataset(  name = "mushroom_dataset",  initialize = function(indices) {    data <- self$prepare_mushroom_data(mushroom_data[indices, ])    self$xcat <- data[[1]][[1]]    self$xnum <- data[[1]][[2]]    self$y <- data[[2]]  },  .getitem = function(i) {    xcat <- self$xcat[i, ]    xnum <- self$xnum[i, ]    y <- self$y[i, ]        list(x = list(xcat, xnum), y = y)  },    .length = function() {    dim(self$y)[1]  },    prepare_mushroom_data = function(input) {        input <- input %>%      mutate(across(.fns = as.factor))         target_col <- input$poisonous %>%       as.integer() %>%      `-`(1) %>%      as.matrix()        categorical_cols <- input %>%       select(-poisonous) %>%      select(where(function(x) nlevels(x) != 2)) %>%      mutate(across(.fns = as.integer)) %>%      as.matrix()    numerical_cols <- input %>%      select(-poisonous) %>%      select(where(function(x) nlevels(x) == 2)) %>%      mutate(across(.fns = as.integer)) %>%      as.matrix()        list(list(torch_tensor(categorical_cols), torch_tensor(numerical_cols)),         torch_tensor(target_col))  })

As for data storage, there is a field for the target, self$y, but instead of the expected self$x we see separate fields for numerical features (self$xnum) and categorical ones (self$xcat). This is just for convenience: The latter will be passed into embedding modules, which require its inputs to be of type torch_long(), as opposed to most other modules that, by default, work with torch_float().

Accordingly, then, all prepare_mushroom_data() does is break apart the data into those three parts.

Indispensable aside: In this dataset, really all features happen to be categorical – it’s just that for some, there are but two types. Technically, we could just have treated them the same as the non-binary features. But since normally in DL, we just leave binary features the way they are, we use this as an occasion to show how to handle a mix of various data types.

Our custom dataset defined, we create instances for training and validation; each gets its companion dataloader:

train_indices <- sample(1:nrow(mushroom_data), size = floor(0.8 * nrow(mushroom_data)))valid_indices <- setdiff(1:nrow(mushroom_data), train_indices)train_ds <- mushroom_dataset(train_indices)train_dl <- train_ds %>% dataloader(batch_size = 256, shuffle = TRUE)valid_ds <- mushroom_dataset(valid_indices)valid_dl <- valid_ds %>% dataloader(batch_size = 256, shuffle = FALSE)

Model

In torch, how much you modularize your models is up to you. Often, high degrees of modularization enhance readability and help with troubleshooting.

Here we factor out the embedding functionality. An embedding_module, to be passed the categorical features only, will call torch’s nn_embedding() on each of them:

embedding_module <- nn_module(    initialize = function(cardinalities) {    self$embeddings = nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim = ceiling(x/2))))  },    forward = function(x) {    embedded <- vector(mode = "list", length = length(self$embeddings))    for (i in 1:length(self$embeddings)) {      embedded[[i]] <- self$embeddings[[i]](x[ , i])    }    torch_cat(embedded, dim = 2)  })

The main model, when called, starts by embedding the categorical features, then appends the numerical input and continues processing:

net <- nn_module(  "mushroom_net",  initialize = function(cardinalities,                        num_numerical,                        fc1_dim,                        fc2_dim) {    self$embedder <- embedding_module(cardinalities)    self$fc1 <- nn_linear(sum(map(cardinalities, function(x) ceiling(x/2)) %>% unlist()) + num_numerical, fc1_dim)    self$fc2 <- nn_linear(fc1_dim, fc2_dim)    self$output <- nn_linear(fc2_dim, 1)  },  forward = function(xcat, xnum) {    embedded <- self$embedder(xcat)    all <- torch_cat(list(embedded, xnum$to(dtype = torch_float())), dim = 2)    all %>% self$fc1() %>%      nnf_relu() %>%      self$fc2() %>%      self$output() %>%      nnf_sigmoid()  })

Now instantiate this model, passing in, on the one hand, output sizes for the linear layers, and on the other, feature cardinalities. The latter will be used by the embedding modules to determine their output sizes, following a simple rule “embed into a space of size half the number of input values”:

cardinalities <- map(  mushroom_data[ , 2:ncol(mushroom_data)], compose(nlevels, as.factor)) %>%  keep(function(x) x > 2) %>%  unlist() %>%  unname()num_numerical <- ncol(mushroom_data) - length(cardinalities) - 1fc1_dim <- 16fc2_dim <- 16model <- net(  cardinalities,  num_numerical,  fc1_dim,  fc2_dim)device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"model <- model$to(device = device)

Training

The training loop now is “business as usual”:

optimizer <- optim_adam(model$parameters, lr = 0.1)for (epoch in 1:20) {  model$train()  train_losses <- c()    for (b in enumerate(train_dl)) {    optimizer$zero_grad()    output <- model(b$x[[1]]$to(device = device), b$x[[2]]$to(device = device))    loss <- nnf_binary_cross_entropy(output, b$y$to(dtype = torch_float(), device = device))    loss$backward()    optimizer$step()    train_losses <- c(train_losses, loss$item())  }  model$eval()  valid_losses <- c()  for (b in enumerate(valid_dl)) {    output <- model(b$x[[1]]$to(device = device), b$x[[2]]$to(device = device))    loss <- nnf_binary_cross_entropy(output, b$y$to(dtype = torch_float(), device = device))    valid_losses <- c(valid_losses, loss$item())  }  cat(sprintf("Loss at epoch %d: training: %3f, validation: %3f\n", epoch, mean(train_losses), mean(valid_losses)))}Loss at epoch 1: training: 0.274634, validation: 0.111689Loss at epoch 2: training: 0.057177, validation: 0.036074Loss at epoch 3: training: 0.025018, validation: 0.016698Loss at epoch 4: training: 0.010819, validation: 0.010996Loss at epoch 5: training: 0.005467, validation: 0.002849Loss at epoch 6: training: 0.002026, validation: 0.000959Loss at epoch 7: training: 0.000458, validation: 0.000282Loss at epoch 8: training: 0.000231, validation: 0.000190Loss at epoch 9: training: 0.000172, validation: 0.000144Loss at epoch 10: training: 0.000120, validation: 0.000110Loss at epoch 11: training: 0.000098, validation: 0.000090Loss at epoch 12: training: 0.000079, validation: 0.000074Loss at epoch 13: training: 0.000066, validation: 0.000064Loss at epoch 14: training: 0.000058, validation: 0.000055Loss at epoch 15: training: 0.000052, validation: 0.000048Loss at epoch 16: training: 0.000043, validation: 0.000042Loss at epoch 17: training: 0.000038, validation: 0.000038Loss at epoch 18: training: 0.000034, validation: 0.000034Loss at epoch 19: training: 0.000032, validation: 0.000031Loss at epoch 20: training: 0.000028, validation: 0.000027

While loss on the validation set is still decreasing, we’ll soon see that the network has learned enough to obtain an accuracy of 100%.

Evaluation

To check classification accuracy, we re-use the validation set, seeing how we haven’t employed it for tuning anyway.

model$eval()test_dl <- valid_ds %>% dataloader(batch_size = valid_ds$.length(), shuffle = FALSE)iter <- test_dl$.iter()b <- iter$.next()output <- model(b$x[[1]]$to(device = device), b$x[[2]]$to(device = device))preds <- output$to(device = "cpu") %>% as.array()preds <- ifelse(preds > 0.5, 1, 0)comp_df <- data.frame(preds = preds, y = b[[2]] %>% as_array())num_correct <- sum(comp_df$preds == comp_df$y)num_total <- nrow(comp_df)accuracy <- num_correct/num_totalaccuracy1

Phew. No embarrassing failure for the DL approach on a task where straightforward rules are sufficient. Plus, we’ve really been parsimonious as to network size.

Before concluding with an inspection of the learned embeddings, let’s have some fun obscuring things.

Making the task harder

The following rules (with accompanying accuracies) are reported in the dataset description.

Disjunctive rules for poisonous mushrooms, from most general    to most specific:    P_1) odor=NOT(almond.OR.anise.OR.none)         120 poisonous cases missed, 98.52% accuracy    P_2) spore-print-color=green         48 cases missed, 99.41% accuracy             P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.              (stalk-color-above-ring=NOT.brown)          8 cases missed, 99.90% accuracy             P_4) habitat=leaves.AND.cap-color=white             100% accuracy         Rule P_4) may also be    P_4') population=clustered.AND.cap_color=white    These rule involve 6 attributes (out of 22). 

Evidently, there’s no distinction being made between training and test sets; but we’ll stay with our 80:20 split anyway. We’ll successively remove all mentioned attributes, starting with the three that enabled 100% accuracy, and continuing our way up. Here are the results I obtained seeding the random number generator like so:

torch_manual_seed(777)
withoutaccuracy
cap-color, population, habitat0.9938
cap-color, population, habitat, stalk-surface-below-ring, stalk-color-above-ring1
cap-color, population, habitat, stalk-surface-below-ring, stalk-color-above-ring, spore-print-color0.9994
cap-color, population, habitat, stalk-surface-below-ring, stalk-color-above-ring, spore-print-color, odor0.9526

Still 95% correct … While experiments like this are fun, it looks like they can also tell us something serious: Imagine the case of so-called “debiasing” by removing features like race, gender, or income. How many proxy variables may still be left that allow for inferring the masked attributes?

A look at the hidden representations

Looking at the weight matrix of an embedding module, what we see are the learned representations of a feature’s values. The first categorical column was cap-shape; let’s extract its corresponding embeddings:

embedding_weights <- vector(mode = "list")for (i in 1: length(model$embedder$embeddings)) {  embedding_weights[[i]] <- model$embedder$embeddings[[i]]$parameters$weight$to(device = "cpu")}cap_shape_repr <- embedding_weights[[1]]cap_shape_reprtorch_tensor-0.0025 -0.1271  1.8077-0.2367 -2.6165 -0.3363-0.5264 -0.9455 -0.6702 0.3057 -1.8139  0.3762-0.8583 -0.7752  1.0954 0.2740 -0.7513  0.4879[ CPUFloatType{6,3} ]

The number of columns is three, since that’s what we chose when creating the embedding layer. The number of rows is six, matching the number of available categories. We may look up per-feature categories in the dataset description (agaricus-lepiota.names):

cap_shapes <- c("bell", "conical", "convex", "flat", "knobbed", "sunken")

For visualization, it’s convenient to do principal components analysis (but there are other options, like t-SNE). Here are the six cap shapes in two-dimensional space:

pca <- prcomp(cap_shape_repr, center = TRUE, scale. = TRUE, rank = 2)$x[, c("PC1", "PC2")]pca %>%  as.data.frame() %>%  mutate(class = cap_shapes) %>%  ggplot(aes(x = PC1, y = PC2)) +  geom_point() +  geom_label_repel(aes(label = class)) +   coord_cartesian(xlim = c(-2, 2), ylim = c(-2, 2)) +  theme(aspect.ratio = 1) +  theme_classic()

Naturally, how interesting you find the results depends on how much you care about the hidden representation of a variable. Analyses like these may quickly turn into an activity where extreme caution is to be applied, as any biases in the data will immediately translate into biased representations. Moreover, reduction to two-dimensional space may or may not be adequate.

This concludes our introduction to torch for tabular data. While the conceptual focus was on categorical features, and how to make use of them in combination with numerical ones, we’ve taken care to also provide background on something that will come up time and again: defining a dataset tailored to the task at hand.

Thanks for reading!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio AI Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post torch for tabular data first appeared on R-bloggers.

Viewing all 12171 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>