Outlier Days with R and Python

March 15, 2020, 5:00 pm

≫ Next: Online R, Python & Git Training!

≪ Previous: All you need to know on Multiple Factor Analysis …

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to another installment of Reproducible Finance. Today’s post will be topical as we look at the historical behavior of the stock market after days of extreme returns and it will also explore one of my favorite coding themes of 2020 – the power of RMarkdown as an R/Python collaboration tool.

This post originated when Rishi Singh, the founder of tiingo and one of the nicest people I have encountered in this crazy world, sent over a note about recent market volatility along with some Python code for analyzing that volatility. We thought it would be a nice project to post that Python code along with the equivalent R code for reproducing the same results. For me, it’s a great opportunity to use RMarkdown's R and Python interoperability superpowers, fueled by the reticulate package. If you are an R coder and someone sends you Python code as part of a project, RMarkdown + reticulate makes it quite smooth to incorporate that Python code into your work. It was interesting to learn how a very experienced Python coder might tackle a problem and then think about how to tackle that problem with R. Unsurprisingly, I couldn’t resist adding a few elements of data visualization.

Before we get started, if you’re unfamiliar with using R and Python chunks throughout an RMarkdown file, have a quick look at the reticulate documentation here.

Let’s get to it. Since we’ll be working with R and Python, we start with our usual R setup code chunk to load R packages, but we’ll also load the reticulate package and source a Python script. Here’s what that looks like.

library(tidyverse)library(tidyquant)library(riingo)library(timetk)library(plotly)library(roll)library(slider)library(reticulate)riingo_set_token("your tiingo token here")# Python file that holds my tiingo tokenreticulate::source_python("credentials.py")knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA)

Note that I set my tiingo token twice: first using riingo_set_token() so I can use the riingo package in R chunks and then by sourcing the credentials.py file, where I have put tiingoToken = 'my token'. Now I can use the tiingoToken variable in my Python chunks. This is necessary because we will use both R and Python to pull in data from Tiingo.

Next we will use a Python chunk to load the necessary Python libraries. If you haven’t installed these yet, you can open the RStudio terminal and run pip install. Since we’ll be interspersing R and Python code chunks throughout, I will add a # Python Chunk to each Python chunk and, um, # R Chunk to each R chunk.

# Python chunkimport pandas as pdimport numpy as npimport tiingo

Let’s get to the substance. The goal today is look back at the last 43 years of S&P 500 price history and analyze how the market has performed following a day that sees an extreme return. We will also take care with how we define an extreme return, using rolling volatility to normalize percentage moves.

We will use the mutual fund VFINX as a tradeable proxy for the S&P 500 because it has a much longer history than other funds like SPY or VOO.

Let’s start by passing a URL string from tiingo to the pandas function read_csv, along with our tiingoToken.

# Python chunk pricesDF = pd.read_csv("https://api.tiingo.com/tiingo/daily/vfinx/prices?startDate=1976-1-1&format=csv&token=" + tiingoToken)

We just created a Python object called pricesDF. We can look at that object in an R chunk by calling py$pricesDF.

# R chunkpy$pricesDF %>%   head()

We just created a Python object called pricesDF. Let’s reformat the date column becomes the index, in date time format.

# Python chunk pricesDF = pricesDF.set_index(['date'])pricesDF.index = pd.DatetimeIndex(pricesDF.index)

Heading back to R for viewing, we see that the date column is no longer a column – it is the index of the data frame and in pandas the index is more like a label than a new column. In fact, here’s what happens when call the row names of this data frame.

# R chunkpy$pricesDF %>%   head() %>%   rownames()

We now have our prices, indexed by date. Let’s convert adjusted closing prices to log returns and save the results in a new column called returns. Note the use of the shift(1) operator here. That is analogous to the lag(..., 1) function in dplyr.

# Python chunkpricesDF['returns'] = np.log(pricesDF['adjClose']/pricesDF['adjClose'].shift(1))

Next, we want to calculate the 3-month rolling standard deviation of these daily log returns, and then divide daily returns by the previous rolling 3-month volatility in order to prevent look-ahead error. We can think of this as normalizing today’s return by the previous 3-months’ rolling volatility and will label it as stdDevMove.

# Python chunkpricesDF['rollingVol'] = pricesDF['returns'].rolling(63).std()pricesDF['stdDevMove'] = pricesDF['returns'] / pricesDF['rollingVol'].shift(1)

Finally, we eventually want to calculate how the market has performed on the day following a large negative move. To prepare for that, let’s create a column of next day returns using shift(-1).

# Python chunkpricesDF['nextDayReturns'] = pricesDF.returns.shift(-1)

Now, we can filter by the size of the stdDevMove column and the returns column, to isolate days where the standard deviation move was at least 3 and the returns was less than -3%. We use mean() to find the mean next day return following such large events.

# Python chunknextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < -.03), ['nextDayReturns']].mean()

Finally, let’s loop through and see how the mean next day return changes as we filter on different extreme negative returns or we can call drop tolerances. We will label the drop tolerance as i, set it at -.03 and then run a while loop that decrements down i by .0025 at each pass. In this way we can look at the mean next return following different levels of negative returns.

# Python chunki = -.03while i >= -.0525:    nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < i), ['nextDayReturns']]    print(str(round(i, 5)) + ': ' + str(round(nextDayPerformanceSeries['nextDayReturns'].mean(), 6)))    i -= .0025

It appears that as the size of the drop gets larger and more negative, the mean bounce back tends to get larger.

Let’s reproduce these results in R.

First, we import prices using the riingo_prices() function from the riingo.

# R chunksp_500_prices <-   "VFINX" %>%   riingo_prices(start_date = "1976-01-01", end_date = today())

We can use mutate() to add a column of daily returns, rolling volatility, standard deviation move and next day returns.

# R chunksp_500_returns <-   sp_500_prices %>%  select(date, adjClose) %>%   mutate(daily_returns_log = log(adjClose/lag(adjClose)),         rolling_vol = roll_sd(as.matrix(daily_returns_log), 63),         sd_move = daily_returns_log/lag(rolling_vol),         next_day_returns = lead(daily_returns_log))

Now let’s filter() on an sd_move greater than 3 and daily_returns_log less than a drop tolerance of -.03.

# R chunksp_500_returns %>%   na.omit() %>%   filter(sd_move < -3 & daily_returns_log < -.03) %>%   select(date, daily_returns_log, sd_move, next_day_returns) %>%   summarise(mean_return = mean(next_day_returns)) %>%   add_column(drop_tolerance = scales::percent(.03), .before = 1)

# A tibble: 1 x 2  drop_tolerance mean_return                  1 3%                 0.00625

We used a while() loop to iterate across different drop tolerances in Python, let’s see how to implement that using map_dfr() from the purrr package.

First, we will define a sequence of drop tolerances using the seq() function.

# R chunkdrop_tolerance <- seq(.03, .05, .0025)drop_tolerance

[1] 0.0300 0.0325 0.0350 0.0375 0.0400 0.0425 0.0450 0.0475 0.0500

Next, we will create a function called outlier_mov_fun that takes a data frame of returns, filters on a drop tolerance and gives us the mean return following large negative moves.

# R chunkoutlier_mov_fun <- function(drop_tolerance, returns) {returns %>%    na.omit() %>%   filter(sd_move < -3 & daily_returns_log < -drop_tolerance) %>%   select(date, daily_returns_log, sd_move, next_day_returns) %>%   summarise(mean_return = mean(next_day_returns) %>% round(6)) %>%  add_column(drop_tolerance = scales::percent(drop_tolerance), .before = 1) %>%   add_column(drop_tolerance_raw = drop_tolerance, .before = 1)}

Notice how that function takes two arguments: a drop tolerance and data frame of returns.

Next, we pass our sequence of drop tolerances, stored in a variable called drop_tolerance to map_dfr(), along with our function and our sp_500_returns object. map_dfr will iterate through our sequence of drops and apply our function to each one.

# R chunkmap_dfr(drop_tolerance, outlier_mov_fun, sp_500_returns) %>%   select(-drop_tolerance_raw)

# A tibble: 9 x 2  drop_tolerance mean_return                  1 3%                 0.006252 3%                 0.007003 3%                 0.009674 4%                 0.0109 5 4%                 0.0122 6 4%                 0.0132 7 4%                 0.0149 8 5%                 0.0149 9 5%                 0.0162

Have a quick glance up that the results of our Python while() and we should see that the results are consistent.

Alright, let’s have some fun and get to visualizing these results with ggplot and plotly.

# R chunk(sp_500_returns %>% map_dfr(drop_tolerance, outlier_mov_fun, .) %>%   ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}                                                                      mean next day return: {mean_return * 100}%"))) +  geom_point(color = "cornflowerblue") +  labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +  scale_x_continuous(labels = scales::percent) +  scale_y_continuous(labels = scales::percent) +   theme_minimal()) %>% ggplotly(tooltip = "text")

{"x":{"data":[{"x":[0.03,0.0325,0.035,0.0375,0.04,0.0425,0.045,0.0475,0.05],"y":[0.006251,0.007005,0.009671,0.010899,0.01223,0.013163,0.01486,0.01486,0.016152],"text":["drop tolerance: 3% mean next day return: 0.6251%","drop tolerance: 3% mean next day return: 0.7005%","drop tolerance: 3% mean next day return: 0.9671%","drop tolerance: 4% mean next day return: 1.0899%","drop tolerance: 4% mean next day return: 1.223%","drop tolerance: 4% mean next day return: 1.3163%","drop tolerance: 4% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.6152%"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(100,149,237,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(100,149,237,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":54.7945205479452},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Mean Return after Large Daily Drop","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.029,0.051],"tickmode":"array","ticktext":["3.00%","3.50%","4.00%","4.50%","5.00%"],"tickvals":[0.03,0.035,0.04,0.045,0.05],"categoryorder":"array","categoryarray":["3.00%","3.50%","4.00%","4.50%","5.00%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"daily drop","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.00575595,0.01664705],"tickmode":"array","ticktext":["0.60%","0.80%","1.00%","1.20%","1.40%","1.60%"],"tickvals":[0.006,0.008,0.01,0.012,0.014,0.016],"categoryorder":"array","categoryarray":["0.60%","0.80%","1.00%","1.20%","1.40%","1.60%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"mean return","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"9ca5f00ffa3":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"9ca5f00ffa3","visdat":{"9ca5f00ffa3":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}

Here’s what happens when we expand the upper bound to a drop tolerance of -2% and make our intervals smaller, moving from .25% increments to .125% increments.

# R chunkdrop_tolerance_2 <- seq(.02, .05, .00125)(sp_500_returns %>% map_dfr(drop_tolerance_2, outlier_mov_fun, .) %>%   ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}                                                                      mean next day return: {mean_return * 100}%"))) +  geom_point(color = "cornflowerblue") +  labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +  scale_x_continuous(labels = scales::percent) +  scale_y_continuous(labels = scales::percent) +   theme_minimal()) %>% ggplotly(tooltip = "text")

{"x":{"data":[{"x":[0.02,0.02125,0.0225,0.02375,0.025,0.02625,0.0275,0.02875,0.03,0.03125,0.0325,0.03375,0.035,0.03625,0.0375,0.03875,0.04,0.04125,0.0425,0.04375,0.045,0.04625,0.0475,0.04875,0.05],"y":[0.004042,0.004458,0.00505,0.005678,0.005703,0.005944,0.005607,0.005556,0.006251,0.006085,0.007005,0.008483,0.009671,0.009795,0.010899,0.010748,0.01223,0.013352,0.013163,0.013163,0.01486,0.01486,0.01486,0.013091,0.016152],"text":["drop tolerance: 2% mean next day return: 0.4042%","drop tolerance: 2% mean next day return: 0.4458%","drop tolerance: 2% mean next day return: 0.505%","drop tolerance: 2% mean next day return: 0.5678%","drop tolerance: 2% mean next day return: 0.5703%","drop tolerance: 3% mean next day return: 0.5944%","drop tolerance: 3% mean next day return: 0.5607%","drop tolerance: 3% mean next day return: 0.5556%","drop tolerance: 3% mean next day return: 0.6251%","drop tolerance: 3% mean next day return: 0.6085%","drop tolerance: 3% mean next day return: 0.7005%","drop tolerance: 3% mean next day return: 0.8483%","drop tolerance: 4% mean next day return: 0.9671%","drop tolerance: 4% mean next day return: 0.9795%","drop tolerance: 4% mean next day return: 1.0899%","drop tolerance: 4% mean next day return: 1.0748%","drop tolerance: 4% mean next day return: 1.223%","drop tolerance: 4% mean next day return: 1.3352%","drop tolerance: 4% mean next day return: 1.3163%","drop tolerance: 4% mean next day return: 1.3163%","drop tolerance: 4% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.3091%","drop tolerance: 5% mean next day return: 1.6152%"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(100,149,237,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(100,149,237,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":54.7945205479452},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Mean Return after Large Daily Drop","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.0185,0.0515],"tickmode":"array","ticktext":["2.0%","3.0%","4.0%","5.0%"],"tickvals":[0.02,0.03,0.04,0.05],"categoryorder":"array","categoryarray":["2.0%","3.0%","4.0%","5.0%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"daily drop","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.0034365,0.0167575],"tickmode":"array","ticktext":["0.40%","0.80%","1.20%","1.60%"],"tickvals":[0.004,0.008,0.012,0.016],"categoryorder":"array","categoryarray":["0.40%","0.80%","1.20%","1.60%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"mean return","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"9ca5787ce4c":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"9ca5787ce4c","visdat":{"9ca5787ce4c":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}

Check out what happens when we expand the lower bound, to a -6% drop tolerance.

# R chunkdrop_tolerance_3 <- seq(.02, .06, .00125)(sp_500_returns %>% map_dfr(drop_tolerance_3, outlier_mov_fun, .) %>%   ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}                                                                      mean next day return: {mean_return * 100}%"))) +  geom_point(color = "cornflowerblue") +  labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +  scale_x_continuous(labels = scales::percent) +  scale_y_continuous(labels = scales::percent) +   theme_minimal()) %>% ggplotly(tooltip = "text")

{"x":{"data":[{"x":[0.02,0.02125,0.0225,0.02375,0.025,0.02625,0.0275,0.02875,0.03,0.03125,0.0325,0.03375,0.035,0.03625,0.0375,0.03875,0.04,0.04125,0.0425,0.04375,0.045,0.04625,0.0475,0.04875,0.05,0.05125,0.0525,0.05375,0.055,0.05625,0.0575,0.05875,0.06],"y":[0.004042,0.004458,0.00505,0.005678,0.005703,0.005944,0.005607,0.005556,0.006251,0.006085,0.007005,0.008483,0.009671,0.009795,0.010899,0.010748,0.01223,0.013352,0.013163,0.013163,0.01486,0.01486,0.01486,0.013091,0.016152,0.017723,0.017723,0.0367,0.0367,0.0367,0.039321,0.039321,0.039949],"text":["drop tolerance: 2% mean next day return: 0.4042%","drop tolerance: 2% mean next day return: 0.4458%","drop tolerance: 2% mean next day return: 0.505%","drop tolerance: 2% mean next day return: 0.5678%","drop tolerance: 2% mean next day return: 0.5703%","drop tolerance: 3% mean next day return: 0.5944%","drop tolerance: 3% mean next day return: 0.5607%","drop tolerance: 3% mean next day return: 0.5556%","drop tolerance: 3% mean next day return: 0.6251%","drop tolerance: 3% mean next day return: 0.6085%","drop tolerance: 3% mean next day return: 0.7005%","drop tolerance: 3% mean next day return: 0.8483%","drop tolerance: 4% mean next day return: 0.9671%","drop tolerance: 4% mean next day return: 0.9795%","drop tolerance: 4% mean next day return: 1.0899%","drop tolerance: 4% mean next day return: 1.0748%","drop tolerance: 4% mean next day return: 1.223%","drop tolerance: 4% mean next day return: 1.3352%","drop tolerance: 4% mean next day return: 1.3163%","drop tolerance: 4% mean next day return: 1.3163%","drop tolerance: 4% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.486%","drop tolerance: 5% mean next day return: 1.3091%","drop tolerance: 5% mean next day return: 1.6152%","drop tolerance: 5% mean next day return: 1.7723%","drop tolerance: 5% mean next day return: 1.7723%","drop tolerance: 5% mean next day return: 3.67%","drop tolerance: 6% mean next day return: 3.67%","drop tolerance: 6% mean next day return: 3.67%","drop tolerance: 6% mean next day return: 3.9321%","drop tolerance: 6% mean next day return: 3.9321%","drop tolerance: 6% mean next day return: 3.9949%"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(100,149,237,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(100,149,237,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":48.9497716894977},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Mean Return after Large Daily Drop","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.018,0.062],"tickmode":"array","ticktext":["2.0%","3.0%","4.0%","5.0%","6.0%"],"tickvals":[0.02,0.03,0.04,0.05,0.06],"categoryorder":"array","categoryarray":["2.0%","3.0%","4.0%","5.0%","6.0%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"daily drop","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.00224665,0.04174435],"tickmode":"array","ticktext":["1.0%","2.0%","3.0%","4.0%"],"tickvals":[0.01,0.02,0.03,0.04],"categoryorder":"array","categoryarray":["1.0%","2.0%","3.0%","4.0%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"mean return","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"9ca57edae980":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"9ca57edae980","visdat":{"9ca57edae980":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}

I did not expect that gap upward when the daily drop passes 5.25%.

A quick addendum that if I had gotten my act together and finished this 4 days ago I would not have included, but I’m curious how this last week has compared with other weeks in terms of volatility. I have in mind to visualize weekly return dispersion and that seemed a mighty tall task, until the brand new slider package came to the rescue! slider has a function called slide_period() that, among other things, allows us to break up time series according to different periodicities.

To break up our returns by week, we call slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week"), where first_monday_december is a date that falls on a Monday. We could use our eyeballs to check a calendar and find a date that’s a Monday or we could use some good ol’ code. Let’s assume we want to find the first Monday in December of 2016.

We first filter our data with filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))). Then create a column of weekday names with wday(date, label = TRUE, abbr = FALSE) and filter to our first value of “Monday”.

# R Chunkfirst_monday_december <-   sp_500_returns %>%  mutate(date = ymd(date)) %>%   filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))) %>%   mutate(day_week = wday(date, label = TRUE, abbr = FALSE)) %>%   filter(day_week == "Monday") %>%   slice(1) %>%   pull(date)

Now we run our slide_period_dfr() code and it will start on the first Monday in December of 2016, and break our returns into weeks. Since we set .names_to = "week", the function will create a new column called week and give a unique number to each of our weeks.

# R chunksp_500_returns %>%  select(date, daily_returns_log) %>%  filter(date >= first_monday_december) %>%  slide_period_dfr(.,                   .$date,                   "week",                   ~ .x,                   .origin = first_monday_december,                   .names_to = "week") %>%   head(10)

# A tibble: 10 x 3    week date                daily_returns_log                               1     1 2016-12-05 00:00:00           0.00589 2     1 2016-12-06 00:00:00           0.00342 3     1 2016-12-07 00:00:00           0.0133  4     1 2016-12-08 00:00:00           0.00226 5     1 2016-12-09 00:00:00           0.00589 6     2 2016-12-12 00:00:00          -0.00105 7     2 2016-12-13 00:00:00           0.00667 8     2 2016-12-14 00:00:00          -0.00810 9     2 2016-12-15 00:00:00           0.0039210     2 2016-12-16 00:00:00          -0.00172

From here, we can group_by that week column and treat each week as a discrete time period. Let’s use ggplotly to plot each week on the x-axis and the daily returns of each week on the y-axis, so that the vertical dispersion shows us the dispersion of weekly returns. Hover on the point to see the exact date of the return.

# R chunk(sp_500_returns %>%  select(date, daily_returns_log) %>%  filter(date >= first_monday_december) %>%  slide_period_dfr(.,                   .$date,                   "week",                   ~ .x,                   .origin = first_monday_december,                   .names_to = "week") %>%  group_by(week) %>%  mutate(start_week = ymd(min(date))) %>%  ggplot(aes(x = start_week, y = daily_returns_log, text = str_glue("date: {date}"))) +  geom_point(color = "cornflowerblue", alpha = .5) +  scale_y_continuous(labels = scales::percent,                     breaks = scales::pretty_breaks(n = 8)) +  scale_x_date(breaks = scales::pretty_breaks(n = 10)) +  labs(y = "", x = "", title = "Weekly Daily Returns") +  theme_minimal()) %>% ggplotly(tooltip = "text")

We can also plot the standard deviation of returns for each week.

# R chunk(sp_500_returns %>%  select(date, daily_returns_log) %>%  filter(date >= first_monday_december) %>%  slide_period_dfr(.,                   .$date,                   "week",                   ~ .x,                   .origin = first_monday_december,                   .names_to = "week") %>%  group_by(week) %>%  summarise(first_of_week = first(date),            sd = sd(daily_returns_log)) %>%  ggplot(aes(x = first_of_week, y = sd, text = str_glue("week: {first_of_week}"))) +  geom_point(aes(color = sd)) +  labs(x = "", title = "Weekly Standard Dev of Returns", y = "") +  theme_minimal()) %>% ggplotly(tooltip = "text")

{"x":{"data":[{"x":[1480896000,1481500800,1482105600,1482796800,1483401600,1483920000,1484611200,1485129600,1485734400,1486339200,1486944000,1487635200,1488153600,1488758400,1489363200,1489968000,1490572800,1491177600,1491782400,1492387200,1492992000,1493596800,1494201600,1494806400,1495411200,1496102400,1496620800,1497225600,1497830400,1498435200,1499040000,1499644800,1500249600,1500854400,1501459200,1502064000,1502668800,1503273600,1503878400,1504569600,1505088000,1505692800,1506297600,1506902400,1507507200,1508112000,1508716800,1509321600,1509926400,1510531200,1511136000,1511740800,1512345600,1512950400,1513555200,1514246400,1514851200,1515369600,1516060800,1516579200,1517184000,1517788800,1518393600,1519084800,1519603200,1520208000,1520812800,1521417600,1522022400,1522627200,1523232000,1523836800,1524441600,1525046400,1525651200,1526256000,1526860800,1527552000,1528070400,1528675200,1529280000,1529884800,1530489600,1531094400,1531699200,1532304000,1532908800,1533513600,1534118400,1534723200,1535328000,1536019200,1536537600,1537142400,1537747200,1538352000,1538956800,1539561600,1540166400,1540771200,1541376000,1541980800,1542585600,1543190400,1543795200,1544400000,1545004800,1545609600,1546214400,1546819200,1547424000,1548115200,1548633600,1549238400,1549843200,1550534400,1551052800,1551657600,1552262400,1552867200,1553472000,1554076800,1554681600,1555286400,1555891200,1556496000,1557100800,1557705600,1558310400,1559001600,1559520000,1560124800,1560729600,1561334400,1561939200,1562544000,1563148800,1563753600,1564358400,1564963200,1565568000,1566172800,1566777600,1567468800,1567987200,1568592000,1569196800,1569801600,1570406400,1571011200,1571616000,1572220800,1572825600,1573430400,1574035200,1574640000,1575244800,1575849600,1576454400,1577059200,1577664000,1578268800,1578873600,1579564800,1580083200,1580688000,1581292800,1581984000,1582502400,1583107200,1583712000],"y":[0.00428542758271371,0.00568861809441307,0.00260796479741234,0.00463955794747408,0.00391741481793808,0.00269741741395485,0.00351704941088118,0.00483308645419438,0.00474877415206346,0.00311258887176715,0.00264948870806455,0.00300703845118184,0.00742902894305432,0.00277434373390841,0.00461849866133215,0.00554449147735816,0.00371747382211484,0.00200920331663207,0.00324568935145387,0.00579427829778431,0.00529118971328767,0.00188720452501024,0.00134296587543603,0.0100826311872125,0.00194718811639403,0.00403435768384228,0.00167330728202289,0.00268409961777066,0.0053814863947088,0.00735001147793435,0.00657965023344988,0.00320277284662278,0.00244348282914921,0.00173341233786234,0.00182694746613941,0.00659512823944477,0.00921743782178293,0.00520108257894348,0.0023362211601529,0.00450002589746643,0.00447057468661814,0.00180698839992965,0.00262211305946341,0.0024901358468847,0.00193642611595415,0.00198760906022879,0.00513684015835072,0.00234250650318223,0.00201761429303829,0.00530318274327972,0.00304803399322164,0.00555381615098833,0.00363745757809798,0.00477618457516234,0.00325667459244177,0.00313404542426052,0.00170722276979584,0.00357316346616288,0.00584707324263508,0.00528158495465767,0.00891725092563695,0.0283656437260912,0.00644680930969736,0.010197568714238,0.0116325660191617,0.00705946191651277,0.00340059155871215,0.0118107180475284,0.0192565041451108,0.01807155350197,0.00886326437556684,0.00832553141187192,0.00857554846549429,0.00860643934736129,0.00451708956128494,0.00414401773579414,0.00451827900874718,0.0122838052182871,0.00355125688639771,0.0026822542682101,0.00358428118879697,0.00831022352103518,0.00645287911781114,0.0066631388121584,0.00304648234095383,0.00620342246518341,0.00482950351692109,0.00414547845924569,0.00676492408220786,0.00300557941172333,0.00481004112934702,0.00077039084883359,0.00222477657346596,0.00522813200972381,0.00265885056167373,0.00473831298598575,0.0187484488002918,0.0131839351075612,0.0183745210290145,0.0104518103764823,0.0112010313303377,0.0114762484504,0.00995575389439934,0.009785725021822,0.0200338089594582,0.00960874980868858,0.00863736436641497,0.0314432154449168,0.0241164518850932,0.00363801407935785,0.00731639356428921,0.00964701766765257,0.00910961323548259,0.00631050002959617,0.00656164495496061,0.0040330202868636,0.00362899794664017,0.00290552944247068,0.00564796998240457,0.0111248305220034,0.00505651707501505,0.00444044960599729,0.00463682405863611,0.00162891702220012,0.00440947277880287,0.00624041059658424,0.00760088530262248,0.0141827236338453,0.00778639109465693,0.00642703026645009,0.00861334204435986,0.00322672475436481,0.0050108048551864,0.00597667780771788,0.00456023815692009,0.00387030293659101,0.0043499043589986,0.00510592326839474,0.00404473230258807,0.0192291783496026,0.0188582443716662,0.0152210778853089,0.00676803340448989,0.00925356973123229,0.00322440616798477,0.00296953869446457,0.0054906827759299,0.0138508559646498,0.0112501965114402,0.00550641382972188,0.00382393175423247,0.00496105222015224,0.00202734114896572,0.00361595998846953,0.00216200838610327,0.00479241843997551,0.00783474394872398,0.00454788204552359,0.00319777606587138,0.0025050092577236,0.00732020684174918,0.00447995127355144,0.00392070249037435,0.00464557648025062,0.0121569914328974,0.00772994685840006,0.00361996158123424,0.00630444840185479,0.0176530560760263,0.0387490341140696,0.0825608101381797],"text":["week: 2016-12-05","week: 2016-12-12","week: 2016-12-19","week: 2016-12-27","week: 2017-01-03","week: 2017-01-09","week: 2017-01-17","week: 2017-01-23","week: 2017-01-30","week: 2017-02-06","week: 2017-02-13","week: 2017-02-21","week: 2017-02-27","week: 2017-03-06","week: 2017-03-13","week: 2017-03-20","week: 2017-03-27","week: 2017-04-03","week: 2017-04-10","week: 2017-04-17","week: 2017-04-24","week: 2017-05-01","week: 2017-05-08","week: 2017-05-15","week: 2017-05-22","week: 2017-05-30","week: 2017-06-05","week: 2017-06-12","week: 2017-06-19","week: 2017-06-26","week: 2017-07-03","week: 2017-07-10","week: 2017-07-17","week: 2017-07-24","week: 2017-07-31","week: 2017-08-07","week: 2017-08-14","week: 2017-08-21","week: 2017-08-28","week: 2017-09-05","week: 2017-09-11","week: 2017-09-18","week: 2017-09-25","week: 2017-10-02","week: 2017-10-09","week: 2017-10-16","week: 2017-10-23","week: 2017-10-30","week: 2017-11-06","week: 2017-11-13","week: 2017-11-20","week: 2017-11-27","week: 2017-12-04","week: 2017-12-11","week: 2017-12-18","week: 2017-12-26","week: 2018-01-02","week: 2018-01-08","week: 2018-01-16","week: 2018-01-22","week: 2018-01-29","week: 2018-02-05","week: 2018-02-12","week: 2018-02-20","week: 2018-02-26","week: 2018-03-05","week: 2018-03-12","week: 2018-03-19","week: 2018-03-26","week: 2018-04-02","week: 2018-04-09","week: 2018-04-16","week: 2018-04-23","week: 2018-04-30","week: 2018-05-07","week: 2018-05-14","week: 2018-05-21","week: 2018-05-29","week: 2018-06-04","week: 2018-06-11","week: 2018-06-18","week: 2018-06-25","week: 2018-07-02","week: 2018-07-09","week: 2018-07-16","week: 2018-07-23","week: 2018-07-30","week: 2018-08-06","week: 2018-08-13","week: 2018-08-20","week: 2018-08-27","week: 2018-09-04","week: 2018-09-10","week: 2018-09-17","week: 2018-09-24","week: 2018-10-01","week: 2018-10-08","week: 2018-10-15","week: 2018-10-22","week: 2018-10-29","week: 2018-11-05","week: 2018-11-12","week: 2018-11-19","week: 2018-11-26","week: 2018-12-03","week: 2018-12-10","week: 2018-12-17","week: 2018-12-24","week: 2018-12-31","week: 2019-01-07","week: 2019-01-14","week: 2019-01-22","week: 2019-01-28","week: 2019-02-04","week: 2019-02-11","week: 2019-02-19","week: 2019-02-25","week: 2019-03-04","week: 2019-03-11","week: 2019-03-18","week: 2019-03-25","week: 2019-04-01","week: 2019-04-08","week: 2019-04-15","week: 2019-04-22","week: 2019-04-29","week: 2019-05-06","week: 2019-05-13","week: 2019-05-20","week: 2019-05-28","week: 2019-06-03","week: 2019-06-10","week: 2019-06-17","week: 2019-06-24","week: 2019-07-01","week: 2019-07-08","week: 2019-07-15","week: 2019-07-22","week: 2019-07-29","week: 2019-08-05","week: 2019-08-12","week: 2019-08-19","week: 2019-08-26","week: 2019-09-03","week: 2019-09-09","week: 2019-09-16","week: 2019-09-23","week: 2019-09-30","week: 2019-10-07","week: 2019-10-14","week: 2019-10-21","week: 2019-10-28","week: 2019-11-04","week: 2019-11-11","week: 2019-11-18","week: 2019-11-25","week: 2019-12-02","week: 2019-12-09","week: 2019-12-16","week: 2019-12-23","week: 2019-12-30","week: 2020-01-06","week: 2020-01-13","week: 2020-01-21","week: 2020-01-27","week: 2020-02-03","week: 2020-02-10","week: 2020-02-18","week: 2020-02-24","week: 2020-03-02","week: 2020-03-09"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":["rgba(22,48,74,1)","rgba(23,50,77,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(21,48,73,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(22,49,75,1)","rgba(21,46,72,1)","rgba(20,46,71,1)","rgba(21,46,71,1)","rgba(24,53,80,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(20,45,69,1)","rgba(19,44,68,1)","rgba(26,57,85,1)","rgba(20,45,69,1)","rgba(21,48,73,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(22,50,76,1)","rgba(24,53,80,1)","rgba(23,51,78,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(20,44,69,1)","rgba(20,45,69,1)","rgba(23,51,78,1)","rgba(25,55,84,1)","rgba(22,49,76,1)","rgba(20,45,70,1)","rgba(22,48,74,1)","rgba(22,48,74,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(20,45,69,1)","rgba(22,49,75,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(22,50,76,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(21,47,72,1)","rgba(21,46,72,1)","rgba(20,44,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(25,55,83,1)","rgba(40,85,123,1)","rgba(23,51,78,1)","rgba(26,57,85,1)","rgba(27,59,88,1)","rgba(24,52,79,1)","rgba(21,47,72,1)","rgba(27,59,89,1)","rgba(33,70,104,1)","rgba(32,69,101,1)","rgba(25,55,83,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(22,48,74,1)","rgba(21,48,74,1)","rgba(22,48,74,1)","rgba(28,60,90,1)","rgba(21,47,72,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(25,54,82,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,46,71,1)","rgba(23,51,78,1)","rgba(22,49,75,1)","rgba(21,48,74,1)","rgba(23,52,79,1)","rgba(21,46,71,1)","rgba(22,49,75,1)","rgba(19,43,67,1)","rgba(20,45,70,1)","rgba(22,49,76,1)","rgba(20,46,71,1)","rgba(22,49,75,1)","rgba(32,70,103,1)","rgba(28,61,91,1)","rgba(32,69,102,1)","rgba(26,57,86,1)","rgba(27,58,87,1)","rgba(27,59,88,1)","rgba(26,56,85,1)","rgba(26,56,85,1)","rgba(33,72,106,1)","rgba(26,56,84,1)","rgba(25,54,82,1)","rgba(42,90,130,1)","rgba(37,78,114,1)","rgba(21,47,73,1)","rgba(24,52,80,1)","rgba(26,56,84,1)","rgba(25,55,83,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,48,73,1)","rgba(21,47,73,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(27,58,87,1)","rgba(22,49,75,1)","rgba(22,48,74,1)","rgba(22,49,74,1)","rgba(20,44,69,1)","rgba(22,48,74,1)","rgba(23,51,78,1)","rgba(24,53,80,1)","rgba(29,63,94,1)","rgba(24,53,81,1)","rgba(23,51,78,1)","rgba(25,54,82,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(23,50,77,1)","rgba(22,48,74,1)","rgba(21,47,73,1)","rgba(22,48,74,1)","rgba(22,49,75,1)","rgba(21,48,73,1)","rgba(33,70,104,1)","rgba(33,70,103,1)","rgba(30,64,96,1)","rgba(23,52,79,1)","rgba(25,55,84,1)","rgba(21,47,72,1)","rgba(21,46,71,1)","rgba(22,50,76,1)","rgba(29,62,93,1)","rgba(27,58,88,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(20,45,70,1)","rgba(22,49,75,1)","rgba(24,53,81,1)","rgba(22,48,74,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(24,52,80,1)","rgba(22,48,74,1)","rgba(21,48,73,1)","rgba(22,49,75,1)","rgba(27,60,89,1)","rgba(24,53,81,1)","rgba(21,47,73,1)","rgba(23,51,78,1)","rgba(32,68,101,1)","rgba(48,101,146,1)","rgba(86,177,247,1)"],"opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":["rgba(22,48,74,1)","rgba(23,50,77,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(21,48,73,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(22,49,75,1)","rgba(21,46,72,1)","rgba(20,46,71,1)","rgba(21,46,71,1)","rgba(24,53,80,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(20,45,69,1)","rgba(19,44,68,1)","rgba(26,57,85,1)","rgba(20,45,69,1)","rgba(21,48,73,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(22,50,76,1)","rgba(24,53,80,1)","rgba(23,51,78,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(20,44,69,1)","rgba(20,45,69,1)","rgba(23,51,78,1)","rgba(25,55,84,1)","rgba(22,49,76,1)","rgba(20,45,70,1)","rgba(22,48,74,1)","rgba(22,48,74,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(20,45,69,1)","rgba(22,49,75,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(22,50,76,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(21,47,72,1)","rgba(21,46,72,1)","rgba(20,44,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(25,55,83,1)","rgba(40,85,123,1)","rgba(23,51,78,1)","rgba(26,57,85,1)","rgba(27,59,88,1)","rgba(24,52,79,1)","rgba(21,47,72,1)","rgba(27,59,89,1)","rgba(33,70,104,1)","rgba(32,69,101,1)","rgba(25,55,83,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(22,48,74,1)","rgba(21,48,74,1)","rgba(22,48,74,1)","rgba(28,60,90,1)","rgba(21,47,72,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(25,54,82,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,46,71,1)","rgba(23,51,78,1)","rgba(22,49,75,1)","rgba(21,48,74,1)","rgba(23,52,79,1)","rgba(21,46,71,1)","rgba(22,49,75,1)","rgba(19,43,67,1)","rgba(20,45,70,1)","rgba(22,49,76,1)","rgba(20,46,71,1)","rgba(22,49,75,1)","rgba(32,70,103,1)","rgba(28,61,91,1)","rgba(32,69,102,1)","rgba(26,57,86,1)","rgba(27,58,87,1)","rgba(27,59,88,1)","rgba(26,56,85,1)","rgba(26,56,85,1)","rgba(33,72,106,1)","rgba(26,56,84,1)","rgba(25,54,82,1)","rgba(42,90,130,1)","rgba(37,78,114,1)","rgba(21,47,73,1)","rgba(24,52,80,1)","rgba(26,56,84,1)","rgba(25,55,83,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,48,73,1)","rgba(21,47,73,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(27,58,87,1)","rgba(22,49,75,1)","rgba(22,48,74,1)","rgba(22,49,74,1)","rgba(20,44,69,1)","rgba(22,48,74,1)","rgba(23,51,78,1)","rgba(24,53,80,1)","rgba(29,63,94,1)","rgba(24,53,81,1)","rgba(23,51,78,1)","rgba(25,54,82,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(23,50,77,1)","rgba(22,48,74,1)","rgba(21,47,73,1)","rgba(22,48,74,1)","rgba(22,49,75,1)","rgba(21,48,73,1)","rgba(33,70,104,1)","rgba(33,70,103,1)","rgba(30,64,96,1)","rgba(23,52,79,1)","rgba(25,55,84,1)","rgba(21,47,72,1)","rgba(21,46,71,1)","rgba(22,50,76,1)","rgba(29,62,93,1)","rgba(27,58,88,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(20,45,70,1)","rgba(22,49,75,1)","rgba(24,53,81,1)","rgba(22,48,74,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(24,52,80,1)","rgba(22,48,74,1)","rgba(21,48,73,1)","rgba(22,49,75,1)","rgba(27,60,89,1)","rgba(24,53,81,1)","rgba(21,47,73,1)","rgba(23,51,78,1)","rgba(32,68,101,1)","rgba(48,101,146,1)","rgba(86,177,247,1)"]}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[1483228800],"y":[0],"name":"99_b3fb57b6de4dc5ce6cf75b6745e4a4c9","type":"scatter","mode":"markers","opacity":0,"hoverinfo":"skip","showlegend":false,"marker":{"color":[0,1],"colorscale":[[0,"#132B43"],[0.0526315789473684,"#16314B"],[0.105263157894737,"#193754"],[0.157894736842105,"#1D3E5C"],[0.210526315789474,"#204465"],[0.263157894736842,"#234B6E"],[0.315789473684211,"#275277"],[0.368421052631579,"#2A5980"],[0.421052631578947,"#2E608A"],[0.473684210526316,"#316793"],[0.526315789473684,"#356E9D"],[0.578947368421053,"#3875A6"],[0.631578947368421,"#3C7CB0"],[0.68421052631579,"#3F83BA"],[0.736842105263158,"#438BC4"],[0.789473684210526,"#4792CE"],[0.842105263157895,"#4B9AD8"],[0.894736842105263,"#4EA2E2"],[0.947368421052632,"#52A9ED"],[1,"#56B1F7"]],"colorbar":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"thickness":23.04,"title":"sd","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"tickmode":"array","ticktext":["0.02","0.04","0.06","0.08"],"tickvals":[0.235108333203902,0.479635750641963,0.724163168080025,0.968690585518086],"tickfont":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895},"ticklen":2,"len":0.5}},"xaxis":"x","yaxis":"y","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":25.5707762557078,"l":34.337899543379},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Weekly Standard Dev of Returns","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[1475755200,1588852800],"tickmode":"array","ticktext":["2017","2018","2019","2020"],"tickvals":[1483228800,1514764800,1546300800,1577836800],"categoryorder":"array","categoryarray":["2017","2018","2019","2020"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.00331913011563371,0.086650331102647],"tickmode":"array","ticktext":["0.00","0.02","0.04","0.06","0.08"],"tickvals":[0,0.02,0.04,0.06,0.08],"categoryorder":"array","categoryarray":["0.00","0.02","0.04","0.06","0.08"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"9ca575933a1b":{"colour":{},"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"9ca575933a1b","visdat":{"9ca575933a1b":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}

That’s all for today! Thanks for reading and stay safe out there.

_____='https://rviews.rstudio.com/2020/03/16/outlier-days-with-r-and-python/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Online R, Python & Git Training!

March 16, 2020, 7:34 am

≫ Next: A Little Something From Practical Data Science with R Chapter 1

≪ Previous: Outlier Days with R and Python

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hey there!

Here at Jumping Rivers, we have the capabilities to teach you R, Python & Git virtually. For the last three years we have been running online training courses for small groups (and even 1 to 1).

How is it different to an in-person course?

It’s the same, but also different! The course contents is the same, but obviously the structure is adapted to online training. For example, rather than a single long session, we would break the day up over a couple of days and allow regular check-in points.

For the courses, we use whereby.com. This provides screen-sharing for both instructor and attendees, none of the interactivity is lost.

What about IT restrictions?

Don’t worry! If your current IT security/infrastructure is a problem, we have two solutions:

Training can be done using cloud services. We can provide a secure RStudio server or Jupyter notebook environment just for your team. This means attendees simply have to log on to our cloud service to be able to use the appropriate software and packages.
We have a fleet of state of the art Chromebooks, available to post to attendees. Each Chromebook comes with all required software and packages pre-installed. A microphone headset can also be provided if necessary.

What is the classroom size?

We have a maximum online classroom size of 12, including the instructor. Attendees will get the opportunity for a follow-up "virtual coding clinic", split into smaller class sizes, in order to enquire about anything related to the course or how they can apply it to their work.

If you would like to enquire about virtual training, either email info@jumpingrivers.com or contact us via our website.

The post Online R, Python & Git Training! appeared first on Jumping Rivers.

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

A Little Something From Practical Data Science with R Chapter 1

March 16, 2020, 7:39 am

≫ Next: Free Coupon for our R Video Course: Introduction to Data Science

≪ Previous: Online R, Python & Git Training!

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a small quote from Practical Data Science with R Chapter 1.

It is often too much to ask for the data scientist to become a domain expert. However, in all cases the data scientist must develop strong domain empathy to help define and solve the right problems.

Interested? Please check it out.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Free Coupon for our R Video Course: Introduction to Data Science

March 16, 2020, 9:46 am

≫ Next: COVID-19 cumulative observed case fatality rate over time by @ellis2013nz

≪ Previous: A Little Something From Practical Data Science with R Chapter 1

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For all our remote learners, we are sharing a free coupon code for our R video course Introduction to Data Science. The code is ITDS2020, and can be used at this URL https://www.udemy.com/course/introduction-to-data-science/?couponCode=ITDS2020 . Please check it out and share it!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

COVID-19 cumulative observed case fatality rate over time by @ellis2013nz

March 16, 2020, 6:00 am

≫ Next: A look at past bear markets and implications for the future

≪ Previous: Free Coupon for our R Video Course: Introduction to Data Science

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Preamble

I was slightly reluctant to add to the deluge of charts about the COVID-19 outbreak, but on the other hand making charts is one of the ways I relax and try to understand what’s going on around me. So first, to get out of the way my only advice at this point:

wash hands frequently, for 20 seconds at a time, with plenty of soap
work at home if you can
limit any face to face social activities to very small groups of people and stay 1.5 metres apart
if you’re a government, encourage people do the above; while resourcing and preparing the health system properly; and trying to cushion the economy (and particularly the most vulnerable people in it) from the shock.

For a educational tool on why social distancing works, I particularly recommend this beautifully put together story and simulations from The Washington Post.

Evolving understanding over time of case fatality rate

I have a professional interest in uncertainty and in how we work with partial information and the drip feed of new information. So I was interested to create this chart, showing the case fatality rate (deaths from this disease, divided by all people diagnosed with this disease, in a given period of time) of COVID-19 over time.

What I’m showing here is the cumulative case fatality rate, based on all observations up to a given point in time. Eventually there will be a single number for the case fatality rate of the COVID-19 pandemic of 2020 to 2021, but not yet; the future is uncertain.

An edifying point here for statisticians to note is that if we treated deaths from diagnosed cases as a Bernoulli variable drawn from a population of the ‘true’ case fatality rate and naively estimated sampling error, for most of the chart above it would be negligible. Yet clearly there is a lot of uncertainty about where the rate will end up. There are many sources of uncertainty other than sampling error. Bear this in mind when next you consider an opinion poll.

Obviously a key driver of the change over time is the move of the disease into different populations, and particularly demographically older countries. Another key driver is the degree of testing, which provides the denominator of the case fatality rate. More testing means more known cases, driving the rate down. We see the impact of both these drivers when we decompose the case fatality rate by country:

The USA (grey line in the chart above, label somewhat hidden by France’s) provides a stark example of that second driver – testing was very slow to get off the ground, and as the extent of infections in that country is revealed by more testing the apparent case fatality rate is headed downwards. Italy – one of the oldest countries in the world – provides the exemplar of the first driver although this reveals itself in the high position on the latest data (7.5%!) rather than the trend in Italy’s time series. The increase over time in fatality rate there might be a result of the increasing strain on the health system, or simply an early-outbreak phenomenon from a lag in infections leading to deaths.

Here’s the code for those two simple plots. All the downstream data wrangling is done for me by Johns Hopkins and by Rami Krispin (who took the Johns Hopkins data and tidied it into an R package). All the upstream data collection, wrangling and reporting is done by the governments of various countries under great strain, mostly reporting to the WHO.

#--------------- Setup--------------------devtools::install_github("RamiKrispin/coronavirus")library(coronavirus)library(tidyverse)library(scales)the_caption="Source: WHO and many others via Johns Hopkins University and Rami Krispin's coronavirus R package.\nAnalysis by http://freerangestats.info"top_countries<-coronavirus%>%filter(type=="confirmed")%>%group_by(Country.Region)%>%summarise(cases=sum(cases))%>%top_n(8,wt=cases)#---------------------------global total-------------------first_non_china_d<-coronavirus%>%filter(Country.Region!="China"&type=="death"&cases>0)%>%arrange(date)%>%slice(1)%>%pull(date)first_italy_d<-coronavirus%>%filter(Country.Region=="Italy"&type=="death"&cases>0)%>%arrange(date)%>%slice(1)%>%pull(date)d1<-coronavirus%>%group_by(date,type)%>%summarise(cases=sum(cases))%>%arrange(date)%>%spread(type,cases)%>%ungroup()%>%mutate(cfr_today=death/confirmed,cfr_cumulative=cumsum(death)/cumsum(confirmed))d1b<-d1%>%filter(date%in%c(first_italy_d,first_non_china_d))ac<-"steelblue"d1c<-d1%>%mutate(cc=cumsum(confirmed))%>%summarise(`10000`=min(date[cc>10000]),`100000`=min(date[cc>100000]))%>%gather(variable,date)%>%left_join(d1,by="date")%>%mutate(label=paste0(format(as.numeric(variable),big.mark=",",scientific=FALSE),"\ncases"))d1%>%ggplot(aes(x=date,y=cfr_cumulative))+geom_line()+scale_y_continuous(label=percent_format(accuracy=0.1))+expand_limits(y=0)+geom_point(data=d1b,colour=ac,shape=1,size=2)+annotate("text",x=first_italy_d,y=filter(d1,date==first_italy_d)$cfr_cumulative-0.001,label="First death in Italy",hjust=0,size=3,colour=ac)+annotate("text",x=first_non_china_d,y=filter(d1,date==first_non_china_d)$cfr_cumulative+0.001,label="First death outside China",hjust=0,size=3,colour=ac)+geom_text(data=d1c,aes(label=label),size=3,colour="grey70",hjust=0.5,lineheight=0.9,nudge_y=-0.002)+labs(caption=the_caption,x="",y="Observed case fatality rate",title="Steadily increasing case fatality rate of COVID-19 in early 2020",subtitle="Increase probably reflects move of the disease into older populations.Note that actual case fatality is likely to be much lower due to undiagnosed surviving cases.")#-----------------Country-specific totals------------------------d2<-coronavirus%>%group_by(date,Country.Region,type)%>%summarise(cases=sum(cases))%>%group_by(date,Country.Region)%>%spread(type,cases)%>%arrange(date)%>%group_by(Country.Region)%>%mutate(cfr_cumulative=cumsum(death)/cumsum(confirmed))%>%filter(!is.na(cfr_cumulative))%>%ungroup()%>%inner_join(top_countries,by="Country.Region")d2%>%ggplot(aes(x=date,y=cfr_cumulative,colour=Country.Region))+geom_line()+geom_text(data=filter(d2,date==max(date)),aes(label=Country.Region),hjust=0,check_overlap=FALSE,size=3)+scale_y_continuous(label=percent_format(accuracy=1),limits=c(0,0.2))+scale_colour_brewer(palette="Set2")+expand_limits(x=max(d2$date)+4)+labs(caption=the_caption,x="",y="Observed case fatality rate",title="Country-specific case fatality rate of COVID-19 in early 2020",subtitle="Eight countries with most diagnosed cases; Iran's early values truncated.\nA high level of uncertainty reflecting rapidly changing denominators as well as many unresolved cases.")+theme(legend.position="none")

Take care out there.

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

A look at past bear markets and implications for the future

March 16, 2020, 11:47 am

≫ Next: Stock market crisis: is there a tennis ball effect?

≪ Previous: COVID-19 cumulative observed case fatality rate over time by @ellis2013nz

[This article was first published on Data based investing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The S&P 500 is officially in a bear market, and the crash from the high valuation levels has been fast and painful. There is however light at the end of the tunnel. In this post I’ll demonstrate how the US stock market has developed during past bear markets and how the market has recovered during the ten years after the peak.

The reason for choosing the ten years as the horizon is because I believe that you should not invest in stocks any money that you are going to need in the next ten years. The chance of having positive returns increases substantially with time and is almost ninety percent for a period of ten years. The worst annual return for a ten-year period has been about negative four percent since 1928 (sources).

We’ll use monthly total return data of S&P 500 from Shiller beginning from the year 1871 until the very end of last year. The index has been reconstructed to represent the US stock market for dates the S&P 500 didn’t exist yet. The reason why we go so far back in time is to include as many bear markets as possible. Panics and manias have always existed, and the human nature has not changed enough in the past 150 years to make the past data less valid. There has however been a substantial change in the spread of information, which causes panic to spread faster and may possibly make bear markets shorter and deeper.

First, let’s take a look at the 14 bear markets found in the data in nominal terms, which describes how a portfolio would have developed without taking inflation into account. The horizontal black line indicates the drop needed to reach a bear market at minus 20 percent, and a blue color indicates that the return has been positive in the 10 years following the peak i.e. the ending value is higher than the value at the peak, and a red color indicates the opposite.

Click to enlarge images

Only two of the fourteen bear markets did not recover in ten years from the initial peak. Not surprisingly, the two bear markets were the ones that peaked at bubble territory in 1929 and 2000. Notice that the bear markets that peaked in 1919 and 1987 we followed by the exact same bubbles.

Below is the same plot with real returns, so the returns describe the actual change of purchasing power by taking inflation into account. Notice that since bear markets are defined as being down by twenty percent in nominal terms, the returns might not dip below the black line because of deflation.

In real terms, four of the fourteen bear markets did not recover after ten years of peaking. Judging by the history, this still leaves us an over 70 percent chance of the index being higher in the next ten years after inflation. Note that the bear market that peaked in 1968 is overlapping heavily with the bear market that peaked in 1972, so they could be considered to be the same bear market, which would increase our chances even further.

Let’s then plot the bear markets in red on top of the index to get a sense of the lengths of the bear markets, from peak to full recovery.

The average length of a bear market from peak until recovery has been 3.95 years and the fall length from the peak until bottom i.e. a peak to trough time was 1.45 years. The longest bear market during the 1930s Great Depression was 15.33 years, and the longest time the stock market fell was 2.75 years.

Lastly, let’s take a look at just the drawdowns. The bear market threshold is again indicated with a black horizontal line. The monthly data is only until the end of the year 2019, so the recent drawdown of early 2020 is missing from the graph. At the time of writing, the index is down 27 percent, with only seven of the historical drawdowns being as severe as this one.

The average drop in a bear market using monthly data has been 33.9 percent, with a maximum of 81.8 percent during the 1930s. Notice again that these are total returns. The drawdowns have been worse during periods with high valuations, as measured by Shiller CAPE or P/B. The maximum drawdowns seem to have also increased with time, which may be caused by lower valuations at the beginning of the time frame and possibly also because people have been more connected than ever, which makes the spread of panic easier.

To conclude, this bear market has been rough and short this far. However, judging by the history, most bear markets recover fully in ten years. The valuations that are still elevated compared to history may however make the index to not to recover as much as in past bear markets.

Be sure to follow me on Twitter for updates about new blog posts like this!

The R code used in the analysis can be found here.

To leave a comment for the author, please follow the link and comment on their blog: Data based investing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Stock market crisis: is there a tennis ball effect?

March 16, 2020, 12:28 pm

≫ Next: When you want more than a chi-squared test, consider a measure of association for contingency tables

≪ Previous: A look at past bear markets and implications for the future

[This article was first published on R – first differences, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Stock markets are plummeting as the Corona virus has turned into a global pandemic. During the short periods of recovery one can hear those stock market reporters on TV talking about a tennis ball effect. The idea is that what goes down has to come up again and vice versa. So should you wait for … Continue reading Stock market crisis: is there a tennis ball effect?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – first differences.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

When you want more than a chi-squared test, consider a measure of association for contingency tables

March 16, 2020, 5:00 pm

≫ Next: How to create decorators in R

≪ Previous: Stock market crisis: is there a tennis ball effect?

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post, I made the point that p-values should not necessarily be considered sufficient evidence (or evidence at all) in drawing conclusions about associations we are interested in exploring. When it comes to contingency tables that represent the outcomes for two categorical variables, it isn’t so obvious what measure of association should augment (or replace) the $\chi^2$ statistic.

I described a model-based measure of effect to quantify the strength of an association in the particular case where one of the categorical variables is ordinal. This can arise, for example, when we want to compare Likert-type responses across multiple groups. The measure of effect I focused on – the cumulative proportional odds – is quite useful, but is potentially limited for two reasons. First, the proportional odds assumption may not be reasonable, potentially leading to biased estimates. Second, both factors may be nominal (i.e. not ordinal), it which case cumulative odds model is inappropriate.

An alternative, non-parametric measure of association that can be broadly applied to any contingency table is Cramér’s V, which is calculated as

\[ V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}} \] where $\chi^2$ is from the Pearson’s chi-squared test, $N$ is the total number of responses across all groups, $r$ is the number of rows in the contingency table, and $c$ is the number of columns. $V$ ranges from $0$ to $1$, with $0$ indicating no association, and $1$ indicating the strongest possible association. (In the addendum, I provide a little detail as to why $V$ cannot exceed $1$.)

Simulating independence

In this first example, the distribution of ratings is independent of the group membership. In the data generating process, the probability distribution for rating has no reference to grp, so we would expect similar distributions of the response across the groups:

library(simstudy)def <- defData(varname = "grp",          formula = "0.3; 0.5; 0.2", dist = "categorical")def <- defData(def, varname = "rating",          formula = "0.2;0.3;0.4;0.1", dist = "categorical")set.seed(99)dind <- genData(500, def)

And in fact, the distributions across the 4 rating options do appear pretty similar for each of the 3 groups:

In order to estimate $V$ from this sample, we use the $\chi^2$ formula (I explored the chi-squared test with simulations in a two-part post here and here):

\[ \sum_{i,j} {\frac{(O_{ij} – E_{ij})^2}{E_{ij}}} \]

observed <- dind[, table(grp, rating)]obs.dim <- dim(observed)getmargins <- addmargins(observed, margin = seq_along(obs.dim),                          FUN = sum, quiet = TRUE)rowsums <- getmargins[1:obs.dim[1], "sum"]colsums <- getmargins["sum", 1:obs.dim[2]]expected <- rowsums %*% t(colsums) / sum(observed)X2 <- sum( ( (observed - expected)^2) / expected)X2

## [1] 3.45

And to check our calculation, here’s a comparison with the estimate from the chisq.test function:

chisq.test(observed)

## ##  Pearson's Chi-squared test## ## data:  observed## X-squared = 3.5, df = 6, p-value = 0.8

With $\chi^2$ in hand, we can estimate $V$, which we expect to be quite low:

sqrt( (X2/sum(observed)) / (min(obs.dim) - 1) )

## [1] 0.05874

Again, to verify the calculation, here is an alternative estimate using the DescTools package, with a 95% confidence interval:

library(DescTools)CramerV(observed, conf.level = 0.95)

## Cramer V   lwr.ci   upr.ci ##  0.05874  0.00000  0.08426

Group membership matters

In this second scenario, the distribution of rating is specified directly as a function of group membership. This is an extreme example, designed to elicit a very high value of $V$:

def <- defData(varname = "grp",             formula = "0.3; 0.5; 0.2", dist = "categorical")defc <- defCondition(condition = "grp == 1",             formula = "0.75; 0.15; 0.05; 0.05", dist = "categorical")defc <- defCondition(defc, condition = "grp == 2",             formula = "0.05; 0.75; 0.15; 0.05", dist = "categorical")defc <- defCondition(defc, condition = "grp == 3",             formula = "0.05; 0.05; 0.15; 0.75", dist = "categorical")# generate the datadgrp <- genData(500, def)dgrp <- addCondition(defc, dgrp, "rating")

It is readily apparent that the structure of the data is highly dependent on the group:

And, as expected, the estimated $V$ is quite high:

observed <- dgrp[, table(grp, rating)]CramerV(observed, conf.level = 0.95)

## Cramer V   lwr.ci   upr.ci ##   0.7400   0.6744   0.7987

Interpretation of Cramér’s V using proportional odds

A key question is how we should interpret V? Some folks suggest that $V \le 0.10$ is very weak and anything over $0.25$ could be considered quite strong. I decided to explore this a bit by seeing how various cumulative odds ratios relate to estimated values of $V$.

To give a sense of what some log odds ratios (LORs) look like, I have plotted distributions generated from cumulative proportional odds models, using LORs ranging from 0 to 2. At 0.5, there is slight separation between the groups, and by the time we reach 1.0, the differences are considerably more apparent:

My goal was to see how estimated values of $V$ change with the underlying LORs. I generated 100 data sets for each LOR ranging from 0 to 3 (increasing by increments of 0.05) and estimated $V$ for each data set (of which there were 6100). The plot below shows the mean $V$ estimate (in yellow) at each LOR, with the individual estimates represented by the grey points. I’ll let you draw you own conclusions, but (in this scenario at least), it does appear that 0.25 (the dotted horizontal line) signifies a pretty strong relationship, as LORs larger than 1.0 generally have estimates of $V$ that exceed this threshold.

p-values and Cramér’s V

To end, I am just going to circle back to where I started at the beginning of the previous post, thinking about p-values and effect sizes. Here, I’ve generated data sets with a relatively small between-group difference, using a modest LOR of 0.40 that translates to a measure of association $V$ just over 0.10. I varied the sample size from 200 to 1000. For each data set, I estimated $V$ and recorded whether or not the p-value from a chi-square test would have been deemed “significant” (i.e. p-value $< 0.05$) or not. The key point here is that as the sample size increases and we rely solely on the chi-squared test, we are increasingly likely to attach importance to the findings even though the measure of association is quite small. However, if we actually consider a measure of association like Cramér’s $V$ (or some other measure that you might prefer) in drawing our conclusions, we are less likely to get over-excited about a result when perhaps we shouldn’t.

I should also comment that at smaller sample sizes, we will probably over-estimate the measure of association. Here, it would be important to consider some measure of uncertainty, like a 95% confidence interval, to accompany the point estimate. Otherwise, as in the case of larger sample sizes, we would run the risk of declaring success or finding a difference when it may not be warranted.

Addendum: Why is Cramér’s V$\le$ 1?

Cramér’s $V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}}$, which cannot be lower than 0. $V=0$ when $\chi^2 = 0$, which will only happen when the observed cell counts for all cells equal the expected cell counts for all cells. In other words, $V=0$ only when there is complete independence.

It is also the case that $V$ cannot exceed $1$. I will provide some intuition for this using a relatively simple example and some algebra. Consider the following contingency table which represents complete separation of the three groups:

I would argue that this initial $3 \times 4$ table is equivalent to the following $3 \times 3$ table that collapses responses $1$ and $2$– no information about the dependence has been lost or distorted. In this case $n_A = n_{A1} + n_{A2}$.

In order to calculate $\chi^2$, we need to derive the expected values based on this collapsed contingency table. If $p_{ij}$ is the probability for cell row $i$ and column $j$, and $p_i.$ and $p._j$ are the row $i$ and column $j$ totals, respectively then independence implies that $p_{ij} = p_i.p._j$. In this example, under independence, the expected cell count for cell $i,j$ is $\frac{n_i}{N} \frac{n_j}{N} N = \frac{n_in_j}{N}$:

If we consider the contribution of group $A$ to $\chi^2$, we start with the $\sum_{group \ A} (O_j – E_j)^2/E_j$ and end up with $N – n_A$:

\[ \begin{aligned} \chi^2_{\text{rowA}} &= \frac{\left ( n_A – \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{\left ( \frac{n_An_B}{N} \right )^2}{\frac{n_An_B}{N}} + \frac{\left ( \frac{n_An_C}{N} \right )^2}{\frac{n_An_C}{N}} \\ \\ &= \frac{\left ( n_A – \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &=N \left ( \frac{n_A^2 – \frac{2n_A^3}{N} +\frac{n_A^4}{N^2}} {n_A^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &=N \left ( 1 – \frac{2n_A}{N} +\frac{n_A^2}{N^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &= N – 2n_A +\frac{n_A^2}{N} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &= N – 2n_A + \frac{n_A}{N} \left ( {n_A} + n_B + n_C \right ) \\ \\ &= N – 2n_A + \frac{n_A}{N} N \\ \\ &= N – n_A \end{aligned} \]

If we repeat this on rows 2 and 3 of the table, we will find that $\chi^2_{\text{rowB}} = N – n_B$, and $\chi^2_{\text{rowC}} = N – n_C$, so

\[ \begin{aligned} \chi^2 &= \chi^2_\text{rowA} +\chi^2_\text{rowB}+\chi^2_\text{rowC} \\ \\ &=(N – n_A) + (N – n_B) + (N – n_C) \\ \\ &= 3N – (n_A + n_B + n_C) \\ \\ &= 3N – N \\ \\ \chi^2 &= 2N \end{aligned} \]

And

\[ \frac{\chi^2}{2 N} = 1 \]

So, under this scenario of extreme separation between groups,

\[ V = \sqrt{\frac{\chi^2}{\text{min}(r-1, c-1) \times N}} = 1 \]

where $\text{min}(r – 1, c – 1) = \text{min}(2, 3) = 2$.

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How to create decorators in R

March 16, 2020, 5:45 pm

≫ Next: Generate names using posterior probabilities

≪ Previous: When you want more than a chi-squared test, consider a measure of association for contingency tables

[This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

decorators in r

(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });sq

Introduction

One of the coolest features of Python is its nice ability to create decorators. In short, decorators allow us to modify how a function behaves without changing the function’s source code. This can often make code cleaner and easier to modify. For instance, decorators are also really useful if you have a collection of functions where each function has similar repeating code. Fortunately, decorators can now also be created in R!

The first example below is in Python– we’ll get to R in a moment. If you already know about decorators in Python, feel free to skip below to the R section.

Below we have a function that prints today’s date. In our example, we create two functions. The first – print_start_end takes another function as input. It prints “Starting function call…” prior to calling this input function. Secondly, it calls the function. Lastly, it prints “Finished function call…”.

The other function, todays_date, simply prints today’s date.

import datetime# defining a decorator def print_start_end(f):         def wrapper(**args):         print("Starting function call...")           f()           print("Finished function call...")               return wrapper      def todays_date():      print(datetime.datetime.today())

Now, let’s say we want to call our todays_date function, but would like to wrap it inside the print_start_end function. One way to do that is by having a nested function call:

print_start_end(todays_date)

However, we can also add a decorator to the todays_date like below. In Python, a decorator is created by adding an “@” followed by a function name (the “decorating” function).

@print_decorator    def todays_date():      print(datetime.datetime.today())

The result of running todays_date() now is below:

python decorators

We get the same result using a decorator as we did using a nested function call above. The nice part of this, however, is that we don’t have to have any nested functions or change the code within todays_date. We can also easily comment out the decorator to turn off this functionality if we wanted to.

#@print_decorator    def todays_date():      print(datetime.datetime.today())

There’s a whole lot more to decorators. To learn more, check out Powerful Python, or Guide to Learning Python Decorators. The second of these resources focuses solely on decorators, while the first one also covers several other Python features.

Decorators in R

Now, let’s walk through how to create decorators in R. We can do that using an awesome package called tinsel.

To get started, we first need to install tinsel, which requires devtools. If you don’t have the devtools package installed, you’ll need to install that first (install.packages(“devtools”)). Then, you can install tinsel by running the command below.

devtools::install_github('nteetor/tinsel')

Once you’re setup, you can start decorating functions. Let’s recreate the example above using R. The first function is straightforward to reproduce – we’re just changing Python syntax to R.

library(tinsel)print_start_end <- function(f){  wrapper <- function(...)  {      print("Starting function call...")         f()         print("Finished function call...")     }      return(wrapper)   }

Now when it comes to the function we’re decorating – todays_date, we add our decorator using “#.” plus the name of our decorator function – in this case, print_start_end. Essentially, the main difference between creating a decorator here in R vs. Python is that we use “#.” rather than an “@” in the line above a function to signify that we want to decorate that function.

#. print_start_endtodays_date <- function(){    print(Sys.Date())  }

In order for R to recognize that we want to use decorators, we need to source our R script using tinsel’ssource_decoratees function. Here we just need to pass the name of our script.

source_decoratees("test_dec.R")

Now when we call todays_date, the decorated version of the function gets executed.

r decorator

As mentioned, one key use of decorators is to handle collections of functions that have similar repeating lines of code. For example, let’s apply the print_start_end decorator above to two other functions.

#. print_start_endyesterday <- function(){    print(Sys.Date() - 1)  }#. print_start_endtomorrow <- function(){  print(Sys.Date() + 1)  }

decorate functions in r

Timing functions with an R decorator

Another nice feature of decorators is that if we make changes to the decorator, we can easily apply these to any of the functions being decorated. For example, below we can change our print_start_end function to time the execution of a function.

print_start_end <- function(f){  wrapper <- function(...)  {      start <- proc.time()         f()         print(proc.time() - start)     }      return(wrapper)   }

Then, we can call our functions like below with our updated decorator. This allows us to make changes once rather than for each function.

yesterday()tomorrow()

change decorator in r

That’s all for now! Please click to follow my blog on Twitter and keep up with my latest posts, or check out some additional resources for learning Python and R by clicking here.

For more on tinsel, see here.

The post How to create decorators in R appeared first on Open Source Automation.

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Generate names using posterior probabilities

March 16, 2020, 11:29 pm

≫ Next: Political Ideology & Front-line House Democrats

≪ Previous: How to create decorators in R

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you are building synthetic data and need to generate people names, this article will be a helpful guide. This article is part of a series of articles regarding the R package conjurer. You can find the first part of this series here.

Steps to generate people names

1. Installation

Install conjurer package by using the following code.

 install.packages("conjurer")

2. Training data Vs default data

The package conjurer provides 2 two options to generate names.

- The first option is to provide a custom training data.
- The second option is to use the default training data provided by the package.

If it is people names that you are interested in generating, you are better off using the default training data. However, if you would like to generate names of items or products (example: pharmaceutical drug names), it is recommended that you build your own training data. The function that helps in generating names is buildNames. Let us understand the inputs of the function. This function takes the form as given below.

buildNames(dframe, numOfNames, minLength, maxLength)

In this function, dframe is a dataframe. This dataframe must be a single column dataframe where each row contains a name. These names must only contain english alphabets(upper or lower case) from A to Z but no special characters such as “;” or non ASCII characters. If you do not pass this argument to the function, the function uses the default prior probabilities to generate the names.

numOfNames is a numeric. This specifies the number of names to be generated. It should be a non-zero natural number.

minLength is a numeric. This specifies the minimum number of alphabets in the name. It must be a non-zero natural number.

maxLength is a numeric. This specifies the maximum number of alphabets in the name. It must be a non-zero natural number.

3. Example

Let us run this function with an example to see how it works. Let us use the default matrix of prior probabilities for this example. The output would be a list of names as given below.

library(conjurer)
peopleNames <- buildNames(numOfNames = 3, minLength = 5, maxLength = 7)
print(peopleNames)
[1] "ellie"   "bellann" "netar"

Please note that since this is a random generator, you may get other names than displayed in the above example.

4. Consolidated code

Following is the consolidated code for your convenience.

#install latest version
install.packages("conjurer") 

#invoke library
library(conjurer)

#generate names
peopleNames <- buildNames(numOfNames = 3, minLength = 5, maxLength = 7) 

#inspect the names generated
print(peopleNames)

5. Concluding remarks

In this article, we have learnt how to use the R package conjurer and generate names. Since the algorithm relies on prior probabilities, the names that are output may not look exactly like real human names but will phonetically sound like human names. So, go ahead and give it a try. If you like to understand the underlying code that generates these names, you can explore the GitHub repository here. If you are interested in what’s coming next in this package, you can find it in the issues section here

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Political Ideology & Front-line House Democrats

March 15, 2020, 5:00 pm

≫ Next: RStudio 1.3 Preview: The Little Things

≪ Previous: Generate names using posterior probabilities

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Briefly

A quick look at the voting behavior of the 30 House Democrats that represent congressional districts carried by Trump in 2016 – relative to other House members. Using Rvoteview. For a more in-depth account of the characteristics of front-line House Democrats in the 116th Congress, see this post.

Front-line House Democrats

I have been going on about the House of Representatives lately, especially Democrats representing Trump districts. These guys are so important for any number of reasons. They define the Democratic majority in the House. They are the few remaining moderate voices in the House, from either party. But they are super-vulnerable. Over 2/3s of them are freshman members. Again, their constituents supported Trump in 2016. And they could be trying to defend their seats in 2020 with a Democratic Socialist atop the ticket.

I have posted a list of the 30 front-liners as a simple csv, cached as a part of the uspoliticalextras data package. It is available at the link below.

library(tidyverse)url1 <- 'https://raw.githubusercontent.com/jaytimm/uspoliticalextras/master/clean-data-sets/thirty-one-house-democrats.csv'fl <- read.csv(url(url1))

Ideologies in the 116th

So, using the Rvoteview (!) package, we obtain DW-Nominate scores for all members in the 116th House. This session is still in progress, so these numbers will change depending on when they are accessed.

x116 <- Rvoteview::member_search (chamber = 'House', congress = 116) %>%  mutate(label = gsub(', .*$', '', bioname),         party_code = ifelse(bioname %in% fl$house_rep, 'xx', party_code),         party_name = ifelse(bioname %in% fl$house_rep, 'Frontline Dems', 'Other Dems'))

The plot below summarizes voting behaviors as approximated by DW-Nominate scores in two dimensions. Here, our focus is on the first dimension (ie, the x-axis). The 30 front-liners are marked in orange. In the aggregate, then, they vote more moderately than their non-front-line Democrat peers.

p <- x116 %>%  ggplot(aes(x=nominate.dim1,              y=nominate.dim2,             label = label             )) +          annotate("path",               x=cos(seq(0,2*pi,length.out=300)),               y=sin(seq(0,2*pi,length.out=300)),               color='gray',               size = .25) +  geom_point(aes(color = as.factor(party_code)),              size= 2.5,              shape= 17) +  theme_bw() +  ggthemes::scale_color_stata() +  theme(legend.position = 'none') +  labs(title="DW-Nominate ideology scores for the 116th US House",       subtitle = '30 front-line House Democrats in orange')p

Focusing on Democrats

Next, we home in a bit on House Democrats. To add some context to the above plot, we calculate quartiles for DW-Nominate scores among Democrats. These are summarized in table below, ranging from progressive to moderate.

dems <- x116 %>%  filter(party_code %in% c('xx', '100')) qq <- data.frame(x = quantile(dems$nominate.dim1, probs = seq(0, 1, 0.25)),                 stringsAsFactors = FALSE)qq %>% knitr::kable()

	x
0%	-0.7620
25%	-0.4405
50%	-0.3780
75%	-0.2860
100%	-0.0670

We add these quartiles to the plot below, and label front-line House Democrats. Again, front-liners cluster as a group in terms of roll call voting behavior. The most notable exception to this pattern is Lauren Underhood (IL-14). She won her district by five points in 2018, and Trump won the district by 4 points in 2016. It would appear, then, that her voting behavior and the political ideology of her constituents do not especially rhyme. In other words, she represents a Trump district and votes like a progressive.

p1 <- p +  xlim(-1, 0) +  geom_vline(xintercept = qq$x, linetype = 2, color = 'gray') +  ggrepel::geom_text_repel(   data  = filter(x116,                   bioname %in% fl$house_rep),   nudge_y =  -0.005,   direction = "y",   hjust = 0,    size = 2.5)p1

The table below summarizes counts of Democrats by front-line status & ideology quartile. So, roughly 3/4 of front-liners vote in the most moderate Democratic quartile in the House. And all but Underwood are in top 50%.

dems1 <- dems %>%  mutate(qt = ntile(nominate.dim1, 4))dems1 %>%   group_by(party_name, qt) %>%  count() %>%  group_by(party_name) %>%  mutate(per = round(n/sum(n)*100, 1)) %>%  knitr::kable(booktabs = T, format = "html") %>%  kableExtra::kable_styling() %>%  kableExtra::row_spec(3,                         background = "#e4eef4")

party_name	qt	n	per
Frontline Dems	1	1	3.3
Frontline Dems	3	6	20.0
Frontline Dems	4	23	76.7
Other Dems	1	58	28.3
Other Dems	2	59	28.8
Other Dems	3	53	25.9
Other Dems	4	35	17.1

Summary

Support this group of House members!! Follow them on Twitter!

twitter1
RepRonKind \| DaveLoebsack \| RepCheri \| RepSeanMaloney \| RepCartwright \| repohalleran \| RepJoshG \| RepConorLamb \| replucymcbath \| RepFinkenauer \| RepCindyAxne \| RepUnderwood \| RepSlotkin \| RepHaleyStevens \| RepAngieCraig \| RepChrisPappas \| RepAndyKimNJ \| RepSherrill \| RepTorresSmall \| RepSusieLee \| RepMaxRose \| repdelgado \| RepBrindisi \| RepKendraHorn \| RepCunningham \| RepBenMcAdams \| RepElaineLuria \| RepSpanberger \| repgolden

twitter1

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

RStudio 1.3 Preview: The Little Things

March 16, 2020, 5:00 pm

≫ Next: COVID-19: The Case of Germany

≪ Previous: Political Ideology & Front-line House Democrats

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog post is part of a series on new features in RStudio 1.3, currently available as a preview release.

In every RStudio release, we introduce dozens of small quality-of-life improvements alongside bigger headline features. This blog post concludes our series on the upcoming RStudio 1.3 release with a look at some of these little conveniences.

Global Replace

RStudio has long had a Find in Files feature, which makes it possible to easily locate text in your project. If you’re not familiar with this feature, try it out: press Ctrl+Shift+F (MacOS: Cmd+Shift+F), or choose Find in Files… from the Edit menu.

In RStudio 1.3, it’s now possible to replace the text you found:

Screenshot of Global Replace in action

After you’ve done a search, switch to Replace view via the toggle, enter your new text, and click Replace All. It works with regular expressions, too.

Resizable Environment Columns

This really is a little thing, but it drove many of you nuts: the size of the columns in the Environment pane was fixed, so if your variables (or values) were long, it was awkward to try to see the whole thing. Now you can!

Screenshot showing resizable columns in the Environment pane

New File Templates

Do you usually start new files with the same information? For example, do you usually include a header comment on your R scripts with metadata you know you’ll find useful later? You can now have RStudio inject this header for you when you create a new file.

Screenshot showing an R script with content from a default template

Create a template in ~/.config/rstudio/templates/default.R (macOS/Linux) or AppData/Roaming/RStudio/templates/default.R (Windows) to try it out. It works with other file types, too; for example creating a file named default.cpp will set the content for new C++ files.

If you’re an RStudio Server administrator, you can set templates for all the users on your server, which can be helpful if your organization has standards around file headers and structure. Read more in Default Document Templates from the admin guide.

Autosave

RStudio automatically keeps its own backup copy of files you’re editing so that you don’t lose changes. We’ve improved this in two ways in the 1.3 release:

Section of Options dialog showing auto-save preferences

When enabled, RStudio will automatically save open files as they are changed. This is useful if you don’t want to have to remember to manually save and just want your changes saved at all times.
You can also disable the auto-backup, or change the interval at which it is performed. This is useful if you are storing your projects on a cloud-synchronized folder, which sometimes struggle to keep up with RStudio’s frequent writes to the backup copy.

Terminal Ergonomics

You can now set the initial working directory of new terminals, so it’s less likely you’ll have to begin each terminal session with the same old cd command.

Section of Options dialog showing terminal starting directory

We’ve also added a bunch of commands designed to reduce the number of times you need to manually paste cumbersome file and directory paths between the IDE and the terminal.

Menu showing new commands for working with directories in the Terminal

Specifically, we’ve added:

A command to open a new terminal at location of current editor file
A command to insert the full path and filename of current editor file into the terminal
A command in the File pane to open a new terminal at File pane’s current location
A command to change the terminal to current RStudio working directory

Shiny Background Jobs

RStudio can now run Shiny applications as background jobs in the Jobs tab we added in RStudio 1.2.

Menu showing new commands for working with directories in the Terminal

This has a couple of advantages:

You can continue to use the R console while your Shiny application runs!
Your Shiny application runs in a fresh R session. This makes it easier to keep your application’s code reproducible, since any implicit dependencies will keep the application from running successfully in the background.

Note, however, that you can’t use RStudio’s debugging interface with a Shiny application running in the background, since it is part of a separate R session.

Wrapup

If you’d like to try out any of these features, we welcome you to download the RStudio Preview and give them a spin!

We hope these little changes make a big difference in your day-to-day work, and we’d love to hear your feedback on the community forum.

Finally, we’re grateful to you, the R community, for the overwhelming number of ideas, support, and bug reports that have helped us build this release. We couldn’t have done it without you. Watch this space for an announcement of the stable release soon!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

COVID-19: The Case of Germany

March 17, 2020, 1:00 am

≫ Next: Rcpp 1.0.4: Lots of goodies

≪ Previous: RStudio 1.3 Preview: The Little Things

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is such a beautiful day outside, lot’s of sunshine, spring at last… and we are now basically all grounded and sitting here, waiting to get sick.

So, why not a post from the new epicentre of the global COVID-19 pandemic, Central Europe, more exactly where I live: Germany?! Indeed, if you want to find out what the numbers tell us how things might develop here, read on!

We will use the same model we already used in this post: Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?. You can find all the details there and in the comments.

library(deSolve)# https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Germany#StatisticsInfected <- c(16, 18, 21, 26, 53, 66, 117, 150, 188, 240, 349, 534, 684, 847, 1112, 1460, 1884, 2369, 3062, 3795, 4838, 6012)Day <- 1:(length(Infected))N <- 83149300 # population of Germany acc. to Destatisold <- par(mfrow = c(1, 2))plot(Day, Infected, type ="b")plot(Day, Infected, log = "y")abline(lm(log10(Infected) ~ Day))title("Total infections COVID-19 Germany", outer = TRUE, line = -2)

This clearly shows that we have an exponential development here, unfortunately as expected.

SIR <- function(time, state, parameters) {  par <- as.list(c(state, parameters))  with(par, {    dS <- -beta/N * I * S    dI <- beta/N * I * S - gamma * I    dR <- gamma * I    list(c(dS, dI, dR))    })}init <- c(S = N-Infected[1], I = Infected[1], R = 0)RSS <- function(parameters) {  names(parameters) <- c("beta", "gamma")  out <- ode(y = init, times = Day, func = SIR, parms = parameters)  fit <- out[ , 3]  sum((Infected - fit)^2)}Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1)) # optimize with some sensible conditionsOpt$message## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"Opt_par <- setNames(Opt$par, c("beta", "gamma"))Opt_par##      beta     gamma ## 0.6428120 0.3571881t <- 1:80 # time in daysfit <- data.frame(ode(y = init, times = t, func = SIR, parms = Opt_par))col <- 1:3 # colourmatplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col)matplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col, log = "y")## Warning in xy.coords(x, y, xlabel, ylabel, log = log): 1 y value <= 0## omitted from logarithmic plotpoints(Day, Infected)legend("bottomright", c("Susceptibles", "Infecteds", "Recovereds"), lty = 1, lwd = 2, col = col, inset = 0.05)title("SIR model Covid-19 Germany", outer = TRUE, line = -2)

par(old)R0 <- setNames(Opt_par["beta"] / Opt_par["gamma"], "R0")R0##       R0 ## 1.799646fit[fit$I == max(fit$I), "I", drop = FALSE] # height of pandemic##          I## 54 9769398max_infected <- max(fit$I)max_infected / 5 # severe cases## [1] 1953880max_infected * 0.06 # cases with need for intensive care## [1] 586163.9# https://www.newscientist.com/article/mg24532733-700-why-is-it-so-hard-to-calculate-how-many-people-will-die-from-covid-19/max_infected * 0.007 # deaths with supposed 0.7% fatality rate## [1] 68385.78

So, according to this model, the height of the pandemic will be reached by the end of April, beginning of May. About 10 million people would be infected by then, which translates to about 2 million severe cases, about 600,000 cases in need of intensive care and up to 70,000 deaths.

Those are the numbers our model produces and nobody knows whether they are correct while everybody hopes they are not. One thing has to be kept in mind though: the numbers used in the model are from before the shutdown (for details see here: DER SPIEGEL: Germany Moves To Shut Down Most of Public Life). So hopefully those measures will prove effective and the actual numbers will turn out to be much, much lower.

I wish you all the best and stay healthy!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Rcpp 1.0.4: Lots of goodies

March 17, 2020, 5:00 am

≫ Next: paletteer: Hundreds of color palettes in R

≪ Previous: COVID-19: The Case of Germany

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rcpp logo

The fourth maintenance release 1.0.4 of Rcpp, following up on the 10th anniversary and the 1.0.0. release sixteen months ago, arrived on CRAN this morning. This follows a few days of gestation at CRAN. To help during the wait we provided this release via drat last Friday. And it followed a pre-release via drat a week earlier. But now that the release is official, Windows and macOS binaries will be built by CRAN over the next few days. The corresponding Debian package will be uploaded as a source package shortly after which binaries can be built.

As with the previous releases Rcpp 1.0.1, Rcpp 1.0.2 and Rcpp 1.0.3, we have the predictable and expected four month gap between releases which seems appropriate given both the changes still being made (see below) and the relative stability of Rcpp. It still takes work to release this as we run multiple extensive sets of reverse dependency checks so maybe one day we will switch to six month cycle. For now, four months still seem like a good pace.

Rcpp has become the most popular way of enhancing R with C or C++ code. As of today, 1873 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with 191 in BioConductor. And per the (partial) logs of CRAN downloads, we are running steasy at one millions downloads per month.

This release features quite a number of different pull requests by seven different contributors as detailed below. One (personal) highlight is the switch to tinytest.

Changes in Rcpp version 1.0.4 (2020-03-13)
Changes in Rcpp API:
Safer Rcpp_list*, Rcpp_lang* and Function.operator() (Romain in #1014, #1015).
A number of #nocov markers were added (Dirk in #1036, #1042 and #1044).
Finalizer calls clear external pointer first (Kirill Müller and Dirk in #1038).
Scalar operations with a rhs matrix no longer change the matrix value (Qiang in #1040 fixing (again) #365).
Rcpp::exception and Rcpp::stop are now more thread-safe (Joshua Pritikin in #1043).
Changes in Rcpp Attributes:
The cppFunction helper now deals correctly with mulitple depends arguments (TJ McKinley in #1016 fixing #1017).
Invisible return objects are now supported via new option (Kun Ren in #1025 fixing #1024).
Unavailable packages referred to in LinkingTo are now reported (Dirk in #1027 fixing #1026).
The sourceCpp function can now create a debug DLL on Windows (Dirk in #1037 fixing #1035).
Changes in Rcpp Documentation:
The .github/ directory now has more explicit guidance on contributing, issues, and pull requests (Dirk).
The Rcpp Attributes vignette describe the new invisible return object option (Kun Ren in #1025).
Vignettes are now included as pre-made pdf files (Dirk in #1029)
The Rcpp FAQ has a new entry on the recommended importFrom directive (Dirk in #1031 fixing #1030).
The bib file for the vignette was once again updated to current package versions (Dirk).
Changes in Rcpp Deployment:
Added unit test to check if C++ version remains remains aligned with the package number (Dirk in #1022 fixing #1021).
The unit test system was switched to tinytest (Dirk in #1028, #1032, #1033).

Please note that the change to execptions and Rcpp::stop() in pr #1043 has been seen to have a minor side effect on macOS issue #1046 which has already been fixed by Kevin in pr #1047 for which I may prepare a 1.0.4.1 release for the Rcpp drat repo in a day or two.

Thanks to CRANberries, you can also look at a diff to the previous release. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page. Bugs reports are welcome at the GitHub issue tracker as well (where one can also search among open or closed issues); questions are also welcome under rcpp tag at StackOverflow which also allows searching among the (currently) 2356 previous questions.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

paletteer: Hundreds of color palettes in R

March 17, 2020, 7:00 am

≫ Next: R spatial follows GDAL and PROJ development

≪ Previous: Rcpp 1.0.4: Lots of goodies

[This article was first published on r – paulvanderlaken.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Looking for just the right colors for your data visualization?

I often cover tools to pick color palettes on my website (e.g. here, here, or here) and also host a comprehensive list of color packages in my R programming resources overview.

However, paletteer is by far my favorite package for customizing your colors in R!

The paletteer package offers direct access to 1759 color palettes, from 50 different packages!

After installing and loading the package, paletteer works as easy as just adding one additional line of code to your ggplot:

install.packages("paletteer") library(paletteer)  install.packages("ggplot2") library(ggplot2)  ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +   geom_point() +   scale_color_paletteer_d("nord::aurora")

paletteer offers a combined collection of hundreds of other color palettes offered in the R programming environment, so you are sure you will find a palette that you like! Here’s the list copied below, but this github repo provides more detailed information about the package contents.

Name	Github	CRAN
awtools	awhstin/awtools – 0.2.1	–
basetheme	KKPMW/basetheme – 0.1.2	0.1.2
calecopal	an-bui/calecopal – 0.1.0	–
cartography	riatelab/cartography – 2.2.1.1	2.2.1
colorblindr	clauswilke/colorblindr – 0.1.0	–
colRoz	jacintak/colRoz – 0.2.2	–
dichromat	–	2.0-0
DresdenColor	katiesaund/DresdenColor – 0.0.0.9000	–
dutchmasters	EdwinTh/dutchmasters – 0.1.0	–
fishualize	nschiett/fishualize – 0.2.999	0.1.0
gameofthrones	aljrico/gameofthrones – 1.0.1	1.0.0
ggpomological	gadenbuie/ggpomological – 0.1.2	–
ggsci	road2stat/ggsci – 2.9	2.9
ggthemes	jrnold/ggthemes – 4.2.0	4.2.0
ggthemr	cttobin/ggthemr – 1.1.0	–
ghibli	ewenme/ghibli – 0.3.0.9000	0.3.0
grDevices	–	2.0-14
harrypotter	aljrico/harrypotter – 2.1.0	2.1.0
IslamicArt	lambdamoses/IslamicArt – 0.1.0	–
jcolors	jaredhuling/jcolors – 0.0.4	0.0.4
LaCroixColoR	johannesbjork/LaCroixColoR – 0.1.0	–
lisa	tyluRp/lisa – 0.1.1.9000	0.1.1
MapPalettes	disarm-platform/MapPalettes – 0.0.2	–
miscpalettes	EmilHvitfeldt/miscpalettes – 0.0.0.9000	–
nationalparkcolors	katiejolly/nationalparkcolors – 0.1.0	–
NineteenEightyR	m-clark/NineteenEightyR – 0.1.0	–
nord	jkaupp/nord – 1.0.0	1.0.0
ochRe	ropenscilabs/ochRe – 1.0.0	–
oompaBase	–	3.2.9
palettesForR	frareb/palettesForR – 0.1.2	0.1.2
palettetown	timcdlucas/palettetown – 0.1.1.90000	0.1.1
palr	AustralianAntarcticDivision/palr – 0.1.0	0.1.0
pals	kwstat/pals – 1.6	1.6
PNWColors	jakelawlor/PNWColors – 0.1.0	–
Polychrome	–	1.2.3
rcartocolor	Nowosad/rcartocolor – 2.0.0	2.0.0
RColorBrewer	–	1.1-2
Redmonder	–	0.2.0
RSkittleBrewer	alyssafrazee/RSkittleBrewer – 1.1	–
scico	thomasp85/scico – 1.1.0	1.1.0
tidyquant	business-science/tidyquant – 0.5.8	0.5.8
trekcolors	leonawicz/trekcolors – 0.1.2	0.1.1
tvthemes	Ryo-N7/tvthemes – 1.1.0	1.1.0
unikn	hneth/unikn – 0.2.0.9003	0.2.0
vapeplot	seasmith/vapeplot – 0.1.0	–
vapoRwave	moldach/vapoRwave – 0.0.0.9000	–
viridis	sjmgarnier/viridis – 0.5.1	0.5.1
visibly	m-clark/visibly – 0.2.6	–
werpals	sciencificity/werpals – 0.1.0	–
wesanderson	karthik/wesanderson – 0.3.6.9000	0.3.6
yarrr	ndphillips/yarrr – 0.1.6	0.1.5

Via the paletteer github page

Let me know what you like about the package and do share any beautiful data visualizations you create with it!

To leave a comment for the author, please follow the link and comment on their blog: r – paulvanderlaken.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

R spatial follows GDAL and PROJ development

March 16, 2020, 5:00 pm

≫ Next: LASSO regression using tidymodels and #TidyTuesday data for The Office

≪ Previous: paletteer: Hundreds of color palettes in R

[This article was first published on r-spatial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[view raw Rmd]

GDAL and PROJ

GDAL and PROJ (formerly proj.4) are two libraries that form the basis, if not foundations, for most open source geospatial software, including most R packages (sf, sp, rgdal, and all their dependencies). The dependency for package sf is for instance pictured here:

dependencies

Briefly:

PROJ provides methods for coordinate representation, conversion (projection) and transformation, and
GDAL allows reading and writing of spatial raster and vector data in a standardised form, and provides a high-level interface to PROJ for these data structures, including the representation of coordinate reference systems (CRS)

gdalbarn

Motivated by the need for higher precision handling of coordinate transformations and the wish to support for a better description of coordinate reference systems (WKT2), a succesful fundraising helped the implementation of a large number of changes in GDAL and PROJ, most notably:

PROJ changes from (mostly) a projection library into a full geodetic library, taking care of different representations of the shape of the Earth (datums)
PROJ now has the ability to choose between different transformation paths (pipelines), and can report the precision obtained by each
rather than distributing datum transformation grids to local users, PROJ (7.0.0 and higher) offers access to an on-line distribution network (CDN) of free transformation grids, thereby allowing for local caching of portions of grids
PROJ respects authorities (such as EPSG) for determining whether coordinate pairs refer to longitude-latitude (such as 3857), or latitude-longitude (such as 4326)
GDAL offers the ability to handle coordinate pairs authority-compliant (lat-long for 4326), or “traditional” GIS-compliant (long-lat for 4326)
use of so-called PROJ4-strings (like +proj=longlat +datum=WGS84) are discouraged, they no longer offer sufficient description of coordinate reference systems; use of +init=epsg:XXXX leads to warnings
PROJ offers access to a large number of vertical reference systems and reference systems of authorities different from EPSG

`crs` objects in `sf`

Pre-0.9 versions of sf used crs (coordinate reference system) objects represented as lists with two components, epsg (possibly set as NA) and proj4string:

library(sf) # Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1st_crs(4326)# Coordinate Reference System:#   EPSG: 4326#   proj4string: "+proj=longlat +datum=WGS84 +no_defs"

now, with sf>= 0.9, crs objects are lists with two components, input and wkt:

library(sf)## Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1(x = st_crs(4326))## Coordinate Reference System:##   User input: EPSG:4326 ##   wkt:## GEOGCRS["WGS 84",##     DATUM["World Geodetic System 1984",##         ELLIPSOID["WGS 84",6378137,298.257223563,##             LENGTHUNIT["metre",1]]],##     PRIMEM["Greenwich",0,##         ANGLEUNIT["degree",0.0174532925199433]],##     CS[ellipsoidal,2],##         AXIS["geodetic latitude (Lat)",north,##             ORDER[1],##             ANGLEUNIT["degree",0.0174532925199433]],##         AXIS["geodetic longitude (Lon)",east,##             ORDER[2],##             ANGLEUNIT["degree",0.0174532925199433]],##     USAGE[##         SCOPE["unknown"],##         AREA["World"],##         BBOX[-90,-180,90,180]],##     ID["EPSG",4326]]

where a $ method allows for retrieving the epsg and proj4string values:

x$epsg## [1] 4326x$proj4string## [1] "+proj=longlat +datum=WGS84 +no_defs"

but this means that packages that hard-code for instance

x[["proj4string"]]## NULL

now fail to get the result wanted; NULL is not a value that would have occurred in legacy code.

Regretably, assignment to a crs object component still works, as the objects are lists, so not all downstream legacy code will fail

x$proj4string <- "+proj=longlat +ellps=intl"x$proj4string## Warning in `$.crs`(x, proj4string): old-style crs object found: please update## code## [1] "+proj=longlat +ellps=intl +no_defs"

Package maintainers and authors of production scripts will need to review their use of crs objects.

Many external data sources provide a WKT CRS directly and as such do not have an “input” field. In such cases, the input field is filled with the CRS name, which is a user-readable representation

st = stars::read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))st_crs(st)$input## [1] "UTM Zone 25, Southern Hemisphere"

but this representation can not be used as input to a CRS:

st_crs(st_crs(st)$input)## Error in st_crs.character(st_crs(st)$input): invalid crs: UTM Zone 25, Southern Hemisphere

however wkt fields obviously can be used as input:

st_crs(st_crs(st)$wkt) == st_crs(st)## [1] TRUE

`CRS` objects in `sp`

When equiped with a new ( >= 1.5.6) rgdal version, sp’s CRS objects carry a comment field with the WKT representation of a CRS:

# install.packages("rgdal", repos="http://R-Forge.R-project.org")library(sp)(x = CRS("+init=epsg:4326")) # or better: CRS(SRS_string='EPSG:4326')## CRS arguments: +proj=longlat +datum=WGS84 +no_defscat(comment(x), "\n")## GEOGCRS["WGS 84",##     DATUM["World Geodetic System 1984",##         ELLIPSOID["WGS 84",6378137,298.257223563,##             LENGTHUNIT["metre",1]],##         ID["EPSG",6326]],##     PRIMEM["Greenwich",0,##         ANGLEUNIT["degree",0.0174532925199433],##         ID["EPSG",8901]],##     CS[ellipsoidal,2],##         AXIS["longitude",east,##             ORDER[1],##             ANGLEUNIT["degree",0.0174532925199433,##                 ID["EPSG",9122]]],##         AXIS["latitude",north,##             ORDER[2],##             ANGLEUNIT["degree",0.0174532925199433,##                 ID["EPSG",9122]]],##     USAGE[##         SCOPE["unknown"],##         AREA["World"],##         BBOX[-90,-180,90,180]]]

and it is this WKT representation that is used to communicate with GDAL and PROJ when using packages rgdal or sf. At present, rgdal generates many warnings about discarded PROJ string keys, intended to alert package maintainers and script authors to the need to review code. It is particularly egregious to assign to the CRS object projargs slot directly, and this is unfortunately seem in much code in packages.

Coercion from `CRS` objects to `crs` and back

Because workflows often need to combine packages using sp and sf representations, coercion methods from CRS to crs have been updated to use the WKT information; from sp to sf one can use

(x2 <- st_crs(x))## Coordinate Reference System:##   User input: WGS 84 ##   wkt:## GEOGCRS["WGS 84",##     DATUM["World Geodetic System 1984",##         ELLIPSOID["WGS 84",6378137,298.257223563,##             LENGTHUNIT["metre",1]],##         ID["EPSG",6326]],##     PRIMEM["Greenwich",0,##         ANGLEUNIT["degree",0.0174532925199433],##         ID["EPSG",8901]],##     CS[ellipsoidal,2],##         AXIS["longitude",east,##             ORDER[1],##             ANGLEUNIT["degree",0.0174532925199433,##                 ID["EPSG",9122]]],##         AXIS["latitude",north,##             ORDER[2],##             ANGLEUNIT["degree",0.0174532925199433,##                 ID["EPSG",9122]]],##     USAGE[##         SCOPE["unknown"],##         AREA["World"],##         BBOX[-90,-180,90,180]]]

The spCRS constructor has been provided with an additional argument SRS_string= which accepts WKT, among other representations

(x3 <- CRS(SRS_string = x2$wkt))## CRS arguments: +proj=longlat +datum=WGS84 +no_defs

but also

(x4 <- as(x2, "CRS"))## CRS arguments: +proj=longlat +datum=WGS84 +no_defs

uses the WKT information when present.

all.equal(x, x3)## [1] TRUEall.equal(x, x4)## [1] TRUE

Axis order

R-spatial packages have, for the past 25 years, pretty much assumed that two-dimensional data are XY-ordered, or longitude-latitude. Geodesists, on the other hand, typically use $(\phi,\lambda)$, or latitude-longitude, as coordinate pairs; the PROJ logo is now PR$\phi$J. If we use geocentric coordinates, there is no logical ordering. Axis direction may also vary; the y-axis index of images typically increases when going south. As pointed out in sf/#1033, there are powers out there that will bring us spatial data with (latitude,longitude) as (X,Y) coordinates. Even stronger, officially, EPSG:4326 has axis order latitude, longitude (see WKT description above).

Package sf by default uses a switch in GDAL that brings everything in the old, longitude-latitude order, but data may come in in another ordering.

This can now be controlled (to some extent), as st_axis_order can be used to query, and set whether axis ordering is “GIS style” (longitude,latitude; non-authority compliant) or “authority compliant” (often: latitude,longitude):

pt = st_sfc(st_point(c(0, 60)), crs = 4326)st_axis_order() # query default: FALSE means interpret pt as (longitude latitude)## [1] FALSEst_transform(pt, 3857)[[1]]## POINT (0 8399738)(old_value = st_axis_order(TRUE)) ## [1] FALSE# now interpret pt as (latitude longitude), as EPSG:4326 prescribes:st_axis_order() # query current value## [1] TRUEst_transform(pt, 3857)[[1]]## POINT (6679169 0)st_axis_order(old_value) # set back to old value

sf::plot is sensitive to this and will swap axis if needed, but for instance ggplot2::geom_sf is not yet aware of this.

Workflows using sp/rgdal should expect “GIS style” axis order to be preserved

rgdal::get_enforce_xy()## [1] TRUEpt_sp <- as(pt, "Spatial")coordinates(pt_sp)##      coords.x1 coords.x2## [1,]         0        60coordinates(spTransform(pt_sp, CRS(SRS_string="EPSG:3857")))## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO"): Discarded## ellps WGS 84 in CRS definition: +proj=merc +a=6378137 +b=6378137 +lat_ts=0## +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO"): Discarded## datum WGS_1984 in CRS definition##      coords.x1 coords.x2## [1,]         0   8399738

LASSO regression using tidymodels and #TidyTuesday data for The Office

March 16, 2020, 5:00 pm

≫ Next: The ulimate package for correlations (by easystats)

≪ Previous: R spatial follows GDAL and PROJ development

[This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on The Office to show how to build a LASSO regression model and choose regularization parameters!

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore the data

Our modeling goal here is to predict the IMDB ratings for episodes of The Office based on the other characteristics of the episodes in the #TidyTuesday dataset. There are two datasets, one with the ratings and one with information like director, writer, and which character spoke which line. The episode numbers and titles are not consistent between them, so we can use regular expressions to do a better job of matching the datasets up for joining.

library(tidyverse)ratings_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-17/office_ratings.csv")remove_regex <- "[:punct:]|[:digit:]|parts |part |the |and"office_ratings <- ratings_raw %>%  transmute(    episode_name = str_to_lower(title),    episode_name = str_remove_all(episode_name, remove_regex),    episode_name = str_trim(episode_name),    imdb_rating  )office_info <- schrute::theoffice %>%  mutate(    season = as.numeric(season),    episode = as.numeric(episode),    episode_name = str_to_lower(episode_name),    episode_name = str_remove_all(episode_name, remove_regex),    episode_name = str_trim(episode_name)  ) %>%  select(season, episode, episode_name, director, writer, character)office_info

## # A tibble: 55,130 x 6##    season episode episode_name director   writer                       character##                                                   ##  1      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Michael  ##  2      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Jim      ##  3      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Michael  ##  4      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Jim      ##  5      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Michael  ##  6      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Michael  ##  7      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Michael  ##  8      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Pam      ##  9      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Michael  ## 10      1       1 pilot        Ken Kwapis Ricky Gervais;Stephen Merch… Pam      ## # … with 55,120 more rows

We are going to use several different kinds of features for modeling. First, let’s find out how many times characters speak per episode.

characters <- office_info %>%  count(episode_name, character) %>%  add_count(character, wt = n, name = "character_count") %>%  filter(character_count > 800) %>%  select(-character_count) %>%  pivot_wider(    names_from = character,    values_from = n,    values_fill = list(n = 0)  )characters

## # A tibble: 185 x 16##    episode_name  Andy Angela Darryl Dwight   Jim Kelly Kevin Michael Oscar   Pam##                          ##  1 a benihana …    28     37      3     61    44     5    14     108     1    57##  2 aarm            44     39     30     87    89     0    30       0    28    34##  3 after hours     20     11     14     60    55     8     4       0    10    15##  4 alliance         0      7      0     47    49     0     3      68    14    22##  5 angry y         53      7      5     16    19    13     9       0     7    29##  6 baby shower     13     13      9     35    27     2     4      79     3    25##  7 back from v…     3      4      6     22    25     0     5      70     0    33##  8 banker           1      2      0     17     0     0     2      44     0     5##  9 basketball       0      3     15     25    21     0     1     104     2    14## 10 beach games     18      8      0     38    22     9     5     105     5    23## # … with 175 more rows, and 5 more variables: Phyllis , Ryan ,## #   Toby , Erin , Jan

Next, let’s find which directors and writers are involved in each episode. I’m choosing here to combine this into one category in modeling, for a simpler model, since these are often the same individuals.

creators <- office_info %>%  distinct(episode_name, director, writer) %>%  pivot_longer(director:writer, names_to = "role", values_to = "person") %>%  separate_rows(person, sep = ";") %>%  add_count(person) %>%  filter(n > 10) %>%  distinct(episode_name, person) %>%  mutate(person_value = 1) %>%  pivot_wider(    names_from = person,    values_from = person_value,    values_fill = list(person_value = 0)  )creators

## # A tibble: 135 x 14##    episode_name `Ken Kwapis` `Greg Daniels` `B.J. Novak` `Paul Lieberste…##                                                 ##  1 pilot                   1              1            0                0##  2 diversity d…            1              0            1                0##  3 health care             0              0            0                1##  4 basketball              0              1            0                0##  5 hot girl                0              0            0                0##  6 dundies                 0              1            0                0##  7 sexual hara…            1              0            1                0##  8 office olym…            0              0            0                0##  9 fire                    1              0            1                0## 10 halloween               0              1            0                0## # … with 125 more rows, and 9 more variables: `Mindy Kaling` , `Paul## #   Feig` , `Gene Stupnitsky` , `Lee Eisenberg` , `Jennifer## #   Celotta` , `Randall Einhorn` , `Brent Forrester` , `Jeffrey## #   Blitz` , `Justin Spitzer`

Next, let’s find the season and episode number for each episode, and then finally let’s put it all together into one dataset for modeling.

office <- office_info %>%  distinct(season, episode, episode_name) %>%  inner_join(characters) %>%  inner_join(creators) %>%  inner_join(office_ratings %>%    select(episode_name, imdb_rating)) %>%  janitor::clean_names()office

## # A tibble: 136 x 32##    season episode episode_name  andy angela darryl dwight   jim kelly kevin##                          ##  1      1       1 pilot            0      1      0     29    36     0     1##  2      1       2 diversity d…     0      4      0     17    25     2     8##  3      1       3 health care      0      5      0     62    42     0     6##  4      1       5 basketball       0      3     15     25    21     0     1##  5      1       6 hot girl         0      3      0     28    55     0     5##  6      2       1 dundies          0      1      1     32    32     7     1##  7      2       2 sexual hara…     0      2      9     11    16     0     6##  8      2       3 office olym…     0      6      0     55    55     0     9##  9      2       4 fire             0     17      0     65    51     4     5## 10      2       5 halloween        0     13      0     33    30     3     2## # … with 126 more rows, and 22 more variables: michael , oscar ,## #   pam , phyllis , ryan , toby , erin , jan ,## #   ken_kwapis , greg_daniels , b_j_novak ,## #   paul_lieberstein , mindy_kaling , paul_feig ,## #   gene_stupnitsky , lee_eisenberg , jennifer_celotta ,## #   randall_einhorn , brent_forrester , jeffrey_blitz ,## #   justin_spitzer , imdb_rating

There are lots of great examples of EDA on Twitter; I especially encourage you to check out the screencast of my coauthor Dave, which is similar in spirit to the modeling I am showing here and includes more EDA. Just for kicks, let’s show one graph.

office %>%  ggplot(aes(episode, imdb_rating, fill = as.factor(episode))) +  geom_boxplot(show.legend = FALSE)

Ratings are higher for episodes later in the season. What else is associated with higher ratings? Let’s use LASSO regression to find out!

Train a model

We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.

library(tidymodels)office_split <- initial_split(office, strata = season)office_train <- training(office_split)office_test <- testing(office_split)

Then, we build a recipe for data preprocessing.

First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.
Next, we update the role for episode_name, since this is a variable we might like to keep around for convenience as an identifier for rows but is not a predictor or outcome.
Next, we remove any numeric variables that have zero variance.
As a last step, we normalize (center and scale) the numeric variables. We need to do this because it’s important for LASSO regularization.

The object office_rec is a recipe that has not been trained on data yet (for example, the centered and scaling has not been calculated) and office_prep is an object that has been trained on data. The reason I use strings_as_factors = FALSE here is that my ID column episode_name is of type character (as opposed to, say, integers).

office_rec <- recipe(imdb_rating ~ ., data = office_train) %>%  update_role(episode_name, new_role = "ID") %>%  step_zv(all_numeric(), -all_outcomes()) %>%  step_normalize(all_numeric(), -all_outcomes())office_prep <- office_rec %>%  prep(strings_as_factors = FALSE)

Now it’s time to specify and then fit our models. Here I set up one model specification for LASSO regression; I picked a value for penalty (sort of randomly) and I set mixture = 1 for LASSO. I am using a workflow() in this example for convenience; these are objects that can help you manage modeling pipelines more easily, with pieces that fit together like Lego blocks. You can fit() a workflow, much like you can fit a model, and then you can pull out the fit object and tidy() it!

lasso_spec <- linear_reg(penalty = 0.1, mixture = 1) %>%  set_engine("glmnet")wf <- workflow() %>%  add_recipe(office_rec)lasso_fit <- wf %>%  add_model(lasso_spec) %>%  fit(data = office_train)lasso_fit %>%  pull_workflow_fit() %>%  tidy()

## # A tibble: 1,639 x 5##    term         step estimate lambda dev.ratio##                      ##  1 (Intercept)     1  8.39     0.183    0     ##  2 (Intercept)     2  8.39     0.167    0.0332##  3 jim             2  0.0113   0.167    0.0332##  4 michael         2  0.0150   0.167    0.0332##  5 (Intercept)     3  8.39     0.152    0.0640##  6 jim             3  0.0247   0.152    0.0640##  7 michael         3  0.0283   0.152    0.0640##  8 (Intercept)     4  8.39     0.139    0.0986##  9 dwight          4  0.00236  0.139    0.0986## 10 jim             4  0.0361   0.139    0.0986## # … with 1,629 more rows

If you have used glmnet before, this is the familiar output where we can see (here, for the most regularized examples) what contributes to higher IMDB ratings.

Tune LASSO parameters

So we fit one LASSO model, but how do we know the right regularization parameter penalty? We can figure that out using resampling and tuning the model. Let’s build a set of bootstrap resamples, and set penalty = tune() instead of to a single value. We can use a function penalty() to set up an appropriate grid for this kind of regularization model.

set.seed(1234)office_boot <- bootstraps(office_train, strata = season)tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%  set_engine("glmnet")lambda_grid <- grid_regular(penalty(), levels = 50)

Now it’s time to tune the grid, using our workflow object.

doParallel::registerDoParallel()set.seed(2020)lasso_grid <- tune_grid(  wf %>% add_model(tune_spec),  resamples = office_boot,  grid = lambda_grid)

What results did we get?

lasso_grid %>%  collect_metrics()

## # A tibble: 100 x 6##     penalty .metric .estimator   mean     n std_err##                      ##  1 1.00e-10 rmse    standard   0.638     25  0.0177##  2 1.00e-10 rsq     standard   0.0911    25  0.0132##  3 1.60e-10 rmse    standard   0.638     25  0.0177##  4 1.60e-10 rsq     standard   0.0911    25  0.0132##  5 2.56e-10 rmse    standard   0.638     25  0.0177##  6 2.56e-10 rsq     standard   0.0911    25  0.0132##  7 4.09e-10 rmse    standard   0.638     25  0.0177##  8 4.09e-10 rsq     standard   0.0911    25  0.0132##  9 6.55e-10 rmse    standard   0.638     25  0.0177## 10 6.55e-10 rsq     standard   0.0911    25  0.0132## # … with 90 more rows

That’s nice, but I would rather see a visualization of performance with the regularization parameter.

lasso_grid %>%  collect_metrics() %>%  ggplot(aes(penalty, mean, color = .metric)) +  geom_errorbar(aes(    ymin = mean - std_err,    ymax = mean + std_err  ),  alpha = 0.5  ) +  geom_line(size = 1.5) +  facet_wrap(~.metric, scales = "free", nrow = 2) +  scale_x_log10() +  theme(legend.position = "none")

This is a great way to see that regularization helps this modeling a lot. We have a couple of options for choosing our final parameter, such as select_by_pct_loss() or select_by_one_std_err(), but for now let’s stick with just picking the lowest RMSE. After we have that parameter, we can finalize our workflow, i.e. update it with this value.

lowest_rmse <- lasso_grid %>%  select_best("rmse", maximize = FALSE)final_lasso <- finalize_workflow(  wf %>% add_model(tune_spec),  lowest_rmse)

We can then fit this finalized workflow on our training data. While we’re at it, let’s see what the most important variables are using the vip package.

library(vip)final_lasso %>%  fit(office_train) %>%  pull_workflow_fit() %>%  vi(lambda = lowest_rmse$penalty) %>%  mutate(    Importance = abs(Importance),    Variable = fct_reorder(Variable, Importance)  ) %>%  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +  geom_col() +  scale_x_continuous(expand = c(0, 0)) +  labs(y = NULL)

And then, finally, let’s return to our test data. The tune package has a function last_fit() which is nice for situations when you have tuned and finalized a model or workflow and want to fit it one last time on your training data and evaluate it on your testing data. You only have to pass this function your finalized model/workflow and your split.

last_fit(  final_lasso,  office_split) %>%  collect_metrics()

## # A tibble: 2 x 3##   .metric .estimator .estimate##                ## 1 rmse    standard       0.381## 2 rsq     standard       0.234

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

The ulimate package for correlations (by easystats)

March 17, 2020, 5:00 pm

≫ Next: Survey Results: What Degree is Best for Data Science?

≪ Previous: LASSO regression using tidymodels and #TidyTuesday data for The Office

[This article was first published on R on easystats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The correlation package

The easystats project continues to grow with its more recent addition, a package devoted to correlations. Check-out its webpage here!

It’s lightweight, easy to use, and allows for the computation of many different kinds of correlations, such as partial correlations, Bayesian correlations, multilevel correlations, polychoric correlations, biweight, percentage bend or Sheperd’s Pi correlations (types of robust correlation), distance correlation (a type of non-linear correlation) and more, also allowing for combinations between them (for instance, Bayesian partial multilevel correlation).

You can install and load the package as follows:

install.packages("correlation")library(correlation)

Examples

The main function is correlation(), which builds on top of cor_test() and comes with a number of possible options.

Correlation details and matrix

cor <- correlation(iris)cor

## Parameter1   |   Parameter2 |     r |     t |  df |      p |         95% CI |  Method | n_Obs## ---------------------------------------------------------------------------------------------## Sepal.Length |  Sepal.Width | -0.12 | -1.44 | 148 | 0.152  | [-0.27,  0.04] | Pearson |   150## Sepal.Length | Petal.Length |  0.87 | 21.65 | 148 | < .001 | [ 0.83,  0.91] | Pearson |   150## Sepal.Length |  Petal.Width |  0.82 | 17.30 | 148 | < .001 | [ 0.76,  0.86] | Pearson |   150## Sepal.Width  | Petal.Length | -0.43 | -5.77 | 148 | < .001 | [-0.55, -0.29] | Pearson |   150## Sepal.Width  |  Petal.Width | -0.37 | -4.79 | 148 | < .001 | [-0.50, -0.22] | Pearson |   150## Petal.Length |  Petal.Width |  0.96 | 43.39 | 148 | < .001 | [ 0.95,  0.97] | Pearson |   150

The output is not a square matrix, but a (tidy) dataframe with all correlations tests per row. One can also obtain a matrix using:

summary(cor)

## Parameter    | Petal.Width | Petal.Length | Sepal.Width## -------------------------------------------------------## Sepal.Length |     0.82*** |      0.87*** |       -0.12## Sepal.Width  |    -0.37*** |     -0.43*** |            ## Petal.Length |     0.96*** |              |

Note that one can also obtain the full, square and redundant matrix using:

as.table(cor)

## Parameter    | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width## ----------------------------------------------------------------------## Sepal.Length |      1.00*** |       -0.12 |      0.87*** |     0.82***## Sepal.Width  |        -0.12 |     1.00*** |     -0.43*** |    -0.37***## Petal.Length |      0.87*** |    -0.43*** |      1.00*** |     0.96***## Petal.Width  |      0.82*** |    -0.37*** |      0.96*** |     1.00***

Grouped dataframes

The function also supports stratified correlations, all within the tidyverse workflow!

library(dplyr)iris %>%   select(Species, Petal.Width, Sepal.Length, Sepal.Width) %>%  group_by(Species) %>%   correlation()

## Group      |   Parameter1 |   Parameter2 |    r |    t | df |      p |        95% CI |  Method | n_Obs## ------------------------------------------------------------------------------------------------------## setosa     |  Petal.Width | Sepal.Length | 0.28 | 2.01 | 48 | 0.101  | [ 0.00, 0.52] | Pearson |    50## setosa     |  Petal.Width |  Sepal.Width | 0.23 | 1.66 | 48 | 0.104  | [-0.05, 0.48] | Pearson |    50## setosa     | Sepal.Length |  Sepal.Width | 0.74 | 7.68 | 48 | < .001 | [ 0.59, 0.85] | Pearson |    50## versicolor |  Petal.Width | Sepal.Length | 0.55 | 4.52 | 48 | < .001 | [ 0.32, 0.72] | Pearson |    50## versicolor |  Petal.Width |  Sepal.Width | 0.66 | 6.15 | 48 | < .001 | [ 0.47, 0.80] | Pearson |    50## versicolor | Sepal.Length |  Sepal.Width | 0.53 | 4.28 | 48 | < .001 | [ 0.29, 0.70] | Pearson |    50## virginica  |  Petal.Width | Sepal.Length | 0.28 | 2.03 | 48 | 0.048  | [ 0.00, 0.52] | Pearson |    50## virginica  |  Petal.Width |  Sepal.Width | 0.54 | 4.42 | 48 | < .001 | [ 0.31, 0.71] | Pearson |    50## virginica  | Sepal.Length |  Sepal.Width | 0.46 | 3.56 | 48 | 0.002  | [ 0.20, 0.65] | Pearson |    50

Bayesian Correlations

It is very easy to switch to a Bayesian framework.

correlation(iris, bayesian=TRUE)

## Parameter1   |   Parameter2 |   rho |         89% CI |     pd | % in ROPE |    BF |              Prior | n_Obs## --------------------------------------------------------------------------------------------------------------## Sepal.Length |  Sepal.Width | -0.12 | [-0.24,  0.01] | 91.60% |    44.45% |  0.51 | Cauchy (0 +- 0.33) |   150## Sepal.Length | Petal.Length |  0.86 | [ 0.82,  0.89] |   100% |        0% | > 999 | Cauchy (0 +- 0.33) |   150## Sepal.Length |  Petal.Width |  0.81 | [ 0.76,  0.85] |   100% |        0% | > 999 | Cauchy (0 +- 0.33) |   150## Sepal.Width  | Petal.Length | -0.41 | [-0.52, -0.31] |   100% |        0% | > 999 | Cauchy (0 +- 0.33) |   150## Sepal.Width  |  Petal.Width | -0.35 | [-0.47, -0.25] |   100% |     0.22% | > 999 | Cauchy (0 +- 0.33) |   150## Petal.Length |  Petal.Width |  0.96 | [ 0.95,  0.97] |   100% |        0% | > 999 | Cauchy (0 +- 0.33) |   150

Tetrachoric, Polychoric, Biserial, Biweight…

The correlation package also supports different types of methods, which can deal with correlations between factors!

correlation(iris, include_factors = TRUE, method = "auto")

## Parameter1         |         Parameter2 |     r |      t |  df |      p |         95% CI |      Method | n_Obs## --------------------------------------------------------------------------------------------------------------## Sepal.Length       |        Sepal.Width | -0.12 |  -1.44 | 148 | 0.304  | [-0.27,  0.04] |     Pearson |   150## Sepal.Length       |       Petal.Length |  0.87 |  21.65 | 148 | < .001 | [ 0.83,  0.91] |     Pearson |   150## Sepal.Length       |        Petal.Width |  0.82 |  17.30 | 148 | < .001 | [ 0.76,  0.86] |     Pearson |   150## Sepal.Length       |     Species.setosa | -0.93 | -29.97 | 148 | < .001 | [-0.95, -0.90] |    Biserial |   150## Sepal.Length       | Species.versicolor |  0.10 |   1.25 | 148 | 0.304  | [-0.06,  0.26] |    Biserial |   150## Sepal.Length       |  Species.virginica |  0.82 |  17.66 | 148 | < .001 | [ 0.77,  0.87] |    Biserial |   150## Sepal.Width        |       Petal.Length | -0.43 |  -5.77 | 148 | < .001 | [-0.55, -0.29] |     Pearson |   150## Sepal.Width        |        Petal.Width | -0.37 |  -4.79 | 148 | < .001 | [-0.50, -0.22] |     Pearson |   150## Sepal.Width        |     Species.setosa |  0.78 |  15.09 | 148 | < .001 | [ 0.71,  0.84] |    Biserial |   150## Sepal.Width        | Species.versicolor | -0.60 |  -9.20 | 148 | < .001 | [-0.70, -0.49] |    Biserial |   150## Sepal.Width        |  Species.virginica | -0.18 |  -2.16 | 148 | 0.130  | [-0.33, -0.02] |    Biserial |   150## Petal.Length       |        Petal.Width |  0.96 |  43.39 | 148 | < .001 | [ 0.95,  0.97] |     Pearson |   150## Petal.Length       |     Species.setosa | -1.00 |   -Inf | 148 | < .001 | [-1.00, -1.00] |    Biserial |   150## Petal.Length       | Species.versicolor |  0.26 |   3.27 | 148 | 0.007  | [ 0.10,  0.40] |    Biserial |   150## Petal.Length       |  Species.virginica |  0.93 |  31.09 | 148 | < .001 | [ 0.91,  0.95] |    Biserial |   150## Petal.Width        |     Species.setosa | -1.00 |   -Inf | 148 | < .001 | [-1.00, -1.00] |    Biserial |   150## Petal.Width        | Species.versicolor |  0.15 |   1.87 | 148 | 0.191  | [-0.01,  0.31] |    Biserial |   150## Petal.Width        |  Species.virginica |  0.99 | 112.56 | 148 | < .001 | [ 0.99,  1.00] |    Biserial |   150## Species.setosa     | Species.versicolor | -0.88 | -22.35 | 148 | < .001 | [-0.91, -0.84] | Tetrachoric |   150## Species.setosa     |  Species.virginica | -0.88 | -22.35 | 148 | < .001 | [-0.91, -0.84] | Tetrachoric |   150## Species.versicolor |  Species.virginica | -0.88 | -22.35 | 148 | < .001 | [-0.91, -0.84] | Tetrachoric |   150

Partial Correlations

It also supports partial correlations:

iris %>%   correlation(partial = TRUE) %>%   summary()

## Parameter    | Petal.Width | Petal.Length | Sepal.Width## -------------------------------------------------------## Sepal.Length |    -0.34*** |      0.72*** |     0.63***## Sepal.Width  |     0.35*** |     -0.62*** |            ## Petal.Length |     0.87*** |              |

Gaussian Graphical Models (GGMs)

Such partial correlations can also be represented as Gaussian graphical models, an increasingly popular tool in psychology:

library(see) # for plottinglibrary(ggraph) # needs to be loadedmtcars %>%   correlation(partial = TRUE) %>%   plot()

Get Involved

easystats is a new project in active development, looking for contributors and supporters. Thus, do not hesitate to contact us if you want to get involved

Check out our other blog postshere!

Stay tuned

To be updated about the upcoming features and cool R or data science stuff, you can follow the packages on GitHub (click on one of the easystats package) and then on the Watch button on the top right corner) as well as the easystats team on twitter and online:

To leave a comment for the author, please follow the link and comment on their blog: R on easystats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Survey Results: What Degree is Best for Data Science?

March 17, 2020, 10:28 pm

≫ Next: Community of Bioinformatics Software Developers (CDSB): The story of a diversity and outreach hotspot in Mexico that hopes to empower local R developers

≪ Previous: The ulimate package for correlations (by easystats)

[This article was first published on novyden, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Survey

The survey What Degree is Best for Data Science? ran from February 9 through March 12, 2020 asking participants 4 questions:

Answers about self:

Q1: What is the highest level of school degree you have completed?
Q2: Which of the following best describes the field in which you received your highest degree?

Answers about best education:

Q3: What level of school degree you consider optimal for successful career in data science?
Q4: Which field of study you consider optimal for successful career in data science?

During that period 289 respondents participated and 285 successfully completed all 4 questions, so 4 participants with partial answers were removed from analysis below.

Though simple and short (average time it took to complete survey was 55 seconds (after removing 6 outliers who took over 500 seconds to complete survey)) they survey possesses certain internal structure overlapping in time and subject. Time groups questions in 2 pairs: one about education already acquired by a participant and the other about participant recommendations for best education. Subject of questions yields alternative groups based on the answers questions share: pair of 1st and 3d about degree and pair of 2d and 4th about field of study.

Answers to Each Question

Bird’s-Eye View

Sankey Diagrams: How Data Flows

Sankey diagrams help visualize how answers flow through the questions. We start with pairs of related questions and finish with all 4 questions together.

Completed Degree and Field of Study (Q1, Q2)

Best Degree and Field of Study (Q3, Q4)

Completed Degree vs. Best Degree (Q1, Q3)

Completed Field vs. Best Field (Q2, Q4)

Complete Flow of Answers For All 4 Questions

Concluding comments

The survey is still open so anyone who didn’t participate so do so and also let others know about it. If you haven’t noticed yet there is certain bias towards statistics in answers. This might be because significant part of respondents reached the survey via R-bloggers distribution which is popular among R users who often have background in statistics. Finally, people with degree in Math are likely to suggest Math as best field, so on for other fields and degrees – this sort of bias is easy to see from Sankey diagrams above. Removing such bias from the results could be useful and I attempted this exercise but found it to be either too naive in my DIY approach or too extensive to process in short period of time from resources discovered. If you have pointers or even better a method of removing such bias from answers I’d love to hear from you.

To leave a comment for the author, please follow the link and comment on their blog: novyden.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Community of Bioinformatics Software Developers (CDSB): The story of a diversity and outreach hotspot in Mexico that hopes to empower local R developers

March 18, 2020, 8:00 am

≫ Next: Shiny Contest 2020 deadline extended

≪ Previous: Survey Results: What Degree is Best for Data Science?

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Leonardo Collado Torres, Ph. D., Research Scientist, Lieber Institute for Brain Development, Brain genomics #rstats coder working w/ @andrewejaffe @LieberInstitute. @lcgunam @jhubiostat @jtleek alumni. @LIBDrstats @CDSBMexico co-founder

I have been attending R conferences since 2008, and while I’ve seen the R community grow rapidly, I generally don’t encounter as many Latin Americans (LatAm) among communities of R developers. Traditionally, a lab lead investigator invested in R or Bioconductor would teach their trainees and students these skills, becoming a local R hotspot. However, that scenario is uncommon in Mexico for several reasons. Recognizing some of these challenges and driven to promote R in our home country and LatAm, in 2017 Alejandro Reyes and I teamed up with Alejandra Medina Rivera and Heladia Salgado to eventually launch the Community of Bioinformatics Software Developers CDSB (in Spanish) in 2018. One of our goals is to facilitate and encourage the transition from R user to R/Bioconductor developer. We have organized yearly one-week long workshops together with NNB-UNAM and RMB and just announced our 2020 workshop (August 3-7 2020 Cuernavaca, Mexico).

If you are fighting for change, know that even when things feel hard and progress is slow (or seems nonexistent), you are still creating and holding space for those who will come after you. Even if you can’t see it now, the impact of your work will be visible in the future.
— Jen Heemstra (@jenheemstra) March 5, 2020

Now unto our third workshop, I feel like we’ve had several success stories.

CDSB2018 alumni wrote a blog post about their R GitHub package: `rGriffin`.
CDSB2018 alumni secured BioC2019 travel scholarships and presented their work.
Three CDSB2018 and 2019 alumni submitted the first package to Bioconductor (`regutools`), representing a significant percent-wise increase in the representation of LatAms in the Bioconductor developers community.

We have greatly benefited from the logistics and organization support by NNB-UNAM and RMB local teams, allowing us to focus on designing the workshop curriculum and inviting a diverse set of instructors, including Maria Teresa Ortiz who is an RLadiesCDMX co-founder and has been supporting us from the beginning. However, we face economic challenges as the budget for the national science foundation (CONACyT) has decreased in recent years. The support by the small R conference fund by R Consortium and other sponsors has been instrumental, as well as diversity and travel scholarships some of our instructors have secured at R conferences. We just recently revamped our sponsor page and answered the question: why should you support us?

However, while we are just getting started, one of our highlights was born by rOpenSci’s icebreaker exercise at CDSB2019. We were able to really build a sense of community and desire to perform outreach activities at our local communities. Particularly, a CDSB2018 and 19 alumni, Joselyn Chávez, volunteered to join the CDSB board. At CDSB2019 we also created an #rladies channel in our Slack where at the time we had members of 3/4 Mexico’s RLadies chapters (Qro, Xalapa, CDMX) and now have 5/6 (Cuerna, Monterrey), as CDSB2018 and CDSB2019 alumni have been co-founders of two chapters: Ana Beatriz Villaseñor-Altamirano for Qro and Joselyn Chávez for Cuerna.

I am proud and excited of what we have achieved with our one-week long CDSB workshops, but also with how we used the tools we’ve learnt from other communities in order to keep interacting and communicating throughout the rest of the year. Time will tell if our efforts created a ripple that grew into a wave or if we’ll end burning out. Sustainability is a challenge, but we are greatly motivated by the impact we’ve had and can only imagine a brighter future.

Stay Strong Stronger Together GIF by GIPHY Studios Originals - Find & Share on GIPHY

The post Community of Bioinformatics Software Developers (CDSB): The story of a diversity and outreach hotspot in Mexico that hopes to empower local R developers appeared first on R Consortium.

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How is it different to an in-person course?

What about IT restrictions?

What is the classroom size?

Preamble

Evolving understanding over time of case fatality rate

Simulating independence

Group membership matters

Interpretation of Cramér’s V using proportional odds

p-values and Cramér’s V

Addendum: Why is Cramér’s V\(\le\) 1?

Introduction

Decorators in R

Timing functions with an R decorator

Steps to generate people names

1. Installation

2. Training data Vs default data

3. Example

4. Consolidated code

5. Concluding remarks

Briefly

Front-line House Democrats

Ideologies in the 116th

Focusing on Democrats

Summary

Global Replace

Resizable Environment Columns

New File Templates

Autosave

Terminal Ergonomics

Shiny Background Jobs

Wrapup

Changes in Rcpp version 1.0.4 (2020-03-13)

GDAL and PROJ

gdalbarn

crs objects in sf

CRS objects in sp

Coercion from CRS objects to crs and back

Axis order

Further reading

Explore the data

Train a model

Tune LASSO parameters

The correlation package

Examples

Correlation details and matrix

Grouped dataframes

Bayesian Correlations

Tetrachoric, Polychoric, Biserial, Biweight…

Partial Correlations

Gaussian Graphical Models (GGMs)

Get Involved

Stay tuned

The Survey

Answers to Each Question

Bird’s-Eye View

Sankey Diagrams: How Data Flows

Concluding comments

`crs` objects in `sf`

`CRS` objects in `sp`

Coercion from `CRS` objects to `crs` and back