Quantcast
Channel: R-bloggers
Viewing all 12130 articles
Browse latest View live

Python and R – Part 2: Visualizing Data with Plotnine

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Article Update

Interested in more Python and R tutorials?

👉Register for our blog to get new articles as we release them.


Introduction

In this post, we start out where we left off in Python and R – Part 1: Exploring Data with Datatable. In the chunk below, we load our cleaned up big MT Cars data set in order to be able to refer directly to the variable without a short code or the f function from our datatable. On the other hand, we will also load plotnine with the short code p9. We found this to be cumbersome relative to the R behavior, but given that we use so many different functions in ggplot when exploring a data set, it is hard to know which functions to load into the name space in advance. Our experience and discussions we have read by others with matplotlib and seaborn, is that they are not very intuitive, and probably not better than ggplot (given mixed reviews that we have read). If we can port over with a familiar library and avoid a learning curve, it would be a win. As we mentioned in our previous post, plotnine feels very similar with ggplot with a few exceptions. We will take the library through the paces below.

 # R Librarieslibrary("reticulate")knitr::opts_chunk$set(  fig.width = 15,  fig.height = 8,  out.width = '100%')
# Choose Python 3.7 minicondareticulate::use_condaenv(  condaenv = "r-reticulate",  required = TRUE  )
# Install Python packageslapply(c("plotnine"), function(package) {       conda_install("r-reticulate", package, pip = TRUE)})
# Python librariesfrom datatable import *import numpy as npimport plotnine as p9 import re
# Load cleaned vehiclesbig_mt = fread("~/Desktop/David/Projects/general_working/mt_cars/vehicles_cleaned.csv")# Export names to list to add to dictionaryexpr = [exp for exp in big_mt.export_names()]names = big_mt.names# Assign all exported name expressions to variable namesnames_dict = { names[i]: expr[i] for i in range(len(names)) } locals().update(names_dict)

Consolidate make Into Parent manufacturer

In the previous post, we collapsed VClass from 35 overlapping categories down to 7. Here, we similarly consolidate many brands in make within their parent producers. Automotive brands often transfer, and there have been some large mergers over the years, such as Fiat and Chrysler in 2014 and upcoming combination with Peugeot, making this somewhat of a crude exercise. We used the standard that the brand was owned by the parent currently, but this may not have been the case over most of the period which will be shown in the charts below. This can also effect the parent’s efficiency compared to peers. For example, Volkswagen bought a portfolio of luxury European gas guzzlers over the recent period, so its position is being pulled down from what would be one of the most efficient brands.

# Control flow statement used to collapse Make levelsdef collapse_make(make):  manufacturer = str()  if make in ['Pontiac', 'Oldmobile', 'Cadillac', 'Chevrolet', 'Buick', 'General Motors', 'Saturn', 'GMC']:      manufacturer = 'GM'  elif make in ['Ford', 'Mercury', 'Lincoln']:      manufacturer = 'Ford'  elif make in ['Toyota', 'Lexus', 'Scion']:      manufacturer = 'Toyota'  elif make in ['Nissan', 'Infiniti', 'Renault', 'Mitsubishi']:      manufacturer = 'Nissan'  elif make in ['Volkswagen', 'Audi', 'Porshe', 'Bentley', 'Bentley', 'Bugatti', 'Lamborghini']:      manufacturer = 'Volkswagen'  elif make in ['Chrysler', 'Plymouth', 'Dodge', 'Jeep', 'Fiat', 'Alfa Romeo', 'Ram']:      manufacturer = 'Chrysler'  elif make in ['Honda', 'Acura']:      manufacturer = 'Honda'  elif make in ['BMW', 'Rolls Royce', 'MINI']:      manufacturer = 'BMW'  elif make in ['Isuzu', 'Suburu', 'Kia', 'Hyundai', 'Mazda', 'Tata', 'Genesis']:      manufacturer = 'Other Asian'  elif make in ['Volvo', 'Saab', 'Peugeot', 'Land Rover', 'Jaguar', 'Ferrari']:      manufacturer = 'Other Euro'  else:    manufacturer = 'Other'  return manufacturer# Set up vclass of categories list for iterationvclass = big_mt[:, VClass].to_list()[0]big_mt[:, 'vehicle_type'] = Frame(['Cars' if re.findall('Car', item) else 'Trucks' for item in vclass]).to_numpy()# Consolidate make under parents#manufacturers = [tup[0] for tup in big_mt[:, 'make'].to_tuples()]big_mt[:,'manufacturer'] = Frame([collapse_make(line[0]) for line in big_mt[:, 'make'].to_tuples()])# Assign expressions to new variablesvehicle_type, manufacturer = big_mt[:, ('vehicle_type', 'manufacturer')].export_names()

Imports Started Ahead and Improved Efficency More

Here, we selected the largest volume brands in two steps, first creating an numpy vector of makes which sold more than 1500 separate models over the full period, and then creating an expression to filter for the most popular. Then, we iterated over our vector and classified vehicles as ‘Cars’ or ‘Trucks’ based on regex matches to build a new vehicle_type variable. We would love to know streamlined way to accomplish these operations, because they would surely be easier for us using data.table. Excluding EV’s, we found the combined mean mpg by year and make for both cars and trucks. It could be that we are missing something, but it also feels more verbose than it would have been in data.table, where we probably could have nested the filtering expressions within the frames, but again this could be our weakness in Python.

# Filter for brands with most models over full periodmost_popular_vector = big_mt[:, count(), by(manufacturer)][(f.count > 1500), 'manufacturer'].to_numpy()most_popular = np.isin(big_mt[:, manufacturer], most_popular_vector)# Create data set for chartsdata = big_mt[ most_popular, :] \             [ (is_ev == 0), :] \             [:, { 'mean_combined' : mean(comb08),                   'num_models' : count() },                       by(year,                          manufacturer,                         vehicle_type)]

Our plotnine code and graph below looks very similar to one generated from ggplot, but we struggled with sizing the plot on the page and avoiding cutting off axis and legend labels. We tried to put the legend on the right, but the labels were partially cut off unless we squeezed the charts too much. When we put it at the bottom with horizontal labels, the x-axis for the ‘Cars’ facet was still partially blocked by the legend title. We couldn’t find much written on how to make the charts bigger or to change the aspect ratio or figure size parameters, so the size looks a bit smaller than we would like. We remember these struggles while learning ggplot, but it felt like we could figure it out more quickly.

It is also important to mention that confidence intervals are not implemented yet for lowess smoothing with geom_smooth() in plotnine. This probably isn’t such a big deal for our purposes in this graph, where there are a large number of models in each year. However, it detracts from Figure below, where it the uncertainty about the true mean efficiency of cars with batteries in the early years is high because there were so few models.

# Smoothed line chart of efficiency by manufacturer(p9.ggplot(data.to_pandas(),          p9.aes(x = 'year',                  y= 'mean_combined',                  group = 'manufacturer',                  color = 'manufacturer')) +           p9.geom_smooth() +          p9.theme_bw() +           p9.labs(title = 'Imported Brands Start Strong, Make More Progress on Efficiency',                    x = 'Year',                    y = 'MPG',                    caption = 'EPA',                    color = 'Manufacturer') +          p9.facet_wrap('~vehicle_type',                         ncol = 2) +          p9.theme(                subplots_adjust={'bottom': 0.25},            figure_size=(8, 6), # inches            aspect_ratio=1/0.7,    # height:width            dpi = 200,            legend_position='bottom',            legend_direction='horizontal') )
## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

One thing to note is that it is difficult to tell which line maps to which make just by the colors. The original plan was to pipe this into plotly as we would do in R, but this functionality is not available. While the plotnine functionality is pretty close to ggplot, the lack of support of plotly is a pretty serious shortcoming.

From what we can see in the chart, we can see that “Other Asian” started out well in the beginning of the period, and made remarkable progress leaving Toyota behind as the leader in cars and trucks. Our family has driven Highlanders over the last 20 years, and seen the size of that model go from moderate to large, so it is not surprising to see Toyota trucks going from 2nd most to 2nd least efficient. BMW made the most progress of all producers in cars, and also made gains since introducing trucks in 2000. As a general comment, relative efficiency seems more dispersed and stable for cars than for trucks.

# Stacked line of number of models per manufacturer(p9.ggplot(data[year < 2020, :].to_pandas(),          p9.aes(x = 'year',                  y= 'num_models',                  fill = 'manufacturer')) +           p9.geom_area(position = 'stack') +          p9.theme_bw() +           p9.labs(title = 'BMW Making a Lot of Car Models, While GM Streamlines',                    x = 'Year',                    y = 'Number of Models',                    caption = 'EPA',                    color = 'Manufacturer') +          p9.facet_wrap('~vehicle_type',                         ncol = 2,                         scales= 'free') +          p9.theme(                subplots_adjust={'bottom': 0.25},            figure_size=(8, 6), # inches            aspect_ratio=1/0.7,    # height:width            dpi = 200,            legend_position='bottom',            legend_direction='horizontal') )
## 

When we look number of models by Manufacturer , we can see that the number of models declined steadily from 1984 though the late 1990s, but has been rising since. Although the number of truck models appear to be competitive with cars, note that the graphs have different scales so there are about 2/3 as many in most years. In addition to becoming much more fuel efficient, BMW has increased the number of models to an astonishing degree over the period, even while most other European imports have started to tail off (except Mercedes). We would be interested to know the story behind such a big move by a still niche US player. GM had a very large number of car and truck models at the beginning of the period, but now has a much more streamlined range. It is important to remember that these numbers are not vehicles sold or market share, just models tested for fuel efficiency in a given year.

Electric Vehicles Unsurprisingly Get Drastically Better Mileage

After the looking at the efficiency by manufacturer in Figure above, we had a double-take when we saw the chart Figure below. While progress for gas-powered vehicles looked respectable above, in the context of cars with batteries, gas-only vehicles are about half as efficient on average. Though the mean improved, the mileage of the most efficient gas powered vehicle in any given year steadily lost ground over the period.

Meanwhile, vehicles with batteries are not really comparable because plug-in vehicles don’t use any gas. The EPA imputes energy equivalence for those vehicles. The EPA website explains in Electric Vehicles: Learn More About the Label that a calculation of equivalent electricity to travel 100 miles for plug-in vehicles. This seems like a crude comparison as electricity prices vary around the country. Still, the most efficient battery-powered car (recently a Tesla) improved to an incredible degree.

Around 2000, there were only a handful of battery-powered cars so the error bars would be wide if included, and we are counting all cars with any battery as one category when there are hybrids and plug-ins. In any case, caution should be used in interpreting the trend, but there was a period where the average actually declined, and really hasn’t improved over 20-years with the most efficient.

# Prepare data for charting by gas and battery-powereddata = big_mt[ (vehicle_type == "Cars"), :][:,                { "maximum": dt.max(comb08),                  "mean" : dt.mean(comb08),                  "minimum": dt.min(comb08),                  "num_models" : dt.count() },                    by(year, is_ev)]# Reshape data = data.to_pandas().melt(                  id_vars=["year",                            "is_ev",                           "num_models"],                  value_vars=["maximum",                               "mean",                              "minimum"],                  var_name = "Description",                  value_name = "MPG")# Facet plot smoothed line for gas and battery-powered(p9.ggplot(    data,     p9.aes('year',            'MPG',            group = 'Description',           color = 'Description')) +     p9.geom_smooth() +    p9.facet_wrap('~ is_ev') +    p9.labs(      title = 'Gas Powered Cars Make Little Progress, While EV Driven by Most Efficient',      x = 'Year'    ) +    p9.theme_bw() +    p9.theme(          subplots_adjust={'right': 0.85},      figure_size=(10, 8), # inches      aspect_ratio=1/1,    # height:width      legend_position='right',      legend_direction='vertical'))
## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

Efficiency of Most Vehicle Types Started Improving in 2005

We were surprised to see the fuel efficiency of mid-sized overtake even small cars as the most efficient around 2012. Small pickups and SUV’s also made a lot of progress as did standard pick-up trucks. Sport Utility Vehicles were left behind by the improvement most categories saw since 2005, while vans steadily lost efficiency over the whole period. As mentioned earlier, we noticed that the same model SUV that we owned got about 20% larger over the period. It seems like most families in our area have at least oneSUV, but they didn’t really exist before 2000.

# Prepare data for plotting smoothed line by VClassdata = big_mt[(is_ev == False), :][:,                 {'mean' : dt.mean(comb08),                 'num_models' : count() },                    by(year, VClass, is_ev)].to_pandas()# Plot smoothed line of efficiency by VClass(p9.ggplot(    data,    p9.aes('year',            'mean',            group = 'VClass',            color = 'VClass')) +             p9.geom_smooth() +            p9.labs(                title = "Midsize Cars Pull Ahead in Efficiency",                y = 'MPG',                x = 'Year') +            p9.theme_bw()  +    p9.theme(          subplots_adjust={'right': 0.75},      figure_size=(10, 4), # inches      aspect_ratio=1/1.5,    # height:width      legend_position='right',      legend_direction='vertical'))
## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

Efficiency by Fuel Type

We can see that fuel efficiency of electric vehicles almost doubled over the period, while we didn’t see the average efficiency of vehicles with batteries make the same improvement. We generated our is_ev battery if the car had a battery, but didn’t specify if it was plug-in or hybrid, so this discrepancy may have something to do with this. We can also see efficiency of diesel vehicles comes down sharply during the 2000s. We know that Dieselgate broke in 2015 for vehicles sold from 2009, so it is interesting to see the decline in listed efficiency started prior to that period. Natural gas vehicles seem to have been eliminated five years ago, which is surprising with the natural gas boom.

# Prepare data for plotting by fuelType1data = big_mt[: ,               { 'maximum': dt.max(comb08),                 'minimum': dt.min(comb08),                 'num_models' : count(),                 'mpg' : dt.mean(comb08) },                   by(year, fuelType1)].to_pandas()# Plot smoothed line of efficiency by fuelType1 by VClass              (p9.ggplot(data,             p9.aes('year',                    'mpg',                    color='fuelType1')) +             p9.geom_smooth() +             p9.theme_bw() +            p9.labs(                title = "Efficiency of Electric Vehicles Takes Off",                y = 'MPG',                x = 'Year',                color='Fuel Type') +            #p9.geom_hline(aes(color="Overall mean")) +            p9.theme(                  subplots_adjust={'right': 0.75},              figure_size=(10, 4), # inches              aspect_ratio=1/1.5,    # height:width              legend_position='right',              legend_direction='vertical'))            
## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

We don’t know if fuelType1 refers to the recommended or required fuel, but didn’t realize that there had been such a sharp increase in premium over the period. Our understanding was that premium gasoline had more to do with the engine performance than gas efficiency. it is notable that despite all the talk about alternative fuels, they can still be used in only a small minority of new models.

# Plot stacked line of share of fuelType1 by VClass(p9.ggplot(data[data['year'] < 2020],            p9.aes('year',                    'num_models',                    fill = 'fuelType1')) +             p9.geom_area(position = 'stack') +            p9.theme_bw() +            p9.labs(                title = "Number of Cars and Trucks Requiring Premium Overtakes Regular",                y = 'Number of Models',                x = 'Year',                fill = 'Fuel Type') +            p9.theme(                  subplots_adjust={'right': 0.75},              figure_size=(10, 4), # inches              aspect_ratio=1/1.5,    # height:width              legend_position='right',              legend_direction='vertical'))
## 

Comments About Plotnine and Python Chunks in RStudio

In addition to the charts rendering smaller than we would have liked, we would have liked to have figure captions (as we generally do in for our R chunks). In addition, our cross-referencing links are currently not working for the Python chunks as they would with R. There is a bug mentioned on the knitr news page which may be fixed when the 1.29 update becomes available.

Conclusion

There is a lot of complexity in this system and more going on than we are likely to comprehend in a short exploration. We know there is a regulatory response to the CAFE standards which tightened in 2005, and that at least one significant producer may not have had accurate efficiency numbers during the period. The oil price fluctuated widely during the period, but not enough to cause real change in behavior in the same way it did during the 1970s. We also don’t know how many vehicles of each brand were sold, so don’t know how producers might jockey to sell more profitable models within the framework of overall fleet efficiency constraints. There can be a fine line between a light truck and a car, and the taxation differentials importation of cars vs light trucks are significant. Also, the weight cutoffs for trucks changed in 2008, so most truck categories are not a consistent weight over the whole period. That is all for now, but a future post might involve scraping CAFE standards, where there is also long term data available, to see if some of the blanks about volumes and weights could be filled in to support more than just exploratory analysis.

Author: David Lucy, Founder of Redwall Analytics David spent 25 years working with institutional global equity research with several top investment banking firms.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Python and R - Part 2: Visualizing Data with Plotnine first appeared on R-bloggers.


More on Biontech/Pfizer’s Covid-19 vaccine trial: Adjusting for interim testing in the Bayesian analysis

$
0
0

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is one more post about Biontech/Pfizer’s vaccine trial. That is because my previous two posts (here and here) have so far ignored one interesting topic: How do Biontech/Pfizer statistically account for interim analyses? Studying this topic gives us also general insights into how adaptive trial designs with multiple success conditions can be properly evaluated.

According to their study plan sufficient vaccine efficacy is established if one can infer that the efficacy is at least 30% with a type I error (wrongly declaring sufficient efficacy) of no more than 2.5%. While the final evaluation shall take place once there are 164 confirmed Covid-19 cases among the 43538 study participants, the plan states that also at intermediate stages of 32, 62, 92, and 120 cases overwhelming efficacy or, if outcomes are bad, futility can be declared.

Here is a screenshot of Table 5 on p. 103 in the study plan that specifies the exact success and futility thresholds for the interim and final efficacy analyses:

The first interim analysis was planned after 32 confirmed Covid-19 cases. The table states that overwhelming efficacy shall be announced if no more than 6 of these 32 subjects with Covid-19 were vaccinated. Let us draw the posterior distribution of the parameter $\theta$ that measures the probability that a subject with Covid-19 was vaccinated (see the previous post for details):

# Parameters of prior beta distributiona0 = 0.700102; b0 = 1# Covid cases in treatment and control groupmv = 6; mc=26# Compute posterior density of thetatheta.seq = seq(0,1,by=0.01)density = dbeta(theta.seq,a0+mv,b0+mc)# Thresholds VE.min = 0.3theta.max = (1-VE.min)/(2-VE.min) # 0.41176ggplot(data.frame(theta=theta.seq, density=density), aes(x=theta, y=density)) +  geom_area(col="blue", fill="blue", alpha=0.5)+  geom_vline(xintercept=theta.max) +  ggtitle("Posterior if from 32 Covid cases 6 were vaccinated")

The posterior probability that the efficacy would be above 30% in that case can be simply computed:

prob.VE.above.30  = pbeta(theta.max,a0+mv, b0+mc)round(prob.VE.above.30*100,3)## [1] 99.648

This means if the study would have been planned to end after 32 cases with no previous interim evaluation, these thresholds would yield a type I error below 0.352% (i.e. 100% – 99.648%). This is a much stricter bound than the 2.5% type I error bound stated above.

Footnote a) below Table 5, confirms such tighter bounds for the interim analyses:

Interim efficacy claim: P(VE > 30% given data) > 0.995

For the final analysis the footnote also implies an error bound below 2.5%:

success at the final analysis: P(VE > 30% given data) > 0.986.

The crucial point is that the 2.5% bound on the type I error shall hold for the complete analysis that allows to declare sufficient efficacy at 5 different occasions: either at one of the 4 interim analysis or at the final analysis. In a similar fashion as controlling for multiple testing, we have to correct the individual error thresholds of each of the 5 analyses to guarantee an overall 2.5% error bound. The Bayesian framework does not relieve us from such a “multiple testing correction”.

So how does one come up with error bounds for each separate analysis that guarantee a total error bound of 2.5%? For an overview and links to relevant literature you can e.g. look at the article “Do we need to adjust for interim analyses in a Bayesian adaptive trial design?” by Ryan et al. (2020). The article states that in practice the overall error bound of a Bayesian trial design with interim stopping opportunities is usually computed via simulations.

While I want to reiterate that I am no biostatistician, I would like to make an educated guess how such simulations could have looked like.

The following code simulates a trial run until m.max=164 Covid-19 cases are observed, assuming a true vaccine efficacy of only VE.true = 30%:

simulate.trial = function(runid=1, m.max=164, VE.true = 0.3,    VE.min=VE.true, a0 = 0.700102, b0 = 1, m.analyse = 1:m.max) {  theta.true = (1-VE.true)/(2-VE.true)  theta.max = (1-VE.min)/(2-VE.min)  is.vaccinated = ifelse(runif(m.max) >= theta.true,0,1)    mv = cumsum(is.vaccinated)[m.analyse]  mc = m.analyse - mv    prob.above.VE.min = pbeta(theta.max,shape1 = a0+mv, shape2=b0+mc, lower.tail=TRUE)  # Returning results as matrix is faster than as data frame  cbind(runid=runid, m=m.analyse, mv=mv,mc=mc, prob.above.VE.min)}set.seed(42)dat = simulate.trial() %>% as_tibbledat## # A tibble: 164 x 5##    runid     m    mv    mc prob.above.VE.min##                    ##  1     1     1     0     1             0.759##  2     1     2     0     2             0.869##  3     1     3     1     2             0.618##  4     1     4     1     3             0.746##  5     1     5     1     4             0.834##  6     1     6     1     5             0.893##  7     1     7     1     6             0.932##  8     1     8     2     6             0.828##  9     1     9     2     7             0.881## 10     1    10     2     8             0.919## # ... with 154 more rows

Each row of the data frame corresponds to one potential interim analysis after m Covid-19 cases have been observed. The corresponding number of cases from vaccinated subjects mv and from control group subjects mc are random. The column prob.above.VE.min denotes the posterior probability that the vaccine efficacy is better than VE.min = 30% given the simulated interim data and Biontech/Pfizer’s assumed prior characterized by the arguments a0 and b0. You see how every additional Covid-19 case of a vaccinated subject reduces prob.above.VE.min while every additional Covid-19 case of a control group subject increases it.

Let us plot at these posterior probabilities of a vaccine effectiveness above 30% for all possible interim analyses and the final analysis for our simulated trial:

ggplot(dat, aes(x=m, y=prob.above.VE.min*100)) +  geom_line() + geom_hline(yintercept = 97.5)

You see how this posterior probability varies substantially. In a few interim analyses the 97.5% threshold is exceeded but not in all. (As you may have guessed from the random seed, this is not a completely arbitrary example. I picked an example where the 97.5% line is actually breached. This happens not in all runs, but it is representative that the posterior probabilities can change a lot between different interim analyses.)

This just illustrates the aforementioned multiple testing problem. Hence, if we allow interim analyses to declare an early success, the critical posterior probability for each separate analysis should be some level above 97.5% to have an overall type I error rate of at most 2.5%.

The simulate.trial function can also simulate the relevant values just for a smaller set of interim analyses, e.g. the five analyses planed by Biontech/Pfizer:

simulate.trial(m.analyse = c(32,64,90,120,164)) %>% as_tibble## # A tibble: 5 x 5##   runid     m    mv    mc prob.above.VE.min##                   ## 1     1    32    14    18            0.393 ## 2     1    64    25    39            0.639 ## 3     1    90    40    50            0.270 ## 4     1   120    54    66            0.202 ## 5     1   164    79    85            0.0364

To assess how error thresholds should be adapted, we repeat the simulation above many times. Ideally, a large number like a million times, but for speed reasons, let us settle with just 100000 simulated trials:

set.seed(1)dat = do.call(rbind,lapply(1:100000, simulate.trial, m.analyse = c(32,64,90,120,164))) %>%  as_tibble

We can now compute in which fraction of simulated trials at least one analysis (one of the 4 interim or the final one) yields a posterior probability above 97.5% for a vaccine efficacy above 30%.

agg = dat %>%  group_by(runid) %>%  summarize(    highest.prob = max(prob.above.VE.min),    final.prob = last(prob.above.VE.min)  )# Share of simulation runs in which highest posterior probability# of VE > 30% across all possible interim analyses is larger than # 97.5%mean(agg$highest.prob > 0.975)## [1] 0.07179# Corresponding share looking only at the final analysismean(agg$final.prob > 0.975)## [1] 0.02656

While in 2.66% of the simulated trials, the final posterior probability of an efficacy larger than 30% is above 97.5%, we find in 7.18% of the simulated trials at least one interim analysis or the final analysis with that posterior probability above 97.5%.

Assume we want to set in all interim analysis and the final analysis the same threshold for the posterior probability of a vaccine efficacy above 30% so that the total error rate is below 2.5%. We can find the corresponding threshold by computing the empirical 97.5% quantile in our simulated data:

quantile(agg$highest.prob, 0.975)##     97.5% ## 0.9922456

This means we should accept the efficacy only if in one of the 5 analyses the posterior probability of a vaccine efficacy above 30% is above 99.22%. The more interim analyses we would run, the tougher would be this threshold.

However, there is no reason to set the same threshold for all interim analyses and the final analysis. In their study plan Biontech/Pfizer set a tougher success threshold of 99.5% for the 4 interim analyses, but a lower threshold of 98.6% for the final analysis.

Let us see whether we also come up with the 98.6% threshold for the final analysis if we fix the interim thresholds at 99.5% and want to guarantee a total type I error rate of 2.5%:

agg = dat %>%  group_by(runid) %>%  summarize(    interim.success = max(prob.above.VE.min* (m<164)) > 0.995,    final.prob = last(prob.above.VE.min)  )# Fraction of trials where we have an interim successshare.interim.success = mean(agg$interim.success)share.interim.success## [1] 0.01516# Remaining trials without interim successagg.remain = filter(agg, !interim.success)# Maximal share of success in final analysis in remaining trials# to guarantee 2.5% error ratemax.share.final.success = (0.025-share.interim.success)*  NROW(agg)/NROW(agg.remain)quantile(agg.remain$final.prob, 1-max.share.final.success)## 99.00085% ## 0.9852906

OK, I have not deeply checked that there is no mistake in the computation above, but we indeed get a required threshold for the final trial of 98.53%. If we round it, we must round it up to keep the total error rate bounded by 2.5%, which yields the 98.6% threshold used by Biontech/Pfizer. So possibly, we indeed have replicated the computations that formed their study plan.

Overall, I consider this approach an intriguing mixture between Bayesian statistics and frequentist hypothesis testing. In some sense it feels like the Bayesian posterior distributions are used as frequentist test statistics whose distributions we establish by our simulation.

For the sake of simplicity, I have ignored that the interim analyses also have futility thresholds that can lead to an early stop of the trial. Accounting for this possibly will generate some slight slack in the threshold computed above. But likely the threshold will not change much because I imagine that there are only very few trials that would in some interim analyses be declared futile while (if futility is ignored) in later analyses be declared an overwhelming success.

Giving the degrees in freedom in the number and timing of the interim analyses and in the distribution of critical thresholds across the different analyses, optimal trial analysis design seems like a very interesting applied optimization problem. Of course, solving such an optimization problem would require a lot of estimates like a quantification of gains from earlier conclusions and true subjective beliefs about the efficacy distribution (possibly more optimistic than the conservative priors used for regulatory purposes).

One can also imagine that more flexible adaptive trial designs can be evaluated with similar simulations. E.g. assume that in another vaccine trial one wants to perform the final analysis after say m=100 cases and earlier wants to have exactly one interim analysis, which shall take place at a specific date no matter how many cases mi have accrued at that date. One could then possibly make a valid study plan that just specifies the probability threshold for the interim analysis, like 99.5% while the threshold for the final analysis will be a function of the number of cases mi at the interim analysis and set such that for every possible value of mi a total type error I bound of 2.5% is guaranteed.

As long as one can verify that one has yet looked at the data, one possibly can also adapt the analysis plan in a running study. E.g. Biontech/Pfizer’s press release states:

After discussion with the FDA, the companies recently elected to drop the 32-case interim analysis and conduct the first interim analysis at a minimum of 62 cases. Upon the conclusion of those discussions, the evaluable case count reached 94 and the DMC [independent Data Monitoring Committee] performed its first analysis on all cases.

While we don’t know the chosen thresholds for the modified study plan, it seems obvious from the thresholds of the original design that the vaccine efficacy is established already at the interim analysis with 94 cases. It is also stated in the press release that the clinical trial continues to the final analysis at 164 confirmed cases. However, this shall be done in order to collect further data and characterize the vaccine candidate’s performance against other study endpoints. Submission for Emergency Use Authorization seems still to require a safety milestone to be achieved, but it doesn’t look like proving efficacy is an issue anymore.

As a personal summary, I had not thought that I end up writing three blog posts when I started with the idea to understand a bit better how this famous vaccine trial is evaluated. But with each post I learned new, interesting things about study design and analysis. Hope you also enjoyed reading the posts a bit.

UPDATE (2020-11-16 14:00): I just read the good news from Moderna’s vaccine trial and skimmed over the study plan that Google located. Interestingly, they seem to use a pure frequentist approach in the statistical analysis and the study plan contains terms like Cox proportional hazard regression and O’Brien-Fleming boundaries (for dealing with interim analyses).

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post More on Biontech/Pfizer's Covid-19 vaccine trial: Adjusting for interim testing in the Bayesian analysis first appeared on R-bloggers.

Upcoming workshop: Shiny App to Prod

$
0
0

[This article was first published on Mirai Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mirai is thrilled to announce that one of his cherished workshop Bring your Shiny App to Production will be given on December 1st!

Since more than 10 years, we have been witnessing a crucial shift within the customer and user behavior. Consumers expectations have reached new heights: innovations all the time, on demand, competitive price but all that, without sacrifying on quality. The industry has been responding with the agile movement. As a matter of fact DevOps had to reinvent itself to enable non-disruptive innovation to reach the market in accelerated time.

shiny devopa

Get your first steps in collaborative development and release including automation. In this workshop, Mirai’s DevOps enthusiasts experts will share their expertise in Enterprise with you. They will show live all the actions to bring a Shiny app to production in a safe and controlled manner.

CICD gif
CICD course content

For an overview of the benefits of applying a CI/CD approach, check out our article and take some hints from our CI/CD techguides.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mirai Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Upcoming workshop: Shiny App to Prod first appeared on R-bloggers.

NHS-R 2020 Week Long Conference

$
0
0

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The NHS-R conference concluded, and I am emotional that it has ended. There were some fantastic speakers and the whole event, from start to finish was a blast.

Openers

I would like to include everyone on here, but the openers that had an impact for me are included below:

Conference opening – the journey so far – Mohammed Amin Mohammed

Mohammed Amin Mohammed on the NHS-R Journey so far

What a great opener from Mohammed and it was really interesting to get a view of the direction of the NHS-R Community.

OpenSAFELY.org: proving the power of open methods for NHS data analysis

Ben Goldacre, you may have heard of his from his books on Bad Science and Bad Pharma, dropped in for a chat about the OpenSAFELY.org platform and the need for open methods for NHS data analysis.

Ben Goldacre on the OpenSAFELY project

Opening up Analytics by NHS-x

Sarah Culkin did a good talk on opening up new data analytics and the need for an open source mindset. Advanced analytics, was mentioned as another revolution and interesting work is at foot in the AI lab. Watch the video to find out more:

Sarah Culkin on Opening up Analytics

Excellent workshops

Prior to the conference, there were a week of excellent workshops. The ones I found most interesting were:

Regression Modelling

Chris Mainey did an excellent introduction to Regression Modelling. Check it out:

Chris’ excellent regression workshop

Pretty R Markdown and Presentations with Xaringham

This was a two part session:

Introduction to DPLYR and RMarkdown

Zoe Turner did a brilliant job of onboarding newer developers with DPLYR:

Awesome plenary talks

The plenary talks were awesome this year, as well as the Lightening talks. I have not watched all of the sessions, so I do apologise if I missed you out, but all the sessions can be found here: https://tinyurl.com/NHSRConference.

Computer Vision – how it can aid clinicians – Gary Hutson

Shameless self promotion – I did a talk on how Computer Vision can aid clinicians:

Gary Hutson – Computer Vision, Deep Learning and more…

Building Predictive Models with HES data

Chris Mainey did another excellent slot looking at GAMS and Mixed Effects models, and how they apply to HES datasets:

Chris Mainey

Integrating R and QlikSense

Jamie-Leigh Chapman did a stellar job at showing the integration capabilities of R and QlikSense:

Jamie-Leigh Chapman – on integrating R and QlikSense

Decision Modelling in R and Shiny

A very interesting session by Robert Smith:

Causal Inference in Predictive Modelling

Great session from Andi Orlowski and Bruno Petrungaro:

Causal Inference in Predictive Modelling

Using ggplot2 polygon and spatial data to plot brain atlases

Athanasia Mowinckel has created an amazing package for displaying brain atlases:

R Code Quality: Does it Really Matter?

A great session on code quality and packages that can be used to scare the life of people with bad coding practice, sounds like all of my mates, and me:

APIs in R with Plumber

This session focussed on passing model parameters through from PlumbeR. Definitely something that will be very useful in the future:

Wowsome Lightening Talks

The Lightening talks this year, like everything else, were fascinating. The links to the three days worth of ligtening talks are included in this post:

The full playlist

The below YouTube link has all the sessions from the workshop. The ones contained in my blog were the ones I got chance to watch and I found these really useful and interesting. I am sure there are many more in the playlist that I have not had time to watch yet.

These can be accessed hereunder:

Full NHS-R Conference talks
NHS R Community Workshops

Subscribe to this YouTube channel and watch out for the excellent content coming out of this community. The belief is for 100% open source in the NHS and all our code is sharable.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post NHS-R 2020 Week Long Conference first appeared on R-bloggers.

Moderna Pfizer Vaccine Update

$
0
0

[This article was first published on Fells Stats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post we looked at the potential effectiveness of the Pfizer-Biontech vaccine candidate. Today Moderna announced interim results from their study. I have to say that their press release was quite a bit more informative than the Pfizer release.

They report that only 5 of the 95 cases came from the vaccine group (94% efficacy!). This allows us to update our efficacy probability plots to include the Moderna vaccine. Recall that due to Pfizer only reporting that efficacy is “greater than 90%,” we don’t know whether that means that the point estimate is greater than 90%, or that they are 95% certain that efficacy is above 90%. For completeness, we will include both of these in our analysis, with “point” indicating that the point estimate is slightly greater than 90%, and “bound” indicating the result if they are 95% certain that efficacy is above 90%. We’ll use the weakly informative prior from the Pfizer study design, though any weak prior will give similar results.

 # reference: https://pfe-pfizercom-d8-prod.s3.amazonaws.com/2020-09/C4591001_Clinical_Protocol.pdf # prior interval (matches prior interval on page 103)qbeta(c(.025,.975),.700102,1)  # posterior pfizer (bound)cases_treatment <- 3cases_control <- 94 - cases_treatmenttheta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)rate_ratio_ci <- theta_ci / (1-theta_ci) # effectiveness100 * (1 - rate_ratio_ci) xx <- (1:90)/500yy <- sapply(xx, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))xx <- 100 * (1 - xx / (1 - xx))ggplot() +   geom_area(aes(x=xx,y=yy)) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density") # posterior pfizer (point)cases_treatment <- 8cases_control <- 94 - cases_treatmenttheta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)rate_ratio_ci <- theta_ci / (1-theta_ci) # effectiveness100 * (1 - rate_ratio_ci) xx1 <- (1:90)/500yy1 <- sapply(xx1, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))xx1 <- 100 * (1 - xx1 / (1 - xx1))ggplot() +   geom_area(aes(x=xx1,y=yy1)) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density")    # posterior modernacases_treatment <- 5cases_control <- 95 - cases_treatmenttheta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)rate_ratio_ci <- theta_ci / (1-theta_ci) # effectiveness100 * (1 - rate_ratio_ci)   xx2 <- (1:90)/500yy2 <- sapply(xx2, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))xx2 <- 100 * (1 - xx2 / (1 - xx2))ggplot() +   geom_area(aes(x=xx2,y=yy2)) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density")  df <- rbind(  data.frame(xx=xx,yy=yy,Company="Pfizer-Biontech (bound)"),  data.frame(xx=xx1,yy=yy1,Company="Pfizer-Biontech (point)"),  data.frame(xx=xx2,yy=yy2,Company="Moderna"))ggplot(df) +   geom_area(aes(x=xx,y=yy,fill=Company),alpha=.25,position = "identity") +   geom_line(aes(x=xx,y=yy,color=Company),size=1) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density")

moderna

The likelihood that Moderna has higher or lower efficacy compared to Pfizer depends on the Pfizer press release interpretation. Regardless, both show fantastic levels of protection. Additionally, Moderna reported some safety data.

A review of solicited adverse events indicated that the vaccine was generally well tolerated. The majority of adverse events were mild or moderate in severity. Grade 3 (severe) events greater than or equal to 2% in frequency after the first dose included injection site pain (2.7%), and after the second dose included fatigue (9.7%), myalgia (8.9%), arthralgia (5.2%), headache (4.5%), pain (4.1%) and erythema/redness at the injection site (2.0%). These solicited adverse events were generally short-lived. These data are subject to change based on ongoing analysis of further Phase 3 COVE study data and final analysis.

To my eye, these also look fantastic. Based on my personal experience, they appear to be in the same ballpark as a flu shot. None of them are anything I'd mind experiencing yearly or twice yearly. Notably absent from this list is fever, which appears to be relatively common in other candidates and could really put a damper on vaccine uptake.

Another pressing question is whether the vaccines protect one from getting severe disease. Moderna found 11 severe cases among the control vs 0 in the treatment. This number is significantly lower, but should be interpreted with care. Given that the vaccine is effective, the pressing question is whether subjects are less likely to get severe disease given that they become infected. That is to say, in addition to preventing infections, does the vaccine make it milder if you do get infected?

> x <- matrix(c(0,5,11,90-11),nrow=2)> x     [,1] [,2][1,]    0   11[2,]    5   79> fisher.test(x) Fisher's Exact Test for Count Data data:  xp-value = 1alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 0.00000 8.88491sample estimates:odds ratio          0

0 of the five cases in the treatment group were severe, vs 11 of the 90 in the placebo. This is certainly trending in the right direction, but is not even in the neighborhood of significance yet (Fisher's test p-value = 1).

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fells Stats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Moderna Pfizer Vaccine Update first appeared on R-bloggers.

Why RStudio Focuses on Code-Based Data Science

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Python in R shape

Michael Lippis of The Outlook podcast recently interviewed RStudio’s Lou Bajuk to discuss data science with R and Python, and why RStudio encourages its customers to adopt a multi-lingual data science approach. During the interview, Michael and Lou examined three main topics:

  1. RStudio’s mission to support open source data science
  2. How and why RStudio supports R and Python within its products
  3. How business leaders are delivering value from data science investments

I’ve extracted the most interesting parts of the podcast interview below and edited the quotes for clarity and length. You can listen to the entire interview here.

h2 { padding-top: 30px; }.padded-ul { padding-left: 100px; }.unpadded-ul { padding-left: 20px; }.quote-spacing { padding:0 80px; } .quote-size { font-size: 160%; line-height: 34px; }.question-quote { padding-top: 20px; padding-left: 50px; text-indent: -50px; font-weight: bold; }.speaker-quote { padding-left: 50px; text-indent: -50px; }.no-speaker-quote { padding-top: 20px; padding-left: 50px; text-indent: 0px; }.question-quote .speaker-name,.speaker-quote .speaker-name{ display: inline-block; width: 50px; text-indent: 0px;}</p><p>[@media] only screen and (max-width: 600px) {.quote-spacing { padding:0; } .quote-size { font-size: 120%; line-height: 28px; }}

RStudio’s Mission To Benefit Open Source Data Science

Mike:What has been the focus of RStudio since its inception?
Lou:From the beginning, our primary purpose has been to create free and open source software for data science, scientific research, and technical communication. We do this because free and open source software enhances the production and consumption of knowledge and really facilitates collaboration and reproducible research, not only in science, but in education and industry as well. To support this, we spend over half of our engineering resources developing free and open-source software as well as providing extensive support to the open-source data science community.
Mike:How does RStudio help organizations make sense of data regardless of their ability to pay?
Lou:We do this as part of our primary mission around supporting open source data science software. It allows anyone with access to a computer to participate freely in a global economy that really rewards and demands data literacy. So the core of our offerings which enables everyone to do data science is, and will always be, free and open source.
However for those organizations that want to take the data science that they do in R and Python and deploy it at scale, our professional products provide an enterprise-ready modular platform to help them do that. This platform addresses the security, scalability, and other enterprise requirements organizations need to allow their team to deploy their work, collaborate within their team, and communicate with the decision makers that they ultimately support.
Mike:So what is RStudio’s public benefit?
Lou:We announced in January that we’re now registered as a Public Benefit Corporation (PBC) in Delaware. We believe that corporations should strive to fulfill a public beneficial purpose and that they should be run for the benefit of all of our stakeholders. And this is something that’s really a critical part of our founder and CEO JJ Allaire’s philosophy.
Our stated public benefit is to create open source software for scientific and technical computing, which means that the open source mission we’ve been talking about is codified into our corporate charter. And as a PBC, we are committed to considering the needs of not only our shareholders, but all our stakeholders including our community, our customers and employees.
And as part of this, we’re now also a Certified B Corporation®, which means we’ve met the certification requirements set out by the nonprofit B Lab®. That means that we’ve met the highest verified standards of things like social and environmental performance, transparency, and accountability.

Multilingual Data Science

The interview continued with Lou diving into why RStudio has committed to supporting both R and Python within its products.

Mike:Why are R and Python in RStudio, and what challenges are you addressing?
Lou:In talking to our many customers and others in the data science field, we’ve seen that many data science teams today are bilingual, leveraging both R and Python in their work. And while both languages have unique strengths, these teams frequently struggle to use them together.
So for example, a data scientist might find themself constantly needing to switch contexts between multiple development environments. The leader of a data science team might be wrestling with how to share results from their team consistently, so they could deliver value to the larger organizations while promoting collaboration between the R and Python users on their team. The Dev Ops and IT admins spend time and resources attempting to maintain, manage, and scale separate environments for R and Python in a cost effective way.
To help data science teams and the organizations they’re in solve these challenges, and in line with our ongoing mission to support the open source data science ecosystem, we’ve focused our professional products on providing a single centralized infrastructure for bilingual teams using R and Python.
Mike:Is it possible for a data scientist to use R and Python in a single project?
Lou:Absolutely. And there multiple ways that can be done. One of the most popular is an open source package called reticulate that we’ve developed. Reticulate is an open source package that is available to anyone using R. It provides a comprehensive set of tools for interoperability between Python and R, including things like:
  • Calling Python from R in a variety of ways, whether you’re doing something with R Markdown, importing Python modules, or using Python interactively within an R session.
  • Translating data objects between R and Python
  • Binding to different versions of Python, including virtual and Conda environments.
Mike:What about data scientists whose primary language is Python? What does RStudio provide for them?
Lou:First off, we’ve been working on making the RStudio IDE a better environment for Python coding. In addition to the reticulate package we just discussed, we’ve just announced some new features in the upcoming release of our IDE, RStudio 1.4. This includes displaying Python objects in the environment pane, viewing Python data frames, and tools for configuring Python versions and different Conda virtual environments. All this is going to make life easier for someone who wants to code Python within the RStudio IDE.
Secondly, for a team where you might have multiple different data scientists who have different preferences for what IDEs they want to use, our pro platform provides a centralized workbench supporting multiple different development environments. In addition to our own IDE we support Jupyter notebooks and Jupyter Lab as development environments, and we’re working on more options for the near future. This includes Visual Studio Code, which we’re going to be announcing a beta of very shortly.
And finally with our platform, Python-oriented data scientists can create data products and interactive web applications such as Plotly, Streamlit, or Bokeh in their framework of choice and then directly share those analyses with their stakeholders.
We believe this ability for Python data scientists to share their results in a single place alongside the data products created by the R data scientists is critical to actually impacting decision making at an organization
Mike:How can data science leaders promote collaboration across a bilingual team?
Lou:Data science leaders often see their teams struggle to collaborate and share work across disparate, open source tools. Often they waste time translating code from one language to another to put it into production. These activities really distract them from their core work. And as a result, their business stakeholders are less likely to see results or must wait longer for them.
With RStudio products, a bilingual team can work together, building off each other’s work. Best of all, it can publish, schedule, and email regular updates for interactive analyses and custom reports built in both languages. So you, the data science team, and your stakeholders will always know where to look for these valuable insights.
Mike:Has RStudio done anything to help DevOps engineers and IT administrators deal with the difficulty of maintaining separate environments for each data science language?
Lou:Absolutely. DevOps and IT is a critical stakeholder in the whole process of doing data science effectively in your organization. So with RStudio products, DevOps, and IT can maintain a single infrastructure for provisioning, scaling and managing environments for both R and Python users. This means that IT only needs to configure, maintain and secure a single system.
A single system also makes it easy for IT to leverage their existing automation tools and other analytic investments and provide data scientists with transparent access to their servers or Kubernetes or SLURM clusters, directly from the development tools those data scientists prefer. They can easily configure all the critical capabilities around access, monitoring, and environment management. And of course RStudio’s Support, Customer Success, and Solutions Engineering teams are here to help and advise these teams as they scale out their applications.
Mike:How do business stakeholders view bilingual data science teams?
Lou:Ultimately most decision makers really don’t care what language a data science insight was created in. They just want to be able to trust the information and use it to make the right decision. That’s why we’re so focused on making it easy for data scientists to create these data products, regardless of whether they’re R or Python, and then easily share them with their different stakeholders.

Delivering Value from Data Science Investments

Mike and Lou wrapped up by discussing how businesses can improve the value they derive from their data science.

Mike:What would you say to business leaders that are worried about the value of their data science investment?
Lou:One of the big challenges that organizations face with data science is not just how they solve today’s problems, but how they ensure that they continue to deliver value over time. Too many organizations find themselves either struggling to maintain the value of their legacy systems, reinventing the wheel year after year, or being held over a barrel by vendor lock-in.
To address this, we recommend a few approaches:
One is to build your analyses with code, not clicks. Data science teams should use a code-oriented approach because code can be developed and adapted to solve similar problems in the future. This reusable and extensible code then becomes the core intellectual property for an organization. It’ll make it easier over time to solve new problems and to increase the aggregate value of your data science work. This is why code-first data science is really a critical part of RStudio’s philosophy and our roadmap.
The second major approach is to manage your data science environments for reproducibility. Organizations need a way to reproduce reports and dashboards as projects, tools, and dependencies change. You’ll often hear about repeatability and reproducibility when talking about a heavily regulated environment like pharmaceuticals, and it’s certainly particularly critical there. However, it’s critical for every industry; otherwise your team may spend far too much time attempting to recreate old results. Worse, you may get different answers to the same questions at different points in time, which really undermines your team’s credibility.
And third, deploy tools and interactive applications to keep insights up to date, because no one wants to make a decision based on old data. Publishing your insights on web-based interactive tools such as the RStudio Connect platform helps keep your business stakeholders up-to-date and gives them on demand access and scheduled updates. By deploying insights in this way, your data scientists are free to spend their time solving new problems rather than solving the same problem again and again.
Mike:Has RStudio done anything to empower business stakeholders with better decision-making?
Lou:This is really a key focus of ours. Many data science vendors out there focus on creating models and then putting these models into “production”. which typically means integrating these models into some system for automated decision-making. For example, a model might determine what marketing offer to present to someone who visits a website.
Though our products certainly support this through the ability to deploy R and Python models as APIs to plug into other systems, our focus is broader. We want to make it easy for a data science team to create tailored reports, dashboards, and interactive web-based applications, using frameworks like Shiny that they can then easily and iteratively share with their decision makers. This iterative and interactive aspect is critical because decision-makers will invariably come back with questions like “What if you run this analysis on a different time period?” or “What if this parameter is different?”.
Interactive applications give these decision-makers tremendous flexibility to answer their own “what if?” questions. When it’s easy for the data scientist to create a new version, tweak the code, and redeploy it, it’s also more convenient for the decision maker. It allows them to get a timely answer that’s really super focused on what they actually need as opposed to a generic report.
We call these reports tailored or curated because of their flexibility. Open source data science means that these teams can provide their stakeholders with exactly the information they need in the best format for presenting that information rather than being constrained by the black box limitations of a BI reporting tool.
Mike:Can you provide the Outlook series audience with an overview of RStudio Team?
Lou:RStudio Team is a bundle of our professional software for data analysis, package management, and data product sharing. The Team product is a way of getting all three products, but each of these products can also be purchased individually to fit into and complement an organization’s existing data science investments.
The first component is RStudio Server Pro, which provides a centralized work bench for analyzing data and then developing and sharing new data products and interactive applications. This is the platform where the data scientists develop their insights.
Secondly, RStudio Connect is a centralized portal for distributing these dashboards, reports and applications created by the data scientists, whether they’re written in R and Python. This includes the ability to schedule and send email reports to your community of users and to provide all the access control and scalability and reproducibility that a modern enterprise really needs.
Thirdly, RStudio Package Manager supports both the development side (RStudio Server Pro) and the deployment side (RStudio Connect) by managing the wealth of open source data science packages you might need to create and run these analyses. Open source data science hosts a world of people creating these great packages on the cutting edge of statistics and data science but managing these packages over time can be very difficult. RStudio Package Manager makes maintenance and reproducibility much easier.
Mike:All right, Lou. So, can you share a use case with our audience?
Lou:We have a ton of great customer stories at rstudio.com, but one of my favorites is Redfin. Redfin is a technology-powered real estate brokerage that serves more than 90 metropolitan areas across the U.S. and Canada. Now when Redfin was smaller, they used to do a lot of planning using basic data models implemented in spreadsheets and gathering input from emails or files saved in Google drive.
But Redfin wanted to get better, smarter answers. They wanted to make these models more complex and to scale these models to handle the increasing scope of the business. And they found that spreadsheets just wouldn’t work anymore. They weren’t able to apply the more statistical approaches for forecasting that they wanted and maintaining the formulas and spreadsheets was error prone and slow. Plus, the amount of time that it took to consolidate user input into these spreadsheets limited how many iterations of their models they could run. These workbooks then would be painfully slow, sometimes taking 10 more minutes to open up and use. Sometimes they would crash, leaving people unable to use them at all.
Redfin used RStudio products to move their data models from spreadsheets to a much more reproducible and scalable data science environment. They saw our products as a way to replicate the interactivity that users loved in spreadsheets, but host all this on a server that was easy to access and maintain. This approach allowed them to build in all those complex statistical approaches that they wanted while still keeping the end interface simple for the end users.
Mike:All right, Lou. Where can the audience get more information on RStudio’s solutions?
Lou:All the information is available on our website at rstudio.com and there we talk about our products. We also make it easy to either download our products and try them out, or set up a call with our great sales team to help provide some guidance and answer any questions you have. I also encourage your listeners to follow the RStudio blog at blog.rstudio.com, where we write about many of the themes I talked about today as well as share updates on our products and our company.

For More Information

If you’d like to learn more about some of the topics discussed in this interview, we recommend exploring:

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Why RStudio Focuses on Code-Based Data Science first appeared on R-bloggers.

How to Catch a Thief: Unmasking Madoff’s Ponzi Scheme with Benford’s Law

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of my starting points into quantitive finance was Bernie Madoff’s fund. Back then because Bernie was in desperate need of money to keep his Ponzi scheme running there existed several so-called feeder funds.

One of them happened to approach me to offer me a once in a lifetime investment opportunity. Or so it seemed. Now, there is this old saying that when something seems too good to be true it probably is. If you want to learn what Benford’s law is and how to apply it to uncover fraud, read on!

Here are Bernie’s monthly returns (you can find them here: madoff_returns.csv):

madoff_returns <- read.csv("Data/madoff_returns.csv")equity_curve <- cumprod(c(100, (1 + madoff_returns$Return)))plot(equity_curve, main = "Bernie's equity curve", ylab = "$", type = "l")

An equity curve with annual returns of over 10% as if drawn with a ruler! Wow… and Double Wow! What a hell of a fund manager!

I set off to understand how Bernie accomplished those high and especially extraordinarily stable returns. And found: Nothing! I literally rebuilt his purported split-strike strategy and backtested it, it of course didn’t work. And therefore I didn’t invest with him. A wise decision as history proved. And yet, I learned so much along the way, especially on trading and options strategies.

A very good and detailed account of the Madoff fraud can be read in the excellent book “No One Would Listen: A True Financial Thriller” by whistleblower Harry Markopolos who was on Bernie’s heels for many years but as the title says, no one would listen… The reason is some variant of the above wisdom “What seems to good…”: people told him that Bernie could not be a fraud because his fund was so big and other people would have realized that!

One of the red flags that those returns were made up could have been raised by applying Benford’s law. It states that the frequency of the leading digits of many real-world data sets follows a very distinct pattern:

theory <- log10(2:9) - log10(1:8)theory <- round(c(theory, 1-sum(theory)), 3)data.frame(theory)##   theory## 1  0.301## 2  0.176## 3  0.125## 4  0.097## 5  0.079## 6  0.067## 7  0.058## 8  0.051## 9  0.046

The discovery of Benford’s law goes back to 1881 when the astronomer Simon Newcomb noticed that in logarithm tables the earlier pages were much more worn than the other pages. It was re-discovered in 1938 by the physicist Frank Benford and subsequently named after him. Thereby it is just another instance of Stigler’s law which states that no scientific discovery is named after its original discoverer (Stigler’s law is by the way another instance of Stigler’s law because the idea goes back at least as far as to Mark Twain).

Thie following analysis is inspired by the great book “Analytics Stories” by my colleague Professor em. Wayne L. Winston from Kelley School of Business at Indiana University. Professor Winston gives an insightful explanation of why Benford’s law holds for many real-world data sets:

Many quantities (such as population and a company’s sales revenue) grow by a similar percentage (say, 10%) each year. If this is the case, and the first digit is a 1, it may take several years to get to a first digit of 2. If your first digit is 8 or 9, however, growing at 10% will quickly send you back to a first digit of 1. This explains why smaller first digits are more likely than larger first digits.

We are going to simulate a growth process by sampling some random numbers as a starting value and a growth rate and letting it grow a few hundred times, each time extracting the first digit of the resulting number, tallying everything up, and comparing it to the above distribution at the end:

# needs dataframe with actual and theoretic distributionplot_benford <- function(benford) {  colours = c("red", "blue")  bars <- t(benford)  colnames(bars) <- 1:9  barplot(bars, main = "Frequency analysis of first digits", xlab = "Digits", ylab = "Frequency", beside = TRUE, col = colours, ylim=c(0, max(benford) * 1.2))  legend('topright', fill = colours, legend = c("Actual", "Theory"))}set.seed(123)start <- sample(1:9000000, 1)growth <- 1 + sample(1:50, 1) / 100n <- 500sim <- cumprod(c(start, rep(growth, (n-1)))) # vectorize recursive simulationfirst_digit <- as.numeric(substr(sim, 1, 1))actual <- as.vector(table(first_digit) / n)benford_sim <- data.frame(actual, theory)benford_sim##   actual theory## 1  0.300  0.301## 2  0.174  0.176## 3  0.126  0.125## 4  0.098  0.097## 5  0.078  0.079## 6  0.068  0.067## 7  0.058  0.058## 8  0.050  0.051## 9  0.048  0.046plot_benford(benford_sim)

We can see a nearly perfect fit!

We are now doing the same kind of analysis with Bernie’s made-up returns:

first_digit <- as.numeric(substr(abs(madoff_returns$Return * 10000), 1, 1))actual <- round(as.vector(table(first_digit) / length(first_digit)), 3)madoff <- data.frame(actual = actual[2:10], theory)madoff##   actual theory## 1  0.391  0.301## 2  0.135  0.176## 3  0.093  0.125## 4  0.060  0.097## 5  0.051  0.079## 6  0.079  0.067## 7  0.065  0.058## 8  0.070  0.051## 9  0.051  0.046plot_benford(madoff)

Just by inspection, we can see that something doesn’t seem to be quite right. This is of course no proof but another indication that something could be amiss.

Benford’s law has become one of the standard methods used for fraud detection, forensic analytics and forensic accounting (also called forensic accountancy or financial forensics). There are several R packages with which you can finetune the above analysis, yet the principle stays the same. Because this has become common knowledge many more sophisticated fraudsters tailor their numbers according to Benford’s law so that it may become an instance of yet another law: Goodhart’s law:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

Let us hope that this law doesn’t lead to more lawlessness!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Catch a Thief: Unmasking Madoff’s Ponzi Scheme with Benford’s Law first appeared on R-bloggers.

Buy your RStudio products from eoda – Get a free application training

$
0
0

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In addition to the open source tools, RStudio offers professional software solutions for the use of data science in companies. As maintainer of the leading R development environment, package developer and provider of solutions for the professional use of R, RStudio is one of the pioneers for the distribution of R in the enterprise environment. As a full service certified partner, we are your RStudio contact in the German-speaking world.

Everything from a single source: We support you from consulting, through purchasing (including purchasing in euros) and seamless integration of RStudio products into your production systems in your company. More about the products and services as a partner, can be seen here.

OUR OFFER FOR YOU:

Buy your RStudio products from eoda by 31.12.2020 and secure a free application training in a value of 549 euros on top!

Contact us now.

 

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – eoda GmbH.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Buy your RStudio products from eoda – Get a free application training first appeared on R-bloggers.


fulltext: Behind the Scenes

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

fulltext is a package I maintain for text-mining the scholarly literature (package docs). You can search for articles, fetch article metadata and abstracts, and fetch full text of some articles. Text-mining the scholarly literature is a research tool used across disciplines. Full text of articles (entire article, not just the abstract) is the gold standard in text-mining in most cases.

Over the past years the fulltext package has evolved under the hood in its approach to attempting to get full text articles for its users. The following is a walk through of the various iterations that fulltext has gone through for fetching full text of articles. I think it serves as a good demonstration of the complexity and frustration baked into the publishing industry, as well as the trade-offs of various approaches to solving problems and getting things done.

Crossref has the data

The first part of the journey was years ago. We started off using the Crossref API to get links to full text versions of articles. This worked in many cases, allowing us to directly download full text versions of articles for fulltext users.

However, metadata is populated by publishers that are Crossref members, and the “links” metadata is optional (i.e. links to full text articles). Furthermore, the “links” metadata may be completely out of date. Given the opportunity to not add links, many publishers do not, and many publishers do not update links once deposited. This leads to many missing links and to errors in existing ones.

Some bad links: a new approach

Given the problems with Crossref links to full text, I decided to work on another solution. I decided to make a web API of my own. The API was available at ftdoi.org, and all it did was accept a DOI for an article, then look up the Crossref publisher member ID, then use rules I maintained for figuring out full text links to articles per publisher or journal, and return links to full text.

This solution worked pretty well, and had the added benefit that I could look at the API logs to see which publishers or DOIs users were most interested in – then I could work on making new mappings for those publishers/journals.

However, essentially the only ftdoi.org API users were people using the fulltext package in R. That is, there were very few additional users beyond those using the fulltext package to consider for the API – which begs the question: is the API worth maintaining given the cost (paying for a cloud server) and time (maintainting the code for the API and the server running the API)?

In addition, because of the ftdoi.org API, fulltext users were waiting for an extra http request that could fail or be slow if their internet connection was bad/slow or if I had a problem on the server for the ftdoi.org API. That is, if what the ftdoi.org API was doing could be done locally, an http request could be avoided.

Simplify and barriers

I decided to retire the ftdoi.org API and do the mapping (DOI to Crossref member to full text link) inside of R. The first attempt at this was to implement the mapping in a separate package: ftdoi. I was about to submit to CRAN but then remembered that new package submissions to CRAN can take a very, very long time, with no upper limit. Given that I wanted to wrap up the change away from the web API to the R side rather quickly, I decided to pivot away from a separate package. Instead of a separate package, I simply moved the code I had in ftdoi to fulltext package. Then submitted a new version of fulltext to CRAN – and done.

Submitting a separate package really was the right decision from a software perspective as it was a distinct set of code with a solid use case. However, given the unknown and possibly very long acceptance time on CRAN, folding the code into a package where I only had to submit a new version made more sense. Luckily new versions to CRAN are partly automated, so things go more smoothly and quickly. I definitely regret bloating the codebase of the fulltext package, but from a “getting things done” perspective it just made more sense.

Onward

Moving forward there will be improvements in fetching full text of articles in the fulltext package as we make mappings on a publisher and/or journal basis. Unfortunately these improvements require new versions of fulltext to get to CRAN. When we used the ftdoi.org API users could benefit from new journal/publisher mappings as soon as the API was updated, which is very fast – but the addition of new mappings will take longer now assuming users only install CRAN versions.

Last, and somewhat unrelated to the discussion above, the Crossref “click-through” text and data mining (TDM) service is going away at the end of this year. If you use this service, pay attention to ropensci/fulltext#224.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post fulltext: Behind the Scenes first appeared on R-bloggers.

Detect Relationships With Linear Regression (10 Must-Know Tidyverse Functions #4)

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.

Group Split and Map are SECRET TOOLS in my data science arsenal. Combining them will help us scale up to 15 linear regression summaries to assess relationship strength & combine in a GT table. Here are the links to get set up 👇

(Click image to play tutorial)

My secret weapon Group split is SERIOUSLY POWERFUL.

In fact, I use group_split() almost every day. I use to convert data frames to iterable lists:

  • Shiny Apps (making iterable cards)
  • Modeling (Regression by Sub-Groups)
  • Doing complex group-wise calculations – things you can’t do with group_by()

Let’s check group_split() out. With 3 lines of code, we turn an ordinary data frame into an iterable.

Before Boring old data frame.

After Now we have a list of data frames (i.e. an iterable)


Modeling with Broom

So what can we do with this “iterable”?

How about detect relationships with a Linear Regression Model using Broom’s Glance Function!

And with a little extra work (thanks to Thomas Mock @rstudio & the gt R package), we can create this INSANE TABLE! 💥💥💥

That was ridiculously easy.

rockstar

But you’re NOT a Wizard yet!

Here’s how to master R programming & save the world Harry Potter Style. 👇

Tidyverse wizard

…And the look on your boss’ face after seeing your first Shiny App. 👇

Amazed

This is career acceleration.

SETUP R-TIPS WEEKLY PROJECT

  1. Sign Up to Get the R-Tips Weekly (You’ll get email notifications of NEW R-Tips as they are released): https://mailchi.mp/business-science/r-tips-newsletter

  2. Set Up the GitHub Repo: https://github.com/business-science/free_r_tips

  3. Check out the setup video (https://youtu.be/F7aYV0RPyD0). Or, Hit Pull in the Git Menu to get the R-Tips Code

Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Detect Relationships With Linear Regression (10 Must-Know Tidyverse Functions #4) first appeared on R-bloggers.

Measurement errors and dimensional analysis in R

$
0
0

[This article was first published on R on solarchemist.se, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wrote this blog post in preparation for a hands-on web presentation I gave on Nov 13, 2020. It’s published mostly as-is in the hopes it might benefit others. Comments or questions on the text or regarding the code are most welcome!

Today I would like to demonstrate how to get automatic error propagation for your data analysis in R using the errors package by Iñaki Ucar et al.[1] I would also like to demonstrate how to combine that with dimensional analysis (with automatic unit conversion) using the units package by Pebesma et al.[2] This is usually known as quantity calculus.

To make it easy for you, dear reader, to follow along, we will take it from the start and assume all you have is a Windows 10 computer. If you work on MacOS1 or Linux, you can skip right ahead to “Working with measurement errors” (but you may have to adapt some of the steps, as they were written with Windows 10 in mind).

Most of the fundamental tools used for reproducible scientific work, such as R, knitr, git, Python, and not to be forgotten, the whole chain of document generation tools such as LaTeX, Markdown, and pandoc, “just work” on any Linux distribution, but can be a bit of a hassle on Windows.

It is not hard to see why that is. All these tools are available for free and with libre software licences. Windows is neither free nor libre, and has a long history of working against this academic model of software sharing. These days, Microsoft, Google and other tech behemoths embrace open source software, simply because it allows them to monetise work without paying for it. Although, to be fair, they are also contributing back (to varying degrees) to this free/libre pool of software. For an academic, open source tooling offers many tangible and intangible benefits, prime among them by making the entire work reproducible, and effectively future-proof.

So, let’s get to it. We will make use of the fact that Microsoft has added something called the Windows subsystem for Linux (WSL) on Windows 10, which allows you to run a (nearly feature-complete) Linux shell. In fact, recently Microsoft released WSL version 2 (WSL2), which attempts to bring the Linux shell natively to the Windows 10 desktop.

In order to work with R, the next sections will show you how to make use of WSL2 to install Ubuntu 20.04, and then we will walk through installing R and RStudio Server using WSL2 Ubuntu. You will then be able to run R (in RStudio) from a browser window on your Windows desktop. Pretty cool, right?

With that in place, we will explore how to use R and the suite of r-quantities packages to seamlessly manipulate vectors that have associated measurement uncertainties and units through the steps of your data analysis, by way of an example using a subset of data from a real experiment in the Edvinsson lab.

Install WSL2 on Windows 10

In the near future, installing Ubuntu on Windows will be as simple as wsl --install (it’s in the development version of Windows 10), but for now, we have to follow the steps below to achieve the same.

Open Powershell as administrator, then enable WSL1:

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart

To enable WSL2, we must first enable the Windows virtual machine (VM) feature:

dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Now you have to restart Windows.

Next, download the WSL2 Linux kernel update package for x64 (or ARM64, depending on your machine), and install it.

Then open Powershell as administrator again, and set WSL2 as default for all future VMs:

wsl --set-default-version 2

Now it is time to install our Linux distribution of choice (i.e., Ubuntu) from the Microsoft Store. Open the Microsoft Store, find Ubuntu, and install it. (And also, by the way, the Microsoft Store is awful. Such nasty underhanded nudging by Microsoft to create a wholly unnecessary “account”).

Anyway, you just installed Ubuntu on your Windows machine, congratulations! Now you get to create your Linux username and password: simply start the app Ubuntu from the Start menu.

Parenthesis: if your Windows 10 is itself a virtual machine

This is unlikely to apply to 99% of my readers, so just skip ahead to the next section.

Who does this section apply to? Well, if you, like me, run Windows 10 as a VM on top of your regular OS, you will need to make sure that a) your host machine (layer 0) supports nested virtualisation, and b) that your hypervisor has nested virtualisation enabled.

Details on how for the KVM hypervisor on a Linux host in footnote.2

Parenthesis: managing and rebooting your WSL2 virtual machines

Closing the Ubuntu shell in Windows does not restart the Ubuntu VM, and sometimes it can be useful to effect a reboot. So how do we do that?

It’s not obvious, but the wsl executable has two useful attributes,

wsl --list --verbosewsl --terminate Ubuntu-20.04

that allows you to list/view all installed VMs (running or not), and then you simply --terminate to shut down, and wsl -d Ubuntu-20.04 to start again. Starting the VM can also be achieved from the app link in the Start menu, assuming the VM has one.

Install R and RStudio server on WSL2 Ubuntu

With a complete Ubuntu shell in place, we can now go about our business and install R. In your Ubuntu shell, add the public key of the CRAN repository, then add the Ubuntu 20.04 CRAN repo of R version 4.x to your package manager’s list of sources:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9echo "deb https://cran.uni-muenster.de/bin/linux/ubuntu focal-cran40/" | sudo tee /etc/apt/sources.list.d/r-project.list

Note that you can change the domain to any one of the CRAN mirrors if you want. At this point, remember to tell apt about the new source by:

sudo apt update

Install R and its dependencies using the aptitude package manager (we will also install libudunits2-dev, which is a C library and utility for handling physical units that the R package units depends on).

sudo apt install r-base r-base-core r-recommended r-base-dev gdebi-core build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev libudunits2-dev

This installs about 400 MB of software in a jiffy. That’s R!

Now download the latest RStudio Server package file for Ubuntu 18+ and install it:

wget https://rstudio.org/download/latest/stable/server/bionic/rstudio-server-latest-amd64.debsudo gdebi rstudio-server-latest-amd64.deb

Great! Now all you got to do is launch your RStudio Server:

sudo rstudio-server start

With that, you can open your browser (Chromium/Chrome recommended) at http://localhost:8787 and enjoy your very own instance of RStudio Server. (You sign in with the same username and password as in your Ubuntu Linux VM).

And with that, all the preparatory steps are done.3 We can now start working in our data analysis environment in R.

Working with quantities and measurement errors

The quantities package (which ties together the capabilities of the units and errors packages) has an excellent vignette on quantity calculus in R.

The errors package is available on CRAN, and its git repo is published on Github. The package is also presented by its authors in an R Journal article from 2018, and there is also an informative blog post by the main author for the very first release of the package.

The units package handles physical quantities (values with associated units) in R, and builds on UNIDATA’s udunits-2 C library of over 3000 supported units, making it a very robust implementation while avoiding the sin of re-inventing the wheel. It does lead to a dependency on the system-level package udunits2 however, making it very hard to install and work with units on Windows.

I think that these packages have finally brought R up to par with Python with regards to quantity calculus, which is a crucial aspect for any scientist or engineer handling physical measurement values. Although it should be noted that the uncertaintiespackage in Python is more mature (v1.0 was released in 2010 whereas the errors package in R still has not achieved that milestone), the errors package is very usable, indeed I used it for practically all the analysis steps of the latest paper I co-authored (I even managed to cite the package). [3]

Let’s take the errors and units packages out for a spin! Since this is a clean R installation, first we will need to install these packages and any other packages we would like to have available.

Install necessary R packages

Let’s install all the r-quantities packages, and a subset of the tidyverse packages (feel free to install any other packages you may like).

install.packages(c("errors", "units", "constants", "quantities"))install.packages(c("dplyr", "magrittr", "ggplot2", "ggforce", "tibble"))install.packages("remotes")remotes::install_github("solarchemist/R-common")remotes::install_github("solarchemist/oceanoptics")

(This operation may take a minute).

Now let’s load all the packages we will need and get to work on some real-life data.

library(errors)library(constants)library(units)library(quantities)library(common)library(oceanoptics)library(dplyr)library(magrittr)library(ggplot2)library(ggforce)

Back to work

For the sake of keeping this post short, I will spare you the legwork of reading the individual spectra files into a dataframe. I have done that and saved to resultant dataframe to an rda-file which has been uploaded to a publicly accessible URL. Let’s import the data to a variable, and have a quick look at the dataframe:

# read the MB abs spectra from the public URLdf.MB <-    common::LoadRData2Variable(url="https://public.solarchemist.se/data/uvvis-abs-MB-conc-series.rda")df.MB %>% glimpse()## Rows: 28,672## Columns: 13## $ sampleid            "H02AB", "H02AB", "H02AB", "H02AB", "H02AB", "H02AB", …## $ range               "001", "001", "001", "001", "001", "001", "001", "001"…## $ wavelength          188.76, 189.23, 189.70, 190.17, 190.64, 191.12, 191.59…## $ intensity           0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000…## $ IntegrationTime     "1000", "1000", "1000", "1000", "1000", "1000", "1000"…## $ n_Averaged          "10", "10", "10", "10", "10", "10", "10", "10", "10", …## $ Boxcar              "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…## $ CorrElectricDark    "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ StrobeLampEnabled   "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ CorrDetectorNonLin  "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ CorrStrayLight      "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ n_Pixels            "2048", "2048", "2048", "2048", "2048", "2048", "2048"…## $ conc                50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50…

As you can see, it is raw data from the spectrometer, and it comes with neither measurement errors nor units. (Unfortunately, it is still all-to-common for lab instruments to export data without such attributes).

It is perfectly acceptable to add estimated standard errors to this raw data (we will add units at the same time using the set_quantities() function of the quantities package).

# specify unit and error for the measured concentration valuesdf.MB$conc <-    set_quantities(      df.MB$conc * 1e-6,      "mol/L",      # set relative error 2% (estimate)      df.MB$conc * 1e-6 * 0.02)# specify unit and error for wavelength values to instrumental resolutiondf.MB$wavelength <-    set_quantities(df.MB$wavelength, "nm", 1.5)# specify error for absorbance values to 1% rel error (estimate)df.MB$intensity <-    # absorbance is unitless, or more precisely, has a unit of "1"   set_quantities(df.MB$intensity, "1", df.MB$intensity * 0.01)# add a column describing the solvent (H2O or H2O+EtOH)df.MB$solvent <- "H2O"df.MB$solvent[grep("+EtOH$", df.MB$sampleid)] <- "H2O+EtOH"

For plotting, we will use the package ggplot2, which is like a cousin to the dplyr data manipulation package. Both come with a sort of “coding grammar”, which is very comfortable to use once you get used to it. In my opinion it is much easier to express chains of manipulations using this grammar (you can find plenty of examples in the code below).

We should note that ggplotting two columns that both are units objects will fail with Error in Ops.units(): both operands of the expression should be "units" objects, due to the units package (rightfully) determining that they don’t have the same units (or units that can be converted into each other). The (very elegant) solution to this finicky behaviour is provided by the ggforce package. All we have to do is make sure we load library(ggforce), and then we can get back to using ggplot2 as usual (mostly as usual, you will notice that we now use a scale_*_unit() function instead of the usual scale_*_continuous()).

ggplot(   df.MB %>%       filter(wavelength %>% as.numeric() >= 250) %>%      filter(wavelength %>% as.numeric() <= 850)) +   facet_wrap(~solvent) +   geom_errorbar(      colour = "red",      aes(x = wavelength,           ymin = errors_min(intensity),          ymax = errors_max(intensity))) +   geom_line(aes(x = wavelength, y = intensity, group = sampleid)) +   scale_x_unit(      breaks = seq(300, 800, 100),      expand = expansion(0.03, 0)) +   scale_y_unit(expand = expansion(0.03, 0)) +   labs(x = "Wavelength", y = "Abs") +   theme(legend.position = "none")
Recorded UV-Vis spectra of MB in water and MB in water with 10 vol% EtOH (with the estimated measurement error of the absorbances shown as vertical error bars in red).

Figure 1: Recorded UV-Vis spectra of MB in water and MB in water with 10 vol% EtOH (with the estimated measurement error of the absorbances shown as vertical error bars in red).

To check whether a unit already exists in the udunits2 library, I like to use the udunits2::ud.is.parseable() function. If you know of a better way, please let me know!

Now, as you have surely already noticed, this dataset is simply a series of UV-Vis absorbance spectra at varying concentrations (of methylene blue in water and water with some ethanol, in this case).

Here’s how the data looks at this point (after we set errors and/or units for some columns):

df.MB %>% glimpse()## Rows: 28,672## Columns: 14## $ sampleid            "H02AB", "H02AB", "H02AB", "H02AB", "H02AB", "H02AB", …## $ range               "001", "001", "001", "001", "001", "001", "001", "001"…## $ wavelength         (err) /nm 189(2) /nm, 189(2) /nm, 190(2) /nm, 190(2) /nm, 19…## $ intensity          (err) /1 0(0) /1, 0(0) /1, 0(0) /1, 0(0) /1, 0(0) /1, 0(0) /…## $ IntegrationTime     "1000", "1000", "1000", "1000", "1000", "1000", "1000"…## $ n_Averaged          "10", "10", "10", "10", "10", "10", "10", "10", "10", …## $ Boxcar              "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…## $ CorrElectricDark    "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ StrobeLampEnabled   "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ CorrDetectorNonLin  "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ CorrStrayLight      "No", "No", "No", "No", "No", "No", "No", "No", "No", …## $ n_Pixels            "2048", "2048", "2048", "2048", "2048", "2048", "2048"…## $ conc               (err) /mol*L^-1 5.0(1)e-5 /mol*L^-1, 5.0(1)e-5 /mol*L^-1, 5.…## $ solvent             "H2O", "H2O", "H2O", "H2O", "H2O", "H2O", "H2O", "H2O"…

Most columns are of type “character”, but note how $wavelength (and two other columns) say something else in that field. We can take a closer look at the structure of the dataframe using the str() function:

df.MB %>% str()## 'data.frame':    28672 obs. of  14 variables:##  $ sampleid          : chr  "H02AB" "H02AB" "H02AB" "H02AB" ...##  $ range             : chr  "001" "001" "001" "001" ...##  $ wavelength        : Units+Errors: /nm num  189(2) 189(2) 190(2) 190(2) 191(2) ...  $ intensity         : Units+Errors: /1 num  0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) ...  $ IntegrationTime   : chr  "1000" "1000" "1000" "1000" ...##  $ n_Averaged        : chr  "10" "10" "10" "10" ...##  $ Boxcar            : chr  "0" "0" "0" "0" ...##  $ CorrElectricDark  : chr  "No" "No" "No" "No" ...##  $ StrobeLampEnabled : chr  "No" "No" "No" "No" ...##  $ CorrDetectorNonLin: chr  "No" "No" "No" "No" ...##  $ CorrStrayLight    : chr  "No" "No" "No" "No" ...##  $ n_Pixels          : chr  "2048" "2048" "2048" "2048" ...##  $ conc              : Units+Errors: /mol*L^-1 num  5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 5.0(1)e-5 ...  $ solvent           : chr  "H2O" "H2O" "H2O" "H2O" ...

Having thus set suitable estimates (or perhaps even properly calculated) errors for the measured quantities wavelength, absorbance, and concentration, we will use this data to calculate the spectral absorption coefficient of MB in the two solvents we used.

As an introductory exercise, let’s define the wavelength of the strongest absorption band along with its unit and a synthetic standard error value:

wl.max.abs <- set_quantities(665, "nm", 5)

We can change the notation of the error from parenthesis to plus-minus when printing a quantity:

print(wl.max.abs, notation="plus-minus", digits=1)## 665 ± 5 /nm

By default, errors uses parenthesis notation (which is more compact) and a single significant digit for errors.

We can also print more digits for the error part by setting digits. Both of these settings can also be set globally (for our entire R session) by using the options() function, like this:

options(errors.notation="plus-minus", errors.digits=2)

Before we get into calculating the spectral absorption coefficients, recall that the Beer-Lambert law

\[\begin{equation} A = \epsilon lc \;\Longleftrightarrow\; \epsilon = \frac{A}{lc} \end{equation}\]

includes the optical path length of the cuvette, \(l=\SI{1}{\cm}\) in our case. Let’s set that as a variable so we can use it when calculating \(\epsilon\):

# I don't know the tolerances of the cuvette dimensions, (quite the oversight, I know)# but since these cuvettes are mass-produced I am assuming they are pretty tight and # so estimating the error at 1/10 mmoptical.length <- set_quantities(1, "cm", 0.01)

With a series of UV-Vis absorbance spectra at varying concentrations, we can of course plot the linear relationship between \(A \propto c\) at a particular wavelength (such as the main absorbance peak), but for the sake of this exercise we will take the calculation one step further and calculate \(\epsilon(\lambda, A, c)\), i.e., the absorption coefficient for all wavelengths. It is not pretty code, but it gets the job done. Here goes:

# loop over each spectrum by solventsolvents <- df.MB %>% pull(solvent) %>% unique()# beware, unique() silently drops all quantity attributeswl.steps <- df.MB %>% filter(sampleid == unique(df.MB$sampleid)[1]) %>% pull(wavelength)# hold the results of the loop in an empty dataframe, pre-made in the right dimensions# complication! quantities cols must use the same (or convertable) units from the startabs.coeff <-    data.frame(      wavelength = rep(wl.steps, length(solvents)),      solvent = "",      # slope of linear fit      k = rep(set_quantities(1, "L/mol", 1), length(solvents) * length(wl.steps)),       # intercept of linear fit      m = rep(set_quantities(1, "1", 1), length(solvents) * length(wl.steps)),       # calculated abs coeff      coeff = rep(set_quantities(1, "L*cm^-1*mol^-1", 1), length(solvents) * length(wl.steps)),       rsq = NA, # R-squared of linear fit      adj.rsq = NA) # adj R-squared of linear fiti <- 0# Fair warning: this loop is horribly slow, and may take several minutes to complete!for (s in 1:length(solvents)) {   message("Solvent: ", solvents[s])   # varying conc of a particular solvent   for (w in 1:length(wl.steps)) {      # to keep track of the current row in abs.coeff, we will use a counter i      i <- i + 1      message("[", i, "] ", solvents[s], " :: ", wl.steps[w], " nm")      # temporary df which we will use to calculate linear fits (abs.coeff) for each wl.step      intensity.by.wl <-          df.MB %>%         filter(solvent == unique(df.MB$solvent)[s]) %>%         # even though we drop the errors when filtering, it does not affect the returned rows         filter(as.numeric(wavelength) == unique(as.numeric(df.MB$wavelength))[w]) %>%         select(sampleid, wavelength, intensity, conc, solvent)      # now we can calculate abs coeff for this solvent at the current wavelength      this.fit <-          # now this is the really tricky one         # https://github.com/r-quantities/quantities/issues/10         qlm(data = intensity.by.wl, formula = intensity ~ conc)       # save R-squared and adjusted R-squared      abs.coeff$rsq[i] <- summary(this.fit)$r.squared      abs.coeff$adj.rsq[i] <- summary(this.fit)$adj.r.squared      # save coefficients of linear fit      abs.coeff$m[i] <- coef(this.fit)[[1]]      abs.coeff$k[i] <- coef(this.fit)[[2]]      abs.coeff$coeff[i] <- coef(this.fit)[[2]] / optical.length      # solvent       abs.coeff$solvent[i] <- unique(intensity.by.wl$solvent)   }}save(   abs.coeff,    file = "/media/bay/taha/sites/hugo/solarchemist/static/assets/data/uvvis-abscoeff-MB-conc-series.rda")# read the calculated MB abs coeff from the public URLabs.coeff <-    LoadRData2Variable(url="https://public.solarchemist.se/data/uvvis-abscoeff-MB-conc-series.rda")

Well, that was quite a workout!4 But now that we have saved the calculated absorption coefficients to an rda-file on disk, we won’t have to repeat that calculation (for a while at least). But the proof is in eating the pudding, so let’s plot the absorption coefficients we got.

ggplot(   abs.coeff %>%      filter(wavelength %>% as.numeric() >= 250) %>%      filter(wavelength %>% as.numeric() <= 850)) +   geom_errorbar(      colour = "red",      aes(x = wavelength,           ymin = errors_min(coeff),          ymax = errors_max(coeff))) +   geom_line(aes(x = wavelength, y = coeff, group = solvent)) +   scale_x_unit(      breaks = seq(300, 800, 100),      expand = expansion(0.03, 0)) +   scale_y_unit(expand = expansion(0.03, 0)) +   labs(x = "Wavelength", y = "Abs. coeff.") +   theme(legend.position = "none")
Calculated spectral absorption coefficient for MB in water and MB in water with 10 vol% EtOH. With the estimated standard error due to propagated measurement errors of the underlying quantities (red errorbars).

Figure 2: Calculated spectral absorption coefficient for MB in water and MB in water with 10 vol% EtOH. With the estimated standard error due to propagated measurement errors of the underlying quantities (red errorbars).

That looks good. Note how the vertical errorbars of the two curves (different solvents) overlap completely at certain spectral regions, notably at the main absorbance peak.

And not only did our entire analysis automatically propagate the measurement errors, we also managed to do the same for the units, doing automatic dimensional analysis at the same time (note how the units in the plot axes labels is automatically derived from the data).

That’s quantity analysis, in R, with real-life data! I hope you learned something new, and that this will inspire you to use quantity analysis for your future work.

But what about Python?

I feel it would be bad form to end this post without even mentioning Python’s support for error propagation and dimensional analysis, which far predates R’s.

Python has a whole array of packages to choose from for working with physical units, but as far as I can tell none of them build on the UNIDATA database.

However, the pint package appears to be the most popular by far (with 1200 stars on Github, it’s actually an order of magnitude more popular than all the related R packages), and it integrates with both NumPy and the uncertainties package and Pandas.

Support for errors and error propagation has been around since 2010 courtesy of the uncertainties package by Eric O. Lebigot.

I am not aware of any package for integrating between the uncertainties and pint packages, but perhaps that is not necessary in Python. I’m not sure on that point. is a little lacking.

In any case, getting up and running with Python is fairly easy. In fact, Python 3.x comes pre-installed on the Ubuntu shell we just configured. You should be aware, that you could just write and execute your Python code directly from RStudio Server (or R in general) using the reticulate package for R.

But if you prefer working with Python in its own IDE, you can actually install Jupyter Notebook and work from a browser, just like for RStudio Server. Just follow these steps:

In the WSL Ubuntu shell, run the following commands to install Jupyter Notebook:

sudo apt install python3-pip python3-devsudo -H pip3 sudo -H pip3 install --upgrade pipsudo -H pip3 install jupyter

then run jupyter notebook and copy-paste the provided URL into your browser.

Note: for serious work, you should look into installing JupyterLab and/or JupyterHub instead of Jupyter Notebook.

This template for a student lab report I co-authored way back when uses both uncertainties and numpy for its calculations, and may be of interest.

Wrapper function redefining lm() to support units/errors

# wrap lm() to drop quantities and get around error when calling summary(lm())# https://www.r-spatial.org/r/2018/08/31/quantities-final.htmlqlm <- function(formula, data, ...) {   # get units info, then drop quantities   row <- data[1, ]   for (var in colnames(data)) if (inherits(data[[var]], "quantities")) {      data[[var]] <- drop_quantities(data[[var]])   }   # fit linear model and add units info for later use   fit <- lm(formula, data, ...)   fit$units <- lapply(eval(attr(fit$terms, "variables"), row), units)   class(fit) <- c("qlm", class(fit))   fit}# get quantities attributes back for the coef method# (which is the only class we care about in this demonstration)coef.qlm <- function(object, ...) {  # compute coefficients' units  coef.units <- lapply(object$units, as_units)  for (i in seq_len(length(coef.units) - 1) + 1)    coef.units[[i]] <- coef.units[[1]] / coef.units[[i]]  coef.units <- lapply(coef.units, units)  # use units above and vcov diagonal to set quantities  coef <- mapply(set_quantities, NextMethod(), coef.units,                 sqrt(diag(vcov(object))), mode="symbolic", SIMPLIFY=FALSE)  # use the rest of the vcov to set correlations  p <- combn(names(coef), 2)  for (i in seq_len(ncol(p)))    covar(coef[[p[1, i]]], coef[[p[2, i]]]) <- vcov(object)[p[1, i], p[2, i]]  coef}

sessionInfo()

sessionInfo()## R version 3.6.2 (2019-12-12)## Platform: x86_64-pc-linux-gnu (64-bit)## Running under: Ubuntu 18.04.5 LTS## ## Matrix products: default## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1## ## locale:##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              ##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    ##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   ##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            ## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       ## ## attached base packages:## [1] stats     graphics  grDevices utils     datasets  methods   base     ## ## other attached packages:##  [1] ggforce_0.3.2          ggplot2_3.3.2          magrittr_1.5          ##  [4] dplyr_1.0.2            oceanoptics_0.0.0.9004 common_0.0.2          ##  [7] quantities_0.1.5       units_0.6-7            constants_0.0.2       ## [10] errors_0.3.4           knitr_1.29            ## ## loaded via a namespace (and not attached):##  [1] Rcpp_1.0.4.6     highr_0.8        compiler_3.6.2   pillar_1.4.6    ##  [5] tools_3.6.2      digest_0.6.27    evaluate_0.14    lifecycle_0.2.0 ##  [9] tibble_3.0.4     gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.8     ## [13] cli_2.1.0        yaml_2.2.1       blogdown_0.20    xfun_0.15       ## [17] withr_2.3.0      stringr_1.4.0    generics_0.1.0   vctrs_0.3.4     ## [21] grid_3.6.2       tidyselect_1.1.0 glue_1.4.2       R6_2.5.0        ## [25] fansi_0.4.1      rmarkdown_2.3    bookdown_0.20    polyclip_1.10-0 ## [29] farver_2.0.3     purrr_0.3.4      tweenr_1.0.1     scales_1.1.1    ## [33] htmltools_0.5.0  ellipsis_0.3.1   MASS_7.3-51.6    assertthat_0.2.1## [37] colorspace_1.4-1 labeling_0.4.2   utf8_1.1.4       stringi_1.4.6   ## [41] munsell_0.5.0    crayon_1.3.4

References

[1] I. Ucar, E. Pebesma, A. Azcorra, Measurement Errors in R, R J. 10 (2018) 549–557. https://doi.org/10.32614/RJ-2018-075.

[2] E. Pebesma, T. Mailund, J. Hiebert, Measurement Units in R, R J. 8 (2016) 486–494. https://doi.org/10.32614/RJ-2016-061.

[3] T. Ahmed, T. Edvinsson, Optical Quantum Confinement in Ultrasmall ZnO and the Effect of Size on Their Photocatalytic Activity, J. Phys. Chem. C. 124 (2020) 6395–6404. https://doi.org/10.1021/acs.jpcc.9b11229.


  1. If you run MacOS, you probably want to install Homebrew, a Linux-like package manager.↩

  2. Check that the bare-metal Linux host supports “Nested virtualisation” by inspecting either one of /etc/modprobe.d/qemu-system-x86.conf or /sys/module/kvm_intel/parameters/nested. The first file should contain nested=1, and the second should contain Y or 1. If not, set them accordingly. With the Windows VM shut down, edit the KVM XML config file so that the tag includes mode=host-passthrough, like this: . Note that these settings must be in place before installing WSL2 Ubuntu on Windows. Also note that WSL2 supports nested virtualisation, but not WSL1.↩

  3. We should mention here that there is a known limitation to WSL2 virtualisation regarding having services (such as RStudio Server in this case) persist across reboots of the VM or even across reboots of the Windows machine itself. In short, WSL2 does not support systemd nor init, so any services on the Linux VM will not autostart when the VM is rebooted. Even so, the web is full of hacksattempting to circumvent this limitation. On the other hand, this might not be such a big problem for the kind of experimentation that having an otherwise complete Linux shell available on the Windows desktop allows.↩

  4. Note that we had to redefine the lm() function using our own wrapper that takes care of unsetting and resetting the units and errors. See above. The code was copied verbatim from Ucar’s r-spatial article.↩

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on solarchemist.se.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Measurement errors and dimensional analysis in R first appeared on R-bloggers.

The Wild World of Data Repositories

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to join this free online event with Kara Woo, Daniella Lowenberg, Matt Jones, Carl Boettiger and Karthik Ram.

There is no one-size-fits-all protocol for depositing your research data into a public repository in a way that maximizes its reuse and citation. We’ve assembled a panel that will help you understand the issues and opportunities for developing new tools and documentation.

This 1-hour event, moderated by Kara Woo, includes 5 speakers and 20 minutes for Q & A on:

  • Where and how to deposit data (or data + software!)
  • Challenges in data deposition for reuse
  • How to reuse data
  • Where are the gaps?

Speakers will cover nuances of general and domain-specific repositories (Dryad, the DataONE federation of repositories, Arctic Data Center), what curators look for, what to do when you want to deposit both data and software, designing tools to help researchers provide the right metadata to maximize reuse, making data reuse easier with R tools like piggyback, and their vision for new tools and documentation.

All are welcome.

Speakers

Kara Woo

Portrait of Kara Woo

Kara Woo is a Principal Bioinformatics Engineer at Sage Bionetworks where she leads a team of developers building tools and infrastructure for open science. Kara is a member of the rOpenSci Code of Conduct committee. Kara on GitHub, Twitter, Website

Daniella Lowenberg

Portrait of Daniella Lowenberg

Daniella Lowenberg is the Product Manager at Dryad and Principal Investigator of Make Data Count within the California Digital Library at University of California. Daniella on GitHub, Twitter

Matt Jones

Portrait of Matt Jones

Matt Jones is Director of Informatics R&D at the National Center for Ecological Analysis and Synthesis (NCEAS), Principal Investigator of the NSF Arctic Data Center, and Director of DataONE at University of California Santa Barbara. Matt on GitHub, Twitter, Website

Carl Boettiger

Portrait of Carl Boettiger

Carl Boettiger is Assistant Professor in the Department of Environmental Science, Policy and Management at UC Berkeley, a Co-founder and strategic advisor of rOpenSci. Carl on GitHub, Twitter, Website

Karthik Ram

Portrait of Karthik Ram

Karthik Ram is a Senior Research Scientist with the Berkeley Institute for Data Science, Project Lead and Co-founder of rOpenSci, Editor for rOpenSci Software Peer Review. He has a PhD in Ecology and Evolution. Karthik on GitHub, Twitter, Website

Resources

Join Us!

  • Who Everyone is welcome. No RSVP needed, simply connect and/or dial in at the time of the event.
  • When Wednesday, 16 December 2020 10:00 PST (Wednesday, 16 December 2020 18:00 UTC)
  • Find your timezone
  • Add to Calendar.
  • How

    Meeting ID: 922 9890 9939

    Passcode: 896415

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Wild World of Data Repositories first appeared on R-bloggers.

Little useless-useful R functions – Same function names from different packages or namespaces

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This little useless-useful R function is not that useless after-all. Many of the R programmers have had some issues with namespaces in R environment and lost some time debugging, that turned out to be due to variable name or function name.

Let’s make a simple example:

c <- 100a <- c(a,c)a

So it looks like two variables names “a” and “c”. YeeaNo. “c” is reserved word (means combine) and can easily be used also as a named variable. Which is a programmers fault. But the interesting part is when you want to see what is stored in variable a. To your surprise, the value of variable and the function itself. So, caution when using reserved words for variables names. Another problems are namespaces and same function names among different packages. And you want to know about these incidents, right?

So the following function does just that; if goes through the list of functions for all the installed packages in particular environment and finds duplicates. Giving the R programmer the insights where to be cautious and refer to scope and namespace.

funkyFun <- function(){     libnames <- installed.packages()[,1]     #exclude packages:      libnames <- libnames[ !(libnames %in% c("rJava","RPostgreSQL","XLConnect","xlsx","xlsxjars")) ]     df <- data.frame(packname = NULL, funkyName = c(NULL,NULL))     #for (i in 1:50){     for (i in 1:length(libnames)){        com <- paste0("require(", libnames[i],")")        eval(parse(text= com))        str <- paste0("package:", libnames[i])        funk <- (ls(str))        if (length(funk)==0){          funk <- ifelse((length(funk)==0)==TRUE, "funkyFun", funk)        }        da <- cbind(libnames[i], funk)        df <- rbind(df, da)         }       no_freq <- data.frame(table(df$funk))     all_duplicated_functions  <- no_freq[no_freq$Freq > 1,]     all_duplicated_functions_per_package <- df[df$funk %in% no_freq$Var1[no_freq$Freq > 1],]     return(all_duplicated_functions_per_package)}

Results of the function is somewhat interesting. In my case, with installed little over 300 packages (yes, it’s not that much), I have managed to find at least 20 functions that share same name and each of them is present in at least two (or more) packages.

And this is just an excerpt from much larger set of duplicate function names among the packages. And also gives you the idea of potentially conflicting packages, as on the example above, caret and base for function Recall. When loading caret (to already existing base function), be cautious when calling such functions.

As always, the code is available at Github.

Happy R-coding and stay healthy!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Little useless-useful R functions – Same function names from different packages or namespaces first appeared on R-bloggers.

Helper code and files for your testthat tests

$
0
0

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If your package uses testthat for unit testing, as many packages on CRAN, you might have wondered at least once about where to put “things” you wished to use in several tests: a function generating a random string, an example image, etc. In this blog post, we shall offer a round-up of how to use such code and files when testing your package.

Code called in your tests

Remember our post about internal functions in R packages? What about internal functions in unit tests of R packages? And code that needs to be run for tests?

Where to put your code and function depends on where you’ll want to use them.

  • It’s best not to touch tests/testthat.R.
  • R scripts under tests/testthat/ whose name starts with setup are loaded before tests are run but not with devtools::load_all(). This can be important: it means the code in test setup files is not available when you try debugging a test error (or developping a new test) by running devtools::load_all() then the code of your test.1 And yes, you’ll be interactively debugging tests more often than you wish. 😉

  • R scripts under tests/testthat/ whose name start with helper are loaded with devtools::load_all()2 so they are available for both tests and interactive debugging. Just like… R scripts under R/ so you might put your testthat helpers in the R directory instead, as recommended in testthat docs. So instead of living in tests/testthat/helper.R they’d live in e.g. R/testthat-helpers.R (the name is not important in the R directory). However, it also means they are installed with the package which might (slightly!) increase its size, and that they are with the rest of your package code which might put you off.

To summarize,

FileRun before testsLoaded via load_all()Installed with the package3Testable4
tests/testthat/setup*.R✔
tests/testthat/helper*.R✔✔
R/any-name.R✔✔✔✔
tests/testthat/anything-else.R

tests/testthat/helper*.R are no longer recommended in testthat but they are still supported. 😌

In practice,

  • In tests/testthat/setup.R you might do something like loading a package that helps your unit testing like {vcr}, {httptest} or {presser} if you’re testing an API client.
  • In a helper like tests/testthat/helper.R or R/test-helpers.R you might define variables and functions that you’ll use throughout your tests, even custom skippers. To choose between the two locations, refer to the table above and your own needs and preferences. Note that if someone wanted to study testthat “utility belts” à la Bob Rudis, they would probably only identify helper files like tests/testthat/helper.R.

You’ll notice testthat no longer recommends having a file with code to be run after tests… So how do you clean up after tests? Well, use withr’s various helper functions for deferring clean-up. So basically it means the code for cleaning lives near the code for making a mess. To learn more about this, read the “self-cleaning text fixtures” vignette in testthat that includes examples.

Files called from your tests

Say your package deals with files in some way or the other. To test it you can use two strategies, depending on your needs.

Create fake folders and text files from your tests

If the functionality under scrutiny depends on files that are fairly simple to generate with code, the best strategy might be to create them before running tests, and to delete them after running them. So you’ll need to (re-)read the “self-cleaning text fixtures” vignette in testthat. In the words of Jenny Bryan

I have basically come to the opinion that any file system work done in tests should happen below the temp directory anyway. So, if you need to stand up a directory, then do stuff to or in it, the affected test(s) should create such a directory explicitly, below tempdir, for themselves (and delete it when they’re done).

It might seem easier to have the fake folders live under the testthat/ directory but this might bite you later, so better make the effort to create self-cleaning text fixtures, especially as, as mentioned earlier, this is a skill you’ll need often.

Use other files in your tests

Now, there are files that might be harder to re-create from your tests, like images, or even some text files with a ton of information in them. If you look at usethis testthat/ folder you’ll notice a ref folder for instance, with zip files used in tests. You are free to organize files under testthat/ as you wish, they do not even need to be in a subdirectory, but sorting them in different folders might help you.

All files under testthat/ and its subfolders are available to your tests so you can read from them, source them if they are R scripts, copy them to a temp dir, etc.

Now, to refer to these files in your tests, use testthat::test_path(), this way you will get a filepath that works “both interactively and during tests”. E.g. if you create a file under tests/testthat/examples/image.png in your tests you’ll have to write testthat::test_path("examples", "image.png").

Conclusion

In this post we offered a roundup around helper code and example files for your testthat unit tests. As often it was inspired by a help thread, on RStudio community. If you have some wisdom from your own testthat bag of tricks, please share it in the comments below!


  1. If you use something like browser(), debug() etc. somewhere in your code and run the tests, the setup file will have been loaded. ↩

  2. Actually you can choose to have devtools::load_all()not load the testthat helpers by using its helpers argument (TRUE by default). ↩

  3. Installed with the package, and run at package installation time. ↩

  4. Yes, you could test the code supporting your tests. ↩

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Helper code and files for your testthat tests first appeared on R-bloggers.

Exploring vaccine effectiveness through bayesian regression — Part 4

$
0
0

[This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yet another post about COVID-19 vaccines. Ironically I started with Moderna for the first two posts then shifted to Pfizer for the 3rd post when they released data first. Yesterday Moderna actual data became available (great news), along with some additional information we can model. Literally as I was writing the post more Pfizer data came out with even more good news.

Last time

Last time we paid homage to Doing Bayesian Data Analysis by Dr. John K. Kruschke (hereafter, simply DBDA) while exploring overall effectiveness of the Pfizer vaccine.

This time we’ll revisit brms and STAN which we touched upon in part 2. At the time I wrote:

Quite frankly it is more than we need for our simple little analysis. A bit like using a sledge hammer as a fly-swatter. Besides the simple joy of learning about it however, there is one practical reason to use it. As our data grows and we inevitably want to conduct more sophisticated analysis factoring things in like gender, age and race. It is likely to become the tool of choice.

Since the Pfizer data includes some early glimpses at information about factors such as age and race we can put brms to much better use this time. First step let’s load some packages.

library(dplyr)library(brms)library(ggplot2)library(kableExtra)theme_set(theme_bw())

What we know. Approximately 25,650 in the study (well at least eligible so far). Two age categories (Less than 65, or older than that). We can safely assume they are not evenly distributed across age groups, but are more or less equal in the number who received the true vaccine versus the placebo. 95 subjects contracted COVID, the rest did not. Of those 95 only 5 received the true vaccine yielding an effectiveness of ~ 94%. We don’t know from the press releases what the proportion by age was. But we can have some fun playing with the numbers and preparing for when we see fuller data.

One area of interest will no doubt be whether the vaccine protects older people as well as younger. So we’ll have two factors vaccine/placebo and <65 versus >= 65. For each of those we’ll have data on the number of people who got COVID and those who didn’t.

For our purposes we’re making some numbers up! They’ll be realistic and possible given the limitations I laid out above (for example only 5 people regardless of age got the true vaccine and got COVID). I’m going to deliberately assign a disproportionate number of the positive true vaccine cases to the older category a 4/1 split. Here’s a table, of our little dataframe agg_moderna_data.

Edit Nov 18th– I’m also aware that Pfizer has released even more data today that the vaccine is nearly equally effective for older people. But since I don’t have access to the exact numbers (if anyone has them send them along) we’ll make some up.

Table 1: Notional Moderna Data
Factors of interest
COVID Infection
Total subjects
conditionagedidnotgot_covidsubjects
placeboLess than 658271798350
placeboOlder4494114505
vaccinatedLess than 65829318294
vaccinatedOlder449744501

Those familiar with frequentists methods in r will recognize this as an opportunity for glm. For reasons I’ll explain later I would prefer to use the aggregated data, although glm is perfectly happy to process as a long dataframe. So we’ll create a matrix of outcomes called outcomes with successes got_covid and failures didnot.

For binomial and quasibinomial families the response can be specified … as a two-column matrix with the columns giving the numbers of successes and failures

Voila the results using the native print function as well as broom::tidy. With classic frequentist methodology we can now “reject” the null that the \(\beta\) coefficients are 0. Both the main effects condition and age are significant as well as the interaction age:condition. Telling us that we must look at the combination of age and vaccine status to understand our results.

outcomes <- cbind(as.matrix(agg_moderna_data$got_covid, ncol = 1),              as.matrix(agg_moderna_data$didnot, ncol = 1))moderna_frequentist <-   glm(formula = outcomes ~ age + condition + age:condition,       data = agg_moderna_data,       family = binomial(link = "logit"))moderna_frequentist## ## Call:  glm(formula = outcomes ~ age + condition + age:condition, family = binomial(link = "logit"), ##     data = agg_moderna_data)## ## Coefficients:##                  (Intercept)                      ageOlder  ##                       -4.651                        -1.362  ##          conditionvaccinated  ageOlder:conditionvaccinated  ##                       -4.372                         3.360  ## ## Degrees of Freedom: 3 Total (i.e. Null);  0 Residual## Null Deviance:       121.2 ## Residual Deviance: 5.18e-13  AIC: 23.71broom::tidy(moderna_frequentist)## # A tibble: 4 x 5##   term                         estimate std.error statistic   p.value##                                             ## 1 (Intercept)                     -4.65     0.113    -41.1  0        ## 2 ageOlder                        -1.36     0.322     -4.22 0.0000240## 3 conditionvaccinated             -4.37     1.01      -4.34 0.0000140## 4 ageOlder:conditionvaccinated     3.36     1.16       2.89 0.00389

Notice that the \(\beta\) values are not especially informative because they are in link = “logit” format. We can infer direction. Older people, even though vaccinated still get COVID at higher rates. As we investigate bayesian methods we’ll convert back to our original scale. But let’s pause for a minute and acknowledge the limitations of the frequentist (NHST) method.

Things we can’t do (great reading here):

  1. Claim that we are highly confident about our results because the p value is is very small? NO while you could construct confidence intervals around our \(\beta\) values you can’t make statements about probability, you either reject the null at some \(\alpha\) level or you don’t.

  2. If the results were non-significant can you “accept” the null or say there is evidence for it. NO again that’s not permitted in the NHST framework.

  3. Append this data to results from the study two months from now when the case count is higher? Another NO you can not legitimately keep adding data in a frequentist’s world.

A bayesian approach

brms is great package that very much mirror’s the way glm works. It abstracts away many of the stumbling blocks that newcomers find difficult about STAN and bayesian modeling in general.

Let me back up a minute. The simplest way to run the bayesian analog if our data were in long format i.e. we had a dataframe with 25,650 rows one per subject, would be the following command.

brm(data = moderna_long_data,            family = bernoulli(link = logit),            y ~ age + condition + age:condition,            iter = 12500, warmup = 500, chains = 4, cores = 4,            control = list(adapt_delta = .99, max_treedepth = 12),            seed = 9,            file = "moderna_long")

We’re treating COVID-19 outcomes as simple bernoulli events (a coin flip if you will). I’m providing the code so you can try it if you like. But I don’t recommend it. Expect a wait of 5 minutes or more!

Let’s go with a plan that works for the impatient (ME!). brm allows us to specify the outcomes aggregated. For each row of agg_moderna_data it’s got_covid | trials(subjects) or the number who became infected by the number of subjects in that condition. Again our \(\beta\)’s are in logit but we’ll correct that in a minute.

Notice file = "moderna_bayes_full" which saves our results to a file on disk named moderna_bayes.rds. That’s a blessing because we have the results saved and can get them back in a future session. But be careful if you rerun the command without deleting that file first it won’t rerun the analysis. We increased the default number of chains and iterations and after burn-in have 48,000 usable samples.

moderna_bayes_full <-    brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition + age:condition,       iter = 12500,        warmup = 500,        chains = 4,        cores = 4,       seed = 9,       file = "moderna_bayes_full")summary(moderna_bayes_full)##  Family: binomial ##   Links: mu = logit ## Formula: got_covid | trials(subjects) ~ age + condition + age:condition ##    Data: agg_moderna_data (Number of observations: 4) ## Samples: 4 chains, each with iter = 12500; warmup = 500; thin = 1;##          total post-warmup samples = 48000## ## Population-Level Effects: ##                              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS## Intercept                       -4.66      0.11    -4.88    -4.44 1.00    42732## ageOlder                        -1.39      0.33    -2.08    -0.79 1.00    21396## conditionvaccinated             -4.80      1.21    -7.73    -2.98 1.00    11248## ageOlder:conditionvaccinated     3.72      1.35     1.45     6.80 1.00    12007##                              Tail_ESS## Intercept                       36581## ageOlder                        21752## conditionvaccinated             10470## ageOlder:conditionvaccinated    11430## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).bayestestR::describe_posterior(moderna_bayes_full,                                ci = 0.95,                                test = c("p_direction"),                               centrality = "MAP")## # Description of Posterior Distributions## ## Parameter                    |    MAP |           95% CI |      pd |  Rhat |       ESS## --------------------------------------------------------------------------------------## Intercept                    | -4.651 | [-4.884, -4.437] | 100.00% | 1.000 | 42708.798## ageOlder                     | -1.357 | [-2.050, -0.769] | 100.00% | 1.000 | 21032.449## conditionvaccinated          | -4.166 | [-7.274, -2.760] | 100.00% | 1.001 |  9684.929## ageOlder.conditionvaccinated |  3.285 | [ 1.308,  6.549] |  99.96% | 1.001 | 10879.160

Notice how similar our results are to glm. We can also use bayestestR::describe_posterior directly on the brmsfit object.

One of the very nice things about brms is that it has plot methods for almost everything you might want to plot. A few examples below and even more commented out.

mcmc_plot(moderna_bayes_full)

mcmc_plot(moderna_bayes_full, type = "areas")

mcmc_plot(moderna_bayes_full, type = "trace")

mcmc_plot(moderna_bayes_full, type = "hist")

mcmc_plot(moderna_bayes_full, type = "dens_overlay")

# mcmc_plot(moderna_bayes_full, type = "acf")# mcmc_plot(moderna_bayes_full, type = "neff")# mcmc_plot(moderna_bayes_full, type = "nuts_treedepth")# mcmc_plot(moderna_bayes_full, type = "scatter")# mcmc_plot(moderna_bayes_full, type = "rhat")# mcmc_plot(moderna_bayes_full, type = "dens")

Time to address the issue of getting our scale back to the original instead of logit. brms to the rescue by offering transformations and more specifically transformations = "inv_logit_scaled". Now we can see that \(\beta_0\) etc in their native scale.

mcmc_plot(moderna_bayes_full, transformations = "inv_logit_scaled")

Honestly though \(\beta\) coefficients are sometimes hard to explain to someone not familiar with the regression framework. Let’s talk about conditional effects. brms provides a handy functional called conditional_effects that will plot them for us. The command conditional_effects(moderna_bayes_full) is enough to get us a decent output, but we can also wrap it in plot and do things like change the ggplot theme. We can focus on just the interaction effect with effects = "age:condition". This is important because interpreting main effects when we have evidence of an interaction is dubious. We can even pass additional ggplot information to customize our plot.

# conditional_effects(moderna_bayes_full)plot(conditional_effects(moderna_bayes_full),      theme = ggplot2::theme_minimal())

interaction_effect <-   plot(conditional_effects(moderna_bayes_full,                            effects = "age:condition",                            plot = FALSE),         theme = ggplot2::theme_minimal())

interaction_effect$`age:condition`  +   ggplot2::ggtitle("Infection rate by condition") +   ggplot2::ylab("Infected / Subjects per condition") +   ggplot2::xlab("Age") +   ggplot2::theme_minimal()

If this data were real it would give us pause since it highlights the fact that the vaccine appears to be much less effective for people over the age of 65.

If you don’t like the appearance of the plots from brms or prefer making use of bayestestR as we did in earlier posts it is trivial to extract the draws/chains and plot it.

# library(bayestestR)moderna_draws <-    tidybayes::tidy_draws(moderna_bayes_full) %>%   select(b_Intercept:`b_ageOlder:conditionvaccinated`) %>%   mutate_all(inv_logit_scaled)moderna_draws## # A tibble: 48,000 x 4##    b_Intercept b_ageOlder b_conditionvaccinated `b_ageOlder:conditionvaccinated`##                                                             ##  1     0.00827      0.351               0.0160                             0.948##  2     0.00954      0.164               0.0110                             0.958##  3     0.00955      0.242               0.0202                             0.962##  4     0.00876      0.240               0.0176                             0.896##  5     0.00999      0.157               0.00496                            0.981##  6     0.0129       0.171               0.00266                            0.994##  7     0.00919      0.181               0.00442                            0.993##  8     0.00971      0.244               0.0259                             0.937##  9     0.00900      0.147               0.00749                            0.982## 10     0.00893      0.172               0.00975                            0.977## # … with 47,990 more rowsplot(bayestestR::hdi(moderna_draws, ci = c(.89, .95)))

Priors and Bayes Factors

Bayesian analysis rests on the principle of modeling how the data inform our prior beliefs about understanding. You’ll notice that no where above did I specify any prior. That’s because brms is kind enough to provide defaults. From the documentation “Default priors are chosen to be non or very weakly informative so that their influence on the results will be negligible and you usually don’t have to worry about them. However, after getting more familiar with Bayesian statistics, I recommend you to start thinking about reasonable informative priors for your model parameters: Nearly always, there is at least some prior information available that can be used to improve your inference.”

To be conservative but also to pave the way for computing some Bayes Factors let’s consult the STAN documentation and some weakly informative priors. We can run get_prior to get back information about the defaults. We don’t really care about the overall intercept \(\beta_0\) but let’s set the others to normal(0,3) which puts us between weak and very weak. We’ll store those changes in bprior, rerun the model with a new name and you can see that changing the priors to something slightly more informative has little impact. The data overwhelms the priors and our posteriors are about the same as they were.

get_prior(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition + age:condition)##                 prior     class                         coef group resp dpar##                (flat)         b                                             ##                (flat)         b                     ageOlder                ##                (flat)         b ageOlder:conditionvaccinated                ##                (flat)         b          conditionvaccinated                ##  student_t(3, 0, 2.5) Intercept                                             ##  nlpar bound       source##                   default##              (vectorized)##              (vectorized)##              (vectorized)##                   default### https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations### Generic weakly informative prior: normal(0, 1);bprior <- c(prior(normal(0,3), class = b, coef = ageOlder),            prior(normal(0,3), class = b, coef = conditionvaccinated),            prior(normal(0,3), class = b, coef = ageOlder:conditionvaccinated))bprior##         prior class                         coef group resp dpar nlpar bound##  normal(0, 3)     b                     ageOlder                            ##  normal(0, 3)     b          conditionvaccinated                            ##  normal(0, 3)     b ageOlder:conditionvaccinated                            ##  source##    user##    user##    userfull_model <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition + age:condition,       prior = bprior,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "full_model")summary(full_model)##  Family: binomial ##   Links: mu = logit ## Formula: got_covid | trials(subjects) ~ age + condition + age:condition ##    Data: agg_moderna_data (Number of observations: 4) ## Samples: 4 chains, each with iter = 12500; warmup = 500; thin = 1;##          total post-warmup samples = 48000## ## Population-Level Effects: ##                              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS## Intercept                       -4.67      0.11    -4.90    -4.45 1.00    49307## ageOlder                        -1.33      0.32    -1.99    -0.74 1.00    29645## conditionvaccinated             -3.97      0.76    -5.67    -2.69 1.00    15090## ageOlder:conditionvaccinated     2.76      0.96     0.96     4.75 1.00    16749##                              Tail_ESS## Intercept                       38807## ageOlder                        26856## conditionvaccinated             15787## ageOlder:conditionvaccinated    18968## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

Some bayesians like to work with a Bayes Factor. We can use brms::bayes_factor to inform us about the relative worthiness of competing regression models. You’ll notice I named the first set of results full_model because it included age, condition and the interaction between them age:condition. Let’s build three more models:

  • one without the interaction term called no_interaction,
  • one without the interaction or the vaccine called no_vaccine,
  • one without the interaction or age called no_age,

Then we can compare the full model against each of these nested models. (NB for the purists yes I actually tried some other priors that were less weak and yes the bayes factors are quite robust.)

bprior2 <- c(prior(normal(0,3), class = b, coef = ageOlder),            prior(normal(0,3), class = b, coef = conditionvaccinated))bprior2##         prior class                coef group resp dpar nlpar bound source##  normal(0, 3)     b            ageOlder                               user##  normal(0, 3)     b conditionvaccinated                               userno_interaction <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition,       prior = bprior2,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "no_interaction")summary(no_interaction)##  Family: binomial ##   Links: mu = logit ## Formula: got_covid | trials(subjects) ~ age + condition ##    Data: agg_moderna_data (Number of observations: 4) ## Samples: 4 chains, each with iter = 12500; warmup = 500; thin = 1;##          total post-warmup samples = 48000## ## Population-Level Effects: ##                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## Intercept              -4.70      0.12    -4.94    -4.48 1.00    52909    38740## ageOlder               -1.07      0.28    -1.65    -0.55 1.00    22774    21953## conditionvaccinated    -2.87      0.45    -3.83    -2.08 1.00    18737    19745## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).bprior3 <- c(prior(normal(0,3), class = b, coef = ageOlder))bprior3## b_ageOlder ~ normal(0, 3)no_vaccine <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age,       prior = bprior3,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "no_vaccine")summary(no_vaccine)##  Family: binomial ##   Links: mu = logit ## Formula: got_covid | trials(subjects) ~ age ##    Data: agg_moderna_data (Number of observations: 4) ## Samples: 4 chains, each with iter = 12500; warmup = 500; thin = 1;##          total post-warmup samples = 48000## ## Population-Level Effects: ##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## Intercept    -5.34      0.11    -5.56    -5.12 1.00    48533    33484## ageOlder     -1.07      0.28    -1.65    -0.54 1.00    16100    17671## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).bprior4 <- c(prior(normal(0,3), class = b, coef = conditionvaccinated))bprior4## b_conditionvaccinated ~ normal(0, 3)no_age <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ condition,       prior = bprior4,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "no_age")summary(no_age)##  Family: binomial ##   Links: mu = logit ## Formula: got_covid | trials(subjects) ~ condition ##    Data: agg_moderna_data (Number of observations: 4) ## Samples: 4 chains, each with iter = 12500; warmup = 500; thin = 1;##          total post-warmup samples = 48000## ## Population-Level Effects: ##                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## Intercept              -4.96      0.11    -5.17    -4.76 1.00    52371    33818## conditionvaccinated    -2.88      0.45    -3.85    -2.07 1.00    10390    11347## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).bayes_factor(full_model, no_interaction, silent = TRUE)## Estimated Bayes factor in favor of full_model over no_interaction: 31.06810bayes_factor(full_model, no_vaccine, silent = TRUE)## Estimated Bayes factor in favor of full_model over no_vaccine: 235611405759697158144.00000bayes_factor(full_model, no_age, silent = TRUE)## Estimated Bayes factor in favor of full_model over no_age: 17306.61482

We can now quantify how well our models “fit” the data compared to the nested models. The evidence for retaining the model with the interaction term included is about 30:1. The evidence for keeping age is about 17,000:1 and you don’t even want to think about dropping the vaccine condition.

I do not personally believe in any model that includes an interaction but not the main effect that accompanies it.

What have we learned?

  • Very little more about the actual data since I made some things up (full disclosure). But a lot about how to analyze it as it becomes available. As more data becomes available we’ll be able to do even more complex analyses with other factors like gender and race.

  • brms is a great modeling tool!

Done

I’m done with this topic for now. More fantastic news for humanity in the race for a vaccine.

Hope you enjoyed the post. Comments always welcomed. Especially please let me know if you actually use the tools and find them useful.

Keep counting the votes! Every last one of them!

Chuck

CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Exploring vaccine effectiveness through bayesian regression -- Part 4 first appeared on R-bloggers.


FIFA Shiny App Wins Popular Vote in Appsilon’s Shiny Contest

$
0
0

[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Thumbnail

Appsilon recently ran an internal Shiny contest to see which team member could make the best shiny.semantic PoC in under 24 hours. We asked the R Community to vote for the People’s Choice Award, and today we have a winner to announce!

Appsilon is hiring hiring globally. We are a remote-first company. Check out 9 currently open positions. Want to work with our partner RStudio? They are currently hiring a Data Science Evangelist.

Before we announce the winner, we would like to take a minute and thank everyone who voted for their favorite shiny.semantic dashboard. It’s you who made it possible to award this prize. And now, let’s take a look at how the voting turned out. 

PoContest results

A breakdown of all the votes for our six Shiny App entries.

See the other winners of our internal Shiny Contest as determined by our partner RStudio: RStudio Announces Winners of Appsilon’s Internal Shiny Contest

People’s Choice Award Winner: FIFA ’19

Huge congratulations to our open-source leader Dominik for building this application! The app won 34.7% of your votes, 9% more than the runner-up. Perhaps there are some football fans in the R Community! RStudio was impressed by this app as well and named it the runner-up for the “Most Technically Impressive” award. Read more about the winners they chose on the RStudio blog.

FIFA 19 Dashboard

The FIFA ’19 application was created to demonstrate powerful shiny.semantic features for generating interactive data visualization. This dashboard used SoFifa data and was inspired by a fantastic Fifa Shiny Dashboard. You can test the application here.

People’s Choice Runner-Up: Semantic Memory

Semantic Memory is a memory game created from scratch by Appsilon engineer Jakub Nowicki using shiny.semantic (with some adjustments). Two players try to find as many pairs of R package hexes (coming from both Appsilon and RStudio) as they can. You might notice the reference to our new shiny.worker package in this screenshot. The app won an impressive 26.5% of the overall vote.

Semantic memory

Semantic Memory is based on various shiny.semantic components and uses features that come with the FomanticUI library, such as the mechanism responsible for revealing and hiding cards. You can test the application here.

Like shiny.semantic? Please give us a star on GitHub.

Third Place: Polluter Alert

Polluter Alert is a dashboard created by Appsilon co-founder PawełPrzytuła that allows the user to report sources of air pollution in the user’s area. The app won 20.4% of your votes and was named “Most Technically Impressive” by RStudio.

Polluter Alert

This dashboard’s goal is to build a reliable dataset used for actionable insights – sometimes, the primary pollution source is a single chimney. Sometimes it is a district problem (lack of modern heating infrastructure), etc. You can test the application here.

Fourth Place: Shiny Mosaic

The purpose of this application by Krystian Igras is to enable the user to create a mosaic of a photo they upload. The app won 12.2% of your votes.

Shiny Mosaic

The application, built in the form of a wizard, allows users to configure the target mosaic form easily. You can test the application here.

Fifth Place: Semantic Pixelator

Semantic Pixelator (created by Pedro Silva) is a fun way to explore semantic elements by creating different image compositions using loaders, icons, and other UI elements from semantic/fomantic UI. The app only won 4.1% of your votes in the popularity contest, but RStudio named it “Most Creatively Impressive.”

Semantic Pixelator

You can use the sidebar to refine different parameters such as the generated grid’s size, the base element type, and other color options. You can then use the palette generator to create a color palette based on the result and download both the current palette details and the developed composition. You can test the application here.

Sixth Place: Squaremantic

With this app by Appsilon team member Jakub Chojna, you can quickly generate a nicely formatted square layout of letters based on the text input. The app received 2% of the overall votes, but it’s first place in our hearts! With this app, Jakub impressed us with his design background. 

Squaremantic

The app uses shiny.semantic layouts and input elements, by which you can control visual output like in a simple graphic program. You can test the application here.

Conclusion

That’s a wrap for the 2020 shiny.semantic PoContest! We want to thank you again for taking the time to vote for your favorite dashboard. It means a lot to us and the community engagement in this contest was a huge confidence boost for our team.

Want to see more high-quality Shiny Dashboard examples? Visit Appsilon’s Shiny Demo Gallery.

Do you like shiny.semantic? Show your support by giving us a star on GitHub. You can explore other Appsilon open source packages on our new Shiny Tools landing page. 

Appsilon is hiring! We are primarily seeking senior-level developers with management experience. See our Careers page for all new openings, including openings for a Project Manager and Community Manager.

Article FIFA Shiny App Wins Popular Vote in Appsilon’s Shiny Contest comes from Appsilon | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post FIFA Shiny App Wins Popular Vote in Appsilon’s Shiny Contest first appeared on R-bloggers.

AzureTableStor: R interface to Azure table storage service

$
0
0

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Hong Ooi

I’m pleased to announce that the AzureTableStor package, providing a simple yet powerful interface to the Azure table storage service, is now on CRAN. This is something that many people have requested since the initial release of the AzureR packages nearly two years ago.

Azure table storage is a service that stores structured NoSQL data in the cloud, providing a key/attribute store with a schemaless design. Because table storage is schemaless, it’s easy to adapt your data as the needs of your application evolve. Access to table storage data is fast and cost-effective for many types of applications, and is typically lower in cost than traditional SQL for similar volumes of data.

You can use table storage to store flexible datasets like user data for web applications, address books, device information, or other types of metadata your service requires. You can store any number of entities in a table, and a storage account may contain any number of tables, up to the capacity limit of the storage account.

AzureTableStor builds on the functionality provided by the AzureStor package. The table storage service is available both as part of general Azure storage and via Azure Cosmos DB; AzureTableStor is able to work with either.

Tables

AzureTableStor provides a table_endpoint function that is the analogue of AzureStor’s blob_endpoint, file_endpoint and adls_endpoint functions. There are methods for retrieving, creating, listing and deleting tables within the endpoint.

library(AzureTableStor)

# storage account endpoint
endp <- table_endpoint(
    "https://mystorageacct.table.core.windows.net",
    key="mykey")
# Cosmos DB w/table API endpoint
endp <- table_endpoint(
    "https://mycosmosdb.table.cosmos.azure.com:443",
    key="mykey")

create_storage_table(endp, "mytable")
list_storage_tables(endp)
tab <- storage_table(endp, "mytable")

Entities

In table storage jargon, an entity is a row in a table. The columns of the table are properties. Note that table storage does not enforce a schema; that is, individual entities in a table can have different properties. An entity is identified by its RowKey and PartitionKey properties, which must be unique for each entity.

AzureTableStor provides the following functions to work with data in a table:

  • insert_table_entity: inserts a row into the table.
  • update_table_entity: updates a row with new data, or inserts a new row if it doesn’t already exist.
  • get_table_entity: retrieves an individual row from the table.
  • delete_table_entity: deletes a row from the table.
  • import_table_entities: inserts multiple rows of data from a data frame into the table.
insert_table_entity(tab, list(
    RowKey="row1",
    PartitionKey="partition1",
    firstname="Bill",
    lastname="Gates"
))

get_table_entity(tab, "row1", "partition1")

# we can import to the same table as above:
# table storage doesn't enforce a schema
import_table_entities(tab, mtcars,
    row_key=row.names(mtcars),
    partition_key=as.character(mtcars$cyl))

list_table_entities(tab)
list_table_entities(tab, filter="firstname eq 'Satya'")
list_table_entities(tab, filter="RowKey eq 'Toyota Corolla'")

Batch transactions

With the exception of import_table_entities, all of the above entity functions work on a single row of data. Table storage provides a batch execution facility, which lets you bundle up single-row operations into a single transaction that will be executed atomically. In the jargon, this is known as an entity group transaction. import_table_entities is an example of an entity group transaction: it bundles up multiple rows of data into batch jobs, which is much more efficient than sending each row individually to the server.

The create_table_operation, create_batch_transaction and do_batch_transaction functions let you perform entity group transactions. Here is an example of a simple batch insert. The actual import_table_entities function is more complex as it can also handle multiple partition keys and more than 100 rows of data.

ir <- subset(iris, Species == "setosa")

# property names must be valid C# variable names
names(ir) <- sub("\\.", "_", names(ir))

# create the PartitionKey and RowKey properties
ir$PartitionKey <- ir$Species
ir$RowKey <- sprintf("%03d", seq_len(nrow(ir)))

# generate the array of insert operations: 1 per row
ops <- lapply(seq_len(nrow(ir)), function(i)
    create_table_operation(endp, "mytable", body=ir[i, ],
                           http_verb="POST")))

# create a batch transaction and send it to the endpoint
bat <- create_batch_transaction(endp, ops)
do_batch_transaction(bat)

If you have any feedback, or to report bugs with the package, please contact me at hongooi73@gmail.com or open an issue on GitHub.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post AzureTableStor: R interface to Azure table storage service first appeared on R-bloggers.

Join Us Dec 10 at the COVID-19 Data Forum: Using Mobility Data To Forecast COVID-19 Cases

$
0
0

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Thurs, December 10th, 9am PDT/12pm EDT/18:00 CEST – Register now!

Hosted by the COVID-19 Data Forum/Stanford Data Science Initiative/R Consortium

Join the R Consortium and learn about mobility data in monitoring and forecasting COVID-19. COVID-19 is the first global pandemic to occur in the age of big data. All around the world, public health officials are testing and releasing data to the public, giving ample cases for scientists to analyze and forecast in real-time.

Despite having so much data available, the data itself has been limited by simplistic metrics rather than higher dimensional patient-level data. To understand how COVID-19 works within the body and is transmitted, scientists must understand why the virus causes harm to some more than others.

Sharing this type of data brings up patient confidentiality issues, making it difficult to get this type of vital data.

The COVID-19 Data Forum, a collaboration between Stanford University and the R Consortium, will discuss the ways in which people’s mobility data holds promises and challenges in combating the spread of SARS-CoV-2, as well as how the public has behaved in response to the pandemic. This data is vital in understanding the way individual’s patterns have shifted since the pandemic, helping us to better understand where people are going and when they are getting sick. 

The event is free and open to the public.

Speakers include:

  • Chris Volinksky, PhD Associate vice-president, Big Data Research, ATT Labs.
  • Caroline Buckee Associate Professor of Epidemiology and Associate Director of the Center for Communicable Disease Dynamics at the Harvard T.H. Chan School of Public Health.
  • Christophe Fraser Professor of Pathogen Dynamics at University of Oxford and Senior Group Leader at Big Data Institute, Oxford University, UK.
  • Andrew Schoeder, PhD Vice-president Research & Analytics for Direct Relief.

Registration and more info: https://covid19-data-forum.org

The post Join Us Dec 10 at the COVID-19 Data Forum: Using Mobility Data To Forecast COVID-19 Cases appeared first on R Consortium.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Join Us Dec 10 at the COVID-19 Data Forum: Using Mobility Data To Forecast COVID-19 Cases first appeared on R-bloggers.

Apple Silicon + Big Sur + RStudio + R Field Report

$
0
0

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s been a while since I’ve posted anything R-related and, while this one will be brief, it may be of use to some R folks who have taken the leap into Big Sur and/or Apple Silicon. Stay to the end for an early Christmas 🎁!

Big Sur Report

As #rstats + #macOS Twitter-folks know (before my permanent read hiatus began), I’ve been on Big Sur since the very first developer betas were out. I’m even on the latest beta (11.1b1) as I type this post.

Apple picked away at many nits that they introduced with Big Sur, but it’s still more of a “Vista” release than Catalina was given all the clicks involved with installing new software. However, despite making Simon’s life a bit more difficult (re: notarization of R binaries) I’m happy to report that R 4.0.3 and the latest RStudio 1.4 daily releases work great on Big Sur. To be on the safe side, I highly recommend putting the R command-line binaries and RStudio and R-GUI .apps into both“Developer Tools” and “Full Disk Access” in the Security & Privacy preferences pane. While not completely necessary, it may save some debugging (or clicks of “OK”) down the road.

The Xcode command-line tools are finally in a stable state and can be used instead of the massive Xcode.app installation. This was problematic for a few months, but Apple has been pretty consistent keeping it version-stable with Xcode-proper.

Homebrew is pretty much feature complete on Big Sur (for Intel-architecture Macs) and I’ve run everything from the {tidyverse} to {sf}-verse, ODBC/JDBC and more. Zero issues have come up, and with the pending public release (in a few weeks) of 11.1, I can safely say you should be fine moving your R analyses and workflows to Big Sur.

Apple Silicon Report

Along with all the other madness, 2020 has resurrected the “Processor Wars” by making their own AMD 64-bit chips (the M1 series). The Mac mini flavor is fast, but I suspect some of the “feel” of that speed comes from the faster SSDs and beefed up I/O plane to get to those SSDs. These newly released Mac models require Big Sur, so if you’re not prepared to deal with that, you should hold off on a hardware upgrade.

Another big reason to hold off on a hardware upgrade is that the current M1 chips cannot handle more than 16GB of RAM. I do most of the workloads requiring memory heavy-lifting on a 128GB RAM, 8-core Ubuntu 20 server, so what would have been a deal breaker in the past is not much of one today. Still, R folks who have gotten used to 32GB+ of RAM on laptops/desktops will absolutely need to wait for the next-gen chips.

Most current macOS software — including R and RStudio— is going to run in the Rosetta 2 “translation environment”. I’ve not experienced the 20+ seconds of startup time that others have reported, but RStudio 1.4 did take noticeably (but not painfully) longer on the first run than it has on subsequent ones. Given how complex the RStudio app is (chromium engine, Qt, C++, Java, oh my!) I was concerned it might not work at all, but it does indeed work with absolutely no problems. Even the ODBC drivers (Athena, Drill, Postgres) I need to use in daily work all work great with R/RStudio.

This means things can only get even better (i.e. faster) as all these components are built with “Universal” processor support.

Homebrew can work, but it requires the use of the arch command to ensure everything is running the Rosetta 2 translation environment and nothing I’ve installed (from source) has required anything from Homebrew. Do not let that last sentence lull you into a false sense of excitement. So far, I’ve tested:

  • core {tidyverse}
  • {DBI}, {odbc}, {RJDBC}, and (hence) {dbplyr}
  • partially extended {ggplot2}-verse
  • {httr}, {rvest}, {xml2}
  • {V8}
  • a bunch of self-contained, C[++]-backed or just base R-backed stats packages

and they all work fine.

I have installed separate (non-Universal) binaries of fd, ripgrep, bat and a few others and almost have a working Rust toolchain up (both Rust and Go are very close to stable on Apple’s M1 architecture).

If there are specific packages/package ecosystems you’d like tested or benchmarked, drop a note in the comments. I’ll likely bee adding more field report updates over the coming weeks as I experiment with additional components.

Now, if you are a macOS R user, you already know — thanks to Tomas and Simon — that we are in wait mode for a working Fortran compiler before we will see a Universal build of R. The good news is that things are working fine under Rosetta 2 (so far).

RStudio Update

If there is any way you can use RStudio Desktop + Server Pro, I would heartily recommend that you do so. The remote R session in-app access capabilities are dreamy and addictive as all get-out.

I also (finally) made a (very stupid simple) PR into the dailies so that RStudio will be counted as a “Developer Tool” for Screen Time accounting.

RSwitch Update

🎁-time! As incentive to try Big Sur and/or Apple Silicon, I started work on version 2 of RSwitch which you can grab from — https://rud.is/rswitch/releases/RSwitch-2.0.0b.app.zip. For those new to RSwitch, it is a modern alternative to the legacy RSwitch that enabled easy switching of R versions on macOS.

The new version requires Big Sur as its based on Apple’s new “SwiftUI” and takes advantage of some components that Apple has not seen fit to make backwards compatible. The current version has fewer features than the 1.x series, but I’m still working out what should and should not be included in the app. Drop notes in the comments with feature requests (source will be up after 🦃 Day).

The biggest change is that the app is not just a menu but an popup window:

Each of those tappable download regions presents a cancellable download progress bar (all three can run at the same time), and any downloaded disk images will be auto-mounted. That third tappable download region is for downloading RStudio Pro dailies. You can get notifications for both (or neither) Community and Pro versions:

The R version switchers also provides more info about the installed versions:

(And, yes, the r79439 is what’s in Rversion.h of each of those installations. I kinda want to know why, now.)

The interface is very likely going to change dramatically, but the app is a Universal binary, so will work on M1 and Intel Big Sur Macs.

NOTE: it’s probably a good idea to put this into “Full Disk Access” in the Security & Privacy preferences pane, but you should likely wait until I post the source so you can either build it yourself or validate that it’s only doing what it says on the tin for yourself (the app is benign but you should be wary of any app you put on your Macs these days).

WARNING: For now it’s a Dark Mode only app so if you need it to be non-hideous in Light Mode, wait until next week as I add in support for both modes.

FIN

If you’re running into issues, the macOS R mailing list is probably a better place to post issues with R and BigSur/Apple Silicon, but feel free to drop a comment if you are having something you want me to test/take a stab at, and/or change/add to RSwitch.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Apple Silicon + Big Sur + RStudio + R Field Report first appeared on R-bloggers.

Final Moderna + Pfizer Vaccine Efficacy Update

$
0
0

[This article was first published on Fells Stats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Okay, last one I swear. We just got updated results from the Pfizer-Biontech vaccine candidate. With 8 out of 170 cases belonging to the treatment group, the point estimate is remarkably similar to the Moderna preliminary results (~95% effectiveness). I will note that I was right in saying that the Pfizer-Biontech vaccine is likely much more effective than the 90% bound originally reported. Given that these are both very similar mRNA vaccines, the similarity in efficacy shouldn’t be radically surprising. Below is an updated plot of the posterior distribution for vaccine efficacy. I’ve added a “pooled” posterior, which assumes that the true efficacy is the same for both vaccines.

# reference: https://pfe-pfizercom-d8-prod.s3.amazonaws.com/2020-09/C4591001_Clinical_Protocol.pdf # prior interval (matches prior interval on page 103)qbeta(c(.025,.975),.700102,1)  # posterior pfizercases_treatment <- 8cases_control <- 170 - cases_treatmenttheta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)rate_ratio_ci <- theta_ci / (1-theta_ci) # effectiveness100 * (1 - rate_ratio_ci) xx <- (1:90)/500yy <- sapply(xx, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))xx <- 100 * (1 - xx / (1 - xx))ggplot() +   geom_area(aes(x=xx,y=yy)) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density") # posterior combinedcases_treatment <- 8 + 5cases_control <- 170 + 95 - cases_treatmenttheta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)rate_ratio_ci <- theta_ci / (1-theta_ci) # effectiveness100 * (1 - rate_ratio_ci) xx1 <- (1:90)/500yy1 <- sapply(xx1, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))xx1 <- 100 * (1 - xx1 / (1 - xx1))ggplot() +   geom_area(aes(x=xx1,y=yy1)) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density")   # posterior modernacases_treatment <- 5cases_control <- 95 - cases_treatmenttheta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)rate_ratio_ci <- theta_ci / (1-theta_ci) # effectiveness100 * (1 - rate_ratio_ci)   xx2 <- (1:90)/500yy2 <- sapply(xx2, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))xx2 <- 100 * (1 - xx2 / (1 - xx2))ggplot() +   geom_area(aes(x=xx2,y=yy2)) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density")  df <- rbind(  data.frame(xx=xx,yy=yy,Company="Pfizer-Biontech"),  data.frame(xx=xx2,yy=yy2,Company="Moderna"),  data.frame(xx=xx1,yy=yy1,Company="Pooled"))ggplot(df) +   geom_area(aes(x=xx,y=yy,fill=Company),alpha=.25,position = "identity") +   geom_line(aes(x=xx,y=yy,color=Company),size=1) +   theme_bw() +   xlab("Vaccine Effectiveness") +   ylab("Posterior Density")

vac3

Both provide excellent protection. Really the only new information is that there is no meaningful difference in efficacy between the two, and hence no reason to prefer one over the other.

Some safety data was also reported:

A review of unblinded reactogenicity data from the final analysis which consisted of a randomized subset of at least 8,000 participants 18 years and older in the phase 2/3 study demonstrates that the vaccine was well tolerated, with most solicited adverse events resolving shortly after vaccination. The only Grade 3 (severe) solicited adverse events greater than or equal to 2% in frequency after the first or second dose was fatigue at 3.8% and headache at 2.0% following dose 2.

This is really good, and maybe even a bit better than Moderna's reported profile. I honestly don't know how to square this with higher averse event rates in the phase II study, where a significant number of participants had fevers after the second dose.

There were 10 severe cases of which 1 was in the treatment arm. Pooling the data from the Moderna and Pfizer studies, 7.1% of cases were severe in the treated arm versus 8.9% in the control. This difference is nowhere near significance though. So no real evidence yet that mRNA vaccines make illness milder should you be infected.

One thing that may have gone under the radar is this quote:

“We are grateful that the first global trial to reach the final efficacy analysis mark indicates that a high rate of protection against COVID-19 can be achieved very fast after the first 30 µg dose, underscoring the power of BNT162 in providing early protection,” said Ugur Sahin, M.D., CEO and Co-founder of BioNTech.

It appears that protection ramps up after the first shot, which is critical since we are in the midst of a huge surge in cases. We need immediate protection as soon as possible to reduce the death and reproductive rates.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fells Stats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Final Moderna + Pfizer Vaccine Efficacy Update first appeared on R-bloggers.

Viewing all 12130 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>