Collider Bias, or: Are Hot Babes Dim and Eggheads Ugly?

March 24, 2020, 1:00 am

≫ Next: Tempered MCMC for Multimodal Posteriors

≪ Previous: Tidying the new Johns Hopkins Covid-19 time-series datasets

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Correlation and its associated challenges don’t lose their fascination: most people know that correlation doesn’t imply causation, not many people know that the opposite is also true (see: Causation doesn’t imply Correlation either) and some know that correlation can just be random (so-called spurious correlation).

If you want to learn about a paradoxical effect nearly nobody is aware of, where correlation between two uncorrelated random variables is introduced just by sampling, read on!

Let us just get into an example (inspired by When Correlation Is Not Causation, But Something Much More Screwy): for all intents and purposes let us assume that appearance and IQ are normally distributed and are uncorrelated:

set.seed(1147)hotness <- rnorm(1000, 100, 15)IQ <- rnorm(1000, 100, 15)pop <- data.frame(hotness, IQ)plot(hotness ~ IQ, main = "The general population")

Now, we can ask ourselves: why does somebody become famous? One plausible assumption (besides luck, see also: The Rich didn’t earn their Wealth, they just got Lucky) would be that this person has some combination of attributes. To stick with our example, let us assume some combination of hotness and intelligence and let us sample some “celebrities” on the basis of this combination:

pop$comb <- pop$hotness + pop$IQ # some combination of hotness and IQcelebs <- pop[pop$comb > 235, ] # sample celebs on the basis of this combinationplot(celebs$hotness ~ celebs$IQ, xlab = "IQ", ylab = "hotness",  main = "Celebrities")abline(lm(celebs$hotness ~ celebs$IQ), col = "red")

Wow, a clear negative relationship between hotness and IQ! Even a highly significant one (to understand significance, see also: From Coin Tosses to p-Hacking: Make Statistics Significant Again!):

cor.test(celebs$hotness, celebs$IQ) # highly significant## ##  Pearson's product-moment correlation## ## data:  celebs$hotness and celebs$IQ## t = -14.161, df = 46, p-value < 2.2e-16## alternative hypothesis: true correlation is not equal to 0## 95 percent confidence interval:##  -0.9440972 -0.8306163## sample estimates:##       cor ## -0.901897

How can this be? Well, the basis (the combination of hotness and IQ) on which we sample from our (uncorrelated) population is what is called a collider (variable) in statistics. Whereas a confounder (variable) influences (at least) two variables (A ← C → B), a collider is the opposite: it is influenced by (at least) two variables (A → C ← B).

In our simple case, it is the sum of our two independent variables. The result is a spurious correlation introduced by a special form of selection bias, namely endogenous selection bias. The same effect also goes under the name Berkson’s paradox, Berkson’s fallacy, selection-distortion effect, conditioning on a collider (variable), collider stratification bias, or just collider bias.

To understand this effect intuitively we are going to combine the two plots from above:

plot(hotness ~ IQ, main = "The general population & Celebrities")points(celebs$hotness ~ celebs$IQ, col = "red")abline(a = 235, b = -1, col = "blue")

In reality, things are often not so simple. When you google the above search terms you will find all kinds of examples, e.g. the so-called obesity paradox (an apparent preventive effect of obesity on mortality in individuals with cardiovascular disease (CVD)), a supposed health-protective effect of neuroticism or biased deep learning predictions of lung cancer.

As a takeaway: if a statistical result implies a relationship that seems too strange to be true, it possibly is! To check whether collider bias might be present check if sampling was being conducted on the basis of a variable that is influenced by the variables that seem to be correlated! Otherwise, you might not only falsely conclude that beautiful people are generally stupid and intelligent people ugly…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Tempered MCMC for Multimodal Posteriors

March 21, 2020, 5:00 pm

≫ Next: Comparing Machine Learning Algorithms for Predicting Clothing Classes: Part 4

≪ Previous: Collider Bias, or: Are Hot Babes Dim and Eggheads Ugly?

[This article was first published on R on , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is part of a sequence of posts chronicling my journey to manually implement as many MCMC samplers as I can from scratch. Code from previous psots can be found on GitHub. Also I tweet more than I should: StableMarkets.

The Multimodal posterior

I wanted to write up my own implementation of coupled MCMC chains using a tempered posterior along with an animation of the process. This is a classic sampling strategy used to deal with multi-modal posteriors. Here I have a tri-modal target posterior:

\[ p(\theta \mid D) = \frac{1}{3}N(-20,1) + \frac{1}{3}N(0,1) + \frac{1}{3}N(20,1)\] The density looks like this

Notice the regions of flat posterior density at about $(-15,-5)$ and $(5,15)$…these are often referred to as “bottlenecks”.

Problems with standard MH

These bottlenecks causes standard MCMC algorithms like Metropolis-Hastings (MH) to get stuck at one of these modes. Suppose at iteration $t$ of a standard MH sampler, the current value of the parameter is $\theta^{(t-1)}= – 5$. Suppose we use a Gaussian jumping distribution, so that we propose $\theta^{(t)}$ from $\theta^{(t)} \sim N( -5, \sigma)$. Let’s say that $\sigma=1$ so the proposal distribution is proportional to the green density below

It’s clear here that we’re almost never going to propose draws from the other two modes from this jumping distribution. Vast majority of the proposals to the left will end up in the bottlenecks and get rejected. We could increase $\sigma$ to make the proposal distribution is wide enough to jump over these bottlenecks. However, we know in MH that increasing $\sigma$ tends to reduce acceptance probability in general. So maybe that helps us explore the other two modes, but we won’t be accepting frequently – slowing down how efficiently the chain explores the posterior.

The Tempered Posterior

The idea behind tempering is to have two chains: one that is exploring the tempered posterior and another that explores the posterior. Ideally, the tempered posterior won’t have these bottlenecks, so a chain exploring it won’t have trouble getting from mode to mode. Then, we can propose jumps of the chain exploring the posterior to the tempered chain. This increases the chance of our chain of interest jumping to other modes.

So when we say “tempered” we mean raising the posterior to some power (temperature) $T$: $p(\theta \mid D)^T$. Let’s see what $p(\theta \mid D)^T$ looks like (proportional to gray density):

Notice that the tempered posterior has no bottlenecks. So an MH chain exploring this distribution won’t get stuck in bottlenecks of the posterior. So now we set up two chains: one exploring the tempered posterior and another exploring the posterior – both with standard MH updates. In each iteration, once we’ve update the two chains, we propose a swap between the two chains that is accepted with some probability. We say that the chains “meet” when these swaps occur. That is, we’ve in a sense “coupled” the chains.

Linking both chains

Above is a gif of this playing out over 200 iterations. The gray chain is the standard MH chain (not including the swaps) that explores the tempered distribution. The blue chain is the chain exploring the posterior. The red dots indicate values of the blue chain that are swaps from the tempered chain. I.e. at these red points, the chains meet. Notice that the blue chain now easily hops between the modes by occasionally jumping to the gray chain.

How this scales to higher dimensions was a topic of much research – still sort of is. The choice of temperatures is crucial. Often, we need to use several chains, not just two chains as we did here.

Some references: Altekar (2004) is a nice outline and has references to seminal works by Geyers, Gilks and Roberts, etc. I based my sampler on the math they provide in the paper. Also this post by Darren Wilkinson on MH coupled MCMC is a really nice treatment on this topic as well.

To leave a comment for the author, please follow the link and comment on their blog: R on .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Comparing Machine Learning Algorithms for Predicting Clothing Classes: Part 4

March 23, 2020, 5:00 pm

≫ Next: rOpenSci community calls

≪ Previous: Tempered MCMC for Multimodal Posteriors

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Florianne Verkroost is a Ph.D. candidate at Nuffield College at the University of Oxford. She has a passion for data science and a background in mathematics and econometrics. She applies her interdisciplinary knowledge to computationally address societal problems of inequality.

This is the fourth and final post in a series devoted to comparing different machine learning methods for predicting clothing categories from images using the Fashion MNIST data by Zalando. In the first post, we prepared the data for analysis and built a Python deep learning neural network model to predict the clothing categories of the Fashion MNIST data. In Part 2, we used principal components analysis (PCA) to compress the clothing image data down from 784 to just 17 pixels. In Part 3 we saw that gradient-boosted trees and random forests achieve relatively high accuracy on dimensionality-reduced data, although not as high as the neural network. In this post, we will fit a support vector machine, compare the findings from all models we have built and discuss the results. The R code for this post can be found on my Github repository.

Support Vector Machine

Support vector machines (SVMs) provide another method for classifying the clothing categories in the Fashion MNIST data. To better understand what SVMs entail, we’ll have to go through some more complex explanations –mainly summarizing James et. al. (2013)– so please bear with me! The figure below might help you in understanding the different classifiers I will discuss in the next sections (figures taken from here, here and here).

For an $n \times p$ data matrix and binary outcome variable $y_i \in \{-1, 1\}$, a hyperplane is a flat affine subspace of dimension $p – 1$ that divides the $p$-dimensional space into two halves, defined by $\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$. An observation in the test data is assigned an outcome class depending on which side of the perfectly separating hyperplane it lies, assuming that such a hyperplane exists. Cutoff $t$ for an observation’s score $\hat{f}(X) = \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_p$ determines which class it will be assigned to. The further an observation is located from the hyperplane at zero, the more confident the classifier is about the class assignment. If existent, an infinite number of separating hyperplanes can be constructed. A good option in this case would be to use the maximal margin classifier (MMC), which maximizes the margin around the midline of the widest strip that can be inserted between the two outcome classes.

If a perfectly separating hyperplane does not exist, “almost separating” hyperplanes can be used by means of the support vector classifier (SVC). The SVC extends the MMC as it does not require classes to be separable by a linear boundary by including slack variables $\epsilon_i$ that allow some observations to be on the incorrect side of the margin or hyperplane. The extent to which incorrect placements are done is determined by tuning parameter cost $C \geq \sum_{i=1}^{n} \epsilon_i$, which thereby controls the bias-variance trade-off. The SVC is preferable over the MMC as it is more confident in class assignments due to the larger margins and ensures greater robustness as merely observations on the margin or violating the margin affect the hyperplane (James et al., 2013).

Both MMCs and SVCs assume a linear boundary between the two classes of the outcome variable. Non-linearity can be addressed by enlarging the feature space using functions of predictors. Support vector machines combine SVCs with non-linear (e.g. radial, polynomial or sigmoid) Kernels $K(x_i, x_{i'})$ to achieve efficient computations. Kernels are generalizations of inner products that quantify the similarity of two observations (James et al., 2013). Usually, the radial Kernel is selected for non-linear models as it provides a good default Kernel in the absence of prior knowledge of invariances regarding translations. The radial Kernel is defined as $K(x_i, x_{i'})= \exp{(-\sigma \sum_{j=1}^{p} (x_{ij} – x_{i'j})^2)}$, where $\sigma$ is a positive constant that makes the fit more non-linear as it increases. Tuning $C$ and $\sigma$ is necessary to find the optimal trade-off between reducing the number of training errors and making the decision boundary more irregular (by increasing C). As SVMs only require the computation of $\bigl(\begin{smallmatrix} n\\ 2 \end{smallmatrix}\bigr)$ Kernels for all distinct observation pairs, they greatly improve efficiency.

As aforementioned, the parameters that need to be tuned are cost C and, in the case of a radial Kernel, non-linearity constant sigma. Let’s start by tuning these parameters using a random search algorithm, again making use of the caret framework. We set the controls to perform 5-fold cross-validation and we use the multiClassSummary() function from the MLmetrics library to perform multi-class classification. We specify a radial Kernel, use accuracy as the performance metric¹ and let the algorithm perform a random search for the cost parameter C over pca.dims (=17) random values. Note that the random search algorithm only searches for values of C while keeping a constant value for sigma. Also, contrary to previous calls to trainControl(), we now set classProbs = FALSE because the base package used for estimating SVMs in caret, kernlab, leads to lower accuracies when specifying classProbs = TRUE due to using a secondary regression model (also check this link for the Github issue).

We begin with training the support vector machine using the PCA reduced training and test data sets train.images.pca and test.images.pca constructed in Part 2.

library(MLmetrics)svm_control = trainControl(method = "repeatedcv",                               number = 5,                              repeats = 5,                             classProbs = FALSE,                            allowParallel = TRUE,                             summaryFunction = multiClassSummary,                           savePredictions = TRUE)

set.seed(1234)svm_rand_radial = train(label ~ .,                 data = cbind(train.images.pca, label = train.classes),                method = "svmRadial",                 trControl = svm_control,                 tuneLength = pca.dims,                metric = "Accuracy")svm_rand_radial$results[, c("sigma", "C", 'Accuracy')]

We can check the model performance on both the training and test sets by means of different metrics using a custom function, model_performance, which can be found in this code on my Github.

mp.svm.rand.radial = model_performance(svm_rand_radial, train.images.pca, test.images.pca,                                        train.classes, test.classes, "svm_random_radial")

The results show that the model is achieving relatively high accuracies of 88% and 87% on the training and test sets respectively, selecting sigma = 0.040 and C = 32 as the optimal parameters. Let’s have a look at which clothing categories are best and worst predicted by visualizing the confusion matrix. First, let’s compute the predictions for the training data. We need to use the out-of-bag predictions contained in the model object (svm_rand_radial$pred) rather than the manually computed in-sample (non-out-of-bag) predictions for the training data computed using the predict() function. Object svm_rand_radial$pred contains the predictions for all tuning parameter values specified by the user. However, we only need those predictions belonging to the optimal tuning parameter values. Therefore, we subset svm_rand_radial$pred to only contain those predictions and observations in indices rows. Note that we convert svm_rand_radial$pred to a data.table object to find these indices as computations on data.table objects are much faster for large data (e.g. svm_rand_radial$pred has 4.5 million rows).

library(data.table)pred_dt = as.data.table(svm_rand_radial$pred[, names(svm_rand_radial$bestTune)]) names(pred_dt) = names(svm_rand_radial$bestTune)index_list = lapply(1:ncol(svm_rand_radial$bestTune), function(x, DT, tune_opt){  return(which(DT[, Reduce(`&`, lapply(.SD, `==`, tune_opt[, x])), .SDcols = names(tune_opt)[x]]))}, pred_dt, svm_rand_radial$bestTune)rows = Reduce(intersect, index_list)pred_train = svm_rand_radial$pred$pred[rows]trainY = svm_rand_radial$pred$obs[rows]conf = table(pred_train, trainY)

Next, we reshape the confusion matrix into a data frame with three columns: one for the true categories (trainY), one for the predicted categories (pred_train), and one for the proportion of correct predictions for the true category (Freq). We plot this as a tile plot with a blue color scale where lighter values indicate larger proportions of matches between a particular combination of true and predicted categories, and darker values indicate a small proportion of matches between them. Note that we use the custom plotting theme my_theme() as defined in the second blog post of this series.

conf = data.frame(conf / rowSums(conf))ggplot() +   geom_tile(data = conf, aes(x = trainY, y = pred_train, fill = Freq)) +   labs(x = "Actual", y = "Predicted", fill = "Proportion") +  my_theme() +  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +    scale_fill_continuous(breaks = seq(0, 1, 0.25)) +  coord_fixed()

We observe from this plot that most of the classes are predicted accurately as the light blue (high percentages of correct predictions) are on the diagonal of the tile plot. We can also observe that the categories that are most often mixed up include shirts, tops, pullovers and coats, which makes sense because these are all mostly upper body clothing parts having similar shapes. The model predicts trousers, bags, boots and sneakers well, given that these rows and columns are particularly dark except for the diagonal element. These results are in agreement with those from the random forest and gradient-boosted trees from the previous blog post of this series.

Next, we repeat the above process for fitting a support vector machine but instead of a random search for the optimal parameters, we perform a grid search. As such, we can prespecify values to evaluate the model at, not only for C but also for sigma. We define the grid values in grid_radial.

grid_radial = expand.grid(sigma = c(.01, 0.04, 0.1), C = c(0.01, 10, 32, 70, 150))set.seed(1234)svm_grid_radial = train(label ~ .,                               data = cbind(train.images.pca, label = train.classes),                              method = "svmRadial",                               trControl = svm_control,                               tuneGrid = grid_radial,                              metric = "Accuracy")svm_grid_radial$results[, c("sigma", "C", 'Accuracy')]

mp.svm.grid.radial = model_performance(svm_grid_radial, train.images.pca, test.images.pca,                                        train.classes, test.classes, "svm_grid_radial")

The grid search selects the same optimal parameter values as the random search (C=32 and sigma = 0.040), therefore also resulting in 88% and 87% training and test accuracies. To get an idea on how C and sigma influence the training set accuracy, we plot the cross-validation accuracy as a function of C, with separate lines for each value of sigma.

ggplot() +   my_theme() +  geom_line(data = svm_grid_radial$results, aes(x = C, y = Accuracy, color = factor(sigma))) +  geom_point(data = svm_grid_radial$results, aes(x = C, y = Accuracy, color = factor(sigma))) +  labs(x = "Cost", y = "Cross-Validation Accuracy", color = "Sigma") +  ggtitle('Relationship between cross-validation accuracy and values of cost and sigma')

The plot shows that the green line (sigma = 0.04) has the highest cross-validation accuracy for all values of C except for smaller values of C such as 0.01 and 10. Although the accuracy at C=10 and sigma = 0.1 (blue line) comes close, the highest overall accuracy achieved is for C=32 and sigma=32 (green line).

Wrapping Up

To compare the models we have estimated throughout this series of blog posts, we can look at the resampled accuracies of the models. We can do this in our case because we set the same seed of 1234 before training each model.² Essentially, resampling is an important tool to validate our models, and to what extent they are generalizeable onto data they have not been trained on. We used five repeats of five-fold cross-validation, which means that the training data was divided into five random subsets, and that throughout five iterations (“folds”) the model was trained on four of these subsets and tested on the remaining subset (changing with every fold), and that this whole process has been repeated five times. The goal of these repetitions of k-fold cross-validation is to reduce the bias in the estimator, given that the folds in non-repeated cross-validation are not independent (as data used for training at one fold is used for testing at another fold). As we performed five repeats of five-fold cross-validation, we can essentially obtain 5*5=25 accuracies per model. Let’s compare these resampled accuracies visually by means of a boxplot. First, we create a list of all models estimated, including the random forests, gradient-boosted trees and support vector machines. We then compute the resampled accuracies using the resamples() function from the caret package. From the resulting object, resamp, we only keep the columns containing the resample unit (e.g. Fold1.Rep1) and the five columns containing the accuracies for each of the five models. We melt this into a long format and from the result, plotdf, we remove the ~Accuracy part from the strings in column Model.

library(reshape2)model_list = list(rf_rand, rf_grid, xgb_tune, svm_rand_radial, svm_grid_radial)names(model_list) = c(paste0('Random forest ', c("(random ", "(grid "), "search)"), "Gradient-boosted trees",                       paste0('Support vector machine ', c("(random ", "(grid "), "search)"))resamp = resamples(model_list)accuracy_variables = names(resamp$values)[grepl("Accuracy", names(resamp$values))]plotdf = melt(resamp$values[, c('Resample', accuracy_variables)],               id = "Resample", value.name = "Accuracy", variable.name = "Model")plotdf$Model = gsub("~.*","", plotdf$Model)

Next, we create a boxplot with the estimated models on the x-axis and the accuracy on the y-axis.

ggplot() +  geom_boxplot(data = plotdf, aes(x = Model, y = Accuracy, color = Model)) +  ggtitle('Resampled accuracy for machine learning models estimated') +   my_theme() +   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +   labs(x = NULL, color = NULL) +  guides(color = FALSE)

We observe from these box plots that the support vector machines perform best, followed by the gradient-boosted trees and the random forests. Let’s also take a look at the other performance metrics from all models we have looked at.

mp.df = rbind(mp.rf.rand, mp.rf.grid, mp.xgb, mp.svm.rand.radial, mp.svm.grid.radial, mp.svm.grid.linear)mp.df[order(mp.df$accuracy_test, decreasing = TRUE), ]

After taking measures to reduce overfitting, the convolutional neural network from the first blog post of this series achieved training and test set accuracies of 89.4% and 88.8% respectively. The random and grid search for the best value of mtry in the random forests resulted in the selection of mtry=5. The grid search performed better on the training set than the random search on the basis of all metrics except recall (i.e. sensitivity), and better on the test set on all metrics except precision (i.e. positive predictive value). The test set accuracies achieved by the random search and grid search were 84.7% and 84.8% respectively. The gradient-boosted decision trees performed slightly better than the random forests on all metrics and achieved a test set accuracy of 85.5%. Both tree-based models more often misclassified pullovers, shirts and coats, while correctly classifying trousers, boots, bags and sneakers. The random forests and gradient-boosted trees are however outperformed by the support vector machine with radial Kernel specification with tuning parameter values of C=32 and sigma=0.040: this model achieved 86.9% test set accuracy upon a random search for the best parameters. The grid search resulted in slightly worse test set performance, but better training set performance in terms of all metrics except accuracy. Nonetheless, none of the models estimated beats the convolutional neural network from the first blog post of this series, neither in performance nor computational time and feasibility. However, the differences in test set performance are only small: the convolutional neural network achieved 88.8% test set accuracy, compared to 86.9% test set accuracy achieved by the support vector machine with radial Kernel. This shows that we do not always need to resort to deep learning to obtain high accuracies, but that we can also perform image classification to a reasonable standard using basic machine learning models with dimensionality-reduced data.

Just as a side note, accuracy may not be a good model performance metric in some cases. As the Fashion MNIST data has balanced categories (i.e. each category has the same number of observations), accuracy can be a good measure of model performance. However, in the case of unbalanced data, accuracy may be a misleading metric (“accuracy paradox”). Imagine for example that in a binary classification problem of 100 instances, there are 99 observations of class 0 and 1 observation of class 1. If the predictions are 1 for each observation, the model performs with 99% accuracy. As this may be misleading, recall and precision are often used instead. Have a look at this blog post if you are unsure what these performance metrics entail.
Note that in order to compare the resampled accuracies of different models, they need to have been trained with the same seed, and they need to have the same training method and control settings as specified in the trainControl() function. In our case, the method used is repeatedcv, and so all models should have been trained with five repeats (repeats = 5) of five-fold cross-validation (number = 5). Note that the gradient-boosted model in the previous post of this series was trained with non-repeated five-fold cross-validation (method = "cv"). In order to compare this model with the random forests and support vector machines, the method in trainControl() should be changed to method = "repeatedcv" and the number of repeats should be five: repeats = 5. This should be the same for all models trained in order to compute resampled accuracies.

_____='https://rviews.rstudio.com/2020/03/24/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-4/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

rOpenSci community calls

March 24, 2020, 3:31 pm

≫ Next: New package RcppDate 0.0.1 now on CRAN!

≪ Previous: Comparing Machine Learning Algorithms for Predicting Clothing Classes: Part 4

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a short PSA about an R resource that I recently learnt about (and participated in): rOpenSci community calls. According to the website, these community calls happen quarterly, and is a place where the public can learn about “best practices, new projects, Q&As with well known developers, and… rOpenSci developments”.

I heard about the most recent community call (“Maintaining an R package”) via an announcement on R-bloggers. The topic was of personal interest to me and the panelists were experienced/interesting enough that I felt I could learn a lot by participating. For reference, here were the speakers/panelists for this call:

Julia Silge, Data scientist & software engineer @ RStudio
Elin Waring, Professor of Sociology and Interim Dean of health sciences, human services and nursing @ Lehman College, CUNY
Erin Grand, Data scientist @ Uncommon Schools
Leonardo Collado-Torres, Research scientist @ Lieber institute for brain development
Scott Chamberlain, Co-founder and technical lead @ rOpenSci

Here are some of my quick observations of the event as a participant:

The calls are publicly hosted on Zoom, which made it really easy to join. Overall the video and sound quality was good and clear enough that I wasn’t straining to hear the speakers.
At the beginning of the call, Stefanie, the community manager hosting this call, suggested that those who were comfortable share their screen so that we could put faces to names. That was a small simple touch that made the call more personal!
As the call is happening, attendees can collaboratively update a shared document capturing the key points of the discussion. It is then made publicly available soon after the call is over. (As an example, this is the collaborative document of the call I attended.)
Through the collaborative document, not only could participants ask the speakers questions, but other participants could answer and comment on those questions as well!
rOpenSci does a really good job of recording different aspects of the call and archiving them for future reference. Each call has its own webpage with all the resources associated with it. For the call I attended, all the resources are here. There are a list of resource links (including one for the collaborative notes), as well as a video recording of the call itself!

I enjoyed listening in on the call and am very much looking forward to the next one! I hope that you will considering joining in as well.

For the full list of rOpenSci community calls, click here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

New package RcppDate 0.0.1 now on CRAN!

March 24, 2020, 4:29 pm

≫ Next: baRcodeR now on rOpenSci + online barcode PDF generation

≪ Previous: rOpenSci community calls

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new small package with a new C++ header library is now on CRAN. It brings the date library by Howard Hinnant to R. This library has been in pretty wide-spread use for a while now, and adds to C++11/C++14/C++17 what will be (with minor modifications) the ‘date’ library in C++20. I had been aware of it for a while, but not needed thanks to CCTZ library out of Google and our RcppCCTZ package. And like CCTZ, it builds upon std::chron adding a whole lot of functionality and useability enhancement. But a some upcoming (and quite exciting!) changes in nanotime required it, I had a reason to set about packaging it as RcppDate. And after a few days of gestation and review it is now available via CRAN.

Two simple example files are included and can be accessed by Rcpp::sourceCpp(). Some brief excerpts follow.

The first example shows three date constructors. Note how the month (and the leading digits) are literals. No quotes for strings anywhere. And no format (just like our anytime package for R).

constexprauto x1 = 2015_y/March/22;constexprauto x2 = March/22/2015;constexprauto x3 = 22_d/March/2015;

Note that these are constexpr that resolve at compile-time, and that the resulting year_month_day type is inferred via auto.

A second example constructs the last day of the months similarly:

constexprauto x1 = 2015_y/February/last;constexprauto x2 = February/last/2015;constexprauto x3 = last/February/2015;

For more, see the copious date.h documentation.

The (very bland first) NEWS entry (from a since-added NEWS file) for the initial upload follows.

Changes in version 0.0.1 (2020-01-17)
Initial CRAN upload of first version

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

baRcodeR now on rOpenSci + online barcode PDF generation

March 24, 2020, 5:00 pm

≫ Next: Shiny apps with math exercises

≪ Previous: New package RcppDate 0.0.1 now on CRAN!

[This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some major changes have occurred to baRcodeR since the last post on version 0.1.2 to ease the process of making printable labels like below.

As of baRcodeR 0.1.5:

After extremely helpful reviews, baRcodeR was accepted as part of the rOpenSci project.
Online documentation for the package is now available, generated with pkgdown and hosted by rOpenSci.
An interactive GUI in the form of a shiny RStudio addin is now available as part of the package. Users can input parameters and preview label output before creation. Code snippets are provided for reproducibility of output. See sample screenshots here
An online Shiny app is now available on shinyapps.io. Users can now generate their labels and PDF sticker sheets without having to install baRcodeR locally. The interface is similar to the RStudio addin, except that user files have to be downloaded and saved. Due to possible size constraints, all uploads and created files by the user are deleted at the end of the session.

To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Shiny apps with math exercises

March 24, 2020, 5:00 pm

≫ Next: How to Connect RStudio with Git (Github)

≪ Previous: baRcodeR now on rOpenSci + online barcode PDF generation

[This article was first published on R-bloggers on Mikkel Meyer Andersen, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is often very useful to practise mathematics by automatically generated exercises. One approach is multiple choice quizzes (MCQ), but it turns out to be fairly difficult to generate authentic wrong answers. Instead, we want the user to input the answer and be able to parse the answer and check whether this is the correct answer. There are many fun challenges in this, e.g. to verify that 2 is equal to 1 + 1 (as text strings the two are different, but mathematically they are equal, at least to a convenient approximation in this case).

In this post I will demonstrate how to use R package Ryacas (computer algebra system, CAS) and the R package iomath under development (as well as shiny) to make a small, powerful Shiny app. The resulting app is available at https://github.com/r-cas/shinymathexample.

First I will show the app, and then I will show a few central lines of code.

The `shinymathexample` app

First, the shinymathexample app presents the question:

The answer can the be written and checked:

It even works for mathematical/numerical equality, not just text/string equality:

Finally wrong answers are caught, too:

Exercise generation

The exercise generation code (boiled down) is something like this:

choices_x_coef <- c("a", "2*a", "3*a")choices_x_pow <- 1:3generate_f <- function() {  x_coef <- sample(choices_x_coef, 1)  x_pow <- sample(choices_x_pow, 1)  x_part <- paste0(x_coef, "*x^", x_pow)  eq <- ysym(x_part)  eq}problem_f_eq <- generate_f()true_ans <- list(  x = deriv(problem_f_eq, "x"))output$problem <- renderUI({  problem <- paste0("Let $$f(x) = ", tex(problem_f_eq), ".$$",                    "Calculate the derivative with respect ",                     "to \\(x\\) and enter the result below.")    res <- withMathJax(    helpText(problem)  )    return(res)})

Validation

The validation code (boiled down) is something like this:

reply <- input$answer_xparsed_input <- iomath::prepare_input(reply)if (inherits(parsed_input, "error")) {  stop("Could not prepare the input (remember that I'm simple-minded!).")}reply_sym <- tryCatch(Ryacas::ysym(parsed_input),                       error = function(e) e)if (inherits(reply_sym, "error")) {  stop("Could not understand the input (remember that I'm simple-minded!).")}compare_grid <- expand.grid(    x = seq(-10, 10, len = 6),    a = seq(-10, 10, len = 6))is_correct <- tryCatch(iomath::compare_reply_answer(reply = reply,                                                     answer = true_ans$x,                                                     compare_grid = compare_grid),                        error = function(e) e)if (inherits(is_correct, "error")) {  stop(paste0("Error: ", is_correct$message))}is_correct # TRUE/FALSE

Remarks

Take a look at the complete code at https://github.com/r-cas/shinymathexample.

The provided example should illustrate that it is fairly easy to make something relatively sophisticated.

Beside the central aspect of Ryacas (e.g. for derivatives etc.), iomath has the important function compare_reply_answer that compares reply to answer over the grid of values defined by compare_grid. Thus, equality of expressions are measured as point-wise equality over a finite number of points (e.g. 100) for different values of variables including an allowed tolerance.

To leave a comment for the author, please follow the link and comment on their blog: R-bloggers on Mikkel Meyer Andersen.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How to Connect RStudio with Git (Github)

March 24, 2020, 5:00 pm

≫ Next: Data Science Courses for Economists and Epidemiologists using RTutor

≪ Previous: Shiny apps with math exercises

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This video explains how to connect your RStudio with Git (Github) for a better R Programming / Software Development Workflow. It could be as big as updating a package file or as simple as managing a simple repo. This video also shows how can you clone a repo, commit a change and push it back to its master on Github.

Youtube Link: https://www.youtube.com/watch?v=lXwH2R4n3RQ

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Data Science Courses for Economists and Epidemiologists using RTutor

March 25, 2020, 9:00 am

≫ Next: Probability and Bayesian modeling [book review]

≪ Previous: How to Connect RStudio with Git (Github)

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Were there no Corona virus pandemic, German Universities would regularly start the summer semester in around a month (soon after Eastern). Now, it seems likely that courses will be offered digitally and students must learn from home.

If you have a course that uses R, you may take a look at RTutor. It allows students to solve interactive problem sets at home. They can test their solutions, get automatic hints and then submit their solution for automatic grading.

To get some material and ideas you can take a look at the following three courses using RTutor:

A data science project course taught by Alex Rieber for business and economics students at Ulm University. Before students work on their own data science projects, they learn basic skills in R, including tidyverse data wrangling and econometric and machine learning basics via several RTutor problem sets. Alex published the problem sets and other course material here on Github. You find on the Github pages also links that allow you to test the problem sets on the rstudio cloud. The course is in German but Alex already started to make an English version of the problem sets, which we publish once finished.
Jade Benjamin-Chung from UC Berkeley School of Public Health has created with RTutor online tutorials for an introductory R course for epidemiologists. Here is the course page. If you click on a tutorial the corresponding RTutor problem set can be directly solved on shinyapps.io. There is no need to log in.
I have published the RTutor problem sets and other material from my course in empirical industrial organization class in this Github repository. You can directly work on the problem sets here on rstudio.cloud. The course focuses a lot on estimating demand functions, but the R problem sets also cover other material, like data wrangling with dplyr.

In addition, you can find on the RTutor page many interactive replications of interesting economic and interdisciplinary research papers, e.g. about the effects of water polution on cancer, an environmental assessment of driving electric cars, the effect of soap operas on fertility, a study of how better contracts could reduce traffic jams, the effects of CO2 pricing on firm relocation, an assessment of free trade agreements, and more…

For our courses, Alex and me have not included on Github the Rmd solution files from which the RTutor problem sets were created. (We want to avoid that students just copy those solutions). If you are a lecturer who is interested in using and modifying these problem sets, just send Alex or me an email and we can send you these files. Alex has further developed a multiple choice test exam, based on these problem sets which you could also receive upon request.

If you are thinking to make your students study at home with RTutor, you may also take a look at two older blog posts. This one describes how you can automatically grade submitted problem sets. That one compares some RTutor with learnr.

If you are using RTutor and perhaps want to share some course material please let Alex or me know! We are happy to get insights from your RTutor usage, and if you like, we can put a link in a blog post or on the RTutor website.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Probability and Bayesian modeling [book review]

March 25, 2020, 4:20 pm

≫ Next: Tuning random forest hyperparameters with #TidyTuesday trees data

≪ Previous: Data Science Courses for Economists and Epidemiologists using RTutor

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Probability and Bayesian modeling is a textbook by Jim Albert and Jingchen Hu that CRC Press sent me for review in CHANCE. (The book is also freely available in bookdown format.) The level of the textbook is definitely most introductory as it dedicates its first half on probability concepts (with no measure theory involved), meaning mostly focusing on counting and finite sample space models. The second half moves to Bayesian inference(s) with a strong reliance on JAGS for the processing of more realistic models. And R vignettes for the simplest cases (where I discovered R commands I ignored, like dplyr::mutate()!).

As a preliminary warning about my biases, I am always reserved at mixing introductions to probability theory and to (Bayesian) statistics in the same book, as I feel they should be separated to avoid confusion. As for instance between histograms and densities, or between (theoretical) expectation and (empirical) mean. I therefore fail to relate to the pace and tone adopted in the book which, in my opinion, seems to dally on overly simple examples [far too often concerned with food or baseball] while skipping over the concepts and background theory. For instance, introducing the concept of subjective probability as early as page 6 is laudable but I doubt it will engage fresh readers when describing it as a measurement of one’s “belief about the truth of an event”, then stressing that “make any kind of measurement, one needs a tool like a scale or ruler”. Overall, I have no particularly focused criticisms on the probability part except for the discrete vs continuous imbalance. (With the Poisson distribution not covered in the Discrete Distributions chapter. And the “bell curve” making a weird and unrigorous appearance there.) Galton’s board (no mention found of quincunx) could have been better exploited towards the physical definition of a prior, following Steve Stiegler’s analysis, by adding a second level. Or turned into an R coding exercise. In the continuous distributions chapter, I would have seen the cdf coming first to the pdf, rather than the opposite. And disliked the notion that a Normal distribution was supported by an histogram of (marathon) running times, i.e. values lower bounded by 122 (at the moment). Or later (in Chapter 8) for Roger Federer’s serving times. Incidentally, a fun typo on p.191, at least fun for LaTeX users, as

$f_{Y\ mid X}$

with an extra space between `\’ and `mid’! (I also noticed several occurrences of the unvoidable “the the” typo in the last chapters.) The simulation from a bivariate Normal distribution hidden behind a customised R function sim_binom() when it could have been easily described as a two-stage hierarchy. And no comment on the fact that a sample from Y-1.5X could be directly derived from the joint sample. (Too unconscious a statistician?!)

When moving to Bayesian inference, a large section is spent on very simple models like estimating a proportion or a mean, covering both discrete and continuous priors. And strongly focusing on conjugate priors despite giving warnings that they do not necessarily reflect prior information or prior belief. With some debatable recommendation for “large” prior variances as weakly informative or (worse) for Exp(1) as a reference prior for sample precision in the linear model (p.415). But also covering Bayesian model checking either via prior predictive (hence Bayes factors) or posterior predictive (with no mention of using the data twice). A very marginalia in introducing a sufficient statistic for the Normal model. In the Normal model checking section, an estimate of the posterior density of the mean is used without (apparent) explanation.

“It is interesting to note the strong negative correlation in these parameters. If one assigned informative independent priors on β⁰ and β¹, these prior beliefs would be counter to the correlation between the two parameters observed in the data.”

For the same reasons of having to cut on mathematical validation and rigour, Chapter 9 on MCMC is not explaining why MCMC algorithms are converging outside of the finite state space case. The proposal in the algorithmic representation is chosen as a Uniform one, since larger dimension problems are handled by either Gibbs or JAGS. The recommendations about running MCMC do not include how many iterations one “should” run (or other common queries on Stack eXchange), albeit they do include the sensible running multiple chains and comparing simulated predictive samples with the actual data as a model check. However, the MCMC chapter very quickly and inevitably turns into commented JAGS code. Which I presume would require more from the students than just reading the available code. Like JAGS manual. Chapter 10 is mostly a series of examples of Bayesian hierarchical modeling, with illustrations of the shrinkage effect like the one on the book cover. Chapter 11 covers simple linear regression with some mentions of weakly informative priors, although in a BUGS spirit of using large [enough?!] variances: “If one has little information about the location of a regression parameter, then the choice of the prior guess μ is not that important and one chooses a large value for the prior standard deviation s. So the regression intercept and slope are each assigned a Normal prior with a mean of 0 and standard deviation equal to the large value of 100.” (p.415). Regardless of the scale of y? Standardisation is covered later in the chapter (with the use of the R function scale()) as part of constructing more informative priors, although this sounds more like data-dependent priors to me in the sense that the scale and location are summarily estimated by empirical means from the data. The above quote also strikes me as potentially confusing to the students, as it does not spell at all how to design a joint distribution on the linear regression coefficients that translate the concentration of these coefficients along y̅=β⁰+β¹x̄. Chapter 12 expands the setting to multiple regression and generalised linear models, mostly consisting of examples. It however suggests using cross-validation for model checking and then advocates DIC (deviance information criterion) as “to approximate a model’s out-of-sample predictive performance” (p.463). If only because it is covered in JAGS, the definition of the criterion being relegated to the last page of the book. Chapter 13 concludes with two case studies, the (often used) Federalist Papers analysis and a baseball career hierarchical model. Which may sound far-reaching considering the modest prerequisites the book started with.

In conclusion of this rambling [lazy Sunday] review, this is not a textbook I would have the opportunity to use in Paris-Dauphine but I can easily conceive its adoption for students with limited maths exposure. As such it offers a decent entry to the use of Bayesian modelling, supported by a specific software (JAGS), and rightly stresses the call to model checking and comparison with pseudo-observations. Provided the course is reinforced with a fair amount of computer labs and projects, the book can indeed achieve to properly introduce students to Bayesian thinking. Hopefully leading them to seek more advanced courses on the topic.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Tuning random forest hyperparameters with #TidyTuesday trees data

March 25, 2020, 5:00 pm

≫ Next: Beacons of Light…

≪ Previous: Probability and Bayesian modeling [book review]

[This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model.

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore the data

Our modeling goal here is to predict the legal status of the trees in San Francisco in the #TidyTuesday dataset. This isn’t this week’s dataset, but it’s one I have been wanting to return to. Because it seems almost wrong not to, we’ll be using a random forest model!

Let’s build a model to predict which trees are maintained by the San Francisco Department of Public Works and which are not. We can use parse_number() to get a rough estimate of the size of the plot from the plot_size column. Instead of trying any imputation, we will just keep observations with no NA values.

library(tidyverse)sf_trees <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-28/sf_trees.csv")trees_df <- sf_trees %>%  mutate(    legal_status = case_when(      legal_status == "DPW Maintained" ~ legal_status,      TRUE ~ "Other"    ),    plot_size = parse_number(plot_size)  ) %>%  select(-address) %>%  na.omit() %>%  mutate_if(is.character, factor)

Let’s do a little exploratory data analysis before we fit models. How are these trees distributed across San Francisco?

trees_df %>%  ggplot(aes(longitude, latitude, color = legal_status)) +  geom_point(size = 0.5, alpha = 0.4) +  labs(color = NULL)

You can see streets! And there are definitely spatial differences by category.

What relationships do we see with the caretaker of each tree?

trees_df %>%  count(legal_status, caretaker) %>%  add_count(caretaker, wt = n, name = "caretaker_count") %>%  filter(caretaker_count > 50) %>%  group_by(legal_status) %>%  mutate(percent_legal = n / sum(n)) %>%  ggplot(aes(percent_legal, caretaker, fill = legal_status)) +  geom_col(position = "dodge") +  labs(    fill = NULL,    x = "% of trees in each category"  )

Build model

We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.

library(tidymodels)set.seed(123)trees_split <- initial_split(trees_df, strata = legal_status)trees_train <- training(trees_split)trees_test <- testing(trees_split)

Next we build a recipe for data preprocessing.

First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.
Next, we update the role for tree_id, since this is a variable we might like to keep around for convenience as an identifier for rows but is not a predictor or outcome.
Next, we use step_other() to collapse categorical levels for species, caretaker, and the site info. Before this step, there were 300+ species!
The date column with when each tree was planted may be useful for fitting this model, but probably not the exact date, given how slowly trees grow. Let’s create a year feature from the date, and then remove the original date variable.
There are many more DPW maintained trees than not, so let’s downsample the data for training.

The object tree_rec is a recipe that has not been trained on data yet (for example, which categorical levels should be collapsed has not been calculated) and tree_prep is an object that has been trained on data.

tree_rec <- recipe(legal_status ~ ., data = trees_train) %>%  update_role(tree_id, new_role = "ID") %>%  step_other(species, caretaker, threshold = 0.01) %>%  step_other(site_info, threshold = 0.005) %>%  step_dummy(all_nominal(), -all_outcomes()) %>%  step_date(date, features = c("year")) %>%  step_rm(date) %>%  step_downsample(legal_status)tree_prep <- prep(tree_rec)juiced <- juice(tree_prep)

Now it’s time to create a model specification for a random forest where we will tune mtry (the number of predictors to sample at each split) and min_n (the number of observations needed to keep splitting nodes). These are hyperparameters that can’t be learned from data when training the model.

tune_spec <- rand_forest(  mtry = tune(),  trees = 1000,  min_n = tune()) %>%  set_mode("classification") %>%  set_engine("ranger")

Finally, let’s put these together in a workflow(), which is a convenience container object for carrying around bits of models.

tune_wf <- workflow() %>%  add_recipe(tree_rec) %>%  add_model(tune_spec)

This workflow is ready to go.

Train hyperparameters

Now it’s time to tune the hyperparameters for a random forest model. First, let’s create a set of cross-validation resamples to use for tuning.

set.seed(234)trees_folds <- vfold_cv(trees_train)

We can’t learn the right values when training a single model, but we can train a whole bunch of models and see which ones turn out best. We can use parallel processing to make this go faster, since the different parts of the grid are independent. Let’s use grid = 20 to choose 20 grid points automatically.

doParallel::registerDoParallel()set.seed(345)tune_res <- tune_grid(  tune_wf,  resamples = trees_folds,  grid = 20)tune_res

## #  10-fold cross-validation ## # A tibble: 10 x 4##    splits               id     .metrics          .notes          ##                                           ##  1  Fold01  ##  2  Fold02  ##  3  Fold03  ##  4  Fold04  ##  5  Fold05  ##  6  Fold06  ##  7  Fold07  ##  8  Fold08  ##  9  Fold09  ## 10  Fold10

How did this turn out? Let’s look at AUC.

tune_res %>%  collect_metrics() %>%  filter(.metric == "roc_auc") %>%  select(mean, min_n, mtry) %>%  pivot_longer(min_n:mtry,    values_to = "value",    names_to = "parameter"  ) %>%  ggplot(aes(value, mean, color = parameter)) +  geom_point(show.legend = FALSE) +  facet_wrap(~parameter, scales = "free_x") +  labs(x = NULL, y = "AUC")

This grid did not involve every combination of min_n and mtry but we can get an idea of what is going on. It looks like higher values of mtry are good (above about 10) and lower values of min_n are good (below about 10). We can get a better handle on the hyperparameters by tuning one more time, this time using regular_grid(). Let’s set ranges of hyperparameters we want to try, based on the results from our initial tune.

rf_grid <- grid_regular(  mtry(range = c(10, 30)),  min_n(range = c(2, 8)),  levels = 5)rf_grid

## # A tibble: 25 x 2##     mtry min_n##     ##  1    10     2##  2    15     2##  3    20     2##  4    25     2##  5    30     2##  6    10     3##  7    15     3##  8    20     3##  9    25     3## 10    30     3## # … with 15 more rows

We can tune one more time, but this time in a more targeted way with this rf_grid.

set.seed(456)regular_res <- tune_grid(  tune_wf,  resamples = trees_folds,  grid = rf_grid)regular_res

## #  10-fold cross-validation ## # A tibble: 10 x 4##    splits               id     .metrics          .notes          ##                                           ##  1  Fold01  ##  2  Fold02  ##  3  Fold03  ##  4  Fold04  ##  5  Fold05  ##  6  Fold06  ##  7  Fold07  ##  8  Fold08  ##  9  Fold09  ## 10  Fold10

What the results look like now?

regular_res %>%  collect_metrics() %>%  filter(.metric == "roc_auc") %>%  mutate(min_n = factor(min_n)) %>%  ggplot(aes(mtry, mean, color = min_n)) +  geom_line(alpha = 0.5, size = 1.5) +  geom_point() +  labs(y = "AUC")

Choosing the best model

It’s much more clear what the best model is now. We can identify it using the function select_best(), and then update our original model specification tune_spec to create our final model specification.

best_auc <- select_best(regular_res, "roc_auc")final_rf <- finalize_model(  tune_spec,  best_auc)final_rf

## Random Forest Model Specification (classification)## ## Main Arguments:##   mtry = 20##   trees = 1000##   min_n = 2## ## Computational engine: ranger

Let’s explore our final model a bit. What can we learn about variable importance, using the vip package?

library(vip)final_rf %>%  set_engine("ranger", importance = "permutation") %>%  fit(legal_status ~ .,    data = juice(tree_prep) %>% select(-tree_id)  ) %>%  vip(geom = "point")

The private caretaker characteristic important in categorization, as is latitude and longitude. Interesting that year (i.e. age of the tree) is so important!

Let’s make a final workflow, and then fit one last time, using the convenience function last_fit(). This function fits a final model on the entire training set and evaluates on the testing set. We just need to give this funtion our original train/test split.

final_wf <- workflow() %>%  add_recipe(tree_rec) %>%  add_model(final_rf)final_res <- final_wf %>%  last_fit(trees_split)final_res %>%  collect_metrics()

## # A tibble: 2 x 3##   .metric  .estimator .estimate##                 ## 1 accuracy binary         0.852## 2 roc_auc  binary         0.950

The metrics for the test set look good and indicate we did not overfit during tuning.

Let’s bind our testing results back to the original test set, and make one more map. Where in San Francisco are there more or less incorrectly predicted trees?

final_res %>%  collect_predictions() %>%  mutate(correct = case_when(    legal_status == .pred_class ~ "Correct",    TRUE ~ "Incorrect"  )) %>%  bind_cols(trees_test) %>%  ggplot(aes(longitude, latitude, color = correct)) +  geom_point(size = 0.5, alpha = 0.5) +  labs(color = NULL) +  scale_color_manual(values = c("gray80", "darkred"))

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Beacons of Light…

March 25, 2020, 10:38 am

≫ Next: Quick Intro to Reproducible Example in R with reprex

≪ Previous: Tuning random forest hyperparameters with #TidyTuesday trees data

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Earlier today, DataIQ unveiled its list of the 100 most influential people in data-driven business, the DataIQ 100, and I’m delighted to report that I was included on that list. It was humbling to be counted among talented individuals, such as Harry Powell, Orlando Machado and Tom Smith, all of whom are blazing the trail in using data to make a real difference in the way they drive their businesses. The competition was stiff, with a record breaking 1,000+ entries for the coveted 100 places, so we must have done something right along the way!

But as well as being a moment of personal pride, it was actually refreshing to receive good news this week after what feels like a growing sense of doom in the world with the current Covid-19 outbreak. A beacon of light in dark times, if you like. It’s an interesting time for data scientists: the world is witnessing the bravery of medical staff on the frontline dealing with those affected by the illness, but #flattenthecurve is trending and tales of retail boom or bust (depending on what sector you’re in) are just two examples of data-driven stories that highlight how our profession is trying to make some sense out of this unfathomable situation.

Initiatives such as the DataIQ 100 help showcase the value and positive impact that data can have on business and situational outcomes. We at Mango are firm believers in the power of data and (advanced) analytics to drive better decisions, not just in the world of business, but to help the most vulnerable in our society and help combat some of the biggest threats facing our future.

Some time ago, the DataIQ 100 committee asked me, and other members of the 2020 DataIQ 100 list, for our views on the industry’s future, and one of the key themes to emerge from this was skills. The feedback was unanimous, that demand will continue to outstrip supply.

At the end of 2019, Mango, alongside Women in Data, conducted its own research into this topic and discovered that over half of data scientists planned on moving roles within the next year. A lack of support, funding and time available for upskilling were all cited as challenges within the UK data science community – all indications that vital steps need to be taken to assess skills gaps and plan to unite individuals to create effective, skilled teams that can rise to the growing data challenge for businesses.

I hope that the important work data scientists are doing in the background of this current crisis – from work in the pharmaceutical sector to expedite the release of a vaccine, to work in the retail sector to ensure firms can weather this storm or that food supply chains run smoothly – can shine a light on the difference that data can make and encourage others to join the profession in the future.

In the meantime, I am proud to be included among such industry luminaries and hope that, together, we will be able to inspire others to join our crusade.

Here’s my #DataIQ100 profile

The post Beacons of Light… appeared first on Mango Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Quick Intro to Reproducible Example in R with reprex

March 26, 2020, 2:45 am

≫ Next: The Waffle House Index

≪ Previous: Beacons of Light…

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This video quickly introduces you to an amazing R package called reprex that helps in generating Reproducible Example which could be useful in a lot of places like Github issues, Stack Overflow Question and Answers, R-dev mailing list or simply to share your problem with someone or Teaching!

Link: https://www.youtube.com/watch?v=hnzrDLf9anw

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

The Waffle House Index

March 26, 2020, 5:10 am

≫ Next: Online R trainings: Learning data science – live and interactive

≪ Previous: Quick Intro to Reproducible Example in R with reprex

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Waffle House announced it was closing hundreds of stores this week due to SARS-Cov-2 (a.k.a COVID-19). This move garnered quite a bit of media attention since former FEMA Administrator Craig Fugate used the restaurant chain as both an indicator of the immediate and overall severity of a natural disaster. [He’s not the only one](https://www.ehstoday.com/emergency-management/article/21906815/what-do-waffles-have-to-do-with-risk-management. The original concept was pretty straightforward:

For example, if a Waffle House store is open and offering a full menu, the index is green. If it is open but serving from a limited menu, it’s yellow. When the location has been forced to close, the index is red. Because Waffle House is well prepared for disasters, Kouvelis said, it’s rare for the index to hit red. For example, the Joplin, Mo., Waffle House survived the tornado and remained open. “They know immediately which stores are going to be affected and they call their employees to know who can show up and who cannot,” he said. “They have temporary warehouses where they can store food and most importantly, they know they can operate without a full menu. This is a great example of a company that has learned from the past and developed an excellent emergency plan.”

SARS-Cov-2 is not a tropical storm, so conditions are a bit different and a tad more complex when it comes to basing severity of this particular disaster (mostly caused by inept politicians across the globe), which gave me an idea for how to make the Waffle House Index a proper index, i.e. a _”statistical measure of change in a representative group of individual data points.”_¹.

In the case of an outbreak, rather than a simple green/yellow/red condition state, using the ratio of closed to open Waffle House locations as a numeric index — [0-1] — seems to make more sense since it may better help indicate:

when shelter-in-place became mandatory where a given restaurant is located
the severity of SARS-Cov-2-caused symptoms for a given location
disruptions in the supply chain for a given location due to SARS-Cov-2

I kinda desperately needed a covidistraction so I set out to see how hard it would be to build such an index metric.

Waffle House lets you find locations via a standard map/search interface. They provide lots of data via that map which can be used to figure out which stores are open and which are closed. There’s a nascent R package which contains all the recipes necessary for the data gathering. However, you don’t need to use it, since it powers wafflehouseindex.us which is collecting the data when the store closings info changes and provides a snapshot of the latest data daily (direct CSV link).

The historical data will make it to a git repo at some point in the near future.

The current index value is 21.2, which increased quickly after the first value of 18.1 (that event was the catalyst for getting the site up and package done) and the closed locations are on the map at the beginning of the post. I went with three qualitative levels on the gauge mostly to keep things simple.

There will absolutely be more location closings and it will be interesting (and, ultimately, very depressing and likely grave) to see how high the index goes and how long it stays above zero.

FIN

The metric is — for the time being — computed across all stores. As noted earlier, this could be broken down into regional index scores to intuit the aforementioned three indicators on a more local level. The historical data (apart from the first closings announcement) is being saved so it will be possible to go back and compute regional indexes when I’ve got more time.

I shall reiterate that you should grab the data from http://wafflehouseindex.us/data/latest.csv vs use the R package since there’s no point in dup’ing the gathering and the historical data will be up and maintained soon.

Stay safe, folks.

https://en.wikipedia.org/wiki/Index_(economics)

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Online R trainings: Learning data science – live and interactive

March 26, 2020, 6:56 am

≫ Next: COVID-19 Data and Prediction for Michigan

≪ Previous: The Waffle House Index

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Webinar-Data-Science-mit-R

If concerts and theatre performances are cancelled, fitness studios are closed and offices are moved home because of the Corona danger, you don’t have to quit „live events“ completely. Living room concerts, online yoga and dance classes or home office opportunities – these offers already exist and we also bring our courses to the internet.

You would like to take part in a training course that really helps you despite your home office and limited travel options? Then our online R trainings are exactly the right thing for you!

Especially in the current times it becomes obvious how important it is to use the chances of digitalization. For this reason, we offer our popular courses „Introduction to R“ and „Machine learning with R“ to provide you with the knowledge you need to use R productively.

What makes our offer different from other online training courses?

The presence of our experienced data science trainers. Your individual questions will be answered in direct exchange – this ensures the greatest possible learning success for you with practical relevance despite the virtual training!

Our R trainings are the German-language training program for the data science language R. More than 1,500 participants have already been impressed by the practical experience of our trainers and the structure of our training courses.

Key Facts

Location: In the home office, office or from the balcony: Our webinars offer you the spatial flexibility you need in the current situation.

Price: Per course: € 249,- | In a bundle:€ 399,-

Course language: German

Registration possible until one day before the course starts.

Introduction to R | 21.04. – 22.04.2020 | 09:00 am to 13:00 pm

The course is intended as an introduction to R and its basic functionalities and facilitates your entry into R with practical tips and exercises. This basic course serves as a starting point for R beginners without in-depth previous knowledge for the further use of R in individual application scenarios.

The goal of the course is to teach you the logic and terminology of the R programming language and to lay the foundation for independent work with R.

Machine learning with R | 23.04. – 24.04.2020 | 09:00 am to 13:00 pm

Use machine learning and data mining algorithms to develop artificial intelligence applications based on data.

In our course „Machine learning with R“ we give you an insight into algorithms of machine learning and show you how to develop your own models, which challenges you face and how to master them.

By means of practical examples and exercises, we provide you with the skills to independently implement machine learning procedures in R. The preparation of data, the development and training of algorithms and the validation of analysis models: In our course you will learn the central steps of machine learning.

COVID-19 Data and Prediction for Michigan

March 26, 2020, 7:22 am

≫ Next: Watercolors

≪ Previous: Online R trainings: Learning data science – live and interactive

[This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Every country is facing a global pandemic caused by COVID19 and it’s quite scary for everyone. Unlike any other pandemic we faced before, COVID19 is providing plenty of quality data in near real time. Making this available for general public has helped citizen data scientists to share their reports, forecast trends and building real-time dashboards.

Like everyone else, I am just as curious as anyone else as to “How long will all this last?”. So, I decided to pull up some data for my state and see if I build a prediction model.

Getting all the Data Needed

CDC and your state gov websites should be publishing data every day. I got my data from Michigan.gov and click on Detroit. Here is the link to compiled data on my GitHub.

Visualize Data

Covid19plot

From the above plot we can clearly see that the data is increasing in an exponential trend for total cases and the total deaths seems to be in a similar trend.

Correlation

The correlation between each of the variables is as shown below. We will just use Day and Cases for the model building. The reason for this is because we want to be able to extrapolate our data to visualize future trends.

             Day     Cases     Daily  Previous    DeathsDay      1.0000000 0.8699299 0.8990702 0.8715494 0.7617497Cases    0.8699299 1.0000000 0.9614424 0.9570949 0.9597218Daily    0.8990702 0.9614424 1.0000000 0.9350738 0.8990124Previous 0.8715494 0.9570949 0.9350738 1.0000000 0.9004541Deaths   0.7617497 0.9597218 0.8990124 0.9004541 1.0000000

Build a Model for Total Cases

To build the model, we will first split the data in to train and test. The split ratio is set at 80%. Next, we build an exponential regression model by using our simple lm function. Finally, we can view the summary of the model.

 # create samples from the datasamples = sample(1:16, size = 16*0.8)# build an exponential regression modelmodel = lm(log(Cases) ~ Day + I(Day^2) , data = data[samples,])# look at the summary of the modelsummary(model)

In the below summary we can see that Day column is highly significant for our prediction and Day^2 is not highly significant. We will still keep this. Our adjusted R-squared is 0.97 indicating the model is significant and p-value is less than 0.05.

Note: Don’t bash me about number of samples. I agree this is not a good amount of samples and I might be over fitting.

 Call:lm(formula = log(Cases) ~ Day + I(Day^2), data = data[samples,    ])Residuals:     Min       1Q   Median       3Q      Max-0.58417 -0.13007  0.07647  0.17218  0.56305 Coefficients:             Estimate Std. Error t value Pr(>|t|)(Intercept) -0.091554   0.347073  -0.264   0.7979Day          0.711025   0.104040   6.834 7.61e-05 ***I(Day^2)    -0.013296   0.006391  -2.080   0.0672 .---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.3772 on 9 degrees of freedomMultiple R-squared:  0.9806,Adjusted R-squared:  0.9763F-statistic:   228 on 2 and 9 DF,  p-value: 1.951e-08Prediction for New Data

Prediction Time

Now that we have a model, we can do predictions on the test data. In all honesty, I did not intend to make the prediction call this complicated but, here it is . From the prediction, we have calculated Mean Absolute Error. This is indicating that our average error rate is 114 cases. We are either over estimating or under estimating.

“Seems like Overfitting!!”

results = data.frame(actual = data[-samples,]$Cases,           Prediction = exp(predict(model, data.frame(Day = data$Day[-samples])))           )# view test resultsresults#     actual Prediction# 1     25   12.67729# 2     53   40.28360# 3    110  186.92442# 4   2294 2646.77897# calculate maeMetrics::mae(results$actual, results$Prediction)# [1] 113.6856

Visualize the Predictions

Let’s plot over entire model results train and test to see how close are we. The plot seems to show that we are very accurate with our predictions. This might be because of scaling.

Rplot

Now, let’s try with log scale and is as shown below. Now, we can see that our prediction model was over estimating the total cases. This is also a valuable lesson to show how two different charts can interpret the results differently.

Rplot02

Conclusion

From the above analysis and model building we saw how we can predict the number of pandemic cases in Michigan. On further analyzing the model, we found that the model was too good to be true or over fitting. For now, I don’t have a lot of data to work with. I will give this model another try in a week to see how it performs with feeding more data. This would be a good experiment.

Let me know what you think of this and comment some of your comments on how differently should I have done it.

To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Watercolors

March 26, 2020, 8:13 am

≫ Next: Re-Share: vtreat Data Preparation Documentation and Video

≪ Previous: COVID-19 Data and Prediction for Michigan

[This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Moça do corpo dourado Do Sol de Ipanema O seu balançado É mais que um poema (Garota de Ipanema, João Gilberto)

Sometimes I think about the reasons why I spend so many time doing experiments and writing my discoveries in a blog. Even although the main reason to start this blog was some kind of vanity, today I have pretty clear why I still keep writing it: to keep my mind tuned. I really enjoy looking for ideas, learning new algorithms, figuring out the way to translate them into code and trying to discover new territories going a step further. I cannot imagine my life without coding. Many good times in the last years have been in front of my laptop listening music and drinking a beer. In these strange times, confined at house, coding has became in something more important. It keeps me ahead from the sad news and moves my mind to places where everything is quiet, friendly and perfect. Blogging is my therapy, my mindfulness.

This post is inspired in this post from Softology, an amazing blog I recommend you to read. In it, you can find a description of the stepping stone cellular automaton as well as a appealing collection of images generated using this technique. I modified the original algorithm described in the post to create images like these, which remind me a watercolor painting:

I begin with a 400 x 400 null matrix. After that, I choose a number of random pixels that will act as centers of circles. Around them I substitute the initial zeros by numbers drawned from a normal distribution which mean depends on the distance of pixels to the center. The next step is to apply the stepping stone algorithm. For each pixel, I substitute its value by a weighted average of itself and the value of some of its neighbors, choosen randomly. I always mix values of the pixels. The original algorithm, as described in the Softology’s blog, performs these mixings randomly. Another difference is that I mix values intead interchanging them, as the original algorithm does. Once I repeat this process a number of times, I pick a nice palette from COLOURLovers and turn values of pixels into colors with ggplot:

The code is here. Let me know if you do something interesting with it. Turning numbers into bright colors: I cannot imagine a better way to spend some hours in these shadowy times.

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Re-Share: vtreat Data Preparation Documentation and Video

March 26, 2020, 10:04 am

≫ Next: February 2020: “Top 40” New R Packages

≪ Previous: Watercolors

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I would like to re-share vtreat (R version, Python version) a data preparation documentation for machine learning tasks.

vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables for later use.

A nice introductory video lecture on vtreat can be found here, and the latest copy of the lecture slides here. Or, you can check out chapter 8 “Advanced data preparation” of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019– which covers the use of vtreat.

The vtreat documentation is organized by task (regression, classification, multinomial classification, and unsupervised), language (R or Python) and interface style (design/prepare, or fit/prepare). In particular the R code now supports variations of the interfaces, allowing users to choose what works best with their coding style. Either design/prepare, which is very fluid when combined with wrapr::unpack notation or the fit/prepare (which uses mutable state to organize steps).

Regression: Python regression example, R regression example, fit/prepare interface, R regression example, design/prepare/experiment interface.
Classification: Python classification example, R classification example, fit/prepare interface, R classification example, design/prepare/experiment interface.
Unsupervised tasks: Python unsupervised example, R unsupervised example, fit/prepare interface, R unsupervised example, design/prepare/experiment interface.
Multinomial classification: Python multinomial classification example, R multinomial classification example, fit/prepare interface, R multinomial classification example, design/prepare/experiment interface.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

February 2020: “Top 40” New R Packages

March 25, 2020, 5:00 pm

≫ Next: What would a keyboard optimised for Luxembourguish look like?

≪ Previous: Re-Share: vtreat Data Preparation Documentation and Video

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One hundred sixty-four new packages made it to CRAN in February. Here are my “Top 40” picks in eleven categories: Computational Methods, Data, Genomics, Machine Learning, Mathematics, Medicine, Science, Statistics, Time Series, Utilities, and Visualizations.

Computational Methods

delayed v0.3.0: Implements mechanisms to parallelize dependent tasks in a manner that optimizes the computational resources. Functions produce “delayed computations” which may be parallelized using futures. See the vignette for details.

tergmLite v2.1.7: Provides functions to efficiently simulate dynamic networks estimated with the framework for temporal exponential random graph models implemented in the tergm package.

Data

crsmeta v0.2.0: Provides functions to obtain coordinate system metadata from various data formats including: CRS (Coordinate Reference System), EPSG (European Petroleum Survey Group), PROJ4 and WKT (Well-Known Text 2).

danstat v0.1.0: Implements an interface into the Statistics Denmark Databank API. The vignette provides an Introduction.

osfr v0.2.8: Implements an interface for interacting with OSF which enables users to access open research materials and data, or to create and manage private or public projects. There is a Getting Started Guide and a vignette on Authentication.

Genomics

selectSNPs v1.0.1: Provides a method using unified local functions to select low-density SNPs. See the Vignette for a tutorial.

varitas v0.0.1: Implements a multi-caller variant analysis pipeline for targeted analysis sequencing data. There is an Introduction and a vignette on Errors.

Machine Learning

autokeras v1.0.1: Implements an interface to AutoKeras, an open source software library for automated machine learning. See README for an example.

MTPS v0.1.9: Implements functions to predict simultaneous multiple outcomes based on revised stacking algorithms as described in Xing et al. (2019). See the vignette to get started.

quanteda.textmodels v0.9.1: Implements methods for scaling models and classifiers based on sparse matrix objects representing textual data. It includes implementations of the Laver et al. (2003) wordscores model, the Perry & Benoit’s (2017) class affinity scaling model, and the Slapin & Proksch (2008) wordfish model. See the vignette to get started.

SeqDetect v1.0.7: Implements the automaton model found in Krleža, Vrdoljak & Brčić (2019) to detect and process sequences. See the vignette for examples and theory.

studyStrap v1.0.0: Implements multi-Study Learning algorithms such as Merging, Study-Specific Ensembling (Trained-on-Observed-Studies Ensemble), the Study Strap, and the Covariate-Matched Study Strap. and offers over 20 similarity measures. See Kishida, et al. (2019) for background and the vignette for how to use the package.

Mathematics

PlaneGeometry v1.1.0: Provides R6 classes representing triangles, circles, circular arcs, ellipses, elliptical arcs and lines, plot methods, transformations and more. The vignette offers multiple examples.

Medicine

beats v0.1.1: Provides functions to import data from UFI devices and process electrocardiogram (ECG) data. It also includes a Shiny app for finding and exporting heart beats. See README to get started.

NMADiagT v0.1.2: Implements the hierarchical summary receiver operating characteristic model developed by Ma et al. (2018) and the hierarchical model developed by Lian et al. (2019) for performing meta-analysis. It is able to simultaneously compare one to five diagnostic tests within a missing data framework.

SAMBA v0.9.0: Implements several methods, as proposed in Beesley & Mukherjee (2020) for obtaining bias-corrected point estimates along with valid standard errors using electronic health records data with misclassifird EHR-derived disease status. See the vignette for details.

Science

baRUlho v1.0.1: Provides functions to facilitate acoustic analysis of (animal) sound transmission experiments including functions for data preparation, analysis and visualization. See Dabelsteen et al. (1993) for background and the vignette for an introduction.

CBSr v1.0.3: Uses monotonically constrained Cubic Bezier Splines to approximate latent utility functions in intertemporal choice and risky choice data. See the Lee et al. (2019) for the details.

Statistics

blockCV v2.1.1: Provides functions for creating spatially or environmentally separated folds for cross-validation in spatially structured environments and methods for visualizing the effective range of spatial autocorrelation to separate training and testing datasets as described in Valavi, R. et al. (2019). See the vignette for examples.

BGGM v1.0.0: Implements the methods for fitting Bayesian Gaussian graphical models recently introduced in Williams (2019), Williams & Mulder (2019) and Williams et al. (2019). There are vignettes on Credible Intervals, Plotting Network Structure, Comparing GGMs with the Posterior Predicive Distributions, and Predictability.

metagam v:0.1.0: Provides a method to perform the meta-analysis of generalized additive models and generalized additive mixed models, including functionality for removing individual participant data from models computed using the mgcv and gamm4 packages. A typical use case is a situation where data cannot be shared across locations, and an overall meta-analytic fit is sought. For the details see Sorensen et al. (2020), Zanobetti (2000), and Crippa et al. (2018). There is an Introduction and vignettes on Dominance, Heterogenity Plots, and Multivariate Smooth Terms.

MKpower v0.4: Provides functions for power analysis and sample size calculations for Welch and Hsu t-tests, Wilcoxon rank sum tests and diagnostic tests. See Flahault et al. (2005) and Dobbin & Simon (2007) for background, and the vignette for examples.

mvrsquared v0.0.3: Implements a method to compute the coefficient of determination for outcomes in n-dimensions. See Jones (2019) for the theory and the vignette to get started.

pdynmc v0.8.0: Provides functions to model linear dynamic panel data based on linear and nonlinear moment conditions as proposed by Holtz-Eakin et al.(1988), Ahn & Schmidt (1995), and Arellano & Bover (1995). See the vignette for the underlying theory and a sample session.

Superpower v0.0.3: Provides functions to simulate ANOVA designs of up to three factors, calculate the observed power and average observed effect size for all main effects and interactions. See Lakens & Caldwell (2019) for background, and the vignette for an introduction.

tune v0.0.1: Provides functions and classes for use in conjunction with other tidymodels packages for finding reasonable values of hyper-parameters in models, pre-processing methods, and post-processing steps. Look here for and example.

xrnet v0.1.7: Provides functions to fit hierarchical regularized regression models incorporating potentially informative external data as in Weaver & Lewinger (2019). See README for examples.

Time Series

seer v1.4.1: Implements a framework for selecting time series forecast models based on features calculated from the time series. For details see Talagala et al. (20180).

testcorr v0.1.2: Provides functions for computing test statistics for the significance of autocorrelation in univariate time series, cross-correlation in bivariate time series, Pearson correlations in multivariate series and test statistics for i.i.d. property of univariate series as described in Dalla et al. (2019). See the vignette for the math and examples.

Utility

bioC.logs v1.1: Fetches download statistics BioConductor.org. See the vignette.

matricks v0.8.2: Provides function to help with creation of complex matrices along with a plotting function. See the vignette for examples.

rco v1.0.1: Provides functions to automatically apply different strategies to optimize R code. These functions take R code as input, and returns R code as output. There are vignettes on: Contributing an optimizer, Docker files, Common Subexpression Elimination, Constant Folding, Constant Propagation, Dead Code Elimination, Dead Expression Elimination, Dead Store Elimination, and Loop-invariant Code Motion.

slider v0.1.2: Provides type-stable rolling window functions over any R data type and supports both cumulative and expanding windows. See the vignette for examples.

taxadb v0.1.0: Provides fast, consistent access to taxonomic data, and supports common tasks such as resolving taxonomic names to identifiers and looking up higher classification ranks of given species. There is an Introduction and a Schema.

tidyfst v0.8.8: Provides a toolkit of tidy data manipulation verbs with data.table as the backend, combining the merits of syntax elegance from dplyr and computing performance from data.table. There is a vignete written in Chinese, an English Language Introduction and vignettes on join, reshape, nest, fst and dt.

tidytable v0.3.2: Provides an rlang compatible interface to data.table. See README for examples.

Visualization

iNzightTools v1.8.3: Provides wrapper functions for common variable and dataset manipulation workflows primarily used by iNZight, a graphical user interface providing easy exploration and visualization of data for students. Many functions return the tidyverse code used to obtain the result in an effort to bridge the gap between GUI and coding.

IPV v0.1.1: Provides functions to generate item pool visualizations which are used to display the conceptual structure of a set of items. See Dantlgraber et al. (2019) for background and the vignette for examples.

spacey v0.1.1: Provides utilities to download USGS and ESRI geospatial data and produce high quality rayshader maps for locations in the United States. There is an Introduction

Tendril v2.0.4: Provides functions to compute and display tendril plots. See the vignnette for and introduction..

tidyHeatmap v0.99.9: Provides an implementation of the Bioconductor ComplexHeatmap package based on tidy data frames. See the vignette.

_____='https://rviews.rstudio.com/2020/03/26/february-2020-top-40-new-r-packages/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

What would a keyboard optimised for Luxembourguish look like?

March 26, 2020, 5:00 pm

≫ Next: Merge Covid-19 Data with Governmental Interventions Data

≪ Previous: February 2020: “Top 40” New R Packages

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been using the BÉPO layout for my keyboard since 2010-ish, and it’s been one of the best computing decisions I’ve ever taken. The BÉPO layout is an optimized layout for French, but it works quite well for many European languages, English included (the only issue you might have with the BÉPO layout for English is that the w is a bit far away).

To come up with the BÉPO layout, ideas from a man named August Dvorak were applied for the French language. Today, the keyboard layout that is optimised for English is called after him, the DVORAK layout. Dvorak’s ideas were quite simple; unlike the QWERTY layout, his layout had to be based on character frequency of the English language. The main idea is that the most used characters of the language should be on the home row of the keyboard. The home row is the row where you lay your fingers on the keyboard when you are not typing (see picture below).

The problem with the “standard” layouts, such as QWERTY, is that they’re all absolute garbage, and not optimized at all for typing on a computer. For instance, look at the heatmap below, which shows the most used characters on a QWERTY keyboard when typing an a standard English text:

(Heatmap generated on https://www.patrick-wied.at/projects/heatmap-keyboard/.)

As you can see, most of the characters used to type this text are actually outside of the home row, and the majority of them on the left hand side of the keyboard. The idea of Dvorak was to first, put the most used characters on the home row, and second to try to have an equal split of characters, 50% for each hand.

The same text on the DVORAK layout, shows how superior it is:

As you can see, this is much much better. The same idea was applied to develop the BÉPO layout for French. And because character frequency is quite similar across languages, learning a layout such as the BÉPO not only translates to more efficient typing for French, but also for other languages, such as English, as already explained above.

The reason I’m writing this blog post is due, in part, to the confinement situation that many people on Earth are currently facing due to the Corona virus. I have a job where I spend my whole day typing, and am lucky enough to be able to work from home. Which means that I’m lucky enough to use my mechanical keyboard to work, which is really great. (I avoid taking my mechanical keyboard with me at work, because I am never very long in the same spot, between meeting and client assignments…). But to have a mechanical keyboard that’s easy to take with me, I decided to buy a second mechanical keyboard, a 40% keyboard from Ergodox (see picture below):

Because I don’t even want to see the QWERTY keycaps, I bought blank keycaps to replace the ones that come with the keyboard. Anyway, this made me think about how crazy it is that in 2020 people still use absolute garbage keyboard layouts (and keyboards by the way) to type on, when their job is basically only typing all day long. It made me so angry that I even made a video, which you enjoy here.

The other thing I thought about was the specific case of Luxembourg, a country with 3 official languages (Luxembourguish, French and German), a very large Portuguese minority, and where English became so important in recent years that the government distributed leaflets in English to the population (along with leaflets in French, Luxembourguish, German and Portuguese of course) explaining what is and is not allowed during the period of containment. What would a keyboard optimized for such a unique country look like?

Of course, the answer that comes to mind quickly is to use the BÉPO layout; even though people routinely write in at least 3 of the above-mentioned languages, French is still the one that people use most of the time for written communication (at least, that’s my perception). The reason is that while Luxembourguish is the national language, and the language of the native population, French has always been the administrative language, and laws are still written in French only, even though they’re debated in Luxembourguish in the parliament. However, people also routinely write emails in German or English, and more and more people also write in Luxembourguish. This means that a keyboard optimized for Luxembourguish, or rather, for the multilinguistic nature of the Luxembourguish country, should take into account all these different languages. Another thing to keep in mind is that Luxembourguish uses many French words, and as such, writing these words should be easy.

So let’s start with the BÉPO layout as a base. This is what it looks like:

A heatmap of character frequencies of a French, or even English, text would show that the most used characters are on the home row. If you compare DVORAK to BÉPO, you will see that the home row is fairly similar. But what strikes my colleagues when they see a picture of the BÉPO layout, is the fact that the characters é, è, ê, à and ç can be accessed directly. They are so used to having these characters only accessible by using some kind of modifier key that their first reaction is to think that this is completely stupid. However, what is stupid, is not having these letters easily accessible, and instead having, say, z easily accessible (the French “standard” layout is called AZERTY, which is very similar and just as stupid as the QWERTY layout. The letter Z is so easy to type on, but is almost non-existing in French!).

So let’s analyze character frequencies of a Luxembourguish text and see if the BÉPO layout could be a good fit. I used several text snippets from the Bible in Luxembourguish for this, and a few lines of R code:

library(tidyverse)library(rvest)

root_url <- "https://cathol.lu/article"texts <- seq(4869,4900)urls <- c("https://cathol.lu/article4887",          "https://cathol.lu/article1851",          "https://cathol.lu/article1845",          "https://cathol.lu/article1863",          "https://cathol.lu/article1857",          "https://cathol.lu/article4885",          "https://cathol.lu/article1648",          "https://cathol.lu/article1842",          "https://cathol.lu/article1654",          "https://cathol.lu/article1849",          "https://cathol.lu/article1874",          "https://cathol.lu/article4884",          "https://cathol.lu/article1878",          "https://cathol.lu/article2163",          "https://cathol.lu/article2127",          "https://cathol.lu/article2185",          "https://cathol.lu/article4875")

Now that I’ve get the urls, let’s get the text out of it:

pages <- urls %>%  map(read_html)texts <- pages %>%  map(~html_node(., xpath = '//*[(@id = "art_texte")]')) %>%  map(html_text)

texts is a list containing the raw text from the website. I used several functions from the {rvest} package to do this. I won’t comment on them, because this is not a tutorial about webscraping (I’ve written several of those already), but a rant about keyboard layout gosh darn it.

Anyway, let’s now take a look at the character frequencies, and put that in a neat data frame:

characters <- texts %>%  map(~strsplit(., split = "")) %>%  unlist() %>%  map(~strsplit(., split = "")) %>%  unlist() %>%  tolower() %>%  str_extract_all(pattern = "[:alpha:]") %>%  unlist() %>%  table() %>%    as.data.frame()

Computing the frequencies is now easy:

characters <- characters %>%  mutate(frequencies = round(Freq/sum(Freq)*100, digits = 2)) %>%  arrange(desc(frequencies)) %>%    janitor::clean_names()

Let’s start with the obvious differences: there is not a single instance of the characters è, ê or ç, which are used in French only. There are however instances of ü, ä, and ë. These characters should be easily accessible, however their frequencies are so low, that they could still only be accessible using a modifier key, and it would not be a huge issue. However, since ç does not appear at all, maybe it could be replaced by ä and ê could be replaced by ë. But we must keep in mind that since the average Luxembourger has to very often switch between so many languages, I would suggest that these French characters that would be replaced should still be accessible using a modifier such as Alt Gr. As for the rest, the layout as it stands is likely quite ok. Well, actually I know it’s ok, because when I write in Luxembourguish using the BÉPO layout, I find it quite easy to do. But let’s grab a French and a German text, and see how the ranking of the characters compare. Let’s get some French text:

Click to read the French text

french <- "Au commencement, Dieu créa les cieux et la terre.La terre était informe et vide: il y avait des ténèbres à la surface de l'abîme, et l'esprit de Dieu se mouvait au-dessus des eaux.Dieu dit: Que la lumière soit! Et la lumière fut.Dieu vit que la lumière était bonne; et Dieu sépara la lumière d'avec les ténèbres.Dieu appela la lumière jour, et il appela les ténèbres nuit. Ainsi, il y eut un soir, et il y eut un matin: ce fut le premier jour.Dieu dit: Qu'il y ait une étendue entre les eaux, et qu'elle sépare les eaux d'avec les eaux.Et Dieu fit l'étendue, et il sépara les eaux qui sont au-dessous de l'étendue d'avec les eaux qui sont au-dessus de l'étendue. Et cela fut ainsi.Dieu appela l'étendue ciel. Ainsi, il y eut un soir, et il y eut un matin: ce fut le second jour.Dieu dit: Que les eaux qui sont au-dessous du ciel se rassemblent en un seul lieu, et que le sec paraisse. Et cela fut ainsi.Dieu appela le sec terre, et il appela l'amas des eaux mers. Dieu vit que cela était bon.Puis Dieu dit: Que la terre produise de la verdure, de l'herbe portant de la semence, des arbres fruitiers donnant du fruit selon leur espèce et ayant en eux leur semence sur la terre. Et cela fut ainsi.La terre produisit de la verdure, de l'herbe portant de la semence selon son espèce, et des arbres donnant du fruit et ayant en eux leur semence selon leur espèce. Dieu vit que cela était bon.Ainsi, il y eut un soir, et il y eut un matin: ce fut le troisième jour.Dieu dit: Qu'il y ait des luminaires dans l'étendue du ciel, pour séparer le jour d'avec la nuit; que ce soient des signes pour marquer les époques, les jours et les années;et qu'ils servent de luminaires dans l'étendue du ciel, pour éclairer la terre. Et cela fut ainsi.Dieu fit les deux grands luminaires, le plus grand luminaire pour présider au jour, et le plus petit luminaire pour présider à la nuit; il fit aussi les étoiles.Dieu les plaça dans l'étendue du ciel, pour éclairer la terre,pour présider au jour et à la nuit, et pour séparer la lumière d'avec les ténèbres. Dieu vit que cela était bon.Ainsi, il y eut un soir, et il y eut un matin: ce fut le quatrième jour.Dieu dit: Que les eaux produisent en abondance des animaux vivants, et que des oiseaux volent sur la terre vers l'étendue du ciel.Dieu créa les grands poissons et tous les animaux vivants qui se meuvent, et que les eaux produisirent en abondance selon leur espèce; il créa aussi tout oiseau ailé selon son espèce. Dieu vit que cela était bon.Dieu les bénit, en disant: Soyez féconds, multipliez, et remplissez les eaux des mers; et que les oiseaux multiplient sur la terre.Ainsi, il y eut un soir, et il y eut un matin: ce fut le cinquième jour.Dieu dit: Que la terre produise des animaux vivants selon leur espèce, du bétail, des reptiles et des animaux terrestres, selon leur espèce. Et cela fut ainsi.Dieu fit les animaux de la terre selon leur espèce, le bétail selon son espèce, et tous les reptiles de la terre selon leur espèce. Dieu vit que cela était bon.Puis Dieu dit: Faisons l'homme à notre image, selon notre ressemblance, et qu'il domine sur les poissons de la mer, sur les oiseaux du ciel, sur le bétail, sur toute la terre, et sur tous les reptiles qui rampent sur la terre.Dieu créa l'homme à son image, il le créa à l'image de Dieu, il créa l'homme et la femme.Dieu les bénit, et Dieu leur dit: Soyez féconds, multipliez, remplissez la terre, et l'assujettissez; et dominez sur les poissons de la mer, sur les oiseaux du ciel, et sur tout animal qui se meut sur la terre.Et Dieu dit: Voici, je vous donne toute herbe portant de la semence et qui est à la surface de toute la terre, et tout arbre ayant en lui du fruit d'arbre et portant de la semence: ce sera votre nourriture.Et à tout animal de la terre, à tout oiseau du ciel, et à tout ce qui se meut sur la terre, ayant en soi un souffle de vie, je donne toute herbe verte pour nourriture. Et cela fut ainsi.Dieu vit tout ce qu'il avait fait et voici, cela était très bon. Ainsi, il y eut un soir, et il y eut un matin: ce fut le sixième jour."

characters_fr <- french %>%  map(~strsplit(., split = "")) %>%  unlist() %>%  map(~strsplit(., split = "")) %>%  unlist() %>%  tolower() %>%  str_extract_all(pattern = "[:alpha:]") %>%  unlist() %>%  table() %>%    as.data.frame() %>%    mutate(frequencies = round(Freq/sum(Freq)*100, digits = 2)) %>%  arrange(desc(frequencies)) %>%    janitor::clean_names()

Let’s now do the same for German:

Click to read the German text

german <- "Am Anfang schuf Gott Himmel und Erde.Und die Erde war wüst und leer, und es war finster auf der Tiefe; und der Geist Gottes schwebte auf dem Wasser.Und Gott sprach: Es werde Licht! und es ward Licht.Und Gott sah, daß das Licht gut war. Da schied Gott das Licht von der Finsternisund nannte das Licht Tag und die Finsternis Nacht. Da ward aus Abend und Morgen der erste Tag.Und Gott sprach: Es werde eine Feste zwischen den Wassern, und die sei ein Unterschied zwischen den Wassern.Da machte Gott die Feste und schied das Wasser unter der Feste von dem Wasser über der Feste. Und es geschah also.Und Gott nannte die Feste Himmel. Da ward aus Abend und Morgen der andere Tag.Und Gott sprach: Es sammle sich das Wasser unter dem Himmel an besondere Örter, daß man das Trockene sehe. Und es geschah also.Und Gott nannte das Trockene Erde, und die Sammlung der Wasser nannte er Meer. Und Gott sah, daß es gut war.Und Gott sprach: Es lasse die Erde aufgehen Gras und Kraut, das sich besame, und fruchtbare Bäume, da ein jeglicher nach seiner Art Frucht trage und habe seinen eigenen Samen bei sich selbst auf Erden. Und es geschah also.Und die Erde ließ aufgehen Gras und Kraut, das sich besamte, ein jegliches nach seiner Art, und Bäume, die da Frucht trugen und ihren eigenen Samen bei sich selbst hatten, ein jeglicher nach seiner Art. Und Gott sah, daß es gut war.Da ward aus Abend und Morgen der dritte Tag.Und Gott sprach: Es werden Lichter an der Feste des Himmels, die da scheiden Tag und Nacht und geben Zeichen, Zeiten, Tage und Jahreund seien Lichter an der Feste des Himmels, daß sie scheinen auf Erden. Und es geschah also.Und Gott machte zwei große Lichter: ein großes Licht, das den Tag regiere, und ein kleines Licht, das die Nacht regiere, dazu auch Sterne.Und Gott setzte sie an die Feste des Himmels, daß sie schienen auf die Erdeund den Tag und die Nacht regierten und schieden Licht und Finsternis. Und Gott sah, daß es gut war.Da ward aus Abend und Morgen der vierte Tag.Und Gott sprach: Es errege sich das Wasser mit webenden und lebendigen Tieren, und Gevögel fliege auf Erden unter der Feste des Himmels.Und Gott schuf große Walfische und allerlei Getier, daß da lebt und webt, davon das Wasser sich erregte, ein jegliches nach seiner Art, und allerlei gefiedertes Gevögel, ein jegliches nach seiner Art. Und Gott sah, daß es gut war.Und Gott segnete sie und sprach: Seid fruchtbar und mehrt euch und erfüllt das Wasser im Meer; und das Gefieder mehre sich auf Erden.Da ward aus Abend und Morgen der fünfte Tag.Und Gott sprach: Die Erde bringe hervor lebendige Tiere, ein jegliches nach seiner Art: Vieh, Gewürm und Tiere auf Erden, ein jegliches nach seiner Art. Und es geschah also.Und Gott machte die Tiere auf Erden, ein jegliches nach seiner Art, und das Vieh nach seiner Art, und allerlei Gewürm auf Erden nach seiner Art. Und Gott sah, daß es gut war.Und Gott sprach: Laßt uns Menschen machen, ein Bild, das uns gleich sei, die da herrschen über die Fische im Meer und über die Vögel unter dem Himmel und über das Vieh und über die ganze Erde und über alles Gewürm, das auf Erden kriecht.Und Gott schuf den Menschen ihm zum Bilde, zum Bilde Gottes schuf er ihn; und schuf sie einen Mann und ein Weib.Und Gott segnete sie und sprach zu ihnen: Seid fruchtbar und mehrt euch und füllt die Erde und macht sie euch untertan und herrscht über die Fische im Meer und über die Vögel unter dem Himmel und über alles Getier, das auf Erden kriecht.Und Gott sprach: Seht da, ich habe euch gegeben allerlei Kraut, das sich besamt, auf der ganzen Erde und allerlei fruchtbare Bäume, die sich besamen, zu eurer Speise,und allem Getier auf Erden und allen Vögeln unter dem Himmel und allem Gewürm, das da lebt auf Erden, daß sie allerlei grünes Kraut essen. Und es geschah also.Und Gott sah alles an, was er gemacht hatte; und siehe da, es war sehr gut. Da ward aus Abend und Morgen der sechste Tag."

characters_gr <- german %>%  map(~strsplit(., split = "")) %>%  unlist() %>%  map(~strsplit(., split = "")) %>%  unlist() %>%  tolower() %>%  str_extract_all(pattern = "[:alpha:]") %>%  unlist() %>%  table() %>%    as.data.frame() %>%    mutate(frequencies = round(Freq/sum(Freq)*100, digits = 2)) %>%  arrange(desc(frequencies)) %>%  janitor::clean_names()

Let’s now visualize how the rankings evolve between these three languages. For this, I’m using the newggslopegraph() function from the {CGPfunctions} package:

characters$rank <- seq(1, 30)characters_fr$rank <- seq(1, 29)characters_gr$rank <- seq(1, 27)characters_fr <- characters_fr %>%  select(letters = x, rank) %>%  mutate(language = "french")characters_gr <- characters_gr %>%  select(letters = x, rank) %>%  mutate(language = "german")characters <- characters %>%  select(letters = x, rank) %>%  mutate(language = "luxembourguish")characters_df <- bind_rows(characters, characters_fr, characters_gr)CGPfunctions::newggslopegraph(characters_df,                               language,                              rank,                              letters,                              Title = "Character frequency ranking for the Luxembourguish official languages",                              SubTitle = NULL,                              Caption = NULL,                              YTextSize = 4)

## Registered S3 methods overwritten by 'lme4':##   method                          from##   cooks.distance.influence.merMod car ##   influence.merMod                car ##   dfbeta.influence.merMod         car ##   dfbetas.influence.merMod        car

Click to look at the raw data

characters_df

##    letters rank       language## 1        e    1 luxembourguish## 2        n    2 luxembourguish## 3        s    3 luxembourguish## 4        a    4 luxembourguish## 5        i    5 luxembourguish## 6        t    6 luxembourguish## 7        d    7 luxembourguish## 8        r    8 luxembourguish## 9        h    9 luxembourguish## 10       u   10 luxembourguish## 11       g   11 luxembourguish## 12       m   12 luxembourguish## 13       o   13 luxembourguish## 14       l   14 luxembourguish## 15       c   15 luxembourguish## 16       w   16 luxembourguish## 17       é   17 luxembourguish## 18       k   18 luxembourguish## 19       f   19 luxembourguish## 20       ä   20 luxembourguish## 21       z   21 luxembourguish## 22       p   22 luxembourguish## 23       j   23 luxembourguish## 24       ë   24 luxembourguish## 25       b   25 luxembourguish## 26       v   26 luxembourguish## 27       ü   27 luxembourguish## 28       q   28 luxembourguish## 29       x   29 luxembourguish## 30       y   30 luxembourguish## 31       e    1         french## 32       u    2         french## 33       i    3         french## 34       t    4         french## 35       s    5         french## 36       a    6         french## 37       l    7         french## 38       r    8         french## 39       n    9         french## 40       d   10         french## 41       o   11         french## 42       c   12         french## 43       m   13         french## 44       p   14         french## 45       é   15         french## 46       q   16         french## 47       v   17         french## 48       f   18         french## 49       b   19         french## 50       è   20         french## 51       x   21         french## 52       y   22         french## 53       j   23         french## 54       à   24         french## 55       z   25         french## 56       g   26         french## 57       h   27         french## 58       ç   28         french## 59       î   29         french## 60       e    1         german## 61       n    2         german## 62       d    3         german## 63       a    4         german## 64       s    5         german## 65       r    6         german## 66       t    7         german## 67       i    8         german## 68       u    9         german## 69       h   10         german## 70       g   11         german## 71       c   12         german## 72       l   13         german## 73       m   14         german## 74       f   15         german## 75       o   16         german## 76       b   17         german## 77       w   18         german## 78       ü   19         german## 79       ß   20         german## 80       v   21         german## 81       z   22         german## 82       p   23         german## 83       j   24         german## 84       k   25         german## 85       ö   26         german## 86       ä   27         german

Certain things pop out of this plot: the rankings of the German and Luxembourguish languages are more similar than the rankings of French and Luxembourguish, but overall, the three languages have practically the same top 10 characters. Using the same base as the BÉPO layout should be comfortable enough, but the characters h and g, which are not very common in French, are much more common in Luxembourguish, and should thus be better placed. I would advise against using the German ergonomic/optimized layout, however, because as I said in the beginning, French is still probably the most written language, certainly more often written than German. So even though the frequencies of characters are very similar between Luxembourguish and German, I would still prefer to use the French BÉPO layout.

I don’t know if there ever will be an ergonomic/optimized layout for Luxembourguish, but I sure hope that more and more people will start using layouts such as the BÉPO, which are really great to use. It takes some time to get used to, but in general in about one week of usage, maybe two, you should be as fast as you were on the legacy layout.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and watch my youtube channel. If you want to support my blog and channel, you could buy me an espresso or paypal.me, or buy my ebook on Leanpub.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}

Buy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Previous Posts

The Multimodal posterior

Problems with standard MH

The Tempered Posterior

Linking both chains

Support Vector Machine

Wrapping Up

Changes in version 0.0.1 (2020-01-17)

The shinymathexample app

Exercise generation

Validation

Remarks

Explore the data

Build model

Train hyperparameters

Choosing the best model

FIN

Introduction to R | 21.04. – 22.04.2020 | 09:00 am to 13:00 pm

Machine learning with R | 23.04. – 24.04.2020 | 09:00 am to 13:00 pm

Getting all the Data Needed

Visualize Data

Correlation

Build a Model for Total Cases

Prediction Time

Visualize the Predictions

Conclusion

Computational Methods

Data

Genomics

Machine Learning

Mathematics

Medicine

Science

Statistics

Time Series

Utility

Visualization

The `shinymathexample` app