Best Books for Data Engineers

July 21, 2022, 9:59 am

≫ Next: Join rstudio::conf(2022) Virtually

≪ Previous: Learning Path: Introduction to R Shiny

[This article was first published on Data Analysis in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Best Books for Data Engineers appeared first on finnstats.

If you are interested to learn more about data science, you can find more articles here finnstats.

Best Books for Data Engineers, Are you seeking the best books on data engineering? If so, your quest is over here.

We’ve outlined the top 8 books on data engineering in this article. So, read the entire article to choose which book is ideal for you.

The person in charge of overseeing data workflows, pipelines, and ETL procedures is known as a data engineer.

Data engineering, as its name suggests, is a field that deals with the delivery, storage, and processing of data.

SQL, R, Python, Spark, AWS, and other specialized technologies are the ones that data engineers need to master.

Best Books for Data Engineers

Because they provide a firm comprehension of the topics, books are crucial for learning these skills. So let’s go to identifying the Best Data Engineering Books without further ado.

1. Data Engineering with Python

A good understanding of data modeling methods and pipelining is provided by this book. You will become familiar with the fundamentals of data engineering at the book’s outset.

After that, you will gain knowledge of the frameworks and tools needed to construct data pipelines for handling huge datasets.

To make the most of your data, you will also learn how to transform, clean, and run analytics on it. You will learn how to create data pipelines and work with massive data sets of various complexity towards the end of the book.

Additionally, you’ll construct the architectures on which you’ll install data pipelines while using actual-world examples.

2. Designing Data-Intensive Applications

This manual is detailed and useful. Including storage, models, structures, access patterns, encoding, replication, partitioning, distributed systems, batch & stream processing, and the future of data systems, this book covers everything related to data engineering.

You can have a thorough grasp of big data architecture in the actual world by reading this book. If you are involved in big data engineering or are interviewing for the position, you should read this book.

This book gives a fantastic overview of the core ideas that underlie the much-hyped Big Data tools.

3. Spark: The Definitive Guide: Big Data Processing Made Simple

A strong platform for Big Data applications is Apache Spark. This book offers several excellent examples and a comprehensive explanation of Spark architecture.

Python, Scala, and Spark SQL are used in the code presented in this book and the accompanying notebooks. The Spark fans will enjoy this novel.

4. Data Science For Dummies

This book’s primary topic is business cases. This book teaches big data, data science, and data engineering as well as how these three disciplines work together to provide enormous value.

You can learn the skills you need from this book to launch a new project or profession.

You will know the basics of big data and data engineering after reading this book. Big data frameworks including Hadoop, MapReduce, Spark, MPP systems, and NoSQL will also be covered.

5. The Data Warehouse Toolkit

This book offers a thorough, up-to-date introduction and contains a treatment of more recent subjects like big data. It is also current with current practice.

New and improved star schema dimensional modeling patterns are also covered in this book.

This book contains two new chapters on ETL approaches. In general, this book is helpful for learning how data warehouses function.

6. Building a Data Warehouse: With Examples in SQL Server

You will discover how to construct a data warehouse in this book, including how to specify the architecture, comprehend the technique, compile the requirements, create the databases, and design the data models.

This book offers hundreds of useful, real-world cases and is focused on SQL Server-based ETL operations. Additionally, you’ll learn how to leverage reports and multidimensional databases to deliver data to consumers.

7. Big Data: Principles and best practices of scalable real-time data systems

Big data system theory and practical application are covered in this book. Additionally, you will learn about specialized technologies like NoSQL databases, Hadoop, and Storm.

You will have a streamlined understanding of the big data architecture and its fundamental idea. The complete conceptual and technical methods for creating real-time big data with Lambda Architecture are covered in this book.

8. R for Data Science

Understanding data science, how it is used, and the science behind it completely is the first step in the R for Data Science book.

As early as the first few chapters, the book cranks up the pace of utilizing the R platform for various data science tasks and processes.

Best Books For Deep Learning »

Conclusion

You learned about the Top 8 “Best Books for Data Engineers” in this article.

Have any of these books been purchased or read by you? If so, please share your experience in the comments.

.mailpoet_hp_email_label{display:none!important;}#mailpoet_form_3 .mailpoet_form { }#mailpoet_form_3 form { margin-bottom: 0; }#mailpoet_form_3 p.mailpoet_form_paragraph.last { margin-bottom: 10px; }#mailpoet_form_3 .mailpoet_column_with_background { padding: 10px; }#mailpoet_form_3 .mailpoet_form_column:not(:first-child) { margin-left: 20px; }#mailpoet_form_3 .mailpoet_paragraph { line-height: 20px; margin-bottom: 20px; }#mailpoet_form_3 .mailpoet_form_paragraph last { margin-bottom: 0px; }#mailpoet_form_3 .mailpoet_segment_label, #mailpoet_form_3 .mailpoet_text_label, #mailpoet_form_3 .mailpoet_textarea_label, #mailpoet_form_3 .mailpoet_select_label, #mailpoet_form_3 .mailpoet_radio_label, #mailpoet_form_3 .mailpoet_checkbox_label, #mailpoet_form_3 .mailpoet_list_label, #mailpoet_form_3 .mailpoet_date_label { display: block; font-weight: normal; }#mailpoet_form_3 .mailpoet_text, #mailpoet_form_3 .mailpoet_textarea, #mailpoet_form_3 .mailpoet_select, #mailpoet_form_3 .mailpoet_date_month, #mailpoet_form_3 .mailpoet_date_day, #mailpoet_form_3 .mailpoet_date_year, #mailpoet_form_3 .mailpoet_date { display: block; }#mailpoet_form_3 .mailpoet_text, #mailpoet_form_3 .mailpoet_textarea { width: 200px; }#mailpoet_form_3 .mailpoet_checkbox { }#mailpoet_form_3 .mailpoet_submit { }#mailpoet_form_3 .mailpoet_divider { }#mailpoet_form_3 .mailpoet_message { }#mailpoet_form_3 .mailpoet_form_loading { width: 30px; text-align: center; line-height: normal; }#mailpoet_form_3 .mailpoet_form_loading > span { width: 5px; height: 5px; background-color: #5b5b5b; }#mailpoet_form_3 h2.mailpoet-heading { margin: 0 0 20px 0; }#mailpoet_form_3 h1.mailpoet-heading { margin: 0 0 10px; }#mailpoet_form_3{border-radius: 2px;text-align: left;}#mailpoet_form_3 form.mailpoet_form {padding: 30px;}#mailpoet_form_3{width: 100%;}#mailpoet_form_3 .mailpoet_message {margin: 0; padding: 0 20px;} #mailpoet_form_3 .mailpoet_validate_success {color: #00d084} #mailpoet_form_3 input.parsley-success {color: #00d084} #mailpoet_form_3 select.parsley-success {color: #00d084} #mailpoet_form_3 textarea.parsley-success {color: #00d084} #mailpoet_form_3 .mailpoet_validate_error {color: #cf2e2e} #mailpoet_form_3 input.parsley-error {color: #cf2e2e} #mailpoet_form_3 select.parsley-error {color: #cf2e2e} #mailpoet_form_3 textarea.textarea.parsley-error {color: #cf2e2e} #mailpoet_form_3 .parsley-errors-list {color: #cf2e2e} #mailpoet_form_3 .parsley-required {color: #cf2e2e} #mailpoet_form_3 .parsley-custom-error-message {color: #cf2e2e} #mailpoet_form_3 .mailpoet_paragraph.last {margin-bottom: 0} @media (max-width: 500px) {#mailpoet_form_3 {background-image: none;}} @media (min-width: 500px) {#mailpoet_form_3 .last .mailpoet_paragraph:last-child {margin-bottom: 0}} @media (max-width: 500px) {#mailpoet_form_3 .mailpoet_form_column:last-child .mailpoet_paragraph:last-child {margin-bottom: 0}} Please leave this field empty

Email Address *

Check your inbox or spam folder to confirm your subscription.

Have you found this article to be interesting? We’d be glad if you could forward it to a friend or share it on Twitter or Linked In to help it spread.

If you are interested to learn more about data science, you can find more articles here finnstats.

The post Best Books for Data Engineers appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Data Analysis in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Best Books for Data Engineers

↧

Join rstudio::conf(2022) Virtually

July 21, 2022, 10:20 am

≫ Next: [FREE] An introductory workshop in Shiny, July 27st from 18.00 to 19.00

≪ Previous: Best Books for Data Engineers

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While we hope to see you in person at rstudio::conf(2022), we want to include as many of you as possible, so we invite you to join us virtually!

Live streaming: Keynotes and talks will be livestreamed on the rstudio::conf website, free and open to all. No registration is required.
Virtual networking on Discord: Sign up to access the conference Discord server so you can chat with other attendees, participate in fun community events, and keep up with announcements. This is open to both in-person and virtual attendees.
- A lot of RStudio folks are attending conf virtually. Come hang out!
- Need help? Ask questions on the #-discord-help-and-how-to channel.
- Sign up for Discord server access! The sign-up form is under the “Participate virtually” heading.

rstudio conf webpage's participate virtually section in a red rectangle and the form sign up link highlighted in yellow

Discord Sign Up Form Location

Social media: Follow the RStudio and RStudio Glimpse and use the hashtags #rstudioconf and #rstudioconf2022 to share and engage with others!

Plan your virtual experience

Go to the Schedule page and add talks to your calendar.
On July 27-28th, head to the conference website to watch the livestreams and ask questions alongside other attendees.
Join the Discord server to chat, network, and share with others attendees at conf.
Follow the RStudio and RStudio Glimpse Twitter accounts.
Use the hashtags #rstudioconf and #rstudioconf2022 to be part of the party online!

Recordings of the keynotes and talks will be available on the RStudio website in a few weeks.

We can’t wait to see you there!

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Join rstudio::conf(2022) Virtually

↧

[FREE] An introductory workshop in Shiny, July 27st from 18.00 to 19.00

July 20, 2022, 5:00 pm

≫ Next: Meta-analysis for a single study. Is it possible?

≪ Previous: Join rstudio::conf(2022) Virtually

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Get FREE tickets at https://www.buymeacoffee.com/pacha/e/80848

This workshop aims to introduce people with basic R knowledge to develop interactive web applications with the Shiny framework.

The course consists of a one-hour session, where we will demonstrate basic UI, reactive UI, CSS personalization and dashboard creation. Questions are super welcome!

Previous knowledge required: Basic R (examples: reading a CSV file, transforming columns and making graphs using ggplot2).

The course will be delivered online using Zoom on July 27st from18.00 to 19.00. Check the timezone. For this workshop, it is New York Time (https://www.timeanddate.com/worldclock/usa/new-york).

Finally, here’s a short demo of a part of what this workshop covers https://youtu.be/DW-HPfohfwg.

If you consider this workshop is useful, you can buy me a coffee at https://www.buymeacoffee.com/pacha/.

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: [FREE] An introductory workshop in Shiny, July 27st from 18.00 to 19.00

↧

Meta-analysis for a single study. Is it possible?

July 20, 2022, 5:00 pm

≫ Next: Upgrading rtweet

≪ Previous: [FREE] An introductory workshop in Shiny, July 27st from 18.00 to 19.00

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all know that the word meta-analysis encompasses a body of statistical techniques to combine quantitative evidence from several independent studies. However, I have recently discovered that meta-analytic methods can also be used to analyse the results of a single research project. That happened a few months ago, when I was reading a paper from Damesa et al. (2017), where the authors describe some interesting methods of data analyses for multi-environment genotype experiments. These authors gave a few nice examples with related SAS code, that is rooted in mixed models. As an R enthusiast, I was willing to reproduce their analyses with R, but I could not succeed, until I realised that I could make use of the package ‘metafor’ and its bunch of meta-analityc methods.

In this post, I will share my R coding, for those of you who are interested in meta-analytic methods and multi-environment experiments. Let’s start by having a look at the example that motivated my interest (Example 1 in Damesa et al., 2017, p. 849).

Motivating example

Twenty-two different genotypes of maize were compared in Ethiopia, in relation to their yield level, in four sites (Dhera, Melkassa, Mieso, and Ziway). At all sites, there were 11 incomplete blocks in each of three replicates. The data are available in Damesa et al. (2017) as supplemental material; I have put this data at your disposal in my web repository, to reproduce this example; let’s load the data first.

rm(list = ls())
library(tidyverse)
library(nlme)
library(sommer)
library(emmeans)
fileName <- "https://www.casaonofri.it/_datasets/Damesa2017.csv"
dataset <- read.csv(fileName)
dataset <- dataset %>% 
  mutate(across(1:5, .fns = factor))
head(dataset)
##   site rep block plot genotype row col yield
## 1    1   1     1    1        6   1   1  9.93
## 2    1   1     1    2       22   1   2  6.51
## 3    1   1     2    3       17   1   3  7.92
## 4    1   1     2    4       14   1   4  9.28
## 5    1   1     3    5       12   1   5  7.56
## 6    1   1     3    6       10   1   6  9.54

This is a typical multi-environment experiment: we have three blocking factors (‘site’, ‘rep’ and ‘block’) and one treatment factor (‘genotype’), as well as the ‘yield’ response variable. Let’s see how this dataset can be analysed.

The ‘golden standard’ analysis

In most situations with multi-environment experiments, we are interested in broad space inference about genotypes, which means that we want to determine the best genotypes across the whole set of environments. Accordingly, the ‘site’ and ‘site x genotype’ effects must be regarded as random, while the ‘genotype’ effect is fixed. Furthermore, we need to consider the ‘design’ effects, that (in this specific case) are the ‘reps within sites’ and the ‘blocks within reps within sites’ random effects. Finally, we have the residual error term (‘plots within blocks within reps within sites’), that is always included by default.

So far, so good, but we have to go slightly more complex; for this type of studies, the variances for replicates, blocks, and residual error should be site specific, which is usually the most realistic assumption. In the end, we need to estimate:

22 genotype means with standard errors
one variance component for the site effect
one variance component for the ‘genotype x site’ interaction
four variance components (one per site) for the ‘rep’ effect
four variance components (one per site) for the ‘block within rep’ effect
four variance components (one per site) for the residual error

If we work with the lme() function in the nlme package, we have to create a couple of ‘dummy’ variables (‘one’ and ‘GE’), in order to reference the crossed random effects (see Galecki and Burzykowski, 2013).

# One stage analysis
dataset$one <- 1L
dataset$GE <- with(dataset, genotype:site)
model.mix <- lme(yield ~ genotype - 1, 
             random = list(one = pdIdent(~ site - 1),
                           one = pdIdent(~ GE - 1),
                           rep = pdDiag(~ site - 1),
                           block = pdDiag(~ site - 1)),
                              data = dataset,
             weights = varIdent(form = ~1|site))

The means for genotypes are:

mg <- emmeans(model.mix, ~ genotype)
mg
##  genotype emmean   SE  df lower.CL upper.CL
##  1          5.15 1.65 210    1.900     8.40
##  2          5.54 1.65 210    2.296     8.79
##  3          5.19 1.65 210    1.939     8.44
##  4          4.59 1.65 210    1.341     7.84
##  5          4.82 1.65 210    1.568     8.07
##  6          4.66 1.65 210    1.411     7.91
##  7          4.64 1.65 210    1.388     7.88
##  8          4.36 1.65 210    1.110     7.61
##  9          5.03 1.65 210    1.785     8.28
##  10         4.84 1.65 210    1.592     8.09
##  11         4.54 1.65 210    1.290     7.79
##  12         4.87 1.65 210    1.622     8.12
##  13         4.84 1.65 210    1.593     8.09
##  14         4.29 1.65 210    1.045     7.54
##  15         4.47 1.65 210    1.224     7.72
##  16         4.37 1.65 210    1.123     7.62
##  17         4.07 1.65 210    0.819     7.32
##  18         4.95 1.65 210    1.697     8.19
##  19         4.71 1.65 210    1.466     7.96
##  20         4.86 1.65 210    1.612     8.11
##  21         4.13 1.65 210    0.885     7.38
##  22         4.63 1.65 210    1.380     7.88
## 
## Degrees-of-freedom method: containment 
## Confidence level used: 0.95

while the variance components are:

VarCorr(model.mix)
##          Variance          StdDev      
## one =    pdIdent(site - 1)             
## site1    1.045428e+01      3.233309e+00
## site2    1.045428e+01      3.233309e+00
## site3    1.045428e+01      3.233309e+00
## site4    1.045428e+01      3.233309e+00
## one =    pdIdent(GE - 1)               
## GE1:1    1.052944e-01      3.244909e-01
## GE1:2    1.052944e-01      3.244909e-01
## GE1:3    1.052944e-01      3.244909e-01
## GE1:4    1.052944e-01      3.244909e-01
## GE2:1    1.052944e-01      3.244909e-01
## GE2:2    1.052944e-01      3.244909e-01
## GE2:3    1.052944e-01      3.244909e-01
## GE2:4    1.052944e-01      3.244909e-01
## GE3:1    1.052944e-01      3.244909e-01
## GE3:2    1.052944e-01      3.244909e-01
## GE3:3    1.052944e-01      3.244909e-01
## GE3:4    1.052944e-01      3.244909e-01
## GE4:1    1.052944e-01      3.244909e-01
## GE4:2    1.052944e-01      3.244909e-01
## GE4:3    1.052944e-01      3.244909e-01
## GE4:4    1.052944e-01      3.244909e-01
## GE5:1    1.052944e-01      3.244909e-01
## GE5:2    1.052944e-01      3.244909e-01
## GE5:3    1.052944e-01      3.244909e-01
## GE5:4    1.052944e-01      3.244909e-01
## GE6:1    1.052944e-01      3.244909e-01
## GE6:2    1.052944e-01      3.244909e-01
## GE6:3    1.052944e-01      3.244909e-01
## GE6:4    1.052944e-01      3.244909e-01
## GE7:1    1.052944e-01      3.244909e-01
## GE7:2    1.052944e-01      3.244909e-01
## GE7:3    1.052944e-01      3.244909e-01
## GE7:4    1.052944e-01      3.244909e-01
## GE8:1    1.052944e-01      3.244909e-01
## GE8:2    1.052944e-01      3.244909e-01
## GE8:3    1.052944e-01      3.244909e-01
## GE8:4    1.052944e-01      3.244909e-01
## GE9:1    1.052944e-01      3.244909e-01
## GE9:2    1.052944e-01      3.244909e-01
## GE9:3    1.052944e-01      3.244909e-01
## GE9:4    1.052944e-01      3.244909e-01
## GE10:1   1.052944e-01      3.244909e-01
## GE10:2   1.052944e-01      3.244909e-01
## GE10:3   1.052944e-01      3.244909e-01
## GE10:4   1.052944e-01      3.244909e-01
## GE11:1   1.052944e-01      3.244909e-01
## GE11:2   1.052944e-01      3.244909e-01
## GE11:3   1.052944e-01      3.244909e-01
## GE11:4   1.052944e-01      3.244909e-01
## GE12:1   1.052944e-01      3.244909e-01
## GE12:2   1.052944e-01      3.244909e-01
## GE12:3   1.052944e-01      3.244909e-01
## GE12:4   1.052944e-01      3.244909e-01
## GE13:1   1.052944e-01      3.244909e-01
## GE13:2   1.052944e-01      3.244909e-01
## GE13:3   1.052944e-01      3.244909e-01
## GE13:4   1.052944e-01      3.244909e-01
## GE14:1   1.052944e-01      3.244909e-01
## GE14:2   1.052944e-01      3.244909e-01
## GE14:3   1.052944e-01      3.244909e-01
## GE14:4   1.052944e-01      3.244909e-01
## GE15:1   1.052944e-01      3.244909e-01
## GE15:2   1.052944e-01      3.244909e-01
## GE15:3   1.052944e-01      3.244909e-01
## GE15:4   1.052944e-01      3.244909e-01
## GE16:1   1.052944e-01      3.244909e-01
## GE16:2   1.052944e-01      3.244909e-01
## GE16:3   1.052944e-01      3.244909e-01
## GE16:4   1.052944e-01      3.244909e-01
## GE17:1   1.052944e-01      3.244909e-01
## GE17:2   1.052944e-01      3.244909e-01
## GE17:3   1.052944e-01      3.244909e-01
## GE17:4   1.052944e-01      3.244909e-01
## GE18:1   1.052944e-01      3.244909e-01
## GE18:2   1.052944e-01      3.244909e-01
## GE18:3   1.052944e-01      3.244909e-01
## GE18:4   1.052944e-01      3.244909e-01
## GE19:1   1.052944e-01      3.244909e-01
## GE19:2   1.052944e-01      3.244909e-01
## GE19:3   1.052944e-01      3.244909e-01
## GE19:4   1.052944e-01      3.244909e-01
## GE20:1   1.052944e-01      3.244909e-01
## GE20:2   1.052944e-01      3.244909e-01
## GE20:3   1.052944e-01      3.244909e-01
## GE20:4   1.052944e-01      3.244909e-01
## GE21:1   1.052944e-01      3.244909e-01
## GE21:2   1.052944e-01      3.244909e-01
## GE21:3   1.052944e-01      3.244909e-01
## GE21:4   1.052944e-01      3.244909e-01
## GE22:1   1.052944e-01      3.244909e-01
## GE22:2   1.052944e-01      3.244909e-01
## GE22:3   1.052944e-01      3.244909e-01
## GE22:4   1.052944e-01      3.244909e-01
## rep =    pdDiag(site - 1)              
## site1    8.817499e-02      2.969427e-01
## site2    1.383338e+00      1.176154e+00
## site3    4.245188e-09      6.515511e-05
## site4    1.442336e-02      1.200973e-01
## block =  pdDiag(site - 1)              
## site1    3.312025e-01      5.755020e-01
## site2    4.746751e-01      6.889667e-01
## site3    5.498857e-09      7.415428e-05
## site4    6.953371e-02      2.636925e-01
## Residual 1.346652e+00      1.160454e+00

We can see that, apart from some differences relating to the optimisation method, the results are equal to those reported in Tables 1 and 2 of Damesa et al. (2017).

Two-stage analyses

The above analysis is fully correct, but, in some circumstances may be unfeasible. In particular, we may have problems when:

the number of sites is very high, and
different experimental designs have been used in different sites.

In these circumstances, it is advantageous to break the analysis in two stages, as follows:

first stage: we separately analyse the different experiments and obtain the means for all genotypes in all sites;
second stage: we jointly analyse the genotype means from all sites.

This two-stage analysis is far simpler, because the data are only pooled at the second stage, where possible design constraints are no longer important (they are considered only at the first stage). However, this two-stage analysis does not necessarily lead to the same results as the one-stage analysis, unless the whole information obtained at the first stage is carried forward to the second one (fully efficient two-stage analysis).

In more detail, genotypic variances and correlations, as observed in the first stage, should not be neglected in the second stage. Damesa et al. (2017) demonstrate that the best approach is to take the full variance-covariance matrix of genotypes at the first stage and bring it forward to the second stage. They give the coding with SAS, but, how do we do it, with R?

First of all, we perform the first stage of analysis, using the by() function to analyse the data separately for each site. In each site, we fit a mixed model, where the genotype is fixed, while the replicates and the incomplete blocks within replicates are random effects. Of course, this coding works because the experimental design is the same at all sites, while it should be modified in other cases.

# First stage
model.1step <- by(dataset, dataset$site,
                  function(df) lme(yield ~ genotype - 1, 
             random = ~1|rep/block, 
             data = df) )

From there, we use the function lapply() to get the variance components. The results are similar to those obtained in one-stage analysis (see also Damesa et al., 2017, Table 1)

# Get the variance components
lapply(model.1step, VarCorr)
## $`1`
##             Variance     StdDev   
## rep =       pdLogChol(1)          
## (Intercept) 0.1003720    0.3168153
## block =     pdLogChol(1)          
## (Intercept) 0.2505444    0.5005441
## Residual    1.2361933    1.1118423
## 
## $`2`
##             Variance     StdDev   
## rep =       pdLogChol(1)          
## (Intercept) 1.4012207    1.1837317
## block =     pdLogChol(1)          
## (Intercept) 0.4645211    0.6815579
## Residual    0.2020162    0.4494621
## 
## $`3`
##             Variance     StdDev      
## rep =       pdLogChol(1)             
## (Intercept) 2.457639e-10 1.567686e-05
## block =     pdLogChol(1)             
## (Intercept) 1.824486e-09 4.271400e-05
## Residual    1.054905e+00 1.027085e+00
## 
## $`4`
##             Variance     StdDev   
## rep =       pdLogChol(1)          
## (Intercept) 0.01412879   0.1188646
## block =     pdLogChol(1)          
## (Intercept) 0.07196842   0.2682693
## Residual    0.11262234   0.3355925

Now we can retrieve the genotypic means at all sites:

# Get the means
sitmeans <- lapply(model.1step, 
                function(el) 
                  data.frame(emmeans(el, ~genotype)))
sitmeans <- do.call(rbind, sitmeans)
sitmeans$site <- factor(rep(1:4, each = 22))
head(sitmeans)
##     genotype   emmean        SE df lower.CL  upper.CL site
## 1.1        1 8.253672 0.7208426 12 6.683091  9.824253    1
## 1.2        2 7.731118 0.7208426 12 6.160537  9.301699    1
## 1.3        3 7.249198 0.7208426 12 5.678617  8.819779    1
## 1.4        4 8.565262 0.7208426 12 6.994681 10.135843    1
## 1.5        5 8.560002 0.7208426 12 6.989421 10.130583    1
## 1.6        6 9.510255 0.7208426 12 7.939674 11.080836    1

The variance-covariance matrix for genotype means is obtained, for each site, by using the vcov() function. Afterwords, we build a block diagonal matrix using the four variance-covariance matrices as the building blocks.

# Get the vcov matrices
Omega <- lapply(model.1step, vcov)
Omega <- Matrix::bdiag(Omega)
round(Omega[1:8, 1:8], 3)
## 8 x 8 sparse Matrix of class "dgCMatrix"
##                                                     
## [1,] 0.520 0.061 0.037 0.034 0.033 0.035 0.034 0.033
## [2,] 0.061 0.520 0.061 0.037 0.033 0.034 0.033 0.033
## [3,] 0.037 0.061 0.520 0.061 0.033 0.033 0.034 0.033
## [4,] 0.034 0.037 0.061 0.520 0.033 0.033 0.033 0.033
## [5,] 0.033 0.033 0.033 0.033 0.520 0.035 0.034 0.061
## [6,] 0.035 0.034 0.033 0.033 0.035 0.520 0.061 0.034
## [7,] 0.034 0.033 0.034 0.033 0.034 0.061 0.520 0.033
## [8,] 0.033 0.033 0.033 0.033 0.061 0.034 0.033 0.520

Now we can proceed to the second stage, which can be performed by using the rma.mv() function in the metafor package, as shown in the box below. We see that we inject the variance-covariance matrix coming from the first stage into the second. That’s why this is a meta-analytic technique: we are behaving as if we had obtained the data from the first stage from literature!

# Second stage (fully efficient)
mod.meta <- metafor::rma.mv(emmean, Omega, 
                            mods = ~ genotype - 1,
                            random = ~ 1|site/genotype,
                    data = sitmeans, method="REML")

From this fit we get the remaining variance components (for the ‘sites’ and for the ‘sites x genotypes’ interaction) and the genotypic means, which correspond to those obtained from one-step analysis, apart from small differences relating to the optimisation method (see also Tables 1 and 2 in Damesa et al., 2017). That’s why Damesa and co-authors talk about fully efficient two-stage analysis.

# Variance components
mod.meta$sigma2
## [1] 10.4538773  0.1271925
head(mod.meta$beta)
##               [,1]
## genotype1 5.134780
## genotype2 5.509773
## genotype3 5.147052
## genotype4 4.593256
## genotype5 4.844761
## genotype6 4.691955

A possible approximation to this fully-efficient method is also shown in Damesa et al. (2017) and consists of approximating the variance-covariance matrix of genotypic means (‘Omega’) by using a vector of weights, which can be obtained by taking the diagonal elements of the inverse of the ‘Omega’ matrix. To achieve these, we can use the R coding in the box below.

siij <- diag(solve(Omega))
mod.meta2 <- metafor::rma.mv(emmean, 1/siij,
                            mods = ~ genotype - 1,
                            random = ~ 1|site/genotype,
                    data = sitmeans, method="REML") 
mod.meta2$sigma2
## [1] 10.422928  0.127908
head(mod.meta2$beta)
##               [,1]
## genotype1 5.112614
## genotype2 5.431455
## genotype3 5.151905
## genotype4 4.583911
## genotype5 4.811698
## genotype6 4.704518

With this, we have fully reproduced the results relating to the Example 1 in the paper we used as the reference for this post. Hope this was useful.

Happy coding!

Prof. Andrea Onofri Department of Agricultural, Food and Environmental Sciences University of Perugia (Italy) Send comments to: andrea.onofri@unipg.it

Follow @onofriandreapg

References

Damesa, T.M., Möhring, J., Worku, M., Piepho, H.-P., 2017. One Step at a Time: Stage-Wise Analysis of a Series of Experiments. Agronomy Journal 109, 845. https://doi.org/10.2134/agronj2016.07.0395
Gałecki, A., Burzykowski, T., 2013. Linear mixed-effects models using R: a step-by-step approach. Springer, Berlin.
Lenth R (2022). Emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.7.5-090001, https://github.com/rvlenth/emmeans.
Pinheiro JC, Bates DM (2000). Mixed-Effects Models in S and S-PLUS.Springer, New York. doi:10.1007/b98882. https://doi.org/10.1007/b98882.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1-48. https://doi.org/10.18637/jss.v036.i03

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Meta-analysis for a single study. Is it possible?

↧

Upgrading rtweet

July 20, 2022, 5:00 pm

≫ Next: Mbaza Shiny App Case Study

≪ Previous: Meta-analysis for a single study. Is it possible?

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I will provide some examples of what has changed between rtweet 0.7.0 and rtweet 1.0.2.I hope both the changes and this guide will help all users.I highlight the most important and interesting changes in this blog post, and for a full list of changes you can consult it on the NEWS.

Big breaking changes

More consistent output

This is probably what will affect the most users.All functions that return data about tweets¹ now return the same columns.

For example if we search some tweets we’ll get the following columns:

> tweets <- search_tweets("weather")> colnames(tweets)[1] "created_at" "id"[3] "id_str" "full_text"[5] "truncated" "display_text_range"[7] "entities" "metadata"[9] "source" "in_reply_to_status_id"[11] "in_reply_to_status_id_str" "in_reply_to_user_id"[13] "in_reply_to_user_id_str" "in_reply_to_screen_name"[15] "geo" "coordinates"[17] "place" "contributors"[19] "is_quote_status" "retweet_count"[21] "favorite_count" "favorited"[23] "retweeted" "lang"[25] "quoted_status_id" "quoted_status_id_str"[27] "quoted_status" "possibly_sensitive"[29] "retweeted_status" "text"[31] "favorited_by" "scopes"[33] "display_text_width" "quoted_status_permalink"[35] "quote_count" "timestamp_ms"[37] "reply_count" "filter_level"[39] "query" "withheld_scope"[41] "withheld_copyright" "withheld_in_countries"[43] "possibly_sensitive_appealable"

rtweet now minimizes the processing of tweets and only returns the same data as provided by the API while making it easier to handle by R.However, to preserve the nested nature of the data returned some fields are now nested inside other.For example, previously fields "bbpx_coords", "geo_coords", "coords_coords" were returned as separate columns, but they are now nested inside "place", "coordinates" or "geo" depending where they are provided.Some columns previously calculated by rtweet are now not returned, like "rtweet_favorite_count".At the same time it provides with new columns about each tweet like the "withheld_*" columns.

If you scanned through the columns you might have noticed that columns "user_id" and "screen_name" are no longer returned.This data is still returned by the API but it is now made available to the rtweet users via users_data():

> colnames(users_data(tweets))[1] "id" "id_str"[3] "name" "screen_name"[5] "location" "description"[7] "url" "protected"[9] "followers_count" "friends_count"[11] "listed_count" "created_at"[13] "favourites_count" "verified"[15] "statuses_count" "profile_image_url_https"[17] "profile_banner_url" "default_profile"[19] "default_profile_image" "withheld_in_countries"[21] "derived" "withheld_scope"[23] "entities"

This blog post should help you find the right data columns, but if you don’t find what you are looking for it might be nested inside a column. Try using dplyr::glimpse() to explore the data and locate nested columns.For example the entities column (which is present in both tweets and users) have the following useful columns:

> names(tweets$entities[[1]])[1] "hashtags" "symbols" "user_mentions" "urls"[5] "media"

Similarly if you look up a user via search_users() or lookup_users() you’ll get consistent data:

> users <- lookup_users(c("twitter", "rladiesglobal", "_R_Foundation"))> colnames(users)[1] "id" "id_str"[3] "name" "screen_name"[5] "location" "description"[7] "url" "protected"[9] "followers_count" "friends_count"[11] "listed_count" "created_at"[13] "favourites_count" "verified"[15] "statuses_count" "profile_image_url_https"[17] "profile_banner_url" "default_profile"[19] "default_profile_image" "withheld_in_countries"[21] "derived" "withheld_scope"[23] "entities"

You can use tweets_data() to retrieve information about their latest tweet:

> colnames(tweets_data(users))[1] "created_at" "id"[3] "id_str" "text"[5] "truncated" "entities"[7] "source" "in_reply_to_status_id"[9] "in_reply_to_status_id_str" "in_reply_to_user_id"[11] "in_reply_to_user_id_str" "in_reply_to_screen_name"[13] "geo" "coordinates"[15] "place" "contributors"[17] "is_quote_status" "retweet_count"[19] "favorite_count" "favorited"[21] "retweeted" "lang"[23] "retweeted_status" "possibly_sensitive"[25] "quoted_status" "display_text_width"[27] "user" "full_text"[29] "favorited_by" "scopes"[31] "display_text_range" "quoted_status_id"[33] "quoted_status_id_str" "quoted_status_permalink"[35] "quote_count" "timestamp_ms"[37] "reply_count" "filter_level"[39] "metadata" "query"[41] "withheld_scope" "withheld_copyright"[43] "withheld_in_countries" "possibly_sensitive_appealable"

You can merge them via:

users_and_last_tweets <- cbind(users, id_str = tweets_data(users)[, "id_str"])

In the future (see below), with helper functions managing the output of rtweet will become easier.

Finally, get_followers() and get_friends() now return the same columns:

> colnames(get_followers("_R_Foundation"))[1] "from_id" "to_id"> colnames(get_friends("_R_Foundation"))[1] "from_id" "to_id"

This will make it easier to build networks of connections (although you might want to convert screen names to ids or vice versa).

More consistent interface

All paginated functions that don’t return tweets now use a consistent pagination interface (except the premium endpoints).They all store the “next cursor” in an rtweet_cursor attribute, which will be automatically retrieved when you use the cursor argument.This will make it easier to continue a query you started:

users <- get_followers("_R_Foundation")users# use `cursor` to find the next "page" of resultsmore_users <- get_followers("_R_Foundation", cursor = users)

They support max_id and since_id to find earlier and later tweets respectively:

# Retrieve all the tweets made since the previous requestnewer <- search_tweets("weather", since_id = tweets)# Retrieve tweets made before the previous requestolder <- search_tweets("weather", max_id = tweets)

If you want more tweets than it is allowed by the rate limits of the API, you can use retryonratelimit to wait as long as needed:

long <- search_tweets("weather", n = 1000, retryonratelimit = TRUE)

This will keep busy your terminal until the 1000 tweets are retrieved.

Saving data

An unexpected consequence of returning more data (now matching that returned by the API) is that it is harder to save it in a tabular format.For instance one tweet might have one media, mention two users and have three hashtags.There isn’t a simple way to save it in a single row uniformly for all tweets orit could lead to confusion.

This resulted in deprecating save_as_csv, read_twitter_csv and related functions because they don’t work with the new data structure and it won’t be possible to load the complete data from a csv.They will be removed in later versions.

Many users will benefit from saving to RDS (e.g., saveRDS() or readr::write_rds()), and those wanting to export to tabular format can simplify the data to include only that of interest before saving with generic R functions (e.g., write.csv() or readr::write_csv()).

Other breaking changes

Accessibility is important and for this reason if you tweet via post_tweet() and add an image, gif or video you’ll need to provide the media alternative text.Without media_alt_text it will not allow you to post.
tweet_shot() has been deprecated as it no longer works correctly.It might be possible to bring it back, but the code is complex and I do not understand enough to maintain it.If you’re interested in seeing this feature return, checkout the discussion about this issue and let me know if you have any suggestions.
rtweet also used to provide functions for data on emojis, langs and stopwordslangs.These are useful resources for text mining in general - not only in tweets - however they need to be updated to be helpful and would be better placed in other packages, for instance emojis is now on the bdpar package.Therefore they are no longer available in rtweet.
The functions like suggested_*() have been removed as they have been broken since 2019.

Easier authentication

An exciting part of this release has been a big rewrite of the authentication protocol.While it is compatible with previous rtweet authentication methods it has also some important new functions which make it easier to work with rtweet and the twitter API in different ways.

Different ways to authenticate

If you just want to test the package, use the default authentication auth_setup_default() that comes with rtweet.If you use it for one or two days you won’t notice any problem.

If you want to use the package for more than a couple of days, I recommend you set up your own token via rtweet_user().It will open a window to authenticate via the authenticated account in your default browser.This authentication won’t allow you to do everything but it will avoid running out of requests and being rate-limited.

If you plan to make heavy use of the package, I recommend registering yourself as developer and using one of the following two mechanisms, depending on your plans:

Collect data and analyze: rtweet_app().
Set up a bot: rtweet_bot()

Find more information in the Authentication with rtweet vignette.

Storing credentials

Previously rtweet saved each token created, but now non-default tokens are only saved if you ask. You can save them manually via auth_save(token, "my_app").Bonus, if you name your token as default (auth_save(token, "default")) it will be used automatically upon loading the library.

Further, tokens are now saved in the location output by tools::R_user_dir("rtweet", "config"), rather than in your home directory.If you have previous tokens saved or problems identifying which token is which use auth_sitrep().This will provides clues to which tokens might be duplicated or misconfigured but it won’t check if they work.It will also automatically move your tokens to the new path.

To check which credentials you have stored use auth_list() and load them via auth_as("my_app").All the rtweet functions will use the latest token loaded with auth_as (unless you manually specify one when calling it).If you are not sure which token you are using you can use auth_get() it will return the token in use, list them or ask you to authenticate.

Other changes of note

This is a list of other changes that aren’t too big or are not breaking changes but are worthy enough of a mention:

Iteration and continuation of requests

Using cursors, pagination or waiting until you can make more queries is now easier.For example you can continue previous requests via:

users <- get_followers("_R_Foundation")users# use `cursor` to find the next "page" of resultsmore_users <- get_followers("_R_Foundation", cursor = users)

Additions

There is now a function to find a thread of a user.You can start from any tweet and it will find all the tweets of the thread:tweet_threading("1461776330584956929").

There is a lot of interest in downloading and keeping track of interactions on Twitter.The amount of interest is big enough that Twitter is releasing a new API to provide more information of this nature.

Future

Twitter API v2 is being released and soon it will replace API v1.rtweet up to now, including this release, uses API v1 so it will need to adapt to the new endpoints and new data returned.

First will be the streaming endpoints in November, so expect more (breaking?) changes around those dates if not earlier.

I would also like to make it easier for users, dependencies and the package itself to handle the outputs.To this regard I would like to provide some classes to handle the different type of objects it returns.

This will help avoid some of the current shortcomings.Specifically I would like to provide functions to make it easier to reply to previous tweets,extract nested data, and subset tweets and the accompanying user information.

Conclusions

While I made many breaking changes I hope these changes will smooth future development and help both users and maintainers.

Feel free to ask on the rOpenSci community if you have questions about the transition or find something amiss.Please let me know! It will help me prioritize which endpoints are more relevant to the community.(And yes, the academic archive endpoint is on the radar.)

It is also possible that I overlooked something and I thought the code is working when it isn’t.For example, after several months of changing the way the API is parsed, several users found it wasn’t handling some elements.Let me know of such or similar cases and I’ll try to fix it.

In case you find a bug, check the open issues and if it has not already been reported, open an issue on GitHub.Don’t forget to make a reprex and if possible provide the id of the tweets you are having trouble with.Unfortunately it has happened that when I came to look at a bug I couldn’t reproduce it as I wasn’t able to find the tweet which caused the error.

This release includes contributions from Hadely Wicham, Bob Rudis, Alex Hayes, Simon Heß, Diego Hernán, Michael Chirico, Jonathan Sidi, Jon Harmon, Andrew Fraser and many other that reported bugs or provided feedback.Many thanks all for using it, your interest to keep it working and improving rtweet for all.

Finally, you can read the whole NEWS online and the examples.

Happy tweeting!

Specifically these: get_favorites(), get_favorites_user(), get_mentions(),get_my_timeline(), get_retweets(), get_timeline(), get_timeline_user(),lists_statuses(), lookup_statuses(), lookup_tweets(), search_30day(),search_fullarchive(), search_tweets(), tweet_shot() and tweet_threading(). ︎

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Upgrading rtweet

↧

Mbaza Shiny App Case Study

July 21, 2022, 7:30 am

≫ Next: Campus useR Group Frankfurt Using Non-Traditional Techniques to Increase Information Sharing

≪ Previous: Upgrading rtweet

[This article was first published on Tag: r - Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mbaza Shiny App case study blog hero banner with white text,

We care about the impact our work has on the world around us. That’s why we want to ensure our work paves the way to a sustainable future. Using Data for Good is how we achieve this.

Technology can help resolve sustainability challenges related to climate change and biodiversity conservation. That’s why we commit time and resources to ideas that are genuinely having a positive impact on our planet.

R Shiny provides a great way to showcase the ability of data for good. One such example is our Mbaza Shiny App, complementing the Mbaza AI.

Mbaza AI

Mbaza AI is an open-source, free-use application for biodiversity conservationists. Appsilon created this tool for change, in collaboration with researchers at the University of Stirling and The National Parks Agency of Gabon (ANPN), as part of our Data for Good initiative.

Mbaza AI automatically, accurately, and rapidly classifies animal species in camera trap images or videos. And it does so using a state-of-the-art artificial intelligence (AI) model.

Process of manual identification of camera trap images

Our model can classify 3000 images per hour and is up to 96% accurate, using an average laptop without an internet connection. And best of all – it’s free to use!

Mbaza Shiny App

Complementing the Mbaza AI algorithm is an interactive data explorer interface – Mbaza Shiny App. The Mbaza Shiny App intakes the data from the AI model and allows for analysis and visualization in an interactive dashboard. Shiny Dashboards are an excellent tool for telling data-centric stories.

“The Mbaza Shiny App integrates with the Mbaza desktop app for camera trap image analyses and can be used to automatically calculate daily activity patterns of different animal species, create maps and calculate measures of relative abundance with no coding or statistical knowledge.” – Robin Whytock, PhD, former Postdoctoral Researcher at The University of Stirling

Mbaza AI and Shiny App workflow

Improving the Mbaza Shiny App for visualizing data produced by our AI model, offered a great opportunity to leverage another aspect of our technical expertise – R Shiny development.

By combining our data science skills across AI & R Shiny programming, we found a way to use technology to accelerate nature conservation efforts in Gabon. In short, we used technology to fight threats to biodiversity and we made it available to you.

Challenges

Projects like this require a good understanding of the end user’s needs. In this case, our primary users are researchers, ecologists, and park rangers. These users are working tirelessly in the field and typically do not have extensive programming knowledge.

This is just one side of the equation though. Product development requires knowledge exchange, flowing both ways. We as experts and consultants in our tech niches have to explain in a comprehensive way what is feasible or not. And our clients must share their domain knowledge.

Are you looking to apply AI to nature conservation? Discover the technology helping endangered animals.

When exchanging domain knowledge and working with specialists, it’s important to overcome the following challenges:

Succinctly express and understand the ultimate goal of the project (biodiversity conservation) and all intrinsic aspects of the end users’ (wildlife conservationists) day-to-day responsibilities.
Address working with unique data. This was the first time we were tasked with visualizing camera-trap data, and due to the remote nature of the project access to end users was limited.
Explaining R/Shiny strengths and limitations to manage expectations.
Balancing all aspects of the work (time, quality, and functionality) so that both parties were satisfied with the end product.

In doing so, the end product will better serve the users and the project timeline will be realized.

Approach

Close collaboration rooted in open communication and mutual trust set the stage for overcoming these challenges. We discussed all requirements and pain points from our partners, and the whole team put themselves in the shoes of the end users.

This is how we approached the implementation:

Constant feedback from the partner – closely listening to the experts.
Incorporate data and algorithms prepared by biologists into our application – combining domain knowledge and new technologies.
Sound planning and robust quality assurance process throughout the entirety of the project.

You can apply this approach to any Shiny development project to build higher-quality products that are more likely to see user adoption.

Results

Today, the Mbaza Shiny App has progressed with improved functionalities. We’ve accelerated effective biodiversity conservation in Gabon, by speeding up the data analysis process. The user base is growing and user feedback is overwhelmingly positive.

And at the same time, we know it is just the beginning. We want to reach more people and more projects across the globe. We want to help them efficiently protect endangered animals with the power of AI and R Shiny!

We’ve learned a lot from our fruitful journey with the Mbaza Shiny App. In many ways, we improved ourselves as well as efforts in biodiversity conservation:

The team expanded our technical skillset in utilizing technology for biodiversity conservation.
We gained a sense of fulfillment in delivering a useful tool for nature and society.
We see the project continuing and expanding into something even greater.

Both we, and the clients, came away satisfied with the end product. You can explore the Mbaza Shiny App demo.

Summary

“These tools greatly enhance the ability of protected area managers to rapidly analyze camera trap data and make informed conservation decisions. The Shiny app is in the development phase but we hope to roll it out soon as a core part of the Mbaza image analysis pipeline.” – Robin Whytock, PhD, former Postdoctoral Researcher at The University of Stirling

If you’re looking to accelerate your biodiversity monitoring and conservation efforts with an R Shiny dashboard, check out Appsilon’s free-use Shiny Dashboard Templates. Simplify and speed up your Shiny dashboard build with our ready-to-use templates. The bundle contains several beautiful and easy-to-use templates with a range of features and tech stacks. The best part is – it’s entirely free!

Discover what it means to be an RStudio Full Service Certified Partner and how we can serve you.

If you need a more customized, advanced option – reach out to us. We’re here to help. Appsilon is an RStudio Full Service Certified Partner. We develop advanced R Shiny applications for Fortune 500 companies across the globe. We’d be happy to help you choose the right options for your use case. Let’s talk and see how Shiny can help you grow.

Tired of manual data labeling and mislabeled training data? See how the Appsilon ML team built a streamlit widget for cleaning ML labels.

This article was co-authored by Appsilon Project Manger Konrad Żurawski and D4G Lead Andrzej Białaś.

The post Mbaza Shiny App Case Study appeared first on Appsilon | Enterprise R Shiny Dashboards.

To leave a comment for the author, please follow the link and comment on their blog: Tag: r - Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Mbaza Shiny App Case Study

↧

Campus useR Group Frankfurt Using Non-Traditional Techniques to Increase Information Sharing

July 21, 2022, 9:41 am

≫ Next: Systematic error messages

≪ Previous: Mbaza Shiny App Case Study

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Consortium recently talked to Till Straube of the Campus UseR Group, Frankfurt, Germany, about the group’s aim to provide an informal knowledge-sharing environment for Campus R users. The unique format of the group was difficult to achieve in online events, and they look forward to returning to in-person events. He explained that the group constantly strives to be inclusive of all R users coming from all backgrounds and levels of expertise.

Till is a geographer and works at the Goethe University of Frankfurt in the Department of human geography. His research interests center on critical data science, digital infrastructures, and security technologies.

What is the R community like in Germany?

Before moving to Frankfurt, I was working in Bangkok, and we had very active user groups for software engineers. These were casual gatherings, where we got together and talked about different technologies. I was looking for similar groups in Frankfurt to connect on the same level. I found one user group and tried reaching out to them, but it had been inactive, and I didn’t get a response from them right away. So I decided to start a new user group for R in Frankfurt and started looking for allies. I found Janine Buchholz, who also worked on campus here, and we started this group together.

After we had announced our first meeting, we heard ‌from the existing user group. A company had been organizing it, and they had a very different approach. At first, we considered combining the two groups, but our different focuses became apparent. So we decided it would be fitting to have a university-focused group.

We started the Campus useR Group Frankfurt as a platform for everyone who works with R on campus to connect in a more casual way than the classic format of expert talks. We experimented a lot with different formats that were all designed to get people in the room talking about R in a way that was comfortable for everyone. In terms of topics, we found our niche in questions related to research, publishing, and teaching, but we also discussed working with R more generally from the beginning.

There are also Data Science meetups here where they talk about R as well, but they tend to be rather business focused. My impression is that few academic users of R are committed enough to those conversations outside of the university. Also, the R-Ladies Frankfurt was founded around the same time. Some ‌members of our group are also part of that group, so there has been a lively exchange with that group.

Campus UseR Group Logo

How has COVID affected your ability to connect with members?

Before COVID, we had regular meetings in person once a month. It was around March 2020 when ‌we could no longer meet on campus. We switched to online meetings pretty much right away.

At that time, it seems many people were looking for online meetings on the Meetup platform. Interestingly, there were suddenly people joining our meetups from all over the world, sometimes just listening in.

We held on to our casual and diverse formats, but the casual, information-sharing experience was missing. Even though we were successful in creating an online setting that remained easy-going and fun, it was a lot more work. It was just not the same as many of our formats from our in-person meetups didn’t work well for online meetups. The spirit of casually sharing ideas or coding together on a laptop couldn’t translate so well online.

At the end of last year, our co-founder Janine moved away for job reasons, and I was very busy finishing my Ph.D. and preparing the defense. So we haven’t had meetings for 4-5 months. But we are planning to return with in-person events at the university because that’s possible now. Probably with the new semester, we will start again.

In the past year, did you have to change your techniques to connect and collaborate with members? For example, did you use GitHub, videoconferencing, online discussion groups more? Can these techniques be used to make your group more inclusive to people that cannot attend physical events in the future?

We have been using Zoom for video conferencing, and it worked fine, more or less. We were also sharing data for doing exercises, but we had been doing that before the pandemic, so this was not a recent development for us. We don’t have a dedicated GitHub repository. Our preparation is much more casual, and the focus is on the meetings. We have a “no homework” policy so that anyone can join at the spur of the moment.

As far as the idea of being more inclusive by offering online formats is concerned, we are focused on creating unique learning experiences that are only possible in person. The formats we had in mind for our group didn’t translate very well online. However, we do put in efforts to stay more inclusive in terms of language, for example. All our events are in English even if all the people preparing them are native German speakers. In addition, we always emphasize that our meetings are for R users of all backgrounds and at levels of expertise. So, I would rather focus on keeping our group inclusive in those ways than looking at online formats as a quick fix.

Can you tell us about one recent presentation or speaker that was especially interesting and what was the topic and why was it so interesting?

We have had some amazing talks from members of our group. As I said, we try to avoid the classic format of having an expert talk in presentation style because we have so much of that in the university already. We usually choose a topic and share our experiences. And no matter the topic, it usually turns out that one or in the group has had valuable experiences that everyone can learn from.

One meeting I remember fondly was called “Teach me Shiny,” where ‌I was introduced as the “non-expert” while the audience was the experts. I had heard of Shiny at the time, but I hadn’t used it. I would be in the lead sharing my screen, but the audience had to tell me what to do. In the end, I put together a simple Shiny app, and I think that was a very interesting format and a learning experience for everybody.

What trends do you see in R language affecting your organization over the next year?

Publishing tools and workflows involving R have a huge potential in university. I am not sure if it will affect our organization, but I really hope more people continue to look into it. I do all the scripts for my lectures in Bookdown because it is so easy to share information and collaborate. R does not get enough recognition as a publishing platform. I don’t see many people using it for teaching materials, for example, using R Markdown for making slides, writing papers, etc. So that’s really where I save a lot of time in my text editor (not having to fidget with software). I can’t predict the future, but I hope more people at university realize this potential and start playing around with R in this way.

Do you know of any data journalism efforts by your members? If not, are there particular data journalism projects that you’ve seen in the last year that you feel had a positive impact on society?

I should be more in touch with what’s happening in data journalism, but unfortunately, I am not up to date at the moment. I know there are really interesting data journalism efforts that are connected to far-right violence in Germany, which I think is a really important topic. It’s something that lends itself to giving it more visibility through maps and data journalism.

Of the Funded Projects by the R Consortium, do you have a favorite project? Why is it your favorite?

As a geographer, I am really interested in the spatial features of R. So I was really pleased to see Tidy Spatial networks’ efforts. I have also used d3 before, and d3po is also one of the funded projects. Overall, I find projects with a focus on spatial data very interesting. Also, it’s great to see that R Ladies are getting funded support.

When is your next event? Please give details!

We are planning to return with an in-person meeting at the university. Unfortunately, we will have to wait until the start of the new semester in October. I plan to celebrate this relaunch by announcing it beforehand and reaching out to everyone. With this event, we want to revive the format of casual meetups with everybody sharing ideas and doing some coding hands-on. I feel that it really fills an important gap in the R landscape

How do I Join?

R Consortium’s R User Group and Small Conference Support Program (RUGS) provides grants to help R groups around the world organize, share information and support each other. We have given grants over the past four years, encompassing over 65,000 members in 35 countries. We would like to include you! Cash grants and meetup.com accounts are awarded based on the intended use of the funds and the amount of money available to distribute. We are now accepting applications!

Apply Here

The post Campus useR Group Frankfurt Using Non-Traditional Techniques to Increase Information Sharing appeared first on R Consortium.

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Campus useR Group Frankfurt Using Non-Traditional Techniques to Increase Information Sharing

↧

Systematic error messages

July 21, 2022, 3:46 pm

≫ Next: Convert multiple columns into a single column-tidyr Part4

≪ Previous: Campus useR Group Frankfurt Using Non-Traditional Techniques to Increase Information Sharing

[This article was first published on R – Blog – Mazama Science , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Anyone writing code for use in data processing systems needs to have a well thought-out protocol for generating error messages and logs. When a complex pipeline breaks, good logs and recognizable error messages are key to debugging the problem. This post describes improvements to the MazamaCoreUtils package that help you create systematic error messages that can be better handled by calling functions.

Easy Error Handling

Error handling in the MazamaCoreUtils package has been described in previous blog posts:

We are a little obsessed with logging and error handling, but for a very good reason: Whenever anyone complains that one of our automated data processing pipelines isn’t working properly, a quick look at the log files tells us what, if anything, is wrong with our code or whether the problem lies further upstream with the data we ingest.

Good error messages and detailed logging can save you LOTS of time!

In an effort to make error handling as easy as possible, our previous recommendation was to put any code that might fail into a try block and test the result:

result <- try({
  # ...
  # lines of R code
  # ...
}, silent = FALSE)
stopOnError(result)

The stopOnError() function tests whether result is of class “try-error” and stops with the contents of geterrmessage() or a custom error message provided by the user. If logging is enabled, this message is also logged at the ERROR level. This has been a very useful construct over the years.

Improvements to stopOnError()

After many months using this function operationally in our data processing pipelines, a few improvements have been added and are described below. The new function signature is:

stopOnError <- function(
  result,
  err_msg = "",
  prefix = "",
  maxLength = 500,
  truncatedLength = 120,
  call. = FALSE
) {

result

As in the previous version, this is the result of the try({ ... })block.

err_msg

As in the previous version, a custom error message will be used if provided.

prefix

Sometimes it is nice to retain the detailed information obtained with geterrmessage() while assigning this message to a particular type. Providing a prefix allows developers to approach python-style exceptions by prefixing an error message with their own information, e.g.:

stopOnError(result, prefix = "USER_INPUT_ERROR")

maxLength, truncatedLength

We have found that the contents of geterrmessage() can sometimes include huge amounts of text. For example, when requesting data from a web service, we might get back the html contents of an error status page. When this is written to a log file, that log file becomes much more difficult for a human to scan.

Truncating messages to some reasonable length ensures that we get useful information without wrecking the readability of our log files.

call.

The previous version of stopOnError() always stopped with stop(err_msg, call. = FALSE). This restriction is not necessary so we elevated this argument into the stopOnError() function signature.

Example Usage

Our new, preferred style of error handling looks like this:

library(MazamaCoreUtils)

# A function that might fail
myFunc <- function(x) { return(log(x)) }

# Bad user input
userInput <- "10"

# Default error message
try({
  myFunc(x = userInput)
}, silent = TRUE) %>%
  stopOnError(result)

# Custom error message
try({
  myFunc(x = userInput)
}, silent = TRUE) %>%
  stopOnError(result, err_msg = "Unable to process user input")

# Default error message with prefix
try({
  myFunc(x = userInput)
}, silent = TRUE) %>%
  stopOnError(result, prefix = "USER_INPUT_ERROR")

# Truncating prefixed default message
try({
  myFunc(x = userInput)
}, silent = TRUE) %>%
  stopOnError(
    result,
    prefix = "USER_INPUT_ERROR",
    maxLength = 40,
    truncatedLength = 32
  )

Best wishes for better error handling in all your code.

To leave a comment for the author, please follow the link and comment on their blog: R – Blog – Mazama Science .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Systematic error messages

↧

Convert multiple columns into a single column-tidyr Part4

July 21, 2022, 4:34 pm

≫ Next: R code snippet : Transform from long format to wide format

≪ Previous: Systematic error messages

[This article was first published on Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Convert multiple columns into a single column-tidyr Part4 appeared first on Data Science Tutorials

Convert multiple columns into a single column, To combine numerous data frame columns into one column, use the union() function from the tidyr package.

Convert multiple columns into a single column

The basic syntax used by this function is as follows.

unite(data, col, into, sep)

where:

data: Name of the data frame

col: Name of the new united column

… : names for the columns to be combined in a vector

sep: How to create a new united column and combine the data

The practical application of this function is demonstrated in the examples that follow.

5 Free Books to Learn Statistics For Data Science – Data Science Tutorials

Example 1: Unite Two Columns into One Column

Let’s say we have the R data frame shown below:

Let’s create a data frame

df <- data.frame(player=c('P1', 'P1', 'P2', 'P2', 'P3', 'P3'),
year=c(1, 2, 1, 2, 1, 2),
points=c(202, 290, 180, 101, 312, 219),
assists=c(92, 63, 76, 88, 65, 52))

Now we can view the data frame

df
  player year points assists
1     P1    1    202      92
2     P1    2    290      63
3     P2    1    180      76
4     P2    2    101      88
5     P3    1    312      65
6     P3    2    219      52

The “points” and “assists” columns can be combined into a single column using the union() function.

library(tidyr)

combine the columns for points and assists into one column.

Best Books on Data Science with Python – Data Science Tutorials

unite(df, col='points-assists', c('points', 'assists'), sep='-')
  player year points-assists
1     P1    1         202-92
2     P1    2         290-63
3     P2    1         180-76
4     P2    2         101-88
5     P3    1         312-65
6     P3    2         219-52

Example 2: Unite More Than Two Columns

Let’s say we have the R data frame shown below:

Let’s create a data frame

df2 <- data.frame(player=c('P1', 'P1', 'P2', 'P2', 'P3', 'P3'),
year=c(1, 2, 1, 2, 1, 2),
points=c(228, 229, 198, 151, 412, 325),
assists=c(82, 93, 66, 45, 89, 95),
blocks=c(28, 36, 32, 82, 18, 12))

Let’s view the data frame

df2
  player year points assists blocks
1     P1    1    228      82     28
2     P1    2    229      93     36
3     P2    1    198      66     32
4     P2    2    151      45     82
5     P3    1    412      89     18
6     P3    2    325      95     12

The points, assists, and blocks columns can be combined into a single column using the union() function.

library(tidyr)

combine the columns for scoring, assists, and blocks into one

unite(df2, col='stats', c('points', 'assists', 'blocks'), sep='/')
player year     stats
1     P1    1 228/82/28
2     P1    2 229/93/36
3     P2    1 198/66/32
4     P2    2 151/45/82
5     P3    1 412/89/18
6     P3    2 325/95/12

.mailpoet_hp_email_label{display:none!important;}#mailpoet_form_1 .mailpoet_form { } #mailpoet_form_1 form { margin-bottom: 0; } #mailpoet_form_1 h1.mailpoet-heading { margin: 0 0 20px; } #mailpoet_form_1 p.mailpoet_form_paragraph.last { margin-bottom: 5px; } #mailpoet_form_1 .mailpoet_column_with_background { padding: 10px; } #mailpoet_form_1 .mailpoet_form_column:not(:first-child) { margin-left: 20px; } #mailpoet_form_1 .mailpoet_paragraph { line-height: 20px; margin-bottom: 20px; } #mailpoet_form_1 .mailpoet_segment_label, #mailpoet_form_1 .mailpoet_text_label, #mailpoet_form_1 .mailpoet_textarea_label, #mailpoet_form_1 .mailpoet_select_label, #mailpoet_form_1 .mailpoet_radio_label, #mailpoet_form_1 .mailpoet_checkbox_label, #mailpoet_form_1 .mailpoet_list_label, #mailpoet_form_1 .mailpoet_date_label { display: block; font-weight: normal; } #mailpoet_form_1 .mailpoet_text, #mailpoet_form_1 .mailpoet_textarea, #mailpoet_form_1 .mailpoet_select, #mailpoet_form_1 .mailpoet_date_month, #mailpoet_form_1 .mailpoet_date_day, #mailpoet_form_1 .mailpoet_date_year, #mailpoet_form_1 .mailpoet_date { display: block; } #mailpoet_form_1 .mailpoet_text, #mailpoet_form_1 .mailpoet_textarea { width: 200px; } #mailpoet_form_1 .mailpoet_checkbox { } #mailpoet_form_1 .mailpoet_submit { } #mailpoet_form_1 .mailpoet_divider { } #mailpoet_form_1 .mailpoet_message { } #mailpoet_form_1 .mailpoet_form_loading { width: 30px; text-align: center; line-height: normal; } #mailpoet_form_1 .mailpoet_form_loading > span { width: 5px; height: 5px; background-color: #5b5b5b; }#mailpoet_form_1{border-radius: 16px;background: #ffffff;color: #313131;text-align: left;}#mailpoet_form_1 form.mailpoet_form {padding: 16px;}#mailpoet_form_1{width: 100%;}#mailpoet_form_1 .mailpoet_message {margin: 0; padding: 0 20px;} #mailpoet_form_1 .mailpoet_validate_success {color: #00d084} #mailpoet_form_1 input.parsley-success {color: #00d084} #mailpoet_form_1 select.parsley-success {color: #00d084} #mailpoet_form_1 textarea.parsley-success {color: #00d084} #mailpoet_form_1 .mailpoet_validate_error {color: #cf2e2e} #mailpoet_form_1 input.parsley-error {color: #cf2e2e} #mailpoet_form_1 select.parsley-error {color: #cf2e2e} #mailpoet_form_1 textarea.textarea.parsley-error {color: #cf2e2e} #mailpoet_form_1 .parsley-errors-list {color: #cf2e2e} #mailpoet_form_1 .parsley-required {color: #cf2e2e} #mailpoet_form_1 .parsley-custom-error-message {color: #cf2e2e} #mailpoet_form_1 .mailpoet_paragraph.last {margin-bottom: 0} @media (max-width: 500px) {#mailpoet_form_1 {background: #ffffff;}} @media (min-width: 500px) {#mailpoet_form_1 .last .mailpoet_paragraph:last-child {margin-bottom: 0}} @media (max-width: 500px) {#mailpoet_form_1 .mailpoet_form_column:last-child .mailpoet_paragraph:last-child {margin-bottom: 0}} Please leave this field empty

input[name="data[form_field_MGI0Nzk2NWMxZTIzX2VtYWls]"]::placeholder{color:#abb8c3;opacity: 1;}Email Address *

Check your inbox or spam folder to confirm your subscription.

Best Books to learn Tensorflow – Data Science Tutorials

Have you liked this article? If you could email it to a friend or share it on Facebook, Twitter, or Linked In, I would be eternally grateful.

Please use the like buttons below to show your support. Please remember to share and comment below.

The post Convert multiple columns into a single column-tidyr Part4 appeared first on Data Science Tutorials

To leave a comment for the author, please follow the link and comment on their blog: Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Convert multiple columns into a single column-tidyr Part4

↧

R code snippet : Transform from long format to wide format

July 22, 2022, 2:15 am

≫ Next: Developer diary for {ggshakeR} 0.2.0 (a package for soccer analytics viz): Working smoothly as a team on GitHub for R package development!

≪ Previous: Convert multiple columns into a single column-tidyr Part4

[This article was first published on SH Fintech Modeling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post introduces a useful R code snippet for transforming the long format data to the wide format. We occasionally encounter the long format data such as yield curve data since yield curve has two dimensions : maturity and date. For this end, we can use reshape() R built-in function.

Long and Wide formats

What is the long and wide format? A picture paints a thousand words.

R snippet : Transform from long format to wide format

We want to transform the long format data to the wide format data for a panel of time series analysis such as a term structure of interest rates. As financial data is usually extracted from database system, we occasionally encounter the long format data. For example, the data in the above figure is a sample of Euro area yield curve which has the long format. To facilitate an empirical analysis, the wide format is appropriate. Transforming between the long and wide format can be carried out by using reshape() R function. No further explanation is needed. Let’s see the R code below.

R code

The following R code read sample data and transform the long format to the wide format and vice versa. When using the reshape() function, we need to set the direction argument as “long” or “wide”. In particular, we need to add new column name with some delimitator (., _, etc) to the wide format data when we transform it the long format.

100

101

102

103

#========================================================#

# Quantitative ALM, Financial Econometrics & Derivatives

# ML/DL using R, Python, Tensorflow by Sang-Heon Lee

# https://kiandlee.blogspot.com

#——————————————————–#

# Long to wide format and vice versa for yield data

#========================================================#

graphics.off(); rm(list = ls())

# sample data : ECB zero yields

str_data <– “term date rate

3M 2021-01-29 -0.625

3M 2021-02-26 -0.612

3M 2021-03-31 -0.636

3M 2021-04-30 -0.628

3M 2021-05-31 -0.632

3M 2021-06-30 -0.650

3M 2021-07-30 -0.663

3M 2021-08-31 -0.676

3M 2021-09-30 -0.712

3M 2021-10-29 -0.736

3M 2021-11-30 -0.895

3M 2021-12-31 -0.731

3Y 2021-01-29 -0.771

3Y 2021-02-26 -0.648

3Y 2021-03-31 -0.711

3Y 2021-04-30 -0.684

3Y 2021-05-31 -0.666

3Y 2021-06-30 -0.672

3Y 2021-07-30 -0.813

3Y 2021-08-31 -0.760

3Y 2021-09-30 -0.677

3Y 2021-10-29 -0.537

3Y 2021-11-30 -0.766

3Y 2021-12-31 -0.620

10Y 2021-01-29 -0.512

10Y 2021-02-26 -0.246

10Y 2021-03-31 -0.279

10Y 2021-04-30 -0.180

10Y 2021-05-31 -0.146

10Y 2021-06-30 -0.203

10Y 2021-07-30 -0.440

10Y 2021-08-31 -0.393

10Y 2021-09-30 -0.170

10Y 2021-10-29 -0.069

10Y 2021-11-30 -0.350

10Y 2021-12-31 -0.188

20Y 2021-01-29 -0.176

20Y 2021-02-26 0.103

20Y 2021-03-31 0.142

20Y 2021-04-30 0.252

20Y 2021-05-31 0.287

20Y 2021-06-30 0.201

20Y 2021-07-30 -0.059

20Y 2021-08-31 -0.033

20Y 2021-09-30 0.195

20Y 2021-10-29 0.103

20Y 2021-11-30 -0.115

20Y 2021-12-31 0.056″

#==========================================

# Read a sample of ECB zero coupon yields

#==========================================

df_long <– read.table(text = str_data, header = TRUE)

#==========================================

# Transform LONG to WIDE format

#==========================================

# using “wide” option

df_wide <– reshape(df_long, direction = “wide”,

idvar = “date”,

timevar = “term”)

df_wide

# initialize row names

rownames(df_long) <– NULL

# delete a unnecessary prefix in column names

colnames(df_wide) <– gsub(“rate.”,“”, colnames(df_wide))

df_wide

#==========================================

# Transform WIDE to LONG format

#==========================================

df_wide2 <– df_wide

# need to add new column name as a prefix

colnames(df_wide2)[–1] <–

paste0(“term.”, colnames(df_wide)[–1])

# using “long” option

df_long2 <– reshape(df_wide2, direction = “long”,

idvar=“date”,

varying = colnames(df_wide2)[–1],

sep = “.”)

# initialize row names

rownames(df_long2) <– NULL

df_long2

Colored by Color Scripter

Running the above R code produces the following wide format of the yield curve data.

R code snippet : Transform from long format to wide format

We can also transform the wide format data to the long format conversely.

To leave a comment for the author, please follow the link and comment on their blog: SH Fintech Modeling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: R code snippet : Transform from long format to wide format

↧

Developer diary for {ggshakeR} 0.2.0 (a package for soccer analytics viz): Working smoothly as a team on GitHub for R package development!

July 21, 2022, 4:00 pm

≫ Next: rOpenSci News Digest, July 2022

≪ Previous: R code snippet : Transform from long format to wide format

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

{ggshakeR} 0.2.0, a packagefor soccer analytics visualizations for R is released! This versionbrings a huge amount of new functionality as well as changes to existingfunctions.

devtools::install_github("abhiamishra/ggshakeR")

This blog post however, is more of a developer diary that seeks totalk more about what goes on ‘under the hood’ and also to teach myfellow authors Abhishek andHarsh more about workingcollaboratively as a team on Git/Github for R package development.Ideally, I would’ve liked to have written this for them before westarted working on v0.2.0 but a lot of what I wrote here are variouslessons and tips I have been giving them throughout the developmentwindow for this latest release. This blog post is more of acollection of the various best practices (from my own POV, ofcourse) for R package maintenance/building and using Git/Github for themas well as any other would-be R package creators out there, especiallyin the soccer analytics space.

But first, I’ll briefly go over some of the new features.

Let’s get started!

New features in `v0.2.0`

You can check out the changes in more detail inNEWS.md butthe major highlights are:

Addition of new functions (and improvement of some existing ones):
- plot_convexhull()
- plot_voronoi()
- plot_passnet()
- calculate_epv()
Standardization of argument and function names: This was the bigissue that I worked on for this release, you can see my thoughtprocess in the Guide to v0.2.0 changesvignette
Increasing test coverage: I had set a ball-park target of around 80%coverage but what was more important to me was that the generalfunctionality of every function was covered and thanks to theefforts of Abhishek andHarsh, we were largely able toachieve that.

All of the vignettes have been updated, which you can check out in thepackage website. The existingvignettes have been updated for the new functions in this release andthere is also the Guide to Version0.2.0that covers all of the syntax standardization changes that I made forv0.2.0.

Workflow

This is mainly about the workflow related to creating new features orfixing bugs. Following the release of version 0.1.2 (see previousdeveloper diaryhere),things were set in motion for the next big release through discussionsbetween the package authors. With the various Github Actions CI tools(codecov, lintr, package checks) that I was able to implement, I wantedto take things a bit further in working more as a functional team onGithub. So I started outlining the various things I do at my regular joband how they could apply to this open source project team.

Github Issue

Title: Start with a verb describing the main action that theissue is supposed to fix, then a short description.
- Create..., Fix..., Simplify..., etc.
Description: Use the first comment box once you’ve created the issueto describe in a bit more detail. Specify the function(s) you wantto work on (you might have already mentioned it in the title),possible steps you want to take, some brainstorming thoughts. Forbug issues sometimes you may not have too much to say here… yet.
Assignments: On the right side-bar of every issue are variousbuttons that you can use to tag and organize your issue.
- Assign a person via Assignees
- Label your issue as bug, enhancement, etc. (you cancustomize these to fit your project)
- Project/Milestone: If this is part of a larger release ‘Project’or ‘Milestone’ you can add the issue to those from here.

Branch

Branch:
- Create a new branch based on the issue: I like to name it in asimilar vein to the issue. Start with an action verb then ashort description (can be difficult at times especially as Iprefer to keep it to less than 5 whole words…) but this time Ialso append the Github issue number at the end.
- Example: create-passnetwork-function-#56.

The important thing is that one issue should be addressing one‘topic’, or at least as much as possible. Keeping an issue focusing onone feature or bug also helps when we start having multiple branches forseparate individual issues and in general it keeps information about anissue in one particular place rather than spread out over multipleplaces. Of course, when it comes to stuff like bugs then certain thingscan definitely cross-pollinate so you need to be mindful ofreferring/mentioning across these multiple issues.

There are many times that you may find a new problem while you’reworking on an issue. In my point of view, if that new problem doesnot directly affect the current problem you’re working on in thatissue/branch, I simply open a new issue (and later branch) rather thanshoving changes unrelated to the current working issue/branch.

While working on the branch, every commit message should give some ideaof what was done and where. At the end of the commit message referencethe issue connected to the branch you’re working on, ex.references #56 or #56 (this is why I like having the issue number inthe branch name because it helps me remember what the issue number is).In the issue itself on Github, I like to take notes and brainstorm stuffso if I return after a holiday/weekend/whatever I can remember what Iwas doing (talking myself through the problem helps me quite a bit).

So a Github issue will have a stream of various commits you made (viareferences in the commit messages) as well as your own messages toyourself or team members. While I personally like making lots of smallcommits to keep track of things I’ve heard some people like to make aseparate commit for each separate file they’re making changes in,but I find that excessive. I did that quite a few times in issue #27but it was more to make a point to my fellow authors about making sureto reference the issue you're working on in each commit so that itshows up in the respective Github issue. When it comes to actuallymerging everything to master, I use the Squish and merge option toclean them all up and organize them into a single message anyways soconcerns about a bunch of tiny commits flooding the git log on themaster branch are mitigated.

Creating a new function

When creating functions I like to be very explicit when referencingdependencies by using @importFrom rather than @import. This is veryhelpful as it lets me know at a quick glance what exact functions arebeing used within my own functions. Which means I can keep track ofthings better when I add or remove dependencies to my package. Whenfirst creating the function, I like to explicitly specify functions via:: notation because this lets me automatically generate the@importFrom roxygen lines (and all the other roxygen documentation!)via the {sinew} package and itsRStudio Add-on.

sinew example gifhull_fun

On the other hand, you do not need to explicitly specify auxiliaryfunctions from within {ggshakeR} nor do we need to do it for the various{testthat} functions within the test scripts. When I start a developingsession, I always call devtools::load_all() to load all thedependencies in my package (as listed in the DESCRIPTION file) insteadof typing multiple library() calls every single time.

Some related links:

OnDependencies–Karl Broman
Dealing with ‘no visible global function definition’notes–StackOverflow

Pull Requests

Once all the changes are made and you’re ready to create a Pull-Request,the NEWS.md document should be updated. Similarly to all the otherwriting done, I like to format it as VERB/ACTION + describe and then adda link to the specific issue.

0.2.0news.md

Before every commit + push I like to mash:

Ctrl + Shift + D == devtools::document()
Ctrl + Shift + B == Install and Restart

and for the final commit + push before a PR I may also run:

Ctrl + Shift + T == devtools::test()
Ctrl + Shift + E == devtools::check()

Although we have Github Actions running these tests and checksautomatically, right before a PR I always like to check on my localenvironment anyway. I talked about the various GHA workflows Iimplemented for {ggshakeR} in more detail in the last developerdiarybut the main ones are as follows:

Running R CMD Check
Running {lintr} code style checks
Building the {pkgdown} documentation website automatically

These GHA workflows are all stored in .github/workflows directory ifyou want to have a closer look at what they do. For the purposes ofreminding the package authors of what needs to be done when finalizing abranch for the Pull-Request, I created a checklist template that showsup when a PR is created on Github. It’s a simple markdown file that youcan create and edit to fit your team’s needs. The documentation says toput it in its own folder inside .github but if you only have one PRtemplate, then you only need to put it inside the .github directory.

PRtemplate

I’ve also created two separate issue templates, one for bug reportsand another for feature requests. This is more geared toward ourend-users than the authors and is based pretty much on what the{tidyverse} set of packages have implemented. I made these using YAMLand storing them as .yml files inside .github/ISSUE_TEMPLATEdirectory in the repository. You can see what they look like in actionby creating an issue on Github. This is useful to let users know howbest they can help us, help them by clearly stating what we expect tosee in an issue. (Note: Since I don’t have owner access to the{ggshakeR} repository I used the YAML templates instead of using the UIfrom the Settings tab)

issuemenu

bugreport

Github issue template help:

YAML help:

Now you only need to wait for the various Github Actions runs to finishand start checking off the items on the PR list and/or pushing moretweaks to the branch if there were any problems. Once the checklist iscrossed off, you can Squish and merge, write down the overall commitmessage for the entire branch (summarize and/or edit based on the commitmessages made throughout the branch), then finally delete the branchafter you merge it (these are all things covered in the checklist Imentioned earlier).

A small list of helpful resources for my fellow {ggshakeR} authors //any other package creators reading this blog post:

Happy Git & Github for the useR–Jenny Bryan
Pull Request Helpers from the {usethis}package
R Package (2nd Edition) online book
GitHubkeywords
Resolving a merge conflict ongit/Github–GitHub Docs

Closing Comments

Version 0.2.0 was around 4 whole months in the making and we’redelighted for it to finally be released. Big thanks to package creatorAbhishek and my fellow authorHarsh for all the hard work theyput in the past year.

It’s been pretty difficult at times due to the fact that the authorseach reside in a different country/timezone and are each at varyingstages in their lives (schooling/work/etc.). As mentioned earlier, Iwanted to present the information in this blog post before westarted development on 0.2.0 via an online chat but it never workedout. So, throughout the past 4 months we went back-and-forth quite a biton how to work as a team using all of the project management toolson Github as well as on various R package development best practiceswhich could have been avoided or the process made a bit smoother. Butnow, with this blog post I’ve hopefully consolidated a lot of thevarious little things I’ve been trying to get across to my fellowpackage authors and maybe others might find this helpful as well.

While I’m still fairly early in my career, I do manage my organization’ssuite of various R packages, scripts, and etc. as my main day job. Evenif some of the things I wrote above aren’t what some people may considerbest practice (but that is so subjective and can vary so much bycompany/industry/etc.) I thought I would contribute in my own way to thegeneral knowledge out here on the interwebs. I do not see a whole lot ofthese “how to collaborate as a team using R & Git” type posts,especially when it comes to soccer analytics so I thought it was worthit that someone gets the ball rolling on this topic. In the firstplace, my involvement with {ggshakeR} was mainly a part of a generalattempt to introduce or write about the details of software developmentwhen it comes to soccer analytics. It’ll be nice to see if thisencourages other people in the soccer analytics space to talk more aboutthis sort of thing.

There is still more to come as we aim for a CRAN release later thisyear. On top of new features I do want to delve further into the packageinternals and take a deeper look at our testing suite. I’ve been talkingabout wanting to try out {vdiffr} for a while now but more importantly Iwant to examine our tests more holistically based on what’s in the‘Designing your test suite’ section of the newest edition of the RPackages book (not yetreleased). Otherwise for me, I want to dig into more of the packageinternals and see what could be made more efficient. For now, head tothe {ggshakeR} website tolearn more and start creating some great soccer visualizations!

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Developer diary for {ggshakeR} 0.2.0 (a package for soccer analytics viz): Working smoothly as a team on GitHub for R package development!

↧

rOpenSci News Digest, July 2022

July 21, 2022, 5:00 pm

≫ Next: Unhappy in its Own Way

≪ Previous: Developer diary for {ggshakeR} 0.2.0 (a package for soccer analytics viz): Working smoothly as a team on GitHub for R package development!

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dear rOpenSci friends, it’s time for our monthly news roundup!

You can read this post on our blog.Now let’s dive into the activity at and around rOpenSci!

rOpenSci HQ

rOpenSci Code of Conduct update

We are pleased to announce the release of a new version of our Code of Conduct. Based on the feedback of our community we added greater detail about acceptable and unacceptable behaviors in online settings and we have the first translation of the text to Spanish.

Committee changes

We thank Megan Carter for serving as independent community member until June, 2022 and welcome back Kara Woo to serve on this role.Yanina Bellini Saibene joins the committee as the new rOpenSci Community Manager.

You can read all the details in our blog post.

Software

New packages

The following three packages recently became a part of our software suite:

datefixR, developed by Nathan Constantine-Cooke: There are many different formats dates are commonly represented with: the order of day, month, or year can differ, different separators (“-“, “/”, or whitespace) can be used, months can be numerical, names, or abbreviations and year given as two digits or four. datefixR takes dates in all these different formats and converts them to Rs built-in date class. If datefixR cannot standardize a date, such as because it is too malformed, then the user is told which date cannot be standardized and the corresponding ID for the row. datefixR’ also allows the imputation of missing days and months with user-controlled behavior. It is available on CRAN. It has been reviewed by Kaique dos S. Alves, and Al-Ahmadgaid B. Asaad.
epair, developed by G.L. Orozco-Mulfinger together with Madyline Lawrence, and Owais Gilani: Aid the user in making queries to the EPA API site found at https://aqs.epa.gov/aqsweb/documents/data_api. This package combines API calling methods from various web scraping packages with specific strings to retrieve data from the EPA API. It also contains easy to use loaded variables that help a user navigate services offered by the API and aid the user in determining the appropriate way to make a an API call.
readODS, developed by Chung-hong Chan together with Gerrit-Jan Schutten, and Thomas J. Leeper: Read ODS (OpenDocument Spreadsheet) into R as data frame. Also support writing data frame into ODS file. It is available on CRAN. It has been reviewed by Emma Mendelsohn, and Adam H. Sparks.

Discover more packages, read more about Software Peer Review.

New versions

The following eight packages have had an update since the last newsletter: datefixR (v1.0.0), dittodb (v0.1.4), EDIutils (v1.0.1), jagstargets (1.0.3), lingtypology (v1.1.9), restez (v2.0.0), rtweet (v1.0.2), and tidyqpcr (v1.0.0).

Software Peer Review

There are fifteen recently closed and active submissions and 2 submissions on hold. Issues are at different stages:

Two at ‘6/approved’:
- datefixR, Fix Really Messy Dates. Submitted by Nathan Constantine-Cooke.
- epair, Grabs data from EPA API, simplifies getting pollutant data. Submitted by Leo Orozco-Mulfinger.
One at ‘5/awaiting-reviewer(s)-response’:
- phruta, Phylogenetic Reconstruction and Time-dating. Submitted by Cristian Román Palacios.
Three at ‘4/review(s)-in-awaiting-changes’:
- hudr, A R interface for accessing HUD (US Department of Housing and Urban Development) APIs. Submitted by Emmet Tam.
- octolog, Better Github Action Logging. Submitted by Jacob Wujciak-Jens.
- healthdatacsv, Access data in the healthdata.gov catalog. Submitted by iecastro.
Four at ‘3/reviewer(s)-assigned’:
- spiro, Manage Data from Cardiopulmonary Exercise Testing. Submitted by Simon Nolte.
- canaper, Categorical Analysis of Neo- And Paleo-Endemism. Submitted by Joel Nitta. (Stats).
- tsbox, Class-Agnostic Time Series. Submitted by Christoph Sax. (Stats).
- ROriginStamp, Interface to OriginStamp API to Obtain Trusted Time Stamps. Submitted by Rainer M Krug.
Two at ‘2/seeking-reviewer(s)’:
- aorsf, Accelerated Oblique Random Survival Forests. Submitted by Byron.
- bssm, Bayesian Inference of Non-Linear and Non-Gaussian State Space. Submitted by Jouni Helske. (Stats).
Three at ‘1/editor-checks’:
- daiquiri, Data Quality Reporting for Temporal Datasets. Submitted by Phuong Quan.
- wmm, World Magnetic Model. Submitted by Will Frierson.
- rdbhapi, Interface to DBH-API. Submitted by Marija Ninic.

Find out more about Software Peer Review and how to get involved.

On the blog

rOpenSci Code of Conduct Update by Yanina Bellini Saibene, Mark Padgham, and Kara Woo. Update on Code of Conduct 2022.

Tech Notes

Evaluating GitHub Activity for Contributors by Maëlle Salmon. How to evaluate the activity of a GitHub repository as an user or potential contributor. This post was discussed on the R Weekly Highlights podcast.
Upgrading rtweet by Lluís Revilla Sancho. Update from rtweet 0.7.0 to 1.0.2.

Call for maintainers

We’re looking for a new maintainer, or a new maintainer team, for each of the following packages:

nbaR, R client library for the Netherlands Biodiversity Api (NBA). Issue for volunteering.
elastic, R client for the Elasticsearch HTTP API. Issue for volunteering.

If you’re interested, please comment in the issues or email info@ropensci.org.

For more info, see

Package development corner

Some useful tips for R package developers.

pak::pak()

Say you cloned a repository and are now getting ready to debug it.How do you make sure you have all its development dependencies installed?Simply run pak::pak()!Easy to remember and to type, and it works!

Update to CRAN NEWS.md parsing

If you maintain a changelog for your package, as you should, and have chosen the Markdown format (NEWS.md) to do so, you might need to pay attention to its formatting for optimal parsing by

pkgdown, see the docs for pkgdown::build_news(), for instance pay attention to headings;
R itself– if all goes well a NEWS.md file that is correct for pkgdown will be correct for R too. You can follow the debugging steps recommended by Henrik Bengtsson to find what’s wrong in your NEWS.md file.

How to handle CRAN checks with `_R_CHECK_DEPENDS_ONLY_`=true

In some cases CRAN might run checks without installing the Suggested dependencies.How to ensure your vignettes still “work”, that is to say, that R CMD check will not produce any error or warning?

pre-build your vignettes;
make them pkgdown articles instead (no vignette, no vignette error!);
execute code conditionally based on the availability of packages, with knitr eval chunk option for instance.

Last words

Thanks for reading! If you want to get involved with rOpenSci, check out our Contributing Guide that can help direct you to the right place, whether you want to make code contributions, non-code contributions, or contribute in other ways like sharing use cases.

If you haven’t subscribed to our newsletter yet, you can do so via a form. Until it’s time for our next newsletter, you can keep in touch with us via our website and Twitter account.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: rOpenSci News Digest, July 2022

↧

Unhappy in its Own Way

July 22, 2022, 10:54 am

≫ Next: A Kaggle Dataset of R Package History for rstudio::conf(2022)

≪ Previous: rOpenSci News Digest, July 2022

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“Happy families are all alike; every unhappy family is unhappy in its own way” runs the opening sentence of Anna Karenina. Hadley Wickham echoes the sentiment in a somewhat different context: “Tidy datasets are all alike, but every messy dataset is messy in its own way”. Data analysis is mostly data wrangling. That is, before you can do anything at all with your data, you need to get it into a format that your software can read. More often than not, the stage between having collected (or found) some data and being able to analyze it is frustrating, awkward, and filled with difficulties particular to the data you are working with. Things are encoded this way rather than that; every fifth line has an extra column; the data file contains subtotals and running headers; the tables are only available as PDFs; the source website has no API, and so on.

This makes teaching data wrangling a little awkward, too. On the one hand, R’s tidyr package has a bomb-disposal squad of functions designed to defuse these problems. But looking at them piecemeal might make any particular one seem highly specialized and hard to motivate in general. On the other hand, any dataset in need of wrangling will likely have all kinds of idiosyncratic problems, so a worked example may end up seeming far too specific.

In practice we bridge the two extremes by repeatedly showing tools in use, moving back and forth between some specific file problem and more general heuristics for diagnosing and solving problems; or between some specific function and the more general theory of data that it is trying to help you apply.

Here’s a real life case that I had fun with this week. I learned more about a general idea in the process of trying to solve a particular problem that had come up in the middle of trying to transform a PDF of more-or-less formatted tables into a tidy dataset.

Run-Length Encoding

The general idea I learned more about was the notion of run-length encoding. This is

a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.

Wikipedia’s example comes from the transmission of TV signals:

Consider a screen containing plain black text on a solid white background, over hypothetical scan line, it can be rendered as follows: 12W1B12W3B24W1B14W. This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc., and represents the original 67 characters in only 18. While the actual format used for the storage of images is generally binary rather than ASCII characters like this, the principle remains the same.

Though primarily a way of compressing data, we can use a run-length encoding to keep track of how a sequence unfolds. For example, here’s a vector of TRUE and FALSE values:

x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, 
   FALSE, TRUE, FALSE, FALSE, TRUE, FALSE)

length(x)

## [1] 14

In R we can get a run-length encoding of this vector with the rle() function. It returns an object with two pieces, the lengths and the values.

rle(x)

## Run Length Encoding
##   lengths: int [1:10] 2 1 1 3 1 1 1 2 1 1
##   values : logi [1:10] TRUE FALSE TRUE FALSE TRUE FALSE ...

So here we have ten runs. TRUE twice, then FALSE once, then TRUE once, then FALSE three times, and so on. We can get the run lengths alone with

1
2
3

rle(x)$lengths

##  [1] 2 1 1 3 1 1 1 2 1 1

If we use sequence() to count the runs back out, the result will be equal to the length of the original vector:

1
2
3

sequence(rle(x)$lengths)

##  [1] 1 2 1 1 1 2 3 1 1 1 1 2 1 1

This makes a sequence that starts counting at 1 and resets to 1 whenever a new value is seen. Because the values are TRUE and FALSE only, once we know the start value we can reconstitute the sequence knowing just it and the runs. We can take advantage of the binary TRUE/FALSE data (implicitly 1/0 numerically) in a different way, too. With the original vector and the sequenced run, we can get different ways of counting the sequence depending on how we multiply it by the values of x, or negate those values before multiplying them.

tibble(x = x, 
   seq1 = sequence(rle(x)$lengths), 
   seq2 = sequence(rle(!x)$lengths) * x, 
   seq3 = sequence(rle(!x)$lengths) * !x)

## # A tibble: 14 × 4
##x  seq1  seq2  seq3
##<lgl> <int> <int> <int>
##  1 TRUE  1 1 0
##  2 TRUE  2 2 0
##  3 FALSE 1 0 1
##  4 TRUE  1 1 0
##  5 FALSE 1 0 1
##  6 FALSE 2 0 2
##  7 FALSE 3 0 3
##  8 TRUE  1 1 0
##  9 FALSE 1 0 1
## 10 TRUE  1 1 0
## 11 FALSE 1 0 1
## 12 FALSE 2 0 2
## 13 TRUE  1 1 0
## 14 FALSE 1 0 1

So for logical vectors we can make it so that our counter starts at either 1 or 0 for TRUE, and counts up from there. We can also have FALSE always coded as 0 (no matter how many times it’s repeated) but TRUE always coded as 1 and then counted up from there. Different indexes for the sequence can be more or less convenient depending on what we might need to keep track of in some vector.

A PDF of almost-regular tables

That’s the general bit. Now for the idiosyncratic one. This week I was working with about a hundred pages of legal billing data. Every page of the PDF had information about the date, rate, and hours of service billed, with a description of the service that could be anywhere from one to ten or so lines long. I wanted to get this PDF into R to do look at the records more systematically.

Now, In many cases, the most straightforward way to get a PDF table into some sort of plain-text format will not be to write any code at all. Instead you can use Excel, for instance, to read PDF tables a page at a time, or try Adobe’s PDF to Excel service if you’re a subscriber. These are useful tools, and if your job is straightforward you should consider starting with them.

We’re going to use R, because that’s how we roll around here. Even within R there are some decisions to make. For example, I might have used tabula, the Java-driven PDF-to-text engine, via the tabulizer package. But I didn’t have tabula installed, so instead to get the PDF file into R we’ll use pdftools, which depends on the poppler rendering library, which I did have installed. From there we use some standard tidyverse packages, notably dplyr and tidyr.

One thing to bear in mind is that, as often happens, the wrangling and cleaning sequence I’m about to show here developed fairly organically. That is, I was trying to get the hang of the data and this is how I proceeded. Some steps were driven partly by a desire to keep seeing the data in a way that was immediately comprehensible, just so that I could get a quick sense of whether I was making any big mistakes. If this were the sort of problem where doing it as efficiently as possible really mattered, the next step would be to investigate various “roads not taken” a bit more thoroughly. I guarantee you that there are better ways to this. But for my purposes, this way was good enough.

Ultimately we’re going to end up with a table containing columns with these names:

1 2	varnames <- c("date", "timekeeper", "description", "hours", "rate", "fees", "notes")

Getting there will take a little while. A PDF file is an unfriendly thing to try to extract plain-text data from. (Turning typeset tables or text back into plain-text is sometimes likened to trying to reconstitute a pig from a packet of sausages.) In this case, the data is all in there, which is to say that the numbers and the text is all there in a way we can get to, but when we import it into R we will lose all the tabular formatting. All we will have left is a stream of text with lines designated by the special \n character.

pgs <- pdftools::pdf_text("data/appendix_a.pdf")

pgs <- set_names(pgs, nm = 1:length(pgs)) 

length(pgs)

## [1] 92

In this next part, we’ll use some regular expressions inside stringr functions to strip some lines we don’t need (such as the “Page n of 92” header lines); trim the strings of excess whitespace at the start and end; and finally anonymize the data for the purposes of this post. I do this last bit by looking for every word and replacing it with a random word. The words come from stringr’s built-in vector words, which it uses for examples. It has about a thousand miscellaneous words. The regular expression \p{L} will match any single unicode character that is a letter. Writing \p{L}+ will make the expression keep matching until a non-letter is encountered (e.g. a punctuation mark or a space). Finally, because we are inside R, we need to double-escape the special \p code so that it is \\p, in order for the backslash to be preserved.

eg <- pgs |> 
  str_remove_all("Page \\d{1,2} of 92") |> 
  str_trim() |>
  str_replace_all("\\p{L}+", function(x) sample(stringr::words, 1))

OK, so now we can look at what we have. It’s a vector with 92 elements:

1
2
3

length(eg)

## [1] 92

Each of these elements is a single long, messy string that contains all the characters from one page, including special ones like \n for newline. The newline characters are represented by the text \n rather than being interpreted as actual new lines. Here, for example, is page four:

1
2
3

eg[4]

## [1] "4/22/14   stickcontact for world insure fun real (.3); correct shut‐realise cat, apart 1.2635762.00\nexample but trust invest, over door head responsible introduce involve, figure\naddress cheap english company (.9)\n4/22/14   wage   at soon‐lose commit television break care'try lay (.2); admit hullo whole luck 0.4749299.60\nview strong involve\n4/27/14   charactershould between field double air look' clean‐week occasion 1.3635 825.50\n4/28/14   produceguess busy collect tree kill double' deep‐introduce lady 5.4635   3,429.00\n4/29/14   ideacall wee person common succeed catch' system‐big house 9.1635   5,778.50\n5/2/14available   product hit worse stage' seem true can; house begin company   0.4749 299.60 0.8 product seem call/ year hot\nnotice presume\n5/9/14nationseat east operate farm common next0.1635 63.50\n5/15/14   individualnever not jump refer document monday consider (.5); new original then door chance  0.4635254.00 0.9 dog indeed bother/ function bad\nintroduce drop talk (.4)\n5/15/14   word   little such tell paper knock lie town hundred with type bother   0.2749149.80 0.5 shop if council/ enough cake\n\n5/20/14   cup   world right like top (1); clean total family (1.5); help 3.1749   2,321.90 6.2 confer hot sing/ switch a\ncar lead cup (3.7)\n5/23/14   poundprogramme rule stupid   0.1635  63.50\n5/23/14   question   vote plus hospital what suit. top   0.1749  74.90\n6/9/14issue   bed system now discuss converse'age though eight round (hospital kitchen2.7749   2,022.30 5.5 study large resource/ system arm\nlabour country); receive figure'meet he speak afternoon, specific common stand\nwhat list aware; television further, join side bar tell\n\n6/10/14   servicealthough front view pair occasion  0.3635190.50\n6/11/14   blue   report day town too garden line call wear attend enough0.2749149.80\n6/16/14   amount   to hell list simple life leg encourage; as contract  4.2749   3,145.80 8.5 post nine friday/ wrong honest\ncommon brief night student (.4); date 2elect fair beauty couple suggest unless function 2always finish\nmay type trouble (1.7); matter britain oppose lunch america out;\nscotland consider many without represent house during 2nine colleague tape\n(2.1); the down lead clock opportunity water individual thank simple bit\nhot (2.5); clothe balance suggest never moment company (1.8)\n\n6/17/14   class   might bottom work afternoon yes hope  0.6749 449.40 1.3 true income air/ hall doubt\n6/22/14   worth   educate air quick save, health six slow hospital share 3.2749   2,396.80 6.5 happen million lady/ sister field\ndrop choose regard yet\n6/23/14   createscotland next amount love favour2.9635   1,841.50 5.8 bar week cross/ service turn"

It goes on for a long way off to the right there.

Breaking up the lines

The first thing we will do is take each element in the vector (i.e. the long, long string that is each page), and break it on every newline character. That is, we’ll split the string so that every time we encounter the sequence \n we will make it an actual new line. We use the

1	eg <- str_split(eg, pattern = "\\n")

Because we have split each \n character into an actual newline, the structure of our eg data object has changed. It used to be a vector. But now it is a list:

class(eg)

## [1] "list"

length(eg)

## [1] 92

It’s a list of the same length the vector was, but now each list element contains a bunch of lines. Here is page 4 again:

eg[4]  

## [[1]]
##  [1] "4/22/14   stickcontact for world insure fun real (.3); correct shut‐realise cat, apart 1.2635762.00"
##  [2] "example but trust invest, over door head responsible introduce involve, figure" 
##  [3] "address cheap english company (.9)" 
##  [4] "4/22/14   wage   at soon‐lose commit television break care'try lay (.2); admit hullo whole luck 0.4749299.60"   
##  [5] "view strong involve"
##  [6] "4/27/14   charactershould between field double air look' clean‐week occasion 1.3635 825.50" 
##  [7] "4/28/14   produceguess busy collect tree kill double' deep‐introduce lady 5.4635   3,429.00"
##  [8] "4/29/14   ideacall wee person common succeed catch' system‐big house 9.1635   5,778.50" 
##  [9] "5/2/14available   product hit worse stage' seem true can; house begin company   0.4749 299.60 0.8 product seem call/ year hot"  
## [10] "notice presume" 
## [11] "5/9/14nationseat east operate farm common next0.1635 63.50" 
## [12] "5/15/14   individualnever not jump refer document monday consider (.5); new original then door chance  0.4635254.00 0.9 dog indeed bother/ function bad"
## [13] "introduce drop talk (.4)"   
## [14] "5/15/14   word   little such tell paper knock lie town hundred with type bother   0.2749149.80 0.5 shop if council/ enough cake"
## [15] ""   
## [16] "5/20/14   cup   world right like top (1); clean total family (1.5); help 3.1749   2,321.90 6.2 confer hot sing/ switch a"   
## [17] "car lead cup (3.7)" 
## [18] "5/23/14   poundprogramme rule stupid   0.1635  63.50"   
## [19] "5/23/14   question   vote plus hospital what suit. top   0.1749  74.90" 
## [20] "6/9/14issue   bed system now discuss converse'age though eight round (hospital kitchen2.7749   2,022.30 5.5 study large resource/ system arm"   
## [21] "labour country); receive figure'meet he speak afternoon, specific common stand" 
## [22] "what list aware; television further, join side bar tell"
## [23] ""   
## [24] "6/10/14   servicealthough front view pair occasion  0.3635190.50"   
## [25] "6/11/14   blue   report day town too garden line call wear attend enough0.2749149.80"   
## [26] "6/16/14   amount   to hell list simple life leg encourage; as contract  4.2749   3,145.80 8.5 post nine friday/ wrong honest"   
## [27] "common brief night student (.4); date 2elect fair beauty couple suggest unless function 2always finish" 
## [28] "may type trouble (1.7); matter britain oppose lunch america out;"   
## [29] "scotland consider many without represent house during 2nine colleague tape" 
## [30] "(2.1); the down lead clock opportunity water individual thank simple bit"   
## [31] "hot (2.5); clothe balance suggest never moment company (1.8)"   
## [32] ""   
## [33] "6/17/14   class   might bottom work afternoon yes hope  0.6749 449.40 1.3 true income air/ hall doubt"  
## [34] "6/22/14   worth   educate air quick save, health six slow hospital share 3.2749   2,396.80 6.5 happen million lady/ sister field"   
## [35] "drop choose regard yet" 
## [36] "6/23/14   createscotland next amount love favour2.9635   1,841.50 5.8 bar week cross/ service turn"

You can see from that double-bracketed [[1]] at the top (“The first element you asked for”) that we’re looking at a list. Inside is a series of vectors, numbering however many lines we got on that particular page.

We could keep working with this list just as a list. But to make things a little more convenient—really just to make it easier to look at its contents as we go—we’re going to convert the series of lines inside each list element to a tibble (i.e. a nice data table). To begin with, it will have just three columns: text, which will contain each line of text on the page; line, which numbers the line, and page for the page. We’re doing the same thing to each element of this vector, so that is a kind of iteration. But rather than write a loop, the way we do this in a pipeline is to apply or map a function or series of actions to each element of the list. This lets us keep things in a pipeline. To generate the page variable we use imap(), which can access the name of the current list element. In this case that’s just a number corresponding to the page.

If you’re not used to reading a pipeline like this, read |> as “and then”. Imagine starting with the data, eg, and then doing a series of things to it. Each line hands on the result to the next line, which takes it as input.

eg <- eg |>   
  map(~ tibble(text = unlist(.x), 
   line = 1:length(text))) |> 
  imap(~ .x |>  mutate(page = .y))

So what did that do? Here’s page 4 again:

eg[4]  

## [[1]]
## # A tibble: 36 × 3
##       text   						line page
##	<chr> 							<int> <int>
##  1 "4/22/14   stickcontact for world insure fun real (.3); corr… 1 4
##  2 "example but trust invest, over door head respon… 2 4
##  3 "address cheap english company (.9)"  3 4
##  4 "4/22/14   wage   at soon‐lose commit television break care'try … 4 4
##  5 "view strong involve" 5 4
##  6 "4/27/14   charactershould between field double air look' cl… 6 4
##  7 "4/28/14   produceguess busy collect tree kill double' deep‐… 7 4
##  8 "4/29/14   ideacall wee person common succeed catch' system‐… 8 4
##  9 "5/2/14available   product hit worse stage' seem true can; h… 9 4
## 10 "notice presume" 10 4
## # … with 26 more rows

As you can see, page 4 has 36 lines of text. All our data is in that text column. We’re beginning to see what it’s going to look like as a table with proper columns.

Let’s do a little preliminary cleaning and reorganization before we try splitting up text into separate columns. Again we are using map() because we have to do this to each page individually. Some of these lines (e.g. removing section headers) deal with things that I know are in the data but which I don’t need, because I looked at the PDF. At the end of this process we bind all the list elements together by row, so that they become a single big tibble.

eg <- eg |>   
  map(~ .x |> relocate(page, line) |> # move page and line to the left
filter(!str_detect(text, '^$')) |> # Remove any blank lines
filter(!str_detect(text, "^\\d{1,2}\\. ")) # Remove section headers
  ) |>  
  bind_rows() |> # Convert to single tibble
  tail(-2) # Strip very first two lines (from page 1), they're not needed

Now we have a single tibble, rather than a list of tibbles.

eg

## # A tibble: 3,158 × 3
## page  line text 
##<int> <int> <chr>
##  1 1 4 "6/25/13  photographmust: germany allow direct accou…
##  2 1 5 "6/26/13  settleoppose paragraph wish what feed defi…
##  3 1 6 "6/27/13  strongmight new use course full afternoon …
##  4 1 7 "6/28/13  experiencegrow well improve need bet  …
##  5 1 8 "7/1/13   middledanger bother okay bloke per…
##  6 1 9 "7/2/13   eggthis across also play toward (1.6); hun…
##  7 110 "7/2/13   morning   radio pair stuff effect name friday …
##  8 111 "   god south authority (.3)"
##  9 112 "7/3/13   carry   instead yet private hospital responsib…
## 10 113 "   bear once king fit link" 
## # … with 3,148 more rows

This means that if we want to see page 4 now, we filter on the page column:

1 2	eg \|> filter(page == 4)

# A tibble: 33 × 3

page line text

1 4 1 “4/22/14 stickcontact for world insure fun real (.3); corr…

2 4 2 “example but trust invest, over door head respon…

3 4 3 “address cheap english company (.9)”

4 4 4 “4/22/14 wage at soon‐lose commit television break care’try …

5 4 5 “view strong involve”

6 4 6 “4/27/14 charactershould between field double air look’ cl…

7 4 7 “4/28/14 produceguess busy collect tree kill double’ deep‐…

8 4 8 “4/29/14 ideacall wee person common succeed catch’ system‐…

9 4 9 “5/2/14available product hit worse stage’ seem true can; h…

10 410 “notice presume”

# … with 23 more rows

At this point we can see that we are in for a little trouble. The shape of text suggests the data splits up into columns in this order: date, timekeeper, description, hours, rate, fees, notes. (They may be out of view here but the ends of the lines of text have the numeric corresponding to fees and so on.) Unfortunately, we can see that, e.g. on lines 2 and 3 on page 4, that some rows are blank except for text and also do not start with anything that looks like a date. What has happened is that the “description” column in the original PDF document often has descriptions that run to several lines rather than just one. When that happens, either the original data entry person or (more likely) the PDF-generating software has inserted newline so that the full description will display on the page in its table cell. That’s annoying, because the effect—after we have split our file on \n characters—is that some records to have part of their “description” content moved to a new line, or series of lines. If the description had been the last column in the original PDF then rejoining these description-fragments to their correct row would have been easier. But because it’s in the middle of the table things are more difficult. The number of added description-fragments varies irregularly, too, anywhere from one to eight or nine additional lines.

As noted earlier, if this were the sort of problem where I knew I’d be encountering data in just this form often (a new PDF of a hundred-odd pages of billing data coming in every week, or something) then this would be the ideal spot to stop and ask, “How can I avoid getting myself into this situation in the first place?” I’d go back to the initial reading-in stage and try to see if there was something I could do to immediately distinguish the description-fragment lines from “proper” lines. I did think about it briefly, but no obvious (to me) solution immediately presented itself. So, instead, I’m just going to keep going and solve the problem as it stands.

As a first move towards fixing this problem, we can create a new column that flags whether a line of text begins with something that looks like a date. I know from the original records that every distinct billing entry does in fact begin with a date, so this will be helpful.

eg <- eg |> 
  mutate(has_date = str_detect(text, "^\\d{1,2}/\\d{1,2}/\\d{1,2}")) 

eg

## # A tibble: 3,158 × 4
## page  line text has_date
##<int> <int> <chr><lgl>   
##  1 1 4 "6/25/13  photographmust: germany allow dir… TRUE
##  2 1 5 "6/26/13  settleoppose paragraph wish what … TRUE
##  3 1 6 "6/27/13  strongmight new use course full a… TRUE
##  4 1 7 "6/28/13  experiencegrow well improve need … TRUE
##  5 1 8 "7/1/13   middledanger bother okay bloke pe… TRUE
##  6 1 9 "7/2/13   eggthis across also play toward (… TRUE
##  7 110 "7/2/13   morning   radio pair stuff effect nam… TRUE
##  8 111 "   god south authority (.3)"FALSE   
##  9 112 "7/3/13   carry   instead yet private hospital … TRUE
## 10 113 "   bear once king fit link" FALSE   
## # … with 3,148 more rows

Separating out the columns

So, with the foreknowledge that this is not going to work properly, we trim the front and end of each line of text. Then we separate out all the columns into what ought to be the correct series of variable names (which we wrote down above as varnames). We tell the separate() function to split wherever it encounters more than two spaces in a row, and to name the new columns. (I know, because I checked, that there aren’t any sentences where a period is followed by two spaces. Two-spacers are moral monsters, by the way.) If there’s any extra material we haven’t seen it will get filled to the right. In this step we also add an explicit row id, just to help us keep track of things.

eg <- eg |> 
  mutate(text = str_trim(text)) |> 
  separate(text, sep = "\\s{2,}", into = varnames, 
   fill = "right",
   extra = "merge") |> 
  rowid_to_column()

Now what we have is finally starting to look more like a dataset:

eg |> 
  filter(page == 4)

## # A tibble: 33 × 11
##rowid  page  line date timekeeper description hours rate  fees  notes
##<int> <int> <int> <chr><chr>  <chr>   <chr> <chr> <chr> <chr>
##  1   100 4 1 4/22/14  stick  contact fo… 1.2   635   762.… <NA> 
##  2   101 4 2 example but… <NA>   <NA><NA>  <NA>  <NA>  <NA> 
##  3   102 4 3 address che… <NA>   <NA><NA>  <NA>  <NA>  <NA> 
##  4   103 4 4 4/22/14  wage   at soon‐lo… 0.4   749   299.… <NA> 
##  5   104 4 5 view strong… <NA>   <NA><NA>  <NA>  <NA>  <NA> 
##  6   105 4 6 4/27/14  character  should bet… 1.3   635   825.… <NA> 
##  7   106 4 7 4/28/14  produceguess busy… 5.4   635   3,42… <NA> 
##  8   107 4 8 4/29/14  idea   call wee p… 9.1   635   5,77… <NA> 
##  9   108 4 9 5/2/14   available  product hi… 0.4   749   299.… <NA> 
## 10   109 410 notice pres… <NA>   <NA><NA>  <NA>  <NA>  <NA> 
## # … with 23 more rows, and 1 more variable: has_date <lgl>

You can see that, as expected, things did not go exactly as we would have liked, thanks to the line fragments from the description column. Each time we hit one of those we get a new line where the description content ends up in the date column and all the other columns are empty, hence the <NA> missing value designation. We need to fix this.

Dealing with the description-fragments

Our goal is to get each description fragment re-attached to the description field in the correct row, in the right order. How to do this? The answer is always “Well there’s more than one way to do it.” But here’s what I did. I know that if I can reliably group lines with an identifier that says “These lines are all really from the same billing record” I can solve my problem. So the problem becomes how to do that. To begin with, I know that every row that starts with a date—where has_date is TRUE—is the first line of a valid record. Many records have only one line. But many others are followed by some number of description-fragments, which continue for however long as they do. Then we move on to the next record. So we need to distinguish the boundaries. This is what led me to mess around with run-length encoding. In the end, I didn’t actually need it to solve the problem, but I am going to keep it here anyway. First I’ll show the simpler way to get the group id we need. I’ll pick out just a few columns to make things easier to read. To get a groupid, we first assign the rowid to every row that has a date:

eg |> 
  select(page, rowid, has_date, description) |> 
  mutate(groupid = ifelse(has_date == TRUE, rowid, NA), 
   .after = has_date) |> 
  filter(page == 4)

## # A tibble: 33 × 5
## page rowid has_date groupid description 
##<int> <int> <lgl>  <int> <chr>   
##  1 4   100 TRUE 100 contact for world insure fun real (.3); correct…
##  2 4   101 FALSE NA <NA>
##  3 4   102 FALSE NA <NA>
##  4 4   103 TRUE 103 at soon‐lose commit television break care'try l…
##  5 4   104 FALSE NA <NA>
##  6 4   105 TRUE 105 should between field double air look' clean‐wee…
##  7 4   106 TRUE 106 guess busy collect tree kill double' deep‐intro…
##  8 4   107 TRUE 107 call wee person common succeed catch' system‐bi…
##  9 4   108 TRUE 108 product hit worse stage' seem true can; house b…
## 10 4   109 FALSE NA <NA>
## # … with 23 more rows

Now, for every NA value of groupid we encounter, we use fill() to copy down the nearest previous valid groupid. And those are our groups.

eg |> 
  select(page, rowid, has_date, description) |> 
  mutate(groupid = ifelse(has_date == TRUE, rowid, NA), 
   .after = has_date) |> 
  fill(groupid) |> 
  filter(page == 4)

## # A tibble: 33 × 5
## page rowid has_date groupid description 
##<int> <int> <lgl>  <int> <chr>   
##  1 4   100 TRUE 100 contact for world insure fun real (.3); correct…
##  2 4   101 FALSE100 <NA>
##  3 4   102 FALSE100 <NA>
##  4 4   103 TRUE 103 at soon‐lose commit television break care'try l…
##  5 4   104 FALSE103 <NA>
##  6 4   105 TRUE 105 should between field double air look' clean‐wee…
##  7 4   106 TRUE 106 guess busy collect tree kill double' deep‐intro…
##  8 4   107 TRUE 107 call wee person common succeed catch' system‐bi…
##  9 4   108 TRUE 108 product hit worse stage' seem true can; house b…
## 10 4   109 FALSE108 <NA>
## # … with 23 more rows

That’s all we need to proceed. But because I experimented with it first, here are two other variables that might be useful under other circumstances. The first is the run-length sequence, written in a way where every row that’s the start of a record has a value of zero, and every description-fragment has a counter starting from one. The second, calculated from that, is a variable that flags whether a line is a single-line record or part of a multi-line record. We get this by flagging it as “Multi” if either the current row’s run-length counter is greater than zero or the run-length counter of the row below it is greater than zero.

eg |> 
  mutate(groupid = ifelse(has_date == TRUE, rowid, NA),
 rlecount = sequence(rle(!has_date)$lengths) * !has_date, 
 one_or_multi = if_else(rlecount > 0 | lead(rlecount > 0), "Multi", "One")) |> 
  fill(groupid) |> 
  relocate(has_date, one_or_multi, 
   rlecount, groupid, 
    .after = line) |> 
  filter(page == 4)

## # A tibble: 33 × 14
##rowid  page  line has_date one_or_multi rlecount groupid date  timekeeper
##<int> <int> <int> <lgl><chr>   <int>   <int> <chr> <chr> 
##  1   100 4 1 TRUE Multi   0 100 4/22/14   stick 
##  2   101 4 2 FALSEMulti   1 100 example … <NA>  
##  3   102 4 3 FALSEMulti   2 100 address … <NA>  
##  4   103 4 4 TRUE Multi   0 103 4/22/14   wage  
##  5   104 4 5 FALSEMulti   1 103 view str… <NA>  
##  6   105 4 6 TRUE One 0 105 4/27/14   character 
##  7   106 4 7 TRUE One 0 106 4/28/14   produce   
##  8   107 4 8 TRUE One 0 107 4/29/14   idea  
##  9   108 4 9 TRUE Multi   0 108 5/2/14available 
## 10   109 410 FALSEMulti   1 108 notice p… <NA>  
## # … with 23 more rows, and 5 more variables: description <chr>, hours <chr>,
## #   rate <chr>, fees <chr>, notes <chr>

Let’s put all that into an object called eg_gid for the whole data set.

eg_gid <- eg |> 
  mutate(groupid = ifelse(has_date == TRUE, rowid, NA),
 rlecount = sequence(rle(!has_date)$lengths) * !has_date, 
 one_or_multi = if_else(rlecount > 0 | lead(rlecount > 0), "Multi", "One")) |> 
  fill(groupid) |> 
  relocate(has_date, one_or_multi, 
   rlecount, groupid, 
   .after = line)

Next, we extract all the “true” first rows. We can do this by filtering on has_date…

1 2	eg_first <- eg_gid \|> filter(has_date == TRUE)

… or, for the same effect, group by groupid and slice out the first row of each group:

eg_first <- eg_gid |> 
  group_by(groupid) |> 
  slice_head(n = 1) 

eg_first |> 
  filter(page == 4)

## # A tibble: 19 × 14
## # Groups:   groupid [19]
##rowid  page  line has_date one_or_multi rlecount groupid datetimekeeper
##<int> <int> <int> <lgl><chr>   <int>   <int> <chr>   <chr> 
##  1   100 4 1 TRUE Multi   0 100 4/22/14 stick 
##  2   103 4 4 TRUE Multi   0 103 4/22/14 wage  
##  3   105 4 6 TRUE One 0 105 4/27/14 character 
##  4   106 4 7 TRUE One 0 106 4/28/14 produce   
##  5   107 4 8 TRUE One 0 107 4/29/14 idea  
##  6   108 4 9 TRUE Multi   0 108 5/2/14  available 
##  7   110 411 TRUE One 0 110 5/9/14  nation
##  8   111 412 TRUE Multi   0 111 5/15/14 individual
##  9   113 414 TRUE One 0 113 5/15/14 word  
## 10   114 416 TRUE Multi   0 114 5/20/14 cup   
## 11   116 418 TRUE One 0 116 5/23/14 pound 
## 12   117 419 TRUE One 0 117 5/23/14 question  
## 13   118 420 TRUE Multi   0 118 6/9/14  issue 
## 14   121 424 TRUE One 0 121 6/10/14 service   
## 15   122 425 TRUE One 0 122 6/11/14 blue  
## 16   123 426 TRUE Multi   0 123 6/16/14 amount
## 17   129 433 TRUE One 0 129 6/17/14 class 
## 18   130 434 TRUE Multi   0 130 6/22/14 worth 
## 19   132 436 TRUE One 0 132 6/23/14 create
## # … with 5 more variables: description <chr>, hours <chr>, rate <chr>,
## #   fees <chr>, notes <chr>

Bear in mind that eg_first contains the first rows of all the records in the data. It includes both the records that are just one row in length and the first line only of any record that also has description-fragments on subsequent lines.

With this table in hand, we can take advantage of dplyr’s core competence of summarizing tables. We start with all the data, but this time we use filter to get all those rows where rlecount is not zero. Then we group those rows by groupid and summarize their date field (which is where all the decription-fragments are, remember). How do we summarize this text? In this case, by creating a new column called extra_text and pasting all the date text from any rows within the group into a single string:

eg_gid |> 
  filter(rlecount != 0) |> 
  group_by(groupid) |> 
  summarize(extra_text = paste0(date, collapse = ""))

## # A tibble: 891 × 2
##groupid extra_text   
##  <int> <chr>
##  1   7 god south authority (.3) 
##  2   9 bear once king fit link  
##  3  20 visit base club  
##  4  22 (.3); beat occasion place send (.3)  
##  5  24 month seven ring. apparent father raise  
##  6  28 write show stuff shall; identify issue account jump please; figureim…
##  7  31 each major plan welcome (.7); situate scotland region real trust ove…
##  8  35 strike two up; god morning even amount require extra; telephone  
##  9  44 play; nice tax situate stop rule chairman sing floor call arrange; t…
## 10  46 lady close plus; sudden health music collect gas (1.1); sign departm…
## # … with 881 more rows

Now we now we can get all the description-fragments down to one-row per record, and have them identified by their groupid. This means we can join them to eg_first, the table of all record first-rows. So we go back to eg_gid, do our summarizing, then right-join this result to the true first row table, eg_first, joining by groupid. Then we make a new column called full_description and paste the description and extra_text fields together.

eg_clean <- eg_gid |> 
  filter(rlecount != 0) |> 
  group_by(groupid) |> 
  summarize(extra_text = paste0(date, collapse = "")) |> 
  right_join(eg_first, by = "groupid") |> 
  mutate(full_description = paste(description, extra_text)) |> 
  select(rowid, groupid, page:timekeeper, hours:fees, 
         full_description)

We’re nearly finished. How is page 4 doing?

eg_clean |> 
  select(page, line, groupid, one_or_multi, full_description) |> 
  filter(page == 4)

## # A tibble: 19 × 5
## page  line groupid one_or_multi full_description
##<int> <int>   <int> <chr><chr>   
##  1 4 1 100 Multicontact for world insure fun real (.3); cor…
##  2 4 4 103 Multiat soon‐lose commit television break care't…
##  3 4 9 108 Multiproduct hit worse stage' seem true can; hou…
##  4 412 111 Multinever not jump refer document monday consid…
##  5 416 114 Multiworld right like top (1); clean total famil…
##  6 420 118 Multibed system now discuss converse'age though …
##  7 426 123 Multito hell list simple life leg encourage; as …
##  8 434 130 Multieducate air quick save, health six slow hos…
##  9 4 6 105 One  should between field double air look' clean…
## 10 4 7 106 One  guess busy collect tree kill double' deep‐i…
## 11 4 8 107 One  call wee person common succeed catch' syste…
## 12 411 110 One  seat east operate farm common next NA   
## 13 414 113 One  little such tell paper knock lie town hundr…
## 14 418 116 One  programme rule stupid NA
## 15 419 117 One  vote plus hospital what suit. top NA
## 16 424 121 One  although front view pair occasion NA
## 17 425 122 One  report day town too garden line call wear a…
## 18 433 129 One  might bottom work afternoon yes hope NA 
## 19 436 132 One  scotland next amount love favour NA

Looking good. You can see that all the Multi rows are in one bloc and followed by all the One rows, after the join. Now all that remains is for us to clean up the remaining columns. For example, some entries in the rate and fee columns have additional notes after their numbers, so we separate those out on the space character. We strip unnecessary commas from the numeric columns. And then we fix their types, turning the rate and fee columns from character to numeric and making the date column a proper date.

eg_clean <- eg_clean |> 
  separate(fees, sep = "\\s{1}", 
   into = c("fees", "fee_note"), 
   extra = "merge", fill = "right") |> 
  separate(rate, sep = "\\s{1}", 
   into = c("rate", "rate_note"), 
   extra = "merge", 
   fill = "right") |> 
  mutate(fees = str_remove_all(fees, ","), # strip commas from numbers
    rate = str_remove_all(rate, ","),
    fees = as.numeric(fees),
    rate = as.numeric(rate),
    hours = as.numeric(hours), 
    date = lubridate::mdy(date), 
    full_description = str_remove(full_description, " NA$")) |> 
  arrange(rowid)

And we’re done:

eg_clean |> 
  filter(page == 4) |> 
  select(date, hours, rate, full_description)

## # A tibble: 19 × 4
##date   hours  rate full_description  
##<date> <dbl> <dbl> <chr> 
##  1 2014-04-22   1.2   635 contact for world insure fun real (.3); correct shut‐…
##  2 2014-04-22   0.4   749 at soon‐lose commit television break care'try lay (.2…
##  3 2014-04-27   1.3   635 should between field double air look' clean‐week occa…
##  4 2014-04-28   5.4   635 guess busy collect tree kill double' deep‐introduce l…
##  5 2014-04-29   9.1   635 call wee person common succeed catch' system‐big house
##  6 2014-05-02   0.4   749 product hit worse stage' seem true can; house begin c…
##  7 2014-05-09   0.1   635 seat east operate farm common next
##  8 2014-05-15   0.4   635 never not jump refer document monday consider (.5); n…
##  9 2014-05-15   0.2   749 little such tell paper knock lie town hundred with ty…
## 10 2014-05-20   3.1   749 world right like top (1); clean total family (1.5); h…
## 11 2014-05-23   0.1   635 programme rule stupid 
## 12 2014-05-23   0.1   749 vote plus hospital what suit. top 
## 13 2014-06-09   2.7   749 bed system now discuss converse'age though eight roun…
## 14 2014-06-10   0.3   635 although front view pair occasion 
## 15 2014-06-11   0.2   749 report day town too garden line call wear attend enou…
## 16 2014-06-16   4.2   749 to hell list simple life leg encourage; as contract c…
## 17 2014-06-17   0.6   749 might bottom work afternoon yes hope  
## 18 2014-06-22   3.2   749 educate air quick save, health six slow hospital shar…
## 19 2014-06-23   2.9   635 scotland next amount love favour

A clean and tidy dataset ready to be investigated properly. Unhappy in its own way, but now at least in a form where we can ask it some questions.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Unhappy in its Own Way

↧

A Kaggle Dataset of R Package History for rstudio::conf(2022)

July 21, 2022, 5:00 pm

≫ Next: Read Data from Multiple Excel Sheets and Convert them to Individual Data Frames

≪ Previous: Unhappy in its Own Way

[This article was first published on R on head spin - the Heads or Tails blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s summer, and the long-awaited Rstudio conference for 2022 is only days away. Next week, a large number of R aficionados will gather in Washington DC for the first time in person since the beginning of the pandemic. A pandemic, mind you, that is far from over. But Covid precautions are in place, and I trust the R community more than most to be responsible and thoughtful. With masks, social distance, and outdoor events: I’m excited to meet new people and see again many familiar faces from my first Rstudio conference in 2020.

To create even more excitement, this time I’m giving a talk about the Kaggle and R communities, and all the good things that can happen when those worlds interact. In addition to this talk, which is aiming at introducing an R audience to the opportunities of Kaggle, I have also prepared a new Kaggle dataset for this audience to get started on the platform. This post is about that dataset: comprehensive data on all R packages currently on CRAN, and on their full release history.

Let’s get started with the packages; including those that I found instrumental for querying info from CRAN: the powerful tools package and the more specialised packageRank package. Together, the functions in those packages made my task much easier than expected.

libs <- c('dplyr', 'tibble',       # wrangling          'tidyr', 'stringr',      # wrangling          'readr',                 # read files          'tools', 'packageRank',  # CRAN package info          'ggplot2', 'ggthemes',   # plots          'gt', 'lubridate')       # tables & timeinvisible(lapply(libs, library, character.only = TRUE))

Complete list of CRAN packages

Initially, my thought was to scrape the package information directly from CRAN, the Comprehensive R Archive Network. It is the central repository for R packages. CRAN describes itself as “a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.” If you’re installing an R package in the standard way then it is provided by one of the CRAN mirrors. (The install.packages function takes a repos argument that you can set to any of the mirror or to the central “http://cran.r-project.org”.)

CRAN provides full lists of all available package by name and by date of publication. The latter page in particular has a nice html table with all package names, titles, and dates. This would be easy to scrape. If you want to get an intro to webscraping with the rvest package then check out a previous blogpost of mine.

However, the R community had once again made my task much easier. As I was pondering a respectful and responsible scraping strategy, I came across this post on Scraping Responsibly with R by Steven Mortimer, who was working on scraping CRAN downloads. In it, he quoted a tweet by Maëlle Salmon recommending to use tools::CRAN_package_db instead as a gentler approach.

This tool is indeed very fast and powerful. It provides a lot of columns. For the sake of a simple dataset, I’m only selecting a subset of those features here. Feel free to explore the full range.

df <- tools::CRAN_package_db() %>%   as_tibble() %>%   janitor::clean_names() %>%   select(package, version, depends, imports, license, needs_compilation, author, bug_reports, url, date_published = published, description, title) %>%   mutate(needs_compilation = needs_compilation == "yes")

The columns I picked include names, versions, dates, and information about dependencies, authors, descriptions, and web links. Here are the first 50 rows:

df %>%   head(50) %>%   gt() %>%   tab_header(    title = md("**A full list of R packages on CRAN derived via tools::CRAN_package_db**")    ) %>%   opt_row_striping() %>%   tab_options(container.height = px(600))

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;}#qiezykvncn .gt_table { display: table; border-collapse: collapse; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3;}#qiezykvncn .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#qiezykvncn .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; border-bottom-color: #FFFFFF; border-bottom-width: 0;}#qiezykvncn .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 0; padding-bottom: 6px; border-top-color: #FFFFFF; border-top-width: 0;}#qiezykvncn .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#qiezykvncn .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#qiezykvncn .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden;}#qiezykvncn .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px;}#qiezykvncn .gt_column_spanner_outer:first-child { padding-left: 0;}#qiezykvncn .gt_column_spanner_outer:last-child { padding-right: 0;}#qiezykvncn .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%;}#qiezykvncn .gt_group_heading { padding: 8px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle;}#qiezykvncn .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle;}#qiezykvncn .gt_from_md > :first-child { margin-top: 0;}#qiezykvncn .gt_from_md > :last-child { margin-bottom: 0;}#qiezykvncn .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden;}#qiezykvncn .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 12px;}#qiezykvncn .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#qiezykvncn .gt_first_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3;}#qiezykvncn .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#qiezykvncn .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3;}#qiezykvncn .gt_striped { background-color: rgba(128, 128, 128, 0.05);}#qiezykvncn .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#qiezykvncn .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#qiezykvncn .gt_footnote { margin: 0px; font-size: 90%; padding: 4px;}#qiezykvncn .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#qiezykvncn .gt_sourcenote { font-size: 90%; padding: 4px;}#qiezykvncn .gt_left { text-align: left;}#qiezykvncn .gt_center { text-align: center;}#qiezykvncn .gt_right { text-align: right; font-variant-numeric: tabular-nums;}#qiezykvncn .gt_font_normal { font-weight: normal;}#qiezykvncn .gt_font_bold { font-weight: bold;}#qiezykvncn .gt_font_italic { font-style: italic;}#qiezykvncn .gt_super { font-size: 65%;}#qiezykvncn .gt_footnote_marks { font-style: italic; font-weight: normal; font-size: 65%;}

A full list of R packages on CRAN derived via tools::CRAN_package_db
package	version	depends	imports	license	needs_compilation	author	bug_reports	url	date_published	description	title
A3	1.0.0	R (>= 2.15.0), xtable, pbapply	NA	GPL (>= 2)	FALSE	Scott Fortmann-Roe	NA	NA	2015-08-16	Supplies tools for tabulating and analyzing the results of predictive models. The methods employed are applicable to virtually any predictive model and make comparisons between different methodologies straightforward.	Accurate, Adaptable, and Accessible Error Metrics for PredictiveModels
AATtools	0.0.1	R (>= 3.6.0)	magrittr, dplyr, doParallel, foreach	GPL-3	FALSE	Sercan Kahveci [aut, cre]	https://github.com/Spiritspeak/AATtools/issues	NA	2020-06-14	Compute approach bias scores using different scoring algorithms, compute bootstrapped and exact split-half reliability estimates, and compute confidence intervals for individual participant scores.	Reliability and Scoring Routines for the Approach-Avoidance Task
ABACUS	1.0.0	R (>= 3.1.0)	ggplot2 (>= 3.1.0), shiny (>= 1.3.1),	GPL-3	FALSE	Mintu Nath [aut, cre]	NA	https://shiny.abdn.ac.uk/Stats/apps/	2019-09-20	A set of Shiny apps for effective communication and understanding in statistics. The current version includes properties of normal distribution, properties of sampling distribution, one-sample z and t tests, two samples independent (unpaired) t test and analysis of variance.	Apps Based Activities for Communicating and UnderstandingStatistics
abbreviate	0.1	NA	NA	GPL-3	FALSE	Sigbert Klinke [aut, cre]	NA	https://github.com/sigbertklinke/abbreviate (development version)	2021-12-14	Strings are abbreviated to at least "minlength" characters, such that they remain unique (if they were). The abbreviations should be recognisable.	Readable String Abbreviation
abbyyR	0.5.5	R (>= 3.2.0)	httr, XML, curl, readr, plyr, progress	MIT + file LICENSE	FALSE	Gaurav Sood [aut, cre]	http://github.com/soodoku/abbyyR/issues	http://github.com/soodoku/abbyyR	2019-06-25	Get text from images of text using Abbyy Cloud Optical Character Recognition (OCR) API. Easily OCR images, barcodes, forms, documents with machine readable zones, e.g. passports. Get the results in a variety of formats including plain text and XML. To learn more about the Abbyy OCR API, see .	Access to Abbyy Optical Character Recognition (OCR) API
abc	2.2.1	R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit	NA	GPL (>= 3)	FALSE	Csillery Katalin [aut], Lemaire Louisiane [aut], Francois Olivier [aut], Blum Michael [aut, cre]	NA	NA	2022-05-19	Implements several ABC algorithms for performing parameter estimation, model selection, and goodness-of-fit. Cross-validation tools are also available for measuring the accuracy of ABC estimates, and to calculate the misclassification probabilities of different models.	Tools for Approximate Bayesian Computation (ABC)
abc.data	1.0	R (>= 2.10)	NA	GPL (>= 3)	FALSE	Csillery Katalin [aut], Lemaire Louisiane [aut], Francois Olivier [aut], Blum Michael [aut, cre]	NA	NA	2015-05-05	Contains data which are used by functions of the 'abc' package.	Data Only: Tools for Approximate Bayesian Computation (ABC)
ABC.RAP	0.9.0	R (>= 3.1.0)	graphics, stats, utils	GPL-3	FALSE	Abdulmonem Alsaleh [cre, aut], Robert Weeks [aut], Ian Morison [aut], RStudio [ctb]	NA	NA	2016-10-20	It aims to identify candidate genes that are “differentially methylated” between cases and controls. It applies Student’s t-test and delta beta analysis to identify candidate genes containing multiple “CpG sites”.	Array Based CpG Region Analysis Pipeline
abcADM	1.0	NA	Rcpp (>= 1.0.1)	GPL-3	TRUE	Zongjun Liu [aut], Chun-Hao Yang [aut], John Burkardt [ctb], Samuel W.K. Wong [aut, cre]	NA	NA	2019-11-13	Estimate parameters of accumulated damage (load duration) models based on failure time data under a Bayesian framework, using Approximate Bayesian Computation (ABC). Assess long-term reliability under stochastic load profiles. Yang, Zidek, and Wong (2019) .	Fit Accumulated Damage Models and Estimate Reliability using ABC
ABCanalysis	1.2.1	R (>= 2.10)	plotrix	GPL-3	FALSE	Michael Thrun, Jorn Lotsch, Alfred Ultsch	NA	https://www.uni-marburg.de/fb12/datenbionik/software-en	2017-03-13	For a given data set, the package provides a novel method of computing precise limits to acquire subsets which are easily interpreted. Closely related to the Lorenz curve, the ABC curve visualizes the data by graphically representing the cumulative distribution function. Based on an ABC analysis the algorithm calculates, with the help of the ABC curve, the optimal limits by exploiting the mathematical properties pertaining to distribution of analyzed items. The data containing positive values is divided into three disjoint subsets A, B and C, with subset A comprising very profitable values, i.e. largest data values ("the important few"), subset B comprising values where the yield equals to the effort required to obtain it, and the subset C comprising of non-profitable values, i.e., the smallest data sets ("the trivial many"). Package is based on "Computed ABC Analysis for rational Selection of most informative Variables in multivariate Data", PLoS One. Ultsch. A., Lotsch J. (2015) .	Computed ABC Analysis
abclass	0.3.0	R (>= 3.5.0)	Rcpp, stats	GPL (>= 3)	TRUE	Wenjie Wang [aut, cre] (), Eli Lilly and Company [cph]	https://github.com/wenjie2wang/abclass/issues	https://wwenjie.org/abclass,https://github.com/wenjie2wang/abclass	2022-05-28	Multi-category angle-based large-margin classifiers. See Zhang and Liu (2014) for details.	Angle-Based Large-Margin Classifiers
ABCoptim	0.15.0	NA	Rcpp, graphics, stats, utils	MIT + file LICENSE	TRUE	George Vega Yon [aut, cre], Enyelbert Muñoz [ctb]	NA	http://github.com/gvegayon/ABCoptim, http://mf.erciyes.edu.tr/abc/	2017-11-06	An implementation of Karaboga (2005) Artificial Bee Colony Optimization algorithm . This (working) version is a Work-in-progress, which is why it has been implemented using pure R code. This was developed upon the basic version programmed in C and distributed at the algorithm's official website.	Implementation of Artificial Bee Colony (ABC) Optimization
ABCp2	1.2	MASS	NA	GPL-2	FALSE	M. Catherine Duryea, Andrew D. Kern, Robert M. Cox, and Ryan Calsbeek	NA	NA	2016-02-04	Tests the goodness of fit of a distribution of offspring to the Normal, Poisson, and Gamma distribution and estimates the proportional paternity of the second male (P2) based on the best fit distribution.	Approximate Bayesian Computational Model for Estimating P2
abcrlda	1.0.3	NA	stats	GPL-3	FALSE	Dmitriy Fedorov [aut, cre], Amin Zollanvari [aut], Aresh Dadlani [aut], Berdakh Abibullaev [aut]	NA	https://ieeexplore.ieee.org/document/8720003/,https://dx.doi.org/10.1109/LSP.2019.2918485	2020-05-28	Offers methods to perform asymptotically bias-corrected regularized linear discriminant analysis (ABC_RLDA) for cost-sensitive binary classification. The bias-correction is an estimate of the bias term added to regularized discriminant analysis (RLDA) that minimizes the overall risk. The default magnitude of misclassification costs are equal and set to 0.5; however, the package also offers the options to set them to some predetermined values or, alternatively, take them as hyperparameters to tune. A. Zollanvari, M. Abdirash, A. Dadlani and B. Abibullaev (2019) .	Asymptotically Bias-Corrected Regularized Linear DiscriminantAnalysis
abctools	1.1.3	R (>= 2.10), abc, abind, parallel, plyr, Hmisc	NA	GPL (>= 2)	TRUE	Matt Nunes [aut, cre], Dennis Prangle [aut], Guilhereme Rodrigues [ctb]	https://github.com/dennisprangle/abctools/issues	https://github.com/dennisprangle/abctools	2018-07-17	Tools for approximate Bayesian computation including summary statistic selection and assessing coverage.	Tools for ABC Analyses
abd	0.2-8	R (>= 3.0), nlme, lattice, grid, mosaic	NA	GPL-2	FALSE	Kevin M. Middleton , Randall Pruim	NA	NA	2015-07-03	The abd package contains data sets and sample code for The Analysis of Biological Data by Michael Whitlock and Dolph Schluter (2009; Roberts & Company Publishers).	The Analysis of Biological Data
abdiv	0.2.0	NA	ape	MIT + file LICENSE	FALSE	Kyle Bittinger [aut, cre]	https://github.com/kylebittinger/abdiv/issues	https://github.com/kylebittinger/abdiv	2020-01-20	A collection of measures for measuring ecological diversity. Ecological diversity comes in two flavors: alpha diversity measures the diversity within a single site or sample, and beta diversity measures the diversity across two sites or samples. This package overlaps considerably with other R packages such as 'vegan', 'gUniFrac', 'betapart', and 'fossil'. We also include a wide range of functions that are implemented in software outside the R ecosystem, such as 'scipy', 'Mothur', and 'scikit-bio'. The implementations here are designed to be basic and clear to the reader.	Alpha and Beta Diversity Measures
abe	3.0.1	NA	NA	GPL (>= 2)	FALSE	Rok Blagus [aut, cre], Sladana Babic [ctb]	NA	NA	2017-10-30	Performs augmented backward elimination and checks the stability of the obtained model. Augmented backward elimination combines significance or information based criteria with the change in estimate to either select the optimal model for prediction purposes or to serve as a tool to obtain a practically sound, highly interpretable model. More details can be found in Dunkler et al. (2014) .	Augmented Backward Elimination
abess	0.4.5	R (>= 3.1.0)	Rcpp, MASS, methods, Matrix	GPL (>= 3) \| file LICENSE	TRUE	Jin Zhu [aut, cre] (), Liyuan Hu [aut], Junhao Huang [aut], Kangkang Jiang [aut], Yanhang Zhang [aut], Zezhi Wang [aut], Borui Tang [aut], Shiyun Lin [aut], Junxian Zhu [aut], Canhong Wen [aut], Heping Zhang [aut] (), Xueqin Wang [aut] (), spectra contributors [cph] (Spectra implementation)	https://github.com/abess-team/abess/issues	https://github.com/abess-team/abess,https://abess-team.github.io/abess/,https://abess.readthedocs.io	2022-03-22	Extremely efficient toolkit for solving the best subset selection problem . This package is its R interface. The package implements and generalizes algorithms designed in that exploits a novel sequencing-and-splicing technique to guarantee exact support recovery and globally optimal solution in polynomial times for linear model. It also supports best subset selection for logistic regression, Poisson regression, Cox proportional hazard model, Gamma regression, multiple-response regression, multinomial logistic regression, ordinal regression, (sequential) principal component analysis, and robust principal component analysis. The other valuable features such as the best subset of group selection and sure independence screening are also provided.	Fast Best Subset Selection
abglasso	0.1.1	NA	MASS, pracma, stats, statmod	GPL-3	FALSE	Jarod Smith [aut, cre] (), Mohammad Arashi [aut] (), Andriette Bekker [aut] ()	NA	NA	2021-07-13	Implements a Bayesian adaptive graphical lasso data-augmented block Gibbs sampler. The sampler simulates the posterior distribution of precision matrices of a Gaussian Graphical Model. This sampler was adapted from the original MATLAB routine proposed in Wang (2012) .	Adaptive Bayesian Graphical Lasso
ABHgenotypeR	1.0.1	NA	ggplot2, reshape2, utils	GPL-3	FALSE	Stefan Reuscher [aut, cre], Tomoyuki Furuta [aut]	http://github.com/StefanReuscher/ABHgenotypeR/issues	http://github.com/StefanReuscher/ABHgenotypeR	2016-02-04	Easy to use functions to visualize marker data from biparental populations. Useful for both analyzing and presenting genotypes in the ABH format.	Easy Visualization of ABH Genotypes
abind	1.4-5	R (>= 1.5.0)	methods, utils	LGPL (>= 2)	FALSE	Tony Plate and Richard Heiberger	NA	NA	2016-07-21	Combine multidimensional arrays into a single array. This is a generalization of 'cbind' and 'rbind'. Works with vectors, matrices, and higher-dimensional arrays. Also provides functions 'adrop', 'asub', and 'afill' for manipulating, extracting and replacing data in arrays.	Combine Multidimensional Arrays
abjData	1.1.2	R (>= 3.3.1)	NA	MIT + file LICENSE	FALSE	Julio Trecenti [aut, cre] (), Renata Hirota [ctb], Katerine Witkoski [aut] (), Associação Brasileira de Jurimetria [cph, fnd]	NA	https://abjur.github.io/abjData/	2022-06-15	The Brazilian Jurimetrics Association (ABJ in Portuguese, see for more information) is a non-profit organization which aims to investigate and promote the use of statistics and probability in the study of Law and its institutions. This package has a set of datasets commonly used in our book.	Databases Used Routinely by the Brazilian JurimetricsAssociation
abjutils	0.3.2	R (>= 3.6)	dplyr, magrittr, purrr, rlang, rstudioapi, stringi, stringr,tidyr	MIT + file LICENSE	FALSE	Caio Lente [aut, cre] (), Julio Trecenti [aut] (), Katerine Witkoski [ctb] (), Associação Brasileira de Jurimetria [cph, fnd]	NA	https://github.com/abjur/abjutils	2022-02-01	The Brazilian Jurimetrics Association (ABJ in Portuguese, see for more information) is a non-profit organization which aims to investigate and promote the use of statistics and probability in the study of Law and its institutions. This package implements general purpose tools used by ABJ, such as functions for sampling and basic manipulation of Brazilian lawsuits identification number. It also implements functions for text cleaning, such as accentuation removal.	Useful Tools for Jurimetrical Analysis Used by the BrazilianJurimetrics Association
abn	2.7-1	R (>= 4.0.0)	methods, rjags, nnet, lme4, graph, Rgraphviz, doParallel,foreach	GPL (>= 2)	TRUE	Reinhard Furrer [cre, aut] (), Gilles Kratzer [aut] (), Fraser Iain Lewis [aut] (), Marta Pittavino [ctb] (), Kalina Cherneva [ctr]	https://git.math.uzh.ch/reinhard.furrer/abn/-/issues	http://r-bayesian-networks.org	2022-04-25	Bayesian network analysis is a form of probabilistic graphical models which derives from empirical data a directed acyclic graph, DAG, describing the dependency structure between random variables. An additive Bayesian network model consists of a form of a DAG where each node comprises a generalized linear model, GLM. Additive Bayesian network models are equivalent to Bayesian multivariate regression using graphical modelling, they generalises the usual multivariable regression, GLM, to multiple dependent variables. 'abn' provides routines to help determine optimal Bayesian network models for a given data set, where these models are used to identify statistical dependencies in messy, complex data. The additive formulation of these models is equivalent to multivariate generalised linear modelling (including mixed models with iid random effects). The usual term to describe this model selection process is structure discovery. The core functionality is concerned with model selection - determining the most robust empirical model of data from interdependent variables. Laplace approximations are used to estimate goodness of fit metrics and model parameters, and wrappers are also included to the INLA package which can be obtained from . The computing library JAGS is used to simulate 'abn'-like data. A comprehensive set of documented case studies, numerical accuracy/quality assurance exercises, and additional documentation are available from the 'abn' website .	Modelling Multivariate Data with Additive Bayesian Networks
abnormality	0.1.0	NA	MASS (>= 7.3.0), Matrix	MIT + file LICENSE	FALSE	Michael Marks [aut, cre]	NA	NA	2018-03-13	Contains the functions to implement the methodology and considerations laid out by Marks et al. in the manuscript Measuring Abnormality in High Dimensional Spaces: Applications in Biomechanical Gait Analysis. As of 2/27/2018 this paper has been submitted and is under scientific review. Using high-dimensional datasets to measure a subject’s overall level of abnormality as compared to a reference population is often needed in outcomes research. Utilizing applications in instrumented gait analysis, that article demonstrates how using data that is inherently non-independent to measure overall abnormality may bias results. A methodology is introduced to address this bias to accurately measure overall abnormality in high dimensional spaces. While this methodology is in line with previous literature, it differs in two major ways. Advantageously, it can be applied to datasets in which the number of observations is less than the number of features/variables, and it can be abstracted to practically any number of domains or dimensions. After applying the proposed methodology to the original data, the researcher is left with a set of uncorrelated variables (i.e. principal components) with which overall abnormality can be measured without bias. Different considerations are discussed in that article in deciding the appropriate number of principal components to keep and the aggregate distance measure to utilize.	Measure a Subject's Abnormality with Respect to a ReferencePopulation
abodOutlier	0.1	cluster, R (>= 3.1.2)	NA	MIT + file LICENSE	FALSE	Jose Jimenez	NA	NA	2015-08-31	Performs angle-based outlier detection on a given dataframe. Three methods are available, a full but slow implementation using all the data that has cubic complexity, a fully randomized one which is way more efficient and another using k-nearest neighbours. These algorithms are specially well suited for high dimensional data outlier detection.	Angle-Based Outlier Detection
ABPS	0.3	NA	kernlab	GPL (>= 2)	FALSE	Frédéric Schütz [aut, cre], Alix Zollinger [aut]	NA	NA	2018-10-18	An implementation of the Abnormal Blood Profile Score (ABPS, part of the Athlete Biological Passport program of the World Anti-Doping Agency), which combines several blood parameters into a single score in order to detect blood doping (Sottas et al. (2006) ). The package also contains functions to calculate other scores used in anti-doping programs, such as the OFF-score (Gore et al. (2003) ), as well as example data.	The Abnormal Blood Profile Score to Detect Blood Doping
abstr	0.4.1	R (>= 4.0.0)	jsonlite (>= 1.7.2), lwgeom (>= 0.2.5), magrittr (>= 2.0.1),methods, od (>= 0.3.1), sf (>= 1.0.1), tibble (>= 3.0.6), tidyr(>= 1.1.3)	Apache License (>= 2)	FALSE	Nathanael Sheehan [aut, cre] (), Robin Lovelace [aut] (), Trevor Nederlof [aut], Lucas Dias [ctb], Dustin Carlino [aut]	https://github.com/a-b-street/abstr/issues	https://github.com/a-b-street/abstr,https://a-b-street.github.io/abstr/	2021-11-30	Provides functions to convert origin-destination data, represented as straight 'desire lines' in the 'sf' Simple Features class system, into JSON files that can be directly imported into A/B Street , a free and open source tool for simulating urban transport systems and scenarios of change .	R Interface to the A/B Street Transport System SimulationSoftware
abstractr	0.1.0	NA	shiny (>= 1.2.0), ggplot2 (>= 3.0.0), gridExtra (>= 2.3.0),colourpicker, shinythemes, emojifont, rintrojs	GPL-3	FALSE	Matthew Kumar	NA	https://matt-kumar.shinyapps.io/portfolio	2019-01-20	An R-Shiny application to create visual abstracts for original research. A variety of user defined options and formatting are included.	An R-Shiny Application for Creating Visual Abstracts
abtest	1.0.1	R (>= 3.0.0)	Rcpp (>= 0.12.14), mvtnorm, sn, qgam, truncnorm, plotrix,grDevices, RColorBrewer, Matrix, parallel	GPL (>= 2)	TRUE	Quentin F. Gronau [aut, cre], Akash Raj [ctb], Eric-Jan Wagenmakers [ths]	NA	NA	2021-11-22	Provides functions for Bayesian A/B testing including prior elicitation options based on Kass and Vaidyanathan (1992) . Gronau, Raj K. N., & Wagenmakers (2021) .	Bayesian A/B Testing
abundant	1.2	R (>= 2.10), glasso	NA	GPL-2	TRUE	Adam J. Rothman	NA	NA	2022-01-04	Fit and predict with the high-dimensional principal fitted components model. This model is described by Cook, Forzani, and Rothman (2012) .	High-Dimensional Principal Fitted Components and AbundantRegression
Ac3net	1.2.2	R (>= 3.3.0), data.table	NA	GPL (>= 3)	FALSE	Gokmen Altay	NA	NA	2018-02-26	Infers directional conservative causal core (gene) networks. It is an advanced version of the algorithm C3NET by providing directional network. Gokmen Altay (2018) , bioRxiv.	Inferring Directional Conservative Causal Core Gene Networks
ACA	1.1	R (>= 3.2.2)	graphics, grDevices, stats, utils	GPL	FALSE	Daniel Amorese	NA	NA	2018-07-02	Offers an interactive function for the detection of breakpoints in series.	Abrupt Change-Point or Aberration Detection in Point Series
academictwitteR	0.3.1	R (>= 3.4)	dplyr (>= 1.0.0), httr, jsonlite, magrittr, lubridate,usethis, tibble, tidyr, tidyselect, purrr, rlang, utils	MIT + file LICENSE	FALSE	Christopher Barrie [aut, cre] (), Justin Chun-ting Ho [aut] (), Chung-hong Chan [ctb] (), Noelia Rico [ctb] (), Tim König [ctb] (), Thomas Davidson [ctb] ()	https://github.com/cjbarrie/academictwitteR/issues	https://github.com/cjbarrie/academictwitteR	2022-02-16	Package to query the Twitter Academic Research Product Track, providing access to full-archive search and other v2 API endpoints. Functions are written with academic research in mind. They provide flexibility in how the user wishes to store collected data, and encourage regular storage of data to mitigate loss when collecting large volumes of tweets. They also provide workarounds to manage and reshape the format in which data is provided on the client side.	Access the Twitter Academic Research Product Track V2 APIEndpoint
acc	1.3.3	R (>= 2.10), mhsmm	zoo, PhysicalActivity, nleqslv, plyr, methods, DBI, RSQLite,circlize, ggplot2, R.utils, iterators, Rcpp	GPL (>= 2)	TRUE	Jaejoon Song, Matthew G. Cox	NA	NA	2016-12-16	Processes accelerometer data from uni-axial and tri-axial devices, and generates data summaries. Also includes functions to plot, analyze, and simulate accelerometer data.	Exploring Accelerometer Data
acca	0.2	NA	methods, stats, ggplot2, plyr	GPL (>= 2)	FALSE	Livio Finos	NA	NA	2022-01-28	It performs Canonical Correlation Analysis and provides inferential guaranties on the correlation components. The p-values are computed following the resampling method developed in Winkler, A. M., Renaud, O., Smith, S. M., & Nichols, T. E. (2020). Permutation inference for canonical correlation analysis. NeuroImage, . Furthermore, it provides plotting tools to visualize the results.	A Canonical Correlation Analysis with Inferential Guaranties
accelerometry	3.1.2	R (>= 3.0.0)	Rcpp (>= 0.12.15), dvmisc	GPL-3	TRUE	Dane R. Van Domelen	NA	NA	2018-08-24	A collection of functions that perform operations on time-series accelerometer data, such as identify non-wear time, flag minutes that are part of an activity bout, and find the maximum 10-minute average count value. The functions are generally very flexible, allowing for a variety of algorithms to be implemented. Most of the functions are written in C++ for efficiency.	Functions for Processing Accelerometer Data
accelmissing	1.4	R (>= 2.10), mice, pscl	NA	GPL (>= 2)	FALSE	Jung Ae Lee	NA	NA	2018-04-06	Imputation for the missing count values in accelerometer data. The methodology includes both parametric and semi-parametric multiple imputations under the zero-inflated Poisson lognormal model. This package also provides multiple functions to pre-process the accelerometer data previous to the missing data imputation. These includes detecting wearing and non-wearing time, selecting valid days and subjects, and creating plots.	Missing Value Imputation for Accelerometer Data
accept	0.9.1	R (>= 3.6.0)	stats, dplyr, reldist, splines	GPL-3	FALSE	Amin Adibi [aut, cre], Mohsen Sadatsafavi [aut, cph], Abdollah Safari [aut], Ainsleigh Hill [aut]	NA	NA	2022-07-15	Allows clinicians to predict the rate and severity of future acute exacerbation in Chronic Obstructive Pulmonary Disease (COPD) patients, based on the clinical prediction model published in Adibi et al. (2020) .	The Acute COPD Exacerbation Prediction Tool (ACCEPT)
AcceptanceSampling	1.0-8	methods, R(>= 2.4.0), stats	graphics, utils	GPL (>= 3)	FALSE	Andreas Kiermeier [aut, cre], Peter Bloomfield [ctb]	NA	NA	2022-04-06	Provides functionality for creating and evaluating acceptance sampling plans. Sampling plans can be single, double or multiple.	Creation and Evaluation of Acceptance Sampling Plans
accessibility	0.1.0	R (>= 3.5.0)	checkmate, data.table, utils	MIT + file LICENSE	FALSE	Rafael H. M. Pereira [aut, cre] (), Daniel Herszenhut [aut] (), Ipea - Institute for Applied Economic Research [cph, fnd]	https://github.com/ipeaGIT/accessibility/issues	https://github.com/ipeaGIT/accessibility	2022-06-30	A set of fast and convenient functions to calculate multiple transport accessibility measures. Given a pre-computed travel cost matrix in long format combined with land-use data (e.g. location of jobs, healthcare, population), the package allows one to calculate active and passive accessibility levels using multiple accessibility metrics such as: cumulative opportunity measure (using either travel time cutoff or interval), minimum travel cost to closest N number of activities, gravitational measures and different floating catchment area methods.	Transport Accessibility Metrics
accessrmd	1.0.0	ggplot2, R (>= 2.10)	htmltools, stringr, rlist, knitr, RCurl	MIT + file LICENSE	FALSE	Rich Leyshon [aut, cre], Crown Copyright 2021 [cph]	NA	NA	2022-05-03	Provides a simple method to improve the accessibility of 'rmarkdown' documents. The package provides functions for creating or modifying 'rmarkdown' documents, resolving known errors and alerts that result in accessibility issues for screen reader users.	Improving the Accessibility of 'rmarkdown' Documents
accrual	1.3	R(>= 3.1.3), tcltk2	fgui, SMPracticals	GPL-2	FALSE	Junhao Liu, Yu Jiang, Cen Wu, Steve Simon, Matthew S. Mayo, Rama Raghavan, Byron J. Gajewski	NA	NA	2017-10-20	Subject recruitment for medical research is challenging. Slow patient accrual leads to delay in research. Accrual monitoring during the process of recruitment is critical. Researchers need reliable tools to manage the accrual rate. We developed a Bayesian method that integrates researcher's experience on previous trials and data from the current study, providing reliable prediction on accrual rate for clinical studies. In this R package, we present functions for Bayesian accrual prediction which can be easily used by statisticians and clinical researchers.	Bayesian Accrual Prediction
accrualPlot	1.0.1	lubridate	dplyr, ggplot2, grid, magrittr, purrr, rlang	MIT + file LICENSE	FALSE	Lukas Bütikofer [cre, aut], Alan G. Haynes [aut]	https://github.com/CTU-Bern/accrualPlot/issues	https://github.com/CTU-Bern/accrualPlot	2022-05-09	Tracking accrual in clinical trials is important for trial success. If accrual is too slow, the trial will take too long and be too expensive. If accrual is much faster than expected, time sensitive tasks such as the writing of statistical analysis plans might need to be rushed. 'accrualPlot' provides functions to aid the tracking of accrual and predict when a trial will reach it's intended sample size.	Accrual Plots and Predictions for Clinical Trials
accSDA	1.1.1	R (>= 3.2)	MASS (>= 7.3.45), ggplot2 (>= 2.1.0), ggthemes (>= 3.2.0),grid (>= 3.2.2), gridExtra (>= 2.2.1)	GPL (>= 2)	FALSE	Gudmundur Einarsson [aut, cre, trl], Line Clemmensen [aut, ths], Brendan Ames [aut], Summer Atkins [aut]	https://github.com/gumeo/accSDA/issues	https://github.com/gumeo/accSDA/wiki	2022-04-05	Implementation of sparse linear discriminant analysis, which is a supervised classification method for multiple classes. Various novel optimization approaches to this problem are implemented including alternating direction method of multipliers (ADMM), proximal gradient (PG) and accelerated proximal gradient (APG) (See Atkins et al. ). Functions for performing cross validation are also supplied along with basic prediction and plotting functions. Sparse zero variance discriminant analysis (SZVD) is also included in the package (See Ames and Hong, ). See the github wiki for a more extended description.	Accelerated Sparse Discriminant Analysis
accucor	0.3.0	NA	nnls, dplyr, stringr, readxl, readr, rlang, tibble, writexl,CHNOSZ	MIT + file LICENSE	FALSE	Xiaoyang Su [aut] (), Lance Parsons [aut, cre] (), Yujue Wang [ctb] (), Princeton University [cph]	https://github.com/XiaoyangSu/AccuCor/issues	https://github.com/XiaoyangSu/AccuCor	2021-11-17	An isotope natural abundance correction algorithm that is needed especially for high resolution mass spectrometers. Supports correction for 13C, 2H and 15N. Su X, Lu W and Rabinowitz J (2017) .	Natural Abundance Correction of Mass Spectrometer Data
ACDC	1.0.0	R (>= 3.5.0), ggplot2	magrittr, deSolve, dplyr, tibble, colorspace, patchwork,latex2exp, tidyr	GPL-3	FALSE	Bjørn Tore Kopperud [aut, cre], Sebastian Höhna [aut], Andrew F. Magee [aut]	NA	https://github.com/afmagee/ACDC	2022-01-12	Features tools for exploring congruent phylogenetic birth-death models. It can construct the pulled speciation- and net-diversification rates from a reference model. Given alternative speciation- or extinction rates, it can construct new models that are congruent with the reference model. Functionality is included to sample new rate functions, and to visualize the distribution of one congruence class. See also Louca & Pennell (2020) .	Analysis of Congruent Diversification Classes
acdcR	1.0.0	R (>= 4.0.0), raster, data.table, stats	NA	GPL (>= 2)	FALSE	Seong D. Yun [aut, cre]	NA	https://github.com/ysd2004/acdcR	2022-06-27	The functions are designed to calculate the most widely-used county-level variables in agricultural production or agricultural-climatic and weather analyses. To operate some functions in this package needs download of the bulk PRISM raster. See the examples, testing versions and more details from: .	Agro-Climatic Data by County
ACDm	1.0.4.1	R(>= 2.10.0)	plyr, dplyr, ggplot2, Rsolnp, zoo, graphics,	GPL (>= 2)	TRUE	Markus Belfrage	NA	NA	2022-07-08	Package for Autoregressive Conditional Duration (ACD, Engle and Russell, 1998) models. Creates trade, price or volume durations from transactions (tic) data, performs diurnal adjustments, fits various ACD models and tests them.	Tools for Autoregressive Conditional Duration Models

Now you could take this data, aggregate the date_published by month, and plot the growth of the R ecosystem for yourself. For instance like this:

df %>%   mutate(date = floor_date(ymd(date_published), unit = "month")) %>%   filter(!is.na(date)) %>%   count(date) %>%   arrange(date) %>%   mutate(cumul = cumsum(n)) %>%   ggplot(aes(date, cumul)) +  geom_line(col = "blue") +  theme_minimal() +  labs(x = "Date", y = "Cumlative count", title = "Cumulative count of CRAN package by date of latest version")

But this doesn’t really show you the true historical growth, does it? The x-axis range and my plot title are already telling you what’s going on here. The date_published that CRAN_package_db gives us (and that the CRAN website lists) corresponds to the last published version of the package. See for instance the entry for dplyr:

df %>%   filter(package == "dplyr") %>%   select(package, version, date_published) %>%   gt()

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;}#hjalkpihjr .gt_table { display: table; border-collapse: collapse; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3;}#hjalkpihjr .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#hjalkpihjr .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; border-bottom-color: #FFFFFF; border-bottom-width: 0;}#hjalkpihjr .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 0; padding-bottom: 6px; border-top-color: #FFFFFF; border-top-width: 0;}#hjalkpihjr .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#hjalkpihjr .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#hjalkpihjr .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden;}#hjalkpihjr .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px;}#hjalkpihjr .gt_column_spanner_outer:first-child { padding-left: 0;}#hjalkpihjr .gt_column_spanner_outer:last-child { padding-right: 0;}#hjalkpihjr .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%;}#hjalkpihjr .gt_group_heading { padding: 8px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle;}#hjalkpihjr .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle;}#hjalkpihjr .gt_from_md > :first-child { margin-top: 0;}#hjalkpihjr .gt_from_md > :last-child { margin-bottom: 0;}#hjalkpihjr .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden;}#hjalkpihjr .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 12px;}#hjalkpihjr .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#hjalkpihjr .gt_first_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3;}#hjalkpihjr .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#hjalkpihjr .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3;}#hjalkpihjr .gt_striped { background-color: rgba(128, 128, 128, 0.05);}#hjalkpihjr .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#hjalkpihjr .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#hjalkpihjr .gt_footnote { margin: 0px; font-size: 90%; padding: 4px;}#hjalkpihjr .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#hjalkpihjr .gt_sourcenote { font-size: 90%; padding: 4px;}#hjalkpihjr .gt_left { text-align: left;}#hjalkpihjr .gt_center { text-align: center;}#hjalkpihjr .gt_right { text-align: right; font-variant-numeric: tabular-nums;}#hjalkpihjr .gt_font_normal { font-weight: normal;}#hjalkpihjr .gt_font_bold { font-weight: bold;}#hjalkpihjr .gt_font_italic { font-style: italic;}#hjalkpihjr .gt_super { font-size: 65%;}#hjalkpihjr .gt_footnote_marks { font-style: italic; font-weight: normal; font-size: 65%;}

package	version	date_published
dplyr	1.0.9	2022-04-28

The cornerstone of the tidyverse was first published a little while before April 2022. Version 1.0.9 is the most recent one, at the time of writing.

Naturally, that means that in this table those packages with frequent updates will be weighted heavier towards more recent dates. Which is perfectly fine if you’re only interested in the most recent package. But if you, like me, want to see how the R ecosystem grew over time, then you need the historical dates of the first published versions. This is where our next package comes in.

Package history

I found the packageRank package through googling “R package history”. Its documentation on github is detailed, and it performs very well for me. In the packageHistory function you give it a package name and it does the rest. Let’s find out more about the release history of our favourite dplyr package:

df_hist_dplyr <- packageRank::packageHistory(package = "dplyr", check.package = TRUE) %>%     as_tibble() %>%     janitor::clean_names()df_hist_dplyr %>%   gt() %>%   tab_header(    title = md("**The release history of the dplyr package**")    ) %>%   opt_row_striping() %>%   tab_options(container.height = px(400))

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;}#cutwroxsah .gt_table { display: table; border-collapse: collapse; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3;}#cutwroxsah .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#cutwroxsah .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; border-bottom-color: #FFFFFF; border-bottom-width: 0;}#cutwroxsah .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 0; padding-bottom: 6px; border-top-color: #FFFFFF; border-top-width: 0;}#cutwroxsah .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#cutwroxsah .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#cutwroxsah .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden;}#cutwroxsah .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px;}#cutwroxsah .gt_column_spanner_outer:first-child { padding-left: 0;}#cutwroxsah .gt_column_spanner_outer:last-child { padding-right: 0;}#cutwroxsah .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%;}#cutwroxsah .gt_group_heading { padding: 8px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle;}#cutwroxsah .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle;}#cutwroxsah .gt_from_md > :first-child { margin-top: 0;}#cutwroxsah .gt_from_md > :last-child { margin-bottom: 0;}#cutwroxsah .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden;}#cutwroxsah .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 12px;}#cutwroxsah .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#cutwroxsah .gt_first_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3;}#cutwroxsah .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#cutwroxsah .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3;}#cutwroxsah .gt_striped { background-color: rgba(128, 128, 128, 0.05);}#cutwroxsah .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#cutwroxsah .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#cutwroxsah .gt_footnote { margin: 0px; font-size: 90%; padding: 4px;}#cutwroxsah .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#cutwroxsah .gt_sourcenote { font-size: 90%; padding: 4px;}#cutwroxsah .gt_left { text-align: left;}#cutwroxsah .gt_center { text-align: center;}#cutwroxsah .gt_right { text-align: right; font-variant-numeric: tabular-nums;}#cutwroxsah .gt_font_normal { font-weight: normal;}#cutwroxsah .gt_font_bold { font-weight: bold;}#cutwroxsah .gt_font_italic { font-style: italic;}#cutwroxsah .gt_super { font-size: 65%;}#cutwroxsah .gt_footnote_marks { font-style: italic; font-weight: normal; font-size: 65%;}

The release history of the dplyr package
package	version	date	repository
dplyr	0.1	2014-01-16	Archive
dplyr	0.1.1	2014-01-29	Archive
dplyr	0.1.2	2014-02-24	Archive
dplyr	0.1.3	2014-03-15	Archive
dplyr	0.2	2014-05-21	Archive
dplyr	0.3	2014-10-04	Archive
dplyr	0.3.0.1	2014-10-08	Archive
dplyr	0.3.0.2	2014-10-11	Archive
dplyr	0.4.0	2015-01-08	Archive
dplyr	0.4.1	2015-01-14	Archive
dplyr	0.4.2	2015-06-16	Archive
dplyr	0.4.3	2015-09-01	Archive
dplyr	0.5.0	2016-06-24	Archive
dplyr	0.7.0	2017-06-09	Archive
dplyr	0.7.1	2017-06-22	Archive
dplyr	0.7.2	2017-07-20	Archive
dplyr	0.7.3	2017-09-09	Archive
dplyr	0.7.4	2017-09-28	Archive
dplyr	0.7.5	2018-05-19	Archive
dplyr	0.7.6	2018-06-29	Archive
dplyr	0.7.7	2018-10-16	Archive
dplyr	0.7.8	2018-11-10	Archive
dplyr	0.8.0	2019-02-14	Archive
dplyr	0.8.0.1	2019-02-15	Archive
dplyr	0.8.1	2019-05-14	Archive
dplyr	0.8.2	2019-06-29	Archive
dplyr	0.8.3	2019-07-04	Archive
dplyr	0.8.4	2020-01-31	Archive
dplyr	0.8.5	2020-03-07	Archive
dplyr	1.0.0	2020-05-29	Archive
dplyr	1.0.1	2020-07-31	Archive
dplyr	1.0.2	2020-08-18	Archive
dplyr	1.0.3	2021-01-15	Archive
dplyr	1.0.4	2021-02-02	Archive
dplyr	1.0.5	2021-03-05	Archive
dplyr	1.0.6	2021-05-05	Archive
dplyr	1.0.7	2021-06-18	Archive
dplyr	1.0.8	2022-02-08	Archive
dplyr	1.0.9	2022-04-28	CRAN

First released in January 2014. It’s been an interesting journey for the tidyverse since then. I started using the tidy packages first in 2017, and I would find it hard to go back to base R now.

With data like this for a single R package, you could for instance investigate the yearly frequency of releases over time:

df_hist_dplyr %>%   mutate(year = floor_date(date, unit = "year")) %>%   count(year) %>%   ggplot(aes(year, n)) +  geom_col(fill = "purple") +  scale_x_date() +  theme_hc() +  labs(x = "", y = "", title = "Number of releases of 'dplyr' per year - in July 2022")

Interesting pattern there from 2014 to 2016. Since then the number of releases have been pretty consistent. At the time of writing, we are still in the middle of 2022.

(As a side note, you’ll see that I’ve decided not to use an x-axis label here. I feel like a year axis is often self-explanatory; and I’ve used a descriptive title to prevent misinterpretation. Let me know if you disagree.)

To get the history for all the entries in our complete list of CRAN packages, we can then simply loop through the package names. The loop takes about an hour, but you don’t have to run it yourself. This is what I created the Kaggle dataset for. You can download “cran_package_history.csv” and start working with it immediately.

df_hist <- read_csv("../../static/files/cran_package_history.csv", col_types = cols())

Here are the first 50 rows:

df_hist %>%  head(50) %>%   gt() %>%   tab_header(    title = md("**The first rows of the cran_package_history.csv table**")    ) %>%   opt_row_striping() %>%   tab_options(container.height = px(400))

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;}#tzssiediwj .gt_table { display: table; border-collapse: collapse; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3;}#tzssiediwj .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#tzssiediwj .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; border-bottom-color: #FFFFFF; border-bottom-width: 0;}#tzssiediwj .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 0; padding-bottom: 6px; border-top-color: #FFFFFF; border-top-width: 0;}#tzssiediwj .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#tzssiediwj .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#tzssiediwj .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden;}#tzssiediwj .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px;}#tzssiediwj .gt_column_spanner_outer:first-child { padding-left: 0;}#tzssiediwj .gt_column_spanner_outer:last-child { padding-right: 0;}#tzssiediwj .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%;}#tzssiediwj .gt_group_heading { padding: 8px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle;}#tzssiediwj .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle;}#tzssiediwj .gt_from_md > :first-child { margin-top: 0;}#tzssiediwj .gt_from_md > :last-child { margin-bottom: 0;}#tzssiediwj .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden;}#tzssiediwj .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 12px;}#tzssiediwj .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#tzssiediwj .gt_first_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3;}#tzssiediwj .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#tzssiediwj .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3;}#tzssiediwj .gt_striped { background-color: rgba(128, 128, 128, 0.05);}#tzssiediwj .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#tzssiediwj .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#tzssiediwj .gt_footnote { margin: 0px; font-size: 90%; padding: 4px;}#tzssiediwj .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#tzssiediwj .gt_sourcenote { font-size: 90%; padding: 4px;}#tzssiediwj .gt_left { text-align: left;}#tzssiediwj .gt_center { text-align: center;}#tzssiediwj .gt_right { text-align: right; font-variant-numeric: tabular-nums;}#tzssiediwj .gt_font_normal { font-weight: normal;}#tzssiediwj .gt_font_bold { font-weight: bold;}#tzssiediwj .gt_font_italic { font-style: italic;}#tzssiediwj .gt_super { font-size: 65%;}#tzssiediwj .gt_footnote_marks { font-style: italic; font-weight: normal; font-size: 65%;}

The first rows of the cran_package_history.csv table
package	version	date	repository
A3	0.9.1	2013-02-07	Archive
A3	0.9.2	2013-03-26	Archive
A3	1.0.0	2015-08-16	CRAN
AATtools	0.0.1	2020-06-14	CRAN
ABACUS	1.0.0	2019-09-20	CRAN
abbreviate	0.1	2021-12-14	CRAN
abbyyR	0.1	2015-06-12	Archive
abbyyR	0.2	2015-09-12	Archive
abbyyR	0.2.1	2015-11-04	Archive
abbyyR	0.2.2	2015-11-06	Archive
abbyyR	0.2.3	2015-12-06	Archive
abbyyR	0.3	2016-02-04	Archive
abbyyR	0.4.0	2016-05-16	Archive
abbyyR	0.5.0	2016-06-20	Archive
abbyyR	0.5.1	2017-04-12	Archive
abbyyR	0.5.3	2018-05-28	Archive
abbyyR	0.5.4	2018-05-30	Archive
abbyyR	0.5.5	2019-06-25	CRAN
abc	1.0	2010-10-05	Archive
abc	1.1	2010-10-11	Archive
abc	1.2	2011-01-15	Archive
abc	1.3	2011-05-10	Archive
abc	1.4	2011-09-04	Archive
abc	1.5	2012-08-08	Archive
abc	1.6	2012-08-14	Archive
abc	1.7	2013-06-06	Archive
abc	1.8	2013-10-29	Archive
abc	2.0	2014-07-11	Archive
abc	2.1	2015-05-05	Archive
abc	2.2.1	2022-05-19	CRAN
abc.data	1.0	2015-05-05	CRAN
ABC.RAP	0.9.0	2016-10-20	CRAN
abcADM	1.0	2019-11-13	CRAN
ABCanalysis	1.0	2015-02-13	Archive
ABCanalysis	1.0.1	2015-04-20	Archive
ABCanalysis	1.0.2	2015-06-15	Archive
ABCanalysis	1.1.0	2015-09-28	Archive
ABCanalysis	1.1.1	2016-06-15	Archive
ABCanalysis	1.1.2	2016-08-23	Archive
ABCanalysis	1.2.1	2017-03-13	CRAN
abclass	0.1.0	2022-03-07	Archive
abclass	0.2.0	2022-04-12	Archive
abclass	0.3.0	2022-05-28	CRAN
ABCoptim	0.13.10	2013-10-21	Archive
ABCoptim	0.13.11	2013-11-06	Archive
ABCoptim	0.14.0	2016-11-17	Archive
ABCoptim	0.15.0	2017-11-06	CRAN
ABCp2	1.0	2013-04-10	Archive
ABCp2	1.1	2013-07-23	Archive
ABCp2	1.2	2016-02-04	CRAN

Now we can filter the initial release date for each package and visualise the number of CRAN packages created over time:

df_hist %>%   group_by(package) %>%   slice_min(order_by = date, n = 1) %>%   ungroup() %>%   mutate(month = floor_date(date, unit = "month")) %>%   count(month) %>%   arrange(month) %>%   mutate(cumul = cumsum(n)) %>%   ggplot(aes(month, cumul)) +  geom_line(col = "purple") +  theme_minimal() +  labs(x = "Date", y = "Cumlative count", title = "Cumulative count of CRAN package by date of first release")

Still an impressive growth, and now we give proper emphasis to the early history of CRAN; reaching back all the way to before the year 2000. There are many more angles and visuals that this dataset will allow you to explore.

Notes and suggestions:

I will keep updating this dataset on a monthly basis. After the initial collection of all the version histories, from now on I only need to update the histories for those packages that have a new version released. This should simplify the process significantly.
Some ideas for EDA and analysis: how long did packages take from their first release to version 1.0? What type of packages were most frequent in different years? Who are the most productive authors? Can you predict the growth toward 2025?
All of this analysis can be done directly on the Kaggle platform! On the dataset page, on the top right, you will see a button called “New Notebook”. Click that to get an interactive editor in R or Python and start exploring immediately.

Have fun!

To leave a comment for the author, please follow the link and comment on their blog: R on head spin - the Heads or Tails blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: A Kaggle Dataset of R Package History for rstudio::conf(2022)

↧

Read Data from Multiple Excel Sheets and Convert them to Individual Data Frames

July 21, 2022, 5:00 pm

≫ Next: Programming a simple minimax chess engine in R

≪ Previous: A Kaggle Dataset of R Package History for rstudio::conf(2022)

[This article was first published on R | Fahim Ahmad, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I work with a team of data enthusiasts where we deal with a large amount of data every single day. Sometimes it happens that we end up storing the data into multiple Excel files or multiple Excel sheets.

While data analysis though, we have to read the data from different sheets and convert them to individual data frames; this means the same function to import the data must be used several times.

I am sure I am not the only one who works with multiple data sets at once, therefore I decided to write this post to explore an efficient way of reading data from several Excel sheets and storing them into individual data frames all at once, which I hope it can be of some help for those who face the same challenge and to have a record of the script for my future work as well.

Step 1: Reading data

Suppose you have an Excel file named as data.xlsx with data in several sheets and you aim to import the data from every single sheet all at once. There are at least two ways of doing this: 1) using the lapply() function, 2) using map() function from purrr package.

using lapply( )

library(readxl)
df_list <- lapply(excel_sheets("data.xlsx"), function(x)
read_excel("data.xlsx", sheet = x)
)

using map ( )

library(purrr)
df_list <- map(set_names(excel_sheets("data.xlsx")),
read_excel, path = "data.xlsx"
)

Although both lapply() and map() store the final output as a list, the map() function creates a named list where the name of each element of the list corresponds to the name of each sheet which the data come from. Thus, later on you will be able to easily identify which Excel sheet is stored in which element of the list of data frames.

Step 2: Converting the list of data frames into individual data frames

Once you read the data from Excel sheets and store them in a list, the next step is to convert them to individual data frames - unless if you want to apply some list-wise operations such as removing a particular row/column from all data frames at once.

There are several ways of doing this. The straightforward way which I found most appealing and simple to use is the list2env() function. You only need to feed it with the list in which the data frames are stored and it will convert each component of the list into a single object.

list2env(df_list, envir = .GlobalEnv)

Apart from that, the same task can be accomplished using the assign() function with lapply() and map() functions too as below.

using lapply( )

lapply(names(df_list), function(x)
assign(x, df_list[[x]], envir = .GlobalEnv)
)

using map( )

purrr::map(names(df_list),
~assign(.x, df_list[[.x]], envir = .GlobalEnv)
)

To leave a comment for the author, please follow the link and comment on their blog: R | Fahim Ahmad.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Read Data from Multiple Excel Sheets and Convert them to Individual Data Frames

↧

Programming a simple minimax chess engine in R

July 22, 2022, 5:00 pm

≫ Next: What’s the fastest way to search and replace strings in a data frame?

≪ Previous: Read Data from Multiple Excel Sheets and Convert them to Individual Data Frames

[This article was first published on R programming tutorials and exercises for data science and mathematics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why write a chess engine in R?

Chess has always fascinated me. This is a game with simple rules that takes a lot of practice to master. No information is hidden from the players: the game state is clearly visible on the board, but strategies can be complex and subtle. This combination of simple rules and complex strategy becomes very relevant when attempting to write a chess engine. As we will see, a single line of R code is sufficient to implement a program that returns random legal moves. Creating a chess engine that actually plays well is another story.

Powerful chess engines have been developed that can currently defeat most human players. These are highly optimized pieces of software that:

search through the space of available moves very quickly to query positions many moves ahead
often use expert knowledge of the game or pre-computed results to evaluate positions accurately (examples include opening books, endgame databases and evaluation heuristics)
rely on vast amounts of computer resources for training and leverage advanced machine learning models such as neural networks for position evaluation

My goal is not to recreate any of these advanced efforts. This would be unrealistic and, as a casual chess player and avid R programmer, I am more interested to discover the opportunities that R offers around chess. For example, multiple packages are available to plot game positions or to generate lists of legal moves. I am also curious to explore some of the standard algorithms used in chess engines (starting with minimax) and see how simple R implementations of these approaches translate into enjoyable chess games.

In essence, my simple R chess engine should be fun and informative to write and also entertaining to play against as a casual player. It serves as an opportunity to explore an uncommon but fun application of R.

With this introduction let’s dive into the world of R chess!

Getting started: basic chess functions in R and first ‘chess engine’

Before we start implementing any game logic it is important to ensure that some basic chess functionality is available. This includes:

plotting the game board
listing all legal moves for a given position
recognizing when a game is won or drawn

The rchess package provides excellent functions that address these points.

Let’s start a new chess game and plot the board:

# Load the rchess package:library(rchess)# Create a new game:chess_game <- Chess$new()# Plot the current position:plot(chess_game)

{"x":{"fen":"rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"},"evals":[],"jsHooks":[]}

What moves are available for white? We can get a list of all legal moves via the moves function:

chess_game$moves()##  [1] "a3"  "a4"  "b3"  "b4"  "c3"  "c4"  "d3"  "d4" ##  [9] "e3"  "e4"  "f3"  "f4"  "g3"  "g4"  "h3"  "h4" ## [17] "Na3" "Nc3" "Nf3" "Nh3"

We can pick a move and update the game accordingly:

chess_game$move("e4")plot(chess_game)

{"x":{"fen":"rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1"},"evals":[],"jsHooks":[]}

The next player to move is black:

chess_game$turn()## [1] "b"

and the available moves are:

chess_game$moves()##  [1] "Nc6" "Na6" "Nh6" "Nf6" "a6"  "a5"  "b6"  "b5" ##  [9] "c6"  "c5"  "d6"  "d5"  "e6"  "e5"  "f6"  "f5" ## [17] "g6"  "g5"  "h6"  "h5"

Using the moves function we can already write a first chess engine that returns a random legal move:

# Random move generator for a given position:get_random_move <- function(chess_game) {  return(sample(chess_game$moves(), size = 1))}

We can use this simple function to generate full chess games with both sides played by the computer:

# Set up a new gamechess_game <- Chess$new()# Perform random legal moves until the game ends in mate or drawwhile (!chess_game$game_over()) {  chess_game$move(get_random_move(chess_game))}# Plot the final positionplot(chess_game)

{"x":{"fen":"7N/r1nb4/4kQ2/1p1pP1p1/p1p3p1/1P5P/P1P3P1/R2K1B1R b - - 0 43"},"evals":[],"jsHooks":[]}

This game ended in a win for white and took only a few seconds to run. We can also get a summary of the board in FEN notation by using the fen function:

chess_game$fen()## [1] "7N/r1nb4/4kQ2/1p1pP1p1/p1p3p1/1P5P/P1P3P1/R2K1B1R b - - 0 43"

This FEN representation of the game state allows us to easily determine which pieces are still available for each player.

We can also confirm that the game is over, that checkmate occurred and that black is in check:

chess_game$game_over()## [1] TRUEchess_game$in_checkmate()## [1] TRUEchess_game$in_check()## [1] TRUE

Improving our chess logic: heuristic position evaluation function

While playing a full random game is surprisingly simple with the help of the rchess package, we would like to start improving our engine in two ways:

Evaluation function: implement a simple heuristic to evaluate individual positions based on the pieces on the board and the safety of the king
Search function: implement an algorithm that allows our engine to search through positions multiple moves ahead

Let’s first work on the position evaluation function, which is usually performed from the perspective of white. More precisely, scores greater than zero indicate an advantage for white and scores less than zero point to an advantage for black. Our scores are based on three aspects of the position being evaluated:

Is this position a win for one of the players or a draw? If white won the score is 1000. If black won the score is -1000. In case of a draw the score in 0.
What is the material advantage for white? Each piece receives a value (9 points for queens, 5 points for rooks, 3 points for bishops and knights and 1 point for pawns) and the total piece value for black is subtracted from the total piece value for white.
If the white king is in check subtract one point from the position score. If the black king is in check add one point to the position score.

This is a fairly standard (although highly simplified) way to define an evaluation function: white will choose moves with high scores and black will prefer moves with low scores. Both players can thus use the same position evaluation function, which simplifies our program.

Let’s implement these simple heuristics in R:

# Position evaluation functionevaluate_position <- function(position) {  # Test if black won  if (position$in_checkmate() & (position$turn() == "w")) {    return(-1000)  }  # Test if white won  if (position$in_checkmate() & (position$turn() == "b")) {    return(1000)  }  # Test if game ended in a draw  if (position$game_over()) {    return(0)  }  # Compute material advantage  position_fen <- strsplit(strsplit(position$fen(), split = " ")[[1]][1], split = "")[[1]]  white_score <- length(which(position_fen == "Q")) * 9 + length(which(position_fen == "R")) * 5 + length(which(position_fen == "B")) * 3 + length(which(position_fen == "N")) * 3 + length(which(position_fen == "P"))  black_score <- length(which(position_fen == "q")) * 9 + length(which(position_fen == "r")) * 5 + length(which(position_fen == "b")) * 3 + length(which(position_fen == "n")) * 3 + length(which(position_fen == "p"))  # Evaluate king safety  check_score <- 0  if (position$in_check() & (position$turn() == "w")) check_score <- -1  if (position$in_check() & (position$turn() == "b")) check_score <- 1  # Return final position score  return(white_score - black_score + check_score)}

Applying this function to our previous game we see that it correctly recognizes a win by white by returning a score of 1000:

evaluate_position(chess_game)## [1] 1000

Improving our chess logic: minimax search

So far we have implemented a function that quantifies whether a given position is advantageous for white or black. The second part of our engine builds on this result and focuses on searching through potential next moves to identify the best choice. We will use an algorithm called minimax to perform this search.

The figure below shows how minimax works on a toy example:

Figure 1: Example of minimax search for the best next move (white to move in the current position)

White starts at the root position and has two move options. For each option, black has two replies available. Figure 1 summarizes this tree of move possibilities and shows the heuristic score for each of the leaf nodes based on the position evaluation function we just defined.

Black will always pick the move that minimizes the heuristic score. In the right branch black will play the move that results in a score of -3, so white can predict that picking the right branch move will lead to a position with score -3. Picking the left branch move however will result in a score of 0, since black cannot capture any material. Based on this reasoning, white will chose the left branch since it leads to a more advantageous outcome even after black plays its best reply.

This minimax strategy is powerful in chess since it looks multiple moves ahead and evaluates positions assuming that each player makes optimal choices. We can also increase the height of this tree: the position evaluation will be more precise since more moves are tested but the runtime will also increase.

Let’s implement this algorithm in R:

# Score position via minimax strategyminimax_scoring <- function(chess_game, depth) {  # If the game is already over or the depth limit is reached  # then return the heuristic evaluation of the position  if (depth == 0 | chess_game$game_over()) {    return(evaluate_position(chess_game))  }  # Run the minimax scoring recursively on every legal next move, making sure the search depth is not exceeded  next_moves <- chess_game$moves()  next_move_scores <- vector(length = length(next_moves))  for (i in 1:length(next_moves)) {    chess_game$move(next_moves[i])    next_move_scores[i] <- minimax_scoring(chess_game, depth - 1)    chess_game$undo()  }  # White will select the move that maximizes the position score  # Black will select the move that minimizes the position score  if (chess_game$turn() == "w") {    return(max(next_move_scores))  } else {    return(min(next_move_scores))  }}

This completes the minimax_scoring function that returns a score given an input position. Now we just need a wrapper function to evaluate all legal next moves via minimax_scoring and return the optimal choice:

# Select the next move based on the minimax scoringget_minimax_move <- function(chess_game) {  # Score all next moves via minimax  next_moves <- chess_game$moves()  next_move_scores <- vector(length = length(next_moves))  for (i in 1:length(next_moves)) {    chess_game$move(next_moves[i])    # To ensure fast execution of the minimax function we select a depth of 1    # This depth can be increased to enable stronger play at the expense of longer runtime    next_move_scores[i] <- minimax_scoring(chess_game, 1)    chess_game$undo()  }  # For white return the move with maximum score  # For black return the move with minimum score  # If the optimal score is achieved by multiple moves, select one at random  # This random selection from the optimal moves adds some variability to the play  if (chess_game$turn() == "w") {    return(sample(next_moves[which(next_move_scores == max(next_move_scores))], size = 1))  } else {    return(sample(next_moves[which(next_move_scores == min(next_move_scores))], size = 1))  }}

The function get_minimax_move defines the complete logic for our computer chess player. We are now ready to test the performance of our algorithm.

Testing our R minimax chess engine

Let’s first test the performance of our minimax-based chess engine by letting it play 10 games as white and 10 games as black against the random move generator we defined at the beginning of this article. If everything works as expected then our engine should win every time:

# Function that takes a side as input ("w" or "b") and plays 10 games# The selected side will choose moves based on the minimax algorithm# The opponent will use the random move generatorplay_10_games <- function(minimax_player) {  game_results <- vector(length = 10)  for (i in 1:10) {    chess_game <- Chess$new()    while (!chess_game$game_over()) {      if (chess_game$turn() == minimax_player) {        # Selected player uses the minimax strategy        chess_game$move(get_minimax_move(chess_game))      } else {        # Opponent uses the random move generator        chess_game$move(get_random_move(chess_game))      }    }    # Record the result of the current finished game    # If mate: the losing player is recorded    # If draw: record a 0    if (chess_game$in_checkmate()) {      game_results[i] <- chess_game$turn()    } else {      game_results[i] <- "0"    }  }  # Print the outcome of the 10 games  print(table(game_results))}

White wins every game by playing the minimax strategy against the random move generator:

play_10_games("w")## game_results##  b ## 10

Black also wins every game as the minimax player against random white moves:

play_10_games("b")## game_results##  w ## 10

This simple experiment shows that our minimax chess engine is indeed more powerful than a random move generator, but this is a fairly low bar. I have also tested this program against multiple chess apps and it was able to win games against opponents playing at a casual difficulty level. Increasing the minimax depth leads to stronger play, but evaluating more than four moves in advance was not feasible on my computer due to long runtimes.

Conclusion and next steps

Overall this simple R chess engine is fun to play against for a beginner or casual player, and writing it was a great way to practice the minimax algorithm. Numerous improvements can still be made.

Some updates that I plan to discuss in future posts are:

Implement alpha-beta pruning to decrease the number of positions evaluated by minimax.
Use negamax to further simplify the code.
Leverage neural networks to define a better position evaluation function. The tensorflow and keras R packages make it possible to build very flexible neural networks within R.

Programming a simple chess engine is an exciting journey through the many machine learning capabilities offered by R. Looking forward to the next steps!

For more R programming tutorials and exercises visit my website codertime.org and let me know your comments at codertime.contact@gmail.com.

To leave a comment for the author, please follow the link and comment on their blog: R programming tutorials and exercises for data science and mathematics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Programming a simple minimax chess engine in R

↧

What’s the fastest way to search and replace strings in a data frame?

July 22, 2022, 5:00 pm

≫ Next: rstudio::conf(2022) Starts Today!

≪ Previous: Programming a simple minimax chess engine in R

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve tweeted this:

Just changed like 100 grepl calls to stringi::stri_detect and my pipeline now runs 4 times faster #RStats
— Bruno Rodrigues (@brodriguesco) July 20, 2022

much discussed ensued. Some people were surprised, because in their experience, grepl()was faster than alternatives, especially if you set the perl parameter in grepl() to TRUE.My use case was quite simple; I have a relatively large data set (half a million lines) withone column with several misspelling of city names. So I painstakingly wrote some codeto correct the spelling of the major cities (those that came up often enough to matter. Minorcities were set to “Other”. Sorry, Wiltz!)

So in this short blog post, I benchmark some code to see if what I did the other day was a fluke.Maybe something weird with my R installation on my work laptop running Windows 10 somehowmade stri_detect() run faster than grepl()? I don’t even know if something like that ispossible. I’m writing these lines on my Linux machine, unlike the code I run at work.So maybe if I find some differences, they could be due to the different OS running.I don’t want to have to deal with Windows on my days off (for my blood pressure’s sake),so I’m not running this benchmark on my work laptop. So that part we’ll never know.

Anyways, let’s start by getting some data. I’m not commenting the code below, because that’s notthe point of this post.

library(dplyr)library(stringi)library(stringr)library(re2)adult <- vroom::vroom(  "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")adult_colnames <- readLines(  "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names")adult_colnames <- adult_colnames[97:110] %>%  str_extract(".*(?=:)") %>%  str_replace_all("-", "_")adult_colnames <- c(adult_colnames, "wage")colnames(adult) <- adult_colnamesadult## # A tibble: 32,560 × 15##      age workclass    fnlwgt educa…¹ educa…² marit…³ occup…⁴ relat…⁵ race  sex  ##    <dbl> <chr>         <dbl> <chr>     <dbl> <chr>   <chr>   <chr>   <chr> <chr>##  1    50 Self-emp-no…  83311 Bachel…      13 Marrie… Exec-m… Husband White Male ##  2    38 Private      215646 HS-grad       9 Divorc… Handle… Not-in… White Male ##  3    53 Private      234721 11th          7 Marrie… Handle… Husband Black Male ##  4    28 Private      338409 Bachel…      13 Marrie… Prof-s… Wife    Black Fema…##  5    37 Private      284582 Masters      14 Marrie… Exec-m… Wife    White Fema…##  6    49 Private      160187 9th           5 Marrie… Other-… Not-in… Black Fema…##  7    52 Self-emp-no… 209642 HS-grad       9 Marrie… Exec-m… Husband White Male ##  8    31 Private       45781 Masters      14 Never-… Prof-s… Not-in… White Fema…##  9    42 Private      159449 Bachel…      13 Marrie… Exec-m… Husband White Male ## 10    37 Private      280464 Some-c…      10 Marrie… Exec-m… Husband Black Male ## # … with 32,550 more rows, 5 more variables: capital_gain <dbl>,## #   capital_loss <dbl>, hours_per_week <dbl>, native_country <chr>, wage <chr>,## #   and abbreviated variable names ¹education, ²education_num, ³marital_status,## #   ⁴occupation, ⁵relationship## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Let’s now write the functions used for benchmarking. There will be 5 of them:

One using grepl() without any fancy options;
One using grepl() where perl is set to TRUE;
One that uses stringi::stri_detect();
One that uses stringr::str_detect();
One that uses re2::re2_detect().

Below you can read the functions. They’re all pretty much the same, only the functionlooking for the string changes. These functions look for a string in the marital_statusvariable and create a new variable with a corresponding integer.

with_grepl <- function(dataset){  dataset |>    mutate(married = case_when(             grepl("Married", marital_status) ~ 1,             grepl("married", marital_status) ~ 2,             TRUE ~ 3)           )}with_grepl_perl <- function(dataset){  dataset |>    mutate(married = case_when(             grepl("Married", marital_status, perl = TRUE) ~ 1,             grepl("married", marital_status, perl = TRUE) ~ 2,             TRUE ~ 3)           )}with_stringi <- function(dataset){  dataset |>    mutate(married = case_when(             stri_detect(marital_status, regex = "Married") ~ 1,             stri_detect(marital_status, regex = "married") ~ 2,             TRUE ~ 3)           )}with_stringr <- function(dataset){  dataset |>    mutate(married = case_when(             str_detect(marital_status, "Married") ~ 1,             str_detect(marital_status, "married") ~ 2,             TRUE ~ 3)           )}with_re2 <- function(dataset){  dataset |>    mutate(married = case_when(             re2_detect(marital_status, "Married") ~ 1,             re2_detect(marital_status, "married") ~ 2,             TRUE ~ 3)           )}

Now I make extra sure these functions actually return the exact same thing. So for thisI’m running them once on the data and use testthat::expect_equal(). It’s a bitunwieldy, so if you have a better way of doing this, please let me know.

run_grepl <- function(){  with_grepl(adult) %>%    count(married, marital_status)}one <- run_grepl()run_grepl_perl <- function(){  with_grepl_perl(adult) %>%    count(married, marital_status)}two <- run_grepl_perl()run_stringi <- function(){  with_stringi(adult) %>%    count(married, marital_status)}three <- run_stringi()run_stringr <- function(){  with_stringr(adult) %>%    count(married, marital_status)}four <- run_stringr()run_re2 <- function(){  with_re2(adult) %>%    count(married, marital_status)}five <- run_re2()one_eq_two <- testthat::expect_equal(one, two)one_eq_three <- testthat::expect_equal(one, three)three_eq_four <- testthat::expect_equal(three, four)testthat::expect_equal(            one_eq_two,            one_eq_three          )testthat::expect_equal(            one_eq_three,            three_eq_four          )testthat::expect_equal(            one,            five)

testthat::expect_equal() does not complain, so I’m pretty sure my functions, while different,return the exact same thing. Now, we’re ready for the benchmark itself. Let’s run thesefunction 500 times using {microbenchmark}:

microbenchmark::microbenchmark(     run_grepl(),     run_grepl_perl(),     run_stringi(),     run_stringr(),     run_re2(),     times = 500)## Unit: milliseconds##              expr      min       lq     mean   median       uq      max neval##       run_grepl() 24.37832 24.89573 26.64820 25.50033 27.05967 115.0769   500##  run_grepl_perl() 19.03446 19.41323 20.91045 19.89093 21.16683 104.3917   500##     run_stringi() 23.01141 23.40151 25.00304 23.82441 24.83598 104.8065   500##     run_stringr() 22.98317 23.44332 25.32851 23.92721 25.18168 145.5861   500##         run_re2() 22.22656 22.60817 24.07254 23.05895 24.22048 108.6825   500

There you have it folks! The winner is grepl() with perl = TRUE, and then it’spretty much tied between stringi(), stringr() and re2() (maybe there’s a slight edgefor re2()) and grepl() without perl = TRUE is last. But don’t forget that this is runningon my machine with Linux installed on it; maybe you’ll get different results on differenthardware and OSs! So if you rely a lot on grepl() and other such string manipulationfunction, maybe run a benchmark on your hardware first. How come switching from grepl()(without perl = TRUE though) to stri_detect() made my pipeline at work run 4 timesfaster I don’t know. Maybe it has also to do with the size of the data, and the complexityof the regular expression used to detect the problematic strings?

Hope you enjoyed! If you found this blog post useful, you might want to followme on twitter for blog post updates andbuy me an espresso or paypal.me, or buy my ebook on Leanpub.You can also watch my videos on youtube.So much content for you to consoom!

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}

Buy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: What’s the fastest way to search and replace strings in a data frame?

↧

rstudio::conf(2022) Starts Today!

July 24, 2022, 5:00 pm

≫ Next: Interview with Ehouman Evans – Experience with R and Use with Agroforestry in Cotonou, West Africa

≪ Previous: What’s the fastest way to search and replace strings in a data frame?

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The day you’ve been waiting for is here: rstudio::conf(2022) starts today! We hope you’re ready for four days, 16 workshops, four keynotes, and countless ways to learn and collaborate!

With so much happening, we wanted to give an overview of what to expect and how to get ready.

Check out the conf schedule app. Create your itinerary on your Apple or Android device.
- Find out more on the schedule page by clicking the Mobile App + iCal button.
Participate in a Birds of a Feather event. BoFs are short, scheduled sessions where people from similar backgrounds meet during rstudio::conf. BoFs include industry-specific groups such as academic research, finance, and pharma and groups with similar interests such as natural language processing or machine learning. Many thanks to the organizations hosting the sponsored BoFs.
- Add them to your calendar by filtering the schedule to “Social” events.
Join us in The Lounge! Come hang out and chat with RStudio employees about how you do data science in your day-to-day, what challenges you’re facing, how to learn or teach R on a broad scale, or check out the latest in our open source packages. You may even bump into your favorite package developer or software engineer.
Take part in other social events to connect with the community. We have a book signing reception for workshop attendees on Monday evening, the welcome reception on Tuesday evening, dinner and activities on Wednesday evening, and an R-Ladies reception on Thursday evening.

We have more ways to enjoy conf, whether you’re joining us in person or online!

Watch the keynotes and talk via live stream. The live streams will be available on the rstudio::conf(2022) website. No registration is required; they are open and free to all! Tune in and ask questions alongside other attendees.
Join our Discord server to chat and network with RStudio folks and other attendees. Participate in fun community events, AMAs, and more! Sign up on the conference website.
Follow us on social media. We’ll be active on RStudio Twitter, RStudio Glimpse, LinkedIn, Instagram, and TikTok!
Use the #rstudioconf and #rstudioconf2022 hashtags to engage and share with others.

If you are wondering how to fit everything you want to do in these four days, Tracy Teal, RStudio’s Open Source Program Director, recommends coming to conf with one or two goals in mind. Is it someone you want to meet, something you want to learn, a talk you want to see? Then organize your activities accordingly so that you can make them happen! And of course, please reach out to RStudio staff if we can help you achieve your goals.

We can’t wait to see you at rstudio::conf(2022)!

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: rstudio::conf(2022) Starts Today!

↧

Interview with Ehouman Evans – Experience with R and Use with Agroforestry in Cotonou, West Africa

July 25, 2022, 10:27 am

≫ Next: Simulating Turnout in Tunisia’s Constitutional Referendum

≪ Previous: rstudio::conf(2022) Starts Today!

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

View the full interview: https://www.youtube.com/watch?v=4yW5TRZslj8

Ehouman Evans (Ph.D.), Agroforestry Project Manager at CIRAD, gives an interview to the R Consortium about his journey with R, his career, how he’s applied his R knowledge to Agroforestry, and more. The interview was conducted by Kevin O’Brien of the Why R? Foundation.

In this interview, Ehouman began by sharing about the country of Cotonou in West Africa, its well-known surrounding cities, and what the R community is like there. Afterward, Ehouman discussed his work in Agroforestry and the impact it is having on locations in West Africa. He also connects his sustainable development goals with his work as a “plant scientist.”

Following this, Ehmouman gave more background on his career path and transitioned to his journey with R and its usability in his work. He discussed more about the R community in Cotonou, surrounding countries, and his personal journey with learning and using R. Ehmouman also shared about the networking he has done with other R communities around the world. The interview finishes with Ehmouman sharing some of his favorite R packages and wraps up by giving his advice for those who are beginning their journey with R.

Main Sections

0:00 Introduction

0:31 Sharing about Cotonou, West Africa

5:01 Plant Science, Numerical Ecology, and location

7:15 Sustainable development goals

9:00 Sharing about Ehouman’s career path

14:40 Journey with R

17:33 R in the community of Cotonou

20:19 News about R Community in Cotonou

22:26 Connections with other Francophone countries

25:38 Ehouman gives advice

28:05 International connectivity

29:33 Ehmouman’s favorite R package

30:46 Thank you!

More Resources

Main Site: https://www.r-consortium.org/

News: https://www.r-consortium.org/news

Blog: https://www.r-consortium.org/news/blog

Join: https://www.r-consortium.org/about/join

Twitter: https://twitter.com/Rconsortium

LinkedIn: https://www.linkedin.com/company/r-consortium/

The post Interview with Ehouman Evans – Experience with R and Use with Agroforestry in Cotonou, West Africa appeared first on R Consortium.

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Interview with Ehouman Evans – Experience with R and Use with Agroforestry in Cotonou, West Africa

↧

Simulating Turnout in Tunisia’s Constitutional Referendum

July 24, 2022, 11:00 pm

≫ Next: Comments on the New R OOP System, R7

≪ Previous: Interview with Ehouman Evans – Experience with R and Use with Agroforestry in Cotonou, West Africa

[This article was first published on R on Robert Kubinec, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am writing this post in response to questions about estimating turnout for Tunisia’s constitutional referendum today. Turnout is an important aspect to this referendum because high turnout would signal higher legitimacy for President Kais Saied’s dramatic changes to the Tunisia’s democracy. His proposed constitution would replace Tunisia’s democratic institutions with a new dictatorship that would concentrate power in the president’s hands.

Because of Saied’s interference in the election commission, we are much less sure about the accuracy of results from the official election commission, the ISIE. Saied has also banned foreign election observers from arriving, leaving only one local NGO, Mourakiboun, observing polling stations. Mourakiboun has done significant work in prior elections, but reportedly faces a shortfall of local participants because of low interest in the referendum.

Mourakiboun uses a method known as parallel vote tabulation (PVT) to estimate turnout. It is a method used by NGOs like NDI to provide an independent estimate of turnout. According to NDI, to estimate turnout through observers without bias, Mourakiboun has to select a random sample of polling stations and then accurately record all votes from these polling stations. I will use some code-based simulations to see how well this method will work depending on levels of turnout. If turnout is low, there is likely to be much more sampling uncertainty because there are fewer voters at polling stations to base estimates on. I will also examine what happens if Mourakiboun does not select polling stations at random, which is likely to happen for logistical reasons. For example, Mourakiboun may put more observers at stations with higher turnout in order to save volunteer hours.

The results of the simulation show that PVT/Mourakiboun turnout estimates are probably going to be less precise at low levels of turnout compared to high levels of turnout. In addition, when turnout is low and selection of polling stations is somewhat biased, such as selecting higher-turnout polling stations, the inaccuracies in the estimates get much worse. In other words, if Mourakiboun is using an imperfect methodology and turnout is low, their estimates will be farther off than if turnout in the referendum was high. If 50% of Tunisians voted, this wouldn’t be as much of a problem, but if 10% vote, it will be much harder to get an accurate estimate using PVT via polling station counts.

Without digging into the methods below, the results can be summarized as:

If turnout is large, these biases have a much smaller role, but when turnout is small, as is likely to be the case with the constitutional referendum, the biases can have a much bigger impact on the final estimates, making Mourakiboun’s methods more likely to fail.

Simulation

First, we randomly create polling stations roughly corresponding to Tunisia’s population of 7 million registered voters and 4,500 polling stations:

N <- 7000000stations <- 4500vote_assign <- sample(1:stations,N,replace=T,                      prob=sample(1:3,stations,replace=T))

What we will do is test how accurate samples will be depending on total turnout and the quality of samples. To do so, I’ll first vary true turnout from 5% up to 50% of the population. We’ll assume as well that turnout varies by polling station, which is much more accurate than assuming a constant rate of turnout for all polling stations.

We’ll assume that the stations can vary in size up to a 3-fold magnitude, i.e., some stations may be 3 times smaller than other stations. For our sample, we have a polling station as large as 2533 and one as small as 689. We’ll then assume, for now, that Mourakiboun selects a random sample of 50 polling stations and it records all votes at these stations. We’ll repeat this experiment 1,000 times and record what estimated turnout Mourakiboun would report and how that compares to real turnout:

# loop over turnout, sample polls, estimate turnoutover_turnout <- parallel::mclapply(seq(.05,.5,by=.1), function(t) {      # polling station varying turnout rates      station_rates <- rbeta(n=N,t*20,(1-t)*20)              # randomly let voters decide to vote depending on true station-level turnout rate        pop_turnout <- lapply(1:stations, function(s) {                    tibble(turnout=rbinom(n=sum(vote_assign==s), size=1,prob =                                             station_rates[s]),                           station=s)                    }) %>% bind_rows      over_samples <- lapply(1:1000, function(i) {              # sample 100 random polling stations 1,000 times        sample_station <- sample(1:stations, size=50)                turn_est <- mean(pop_turnout$turnout[pop_turnout$station %in% sample_station])                return(tibble(mean_est=turn_est,                      experiment=i))          }) %>% bind_rows %>%       mutate(Turnout=t)        over_samples      },mc.cores=10) %>%   bind_rows

We can now plot estimated versus actual turnout. If we have random samples and no problems with recording votes, this works fairly well. The dot shows the average of all samples, which is un-biased, and the vertical line shows where 95% of the samples fall, which is an estimate of potential sampling error. In other words, if Mourakiboun does everything right with a sample of 50 polling stations, they would expect this kind of error from random chance alone. If true turnout is 15%, they could see estimates that range from 12% to 17%.

over_turnout %>%   group_by(Turnout) %>%   summarize(pop_est=mean(mean_est),            low_est=quantile(mean_est,.05),            high_est=quantile(mean_est, .95)) %>%   ggplot(aes(y=pop_est,x=Turnout)) +  geom_pointrange(aes(ymin=low_est,                      ymax=high_est),size=.5,fatten=1) +  geom_abline(slope=1,intercept=0,linetype=2,colour="red") +  theme_tufte() +  theme(text=element_text(family="")) +  labs(y="Estimated Turnout",x="True Turnout",       caption=stringr::str_wrap("Comparison of Mourakiboun estimated (y axis) versus actual turnout (x axis). Red line shows where true and estimated values are equal. Based on random samples of 50 polling stations and assuming no problems with recording votes."))

Unfortunately, this kind of accuracy only happens in a computer simulation. It is quite tricky to do a true random sample of polling stations because there may not be volunteers to cover all of the polling stations, as is the case with the current referendum. Let’s assume in the following simulation then that a polling station is more likely to be sampled if the true level of turnout is higher:

# loop over turnout, sample polls, estimate turnoutover_turnout_biased <- parallel::mclapply(seq(.05,.5,by=.1), function(t) {      # polling station varying turnout rates      station_rates <- rbeta(n=stations,t*20,(1-t)*20)              # randomly let voters decide to vote depending on true station-level turnout rate        pop_turnout <- lapply(1:stations, function(s) {                    tibble(turnout=rbinom(n=sum(vote_assign==s), size=1,prob =                                             station_rates[s]),                           station=s)                    }) %>% bind_rows      over_samples <- lapply(1:1000, function(i) {              # sample 100 random polling stations 1,000 times        sample_station <- sample(1:stations, size=50,prob=station_rates)                turn_est <- mean(pop_turnout$turnout[pop_turnout$station %in% sample_station])                return(tibble(mean_est=turn_est,                      experiment=i))          }) %>% bind_rows %>%       mutate(Turnout=t)        over_samples      },mc.cores=10) %>%   bind_rows

We can now plot estimated versus actual turnout for this simulation where the polling stations were more likely to be sampled if they had higher turnout:

over_turnout_biased %>%   group_by(Turnout) %>%   summarize(pop_est=mean(mean_est),            low_est=quantile(mean_est,.05),            high_est=quantile(mean_est, .95)) %>%   ggplot(aes(y=pop_est,x=Turnout)) +  geom_pointrange(aes(ymin=low_est,                      ymax=high_est),size=.5,fatten=1) +  geom_abline(slope=1,intercept=0,linetype=2,colour="red") +  theme_tufte() +  theme(text=element_text(family="")) +  labs(y="Estimated Turnout",x="True Turnout",       caption=stringr::str_wrap("Comparison of Mourakiboun estimated (y axis) versus actual turnout (x axis). Red line shows where true and estimated values are equal. Based on biased samples of 50 polling stations with higher turnout stations more likely to be sampled. However, simulation assumes no problems with recording votes.")) +  ylim(c(0,0.5)) +  xlim(c(0,0.5))

We see in the plot above that the black dot estimates are always higher than the red line showing the true values. Furthermore, the bias appears worse when turnout is smaller. The bias is also substantial - at low levels of turnout, such as at 5%, the estimated turnout can be twice as high.

As can be seen in this simulation study, it is quite possible for the method to work, but only if the polling stations are selected at random. I did not look into what would happen with other possible errors, such as being unable to accurately record all votes from a given polling station. For these reasons, while this method can certainly work, it is necessary to confirm what methodology was used to select the polling stations and also how strictly it was followed in implementation. There are many points at which this type of analysis could either intentionally or unintentionally result in over/under reporting of turnout.

If turnout is large, some of these biases have a minimal effect, but when turnout is small, as is likely to be the case with the constitutional referendum, they can have a much bigger impact on the final estimates.

To leave a comment for the author, please follow the link and comment on their blog: R on Robert Kubinec.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Simulating Turnout in Tunisia’s Constitutional Referendum

↧

Best Books for Data Engineers

Conclusion

Plan your virtual experience

Motivating example

The ‘golden standard’ analysis

Two-stage analyses

References

Big breaking changes

More consistent output

More consistent interface

Saving data

Other breaking changes

Easier authentication

Different ways to authenticate

Storing credentials

Other changes of note

Iteration and continuation of requests

Additions

Future

Conclusions

Mbaza AI

Mbaza Shiny App

Challenges

Approach

Results

Summary

Easy Error Handling

Improvements to stopOnError()

result

err_msg

prefix

maxLength, truncatedLength

call.

Example Usage

Convert multiple columns into a single column

Example 1: Unite Two Columns into One Column

Example 2: Unite More Than Two Columns

Long and Wide formats

R code

New features in v0.2.0

Workflow

Github Issue

Branch

Creating a new function

Pull Requests

Closing Comments

rOpenSci HQ

rOpenSci Code of Conduct update

Committee changes

Software

New packages

New versions

Software Peer Review

On the blog

Tech Notes

Call for maintainers

Package development corner

pak::pak()

Update to CRAN NEWS.md parsing

How to handle CRAN checks with _R_CHECK_DEPENDS_ONLY_=true

Last words

Run-Length Encoding

A PDF of almost-regular tables

Breaking up the lines

# A tibble: 33 × 3

page line text

1 4 1 “4/22/14 stickcontact for world insure fun real (.3); corr…

2 4 2 “example but trust invest, over door head respon…

3 4 3 “address cheap english company (.9)”

4 4 4 “4/22/14 wage at soon‐lose commit television break care’try …

5 4 5 “view strong involve”

6 4 6 “4/27/14 charactershould between field double air look’ cl…

7 4 7 “4/28/14 produceguess busy collect tree kill double’ deep‐…

8 4 8 “4/29/14 ideacall wee person common succeed catch’ system‐…

9 4 9 “5/2/14available product hit worse stage’ seem true can; h…

10 410 “notice presume”

# … with 23 more rows

Separating out the columns

Dealing with the description-fragments

Complete list of CRAN packages

New features in `v0.2.0`

How to handle CRAN checks with `_R_CHECK_DEPENDS_ONLY_`=true