Quantcast
Channel: R-bloggers
Viewing all 12095 articles
Browse latest View live

Accessibility Web Development – How to Make R Shiny Apps Accessible

$
0
0
[This article was first published on Tag: r - Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Accessibility Web Development Article Thumbnail

It’s crucial you address R Shiny accessibility early on. Why? Because you want to make sure your data products can be used by all users – inclusivity matters. In a nutshell, that’s what web accessibility development means, but you’ll learn more about it in the context of R Shiny apps today.

To kick things off, we’ll start with the theory behind R Shiny accessibility and web accessibility in general. There are some, somewhat strict rules you need to follow if you want to provide the most inclusive user experience.

Want to see how users use your R Shiny dashboard? Here are three tools for monitoring user adoption.

Table of contents:


Web Accessibility Theory Explained

As mentioned earlier, accessibility means your data products (e.g., dashboards) can be used equally by all users. To get started, you should look no further than the A11Y project. The term “A11Y” is an abbreviation of the word “accessibility,” because there are 11 letters between A and Y.

The entire project breaks down accessibility into a checklist anyone can follow. Let’s go over a couple of their points.

Don’t rely on color to explain the data

Some people are color-blind, so a stacked bar chart with each category represented by a different color doesn’t always translate. Also, your charts can be printed in a black and white color scheme, making the coloring super confusing.

The ggpattern R package can help. To demonstrate, we’ll copy a stacked bar chart source code from the ggplot2 example gallery and compare the two libraries. The code snippet below creates a traditional stacked bar chart, where each colored segment of a bar represents one category:

library(ggplot2)specie <- c(rep("sorgho", 3), rep("poacee", 3), rep("banana", 3), rep("triticum", 3))condition <- rep(c("normal", "stress", "Nitrogen"), 4)value <- abs(rnorm(12, 0, 15))data <- data.frame(specie, condition, value)ggplot(data, aes(fill = condition, y = value, x = specie)) +    geom_bar(position = "stack", stat = "identity")
Image 1 - ggplot2 stacked bar chart

Image 1 – ggplot2 stacked bar chart

Just imagine if you were color-blind – it would be either extremely difficult or impossible to distinguish between the categories. That’s why adding patterns is helpful. The following code snippet adds a distinct pattern to each category:

library(ggpattern)ggplot(data, aes(fill = condition, y = value, x = specie)) +    geom_col_pattern(        aes(pattern = condition, fill = condition, pattern_fill = condition),        colour = "black",        pattern_colour = "white",        pattern_density = 0.15,        pattern_key_scale_factor = 1.3    )
Image 2 - ggpattern stacked bar chart

Image 2 – ggpattern stacked bar chart

It doesn’t matter if you can’t tell the difference between colors, or if the chart gets printed in shades of gray – everyone can spot a pattern and identify the individual categories.

Want to expand your bar charts skills in ggplot2? Check out our guide to bar charts and try to make them more accessible.

Don’t use very bright or low-contrast colors

Your background and foreground colors should be different enough. But what classifies as enough? That can be tricky to decide.

Luckily, free online tools such as contrastchecker.com allow you to verify if your colors pass the tests. There are six tests on the website currently. Don’t worry about what they mean, just remember that green is good, and red is bad.

Here’s an example. A white background color with black text is clear and easy for everyone to see:

Image 3 - Contrast checker - white and black

Image 3 – Contrast checker – white and black

Black and white are the exact opposite colors, so the human eye has little difficulty identifying the characters. But what about yellow letters on a white background? Let’s see:

Image 4 - Contrast checker - white and yellow

Image 4 – Contrast checker – white and yellow

Suddenly, things went south. Reading the yellow text on a white background causes your eyes to strain. It’s difficult even with typical color perception. For that reason, all of the six tests have failed.

Don’t overwhelm the user with the information

If you’re a statistician or a data scientist, it’s easy to think everyone will understand your dashboards just because you do. It’s a common fallacy. But the reality is most users are not following your internal logic.

Don’t include dozens of charts on a single dashboard page! It’s a bad design practice that will leave your users confused, concerned, and likely disoriented. Make one chart a focus of the dashboard and add a couple of helper charts around it. That’s enough. If you can’t convey a message with a handful of data visualizations, consider simplifying your message or re-evaluate your approach.

Also, you should use consistent language and color. If charts have identical X-axis, you shouldn’t name it differently across plots. Likewise, using red and blue in one chart and green and yellow on the other is just confusing, especially if the chart types are identical. Be consistent!

Translate the data into clear language

Every data visualization can be simplified. But you don’t have to remove elements to simplify things – you can also add to simplify.

Even if the chart looks simple enough to you, you can simplify it further by providing additional information. The ggplot2 R package allows you to easily add titles, and subtitles, but that’s only the starting point.

Let’s take a look at the following visualization. It’s sort of a progress bar, and it shows how many users have registered to the company’s newsletter as of 2022/05/15 vs. how many registrations the company needs:

library(ggplot2)registered_users <- 559target_users <- 1500ggplot() +     geom_col(aes("", target_users), alpha = 0.5, color = "black") +     geom_col(aes("", registered_users), fill = "#0199f8") +     coord_flip() +     ggtitle("Number of user registrations vs. the target")
Image 5 - A bad example of a progress bar

Image 5 – A bad example of a progress bar

There are many things wrong with this visualization. For starters, it’s near impossible to read how many users have registered so far. We also don’t know up to which date the chart shows data, nor what the data really represents. We know these are registrations, but for what?

Always provide additional context! It makes the data interpretation clearer and more accessible. Here’s an example:

library(ggplot2)registered_users <- 559target_users <- 1500format_text <- function(registered, target) {    pct <- round((registered / target) * 100, 2)    paste0(registered, "/", target, " users registered - ", pct, "%")}ggplot() +    geom_col(aes("", target_users), alpha = 0.5, color = "black") +    geom_col(aes("", registered_users), fill = "#0199f8") +    geom_text(        aes("", y = 65, label = format_text(registered_users, target_users)),        hjust = 0,        fontface = "bold",        size = 10,        color = "white"    ) +    coord_flip() +    theme_minimal() +    theme(        axis.title = element_blank(),        axis.text = element_blank(),        axis.ticks = element_blank(),        panel.grid.major = element_blank(),        panel.grid.minor = element_blank(),        panel.background = element_blank()    ) +    labs(        title = "Number of user registrations vs. the target",        subtitle = paste0("As of 2022/05/15, ", registered_users, " users have registered for our company's weekly newsletter. Our target is ", target_users, " and we plan to achieve it by the end of the year.")    )
Image 6 - A good example of a progress bar

Image 6 – A good example of a progress bar

Other A11Y guidelines

We’ll only cover a handful of A11Y guidelines. But there’s an exhaustive checklist you can find and follow on their website, and we encourage you to do so.

In general, here are guidelines you should remember about R Shiny accessibility. We gathered them from an RStudio Meetup on data visualization accessibility presented by two advocates and contributors to the R and Shiny community, Maya Gans and Mara Averick. We recommend you watch the video, you’ll even catch some of these points put into action in the presentations:

  • Don’t rely on color to explain the data. Use chart patterns as we explained earlier.
  • Don’t use very bright or low-contrast colors. It just looks bad and in some cases makes it impossible to read the text.
  • Don’t hide important data behind interactions. Hover events aren’t possible on mobile, and most users will access your dashboard from a smartphone.
  • Don’t overwhelm users with the information. Showing 500 charts on a single dashboard page is just awful.
  • Use accessibility tools when designing. Google Lighthouse and the A11Y project checklist are good places to start.
  • Use labels and legends. Otherwise, how will the user know what the data represents?
  • Translate the data into clear language. Every chart can be simplified. Make sure you have your target user in mind.
  • Provide context and explain the visualization. Annotate individual data points on a chart (e.g., all-time high, all-time low, dates).
  • Focus on accessibility during user interviews. Create narratives of people who will use the app. Person X in HR will use the dashboard differently than Person Y, the CFO.

Now you know the theory, so next, we’ll see how accessibility translates to R Shiny.

A11Y and Shiny – Get Started with R Shiny accessibility

Here’s the good news – you don’t need to lift a finger regarding R Shiny accessibility, as someone already did all the heavy lifting for you. The shinya11y package is a huge time saver, and comes will every A11Y guideline and checklist. The best part? It integrates seamlessly into your Shiny apps.

To start, let’s install the package:

devtools::install_github("ewenme/shinya11y")

You can launch the demo Shiny app from the R console to get the gist of it:

shinya11y::demo_tota11y()
Image 7 - Demo shinya11y app

Image 7 – Demo shinya11y app

The button on the bottom left corner of the app allows you to toggle various checklists. You can, for example, see if your app breaks one or more A11Y rules for accessibility. If a highlighted portion is red, you know there’s something you’ll need to address.

But the best part about the package is that you can easily integrate it into your own R Shiny apps.

For demonstration, we’ll reuse an R Shiny application from our Tools for Monitoring User Adoption article. The dashboard applies a clustering algorithm to the Iris dataset and lets you change the columns you want to see on a scatter plot:

library(shiny)ui <- fluidPage(    headerPanel("Iris k-means clustering"),    sidebarLayout(        sidebarPanel(            selectInput(                    inputId = "xcol",                    label = "X Variable",                    choices = names(iris)                ),            selectInput(                    inputId = "ycol",                    label = "Y Variable",                    choices = names(iris),                    selected = names(iris)[[2]]                ),            numericInput(                inputId = "clusters",                label = "Cluster count",                value = 3,                min = 1,                max = 9            )        ),        mainPanel(            plotOutput("plot1")        )    ))server <- function(input, output, session) {    selectedData <- reactive({        iris[, c(input$xcol, input$ycol)]    })    clusters <- reactive({        kmeans(selectedData(), input$clusters)    })    output$plot1 <- renderPlot({        palette(c(            "#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",            "#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"        ))                par(mar = c(5.1, 4.1, 0, 1))        plot(selectedData(),                col = clusters()$cluster,                pch = 20, cex = 3        )        points(clusters()$centers, pch = 4, cex = 4, lwd = 4)    })}shinyApp(ui = ui, server = server)
Image 8 - Clustering R Shiny application

Image 8 – Clustering R Shiny application

To add shinya11y, you’ll need to modify two things:

  1. Import the shinya11y library.
  2. Add a call to use_tota11y() at the top of Shiny UI.

Here’s what the modified app looks like in code:

library(shiny)library(shinya11y)ui <- fluidPage(    use_tota11y(),    headerPanel("Iris k-means clustering"),    sidebarLayout(        sidebarPanel(            selectInput(                inputId = "xcol",                label = "X Variable",                choices = names(iris)                ),            selectInput(                inputId = "ycol",                label = "Y Variable",                choices = names(iris),                selected = names(iris)[[2]]                ),            numericInput(                inputId = "clusters",                label = "Cluster count",                value = 3,                min = 1,                max = 9            )        ),        mainPanel(            plotOutput("plot1")        )    ))server <- function(input, output, session) {    selectedData <- reactive({        iris[, c(input$xcol, input$ycol)]    })    clusters <- reactive({        kmeans(selectedData(), input$clusters)    })    output$plot1 <- renderPlot({        palette(c(            "#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",            "#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"        ))                par(mar = c(5.1, 4.1, 0, 1))        plot(selectedData(),                col = clusters()$cluster,                pch = 20, cex = 3        )        points(clusters()$centers, pch = 4, cex = 4, lwd = 4)    })}shinyApp(ui = ui, server = server)

And here’s what it looks like when launched:

Image 9 - Clustering R Shiny application with accessibility tools

Image 9 – Clustering R Shiny application with accessibility tools

It really is that simple. You can now add these two lines to any existing Shiny app and see if there’s anything that needs improvement.


Summary of Web Accessibility Development for R Shiny Accessibility

If you’re just starting out, accessibility might be tough to wrap your head around. It’s easy to neglect some rules and guidelines if you don’t experience the same challenges others face. But if you go the extra mile, you’ll ensure that your Shiny app, data visualization, or presentation, will be inclusive.

We believe building tech solutions should be no different than any other engineering project; it should be inviting, accessible to all, and provide high-quality service to every user.

Now it’s time for you to shine. For a homework assignment, pick up any of your R Shiny dashboards and plug in the shinya11y package. See what needs addressing and make sure to do so. If you’re ready, share your results with us on Twitter – @appsilon. We’d love to see how accessible Shiny can become.

Ready to take your data visualization skills to the next level? Here are 5 key data visualization principles you must know.

Additional Resources for Accessible Color and Design

Additional resources to help you create accessible visualizations for color Blindness:

The post Accessibility Web Development – How to Make R Shiny Apps Accessible appeared first on Appsilon | Enterprise R Shiny Dashboards.

To leave a comment for the author, please follow the link and comment on their blog: Tag: r - Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Accessibility Web Development – How to Make R Shiny Apps Accessible

How to use %in% operator in R

$
0
0
[This article was first published on Data Analysis in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to use %in% operator in R appeared first on finnstats.

If you are interested to learn more about data science, you can find more articles here finnstats.

How to use %in% operator in R?, Want to know for certain whether a value is included within an R vector quickly? You are probably seeking the R’s percent in percent operator.

How to use %in% operator in R

If the value is present in the array you’re examining, it takes a single value and produces a Boolean True / False result. Below is a sample of code.

%in% in r example

x<- c(11,22,33,44,55,63,12,44,32,27,12)
x
[1] 11 22 33 44 55 63 12 44 32 27 12

%in% in r example missing value

31 %in% x
[1] FALSE

%in% in r example without missing value

44 %in% x
[1] TRUE

%in% r – using variables for search value and vector

value <- 12
value %in% x
[1] TRUE

%in% R applied to check Data Frames

To determine whether a specific value is present in a column of an R data frame, the same operator can be used.

In the example that follows, we’ll examine some time stamps from the ChickWeight data set, which is one of R’s built-in data sets and shows how quickly hens grow on various diets.

This analysis’s (fictitious) objective is to determine whether there are data points available for particular days.

head(ChickWeight)
   weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

 %in% in r – no value

5 %in% ChickWeight$Time
[1] FALSE

%in% in r, with value

8 %in% ChickWeight$Time
[1] TRUE

Comparing two vectors using the percent in percent operator

The percent in percent operator can be used to compare two vectors.

It will return an array with the same number of elements as the initial vector, each of which will indicate whether an element is present or absent (True or False).

This is a useful method for handling complicated comparisons fast where there may be several conditions on either side of the logic statement.

%in% in R example – comparing vectors

check vector of multiple values – %in% r operator

c(10, 11) %in% x
[1] FALSE  TRUE

Why put a percent sign before a percentage?

There are several options available here, as there are in most other areas of the R programming language.

Programmers frequently need to determine if a value is present or absent in an array. Having said that, we appreciate how elegantly simple and easily readable the percent in percent operator is.

It is believed that software should not be developed for computers to process, but rather for people to read.

The percent in percent operator realizes this ambition with its exquisite simplicity and easy usability.

You can find additional information here pipe operator in R-Simplify Your Code with %>% »

.mailpoet_hp_email_label{display:none!important;}#mailpoet_form_3 .mailpoet_form { } #mailpoet_form_3 form { margin-bottom: 0; } #mailpoet_form_3 p.mailpoet_form_paragraph.last { margin-bottom: 10px; } #mailpoet_form_3 .mailpoet_column_with_background { padding: 10px; } #mailpoet_form_3 .mailpoet_form_column:not(:first-child) { margin-left: 20px; } #mailpoet_form_3 .mailpoet_paragraph { line-height: 20px; margin-bottom: 20px; } #mailpoet_form_3 .mailpoet_form_paragraph last { margin-bottom: 0px; } #mailpoet_form_3 .mailpoet_segment_label, #mailpoet_form_3 .mailpoet_text_label, #mailpoet_form_3 .mailpoet_textarea_label, #mailpoet_form_3 .mailpoet_select_label, #mailpoet_form_3 .mailpoet_radio_label, #mailpoet_form_3 .mailpoet_checkbox_label, #mailpoet_form_3 .mailpoet_list_label, #mailpoet_form_3 .mailpoet_date_label { display: block; font-weight: normal; } #mailpoet_form_3 .mailpoet_text, #mailpoet_form_3 .mailpoet_textarea, #mailpoet_form_3 .mailpoet_select, #mailpoet_form_3 .mailpoet_date_month, #mailpoet_form_3 .mailpoet_date_day, #mailpoet_form_3 .mailpoet_date_year, #mailpoet_form_3 .mailpoet_date { display: block; } #mailpoet_form_3 .mailpoet_text, #mailpoet_form_3 .mailpoet_textarea { width: 200px; } #mailpoet_form_3 .mailpoet_checkbox { } #mailpoet_form_3 .mailpoet_submit { } #mailpoet_form_3 .mailpoet_divider { } #mailpoet_form_3 .mailpoet_message { } #mailpoet_form_3 .mailpoet_form_loading { width: 30px; text-align: center; line-height: normal; } #mailpoet_form_3 .mailpoet_form_loading > span { width: 5px; height: 5px; background-color: #5b5b5b; } #mailpoet_form_3 h2.mailpoet-heading { margin: 0 0 20px 0; } #mailpoet_form_3 h1.mailpoet-heading { margin: 0 0 10px; }#mailpoet_form_3{border-radius: 2px;text-align: left;}#mailpoet_form_3 form.mailpoet_form {padding: 30px;}#mailpoet_form_3{width: 100%;}#mailpoet_form_3 .mailpoet_message {margin: 0; padding: 0 20px;} #mailpoet_form_3 .mailpoet_validate_success {color: #00d084} #mailpoet_form_3 input.parsley-success {color: #00d084} #mailpoet_form_3 select.parsley-success {color: #00d084} #mailpoet_form_3 textarea.parsley-success {color: #00d084} #mailpoet_form_3 .mailpoet_validate_error {color: #cf2e2e} #mailpoet_form_3 input.parsley-error {color: #cf2e2e} #mailpoet_form_3 select.parsley-error {color: #cf2e2e} #mailpoet_form_3 textarea.textarea.parsley-error {color: #cf2e2e} #mailpoet_form_3 .parsley-errors-list {color: #cf2e2e} #mailpoet_form_3 .parsley-required {color: #cf2e2e} #mailpoet_form_3 .parsley-custom-error-message {color: #cf2e2e} #mailpoet_form_3 .mailpoet_paragraph.last {margin-bottom: 0} @media (max-width: 500px) {#mailpoet_form_3 {background-image: none;}} @media (min-width: 500px) {#mailpoet_form_3 .last .mailpoet_paragraph:last-child {margin-bottom: 0}} @media (max-width: 500px) {#mailpoet_form_3 .mailpoet_form_column:last-child .mailpoet_paragraph:last-child {margin-bottom: 0}} Please leave this field empty
Email Address *

Check your inbox or spam folder to confirm your subscription.

Have you found this article to be interesting? We’d be glad if you could forward it to a friend or share it on Twitter or Linked In to help it spread.

If you are interested to learn more about data science, you can find more articles here finnstats.

The post How to use %in% operator in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Data Analysis in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: How to use %in% operator in R

Allocating Wealth Both Across Goals and Across Investments

$
0
0
[This article was first published on R – Franklin J. Parker, CFA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the supplement to Chapter 3 of my book, Goals-Based Portfolio Theory, demonstrating the techniques and offering some code examples. If you are reading this having not purchased/read the book, you are missing much of the narrative. You can pick up a copy from Wiley, Amazon.com, or Barnes & Noble.

A fundamental idea in goals-based portfolio theory (though a relatively new one in the literature) is that investors must allocate a limited pool of wealth across many goals, and then allocate that wealth within each goal to a portfolio of investments. Each investor goal, then, will have a unique portfolio of investments.

But the challenge of allocating both within and across goals is not straightforward. In this post I will demonstrate how to accomplish this using R code.

First, we will need some common data to work from. To run this optimization, we will need to following CSV files, which I have reproduced here as tables (but you’ll need to create these CSVs and save them to your working directory).

The model is recursive, so the optimal allocation of wealth to investments within a goal depends on the allocation of wealth to the goal. But the optimal allocation of wealth to a goal depends on the optimal allocation of wealth within that goal.

To solve this recursivity problem, we are going to break the optimization down into two steps. First, we will determine the optimal allocation of investments within a goal for some discrete levels across-goal wealth allocation. In this example, we’ll do 1%-point intervals (so, an optimal portfolio for a 1% allocation of wealth, 2%, 3%, and so on, up to 100%). Second, we will use a simple Monte Carlo method to align the across-goal allocations to the within-goal allocations.

Let’s begin by loading our libraries and building the functions we need.

# Load Dependencies ====================================================library(ggplot2)library(RColorBrewer)library(Rsolnp)library(nloptr)# Define Functions =====================================================# This function will take proposed portfolio weights, forecast volatilities,# and forecast covariance and return the forecast portfolio volatitliy.sd.f = function(weight_vector, covar_table){  covar_vector = 0  for(z in 1:length(weight_vector)){    covar_vector[z] = sum(weight_vector * covar_table[,z])  }  return( sqrt( sum( weight_vector * covar_vector) ) )}# This function will return the expected portfolio return, given the # forecasted returns and proposed portfolio weightsmean.f = function(weight_vector, return_vector){  return( sum( weight_vector * return_vector ) )}# This function will return the probability of goal achievement, given# the goal variables, allocation to the goal, expected return of the# portfolio, and expected volatiltiy of the portfoliophi.f = function(goal_vector, goal_allocation, pool, mean, sd){  required_return = (goal_vector[2]/(pool * goal_allocation))^(1/goal_vector[3]) - 1  if( goal_allocation * pool >= goal_vector[2]){    return(1)  } else {    return( 1 - pnorm( required_return, mean, sd, lower.tail=TRUE ) )  }}# For use in the optimization function later, this is failure probability,# which we want to minimize.optim_function = function(weights){  1 - phi.f(goal_vector, allocation, pool,             mean.f(weights, return_vector),            sd.f(weights, covar_table) ) +    100 * (sum(weights) - 1)^2}# For use in the optimization function later, this allows the portfolio# weights to sum to 1.constraint_function = function(weights){  sum(weights)}# For use in mean-variance optimization.mvu.f = function(weights){  -(mean.f(weights, return_vector) - 0.5 * gamma * sd.f(weights, covariances)^2)}# Required return functionr_req.f = function(goal_vector, goal_allocation, pool){  (goal_vector[2]/(goal_allocation * pool))^(1/goal_vector[3]) - 1 }

Next, let’s load our relevant data sets:

# Load & Parse Data ====================================================n_trials = 10^5 # number of trials to run in MC simulation# Need to set the directories to the location where you save the files.# .:.goal_data_raw = read.csv( "~/Example Goal Details.csv")capital_market_expectations_raw = read.csv( "~/Capital Market Expectations.csv")correlations_raw = read.csv( "~/Correlations - Kitchen Sink.csv")# Record number of potential investmentsnum_assets = length(capital_market_expectations_raw[,2])# Record number of goalsnum_goals = ncol(goal_data_raw) - 1# Create vector of expected returnsreturn_vector = capital_market_expectations_raw[,2]# Change correlation table to just numberscorrelations = data.frame( correlations_raw[1:15, 2:16] )# Build a covariance table by merging forecast vol with forecast correlations# This is an empty matrix to fill with covariancescovariances = matrix(nrow=num_assets, ncol=num_assets)# Iterate through rows and columns to fill covariance matrixfor(i in 1:num_assets){ # columns  for(j in 1:num_assets){ # rows    covariances[j,i] = capital_market_expectations_raw[i,3] *       capital_market_expectations_raw[j,3] * correlations[j,i]  }}# Pull raw goal data and parse into individual goals. Put goal details# into a vector of the form (value ratio, funding requirement, time horizon)goal_A = c(goal_data_raw[1,2], goal_data_raw[2,2], goal_data_raw[3,2])goal_B = c(goal_data_raw[1,3], goal_data_raw[2,3], goal_data_raw[3,3])goal_C = c(goal_data_raw[1,4], goal_data_raw[2,4], goal_data_raw[3,4])goal_D = c(goal_data_raw[1,5], goal_data_raw[2,5], goal_data_raw[3,5])pool = 4654000 # Total pool of wealth

And now we are ready to run our optimization algorithm! As mentioned, we start by finding the optimal investment weights for each level of across-goal allocation. To accomplish this I am using the Rsolnp package’s solnp() function. These optimal allocations are logged to a matrix (a different matrix for each goal) where every row represents a level of across-goal allocation and each column is a potential investment.

Here it is in action:

# STEP 1: Optimal Within-Goal Allocation =============================================# The first step is to vary the goal allocation and find the optimal # investment portfolio and its characteristics for each level of across-goal# allocation. This uses a non-linear optimization engine to find optimal portfolios.# Start by enumerating the various possible across-goal allocations, (0% to 100%]goal_allocation = seq(0.01, 1, 0.01)starting_weights = runif(num_assets,0,1) # Weight seeds to kick-off optimstarting_weights = starting_weights/sum(starting_weights) # Ensure they sum to 1.# Iterate through each potential goal allocation and find the investment# weights that deliver the highest probability of goal achievement. Those# weights will be specific to each goal, so we will log them into a matrix,# where each row corresponds to a potential goal allocation.optimal_weights_A = matrix(nrow=length(goal_allocation), ncol=num_assets)optimal_weights_B = matrix(nrow=length(goal_allocation), ncol=num_assets)optimal_weights_C = matrix(nrow=length(goal_allocation), ncol=num_assets)optimal_weights_D = matrix(nrow=length(goal_allocation), ncol=num_assets)for(i in 1:length(goal_allocation)){    # Use nonlinear optimization function, with constraints  allocation = goal_allocation[i]  covar_table = covariances    # Goal A Optimization  goal_vector = goal_A    if( goal_A[2] <= pool * goal_allocation[i] ){    # If the allocation is enough to fully-fund the goal, force the allocation to all cash.    optimal_weights_A[i,] = c( rep(0, num_assets-1 ), 1 )  } else {    # Otherwise optimize as normal.    result = solnp( starting_weights, # Starting weight values - these are random                    optim_function, # Function to minimize - min prob of failure                    eqfun = constraint_function, # subject to the constraint function                    eqB = 1, # the constraint function must equal 1                    LB = rep(0, num_assets), # lower bound values of 0                    UB = rep(1, num_assets) ) # upper bound values of 1    optimal_weights_A[i,] = result$pars # Log result  }    # Goal B Optimization, same pattern as Goal A.  goal_vector = goal_B    if( goal_B[2] <= pool * goal_allocation[i] ){    optimal_weights_B[i,] = c( rep(0, num_assets-1), 1 )  } else {      result = solnp( starting_weights,                      optim_function,                      eqfun = constraint_function,                      eqB = 1,                      LB = rep(0, num_assets),                      UB = rep(1, num_assets) )      optimal_weights_B[i,] = result$pars  }    # Goal C Optimization  goal_vector = goal_C    if( goal_C[2] <= pool * goal_allocation[i] ){    optimal_weights_C[i,] = c( rep(0, num_assets-1), 1 )  } else {      result = solnp( starting_weights,                      optim_function,                      eqfun = constraint_function,                      eqB = 1,                      LB = rep(0, num_assets),                      UB = rep(1, num_assets) )      optimal_weights_C[i,] = result$pars  }    # Goal D Optimization  goal_vector = goal_D    if( goal_D[2] <= pool * goal_allocation[i] ){    optimal_weights_D[i,] = c( rep(0, num_assets-1), 1 )  } else {      result = solnp( starting_weights,                      optim_function,                      eqfun = constraint_function,                      eqB = 1,                      LB = rep(0, num_assets),                      UB = rep(1, num_assets) )            optimal_weights_D[i,] = result$pars  }}

For each goal, we need to find what each level of across-goal allocation yields in goal achievement probability.

# Using the optimal weights for each level of goal allocation, we will# log the best phis for each level of goal allocation. This will be# used in the next step to help determine utility.phi_A = 0phi_B = 0phi_C = 0phi_D = 0for(i in 1:length(goal_allocation)){  phi_A[i] = phi.f( goal_A, goal_allocation[i], pool,                    mean.f(optimal_weights_A[i,], return_vector),                    sd.f(optimal_weights_A[i,], covariances) )    phi_B[i] = phi.f( goal_B, goal_allocation[i], pool,                    mean.f(optimal_weights_B[i,], return_vector),                    sd.f(optimal_weights_B[i,], covariances) )    phi_C[i] = phi.f( goal_C, goal_allocation[i], pool,                    mean.f(optimal_weights_C[i,], return_vector),                    sd.f(optimal_weights_C[i,], covariances) )    phi_D[i] = phi.f( goal_D, goal_allocation[i], pool,                    mean.f(optimal_weights_D[i,], return_vector),                    sd.f(optimal_weights_D[i,], covariances) )}

And, now, we can go about finding the optimal across-goal allocation by plugging the achievement probabilities into the goals-based utility function.

# STEP 2: Optimal Across-Goal Allocation =======================================# Now that we have the characteristics of the within-goal allocations, we can# use them to find the best across-goal allocation.# Begin by building a matrix of simulated goal weights, then return the utility# for each simulated portfolio.sim_goal_weights = matrix(ncol=num_goals, nrow=n_trials)for(i in 1:n_trials){  rand_vector = runif(num_goals, 0, 1)  normalizer = sum(rand_vector)  # Since you cannot have an allocation to a goal of 0, this ensures that the  # minimum allocation is 1.  sim_goal_weights[i,] = ifelse( round( (rand_vector/normalizer)*100, 0 ) < 1,                                 1,                                 round( (rand_vector/normalizer)*100 ) )}# Find the utility of each trial. utility = goal_A[1] * phi_A[ sim_goal_weights[,1] ] +  goal_A[1] * goal_B[1] * phi_B[ sim_goal_weights[,2] ] +  goal_A[1] * goal_B[1] * goal_C[1] * phi_C[ sim_goal_weights[,3] ] +  goal_A[1] * goal_B[1] * goal_C[1] * goal_D[1] * phi_D[ sim_goal_weights[,4] ]# Which simulated portfolio delivered the highest utilityindex = which( utility == max(utility) )# Optimal goal weightsoptimal_goal_weights = sim_goal_weights[index,]

We now have our optimal wealth allocation—both across our goals and within each goal! Let’s see our results and generate some visualizations

# Step 3: Return Optimal Subportfolios & Optimal Aggregate Portfolio ===========# Optimal subportfolio allocationsoptimal_subportfolios = matrix( nrow=num_goals, ncol=num_assets )goals = c("A", "B", "C", "D")for(i in 1:num_goals){  optimal_subportfolios[i,] =     get( paste("optimal_weights_", goals[i], sep="") )[ optimal_goal_weights[i], ]  }rownames(optimal_subportfolios) = goals# Optimal Aggregate Investment Portfoliooptimal_aggregate_portfolio = 0for(i in 1:num_assets){    optimal_aggregate_portfolio[i] = sum((optimal_goal_weights/100) *                                            optimal_subportfolios[,i])}# Visualize Results ============================================================# Plot allocation as a function of subportfolio allocation, Goal A# Data_viz matrix will be long-form.optimal_weights_A[14:30,12] <- 0 # correct for unstable optimoptimal_weights_A[14:26,13] <- 1.00 # sameasset_names = as.character(capital_market_expectations_raw[,1])data_viz_1 = data.frame( "Weight" = optimal_weights_A[,1],                         "Asset.Name" = rep(asset_names[1], length(optimal_weights_A[,1])),                         "Theta" = seq(1, 100, 1) )for(i in 2:num_assets){  data = data.frame( "Weight" = optimal_weights_A[,i],                     "Asset.Name" = rep(asset_names[i], length(optimal_weights_A[,i])),                     "Theta" = seq(1, 100, 1) )  data_viz_1 = rbind(data_viz_1, data)}# Visualize Goal A's subportfolio allocation as a function of Goal A's across-goal# allocation.ggplot( data_viz_1, aes( x=Theta, y=Weight, fill=Asset.Name) )+  geom_area( linetype=1, size=0.5, color="black" )+  annotate('text', x = 8, y = 0.25,   label = 'AngelVenture')+  annotate('text', x = 35, y = 0.35, label = 'Venture Capital')+  annotate('text', x = 45, y = 0.85,            label = 'PrivateEquity')+  annotate('text', x = 70, y = 0.80, label = 'Small Cap')+  annotate('text', x = 71, y = 0.34, label = 'Gold')+  annotate('text', x = 90, y = 0.60, label = 'US AggBond')+  annotate('text', x = 92, y = 0.22, label = 'US Treasury')+  xlab("Goal Allocation")+  ylab("Investment Weight")+  labs(fill = "Asset" )+  theme_minimal()+  theme( axis.text = element_text(size=14),         legend.text = element_text(size=14),         legend.title = element_text(size=16, face="bold"),         axis.title = element_text(size=16, face="bold") )# Print the optimal across-goal allocationoptimal_goal_weights# Print the optimal aggregate investment allocationoptimal_aggregate_portfolio

Which yields the following plot.

Optimal Investment Portfolio for each level of across-goal allocation to Goal A, given the inputs.

This is the optimal investment allocation for each level of across-goal allocation made to Goal A. Notice how the lottery-like investments take prominence for lower levels of across-goal allocation, and a more traditional portfolio takes over for higher levels of across-goal allocation. This is all different than the mean-variance approach, though I won’t repeat the narrative from the book here.

With Mean-Variance Constraints

Many investors, however, are mean-variance constrained (that is, they want to stay on the efficient frontier). We can use the across-goal allocation method from goals-based portfolio theory but make some adjustments to ensure that we are always on the mean-variance efficient frontier.

Continuing with our code from above (the following code assumes you have run all the code in the previous section), we begin by setting up the efficient frontier and finding the last point on it (we will need to ensure we don’t pass that last point later).

# With Mean-Variance Constraints ===============================================# We have assumed the no short-sale and no-leverage constraints, as this is# a common constraint on goals-based investors.g = seq(60, 1, -1) # vary gamma to build frontierm = 0 # list to store resultant optimal returnss = 0 # list to store resultant optimal standard devscovar_table = covariancesstarting_weights = runif(num_assets,0,1) # Weight seeds to kick-off optimstarting_weights = starting_weights/sum(starting_weights) # Ensure they sum to 1.optimal_weights_mv = data.frame( matrix(nrow=1, ncol=num_assets) )# Iterate through each level of gamma to find optimal portfolio weightsfor(i in 1:length(g)){  gamma =g[i]  result = solnp( starting_weights,                  mvu.f,                  eqfun = constraint_function,                  eqB = 1,                  LB = rep(0, num_assets),                  UB = rep(1, num_assets) )    optimal_weights_mv = rbind(optimal_weights_mv, result$pars)  m[i] = mean.f(result$pars, return_vector)  s[i] = sd.f(result$pars, covariances)}# Visualize MV Efficient Frontiermv_frontier_data = data.frame( "Volatility" = s,                               "Return" = m )ggplot( mv_frontier_data, aes(x =Volatility, y=Return ) )+  geom_line( size=2, col="dodgerblue" )+  theme( axis.text = element_text(size=14),         legend.text = element_text(size=14),         legend.title = element_text(size=16, face="bold"),         axis.title = element_text(size=16, face="bold") )# Find last point on the mean-variance efficient frontier, for portfolios with# less required return than that on offer by the maximum on the frontier, the# optimal weights are synonymous with mean-variance optimization, otherwise# the optimization is probability maximization. See Parker (20XX) for discussion.gamma = 0.01result = solnp( starting_weights,                mvu.f,                eqfun = constraint_function,                eqB = 1,                LB = rep(0, num_assets),                UB = rep(1, num_assets) )last_weights = result$parslast_m = mean.f(result$pars, return_vector)optimal_weights_mv = rbind(optimal_weights_mv, last_weights)m[length(g)+1] = mean.f(last_weights, return_vector)s[length(g)+1] = sd.f(last_weights, covariances)last_index = nrow(optimal_weights_mv) # row number of last portfolio on                                      # mv efficient frontier# Convert means and volatilities to phis, this sets up list to hold themmv_phi_A = 0mv_phi_B = 0mv_phi_C = 0mv_phi_D = 0# Store the optimal weights into data framesoptimal_mv_weights_A = data.frame(matrix(nrow=length(goal_allocation), ncol=num_assets))optimal_mv_weights_B = data.frame(matrix(nrow=length(goal_allocation), ncol=num_assets))optimal_mv_weights_C = data.frame(matrix(nrow=length(goal_allocation), ncol=num_assets))optimal_mv_weights_D = data.frame(matrix(nrow=length(goal_allocation), ncol=num_assets))

Then we apply a similar process to the one in the previous section—optimizing the within-goal investment portfolio for each level of across-goal allocation.

# STEP 1: Optimal Within-Goal Allcoation ===============================================for(i in 1:length(goal_allocation)){  # Iterate through each level of goal allocation:  #  # If the required return is greater than the largest return on   # offer by the MV frontier, then allocation maintains exposure  # to the endpoint of the frontier.  #  # If the required return is less than the largest return on offer  # by the MV frontier, then probability maximization is synonymous  # with mean-variance maximization. See Parker (20XX) for discussion.    # For Goal A  if(r_req.f(goal_A, goal_allocation[i], pool) > last_m){        optimal_mv_weights_A[i,] = last_weights    mv_phi_A[i] = phi.f(goal_A, goal_allocation[i], pool,                        mean.f(optimal_weights_mv[last_index,], return_vector),                        sd.f(optimal_weights_mv[last_index,], covariances) )      } else {        optimal_mv_weights_A[i,] = optimal_weights_A[i,]    mv_phi_A[i] = phi.f(goal_A, goal_allocation[i], pool,                        mean.f(optimal_weights_A[i,], return_vector),                        sd.f(optimal_weights_A[i,], covariances) )  }    # Goal B  if(r_req.f(goal_B, goal_allocation[i], pool) > last_m){        optimal_mv_weights_B[i,] = last_weights    mv_phi_B[i] = phi.f(goal_B, goal_allocation[i], pool,                        mean.f(optimal_weights_mv[last_index,], return_vector),                        sd.f(optimal_weights_mv[last_index,], covariances) )      } else {        optimal_mv_weights_B[i,] = optimal_weights_B[i,]    mv_phi_B[i] = phi.f(goal_B, goal_allocation[i], pool,                        mean.f(optimal_weights_B[i,], return_vector),                        sd.f(optimal_weights_B[i,], covariances) )      }    # Goal C  if(r_req.f(goal_C, goal_allocation[i], pool) > last_m){        optimal_mv_weights_C[i,] = last_weights    mv_phi_C[i] = phi.f(goal_C, goal_allocation[i], pool,                        mean.f(optimal_weights_mv[last_index,], return_vector),                        sd.f(optimal_weights_mv[last_index,], covariances) )      } else {        optimal_mv_weights_C[i,] = optimal_weights_C[i,]    mv_phi_C[i] = phi.f(goal_C, goal_allocation[i], pool,                        mean.f(optimal_weights_C[i,], return_vector),                        sd.f(optimal_weights_C[i,], covariances) )  }    # Goal D  if(r_req.f(goal_D, goal_allocation[i], pool) > last_m){        optimal_mv_weights_D[i,1:num_assets] = last_weights    mv_phi_D[i] = phi.f(goal_D, goal_allocation[i], pool,                        mean.f(optimal_weights_mv[last_index,], return_vector),                        sd.f(optimal_weights_mv[last_index,], covariances) )      } else {        optimal_mv_weights_D[i,] = optimal_weights_D[i,]    mv_phi_D[i] = phi.f(goal_D, goal_allocation[i], pool,                        mean.f(optimal_weights_D[i,], return_vector),                        sd.f(optimal_weights_D[i,], covariances) )  }}

Note the if-else statement in the loop. This tests the result to see if the required return of the portfolio falls outside of the efficient frontier—if it does, we maintain exposure to the last portfolio on the frontier. If it does not, then we proceed with the optimization (the justification for this is discussed in the book).

Next, we proceed with the optimal across-goal allocation using the goals-based utility function, and we can visualize the results.

# STEP 2: Optimal Across-Goal Allcoation ==========================================# Using Monte Carlo trials in previous section, return utility for each trial of# simulated goal weightutility_mv = goal_A[1] * mv_phi_A[ sim_goal_weights[,1] ] +  goal_A[1] * goal_B[1] * mv_phi_B[ sim_goal_weights[,2] ] +  goal_A[1] * goal_B[1] * goal_C[1] * mv_phi_C[ sim_goal_weights[,3] ] +  goal_A[1] * goal_B[1] * goal_C[1] * goal_D[1] * mv_phi_D[ sim_goal_weights[,4] ]index_mv = which( utility_mv == max(utility_mv) )# Optimal across-goal allcoation for MV constraints, print the optimal goal weights# given mean-variance contraints(optimal_goal_weights_mv = sim_goal_weights[index_mv, ])# Visualize MV Results ============================================================# for Goal Adata_viz_2 = data.frame( "Weight" = optimal_mv_weights_A[,1],                         "Asset" = rep(asset_names[1], length(optimal_mv_weights_A[,1])),                         "Theta" = seq(1, 100, 1) )for(i in 2:num_assets){  data = data.frame( "Weight" = optimal_mv_weights_A[,i],                     "Asset" = rep(asset_names[i], length(optimal_mv_weights_A[,i])),                     "Theta" = seq(1, 100, 1) )  data_viz_2 = rbind(data_viz_2, data)}ggplot(data_viz_2, aes(x=Theta, y=Weight, fill=Asset) )+  geom_area( linetype=1, size=0.5, color="black" )+  xlab("Goal Allocation")+  ylab("Investment Weight")+  labs(fill = "Asset" )+  theme( axis.text = element_text(size=14),         legend.text = element_text(size=14),         legend.title = element_text(size=16, face="bold"),         axis.title = element_text(size=16, face="bold") )# Compare probability of achievement for goals-based approach and mean-variance approach,# For Goal Adata_viz_3 = data.frame( "Theta" = rep( seq(1, 100, 1), 2),                         "Phi" = c( phi_A, mv_phi_A ),                         "Name" = c( rep( "Goals-Based", 100), rep( "Mean-Variance", 100 ) ) )ggplot( data_viz_3, aes(x=Theta, y=Phi, lty=Name) )+  geom_line( size=1.5 )+  xlab("Goal Allocation, %")+  ylab("Probability of Achievement")+  labs(color = "")+  theme_minimal()+  theme( axis.text = element_text(size=14),         legend.text = element_text(size=14),         legend.title = element_blank(),         legend.position = 'top',         axis.title = element_text(size=16, face="bold") )

Running the code, you will notice that the lottery-like “Angel Venture” asset is eliminated for mean-variance-constrained investors. This, of course, results in a lower probability of goal achievement for those lower levels of wealth allocation.

In any event, this is an example of how to optimize a goals-based portfolio—both within and across goals.

To leave a comment for the author, please follow the link and comment on their blog: R – Franklin J. Parker, CFA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Allocating Wealth Both Across Goals and Across Investments

Robin Donatello Talks About Growing an R Community at a State University

$
0
0
[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Growing a user base for R at a university can be challenging at the best of times, especially when dealing with the silos that are present in the university system. Robin Donatello with the Chico R Users Group talks about how this issue became both easier and harder to deal with due to the pandemic.

Robin Donatello is an Associate Professor of Statistics and Data Science at California State University, Chico. Robin has also helped host the ASA Datafest, a data analysis competition in which undergraduate students from various majors get to work in teams on large, complex, and real-world data. 💡


What is the R community like at CSU Chico?

RD: It is struggling to have cohesion right now. For the most part, I have been a one-person show, but recent hires in Statistics have brought some new energy to the University and additional interests in building a data scientist community. We have about 20 faculty that use R in about 40 classes. We have about 100 or so students each semester being exposed to the language, but for the most part, everyone is doing their own thing.

I currently have a USDA HSI Education grant where I was able to fund 3 other faculty to become Certified Carpentry instructors. The goal is to increase the pool of faculty who want to teach these skills outside of the traditional class so that we can feel like we have a community.  A lot of faculty are still burnt out right now and aren’t interested in taking on any additional work. As a teaching institution, our primary responsibility is in-class teaching and support. I’m hoping that soon we’ll get something other than an ASA Datafest. We are actively working on a new Masters program in Data Science and Analytics expected to roll out in Fall 23. We envision teaching both in R and Python, and I look forward to exploring how the two languages complement each other.

How has COVID affected your ability to connect with members?

RD: One of the things that we had going pretty well before the pandemic was a thing called Community Coding. This was started based on the idea behind UC Davis “Meet and Analyze Data”. Faculty, Staff, and Students can come and do work, and people would be there to help. The hard part was scheduling a room. Trying to have a centralized room on campus and then trying to have a faculty or student be there at that time. We had 10-15 students, each student about two times a week come in for direct help, they knew someone would be there since it would be like drop-in tutoring. When we switched to online in 2020, they all were gone. No one showed up. From 2021-22 it has gotten a little better. It’s still sparse, but people are getting used to virtual now. Getting help on zoom is less of a barrier now than it used to be. I am seeing more non-traditional students and students not in my class drop in and ask questions. The virtual nature has allowed us to expand the help across campus because faculty are more open to doing zoom hours than to going across campus and sitting in a room. We were able to offer 14 hours/week of help in Spring 21, but the support was still very underutilized. Holding Community Coding virtually also allows us to meet people in the larger R community, not just at our institution.

We also had to cancel our ASA DataFest in 2020 and 2021 due to Covid, which is typically a good networking and community-building event. 

In the past year, did you have to change your techniques to connect and collaborate with members?  For example, did you use GitHub, video conferencing, online discussion groups more?  Can these techniques be used to make your group more inclusive to people that are unable to attend physical events in the future?  

RD: We will do this in the future. We have a lot of students that are working as well, so we will keep on these online. We have graduate students who can’t attend during the day, we have professional students and others who just can’t participate during the day for some reason or another. There will always be some online component, at least for the Community Coding events. The Carpentry workshops are often offered in a hybrid format, allowing for both online and in-person audiences.

Can you tell us about one recent presentation or speaker that was especially interesting and what was the topic and why was it so interesting? 

RD: We haven’t had a presentation recently. We are discussing ways to bring back regular meetups that can re-engage our community.

What trends do you see in R language affecting your organization over the next year?

RD: The only thing that I can think of is the growth of the tidyverse has made teaching students in the applied sciences easier. It helps, but sometimes knowing the base is more helpful for the advanced stuff. The basic student, however, will probably only use it in their analysis and some classes in general, but will not become an advanced R user. It has made my life as a teacher more accessible and I’d like to see it continue.

Do you know of any data journalism efforts by your members?  If not, are there particular data journalism projects that you’ve seen in the last year that you feel had a positive impact on society?

RD: There is an adjunct professor who is a certified R Studio and R Carpentry instructor who is doing a data journalism class. It started a few years ago with an older professor. When that professor left, it stopped being taught. When the current professor came back, he took up the class. However, he’s an adjunct so he may or may not stay. It was offered for the first time this semester with no advertisement, so the enrollment was low. That’s a constant battle for us is advertising and getting the word out about these things. If they could get this class into a major, like journalism or data science. I will add it as an elective to data science and as a class in the data analytics minor.

Of the Funded Projects by the R Consortium, do you have a favorite project? Why is it your favorite?

RD: Data Carpentry because building capacity at the faculty level is the only way that we will build a community and get it out to more students.

 Of the Active Working Groups, which is your favorite? Why is it your favorite?

RD: I wasn’t aware of the groups ahead of time, mostly due to my schedule. If I did, it would be a community of diversity and inclusion, mostly because we are a Hispanic-serving institution. Over 40% of our students identify as Hispanic in origin and the faculty at Chico does not reflect the same diversity. The main aim of my current grant is to empower traditionally underserved populations to engage in research and data analysis using R and to support their growth in scientific fields.

Four projects are R Consortium Top-Level Projects. If you could add another project to this list for guaranteed funding for 3 years and a voting seat on the ISC, which project would you add?

The current four projects are:

RD: I would like to see a service similar to Data Camp created. That was an amazing tool that I will no longer use due to their corporate behavior. That platform was so helpful for teaching students. If someone were to make something similar to that which was community-driven and not profit-driven that would be amazing. That would be super helpful for instructors. Even enhancing the LearnR and grade this package to have more tools and easier to install and use for new learners. A group that is designed to help teachers to teach R easier is what I would like to see.

When is your next event? Please give details!

ASA DataFest in Spring 2022 returned as a success. We had 28 students from two universities attend, which is not too much lower than our pre-pandemic number of 35. A common theme student stated is that they wished they had more experience with R before going into the weekend. This can be an opportunity for us to do more pre-event activities and meetups in Spring 23. 

Our next Data Carpentry workshop will be in August and again in January. We are trying to offer about 2 each year, but the participation has been still pretty low compared to pre-Covid. 


How do I Join?

R Consortium’s R User Group and Small Conference Support Program (RUGS) provides grants to help R groups around the world organize, share information and support each other. We have given grants over the past four years, encompassing over 65,000 members in 35 countries. We would like to include you! Cash grants and meetup.com accounts are awarded based on the intended use of the funds and the amount of money available to distribute. We are now accepting applications!

The post Robin Donatello Talks About Growing an R Community at a State University appeared first on R Consortium.

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Robin Donatello Talks About Growing an R Community at a State University

How to Use Gather Function in R?-tidyr Part2

$
0
0
[This article was first published on Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Use Gather Function in R?-tidyr Part2 appeared first on Data Science Tutorials

How to Use Gather Function in R?, To “collect” a key-value pair across many columns, use the gather() function from the tidyr package.

The basic syntax used by this function is as follows.

gather(data, key value, …)

where:

data: Name of the data frame

key: Name of the key column to create

value: Name of the newly created value column

: Choose which columns to collect data from.

How to Use Gather Function in R

The practical application of this function is demonstrated in the examples that follow.

dplyr Techniques and Tips – Data Science Tutorials

Example 1: Gather Values From Two Columns

Let’s say we have the R data frame shown below:

make a data frame

df <- data.frame(player=c('P1', 'P2', 'P3', 'P4'),
year1=c(212, 215, 319, 129),
year2=c(232, 229, 158, 122))

Let’s view the data frame

df
  player year1 year2
1     P1   212   232
2     P2   215   229
3     P3   319   158
4     P4   129   122

The gather() function can be used to add two new columns named “year” and “points,” respectively:

library(tidyr)

get information from columns 2 and 3

gather(df, key="year", value="points", 2:3)
   player  year points
1     P1 year1    212
2     P2 year1    215
3     P3 year1    319
4     P4 year1    129
5     P1 year2    232
6     P2 year2    229
7     P3 year2    158
8     P4 year2    122

Example 2: Collect Information From More Than Two Columns

Let’s say we have the R data frame shown below:

Let’s create a data frame

df2 <- data.frame(player=c('P1', 'P2', 'P3', 'P4'),
                  year1=c(213, 112, 112, 114),
                  year2=c(122, 229, 215, 211),
                  year3=c(123, 122, 122, 129))

Now we can view the data frame

df2
  player year1 year2 year3
1     P1   213   122   123
2     P2   112   229   122
3     P3   112   215   122
4     P4   114   211   129

The data from columns 2, 3, and 4 can be “gathered” into two additional columns called “year” and “points” using the gather() method as follows:

library(tidyr)

assemble information from columns 2, 3, and 4.

gather(df2, key="year", value="points", 2:4)
    player  year points
1      P1 year1    213
2      P2 year1    112
3      P3 year1    112
4      P4 year1    114
5      P1 year2    122
6      P2 year2    229
7      P3 year2    215
8      P4 year2    211
9      P1 year3    123
10     P2 year3    122
11     P3 year3    122
12     P4 year3    129
.mailpoet_hp_email_label{display:none!important;}#mailpoet_form_1 .mailpoet_form { } #mailpoet_form_1 form { margin-bottom: 0; } #mailpoet_form_1 h1.mailpoet-heading { margin: 0 0 20px; } #mailpoet_form_1 p.mailpoet_form_paragraph.last { margin-bottom: 5px; } #mailpoet_form_1 .mailpoet_column_with_background { padding: 10px; } #mailpoet_form_1 .mailpoet_form_column:not(:first-child) { margin-left: 20px; } #mailpoet_form_1 .mailpoet_paragraph { line-height: 20px; margin-bottom: 20px; } #mailpoet_form_1 .mailpoet_segment_label, #mailpoet_form_1 .mailpoet_text_label, #mailpoet_form_1 .mailpoet_textarea_label, #mailpoet_form_1 .mailpoet_select_label, #mailpoet_form_1 .mailpoet_radio_label, #mailpoet_form_1 .mailpoet_checkbox_label, #mailpoet_form_1 .mailpoet_list_label, #mailpoet_form_1 .mailpoet_date_label { display: block; font-weight: normal; } #mailpoet_form_1 .mailpoet_text, #mailpoet_form_1 .mailpoet_textarea, #mailpoet_form_1 .mailpoet_select, #mailpoet_form_1 .mailpoet_date_month, #mailpoet_form_1 .mailpoet_date_day, #mailpoet_form_1 .mailpoet_date_year, #mailpoet_form_1 .mailpoet_date { display: block; } #mailpoet_form_1 .mailpoet_text, #mailpoet_form_1 .mailpoet_textarea { width: 200px; } #mailpoet_form_1 .mailpoet_checkbox { } #mailpoet_form_1 .mailpoet_submit { } #mailpoet_form_1 .mailpoet_divider { } #mailpoet_form_1 .mailpoet_message { } #mailpoet_form_1 .mailpoet_form_loading { width: 30px; text-align: center; line-height: normal; } #mailpoet_form_1 .mailpoet_form_loading > span { width: 5px; height: 5px; background-color: #5b5b5b; }#mailpoet_form_1{border-radius: 16px;background: #ffffff;color: #313131;text-align: left;}#mailpoet_form_1 form.mailpoet_form {padding: 16px;}#mailpoet_form_1{width: 100%;}#mailpoet_form_1 .mailpoet_message {margin: 0; padding: 0 20px;} #mailpoet_form_1 .mailpoet_validate_success {color: #00d084} #mailpoet_form_1 input.parsley-success {color: #00d084} #mailpoet_form_1 select.parsley-success {color: #00d084} #mailpoet_form_1 textarea.parsley-success {color: #00d084} #mailpoet_form_1 .mailpoet_validate_error {color: #cf2e2e} #mailpoet_form_1 input.parsley-error {color: #cf2e2e} #mailpoet_form_1 select.parsley-error {color: #cf2e2e} #mailpoet_form_1 textarea.textarea.parsley-error {color: #cf2e2e} #mailpoet_form_1 .parsley-errors-list {color: #cf2e2e} #mailpoet_form_1 .parsley-required {color: #cf2e2e} #mailpoet_form_1 .parsley-custom-error-message {color: #cf2e2e} #mailpoet_form_1 .mailpoet_paragraph.last {margin-bottom: 0} @media (max-width: 500px) {#mailpoet_form_1 {background: #ffffff;}} @media (min-width: 500px) {#mailpoet_form_1 .last .mailpoet_paragraph:last-child {margin-bottom: 0}} @media (max-width: 500px) {#mailpoet_form_1 .mailpoet_form_column:last-child .mailpoet_paragraph:last-child {margin-bottom: 0}} Please leave this field empty
input[name="data[form_field_MGI0Nzk2NWMxZTIzX2VtYWls]"]::placeholder{color:#abb8c3;opacity: 1;}Email Address *

Check your inbox or spam folder to confirm your subscription.

Best Books to learn Tensorflow – Data Science Tutorials

Have you liked this article? If you could email it to a friend or share it on Facebook, Twitter, or Linked In, I would be eternally grateful.

Please use the like buttons below to show your support. Please remember to share and comment below. 

The post How to Use Gather Function in R?-tidyr Part2 appeared first on Data Science Tutorials

To leave a comment for the author, please follow the link and comment on their blog: Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: How to Use Gather Function in R?-tidyr Part2

simstudy updated to version 0.5.0

$
0
0
[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of simstudy is available on CRAN. There are two major enhancements and several new features. In the “major” category, I would include (1) changes to survival data generation that accommodate hazard ratios that can change over time, as well as competing risks, and (2) the addition of functions to allow users to sample from existing data sets with replacement to generate “synthetic” data will real life distribution properties. Other less monumental, but important, changes were made: updates to functions genFormula and genMarkov, and two added utility functions, survGetParams and survParamPlot. (I did describe the survival data generation functions in two recent posts, here and here.)

Here are the highlights of the major enhancements:

Non-proportional hazards

If we want to simulate a scenario where survival time is a function of sex and the relative risk of death (comparing males to females) changes after 150 days, we cannot use the proportional hazards assumption that simstudy has typically assumed. Rather, we need to be able to specify different hazards at different time points. This is now implemented in simstudy by using the defSurv function and the transition argument.

In this case, the same outcome variable “death” is specified multiple times (currently the limit is actually two times) in defSurv, and the transition argument indicates the point at which the hazard ratio (HR) changes. In the example below, the log(HR) comparing males and females between day 0 and 150 is -1.4 (HR = 0.25), and after 150 days the hazards are more closely aligned, log(HR) = -0.3 (HR = 0.74). The data definitions determine the proportion of males in the sample and specify the time to death outcomes:

library(simstudy)library(survival)library(gtsummary)  def <- defData(varname = "male", formula = 0.5, dist = "binary")defS <- defSurv(varname = "death", formula = "-14.6 - 1.4 * male",   shape = 0.35, transition = 0)defS <- defSurv(defS, varname = "death", formula = "-14.6 - 0.3 * male",   shape = 0.35, transition = 150)

If we generate the data and take a look at the survival curves, it is possible to see a slight inflection point at 150 days where the HR shifts:

set.seed(10)dd <- genData(600, def)dd <- genSurv(dd, defS, digits = 2)

If we fit a standard Cox proportional hazard model and test the proportionality assumption, it is quite clear that the assumption is violated (as the p-value < 0.05):

coxfit <- coxph(formula = Surv(death) ~ male, data = dd)cox.zph(coxfit)##        chisq df       p## male    12.5  1 0.00042## GLOBAL  12.5  1 0.00042

If we split the data at the proper inflection point of 150 days, and refit the model, we can recover the parameters (or at least get pretty close):

dd2 <- survSplit(Surv(death) ~ ., data= dd, cut=c(150),                 episode= "tgroup", id="newid")coxfit2 <- coxph(Surv(tstart, death, event) ~ male:strata(tgroup), data=dd2)tbl_regression(coxfit2)
html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;}#ejjbdjthlc .gt_table { display: table; border-collapse: collapse; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3;}#ejjbdjthlc .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#ejjbdjthlc .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0;}#ejjbdjthlc .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 0; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0;}#ejjbdjthlc .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#ejjbdjthlc .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;}#ejjbdjthlc .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden;}#ejjbdjthlc .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px;}#ejjbdjthlc .gt_column_spanner_outer:first-child { padding-left: 0;}#ejjbdjthlc .gt_column_spanner_outer:last-child { padding-right: 0;}#ejjbdjthlc .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%;}#ejjbdjthlc .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle;}#ejjbdjthlc .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle;}#ejjbdjthlc .gt_from_md > :first-child { margin-top: 0;}#ejjbdjthlc .gt_from_md > :last-child { margin-bottom: 0;}#ejjbdjthlc .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden;}#ejjbdjthlc .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px;}#ejjbdjthlc .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top;}#ejjbdjthlc .gt_row_group_first td { border-top-width: 2px;}#ejjbdjthlc .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#ejjbdjthlc .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3;}#ejjbdjthlc .gt_first_summary_row.thick { border-top-width: 2px;}#ejjbdjthlc .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#ejjbdjthlc .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px;}#ejjbdjthlc .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3;}#ejjbdjthlc .gt_striped { background-color: rgba(128, 128, 128, 0.05);}#ejjbdjthlc .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3;}#ejjbdjthlc .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#ejjbdjthlc .gt_footnote { margin: 0px; font-size: 90%; padding-left: 4px; padding-right: 4px; padding-left: 5px; padding-right: 5px;}#ejjbdjthlc .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3;}#ejjbdjthlc .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px;}#ejjbdjthlc .gt_left { text-align: left;}#ejjbdjthlc .gt_center { text-align: center;}#ejjbdjthlc .gt_right { text-align: right; font-variant-numeric: tabular-nums;}#ejjbdjthlc .gt_font_normal { font-weight: normal;}#ejjbdjthlc .gt_font_bold { font-weight: bold;}#ejjbdjthlc .gt_font_italic { font-style: italic;}#ejjbdjthlc .gt_super { font-size: 65%;}#ejjbdjthlc .gt_two_val_uncert { display: inline-block; line-height: 1em; text-align: right; font-size: 60%; vertical-align: -0.25em; margin-left: 0.1em;}#ejjbdjthlc .gt_footnote_marks { font-style: italic; font-weight: normal; font-size: 75%; vertical-align: 0.4em;}#ejjbdjthlc .gt_asterisk { font-size: 100%; vertical-align: 0;}#ejjbdjthlc .gt_slash_mark { font-size: 0.7em; line-height: 0.7em; vertical-align: 0.15em;}#ejjbdjthlc .gt_fraction_numerator { font-size: 0.6em; line-height: 0.6em; vertical-align: 0.45em;}#ejjbdjthlc .gt_fraction_denominator { font-size: 0.6em; line-height: 0.6em; vertical-align: -0.05em;}
Characteristiclog(HR)195% CI1p-value
male * strata(tgroup)
male * tgroup=1-1.3-1.6, -1.0<0.001
male * tgroup=2-0.51-0.72, -0.29<0.001
1 HR = Hazard Ratio, CI = Confidence Interval

Competing risks

A new function addCompRisk generates a single time to event outcome from a collection of time to event outcomes, where the observed outcome is the earliest event time. This can be accomplished by specifying a timeName argument that will represent the observed time value. The event status is indicated in the field set by the eventName argument (which defaults to “event”). And if a variable name is indicated using the censorName argument, the censored events automatically have a value of 0.

To use addCompRisk, we first define and generate unique events - in this case event_1, event_2, and censor:

set.seed(1)dS <- defSurv(varname = "event_1", formula = "-10", shape = 0.3)dS <- defSurv(dS, "event_2", "-6.5", shape = 0.5)dS <- defSurv(dS, "censor", "-7", shape = 0.55)dtSurv <- genData(1001)dtSurv <- genSurv(dtSurv, dS)dtSurv##         id censor event_1 event_2##    1:    1     55    15.0     9.7##    2:    2     47    19.8    23.4##    3:    3     34     8.0    33.1##    4:    4     13    25.2    40.8##    5:    5     61    28.6    18.9##   ---                            ##  997:  997     30    22.3    33.7##  998:  998     53    22.3    20.5##  999:  999     62    19.8    12.1## 1000: 1000     55    11.1    22.1## 1001: 1001     37     7.2    33.9

Now we generate a competing risk outcome “obs_time” and an event indicator “delta”:

dtSurv <- addCompRisk(dtSurv, events = c("event_1", "event_2", "censor"),   eventName = "delta", timeName = "obs_time", censorName = "censor")dtSurv##         id obs_time delta    type##    1:    1      9.7     2 event_2##    2:    2     19.8     1 event_1##    3:    3      8.0     1 event_1##    4:    4     13.0     0  censor##    5:    5     18.9     2 event_2##   ---                            ##  997:  997     22.3     1 event_1##  998:  998     20.5     2 event_2##  999:  999     12.1     2 event_2## 1000: 1000     11.1     1 event_1## 1001: 1001      7.2     1 event_1

Here’s a plot competing risk data using the cumulative incidence functions (rather than the survival curves):

The data generation can be done in two (instead of three) steps by including the timeName and eventName arguments in the call to genSurv. By default, the competing events will be all the events defined in defSurv:

set.seed(1)dtSurv <- genData(1001)dtSurv <- genSurv(dtSurv, dS, timeName = "obs_time",   eventName = "delta", censorName = "censor")dtSurv##         id obs_time delta    type##    1:    1      9.7     2 event_2##    2:    2     19.8     1 event_1##    3:    3      8.0     1 event_1##    4:    4     13.0     0  censor##    5:    5     18.9     2 event_2##   ---                            ##  997:  997     22.3     1 event_1##  998:  998     20.5     2 event_2##  999:  999     12.1     2 event_2## 1000: 1000     11.1     1 event_1## 1001: 1001      7.2     1 event_1

Synthetic data

Sometimes, it may be useful to generate data that will represent the distributions of an existing data set. Two new functions, genSynthetic and addSynthetic make it fairly easy to do this.

Let’s say we start with an existing data set \(A\) that has fields \(a\), \(b\), \(c\), and \(d\):

A##       index    a b c    d##    1:     1 2.74 8 0 11.1##    2:     2 4.57 4 1 13.6##    3:     3 2.63 4 0  8.0##    4:     4 4.74 7 0 12.5##    5:     5 1.90 4 0  7.2##   ---                    ##  996:   996 0.92 3 0  5.2##  997:   997 2.89 4 0  8.5##  998:   998 2.80 7 0 10.9##  999:   999 2.47 6 0  8.1## 1000:  1000 2.63 6 0 12.5

We can create a synthetic data set by sampling records with replacement from data set \(A\):

S <- genSynthetic(dtFrom = A, n = 250, id = "index")S##      index   a b c    d##   1:     1 4.0 6 0 11.4##   2:     2 3.2 4 1  9.5##   3:     3 2.7 4 0  6.5##   4:     4 1.7 4 0  6.2##   5:     5 4.2 4 0  8.9##  ---                   ## 246:   246 1.1 5 0  6.5## 247:   247 3.1 4 1  8.7## 248:   248 3.3 2 0  1.2## 249:   249 3.6 6 0  9.3## 250:   250 3.1 3 0  6.2

The distribution of variables in \(S\) matches their distribution in \(A\). Here are the univariate distributions for each variable in each data set:

summary(A[, 2:5])##        a             b              c              d       ##  Min.   :0.0   Min.   : 0.0   Min.   :0.00   Min.   : 0.1  ##  1st Qu.:2.3   1st Qu.: 4.0   1st Qu.:0.00   1st Qu.: 6.9  ##  Median :3.0   Median : 5.0   Median :0.00   Median : 9.0  ##  Mean   :3.0   Mean   : 5.1   Mean   :0.32   Mean   : 9.1  ##  3rd Qu.:3.8   3rd Qu.: 6.0   3rd Qu.:1.00   3rd Qu.:11.2  ##  Max.   :6.0   Max.   :13.0   Max.   :1.00   Max.   :18.1summary(S[, 2:5])##        a             b              c              d       ##  Min.   :0.1   Min.   : 0.0   Min.   :0.00   Min.   : 0.6  ##  1st Qu.:2.3   1st Qu.: 3.0   1st Qu.:0.00   1st Qu.: 6.6  ##  Median :3.0   Median : 5.0   Median :0.00   Median : 8.6  ##  Mean   :3.0   Mean   : 4.7   Mean   :0.33   Mean   : 8.6  ##  3rd Qu.:3.8   3rd Qu.: 6.0   3rd Qu.:1.00   3rd Qu.:10.5  ##  Max.   :5.2   Max.   :12.0   Max.   :1.00   Max.   :18.1

And here are the covariance matrices for both:

cor(A[, cbind(a, b, c, d)])##         a       b       c    d## a  1.0000 -0.0283 -0.0019 0.30## b -0.0283  1.0000  0.0022 0.72## c -0.0019  0.0022  1.0000 0.42## d  0.3034  0.7212  0.4205 1.00cor(S[, cbind(a, b, c, d)])##        a     b      c    d## a  1.000 0.033 -0.028 0.33## b  0.033 1.000  0.052 0.76## c -0.028 0.052  1.000 0.39## d  0.335 0.764  0.388 1.00
To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: simstudy updated to version 0.5.0

Part 3 of 3: 300+ milestone for Big Book of R

$
0
0
[This article was first published on R programming – Oscar Baruffa, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is the final installment of a 3-part series highlighting 35 new entries to Big Book of R.

Read part 1 and part 2.

The site now has well over 300 free R programming titles.

Onto the third batch of new books!


Financial Econometrics – R Tutorial Guidance

by Yizhi Wang, Samuel Vigne

This is an R tutorial book for Financial Econometrics in PDF format.

https://www.bigbookofr.com/finance.html#financial-econometrics—r-tutorial-guidance

R Guide to Accompany Introductory Econometrics for Finance

by Robert Wichmann, Chris Brooks

This free software guide for R with freely downloadable datasets brings the econometric techniques to life, showing readers how to implement the approaches presented in Introductory Econometrics for Finance using this highly popular software package. Designed to be used alongside the main textbook, the guide will give readers the confidence and skills to estimate and interpret their own models while the textbook will ensure that they have a thorough understanding of the conceptual underpinnings.

https://www.bigbookofr.com/finance.html#r-guide-to-accompany-introductory-econometrics-for-finance

R Companion to Real Econometrics

by Tony Carilli

The intended audience for this book is anyone making using of Real Econometrics: The Right Tools to Answer Important Questions 2nd ed. by Michael Bailey who would like to learn to use R, RStudio, and the tidyverse to complete empirical examples from the text. This book will be useful to anyone wishing to integrate R and the Tidyverse into an econometrics course.

https://www.bigbookofr.com/finance.html#r-companion-to-real-econometrics

Partial Least Squares Structural Equation Modeling (PLS-SEM) Using R

by Joseph F. Hair Jr., G. Tomas M. Hult, Christian M. Ringle, Marko Sarstedt, Nicholas P. Danks, Soumya Ray

An open access (free and unlimited) book with concise guidelines on how to apply and interpret Partial Least Squares Structural Equation Modeling (PLS-SEM). It includes an illustrative, step-by-step application of PLS-SEM using the highly user-friendly SEMinR package. It adopts a case-study approach that focuses on the illustration of relevant analysis steps.

https://www.bigbookofr.com/statistics.html#partial-least-squares-structural-equation-modeling-pls-sem-using-r

pipeR Tutorial 

by Kun Ren

pipeR is an R package that helps you better organize your code in pipeline built with %>>%, Pipe() or pipeline(), which is much easier to read, write, and maintain.

https://www.bigbookofr.com/packages.html#piper-tutorial

An(other) introduction to R

by Felix Lennert

In the following, you will receive a gentle introduction to R and how you can use it to work with data. This tutorial was heavily inspired by Richard Cotton’s “Learning R” (Cotton 2013) and Hadley Wickham’s and Garrett Grolemund’s “R for Data Science” (abbreviated with R4DS).

https://www.bigbookofr.com/r-programming.html#another-introduction-to-r

Text Mining for Social Scientists

by Felix Lennert

This script will cover the pre-processing of text, the implementation of supervised and unsupervised approaches to text, and in the end, I will briefly touch upon word embeddings and how social science can use them for inquiry.

https://www.bigbookofr.com/social-science.html#text-mining-for-social-scientists

An Introduction to Text Processing and Analysis with R

by Michael Clark

Dealing with text is typically not even considered in the applied statistical training of most disciplines. This is in direct contrast with how often it has to be dealt with prior to more common analysis, or how interesting it might be to have text be the focus of analysis. This document and corresponding workshop will aim to provide a sense of the things one can do with text, and the sorts of analyses that might be useful.

https://www.bigbookofr.com/text-analysis.html#an-introduction-to-text-processing-and-analysis-with-r

R Shiny Applications in Finance, Medicine, Pharma and Education Industry

by Loan Robinson

The book is a guide to help you understand the codes of five applications you will receive after you purchase the book. If you can go through all of the codes, you can easily create a complex and brilliant R Shiny application.

Instead of spending hours and hours trying to understand, have the ideas, write the codes, apply application features, you can use the codes to quickly apply and learn the codes. There are many advanced features, it takes years to learn them, now you have it, hand on and work through it.

https://www.bigbookofr.com/shiny.html#r-shiny-applications-in-finance-medicine-pharma-and-education-industry

Biostatistics for Biomedical Research

by Frank E Harrell Jr

The book is aimed at exposing biomedical researchers to modern biostatistical methods and statistical graphics, highlighting those methods that make fewer assumptions, including nonparametric statistics and robust statistical measures. In addition to covering traditional estimation and inferential techniques, the course contrasts those with the Bayesian approach, and also includes several components that have been increasingly important in the past few years, such as challenges of high-dimensional data analysis, modeling for observational treatment comparisons, analysis of differential treatment effect (heterogeneity of treatment effect), statistical methods for biomarker research, medical diagnostic research, and methods for reproducible research. 

https://www.bigbookofr.com/life-sciences.html#biostatistics-for-biomedical-research

R Workflow for Reproducible Data Analysis and Reporting

by Frank E Harrell Jr

This work is intended to foster best practices in reproducible data documentation and manipulation, statistical analysis, graphics, and reporting. It will enable the reader to efficiently produce attractive, readable, and reproducible research reports while keeping code concise and clear. Readers are also guided in choosing statistically efficient descriptive analyses that are consonant with the type of data being analyzed.

https://www.bigbookofr.com/workflow.html#r-workflow-for-reproducible-data-analysis-and-reporting

HR Analytics in R

by Chester Ismay, Albert Y. Kim, Hendrik Feddersen

The intention of this book is to encourage more ‘data driven’ decisions in HR. HR Analytics is not anymore a nice-to-have addon but rather the way HR practitioners should conduct HR decision making in the future. Where applicable, human judgement is ‘added’ onto a rigorous analysis of the data done in the first place.

To achieve this ideal world, I need to equip you with some fundamental knowledge of R and RStudio, which are open-source tools for data scientists. I am well aware that on one side you want to do something for your career in HR, however you are most likely completely new to coding.

https://www.bigbookofr.com/field-specific.html#hr-analytics-in-r

Reproducible statistics for psychologists with R: Lab Tutorials

by Matthew J. C. Crump

This is a series of labs/tutorials for a two-semester graduate-level statistics sequence in Psychology @ Brooklyn College of CUNY. The goal of these tutorials is to 1) develop a deeper conceptual understanding of the principles of statistical analysis and inference; and 2) develop practical skills for data-analysis, such as using the increasingly popular statistical software environment R to code reproducible analyses.

https://www.bigbookofr.com/social-science.html#reproducible-statistics-for-psychologists-with-r-lab-tutorials

Computational Social Science: Theory & Application

by Paul C. Bauer

The goals for this course are twofold. First, I hope you will gain a solid understanding of how access to big data (digital traces) is changing the social sciences in terms of a) new substantial and theoretical insights, and in terms of b) new methodologies. Second, I hope you will learn which and how big data could be used to answer further pressing questions you might encounter in the future.

https://www.bigbookofr.com/social-science.html#computational-social-science-theory-application


#mc_embed_signup{background:#fff; clear:left; font:14px Helvetica,Arial,sans-serif; width:600px;}/* Add your own Mailchimp form style overrides in your site stylesheet or in this style block. We recommend moving this block and the preceding CSS link to the HEAD of your HTML file. */

Subscribe for updates. I write about R, data and careers.

Subscribers get a free copy of Project Management Fundamentals for Data Analysts worth $12

* indicates required
Email Address *
First Name *

(function($) {window.fnames = new Array(); window.ftypes = new Array();fnames[0]='EMAIL';ftypes[0]='email';fnames[1]='FNAME';ftypes[1]='text';}(jQuery));var $mcj = jQuery.noConflict(true);

The post Part 3 of 3: 300+ milestone for Big Book of R appeared first on Oscar Baruffa.

To leave a comment for the author, please follow the link and comment on their blog: R programming – Oscar Baruffa.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Part 3 of 3: 300+ milestone for Big Book of R

When To Use Which in R?

$
0
0
[This article was first published on Data Analysis in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post When To Use Which in R? appeared first on finnstats.

If you are interested to learn more about data science, you can find more articles here finnstats.

Do you need to determine which component of a vector, data frame, or matrix satisfies a given set of criteria? then you’re in the right place.

which(x, arr.ind = FALSE)
x: Any logical test vector.

When working with matrices (multiple dimension arrays), the optional argument arr.ind is used to specify that you require the array indices (both x and y data points, for example, vs a single index for one-dimensional arrays).

When scanning a data structure in R, you may use the function which to find the items that satisfy a given criterion.

The indexes of the matching elements in the data object are returned by the which function.

Which() r examples

Let’s get directly to some examples of the which function r.

Examples of which function in the R language

list <-c('Hello','welcome','to','finnstats.com','You','can','find','more','article','here')
which(list=='finnstats.com')
[1] 4

finnstats.com is indeed at position 4 in the list

which(list=='datascience')
integer(0)

It appears that we didn’t satisfy them because the missing values are zeros.

which R Data Frame Function

R data frames can be used with the which function as well. In the example that follows, we’ll find the indexes of a portion of the ChickWeight data frame that satisfy a specific requirement (Time = day 20).

The evolution of a flock of chickens fed various diets over time is shown in this built-in data set for R.

which function in an r data frame

which(ChickWeight$Time==20)
[1]  11  23  35  47  59  71  83  95 106 118 130 142 154 166 193 207 219 231 243 255 267 279 291 303 315 327 
39 351 363 375 387 399 411 423 435 447 459 471 483 495 517 529 541 553 565 577

As we can see, when the time was equal to 20, the which function retrieved the indices for all of the data points.

Additional Applications

The function need not only be looking for equality; it can also be used with any other logical test. The function can be applied to multidimensional matrices, as was already indicated.

What Does This Mean for Which in R?

It is actually a very helpful feature of the R toolkit, particularly if you’re writing programs to clean data and apply statistical methods.

This method captures the process of inspecting a current data structure and flagging the elements that could need further processing.

It may also be applied before a filtering procedure. Use which to identify the observations in a data frame that should be analyzed as part of the whole or separately, return them as a vector of values, and handle them as necessary.

In terms of functional programming, this IS the fundamental underlying operation for “filter” (returning pointers vs. creating an array).

.mailpoet_hp_email_label{display:none!important;}#mailpoet_form_3 .mailpoet_form { } #mailpoet_form_3 form { margin-bottom: 0; } #mailpoet_form_3 p.mailpoet_form_paragraph.last { margin-bottom: 10px; } #mailpoet_form_3 .mailpoet_column_with_background { padding: 10px; } #mailpoet_form_3 .mailpoet_form_column:not(:first-child) { margin-left: 20px; } #mailpoet_form_3 .mailpoet_paragraph { line-height: 20px; margin-bottom: 20px; } #mailpoet_form_3 .mailpoet_form_paragraph last { margin-bottom: 0px; } #mailpoet_form_3 .mailpoet_segment_label, #mailpoet_form_3 .mailpoet_text_label, #mailpoet_form_3 .mailpoet_textarea_label, #mailpoet_form_3 .mailpoet_select_label, #mailpoet_form_3 .mailpoet_radio_label, #mailpoet_form_3 .mailpoet_checkbox_label, #mailpoet_form_3 .mailpoet_list_label, #mailpoet_form_3 .mailpoet_date_label { display: block; font-weight: normal; } #mailpoet_form_3 .mailpoet_text, #mailpoet_form_3 .mailpoet_textarea, #mailpoet_form_3 .mailpoet_select, #mailpoet_form_3 .mailpoet_date_month, #mailpoet_form_3 .mailpoet_date_day, #mailpoet_form_3 .mailpoet_date_year, #mailpoet_form_3 .mailpoet_date { display: block; } #mailpoet_form_3 .mailpoet_text, #mailpoet_form_3 .mailpoet_textarea { width: 200px; } #mailpoet_form_3 .mailpoet_checkbox { } #mailpoet_form_3 .mailpoet_submit { } #mailpoet_form_3 .mailpoet_divider { } #mailpoet_form_3 .mailpoet_message { } #mailpoet_form_3 .mailpoet_form_loading { width: 30px; text-align: center; line-height: normal; } #mailpoet_form_3 .mailpoet_form_loading > span { width: 5px; height: 5px; background-color: #5b5b5b; } #mailpoet_form_3 h2.mailpoet-heading { margin: 0 0 20px 0; } #mailpoet_form_3 h1.mailpoet-heading { margin: 0 0 10px; }#mailpoet_form_3{border-radius: 2px;text-align: left;}#mailpoet_form_3 form.mailpoet_form {padding: 30px;}#mailpoet_form_3{width: 100%;}#mailpoet_form_3 .mailpoet_message {margin: 0; padding: 0 20px;} #mailpoet_form_3 .mailpoet_validate_success {color: #00d084} #mailpoet_form_3 input.parsley-success {color: #00d084} #mailpoet_form_3 select.parsley-success {color: #00d084} #mailpoet_form_3 textarea.parsley-success {color: #00d084} #mailpoet_form_3 .mailpoet_validate_error {color: #cf2e2e} #mailpoet_form_3 input.parsley-error {color: #cf2e2e} #mailpoet_form_3 select.parsley-error {color: #cf2e2e} #mailpoet_form_3 textarea.textarea.parsley-error {color: #cf2e2e} #mailpoet_form_3 .parsley-errors-list {color: #cf2e2e} #mailpoet_form_3 .parsley-required {color: #cf2e2e} #mailpoet_form_3 .parsley-custom-error-message {color: #cf2e2e} #mailpoet_form_3 .mailpoet_paragraph.last {margin-bottom: 0} @media (max-width: 500px) {#mailpoet_form_3 {background-image: none;}} @media (min-width: 500px) {#mailpoet_form_3 .last .mailpoet_paragraph:last-child {margin-bottom: 0}} @media (max-width: 500px) {#mailpoet_form_3 .mailpoet_form_column:last-child .mailpoet_paragraph:last-child {margin-bottom: 0}} Please leave this field empty
Email Address *

Check your inbox or spam folder to confirm your subscription.

Have you found this article to be interesting? We’d be glad if you could forward it to a friend or share it on Twitter or Linked In to help it spread.

If you are interested to learn more about data science, you can find more articles here finnstats.

The post When To Use Which in R? appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Data Analysis in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: When To Use Which in R?

R Color

$
0
0
[This article was first published on R feed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We can visually improve our plots by coloring them. This is generally done with the col graphical parameter.

We can specify the name of the color we want as a string. For example, if we want our plot to be a red color, we pass col = "red".


Add Color to Plot in R

We use the following temp vector to create a barplot throughout this section.

# create a vector named temp
temp <- c(5,7,6,4,8)

# barplot of temp without coloring
barplot(temp, main="By default")

# barplot of temp with coloring
barplot(temp, col="coral", main="With coloring")

Output

Here, we have passed col = "coral" inside the barplot() function to color our barplot with coral color.

Try replacing it with "green", "blue", "violet", etc. and look at the difference.


Using Color Names to Change Plot Color in R

R programming has names for 657 colors. We can take a look at them all with the colors() function, or simply check this R color pdf.

# display all color names
colors()

Output

[1] "white"           "aliceblue"        "antiquewhite"        
[4] "antiquewhite1"   "antiquewhite2"    "antiquewhite3"       
[7] "antiquewhite4"   "aquamarine"       "aquamarine1"         
...
... 
[649] "wheat3"        "wheat4"           "whitesmoke"          
[652] "yellow"        "yellow1"          "yellow2"             
[655] "yellow3"       "yellow4"          "yellowgreen"

Here, the colors() function returns a vector of all the color names in alphabetical order with the first element being "white".

We can color our plot by indexing this vector. For example, col=colors()[655] is the same as col="yellow3".


Using Hex Values as Colors in R

In R, instead of using a color name, color can also be defined with a hexadecimal value.

We define a color as a 6 hexadecimal digit number of the form #RRGGBB. Where the RR is for red, GG for green and BB for blue and value ranges from 00 to FF.

For example, #FF0000 would be red and #00FF00 would be green similarly, #FFFFFF would be white and #000000 would be black.

Let's take a look at how to implement hex values as colors in R,

# create a vector named temp
temp <- c(5,7,6,4,8)

# using hex value #c00000
barplot(temp, col="#c00000", main="#c00000")

# using hex value #AE4371
barplot(temp, col="#AE4371", main="#AE4371")

Output

In the above example, we have passed the hex value for the col parameter inside the barplot() function.

Here,

  • #c00000 - this hex is composed of 75.3% red, 0% green and 0% blue
  • #AE4371 - this hex is composed of 68.24% red, 26.27% green and 44.31% blue

Using RGB Values to Color Plot in R

The rgb() function in R allows us to specify red, green and blue components with a number between 0 and 1.

This function returns the corresponding hex code discussed above. For example,

rgb(0, 1, 0) # prints "#00FF00"

rgb(0.3, 0.7, 0.9) # prints "#4DB3E6"

We can directly pass rgb() to the col parameter as:

# create a vector named temp
temp <- c(5,7,6,4,8)

# using rgb() to color barplot
barplot(temp, col = rgb(0.3, 0.7, 0.9), main="Using RGB Values")

Output

Here, we have passed rbg() to the col parameter inside barplot().

So the plot is colored according to the rgb value.


Color Cycling in R

We can color each bar of the barplot with a different color by providing a vector of colors.

If the number of colors provided is less than the number of bars, the color vector is recycled. For example,

create a vector named temp
temp <- c(5,7,6,4,8)

# color with 5 different colors
barplot(temp, col=c("red", "coral", "blue", "yellow", "pink"), main="With 5 Colors")

# color with 3 different color, last two bars will be recycled
barplot(temp, col=c("red", "coral", "blue"), main="With 3 Color")

Output

In the above example, at first we colored each bar of the barplot by providing a vector with 5 colors for 5 different bars.

For the second barplot, we have provided a vector with 3 different colors, so the color is recycled for the last 2 bars.


Using Color Palette in R

R programming offers 4 built in color palettes which can be used to quickly generate color vectors of desired length.

They are: rainbow(), heat.colors(), terrain.colors(), and topo.colors(). We pass in the number of colors that we want.

Let's take a look at the example,

# use rainbow() to generate color palette
rainbow(5)

# Output: "#FF0000FF" "#CCFF00FF" "#00FF66FF" "#0066FFFF" "#CC00FFFF"

Here, notice that the hexadecimal numbers are 8 digits long. The last two digits are the transparency level with FF being opaque and 00 being fully transparent.

Example: Using Color Palette in R

# create a vector named temp
temp <- c(5,7,6,4,8)

# using rainbow()
barplot(temp, col=rainbow(5), main="rainbow")

# using heat.colors()
barplot(temp, col=heat.colors(5), main="heat.colors")

# using terrain.colors()
barplot(temp, col=terrain.colors(5), main="terrain.colors")

# using topo.colors()
barplot(temp, col=topo.colors(5), main="topo.colors")

Output

Here, we have used 4 built in color palettes which can be used to quickly generate color vectors of desired length.

To leave a comment for the author, please follow the link and comment on their blog: R feed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: R Color

ISF2022: How to make ETS work with ARIMA

$
0
0
[This article was first published on R – Open Forecasting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This time ISF took place in Oxford. I acted as a programme chair of the event and was quite busy with schedule and some other minor organisational things, but I still found time to present something new. Specifically, I talked about one specific part of ADAM, the part implementing ETS+ARIMA. The idea is that the two models are considered as competing, belonging to different families. But we have known how to unite them at least since 1985. So, it is about time to make this brave step and implement ETS with ARIMA elements.

ETS+ARIMA love story with happy ending…

This talk was based on Chapter 9 of ADAM monograph, and more specifically on Section 9.4.

The slides of the presentation are available here.

To leave a comment for the author, please follow the link and comment on their blog: R – Open Forecasting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: ISF2022: How to make ETS work with ARIMA

Shiny and Reactive Multimedia

$
0
0
[This article was first published on Ashley's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Shiny has always been the framework for creating dashboards in R, but as time goes on the potential use cases for shiny applications are continuously increasing. One area that I wanted to explore was the inclusion of multimedia. This might be an introductory video, or embedding one of those racing bar charts.

Standard HTML Players

Most web browsers come with the capability to handle <audio> and <video> tags, and come with their own functionality, but there are a few issues that one might face when including an audio or video element in their shiny application.

Inconsistent UI

Each web browser has their own flavour of styling when it comes to audio and video controls, and whilst working on your application the theme of one browser you are developing on might be cohesive, you may find that the controls “stick out” when on another browser.

Audio

<audio controls src="example.mp3">
Chromium (Chrome/Edge)
UI of an audio tag on Chromium web browser
Mozilla Firefox
UI of an audio tag on Mozilla Firefox web browser

Video

<video width="400" controls><source type="video/mp4" src="example.mp4"></video>
Chromium (Chrome/Edge)
UI of an video tag on Chromium web browser
Mozilla Firefox
UI of an video tag on Mozilla Firefox web browser

Along with the visual differences, there is also some functionality that exists in the Chromium based browsers that isn’t present in the Firefox browser. Chromium adds the ability to download the track or change the playback speed using the vertical ellipsis.

The two browsers also calculate the length of tracks differently. In the example video Chromium floors the video duration of 46.6 seconds to 0:46, whereas Firefox rounds up to 0:47.

Server-Side Interaction

Okay, whilst it is possible from the server side to use shinyjs::runjs to tell an audio or video element to play or pause, there is currently little available in the other direction. It might be useful to know whether or not the multimedia is playing, or where in the track the user currently is in order to trigger an event.

These, along with a general curiosity of adding multimedia into a shiny application, have brought along the creation of two R packages using a couple of JavaScript libraries.

Audio with howler.js

howler.js is an audio library that makes working with audio in JavaScript easy and reliable across all platforms. {howler} is a wrapper for howler.js, creating an audio player that can handle multiple tracks and seamlessly switch between them. A variety of buttons are available that can trigger events on the player, and are easily customisable.

On the server side, there are 4 input values for any howler player:

  • {id}_playing A logical value as to whether or not the howler is playing audio
  • {id}_track Basename of the file currently loaded
  • {id}_seek The current time (in seconds) of the track loaded
  • {id}_duration The duration (in seconds) of the track loaded

There are also two modules available: a basic module that tries to emulate the UI of the Chromium audio player, and a full module with a few extra components, such as the track name and the ability to switch track. If the audio is purely UI, the server-side module is not required, however it does contain the standard information that comes with the howler widget.

Example

library(shiny)library(howler)ui <- fluidPage(  h1("Howler Audio Player"),  howler::howlerModuleUI(    id = "sound",    files = list(      "Winning Elevation" = "https://cdn.pixabay.com/download/audio/2022/05/16/audio_db6591201e.mp3"    )  ),  howler::howlerBasicModuleUI(    id = "sound2",    files = list(      "Winning Elevation" = "https://cdn.pixabay.com/download/audio/2022/05/16/audio_db6591201e.mp3"    )  ))server <- function(input, output, session) {}shinyApp(ui, server)

This produces the same UI in each of the three main browsers:

UI of howler modules in Chrome, Firefox and Edge, all 3 players have identical UI

Video with video.js

video.js is another JavaScript library, focussing on providing consistent video behaviour across all web browsers, being able to embed videos from a variety of different sources. {video} is the wrapper for this library, and creates a video player that is easy to communicate between the UI and server of shiny applications.

Three of the four audio player inputs are available in the video player:

  • {id}_playing A logical value as to whether or not the howler is playing audio
  • {id}_seek The current time (in seconds) of the track loaded
  • {id}_duration The duration (in seconds) of the track loaded

Example

library(shiny)library(video)ui <- fluidPage(  h1("Video Player"),  video("https://vjs.zencdn.net/v/oceans.mp4", width = 600, height = NA))server <- function(input, output, session) {}shinyApp(ui, server)

There are is one difference between Firefox and Chromium, and that is the picture-in-picture. The button is visible when hovering on a Firefox browser (similar to the standard video player), but is kept in the controls bar for Chromium. Apart from that, the players are identical.

UI of video.js players in Chrome, Mozilla and Edge, all 3 players have similar UI

If you aren’t satisfied with the basic skin of the video.js player, there are a collection of skins available on GitHub, including one that looks like the Netflix video player.

An added benefit of video.js is that all videos are easily accessible in JavaScript by using videojs('id') to find any video by just referencing the ID of the HTML tag (so if there is something currently unavailable in {video} you can use this to create your own custom call!).

Server-Side

There are a collection of functions available in both packages to manipulate the multimedia from the server in a shiny application:

  • playHowl/playVideo - resume playing the current track*
  • pauseHowl/pauseVideo - pause the current track
  • stopHowl/stopVideo - pause and return to the start of the current track
  • seekHowl/seekVideo - move the current track to a specified point in time
  • addTrack/changeVideo - change the current track to a new one

Because {howler} can handle multiple tracks attached to a player, there is also the potential to change between tracks using changeTrack without having to add a new track.

{htmlwidgets}

It is worth mentioning that both of these packages have been facilitated with {htmlwidgets}, a package that provides an easy way to create R bindings to JavaScript libraries. The JavaScript for R book was a great aid in creating widgets for these; it made writing the connections between the UI and server a lot easier (here is what the howler shiny wrapper looked like before updating to {htmlwidgets}).

Summary

Whilst the <audio> and <video> tags allow easy use of including multimedia in web pages, the use of JavaScript libraries enables a level of consistency across all web browsers plus more flexibility around playing the audio or video track.

{howler} and {video} are in the process of being submitted to CRAN and hopefully both will be available in the next couple of days.

To leave a comment for the author, please follow the link and comment on their blog: Ashley's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Shiny and Reactive Multimedia

RStudio 2022.07.0: What’s New

$
0
0
[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post highlights some of the improvements in the latest RStudio IDE release 2022.07.0, code-named “Spotted Wakerobin”. To read about all of the new features and updates available in this release, check out the latest Release Notes.

Find in Files improvements

We’ve made some significant improvements to the Find in Files pane, across all platforms but with particularly significant improvements in Windows.

  • The Find in Files pane has gained a Refresh button, so that users can manually refresh possible matches/replacements to capture any changes to the files since the search was last run.Refresh Find in Files

  • We’ve upgraded the version of grep we use with RStudio on Windows. This more modern version of grep enables improved searching through directories and subdirectories with non-ASCII characters in the path name, such as C:\Users\me\Éñçĥìłăḏà or C:\你好\你好.

  • We’ve also changed the flavor of regular expressions supported by the Find in Files search. Previously, Find in Files supported only POSIX Basic Regular Expressions. As of this release, Find in Files is now powered by Extended Regular Expressions, when the Regular expression checkbox is checked. What does this mean for your searches? Previously, if you used the special characters ?, +, |, {}, or (), they were treated as character literals; now they will be interpreted according to their special regex meaning when unescaped, and as character literals only when escaped with a single backslash. This change also adds additional support for Find and Replace using regular expressions with \b, \w, \d, \B, \W, and \D, which now return the expected results in both Find and Replace mode. These changes bring Find in Files search more closely in line with the flavor of regular expressions supported by R’s base grep function (using PERL=FALSE), but note that where the grep function within R requires double backslashes, Find in Files requires only a single backslash as the escape character.

Find in Files Regular expression FindFind in Files Regular expression Replace

  • When using Find in Files with the search directory set to a Git repository, users will by default have the option to ignore searching through any files or subdirectories listed within the .gitignore for that repo. Users can uncheck this option if they wish to include these files in their search.

Find in Files Exclude .gitignore

A number of other small bug fixes have been included in this release to improve the reliability and usability of Find in Files search. We hope this makes the feature more powerful and straightforward for users.

Hyperlinks

Support for hyperlinks, as generated by cli::style_hyperlink(), has been added to the console output, build pane and various other places. Depending on their url, clicking a hyperlink will:

  • go to a website cli::style_hyperlink("tidyverse", "https://www.tidyverse.org"), a local file cli::style_hyperlink("file", "file:///path/to/file"), or a specific line/column of a file cli::style_hyperlink("file", "file:///path/to/file", params = c(line = 10, col = 4))

  • open a help page cli::style_hyperlink("summarise()", "ide:help:dplyr::summarise") or a vignette cli::style_hyperlink("intro to dplyr", "ide:vignette:dplyr::dplyr"),with some preview information in the popup when the link is hovered over.Help Page Hyperlink popup

  • run code in the console cli::style_hyperlink("Show last error", "ide:run::rlang::last_error()"). This also shows information about the function that willrun when the link is clicked.Run Hyperlink popup

Some packages (e.g. testthat and roxygen2) have started to take advantage of this feature to improve their user experience, and we hope this willinspire other packages.testthat failure link

Support for R (>= 4.2.0)

R 4.2+, officially released in April 2022, received extensive IDE support in the previous release of the RStudio IDE. In this release, we add support for some additional features as well as some critical bug fixes.

  • We resolved an issue where files would appear to be blank when opened in projects not using UTF-8 encoding on Windows with R 4.2.0, which could result in users inadvertently overwriting their files with an empty file.
  • We added further support for the R native pipe, first introduced in R 4.1. Code diagnostics now recognize and support the use of unnamed arguments in conjunction with the native pipe (e.g. LETTERS |> length()) as well as the use of the new placeholder character (e.g. mtcars |> lm(mpg ~ cyl, data = _)) added in R 4.2.0.
  • We’ve also made it easier for users to configure whether they want to use the native R pipe |> or the magrittr pipe %>% when using the Insert pipe command (Cmd/Ctrl + Shift + M). Previously, this was only configurable at the global level, from the Global Options pane. As of this release, you can now inherit or override the global option in Project Options as well, to help maintain code style consistency within a RStudio project.Setting pipe operator in Project Options
  • R 4.2 also introduced extensive changes to the Help system; we’ve updated support for this new enhanced Help system to ensure it displays crisply and legibly in the IDE especially when using a dark theme.Enhanced Help in R 4.2

More info

There’s lots more in this release, and it’s available for download today. You can read about all the features and bugfixes in the RStudio 2022.07.0 “Spotted Wakerobin” release in the RStudio Release Notes. We’d love to hear your feedback about the new release on our community forum.

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: RStudio 2022.07.0: What’s New

Shiny showcase at rstudio::conf(2022)

$
0
0
[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rstudio::conf Shiny Talks Schedule

Beginning with the invitation-only 2016 Shiny Developer Conference, Shiny has played a prominent part in all RStudio conferences, and rstudio::conf(2022) is no exception. Two workshops and eighteen talks showcase Shiny’s multiple, ever-increasing capabilities. What started out as a way to introduce R’s interactive statistical computations to the web has grown into a production-grade tool that supports serious data science workflows and facilitates the communication of data-generated insights throughout large organizations in both industry and government.

Here is your Shiny guide to rstudio::conf(2022). But before you attend the show or see the movie, you may want to have a look at the book.

Keynotes and talks will be livestreamed on the rstudio::conf(2022) website, free and open to all. No registration is required. If you would like, you can sign up to get access to our Discord server to meet and chat with attendees during conf.

On July 27-28th, head to the conference website to watch the livestreams and ask questions alongside other attendees.

Keynotes

  1. The Past and Future of Shiny

Workshops

  1. Building Production-Quality Shiny Applications

  2. Getting Started with Shiny

Talks

  1. R Shiny – From Conception to the Cloud

  2. Optimal allocation of COVID-19 vaccines in West Africa – A Shiny success story

  3. R Markdown + RStudio Connect + R Shiny: A Recipe for Automated Data Processing, Error Logging, and Process Monitoring

  4. I made an entire e-commerce platform on Shiny

  5. Let your mobile shine – Leveraging CSS concepts to make shiny apps mobile responsive

  6. {shinyslack}: Connecting Slack Teams to Shiny Apps

  7. leafdown: Interactive multi-layer maps in Shiny apps

  8. Say Hello! to Multilingual Shiny Apps

  9. Cross-Industry Anomaly Detection Solutions with R and Shiny

  10. Running Shiny without a server

  11. A new way to build your Shiny app’s UI

  12. Creating a Design System for Shiny and RMarkdown

  13. Shiny Dashboards for Biomedical Research Funding

  14. Dashboard-Builder: Building Shiny Apps without writing any code

  15. Introducing Rhino: Shiny application framework for enterprise

  16. A Robust Framework for Automated Shiny App Testing

  17. {shinytest2}: Unit testing for Shiny applications

_____='https://rviews.rstudio.com/2022/07/20/shiny-showcase-at-rstudio-conf-2022/';
To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Shiny showcase at rstudio::conf(2022)

Separate a data frame column into multiple columns-tidyr Part3

$
0
0
[This article was first published on Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Separate a data frame column into multiple columns-tidyr Part3 appeared first on Data Science Tutorials

Separate a data frame column into multiple columns, To divide a data frame column into numerous columns, use the separate() function from the tidyr package.

How to Use Gather Function in R?-tidyr Part2

The basic syntax used by this function is as follows.

separate(data, col, into, sep)

where:

data: Name of the data frame

col: Name of the column to separate

into: a list of names to divide the column into

sep: The amount to use as a column separator is

Separate a data frame column into multiple columns

The practical application of this function is demonstrated in the examples that follow.

Example 1: Dividing a column into two

Let’s say we have the R data frame shown below.

Let’s create a data frame

df <- data.frame(player=c('P1', 'P1', 'P2', 'P2', 'P3', 'P3'),
year=c(1, 2, 1, 2, 1, 2),
stats=c('25-2', '22-3', '28-5', '21-9', '22-5', '29-3'))

Now we can view the data frame

Best Books to learn Tensorflow – Data Science Tutorials

df
   player year stats
1     P1    1  25-2
2     P1    2  22-3
3     P2    1  28-5
4     P2    2  21-9
5     P3    1  22-5
6     P3    2  29-3

The stats column can be divided into two new columns labelled “points” and “assists” using the separate() function as follows:

library(tidyr)

divide the stats column into columns for points and assists.

separate(df, col=stats, into=c('points', 'assists'), sep='-')
   player year points assists
1     P1    1     25       2
2     P1    2     22       3
3     P2    1     28       5
4     P2    2     21       9
5     P3    1     22       5
6     P3    2     29       3

Example 2: Column Should Be Divided Into More Than Two Columns

The stats column can be divided into three distinct columns using the separate() function as follows.

dplyr Techniques and Tips – Data Science Tutorials

library(tidyr)
df <- data.frame(player=c('P1', 'P1', 'P2', 'P2', 'P3', 'P3'),
year=c(1, 2, 1, 2, 1, 2),
stats=c('25-2-3', '22-3-3', '28-5-3', '21-9-2', '22-5-1', '29-3-0'))
df
player year  stats
1     P1    1 25-2-3
2     P1    2 22-3-3
3     P2    1 28-5-3
4     P2    2 21-9-2
5     P3    1 22-5-1
6     P3    2 29-3-0

Stats column is split into three new columns.

separate(df, col=stats, into=c('points', 'assists', 'steals'), sep='-')
player year points assists steals
1     P1    1     25       2      3
2     P1    2     22       3      3
3     P2    1     28       5      3
4     P2    2     21       9      2
5     P3    1     22       5      1
6     P3    2     29       3      0
.mailpoet_hp_email_label{display:none!important;}#mailpoet_form_1 .mailpoet_form { } #mailpoet_form_1 form { margin-bottom: 0; } #mailpoet_form_1 h1.mailpoet-heading { margin: 0 0 20px; } #mailpoet_form_1 p.mailpoet_form_paragraph.last { margin-bottom: 5px; } #mailpoet_form_1 .mailpoet_column_with_background { padding: 10px; } #mailpoet_form_1 .mailpoet_form_column:not(:first-child) { margin-left: 20px; } #mailpoet_form_1 .mailpoet_paragraph { line-height: 20px; margin-bottom: 20px; } #mailpoet_form_1 .mailpoet_segment_label, #mailpoet_form_1 .mailpoet_text_label, #mailpoet_form_1 .mailpoet_textarea_label, #mailpoet_form_1 .mailpoet_select_label, #mailpoet_form_1 .mailpoet_radio_label, #mailpoet_form_1 .mailpoet_checkbox_label, #mailpoet_form_1 .mailpoet_list_label, #mailpoet_form_1 .mailpoet_date_label { display: block; font-weight: normal; } #mailpoet_form_1 .mailpoet_text, #mailpoet_form_1 .mailpoet_textarea, #mailpoet_form_1 .mailpoet_select, #mailpoet_form_1 .mailpoet_date_month, #mailpoet_form_1 .mailpoet_date_day, #mailpoet_form_1 .mailpoet_date_year, #mailpoet_form_1 .mailpoet_date { display: block; } #mailpoet_form_1 .mailpoet_text, #mailpoet_form_1 .mailpoet_textarea { width: 200px; } #mailpoet_form_1 .mailpoet_checkbox { } #mailpoet_form_1 .mailpoet_submit { } #mailpoet_form_1 .mailpoet_divider { } #mailpoet_form_1 .mailpoet_message { } #mailpoet_form_1 .mailpoet_form_loading { width: 30px; text-align: center; line-height: normal; } #mailpoet_form_1 .mailpoet_form_loading > span { width: 5px; height: 5px; background-color: #5b5b5b; }#mailpoet_form_1{border-radius: 16px;background: #ffffff;color: #313131;text-align: left;}#mailpoet_form_1 form.mailpoet_form {padding: 16px;}#mailpoet_form_1{width: 100%;}#mailpoet_form_1 .mailpoet_message {margin: 0; padding: 0 20px;} #mailpoet_form_1 .mailpoet_validate_success {color: #00d084} #mailpoet_form_1 input.parsley-success {color: #00d084} #mailpoet_form_1 select.parsley-success {color: #00d084} #mailpoet_form_1 textarea.parsley-success {color: #00d084} #mailpoet_form_1 .mailpoet_validate_error {color: #cf2e2e} #mailpoet_form_1 input.parsley-error {color: #cf2e2e} #mailpoet_form_1 select.parsley-error {color: #cf2e2e} #mailpoet_form_1 textarea.textarea.parsley-error {color: #cf2e2e} #mailpoet_form_1 .parsley-errors-list {color: #cf2e2e} #mailpoet_form_1 .parsley-required {color: #cf2e2e} #mailpoet_form_1 .parsley-custom-error-message {color: #cf2e2e} #mailpoet_form_1 .mailpoet_paragraph.last {margin-bottom: 0} @media (max-width: 500px) {#mailpoet_form_1 {background: #ffffff;}} @media (min-width: 500px) {#mailpoet_form_1 .last .mailpoet_paragraph:last-child {margin-bottom: 0}} @media (max-width: 500px) {#mailpoet_form_1 .mailpoet_form_column:last-child .mailpoet_paragraph:last-child {margin-bottom: 0}} Please leave this field empty
input[name="data[form_field_MGI0Nzk2NWMxZTIzX2VtYWls]"]::placeholder{color:#abb8c3;opacity: 1;}Email Address *

Check your inbox or spam folder to confirm your subscription.

How to do Conditional Mutate in R? – Data Science Tutorials

Have you liked this article? If you could email it to a friend or share it on Facebook, Twitter, or Linked In, I would be eternally grateful.

Please use the like buttons below to show your support. Please remember to share and comment below. 

The post Separate a data frame column into multiple columns-tidyr Part3 appeared first on Data Science Tutorials

To leave a comment for the author, please follow the link and comment on their blog: Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Separate a data frame column into multiple columns-tidyr Part3

Adding continent and country names with {countrycode}, and subsetting a data frame using sample()

$
0
0
[This article was first published on Ronan's #TidyTuesday blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this post, the Technology Adoption data set is used to illustrate data exploration R and adding information using the {countrycode} package. During data exploration, the tt$technology data set is filtered to select for the “Energy” category, and the distinct values for “variable” and “label” are printed. A subset is then created to test adding full country names and corresponding continents based on 3 letter ISO codes in the data set using the countrycode() function. The full data set is then wrangled into two tibbles for fossil fuel and low-carbon electricity production: the distribution for each energy source is plotted according to the corresponding continent. The full source for this blog post is available on GitHub.

Setup

Loading the R libraries and data set.

# Loading librarieslibrary(tidytuesdayR)library(countrycode)library(tidyverse)library(ggthemes)# Loading datatt <- tt_load("2022-07-19")    Downloading file 1 of 1: `technology.csv`

Exploring tt$technology: selecting distinct values after filtering, and testing adding a “continent” variable

# Printing a summary of tt$technologytt$technology# A tibble: 491,636 × 7   variable label                      iso3c  year group categ…¹ value   <chr>    <chr>                      <chr> <dbl> <chr> <chr>   <dbl> 1 BCG      % children who received a… AFG    1982 Cons… Vaccin…    10 2 BCG      % children who received a… AFG    1983 Cons… Vaccin…    10 3 BCG      % children who received a… AFG    1984 Cons… Vaccin…    11 4 BCG      % children who received a… AFG    1985 Cons… Vaccin…    17 5 BCG      % children who received a… AFG    1986 Cons… Vaccin…    18 6 BCG      % children who received a… AFG    1987 Cons… Vaccin…    27 7 BCG      % children who received a… AFG    1988 Cons… Vaccin…    40 8 BCG      % children who received a… AFG    1989 Cons… Vaccin…    38 9 BCG      % children who received a… AFG    1990 Cons… Vaccin…    3010 BCG      % children who received a… AFG    1991 Cons… Vaccin…    21# … with 491,626 more rows, and abbreviated variable name ¹​category# ℹ Use `print(n = ...)` to see more rows# Printing the distinct "variable" and "label" pairs for the "Energy" category## This will be used as a reference to create the "energy_type" column/variablett$technology %>% filter(category == "Energy") %>% select(variable, label) %>%  distinct()# A tibble: 11 × 2   variable              label                                           <chr>                 <chr>                                         1 elec_coal             Electricity from coal (TWH)                   2 elec_cons             Electric power consumption (KWH)              3 elec_gas              Electricity from gas (TWH)                    4 elec_hydro            Electricity from hydro (TWH)                  5 elec_nuc              Electricity from nuclear (TWH)                6 elec_oil              Electricity from oil (TWH)                    7 elec_renew_other      Electricity from other renewables (TWH)       8 elec_solar            Electricity from solar (TWH)                  9 elec_wind             Electricity from wind (TWH)                  10 elecprod              Gross output of electric energy (TWH)        11 electric_gen_capacity Electricity Generating Capacity, 1000 kilowa…# Setting a seed to make results reproducibleset.seed("20220719")# Using sample() to select six rows of tt$technology at randomsample_rows <- sample(x = rownames(tt$technology), size = 6)# Creating a subset using the random rowstechnology_sample <- tt$technology[sample_rows, ]# Printing a summary of the randomly sampled subsettechnology_sample# A tibble: 6 × 7  variable        label               iso3c  year group categ…¹  value  <chr>           <chr>               <chr> <dbl> <chr> <chr>    <dbl>1 Pol3            % children who rec… PRY    1993 Cons… Vaccin… 6.6 e12 pct_ag_ara_land % Arable land shar… LBR    1991 Non-… Agricu… 3.08e13 fert_total      Aggregate kg of fe… CHE    1988 Prod… Agricu… 1.78e84 railp           Thousands of passe… TUR    1948 Cons… Transp… 4.9 e15 ag_land         Land agricultural … TUN    2013 Non-… Agricu… 9.94e36 tv              Television sets     NIC    1981 Cons… Commun… 1.14e5# … with abbreviated variable name ¹​category# Adding continent and country name columns/variables to the sample subset,# using the countrycode::countrycode() functiontechnology_sample <- technology_sample %>%  mutate(continent = countrycode(iso3c, origin = "iso3c",    destination = "continent"),    country = countrycode(iso3c, origin = "iso3c", destination = "country.name"))# Selecting the country ISO code, continent and country name of the sample# subset, to confirm that countrycode() worked as intendedtechnology_sample %>% select(iso3c, continent, country)# A tibble: 6 × 3  iso3c continent country      <chr> <chr>     <chr>      1 PRY   Americas  Paraguay   2 LBR   Africa    Liberia    3 CHE   Europe    Switzerland4 TUR   Asia      Turkey     5 TUN   Africa    Tunisia    6 NIC   Americas  Nicaragua  

Wrangling tt$technology into two electricity production tibbles: fossil fuels and low-carbon sources

# Adding the corresponding continent for each country in tt$technology;# filtering to select for the "Energy" category; adding a more succinct# "energy_type" variable; and dropping rows with missing valuesenergy_tbl <- tt$technology %>%  mutate(continent = countrycode(iso3c, origin = "iso3c",    destination = "continent")) %>%  filter(category == "Energy") %>%  mutate(energy_type = fct_recode(variable,    "Consumption" = "elec_cons", "Coal" = "elec_coal", "Gas" = "elec_gas",    "Hydro" = "elec_hydro", "Nuclear" = "elec_nuc", "Oil" = "elec_oil",    "Other renewables" = "elec_renew_other", "Solar" = "elec_solar",    "Wind" = "elec_wind", "Output" = "elecprod",    "Capacity" = "electric_gen_capacity")) %>%  drop_na()# Printing a summary of energy_tblenergy_tbl# A tibble: 66,300 × 9   variable  label     iso3c  year group categ…¹ value conti…² energ…³   <chr>     <chr>     <chr> <dbl> <chr> <chr>   <dbl> <chr>   <fct>   1 elec_coal Electric… ABW    2000 Prod… Energy      0 Americ… Coal    2 elec_coal Electric… ABW    2001 Prod… Energy      0 Americ… Coal    3 elec_coal Electric… ABW    2002 Prod… Energy      0 Americ… Coal    4 elec_coal Electric… ABW    2003 Prod… Energy      0 Americ… Coal    5 elec_coal Electric… ABW    2004 Prod… Energy      0 Americ… Coal    6 elec_coal Electric… ABW    2005 Prod… Energy      0 Americ… Coal    7 elec_coal Electric… ABW    2006 Prod… Energy      0 Americ… Coal    8 elec_coal Electric… ABW    2007 Prod… Energy      0 Americ… Coal    9 elec_coal Electric… ABW    2008 Prod… Energy      0 Americ… Coal   10 elec_coal Electric… ABW    2009 Prod… Energy      0 Americ… Coal   # … with 66,290 more rows, and abbreviated variable names ¹​category,#   ²​continent, ³​energy_type# ℹ Use `print(n = ...)` to see more rows# Filtering energy_table for fossil fuel rowsfossil_fuel_tbl <- energy_tbl %>%  filter(energy_type != "Consumption" & energy_type != "Output"     & energy_type != "Capacity") %>%   filter(energy_type == "Coal" | energy_type == "Gas" | energy_type == "Oil")# Printing a summary of the tibblefossil_fuel_tbl# A tibble: 13,914 × 9   variable  label     iso3c  year group categ…¹ value conti…² energ…³   <chr>     <chr>     <chr> <dbl> <chr> <chr>   <dbl> <chr>   <fct>   1 elec_coal Electric… ABW    2000 Prod… Energy      0 Americ… Coal    2 elec_coal Electric… ABW    2001 Prod… Energy      0 Americ… Coal    3 elec_coal Electric… ABW    2002 Prod… Energy      0 Americ… Coal    4 elec_coal Electric… ABW    2003 Prod… Energy      0 Americ… Coal    5 elec_coal Electric… ABW    2004 Prod… Energy      0 Americ… Coal    6 elec_coal Electric… ABW    2005 Prod… Energy      0 Americ… Coal    7 elec_coal Electric… ABW    2006 Prod… Energy      0 Americ… Coal    8 elec_coal Electric… ABW    2007 Prod… Energy      0 Americ… Coal    9 elec_coal Electric… ABW    2008 Prod… Energy      0 Americ… Coal   10 elec_coal Electric… ABW    2009 Prod… Energy      0 Americ… Coal   # … with 13,904 more rows, and abbreviated variable names ¹​category,#   ²​continent, ³​energy_type# ℹ Use `print(n = ...)` to see more rows# Filtering energy_table for low-carbon energy source rowslow_carbon_tbl <- energy_tbl %>%  filter(energy_type != "Consumption" & energy_type != "Output"     & energy_type != "Capacity") %>%   filter(energy_type != "Coal" & energy_type != "Gas" & energy_type != "Oil")# Printing a summary of the tibblelow_carbon_tbl# A tibble: 26,890 × 9   variable   label    iso3c  year group categ…¹ value conti…² energ…³   <chr>      <chr>    <chr> <dbl> <chr> <chr>   <dbl> <chr>   <fct>   1 elec_hydro Electri… ABW    2000 Prod… Energy      0 Americ… Hydro   2 elec_hydro Electri… ABW    2001 Prod… Energy      0 Americ… Hydro   3 elec_hydro Electri… ABW    2002 Prod… Energy      0 Americ… Hydro   4 elec_hydro Electri… ABW    2003 Prod… Energy      0 Americ… Hydro   5 elec_hydro Electri… ABW    2004 Prod… Energy      0 Americ… Hydro   6 elec_hydro Electri… ABW    2005 Prod… Energy      0 Americ… Hydro   7 elec_hydro Electri… ABW    2006 Prod… Energy      0 Americ… Hydro   8 elec_hydro Electri… ABW    2007 Prod… Energy      0 Americ… Hydro   9 elec_hydro Electri… ABW    2008 Prod… Energy      0 Americ… Hydro  10 elec_hydro Electri… ABW    2009 Prod… Energy      0 Americ… Hydro  # … with 26,880 more rows, and abbreviated variable names ¹​category,#   ²​continent, ³​energy_type# ℹ Use `print(n = ...)` to see more rows

Plotting distributions of electricity produced from fossil fuels and low-carbon sources

# Plotting distributions of electricity produced from fossil fuelsfossil_fuel_tbl %>%  ggplot(aes(x = fct_reorder(energy_type, value), y = value, fill = energy_type)) +  geom_boxplot() +  theme_solarized() +  theme(axis.text.x = element_text(angle = 45, hjust = 1),    legend.position = "none") +  scale_colour_discrete() +  scale_y_log10() +  facet_wrap(~continent, scales = "free") +  labs(    title = "Electricity generated from fossil fuels by continent",    y = "Output in log terawatt-hours: log10(TWh)",    x = "Source")
Box plots of electricity produced from fossil fuels, faceted by continent.

(#fig:fig1)Box plots of electricity produced from fossil fuels, faceted by continent.

# Plotting distributions of electricity produced from low-carbon sourceslow_carbon_tbl %>%  ggplot(aes(x = fct_reorder(energy_type, value), y = value, fill = energy_type)) +  geom_boxplot() +  theme_solarized() +  theme(axis.text.x = element_text(angle = 45, hjust = 1),    legend.position = "none") +  scale_colour_discrete() +  scale_y_log10() +  facet_wrap(~continent, scales = "free") +  labs(    title = "Electricity generated from low-carbon sources by continent",    y = "Output in log terawatt-hours: log10(TWh)",    x = "Source")
Box plots of electricity produced from low-carbon energy sources, faceted by continent.

(#fig:fig2)Box plots of electricity produced from low-carbon energy sources, faceted by continent.

To leave a comment for the author, please follow the link and comment on their blog: Ronan's #TidyTuesday blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Adding continent and country names with {countrycode}, and subsetting a data frame using sample()

Boosted Configuration (_neural_) Networks for classification

$
0
0
[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few years ago in 2018, I discussed Boosted Configuration (neural) Networks (BCN for multivariate time series forecasting) in this document. Unlike Stochastic Configuration Networks from which they are inspired, BCNs aren’t randomized. Rather, they are closer to Gradient Boosting Machines and Matching Pursuit algorithms; with base learners being single-layered feedforward neural networks – that are actually optimized at each iteration of the algorithm.

The mathematician that you are has certainly been asking himself questions about the convexity of the problem at line 4, algorithm 1 (in the document). As of July 2022, there are unfortunately no answers to that question. BCNs works well empirically, as we’ll see, and finding the maximum at line 4 of the algorithm is achieved, by default, with R’s stats::nlminb. Other derivative-free optimizers are available in R package bcn.

As it will be shown in this document, BCNs can be used for classification. For this purpose, and as implemented in R package bcn, the response (variable to be explained) containing the classes is one-hot encoded as a matrix of probabilities equal to 0 or 1. Then, the classification technique dealing with a one-hot encoded response matrix is similar to the one presented in this post.

6 toy datasets are used for this basic demo of R package bcn: Iris, Wine, Ionosphere, Wisconsin Breast, Digits, Penguins. For each dataset, hyperparameter tuning has already been done. Repeated 5-fold cross-validation was carried out on 80% of the data, for each dataset, and the accuracy reported in the table below is calculated on the remaining 20% of the data. BCN results are compared to Random Forest’s (with default parameters), in order to verify that BCN results are not absurd – it’s not a competition between Random Forest and BCN here.

In the examples, you’ll also notice that BCN’s adjustment to a dataset can be relatively slow when the number of explanatory variables is high (>30, see the digits dataset example, initially with 64 covariates). This is because of line 4, algorithm 1, too: an optimization problem in a high dimensional space is repeated at each iteration of the algorithm.

The future for R package bcn (in no particular order)?

  • Implement BCN for regression (a continuous response)
  • Improve the speed of execution for high dimensional problems
  • Implement a Python version
DatasetBCN test set AccuracyRandom Forest test set accuracy
iris100%93.33%
Wine97.22%94.44%
Ionosphere90.14%95.77%
Breast cancer99.12%94.73%
Digits97.5%98.61%
Penguins100%100%

Content

0 – Installing and loading packages

Installing bcn From Github:

devtools::install_github("Techtonique/bcn")

# Browse the bcn manual pages
help(package = 'bcn')

Installing bcn from R universe:

# Enable repository from techtonique
options(repos = c(
  techtonique = 'https://techtonique.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
  
# Download and install bcn in R
install.packages('bcn')

# Browse the bcn manual pages
help(package = 'bcn')

Loading packages:

library(bcn) # Boosted Configuration networks (only for classification, for now)
library(mlbench) # Machine Learning Benchmark Problems
library(caret)
library(randomForest)
library(pROC)

1 – iris dataset

data("iris")

head(iris)

dim(iris)

set.seed(1234)
train_idx <- sample(nrow(iris), 0.8 * nrow(iris))
X_train <- as.matrix(iris[train_idx, -ncol(iris)])
X_test <- as.matrix(iris[-train_idx, -ncol(iris)])
y_train <- iris$Species[train_idx]
y_test <- iris$Species[-train_idx]

ptm <- proc.time()
fit_obj <- bcn::bcn(x = X_train, y = y_train, B = 10L, nu = 0.335855,
                    lam = 10**0.7837525, r = 1 - 10**(-5.470031), tol = 10**-7,
                    activation = "tanh", type_optim = "nlminb", show_progress = FALSE)
cat("Elapsed: ", (proc.time() - ptm)[3])

plot(fit_obj$errors_norm, type='l')

preds <- predict(fit_obj, newx = X_test)

mean(preds == y_test)

table(y_test, preds)

rf <- randomForest::randomForest(x = X_train, y = y_train)
mean(predict(rf, newdata=as.matrix(X_test)) == y_test)

print(head(predict(fit_obj, newx = X_test, type='probs')))
print(head(predict(rf, newdata=as.matrix(X_test), type='prob')))

2- wine dataset

data(wine)

head(wine)

dim(wine)

set.seed(1234)
train_idx <- sample(nrow(wine), 0.8 * nrow(wine))
X_train <- as.matrix(wine[train_idx, -ncol(wine)])
X_test <- as.matrix(wine[-train_idx, -ncol(wine)])
y_train <- as.factor(wine$target[train_idx])
y_test <- as.factor(wine$target[-train_idx])

ptm <- proc.time()
fit_obj <- bcn::bcn(x = X_train, y = y_train, B = 6L, nu = 0.8715725,
                    lam = 10**0.2143678, r = 1 - 10**(-6.1072786),
                    tol = 10**-4.9605713, show_progress = FALSE)
cat("Elapsed: ", (proc.time() - ptm)[3])

plot(fit_obj$errors_norm, type='l')

preds <- predict(fit_obj, newx = X_test)

mean(preds == y_test)

table(y_test, preds)

rf <- randomForest::randomForest(x = X_train, y = y_train)
mean(predict(rf, newdata=as.matrix(X_test)) == y_test)

print(head(predict(fit_obj, newx = X_test, type='probs')))
print(head(predict(rf, newdata=as.matrix(X_test), type='prob')))

3 - Ionosphere dataset

data("Ionosphere")

head(Ionosphere)

dim(Ionosphere)

Ionosphere$V1 <- as.numeric(Ionosphere$V1)
Ionosphere$V2 <- NULL
set.seed(1234)
train_idx <- sample(nrow(Ionosphere), 0.8 * nrow(Ionosphere))
X_train <- as.matrix(Ionosphere[train_idx, -ncol(Ionosphere)])
X_test <- as.matrix(Ionosphere[-train_idx, -ncol(Ionosphere)])
y_train <- as.factor(Ionosphere$Class[train_idx])
y_test <- as.factor(Ionosphere$Class[-train_idx])

ptm <- proc.time()
fit_obj <- bcn::bcn(x = X_train,
                     y = y_train, B = 50L,
                     nu = 0.5182606,
                     lam = 10**1.323274,
                     r = 1 - 10**(-6.694688),
                     col_sample = 0.7956659,
                     tol = 10**-7,
                     verbose=FALSE,
                     show_progress = FALSE)
cat("Elapsed: ", (proc.time() - ptm)[3])

plot(fit_obj$errors_norm, type='l')

preds <- predict(fit_obj, newx = X_test)

mean(preds == y_test)

table(y_test, preds)

rf <- randomForest::randomForest(x = X_train, y = y_train)
mean(predict(rf, newdata=as.matrix(X_test)) == y_test)

print(head(predict(fit_obj, newx = X_test, type='probs')))
print(head(predict(rf, newdata=as.matrix(X_test), type='prob')))

roc_obj <- pROC::roc(as.numeric(y_test), as.numeric(preds))
pROC::auc(roc_obj)

roc_obj_rf <- pROC::roc(as.numeric(y_test), as.numeric(predict(rf, newdata=as.matrix(X_test))))
pROC::auc(roc_obj_rf)

4 - breast cancer dataset

data("breast_cancer")

head(breast_cancer)

dim(breast_cancer)

set.seed(1234)
train_idx <- sample(nrow(breast_cancer), 0.8 * nrow(breast_cancer))
X_train <- as.matrix(breast_cancer[train_idx, -ncol(breast_cancer)])
X_test <- as.matrix(breast_cancer[-train_idx, -ncol(breast_cancer)])
y_train <- as.factor(breast_cancer$target[train_idx])
y_test <- as.factor(breast_cancer$target[-train_idx])

ptm <- proc.time()
fit_obj <- bcn::bcn(x = X_train, y = y_train, B = 31L, nu = 0.4412851,
                    lam = 10**-0.2439358, r = 1 - 10**(-7), col_sample = 0.5, tol = 10**-2, show_progress = FALSE)
cat("Elapsed: ", (proc.time() - ptm)[3])

plot(fit_obj$errors_norm, type='l')

preds <- predict(fit_obj, newx = X_test)

mean(preds == y_test)

table(y_test, preds)

rf <- randomForest::randomForest(x = X_train, y = y_train)
mean(predict(rf, newdata=as.matrix(X_test)) == y_test)

print(head(predict(fit_obj, newx = X_test, type='probs')))
print(head(predict(rf, newdata=as.matrix(X_test), type='prob')))

roc_obj <- pROC::roc(as.numeric(y_test), as.numeric(preds))
pROC::auc(roc_obj)

roc_obj_rf <- pROC::roc(as.numeric(y_test), as.numeric(predict(rf, newdata=as.matrix(X_test))))
pROC::auc(roc_obj_rf)

5 - digits dataset

data("digits")

head(digits)

dim(digits)

set.seed(1234)
train_idx <- sample(nrow(digits), 0.8 * nrow(digits))
X_train <- as.matrix(digits[train_idx, -ncol(digits)])
X_test <- as.matrix(digits[-train_idx, -ncol(digits)])
y_train <- as.factor(digits$target[train_idx])
X_train <- X_train[, -caret::nearZeroVar(X_train)]
y_test <- as.factor(digits$target[-train_idx])
X_test <- X_test[, colnames(X_train)]

ptm <- proc.time()
fit_obj <- bcn::bcn(x = X_train,
                    y = y_train, B = 50L,
                    nu = 0.6549268,
                    lam = 10**0.4635435,
                    r = 1 - 10**(-7),
                    col_sample = 0.8928518,
                    tol = 10**-5.483609,
                    verbose=FALSE,
                    show_progress = FALSE)
cat("Elapsed: ", (proc.time() - ptm)[3])

plot(fit_obj$errors_norm, type='l')

preds <- predict(fit_obj, newx = X_test)

mean(preds == y_test)

table(y_test, preds)

rf <- randomForest::randomForest(x = X_train, y = y_train)
mean(predict(rf, newdata=as.matrix(X_test)) == y_test)

print(head(predict(fit_obj, newx = X_test, type='probs')))
print(head(predict(rf, newdata=as.matrix(X_test), type='prob')))

6 - Penguins dataset

data("penguins")

penguins_ <- as.data.frame(penguins)

replacement <- median(palmerpenguins::penguins$bill_length_mm, na.rm = TRUE)
penguins_$bill_length_mm[is.na(palmerpenguins::penguins$bill_length_mm)] <- replacement

replacement <- median(palmerpenguins::penguins$bill_depth_mm, na.rm = TRUE)
penguins_$bill_depth_mm[is.na(palmerpenguins::penguins$bill_depth_mm)] <- replacement

replacement <- median(palmerpenguins::penguins$flipper_length_mm, na.rm = TRUE)
penguins_$flipper_length_mm[is.na(palmerpenguins::penguins$flipper_length_mm)] <- replacement

replacement <- median(palmerpenguins::penguins$body_mass_g, na.rm = TRUE)
penguins_$body_mass_g[is.na(palmerpenguins::penguins$body_mass_g)] <- replacement

# replacing NA's by the most frequent occurence
penguins_$sex[is.na(palmerpenguins::penguins$sex)] <- "male" # most frequent

print(summary(penguins_))
print(sum(is.na(penguins_)))

# one-hot encoding for covariates
penguins_mat <- model.matrix(species ~., data=penguins_)[,-1]
penguins_mat <- cbind(penguins_$species, penguins_mat)
penguins_mat <- as.data.frame(penguins_mat)
colnames(penguins_mat)[1] <- "species"

print(head(penguins_mat))
print(tail(penguins_mat))

y <- as.integer(penguins_mat$species)
X <- as.matrix(penguins_mat[,2:ncol(penguins_mat)])

n <- nrow(X)
p <- ncol(X)

set.seed(1234)
index_train <- sample(1:n, size=floor(0.8*n))
X_train <- X[index_train, ]
y_train <- factor(y[index_train])
X_test <- X[-index_train, ]
y_test <- factor(y[-index_train])

ptm <- proc.time()
fit_obj <- bcn::bcn(x = X_train, y = y_train, B = 23, nu = 0.470043,
                    lam = 10**-0.05766029, r = 1 - 10**(-7.905866), tol = 10**-7, 
                    show_progress = FALSE)
cat("Elapsed: ", (proc.time() - ptm)[3])

plot(fit_obj$errors_norm, type='l')

preds <- predict(fit_obj, newx = X_test)

mean(preds == y_test)

table(y_test, preds)

rf <- randomForest::randomForest(x = X_train, y = y_train)
mean(predict(rf, newdata=as.matrix(X_test)) == y_test)

print(head(predict(fit_obj, newx = X_test, type='probs')))
print(head(predict(rf, newdata=as.matrix(X_test), type='prob')))
To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Boosted Configuration (_neural_) Networks for classification

An introductory workshop in Shiny, July 25st to 28st

$
0
0
[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Buy tickets at https://www.buymeacoffee.com/pacha/e/80760

This workshop aims to introduce people with basic R knowledge to develop interactive web applications with the Shiny framework.

The course consists of a one-hour session, where we will demonstrate basic UI, reactive UI, CSS personalization and dashboard creation. Questions are super welcome!

Previous knowledge required: Basic R (examples: reading a CSV file, transforming columns and making graphs using ggplot2).

The course will be delivered online using Zoom from July 25st to 28st at different times: 16.00, 17.30, and 19.00. Check the timezone. For this workshop, it is New York Time (https://www.timeanddate.com/worldclock/usa/new-york)

Each day-hour block has a limit of a maximum of 8 participants to foster participation.

Finally, here’s a short demo of a part of what this workshop covers https://youtu.be/DW-HPfohfwg.

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: An introductory workshop in Shiny, July 25st to 28st

How to compare the values of two vectors in R?

$
0
0
[This article was first published on Data Analysis in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to compare the values of two vectors in R? appeared first on finnstats.

If you are interested to learn more about data science, you can find more articles here finnstats.

How to compare the values of two vectors in R?, Using match() and %in% we can compare vectors

Today, we’ll talk about comparing two R vectors to see what components (values).

Here, we have two choices:

The indexes of common elements are returned by the match () function in R.

  1. Indicating if a value from the first vector was present in the second.
  2. The percent in percent operation produces a vector of True or False answers.

Finding Values in Vectors using R Match

Starting with the R match() function, let’s go on.

Value<-c(15,13,12,14,12,15,30)

The first position of the matching value is returned by the r match example.

match(12, Value)
[1] 3

Example of an r match with multiple values

match (c(13,12), Value)
[1] 2 3

The initial position of each of the two values is returned by the R match function when we pass it a vector of multiple values.

Operator – Boolean Equivalent for percent in percent

Do you only require a True/False response?

Use the percent in percent operator, if possible. Similar operations are carried out, and the result is returned as a Boolean indicator indicating if the value is present.

For an example of the R percent in percent operator

by utilizing the percent in percent operator, check for a single value.

14 %in% Value
[1] TRUE

Utilize the percent in percent operator to verify a vector of multiple values.

c(10, 12) %in% Value
[1] FALSE  TRUE

what values the percent in percent operator treats when it can’t

c(10, 12, 14) %in% Value
[1] FALSE  TRUE  TRUE

Summary

The percent in percent operator and R Match; We can compare the values of two vectors using these two tools.

If you are interested to learn more about data science, you can find more articles here finnstats.

The post How to compare the values of two vectors in R? appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Data Analysis in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: How to compare the values of two vectors in R?

ggdensity: A new R package for plotting high-density regions

$
0
0
[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As data scientists, it can be downright impossible to drill into messy data. Fortunately, there’s a new R package that helps us focus on a “high-density region”, which is simply an area in a scatter plot defined by a high percentage of the data points. It’s called ggdensity.

High Density Regions on a Scatter Plot

In this R-tip, I’m going to show you how to hone in on high-density regions under 5-minutes:

  1. Learn how to make high-density scatter plots with ggdensity
  2. BONUS: Make faceted density plots to drill into over-plotted high-density region data

R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?

Here are the links to get set up. 👇

Video Tutorial

I have a companion video tutorial that shows even more secrets (plus mistakes to avoid).

What you make in this R-Tip

By the end of this tutorial, you’ll use of high density regions to make insights from groups within your data. For example, here we can see where each Class of Vehicle compares in terms of engine displacement (displ) and highway fuel economy (hwy), answering questions like:

  • Is vehicle class a good way to describe vehicle clusters?
  • Which vehicle classes have the greatest variation in highway fuel economy versus displacement?
  • Which vehicle classes have the highest / lowest highway fuel economy?

Do you see how powerful ggdensity is?

Uncover insights with ggdensity

Thank You Developers.

Before we move on, please recognize that ggdensity was developed by James Otto, Doctoral Candidate at the Department of Statistical Science, Baylor University. Thank you for everything you do! Also, the full documentation for ggdensity can be accessed here.

Before we get started, get the R Cheat Sheet

ggdensity is great for extending ggplot2 with advanced features. But, you’ll need to learn ggplot2 to take full advantage. For these topics, I’ll use the Ultimate R Cheat Sheet to refer to ggplot2 code in my workflow.

Quick Example:

Download the Ultimate R Cheat Sheet. Then Click the “CS” hyperlink to “ggplot2”.

Now you’re ready to quickly reference the ggplot2 cheat sheet. This shows you the core plotting functions available in the ggplot library.

ggplot2 cheat sheet

Onto the tutorial.

ggdensity Tutorial

Let’s dive into using ggdensity so we can show you how to make high-density regions on your scatter plots.

Important: All of the data and code shown can be accessed through our Business Science R-Tips Project.

Plus I have a surprise at the end (for everyone)!

💡 Step 1: Load the Libraries and Data

First, run this code to load the R libraries:

Load tidyverse , tidyquant, and ggdensity.

Get the code.

Next, run this code to pull in the data.

We’ll read in the mpg data set that was comes with ggplot2.

Get the Data.

We want to understand how highway fuel economy relates to engine size (displacement) and to see if there are clusters by vehicle class.

💡 Step 2: Make a basic ggplot

Next, make a basic ggplot using the following code. This creates a scatter plot with the colors that change by vehicle class. I won’t go into all of the mechanics, but you can download my R cheat sheet to learn more about ggplot and the grammar of graphics.

Get the code.

Here’s what the plot looks like. Do you see how it’s really tough to pull out the clusters in there? Each of the points overlap which makes understanding the group structure in the data very tough.

Step 3: Add High Density Regions

Ok, now that we have a basic scatter plot, we can make a quick alteration by adding high density regions that capture 90% and 50% of the data. We use geom_hdr(probs = c(0.9, 0.5, alpha = 0.35) to accomplish the next plot.

Get the code.

Let’s see what we have here.

We can now see where the clusters have the highest density. But there’s still a problem called “overplotting”, which is when too many graphics get plot on top of each other.

💡 BONUS: Overplotting solved!

Here’s the problem we’re facing: overplotting. We simply have too many groups that are too close together. Let’s see how to fix this.

The fix is pretty simple. Just use facetting from ggplot2.

Get the code.

And, voila! We can easily inspect the clusters by vehicle class.

💡 Conclusions

You learned how to use the ggdensity library to create high-density regions that help us understand the clusters within our data. Great work! But, there’s a lot more to becoming a Business Scientist.

If you’d like to become a Business Scientist (and have an awesome career, improve your quality of life, enjoy your job, and all the fun that comes along), then I can help with that.

Step 1: Watch my Free 40-Minute Webinar

Learning data science on your own is hard. I know because IT TOOK ME 5-YEARS to feel confident.

AND, I don’t want it to take that long for you.

So, I put together a FREE 40-minute webinar (a masterclass) that provides a roadmap for what worked for me.

Literally 5-years of learning, consolidated into 40-minutes. It’s jammed packed with value. I wish I saw this when I was starting… It would have made a huge difference.

Step 2: Take action

For my action-takers, if you are ready to become a Business Scientist, then read on.

If you need take your skills to the next level and DON’T want to wait 5-years to learn data science for business, AND you want a career you love that earns you $100,000+ salary (plus bonuses), AND you’d like someone to help you do this in UNDER 6-MONTHS or less….

Then I can help with that too.

Surprise!

There’s a link in the FREE 40-minute webinar for a special price (because you are special!) and taking that action will kickstart your journey with me in your corner.

Get ready. The ride is wild. And the destination is AMAZING!

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: ggdensity: A new R package for plotting high-density regions

Learning Path: Introduction to R Shiny

$
0
0
[This article was first published on Mirai Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Enhance your data science toolkit with our “Introduction to R Shiny” learning path: build your first shiny app, make it shine and robust, build a package and bring it to production.

R is among the most popular programming, scripting, and markup languages. Written by statisticians for statisticians, it is an incredible tool for data exploration, data manipulation, visualization and data analysis.

R Shiny is empowered data science through interactive interface. By providing your R code with a UI, it allows to share the power of data analysis with business users, enables decision-making through visualization and interaction, eases collaboration and allows you to bring your work from prototype all the way to production.

With our “Shiny” Learning Path we offer four workshops bringing you from a total novice all the way to a professional Shiny developer. Learn with industry experts, who use Shiny professionally, how to build an efficient, effective, professional User Interface integrated with your R work.

Four consecutive Wednesdays, from 3 PM (NEW) to 5:30 PM (CEST), a 3.5-hours hands on experience.

Learn with industry experts with more than ten years experience using R professionally how to efficiently use R and the wide range of its possibilities.

  • On August 31st learn the basis of Shiny, build your very first Shiny application and get an understanding of reactivity with the “Build your first Shiny App” workshop.

  • Follow up on September 7th discover the many packages that expand the Shiny universe and explore how to customize your Shiny app to tailor it to your use case and taste with the “Make a Shiny App sparkle” workshop.

  • Advance your R Shiny development skills on September 14th with the “Advanced Shiny Development” workshop. Get familiar with more complex Shiny constructs and put into practice best practices like using a package structure & testing.

  • Closing up the series on September 21st, the “Bring a Shiny App to Production” workshop will introduce you to most common DevOps best practices, including automation and collaboration (Git), to bring your Shiny app all the way to production deploying on ShinyApps.io.

With this series we aim at giving you an overview of programming in R Shiny, provide practical examples of effective usage of the language, describe best development practices and most common data analysis and technical tools.

Register now on our web page and get a discount for the full learning path!

Full shiny workshop offer

Learning path: Shiny

Learn how to build a web app using R, exercise reactivity with R Shiny, test the app, and improve your DevOps with an agile approach.

Each workshop can be taken stand-alone but there are convenient deals for following the whole or a part of the path learning paths.

Register directly on our website for multiple workshops.

  • Early bird discount.
  • Register more attendees and get an additional discount.
  • Register for two or more workshops in the path for a further discount.

NEW: the workshops will be recorded. Recording will be available for a separate purchase (only for the attendees) after the end of a workshop.

Contact us for any further information or if you would like to have a custom workshop tailored to your needs.

To leave a comment for the author, please follow the link and comment on their blog: Mirai Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Learning Path: Introduction to R Shiny
Viewing all 12095 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>