Quantcast
Channel: R-bloggers
Viewing all 12126 articles
Browse latest View live

Time Series Demand Forecasting

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Tutorials Update

Interested in more time series tutorials? Learn more R tips:

👉Register for our blog to get new articles as we release them.


Time Series Demand Forecasting of Brazilian Commodities

Demand Forecasting is a technique for estimation of probable demand for a product or services. It is based on the analysis of past demand for that product or service in the present market condition. Demand forecasting should be done on a scientific basis and facts and events related to forecasting should be considered.

After gathering information about various aspects of the market and demand based on the past, is possible to estimate future demand. What we call forecasting of demand.

For example, suppose we sold 200, 250, 300 units of product X in January, February, and March respectively. Now we can say that there will be a demand for Y units approximately of product X in April.

Demand forecasting key advantages:

  • More effective production scheduling
  • Inventory management and reduction
  • Cost reduction
  • Optimized transport logistics
  • Increased customer satisfaction

Those are just some benefits in forecasting demand. As this is a task applicable in almost any business area, the concepts and approaches used here can be extrapolate for your problem too, with specific adjustments.

This project has the aim of forecast the next 3 years of exports of top 3 Brazilian commodities: soybean, corn and sugar. But not just that, I’ll cover also the following steps of a Data Science project: exploratory data analysis, data preparation, data cleaning, feature engineering and modeling.

The dataset used here came from a public source available by clicking here .

Software Requirement

If you want to reproduce the project in your environment, I suggest you to install the following packages first, before load them.

 # Data Explorationlibrary(tidyverse)library(skimr)library(lubridate)library(tidytext)library(timetk)library(gt)# color palletelibrary(tidyquant)# modelinglibrary(tidymodels)library(modeltime)

Exploratory Data Analysis

As mentioned before, the dataset came from a public source called COMEX STAT. This website provides free access to Brazilian foreign trade statistics.

# reading the data and converting categorical features to factorexp_imp_tbl <- read_csv("data/data_comexstat.csv") %>% mutate_if(is.character, as_factor)

As usual, I always create a dictionary of the dataset I’m working just to keep in mind the meaning of each variable, see the following list:

  • date: date where occurred the transaction of export or import (our time series information).
  • product: commodities (sugar, soybean meal, soybean oil, soybean, corn and wheat).
  • state: State responsible for the production.
  • country: Country responsible for the transaction.
  • type: if there is export or import.
  • route: route used to transport the commodity.
  • tons: quantity export/import.
  • usd: commercial currency.

Overview

The data contains all tracking information of monthly imports and exports of a range of products, by brazilian states, by routes (air, sea, ground, etc) and from/to which country.

At the beginning of the process is a good idea to take a general overview of the data, and for that I love the skimr::skim() function, very handy to understand a big picture of your data.

 skim(exp_imp_tbl) 
── Data Summary ────────────────────────                           Values     Name                       exp_imp_tblNumber of rows             117965     Number of columns          8          _______________________               Column type frequency:                  Date                     1            factor                   5            numeric                  2          ________________________              Group variables            None       ── Variable type: Date ──────────────────────────  skim_variable n_missing complete_rate min        max        median     n_unique1 date                  0             1 1997-01-01 2019-12-01 2012-10-01      276── Variable type: factor ──────────────────────────  skim_variable n_missing complete_rate ordered n_unique top_counts                                    1 product               0             1 FALSE          6 sug: 35202, soy: 22914, cor: 21872, soy: 182152 state                 0             1 FALSE         27 SP: 28713, PR: 17155, MT: 16837, GO: 10981    3 country               0             1 FALSE        212 Chi: 7437, Par: 7160, Net: 7158, Arg: 4842    4 type                  0             1 FALSE          2 Exp: 105861, Imp: 12104                       5 route                 0             1 FALSE          5 Sea: 93870, Gro: 13038, Oth: 6374, Air: 2918  ── Variable type: numeric ──────────────────────────  skim_variable n_missing complete_rate     mean        sd    p0    p25    p50      p75       p100 hist 1 tons                  0             1   14537.    49779.     0   125.   2000   13534.   1798446. ▇▁▁▁▁2 usd                   0             1 4813150. 19494116.     0 71552  725000 3895943  903930411  ▇▁▁▁▁

As we can see, our dataset have:

  • 1 date variable
  • 5 categorical variables
  • 2 numerical variables

For our luck there is no missing data in any of these columns. Two things pop up when we look at type and route variables.

  • type: Brazil is a country that export more than import (more than 100k of observations on export category).
  • route: The route that Brazil less use (considering exports and imports) is air. And the route more used is sea.

Production over time

We also saw the date range of this date feature, and it’s from 1997/01/01 to 2019/01/12. Looking more closely at this feature we can investigate how was the exports from Brazil, considering all states and to everywhere, throughout the time.

# data wrangling before plot# adding two columns: year and monthexp_imp_year_month_tbl <- exp_imp_tbl %>%   mutate(year = year(date),         month = month(date, abbr = FALSE, label = TRUE, locale = Sys.setlocale("LC_COLLATE", "C")))# total of tons by year (considering just exports)expt_year_total_tbl <- exp_imp_year_month_tbl %>%  filter(type == "Export") %>%   group_by(year) %>%   summarise(total_tons = sum(tons)) %>%   ungroup()# total of tons for each month of the year (considering just exports)expt_year_month_total_tbl <- exp_imp_year_month_tbl %>%   filter(type == "Export") %>%   group_by(year, month) %>%   summarise(total_tons = sum(tons)) %>%   ungroup()# vizualizing the monthly total of tons (1997 - 2020)expt_year_month_total_tbl %>%   mutate(month = month %>% str_to_title() %>% as_factor()) %>%   ggplot(aes(x = year,             y = total_tons)) +  geom_point(size = .8) +  geom_line() +  facet_wrap(~month) +  theme_tq() +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  scale_x_continuous(breaks = c(1997, 2000, 2005, 2010, 2015, 2019)) +  labs(    title = "Total of Tons for the Monthly Exports Between 1997 and 2020",    subtitle = "Considering all Brazilian States, and Soybean, Soybeans Meal, Soybean Oil, Sugar, Corn and Wheat commodities",    x = "",    y = "Millions of Tons",    caption = "linkedin.com/in/lucianobatistads/"  ) +  theme(axis.text.x = element_text(size = 7))

On this monthly chart, we see that there is a higher pronounced growth trend in exports during the months from March to August.

# building vizualizations# vizualizing the annual total of tons export (1997 - 2020)expt_year_total_tbl %>%   ggplot(aes(x = year,             y = total_tons)) +  geom_point() +  geom_line() +  theme_tq() +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  scale_x_continuous(breaks = c(1997, 2000, 2005, 2010, 2015, 2019)) +  labs(    title = "Total of Tons for the Annual Exports Between 1997 and 2020",    subtitle = "Considering all Brazilian States, and Soybean, Soybeans Meal, Soybean Oil, Sugar, Corn and Wheat commodities",    x = "",    y = "Millions of Tons",    caption = "linkedin.com/in/lucianobatistads/"  ) +  theme(axis.text.x = element_text(size = 10))

As expected, over the years, the brazilian exports follow a growing trend, even if 2020 bring us a terrible result because of COVID-19, it is likely to go back to the initial trend in the next year.

Most Important Commodities

We saw before our data have 6 different commodities: Soybean, Sugar, Soybeans Meal, Corn, Soybean Oil and Wheat. Let’s look at them and see which have been more export in the last 5 years.

 # data wrangling before plot# filtering for recent years and type == "Export"# total of tons for these groupstop_3_product_exp_tbl <- exp_imp_year_month_tbl %>%   filter(year %in% c(2019:2015)) %>%   filter(type == "Export") %>%   group_by(product) %>%   summarise(total_tons_exp = sum(tons)) %>%   ungroup() %>%   slice_max(total_tons_exp, n = 3)top_3_product_exp_tbl %>%   mutate(product_str = case_when(    product == "soybeans" ~ "Soybean",    product == "corn" ~ "Corn",    TRUE ~ "Sugar"  )) %>%   ggplot(aes(x = total_tons_exp,             y = fct_reorder(product_str, total_tons_exp),             fill = product_str)) +  geom_col() +  scale_fill_manual(values = c("#7EBEF7", "#2595F5", "#BBD7F0")) +  scale_x_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  guides(fill = FALSE) +  theme_tq() +  labs(    title = "Top 3 - Brazilian Commodities Exports",    subtitle = "Considering the Last 5 Years",    caption = "linkedin.com/in/lucianobatistads/",    x = "Millions of Tons",    y = "Commodities"  )# good table to plottop_3_product_exp_tbl %>%   rename(Commodities = product, `Total of Tons` = total_tons_exp) %>%  mutate(`Total of Tons` = `Total of Tons` %>% scales::number(scale = 1e-6, suffix = "M")) %>%   gt()

The plot above is showing the top 3 commodities exported in Brazil by the last 5 years: soybean, corn and sugar. With the more important being soybean. If we compare with the others, soybeans have 55.5% more than the second (Corn) and 63.2% more than the third (Sugar), it is an enormous difference.

Routes

These commodities we are seeing until now are exported by different routes: sea, ground, air, river and others. Let’s investigate if there is some preference to choose the route and product.

Before building those visualizations, sounds a good idea to keep in mind the most chosen routes, considering all products, to establish a big picture of the situation. Look at the table below:

# data wrangling before plot# top exports routes on recent yearstops_routes_exp <- exp_imp_year_month_tbl %>%   filter(type == "Export") %>%   filter(year > 2000) %>% # selecting most recent years  count(route) %>%   mutate(prop = n / sum(n))# tabletops_routes_exp %>%   rename(Route = route, Percent = prop) %>%   mutate(Percent = Percent %>% scales::number(scale = 100, suffix = "%", accuracy = .1)) %>%   select(-n) %>%   gt() %>%   tab_header(    title = "Participation of Routes of Exports"  )

Now let’s see by product and routes what it’s happening:

# data wrangling before plotexp_by_route_product_tbl <- exp_imp_year_month_tbl %>%   mutate(product = case_when(    product == "corn" ~ "Corn",    product == "soybean_meal" ~ "Soybeans Meal",    product == "soybean_oil" ~ "Soybean Oil",    product == "sugar" ~ "Sugar",    product == "soybeans" ~ "Soybean",    TRUE ~ "Wheat"  )) %>%   filter(type == "Export") %>%   group_by(route, product) %>%   summarise(n = n()) %>%   mutate(prop_by_route = n / sum(n)) %>%   ungroup()exp_by_route_product_tbl %>%   ggplot(aes(x = tidytext::reorder_within(product, prop_by_route, route),             y = prop_by_route)) +  geom_col(aes(fill = prop_by_route)) +  tidytext::scale_x_reordered() +  coord_flip() +  facet_wrap(~route, scales = "free") +  scale_fill_gradient(high = "#144582", low = "#D4E8FF") +  scale_y_continuous(labels = scales::percent_format(scale = 100, suffix = "%")) +  labs(    title = "Proportion of Commodities Exports for all Different Routes",    caption = "linkedin.com/in/lucianobatistads/",    y = "",    x = "Commodities"  ) +  theme_tq() +  labs(fill = "Proportion by Routes")

Although most products are transported by sea (table above), we observe that depending on the route there is a preference for the product that will be exported.

Considering three major products exported in each route, we have:

  • Sea: sugar, soybean and soybeans meal.
  • Ground: soybean oil, sugar and corn.
  • Air: corn (much more), soybean and sugar.
  • Other: sugar, soybean oil and corn.
  • River: soybean, corn and sugar.

And a closer look at soybean, give us the following chart:

# data wrangling before plottops_routes_sugar_exp <- exp_imp_year_month_tbl %>%   filter(product == "soybeans") %>%   filter(type == "Export") %>%   filter(year > 2000) %>% # selecting most recent years  count(route) %>%   mutate(prop = n / sum(n))tops_routes_sugar_exp %>%   ggplot(aes(x = prop,             y = fct_reorder(route, prop),             fill = route)) +  geom_col() +  scale_x_continuous(labels = scales::percent_format(scale = 100, suffix = "%")) +  scale_fill_manual(values = c( "#2595F5", "#BBD7F0", "#BBD7F0","#BBD7F0","#7EBEF7")) +  theme_tq() +  labs(    title = "Participation in Each Route of Brazilian Soybean Exports (1997 - 2020)",    caption = "linkedin.com/in/lucianobatistads/",    x = "Proportions",    y = "Exportation Routes"  ) +  guides(fill = F)

Yeah, a very high concentration in sea transportation route.

Trade Partners

Let’s look at our data by another perspective, trade partners. Brazil has a lot of trade partners, these are countries which with Brazil export and import more, and sounds a good idea to know which countries Brazil has been doing business.

# data wrangling before plot# filtering by the last 5 years# first look on the exportationexp_trade_partners_corn_sugar_tbl <- exp_imp_tbl %>%   mutate(year = year(date)) %>%   filter(year %in% c(2019:2014)) %>%   filter(type == "Export") %>%   # removing special characters   mutate(country2 = country %>% str_replace_all("[[:punct:]]", "") %>% str_trim(side = "both")) %>%   group_by(year, country2) %>%   summarise(total_usd = sum(usd)) %>%   ungroup() %>%   mutate(year_fc = as.factor(year),         name = reorder_within(country2, total_usd, year_fc))exp_trade_partners_corn_sugar_tbl %>%   # threshold to selecting some countries  filter(total_usd > 100000000) %>%   ggplot(aes(x = name,             y = total_usd,             fill = year)) +  geom_col(show.legend = FALSE) +  coord_flip() +  scale_x_reordered() +  scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +  facet_wrap(~year, scales = "free", nrow = 3) +  labs(    title = "Trade Partners Evaluation for the last 5 years of Exportation",    subtitle = "Considering all Brazilian States, and Soybean, Soybeans Meal, Soybean Oil, Sugar, Corn and Wheat Commodities",    x = "",    y = "Millions of U.S. Dollars",    caption = "linkedin.com/in/lucianobatistads/"  ) +  theme_tq()

It’s clear that China is our most important export trade partner. The others positions have been rotating between Netherlands, Spain and Iran.

Now, by imports perspective:

# this is the same code, but filtering by importationimp_trade_partners_corn_sugar_tbl <- exp_imp_tbl %>%   mutate(year = year(date)) %>%   filter(year %in% c(2019:2014)) %>%   filter(type == "Import") %>%   mutate(country2 = country %>% str_replace_all("[[:punct:]]", "") %>% str_trim(side = "both")) %>%   group_by(year, country2) %>%   summarise(total_usd = sum(usd)) %>%   ungroup() %>%   mutate(year_fc = as.factor(year),         name = reorder_within(country2, total_usd, year_fc))imp_trade_partners_corn_sugar_tbl %>%   group_by(year) %>%   slice_max(total_usd, n = 3) %>%   ungroup() %>%   ggplot(aes(x = name,             y = total_usd,             fill = year)) +  geom_col(show.legend = FALSE) +  coord_flip() +  scale_x_reordered() +  scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +  facet_wrap(~year, scales = "free") +  labs(    title = "Top 3 - Trade Partners Evaluation for the last 5 years of Imports",    subtitle = "Considering Soybean, Soybeans Meal, Soybean Oil, Sugar, Corn and Wheat Commodities",    caption = "linkedin.com/in/lucianobatistads/",    y = "Millions of U.S. Dollars",    x = ""  ) +  theme_tq()

Brazil’s major import trade partners alternate between Argentina, Paraguay, and USA. Curiously, seems that the participation of USA has been decreasing through the time, as opposed to Argentina.

States and Commodities

Brazil is a huge country, the five largest country in the world, and this gives us different temperature ranges depends on each area you’re looking at. This geographical aspect lead to cultures of food been produce in specific regions them others.

Let’s see from which region comes the production of ours commodities, in terms of exports:

quest_5_tbl <- exp_imp_tbl %>%   filter(type == "Export") %>%   group_by(state, product) %>%   summarise(total_usd = sum(usd)) %>%   ungroup() %>%   group_by(product) %>%   top_n(5)quest_5_tbl %>%   mutate(product = case_when(    product == "corn" ~ "Corn",    product == "soybean_meal" ~ "Soybeans Meal",    product == "soybean_oil" ~ "Soybean Oil",    product == "sugar" ~ "Sugar",    product == "soybeans" ~ "Soybeans",    TRUE ~ "Wheat"  ))  %>%   ggplot(aes(x = total_usd,             y = reorder_within(state, total_usd, product),             fill = product)) +  geom_col() +  facet_wrap(~product, scales = "free") +    scale_fill_manual(values = c( "#1B8BB5", "#695905", "#1B60B5", "#B55C24", "#B59C12","#69381A" )) +  scale_y_reordered() +  scale_x_continuous(labels = scales::number_format(scale = 1e-9, suffix = "Bi", prefix = "$")) +  guides(fill = F) +  theme_tq() +  labs(    x = "Billions of U.S. Dollars",    y = "",    title = "Top 5 - Most Important Brazilian States by Commodities",    subtitle = "Considering Exports",    caption = "linkedin.com/in/lucianobatistads/"  )

Mato Grosso concentrates most of the exports of soybean oil, soybeans and soybean meal, along with Rio Grande do Sul and Paraná. São Paulo, on the other hand, takes part strongly in sugar exports. And very few states export wheat, the most expressive values comes from Rio Grande do Sul, Paraná and Santa Catarina.

Data Modeling

As we know, there is 3 time-series to predict the next 3 years of demand in tons: soybean, corn and sugar. I’ll model each separately, because by this way is better to understand the underlying rationale behind the data.

Something important to say is that I already tried different approaches of feature engineering and here I’ll show you what had the better performance. Another point to clarify is I’ll be using a modeltime framework workflow, integrated with tidymodels principles (quick review down below). As this is a first part of this project, we’ll be using just ARIMA family of models, be aware that more advanced topics on modeling and feature engineering will covered in the second part.

A quick review of modeltime

For those that aren’t familiar with modeltime framework, it’s a R package that set up a time series analysis workflow in a very optimized way. The package works as an extension of tidymodels but applied to time series problems.

I’ll summarize here the principals verbs used:

  1. Collect data and split into training and test sets
  2. Create & Fit Multiple Models
  3. Add fitted models to a Model Table
  4. Calibrate the models to a testing set.
  5. Perform Testing Set Forecast & Accuracy Evaluation
  6. Refit the models to Full Dataset & Forecast Forward

To get more details you can access the documantation here.

Soybean

The first thing to do is set up our tibble with the right timestamp. And as we already know, the dataset has a monthly periodicity equally spaced (regular time series).

# data wranglingts_soybean_tbl <- exp_imp_tbl %>%   select(date, product, tons) %>%   filter(product == "soybeans") %>%   group_by(date, product) %>%   summarise(total_tons = sum(tons)) %>%   ungroup()# visualizingts_soybean_tbl %>%   plot_time_series(.date_var = date,                   .value = total_tons,                    .interactive = F,                   .smooth = F) +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  labs(    y = "Millions of Tons",    title = "Soybean Time Series",    caption = "linkedin.com/in/lucianobatistads/"  )

Just by analyzing this visualization, we’re seeing that there is a clear annual seasonality with a multiplicative behavior (values are growing throughout the time). We can verify these assumptions with ACF/PACF charts.

 # High annual correlationts_soybean_tbl %>%   plot_acf_diagnostics(.date_var = date, total_tons,                       .interactive = F,                       .show_white_noise_bars = T) +  labs(    title = "Soybean Lag Diagnostics",    caption = "linkedin.com/in/lucianobatistads/"  )

Here we’re confirm the high correlation with annual lags and also one high partial correlation considering 9, 10 and 11 lags. Is possible to use those features to improve performance, but here we’ll be working with the forecast::auto_arima model that automatic look for lags during the training.

ts_soybean_tbl %>%  plot_seasonal_diagnostics(.date_var = date, log(total_tons),                            .interactive = F) +  labs(    title = "Soybean Seasonal Diagnostics",    y = "Log scale",    caption = "linkedin.com/in/lucianobatistads/"  )

Here we’re seeing that there is quarterly seasonality, every second and third quarters occur an increase in exports.

ts_soybean_tbl %>%    mutate(year = year(date) %>% as_factor(),         month = month(date, label = TRUE, locale = Sys.setlocale("LC_COLLATE", "C")) %>% as_factor()) %>%  ggplot(aes(x = year,             y = fct_rev(month),             fill = total_tons)) +  scale_fill_distiller(labels = scales::number_format(scale = 1e-6, suffix = "M"), direction = 1) +  geom_tile(color = "grey40") +  labs(    title = "Totals of Exports of Soybeans Across the Years and Months",    y = "Months",    x = "",    fill = "Millions of\nTons",    caption = "linkedin.com/in/lucianobatistads/"  )

Now we have a big picture of what is happening. The second and third quarters of practically all months are darker, indicating a higher amount of exports in those periods.

Modeling soybean time series

First thing will be standardize our data, applying a box-cox transformation. This is a method used to variance reduction applying a power transformation. As we’ll be using ARIMA family, is interesting work that way.

We also will keep track of the lambda value, important to back-transform our data after modeling phase.

ts_boxcox_soybean_tbl <- ts_soybean_tbl %>%   select(-product) %>%   mutate(total_tons = box_cox_vec(total_tons))boxcox_soybean_lambda <- 0.382376781152543

So, we don’t have so much data to work, actually our time series has 265 observations. That way, I split the data in 5 years of assessment and choose the rest to training.

soybean_boxcox_splits <- time_series_split(ts_boxcox_soybean_tbl, assess = "5 years", cumulative = TRUE)train_soybean_boxcox_tbl <- training(soybean_boxcox_splits)test_soybean_boxcox_tbl <- testing(soybean_boxcox_splits)soybean_boxcox_splits %>%   tk_time_series_cv_plan() %>%   plot_time_series_cv_plan(date, total_tons, .interactive = F) +  labs(    title = "Soybean Training and Testing Splits",    y = "BoxCox transformed values"  )

Now, we can start work with the modeltime workflow showed before.

auto_arima_formula <- formula(total_tons ~ .)# trainingauto_arima_boxcox_fit <- arima_reg() %>%   set_engine("auto_arima") %>%   fit(auto_arima_formula, train_soybean_boxcox_tbl)# testingcalibration_boxcox_tbl <- modeltime_table(  auto_arima_boxcox_fit) %>%   modeltime_calibrate(    new_data = test_soybean_boxcox_tbl)# accuracy on testing datasoybean_boxcox_accuracy <- calibration_boxcox_tbl %>%   modeltime_accuracy()

Brief explanation about the auto-arima implementation: The auto-arima algo use the AIC metric to optimize the p, q, d and P, Q, D params, looking for the best values. These metrics works like a R-Squared in order to point you to a correct direction.

You can see in .model_desc column discription or as a legend on the following pictures the best parmans choosed by the model.

We get a good R-Squared (0.792), but is a good idea to visualize how was the fit of the model:

# vizualing forecastingcalibration_boxcox_tbl %>%   modeltime_forecast(new_data = test_soybean_boxcox_tbl,                     actual_data = ts_boxcox_soybean_tbl) %>%   plot_modeltime_forecast(.interactive = F) +  labs(    title = "Soybean - Model Performance on Assessment Data",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"      )

I really liked of this fit, and we’ll stick with this model, seems to get the correct seasonality and trend. The next step is refit the model on all data and see how it works. If needed, the algorithm will update the coefficients to capture the general pattern.

refit_boxcox_tbl <- calibration_boxcox_tbl %>%   modeltime_refit(data = ts_boxcox_soybean_tbl) refit_boxcox_tbl %>%   modeltime_forecast(h = "3 years", actual_data = ts_boxcox_soybean_tbl) %>%   plot_modeltime_forecast(.interactive = F) +  labs(    title = "Soybean - Demand Forecast",    subtitle = "Prediction for the next 3 years of Soybean Exports",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"  )refit_boxcox_tbl %>%   modeltime_accuracy()

Look, every time that you see the “UPDATE” as a prefix of model description, meaning that the model found better coefficients to explain the data.

Pos-Processing step

We need back-transform our data because of box-cox transformation at the beginning, and the values don’t represent exports quantities.

forecast_boxcox_soybean_tbl <- refit_boxcox_tbl %>%   modeltime_forecast(h = "3 years", actual_data = ts_boxcox_soybean_tbl)forecast_soybean_tbl <- forecast_boxcox_soybean_tbl %>%   mutate(.value = box_cox_inv_vec(.value, lambda = boxcox_soybean_lambda),         .conf_lo = box_cox_inv_vec(.conf_lo, lambda = boxcox_soybean_lambda),         .conf_hi = box_cox_inv_vec(.conf_hi, lambda = boxcox_soybean_lambda))forecast_soybean_tbl %>%   plot_modeltime_forecast(.interactive = F) +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  labs(    title = "Soybean - Demand Forecast",    subtitle = "Prediction for the next 3 years of Soybean Exports",    y = "Millions of Tons",    caption = "linkedin.com/in/lucianobatistads/"  )

That is our final result for the demand forecast of the next 3 years of soybean production, with 95% of confidence interval.

Corn

Here we’ll follow the same workflow as soybean demand forecast showed before.

 ts_corn_tbl <- exp_imp_tbl %>%   select(date, product, tons) %>%   filter(product == "corn") %>%   group_by(date, product) %>%   summarise(total_tons = sum(tons)) %>%   ungroup()ts_corn_tbl %>%   plot_time_series(.date_var = date,                   .value = total_tons,                    .interactive = F,                   .smooth = F) +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  labs(    y = "Millions of Tons",    title = "Corn Time Series",    caption = "linkedin.com/in/lucianobatistads/"  )

We also have annual seasonality with a multiplicative behavior. Let’s look the lag diagnostic.

# High annual correlationts_corn_tbl %>%   plot_acf_diagnostics(.date_var = date, total_tons,                       .interactive = F,                       .show_white_noise_bars = T) +  labs(    title = "Corn Lag Diagnostics",    caption = "linkedin.com/in/lucianobatistads/"  )

Confirm our assumption of annual seasonality.

ts_corn_tbl %>%   plot_seasonal_diagnostics(.date_var = date, total_tons,                            .interactive = F) +  labs(    title = "Corn Seasonal Diagnostics",    y = "Log scale",    caption = "linkedin.com/in/lucianobatistads/"  )

The interesting of this chart is that we can see a quarterly seasonality too (similiar to soybean seasonal diagnostics), this time with third and fourth quarters.

ts_corn_tbl %>%   mutate(year = year(date) %>% as_factor(),         month = month(date, label = TRUE, locale = Sys.setlocale("LC_COLLATE", "C")) %>% as_factor()) %>%  ggplot(aes(x = year,             y = fct_rev(month),             fill = total_tons)) +  scale_fill_distiller(labels = scales::number_format(scale = 1e-6, suffix = "M"), direction = 1) +  geom_tile(color = "grey40") +  labs(    title = "Totals of Corn Exports Across the Years and Months",    y = "Months",    x = "",    fill = "Millions of\nTons",    caption = "linkedin.com/in/lucianobatistads/"  )

Looking at this heatmap is visible that through the years the exports are growing and the period of the year that has more exports (3rd and 4rd quarters).

Modeling corn time series

# transforming target ----# boxcoxts_boxcox_corn_tbl <- ts_corn_tbl %>%   select(-product) %>%   mutate(total_tons = box_cox_vec(total_tons))boxcox_corn_lambda <- 0.0676121372845911# visualizing the transformations ----ts_boxcox_corn_tbl %>% plot_time_series(date, total_tons)# splits ----# boxcoxcorn_boxcox_splits <- time_series_split(ts_boxcox_corn_tbl, assess = "5 years", cumulative = TRUE)train_corn_boxcox_tbl <- training(corn_boxcox_splits)test_corn_boxcox_tbl <- testing(corn_boxcox_splits)corn_boxcox_splits %>%   tk_time_series_cv_plan() %>%   plot_time_series_cv_plan(date, total_tons, .interactive = F) +  labs(    title = "Corn Training and Testing Splits",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"  )

Our formula here will be different, by including this features our model could better capture the seasonality.

# modeling with modeltime ----# formulaauto_arima_formula <- formula(total_tons ~ . +                                year(date) +                                month(date, label = TRUE))# trainingauto_arima_fit <- arima_reg() %>%   set_engine("auto_arima") %>%   fit(auto_arima_formula, train_corn_boxcox_tbl)# testingcalibration_boxcox_tbl <- modeltime_table(  auto_arima_fit) %>%   modeltime_calibrate(    new_data = test_corn_boxcox_tbl  )# accuracy on testing datacorn_accuracy <- calibration_boxcox_tbl %>%   modeltime_accuracy()# vizualing forecastingcalibration_boxcox_tbl %>%   modeltime_forecast(new_data = test_corn_boxcox_tbl,                     actual_data = ts_boxcox_corn_tbl) %>%   plot_modeltime_forecast(.interactive = F) +  labs(    title = "Corn - Model Performance on Assessment Data",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"      )

The R-Squared here is about 0.643 with good understanding of the seasonality, but the model could not capture the depressions of the time series data. We’ll stick with this model for now.

Now let’s refit the data:

# refiting ----# boxcoxrefit_boxcox_tbl <- calibration_boxcox_tbl %>%   modeltime_refit(data = ts_boxcox_corn_tbl) refit_boxcox_tbl %>%   modeltime_forecast(h = "3 years", actual_data = ts_boxcox_corn_tbl) %>%   plot_modeltime_forecast(.interactive = F) +  labs(    title = "Corn - Demand Forecast",    subtitle = "Prediction for the next 3 years of Soybean Exports",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"  )

This was our final model.

Pos-Processing step

# inverting transformationforecast_boxcox_corn_tbl <- refit_boxcox_tbl %>%   modeltime_forecast(h = "3 years", actual_data = ts_boxcox_corn_tbl)forecast_corn_tbl <- forecast_boxcox_corn_tbl %>%   mutate(.value = box_cox_inv_vec(.value, lambda = boxcox_corn_lambda),         .conf_lo = box_cox_inv_vec(.conf_lo, lambda = boxcox_corn_lambda),         .conf_hi = box_cox_inv_vec(.conf_hi, lambda = boxcox_corn_lambda))forecast_corn_tbl %>%   plot_modeltime_forecast(.interactive = F) +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  labs(    title = "Corn - Demand Forecast",    subtitle = "Prediction for the next 3 years of Corn Exports",    y = "Millions of Tons",    caption = "linkedin.com/in/lucianobatistads/"  )

Besides our 95% confidence intervals been so high, our series capture a similar trend and seasonality of previous years.

Sugar

Let’s investigate the final one.

ts_sugar_tbl <- exp_imp_tbl %>%   select(date, product, tons) %>%   filter(product == "sugar") %>%   group_by(date, product) %>%   summarise(total_tons = sum(tons)) %>%   ungroup()ts_sugar_tbl %>%   tk_summary_diagnostics()ts_sugar_tbl %>%   plot_time_series(.date_var = date,                   .value = total_tons,                    .interactive = F,                   .smooth = F) +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  labs(    y = "Millions of Tons",    title = "Sugar Time Series",    caption = "linkedin.com/in/lucianobatistads/"  )

This time series seems to have a change in behavior after the year of 2012, with a high spike and a significant increase in quantity of exports.

What ACF and PACF tell us?

 # High annual correlation# lag 11 and 23 also have good indicate of correlationts_sugar_tbl %>%   plot_acf_diagnostics(.date_var = date, total_tons,                       .interactive = F,                       .show_white_noise_bars = T) +  labs(    title = "Sugar Lag Diagnostics",    caption = "linkedin.com/in/lucianobatistads/"  )

Here we’re seeing a high correlation mostly with recent 70 lags, and negative correlation with older lags. Then in PACF plot, lag 2 and 9 seems important to our model.

ts_sugar_tbl %>%   plot_seasonal_diagnostics(.date_var = date, log(total_tons),                            .interactive = F) +  labs(    title = "Sugar Seasonal Diagnostics",    y = "Log scale",    caption = "linkedin.com/in/lucianobatistads/"  )

As we confirmed, since 2012 we have higher exports. But looking at this plot, we don’t see any seasonality throughout the time.

So let filter the data and analyse the seasonality after 2012.

ts_sugar_tbl %>%   filter_by_time(date, .start_date = "2012-01-01") %>%   plot_seasonal_diagnostics(.date_var = date, total_tons,                            .interactive = F) +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  labs(    title = "Sugar Seasonal Diagnostics",    subtitle = "Seasonal diagnostics considering data since 2012",    y = "Millions of Tons",    caption = "linkedin.com/in/lucianobatistads/"  )

Now we can capture an interesting bahavior, seems that the third and first quarter have an increase in exports.

Searching the why of happened this change in 2012, I found some events that probably are correlated to our problem.

  1. 2012 was the year that Brazil increases the production of ethanol.
  2. To produce more ethanol was needed to plant more sugar cane (the base of ethanol production).
  3. Sugar also came from sugar cane, so, with more sugar cane cultivation, we saw an increase of sugar production, hence reflected on its exports.

So, there is a huge probability of our time series have really changed its behavior. Another point is that there is a quarterly seasonality that matches exactly with the period of sugar production: 90 days in the summer and 100 days in the winter.

With this context in mind, I’ll use just the data after 2012 for now.

ts_sugar_tbl %>%   mutate(year = year(date) %>% as_factor(),         month = month(date, label = TRUE, locale = Sys.setlocale("LC_COLLATE", "C")) %>% as_factor()) %>%  ggplot(aes(x = year,             y = fct_rev(month),             fill = total_tons)) +  scale_fill_distiller(labels = scales::number_format(scale = 1e-6, suffix = "M"), direction = 1) +  geom_tile(color = "grey40") +  labs(    title = "Totals of Sugar Exports Across the Years and Months",    y = "Months",    x = "",    fill = "Millions of\nTons",    caption = "linkedin.com/in/lucianobatistads/"  )

Looking at this heatmap, it’s visible that through the years the exports intensified by a huge quantity since 2012.

I’ll choose look just for the years after 2012 to modeling our time series.

Modeling sugar time series

# transforming target ----# logts_boxcox_sugar_tbl <- ts_sugar_tbl %>%   select(-product) %>%   filter_by_time(date, .start_date = "2012-01-01") %>%   mutate(total_tons = box_cox_vec(total_tons))boxcox_sugar_lambda <- 0.645806609678906# visualizing the transformations ----ts_boxcox_sugar_tbl %>% plot_time_series(date, total_tons)# splits ----# boxcoxsugar_boxcox_splits <- time_series_split(ts_boxcox_sugar_tbl, assess = "4 years", cumulative = TRUE)train_sugar_boxcox_tbl <- training(sugar_boxcox_splits)test_sugar_boxcox_tbl <- testing(sugar_boxcox_splits)sugar_boxcox_splits %>%   tk_time_series_cv_plan() %>%   plot_time_series_cv_plan(date, total_tons, .interactive = F) +  labs(    title = "Sugar Training and Testing Splits",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"  )

Here I needed to change the amount of data used as assessment data to 4 years instead of 5.

# modeling with modeltime ----# Modeling log transformed data# formulaauto_arima_formula <- formula(total_tons ~ . +                                month(date, label = TRUE))# add trimestres# training# modeling with modeltime ----# Modeling boxcox transformed data# trainingauto_arima_boxcox_fit <- arima_reg() %>%   set_engine("auto_arima") %>%   fit(auto_arima_formula, train_sugar_boxcox_tbl)# testingcalibration_boxcox_tbl <- modeltime_table(  auto_arima_boxcox_fit) %>%   modeltime_calibrate(    new_data = test_sugar_boxcox_tbl)# accuracy on testing datasugar_boxcox_accuracy <- calibration_boxcox_tbl %>%   modeltime_accuracy()# vizualing forecastingcalibration_boxcox_tbl %>%   modeltime_forecast(new_data = test_sugar_boxcox_tbl,                     actual_data = ts_boxcox_sugar_tbl) %>%   plot_modeltime_forecast(.interactive = F) +  labs(    title = "Sugar - Model Performance on Assessment Data",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"  )

Besides the fit was a little off of the real values, the model could capture a general seasonality and trend. We’ll stick with this model for now.

# refiting ----# boxcoxrefit_boxcox_tbl <- calibration_boxcox_tbl %>%   modeltime_refit(data = ts_boxcox_sugar_tbl) refit_boxcox_tbl %>%   modeltime_forecast(h = "3 years", actual_data = ts_boxcox_sugar_tbl) %>%     plot_modeltime_forecast(.interactive = F) +  labs(    title = "Sugar - Demand Forecast",    subtitle = "Prediction for the next 3 years of Soybean Exports",    y = "BoxCox transformed values",    caption = "linkedin.com/in/lucianobatistads/"  )

Post-Processing step

# inverting transformationforecast_boxcox_sugar_tbl <- refit_boxcox_tbl %>%   modeltime_forecast(h = "3 years", actual_data = ts_boxcox_sugar_tbl)forecast_sugar_model2_tbl <- forecast_boxcox_sugar_tbl %>%   mutate(.value = box_cox_inv_vec(.value, lambda = boxcox_sugar_lambda),         .conf_lo = box_cox_inv_vec(.conf_lo, lambda = boxcox_sugar_lambda),         .conf_hi = box_cox_inv_vec(.conf_hi, lambda = boxcox_sugar_lambda))forecast_sugar_model2_tbl %>%   plot_modeltime_forecast(.interactive = F) +  scale_y_continuous(labels = scales::number_format(scale = 1e-6, suffix = "M")) +  labs(    title = "Sugar - Demand Forecast",    subtitle = "Prediction for the next 3 years of Corn Exports",    y = "Millions of Tons",    caption = "linkedin.com/in/lucianobatistads/"  )

So, this is our final model to predict the next 3 years of sugar exports, and also the end of this first phase of the project.

Next Steps! (Important)

Throughout the project, we could see that ARIMA family of models is a very powerfull method. But, there is so much machine learning and deep learning algorithms available to work with time series forecasting that this article would be too big if I putt all that here.

Now, that we have a great understanding about our dataset and also we have a really nice baseline model for all three commodities (soybean, corn and sugar), we can go deeper in modeling and cover advanced topics.

As a spoiler to the next part of this project, check this list:

  • Much more different models (modeltime)
  • Lot more feature engineering (recipes)
  • Hyperparameter Tunning (tune)
  • Resampling tecniques (modeltime.resample)
  • Stacking and ensembles models (modeltime.ensemble)

Author: Luciano Oliveira Batista Luciano is a chemical engineer and data scientist in training. Learn more on his blog at lobdata.com.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Time Series Demand Forecasting first appeared on R-bloggers.


Jumping Rivers and WhyR partnership

$
0
0

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We love supporting the community around the open source tools that we use on a daily basis. In the past, Jumping Rivers has helped useR user groups and SatRdays events to happen by enabling frictionless sponsorship for European groups. We believe that it is our duty to help grow the community that helps us.

With that in mind, it is our honour to announce that we are proudly sponsoring a new season of events hosted by WhyR.

WhyR is a team of R enthusiasts coming from both business and academia who aim to support local R communities around the world. The foundation organises events focusing on the worldwide use of R statistical software such as conferences, hackathons and webinars. The initiative is well known by their remarkable commitment to knowledge and diversion for the R programming language and open source sphere.

Their main goals are:

  • spreading knowledge about the R statistical package,
  • supporting programs and pro-development initiatives in the fields of economics, mathematics, statistics and data science to serve educational activities,
  • supporting cooperation between the scientific and business environment.

We find those align well with the Jumping Rivers mission and are delighted to work beside such an honourable organisation.

Stay tuned on our Twitter and LinkedIn to hear more about the events.

All webinars will be streamed at WhyR’s Youtube page.


Jumping Rivers are full service, RStudio certified partners. Part of our role is to offer support in RStudio Pro products. If you use any RStudio Pro products, feel free to contact us (info@jumpingrivers.com).

The post Jumping Rivers and WhyR partnership appeared first on Jumping Rivers.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Jumping Rivers and WhyR partnership first appeared on R-bloggers.

LondonR Talks – Computer Vision Classification – Turning a Kaggle example into a clinical decision making tool

$
0
0

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I had the pleasure of speaking at the last LondonR event of 2020. What a strange year it has been? But this put the icing on the cake.

The premise

The premise of my talk was to take a novel Kaggle parasite cell dataset and advocate how this type of classification task could be transported to other areas such as clinical x-ray scanning, diagnostic image condition detection, etc.

The live event

To view the talk, have a look at the LondonR event below. I was on first and then two very interesting talks followed from Gwynn Sturdevant – FasteR coding: vectorizing computations in R and Stuart Lodge – Raindrops on roses and whiskers on kittens– a few small things that make me a happy R developer:

Gary Hutson @ London R Event, plus others

Where to get the content

The presentations from the session here. The GitHub code for the convolutional neural network can be found by clicking the GitHub button:

I have written a tutorial about this in my previous blog: https://hutsons-hacks.info/nhs-r-community-lightening-talk-computer-vision-classification-how-it-can-aid-clinicians-malaria-cell-case-study-with-r.

New additions

This presentation had two new addition, on top of what was presented recently at an NHS-R Community Conference event.

Face Mask Detector

The two new additions, delved into how computer vision classification can be used with localisation (bounding box) detection to create novel ideas such as a Face Mask Detector:

Computer Vision Classification with FaceNet bounding box localisation

Facial recognition

The Mango team were used, as an example, of how facial regonition – specifically the YOLO framework, can be used to detect faces in Python:

MangoTeamFacialRecognitionMango GIF with YOLO face detection

The code, for this Python file, is also contained in the LondonR file.

Enough of my ramble

This was a really fun experience and I would urge anyone with an R based project to sign up for LondonR. The hosts Mango are great and you will be treated to a feast of discussions, as well as really friendly people to boot.

Look out for my next blog post on this site and please follow the social icons below to connect with me.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post LondonR Talks – Computer Vision Classification – Turning a Kaggle example into a clinical decision making tool first appeared on R-bloggers.

How to solve Sudoku with R

$
0
0

[This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

solve sudoku in r

(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });sq

In this post we discuss how to write an R script to solve any Sudoku puzzle. There are some R packages to handle this, but in our case, we’ll write our own solution. For our purposes, we’ll assume the input Sudoku is a 9×9 grid. At the end result, each row, column, and 3×3 box needs to contain exactly one of each integer 1 through 9.

Learn more about data science by checking out the great curriculum at 365 Data Science!

Step 0) Define a sample board

Let’s define a sample Sudoku board for testing. Empty cells will be represented as zeroes.

board <- matrix(          c(0,0,0,0,0,6,0,0,0,            0,9,5,7,0,0,3,0,0,            4,0,0,0,9,2,0,0,5,            7,6,4,0,0,0,0,0,3,            0,0,0,0,0,0,0,0,0,            2,0,0,0,0,0,9,7,1,            5,0,0,2,1,0,0,0,9,            0,0,7,0,0,5,4,8,0,            0,0,0,8,0,0,0,0,0),                   byrow = T,          ncol = 9)

sample sudoku board

Step 1) Find the empty cells

In the first step, let’s write a function that will find all of the empty cells on the board.

find_empty_cells <- function(board) {    which(board == 0, arr.ind = TRUE)  }

Step 2) Make sure cell placement is valid

Next, we need a function that will check if a cell placement is valid. In other words, if we try putting a number into a particular cell, we need to ensure that the number appears only once in that row, column, and box. Otherwise, the placement would not be valid.

is_valid <- function(board, num, row, col) {  # Check if any cell in the same row has value = num  if(any(board[row, ] == num)) {        return(FALSE)      }    # Check if any cell in the same column has value = num  if(any(board[, col] == num)) {        return(FALSE)      }    # Get cells in num's box  box_x <- floor((row - 1) / 3) + 1  box_y <- floor((col - 1) / 3) + 1    # Get subset of matrix containing num's box  box <- board[(3 * box_x - 2):(3 * box_x), (3 * box_y - 2):(3 * box_y)]    # Check if the number appears elsewhere in its box  if(any(box == num)) {        return(FALSE)      }    return(TRUE)  }

Step 3) Recursively solve the Sudoku

In the third step, we write our function to solve the Sudoku. This function will return TRUE is the input Sudoku is solvable. Otherwise, it will return FALSE. The final result will be stored in a separate variable.

result <- sudokusolve_sudoku <- function(board, needed_cells = NULL, index = 1) {    # Find all empty cells  if(is.null(needed_cells))       needed_cells <- find_empty(board)    if(index > nrow(needed_cells)) {        # Set result equal to current value of board    # and return TRUE    result <<- board    return(TRUE)      } else {        row <- needed_cells[index, 1]    col <- needed_cells[index, 2]  }    # Solve the Sudoku  for(num in 1:9) {        # Test for valid answers    if(!is_valid(board, num, row, col)) {next} else{      board2 = board      board2[row, col] <- num            # Retest with input      if(solve_sudoku(board2, needed_cells, index + 1)) {        return(TRUE)              }          }      }    # If not solvable, return FALSE  return(FALSE)  }

Calling the Sudoku solver

Lastly, we call our Sudoku solver. The result is stored in the variable “result”, as can be seen below.

solve_sudoku(board)

solve sudoku puzzle

Conclusion

That’s it for this post! If you enjoyed reading this and want to learn more about R or Python, check out the great data science program at 365 Data Science.

The post How to solve Sudoku with R appeared first on Open Source Automation.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to solve Sudoku with R first appeared on R-bloggers.

Maximum Likelihood Distilled

$
0
0

[This article was first published on R on Jorge Cimentada, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all hear about Maximum Likelihood Estimation (MLE) and we often see hints of it in our model output. As usual, doing things manually can give a better grasp on how to better understand how our models work. Here’s a very short example implementing MLE based on the explanation from Gelman and Hill (2007), page 404-405.

The likelihood is literally how much our outcome variable Y is compatible with our predictor X. We compute this measure of compatibility with the probability density function for the normal distribution. In R, dnorm returns this likelihood. The plot on this website gives a very clear intuition on what dnorm returns: it is literally the height of the distribution, or in other words, the likelihood. We of course, want the highest likelihood, as it indicates greater compatibility.

For example, assuming parameters is a vector with the intercept a, the coefficient b and an error term sigma, we can compute the likelihood for any random value of these coefficients:

loglikelihood <- function(parameters, predictor, outcome) {
  # intercept
  a <- parameters[1]
  # beta coef
  b <- parameters[2]
  # error term
  sigma <- parameters[3]

  # Calculate the likelihood of `y` given `a + b * x`
  ll.vec <- dnorm(outcome, a + b * predictor, sigma, log = TRUE)

  # sum that likelihood over all the values in the data
  sum(ll.vec)
}

# Generate three random values for intercept, beta and error term
inits <- runif(3)

# Calculate the likelihood given these three values
loglikelihood(
  inits,
  predictor = mtcars$disp,
  outcome = mtcars$mpg
)
## [1] -11687.41

That’s the likelihood given the random values for the intercept, the coefficient and sigma. How does a typical linear model estimate the maximum of these likelihoods? It performs an optimization search trying out a sliding set of values for these unknowns and searches for the combination that returns the maximum:

mle <-
  optim(
    inits, # The three random values for intercept, beta and sigma
    loglikelihood, # The loglik function
    lower = c(-Inf, -Inf, 1.e-5), # The lower bound for the three values (all can be negative except sigma, which is 1.e-5)
    method = "L-BFGS-B",
    control = list(fnscale = -1), # This signals to search for the maximum rather than the minimum
    predictor = mtcars$disp,
    outcome = mtcars$mpg
  )

mle$par[1:2]
## [1] 29.59985346 -0.04121511

Let’s compare that to the result of lm:

coef(lm(mpg ~ disp, data = mtcars))
## (Intercept)        disp 
## 29.59985476 -0.04121512

In layman terms, MLE really just checks how compatible a given data point is with the outcome with the respect to a coefficient. It repeats that step many times until it finds the combination of coefficients that maximizes the outcome.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Jorge Cimentada.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Maximum Likelihood Distilled first appeared on R-bloggers.

3 Top Business Intelligence Tools Compared: Tableau, PowerBI, and Sisense

$
0
0

[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Article Thumbnail - Top 3 BI Tools

Business Intelligence (BI) is used to transform data into actionable insights that provide value to an organization and help achieve its business goals. Reports and dashboards are go-to approaches for modern-day business intelligence tools.

At Appsilon, we are big advocates of R and R Shiny, but we also have significant experience with business intelligence tools. Besides, coding custom solutions from scratch (or coding in general) isn’t always the right solution, especially for simple tasks. The custom-built BI solutions have their place, but more on that towards the end of the article.

Are you an Excel user who is curious about R Shiny? Read How to Switch from Excel to R Shiny: First Steps

All of the BI tools you’ll see in this article are top contenders in Gartner’s Magic Quadrant. Because of that, all of them are considered to be go-to solutions for enterprise businesses. Here’s a list of our top BI tools:

In addition to evaluating these tools, we’ll also answer the question of whether you should develop custom dashboards from scratch, and when would be an appropriate time for that.

Tableau

This BI tool doesn’t need an introduction. If you’re reading this article, you probably know what Tableau is and what it can do. It is considered one of the leading analytics platforms for good reasons – it is intuitive and easy to use and has remarkable visualization capabilities that look great by default.

Example Tableau Dashboard

Tableau Showcase: Global Market Overview dashboard (https://www.tableau.com/learn/articles/business-intelligence-dashboards-examples)

Further, Tableau can connect to almost any data source you can imagine, handles larger datasets with ease, and has a great community behind it. 

  • Stuck on some visualization problem? A quick Google search will lead you to Tableau forums with an appropriate solution. 
  • Need to track your data on the go? Download the mobile app.

If this wasn’t the case, we seriously doubt that giants like Charles SchwabPepsiCoVerizon, and St. Mary’s Bank would use it on a daily basis. 

The impressive features and ease of use don’t mean there aren’t any downsides to using Tableau. A lot of Tableau users complain about inadequate support. In many cases, all that the support team does is recommend that you buy another feature, promising that it will solve your issues. With Tableau, you are also limited in embedding features. If you have many visual elements, it becomes a real challenge to build a responsive dashboard. Further, we’ve found it almost impossible to build an external component that interacts with Tableau charts.

Next, let’s discuss pricing. Tableau offers a 14-day free trial, but once that’s over, the only viable option is to pay $840 for a year-long subscription. That comes to $70 per month, which is by no means cheap (and since we’re all stuck at home – we already have multiple streaming service subscriptions to pay for monthly). Scale this cost to a 10-person team, and that’s an expense you’ll notice ($8,400 per year!). More affordable options are available, but these are either limited in their capabilities or can’t be purchased for a single user. 

For example, maybe some team members only want to look at charts and dashboards – so the Viewer license is enough for them. That will only set you back $12 per month instead of $70. If you want to use Tableau on the web only, the Explorer license is available. This will cost you $35 per user/month, but it requires you to purchase at least five licenses.

To conclude: Tableau is an excellent and easy-to-use BI tool, but it’s not cheap and has multiple potential deal-breaker limitations

Read more about Tableau:

Microsoft PowerBI

PowerBI is a collection of software services, apps, and connectors that work together to turn unrelated sources into coherent, visually immersive, and interactive insights – at least according to Microsoft.

Example PowerBI Dashboard

PowerBI Showcase: Sales and Marketing dashboard (https://docs.microsoft.com/en-us/power-bi/create-reports/sample-sales-and-marketing)

PowerBI is one of the most widely recognized BI tools due to its intuitive interface, various visualization options, and because it looks good by default. Even if something isn’t supported, PowerBI can easily connect to R and Python.

PowerBI comes in a few different flavors: 

  1. PowerBI Desktop – a Windows-only application used to perform analysis, create visualizations and reports
  2. PowerBI Service (Pro) – web application used to create dashboards. Provides a way for sharing results through a collaboration mode
  3. PowerBI Mobile – a mobile application used only for data access, not for analysis. It is available for both Android and iOS

All of these options and plenty of other more technical details made PowerBI a tool of choice for giant corporations, such as WalmartApple, And ExxonMobil.

PowerBI is easy to get started with. You’ll only need a couple of crash courses to create impressive data visualizations. Don’t let this fool you though – PowerBI is tough to master, as this requires learning the whole suite of Microsoft tools. At the same time, that’s the biggest reason to use PowerBI – it includes full integration with the Microsoft ecosystem (authorization, Azure, Office 365, and so on). If you are a big fan of Microsoft’s UI, it’s safe to say you’ll find PowerBI somewhat familiar and comfortable.

As with Tableau, PowerBI isn’t all sunshine and rainbows. PowerBI is read-only, and you can’t access the source code. As a result, it is nearly impossible to maintain proper version control. This is a problem that PowerBI shares with Microsoft Excel.

Also, you can’t do too much about visuals within PowerBI. Don’t get us wrong, there is a lot you can tweak, but the amount of options doesn’t necessarily correlate with how charts end up looking. The UI can also feel overwhelming at first because there are too many icons and menus to look at, which takes focus away from the visualizations.

When it comes to pricing, things look better with PowerBI than with Tableau. PowerBI desktop is entirely free, provided you have a Windows machine. It is limited to approximately 2GB of data, but you can always upgrade to the Pro version if you need more. It will set you back $9.99 per month per user, so this shouldn’t be a deal-breaker. There are more advanced Premium versions of PowerBI which are used in enterprise situations, and those currently cost $4,995 per month per dedicated cloud compute and storage resource.

To conclude: PowerBI can be entirely free for individuals and small teams depending on the use case. The tool has its quirks, and it’s not as intuitive as Tableau, but it won’t cost you a small fortune to get started.

Read more:

Sisense

Sisense falls into a category of excellent but lesser-known BI tools. It’s not that no one is using Sisense – just the opposite – but it’s the lack of brand awareness that makes you think of Tableau or PowerBI first. Even so, we think Sisense gives PowerBI and Tableau a run for their money.

Example Sisense Dashboard

Sisense Showcase: Marketing dashboard (https://www.sisense.com/dashboard-examples/marketing/)

Sisense provides a drag and drop solution for analytics and dashboards, both on-premise and in the cloud. Dashboards and insights can be shared with multiple users, and they can even run their queries, depending on the set permission level.

Sisense has been placed in the Visionary quadrant of the Gartner Magic Quadrant for BI and Analytic platforms (read more). That’s a huge accomplishment because the competition is harsh in this department. Further, Sisense is well-recognized and used by companies such as Hewlett Packard (HP)PhillipsNasdaqMotorola, and Wix.

Sisense is divided into two parts:

  • Web interface
  • ElastiCube – an analytical database which must be downloaded locally

In the ElastiCube, you can load and establish connections to the data. Keep in mind that it requires a Windows machine to function. If you are not on Windows, the cloud is the only option. Any form of analysis and visualization is done through the web interface. Sisense is known for its good performance on larger datasets and low-ish hardware requirements.

That all sounds great, but let’s consider the downsides. It has been reported many times that Sisense dashboards only work well on the web. The other big downside is that Sisense doesn’t support sending scheduled reports via email. Besides that, Sisense is a great BI tool that doesn’t lag behind its more well-known competitors. 

With regards to the pricing, there’s not much we can tell you. They offer an annual licensing model tailored to individual business needs, but how much it actually costs for an arbitrary user is unknown.

Read more:

Should You Develop Dashboards From Scratch?

Answering this question isn’t easy. The only truly correct answer is “it depends”, but you are unlikely to view that as a satisfying answer. We’ll try to demystify this question as much as possible in this section.

Using a BI tool is sometimes enough. If you are analyzing and visualizing data for yourself, it doesn’t have to look amazing. The only important thing is that it gets the job done. This might also be the case for internal analysis and visualization projects in smaller companies.

But what if you are performing and delivering analysis for a client? Can you afford to make your reports look only somewhat decent instead of amazing/impressive? Probably not. That’s where custom-built web applications in the form of a data analytics dashboard come into play.

Most BI tools fail whenever you try to integrate them within other web apps, and you are forced to use their products individually. Most of our clients want to embed dashboards within other tools keeping them fully controlled by external UI elements and with custom authorization setups. This BI limitation often results in having multiple BI tools within one organization rather than having access to a single integrated solution.

At Appsilon, we are big fans and avid users of R and R Shiny. These technologies have helped us create numerous dashboards for small, medium, large, and even Fortune 500 companies. If you want to leave a professional impression on your clients, developing an analytical solution from scratch isn’t just one of the options – it is the only option.

Example Shiny Dashboard

R Shiny Enterprise Demo Dashboard (source: https://demo.appsilon.ai/shiny-enterprise-demo/)

Our open-source team has developed R packages for UI, routing, and translations to make dashboard development from scratch easier. These packages allow you to deliver an impressive all-round solution in a fraction of the time.

Learn more:

Conclusion

Covering all of the available business intelligence tools isn’t feasible for a single article. Instead, we wanted to cover a few of the top market leaders. Tableau, PowerBI, and Sisense have their place in modern data analytics and visualization, and it’s up to you to decide which one(s) you’ll use.

If price is your main concern, start with PowerBI. The free version comes with more than enough features for many users, but if you want something completely customizable, R Shiny and other code-based solutions are your only real options.

Learn more: Why You Should Use R Shiny for Enterprise Application Development

With Shiny, you can quickly get up to speed in a couple of weeks or even a couple of days, depending on your prior knowledge of programming. If you want to make a scalable enterprise Shiny dashboard, then you can always reach out to Appsilon for help. We’re continually pushing the limits of what’s possible with Shiny, and we’d be happy to guide you and your company.

Learn more

Appsilon is hiring! We are primarily seeking senior-level developers with management experience. See our Careers page for all new openings, including openings for a Project Manager and Community Manager.

Article 3 Top Business Intelligence Tools Compared: Tableau, PowerBI, and Sisense comes from Appsilon | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 3 Top Business Intelligence Tools Compared: Tableau, PowerBI, and Sisense first appeared on R-bloggers.

lmDiallel: a new R package to fit diallel models. The Hayman’s model (type 1)

$
0
0

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post we have presented our new ‘lmDiallel’ package (see this link here and see also the original paper in Theoretical and Applied Genetics). It provides several extensions to the lm() function in R, to fit a class of linear models of interest for plant breeders or geneticists, the so-called diallel models. For those who are interested in this topic, we would like to present some examples of diallel models and how to fit them in R. Please, sit back and relax and, if you have comments, let us know, using the email link at the bottom of this post.

But… what is a ‘diallel’?

If you are not a plant breeder or a geneticist in general, you may be asking this question. From the ancient Greek language, the ‘diallel’ word means ‘reciprocating’ and a diallel cross is a set of several possible crosses and selfs between some parental lines. For example, if we take the male lines A, B and C together with the same female lines A, B and C and we imagine to cross those lines with one another, we obtain the selfs A\(\times\)A, B\(\times\)B and C\(\times\)C, the crosses A\(\times\)B, A\(\times\)C and B\(\times\)C and, in some instances, the reciprocals B\(\times\)A, C\(\times\)A and C\(\times\)B (where the father and mother are swapped). The performances of crosses and/or selfs and/or reciprocals can be compared by planning field experiments, usually known as diallel experiments and designed as randomised complete blocks with 3-4 replicates.

The example

Depending on how the experiment is planned, we can have four experimental methods:

  1. Crosses + reciprocals + selfs (complete diallel)
  2. Crosses and reciprocals (no selfs)
  3. Crosses and selfs (no reciprocals)
  4. Only crosses (no selfs, no reciprocals)

In this post we will concentrate on the first design (complete diallel) and we will use a simple example with three parental lines (A, B and C). The csv file (‘diallel1.csv’) is available in an external repository; in the box below we load the data and we use the group_by() function in the ‘dplyr’ package to obtain the means for all crosses and selfs.

library(tidyverse)
rm(list = ls())
df <- read_csv("https://www.casaonofri.it/_datasets/diallel1.csv")
df$Block <- factor(df$Block)
dfM <- df %>% 
  group_by(Par1, Par2) %>% 
  summarise(YieldM = mean(Yield), SEs = sd(Yield/sqrt(4)))
dfM
## # A tibble: 9 x 4
## # Groups:   Par1 [3]
##   Par1  Par2  YieldM   SEs
##       
## 1 A     A         12 0.740
## 2 A     B         13 0.600
## 3 A     C         14 0.498
## 4 B     A         11 1.00 
## 5 B     B         15 0.332
## 6 B     C         21 0.273
## 7 C     A         17 0.295
## 8 C     B         16 0.166
## 9 C     C         19 1.90

What model do we use?

In order to describe the above dataset, we might think of a two-way ANOVA model, where the ‘father’ and ‘mother’ lines (the ‘Par1’ and ‘Par2’ variables, respectively) are used as the explanatory factors.

This is a very tempting solution, but we should resist: a two way ANOVA model regards the ‘father’ and ‘mother’ effects as two completely different series of treatments, neglecting the fact that they are, indeed, the same genotypes in different combinations. That is exactly why we need specific diallel models to describe the results of diallel experiments!

The “Hayman” model (type 1)

The first diallel model was proposed by Hayman (1954) and it was devised for complete diallel experiments, where reciprocals are available. Neglecting the design effects (blocks and/or environments), the Hayman’s model is defined as:

\[y _{ijk} = \mu + \textrm{g}_i + \textrm{g}_j + \textrm{ts}_{ij} + \textrm{rg}^a_{i} + \textrm{rg}^b_{j} + rs_{ij} + \varepsilon_{ijk} \quad\quad\quad (Eq. 1)\]

where \(\mu\) is expected value (the overall mean, in the balanced case) and \(\varepsilon_{ijk}\) is the residual random error terms for the observation in the \(k^{th}\) block and with the parentals \(i\) and \(j\). All the other terms correspond to genetic effects, namely:

  1. the \(\textrm{g}_i\) and \(\textrm{g}_j\) terms are the general combining abilities (GCAs) of the \(i^{th}\) and \(j^{th}\) parents. Each term relates to the average performances of a parental line in all its hybrid combination, under the sum-to-zero constraint (i.e. the sum of \(g\) values for all parentals must be zero). For example, with our balanced experiment, the overall mean is \(\mu = 15.33\), while the mean for the A parent when used as the ‘father’ is \(\mu_{A.} = 13\) and the mean for the same parent A when used as the ‘mother’ is \(\mu_{.A} = 13.33\). Consequently: \[g_A = \left(13 + 13.33 \right)/2 – 15.33 = -2.167\] Analogously, it is \(g_B = -0.167\).
  2. The \(rg^a_i\) and \(rg^b_j\) terms are the reciprocal general combining abilities (RGCAs) for the \(i^{th}\) and \(j^{th}\) parents. Each term relates to the discrepancy between the effect of a parent when it is used as father/mother and its average effect in all its combinations. For example, considering the parent A, the term \(rg^a_A\) is: \[rg^a_A = \mu_{A.} – \frac{\mu_{A.} + \mu_{.A}}{2} = 13 – 13.167 = -0.167\] Obviously, it must be \(rg^a_A = – rg^b_B\) and it must also be that the sum of all \(rg^a\) terms is zero (sum-to-zero constraint).
  3. The \(\textrm{ts}_{ij}\) term is the total specific combining ability (tSCA) for the combination between the \(i^{th}\) and \(j^{th}\) parents. It relates to the discrepancy from additivity for a specific combination of two parentals. For example, considering the ‘A \(\times\) B’ cross, the expected yield under additivity would be: \[\mu_{A:B} = \mu + \textrm{g}_A + \textrm{g}_B +\textrm{rg}^a_{A} + \textrm{rg}^b_{B} =\]\[ = 15.33 – 2.167 – 0.167 – 0.167 – 0.5 = 12.333\] while the observed yield is 13, with a with a difference of \(-0.667\). On the other hand, considering the ‘B \(\times\) A’ reciprocal cross, the expected yield under additivity would be: \[\mu_{A:B} = \mu + \textrm{g}_A + \textrm{g}_B +\textrm{rg}^a_{B} + \textrm{rg}^b_{A} =\]\[= 15.33 – 2.167 – 0.167 + 0.167 + 0.5 = 13.667\] while the observed yield is 11, with a difference of \(2.667\). The tSCA for the cross between A and B (regardless of the reciprocal) is the average difference, that is \(\textrm{ts}_{AB} = (-0.667 + 2.667)/2 = 1\).
  4. The \(rs_{ij}\) term is the reciprocal specific combining ability (RSCA) for a specific \(ij\) combination, that is the discrepancy between the performances of the two reciprocals (e.g, A \(\times\) B vs. B \(\times\) A). For example, the \(\textrm{rs}_{AB}\) term is equal to \(-0.667 – 1 = -1.667\), that is the opposite of \(\textrm{rs}_{BA}\).

Model fitting with R

Hands-calculations based on means may be useful to understand the meaning of genetical effects, although they are biased with unbalanced designs and, above all, they are totally uninteresting from a practical point of view: we’d rather fit the model by using a statistical software.

Let’s assume that all effects are fixed, apart from the residual standard error. This is a reasonable assumption, as we have a very low number of parentals, which would make the estimation of variance components totally unreliable. We clearly see that the Hayman’s model above is a specific parameterisation of a general linear model and we should be able to fit it by the usual lm() function and related methods. We can, indeed, do so by using our ‘lmDiallel’ extension package, that provides the facilities to generate the correct design matrices for the Hayman’s model (and for other diallel models, as we will show in future posts).

At the beginning, we have to install (if necessary) and load the ‘lmDiallel’ package (see box below). Model fitting can be performed by using the GCA(), tSCA(), RGCA() and RSCA() functions as shown in the box below: the resulting lm object can be explored by the usual R methods, such as summary() and anova().

# library(devtools) # Install if necessary
# install_github("OnofriAndreaPG/lmDiallel")
library(lmDiallel)
dMod <- lm(Yield ~ Block + GCA(Par1, Par2) + tSCA(Par1, Par2) +
              RGCA(Par1, Par2) + RSCA(Par1, Par2), data = df)
summary(dMod)
## 
## Call:
## lm(formula = Yield ~ Block + GCA(Par1, Par2) + tSCA(Par1, Par2) + 
##     RGCA(Par1, Par2) + RSCA(Par1, Par2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3500 -0.5644  0.0606  0.4722  2.7911 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.558e+01  5.780e-01  26.962  < 2e-16 ***
## Block2                 -3.772e-01  8.174e-01  -0.461   0.6486    
## Block3                 -3.011e-01  8.174e-01  -0.368   0.7158    
## Block4                 -3.261e-01  8.174e-01  -0.399   0.6935    
## GCA(Par1, Par2)g_A     -2.167e+00  2.890e-01  -7.497 9.77e-08 ***
## GCA(Par1, Par2)g_B     -1.667e-01  2.890e-01  -0.577   0.5695    
## tSCA(Par1, Par2)ts_A:A  1.000e+00  5.780e-01   1.730   0.0965 .  
## tSCA(Par1, Par2)ts_A:B -1.000e+00  4.570e-01  -2.188   0.0386 *  
## tSCA(Par1, Par2)ts_B:B  1.230e-16  5.780e-01   0.000   1.0000    
## RGCA(Par1, Par2)rg_A   -1.667e-01  2.890e-01  -0.577   0.5695    
## RGCA(Par1, Par2)rg_B    5.000e-01  2.890e-01   1.730   0.0965 .  
## RSCA(Par1, Par2)        1.667e+00  3.540e-01   4.709 8.71e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.734 on 24 degrees of freedom
## Multiple R-squared:  0.8269, Adjusted R-squared:  0.7476 
## F-statistic: 10.42 on 11 and 24 DF,  p-value: 1.129e-06
anova(dMod)
## Analysis of Variance Table
## 
## Response: Yield
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## Block             3   0.784   0.261  0.0869    0.9665    
## GCA(Par1, Par2)   2 244.000 122.000 40.5743 1.999e-08 ***
## tSCA(Par1, Par2)  3  24.000   8.000  2.6606    0.0710 .  
## RGCA(Par1, Par2)  2   9.333   4.667  1.5520    0.2323    
## RSCA(Par1, Par2)  1  66.667  66.667 22.1717 8.710e-05 ***
## Residuals        24  72.164   3.007                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the sake of simplicity, we also built a wrapper function named lm.diallel(), which can be used in the very same fashion as lm(). The syntax is:

lm.diallel(formula, Block, Env, data, fct)

where ‘formula’ specifies the response variable and the two variables for parentals (e.g., Yield ~ Par1 + Par2) and the two arguments ‘Block’ and ‘Env’ are used to specify optional variables, coding for blocks and environments, respectively. The argument ‘data’ is a ‘dataframe’ where to look for the explanatory variables and, finally, ‘fct’ is a string variable coding for the selected model (“HAYMAN1”, for this example; see below).

dMod2 <- lm.diallel(Yield ~ Par1 + Par2, Block = Block,
                    data = df, fct = "HAYMAN1")
summary(dMod2)
## 
## Call:
## lm.diallel(formula = Yield ~ Par1 + Par2, Block = Block, fct = "HAYMAN1", 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3500 -0.5644  0.0606  0.4722  2.7911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## Intercept  1.533e+01  2.890e-01  53.056  < 2e-16 ***
## Block1     2.511e-01  5.006e-01   0.502   0.6205    
## Block2    -1.261e-01  5.006e-01  -0.252   0.8032    
## Block3    -5.000e-02  5.006e-01  -0.100   0.9213    
## g_A       -2.167e+00  2.890e-01  -7.497 9.77e-08 ***
## g_B       -1.667e-01  2.890e-01  -0.577   0.5695    
## ts_A:A     1.000e+00  5.780e-01   1.730   0.0965 .  
## ts_A:B    -1.000e+00  4.570e-01  -2.188   0.0386 *  
## ts_B:B     6.152e-16  5.780e-01   0.000   1.0000    
## rg_A      -1.667e-01  2.890e-01  -0.577   0.5695    
## rg_B       5.000e-01  2.890e-01   1.730   0.0965 .  
## rs_A:B     1.667e+00  3.540e-01   4.709 8.71e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.734 on 24 degrees of freedom
## Multiple R-squared:  0.8269, Adjusted R-squared:  0.7476 
## F-statistic: 10.42 on 11 and 24 DF,  p-value: 1.129e-06
anova(dMod2)
## Analysis of Variance Table
## 
## Response: Yield
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Block      3   0.784   0.261  0.0869    0.9665    
## GCA        2 244.000 122.000 40.5743 1.999e-08 ***
## tSCA       3  24.000   8.000  2.6606    0.0710 .  
## RGCA       2   9.333   4.667  1.5520    0.2323    
## RSCA       1  66.667  66.667 22.1717 8.710e-05 ***
## Residuals 24  72.164                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above function works very much like the lm() function and makes use of the general purpose linear model solver lm.fit(). Apart from simplicity, another advantage is that the call to lm.diallel() returns an object of both ‘lm’ and ‘diallel’ classes. For this latter class, we built several specific S3 methods, such as the usual anova(), summary() and model.matrix() methods, partly shown in the box above.

Considering that diallel models are usually fitted to determine genetical parameters, we also built the glht.diallelMod() method and the diallel.eff() function, which can be used with the ‘multcomp’ package, to retrieve the complete list of genetical parameters, as shown in the box below.

library(multcomp)
gh <- glht(linfct = diallel.eff(dMod2))
summary(gh, test = adjusted(type = "none")) 
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)    
## Intercept == 0  1.533e+01  2.890e-01  53.056  < 2e-16 ***
## g_A == 0       -2.167e+00  2.890e-01  -7.497 5.85e-08 ***
## g_B == 0       -1.667e-01  2.890e-01  -0.577   0.5691    
## g_C == 0        2.333e+00  2.890e-01   8.074 1.49e-08 ***
## ts_A:A == 0     1.000e+00  5.780e-01   1.730   0.0955 .  
## ts_A:B == 0    -1.000e+00  4.570e-01  -2.188   0.0378 *  
## ts_A:C == 0     1.443e-15  4.570e-01   0.000   1.0000    
## ts_B:A == 0    -1.000e+00  4.570e-01  -2.188   0.0378 *  
## ts_B:B == 0     6.152e-16  5.780e-01   0.000   1.0000    
## ts_B:C == 0     1.000e+00  4.570e-01   2.188   0.0378 *  
## ts_C:A == 0     1.443e-15  4.570e-01   0.000   1.0000    
## ts_C:B == 0     1.000e+00  4.570e-01   2.188   0.0378 *  
## ts_C:C == 0    -1.000e+00  5.780e-01  -1.730   0.0955 .  
## j_A == 0       -1.667e-01  2.890e-01  -0.577   0.5691    
## j_B == 0        5.000e-01  2.890e-01   1.730   0.0955 .  
## j_C == 0       -3.333e-01  2.890e-01  -1.153   0.2592    
## rs_A:B == 0     1.667e+00  3.540e-01   4.709 7.25e-05 ***
## rs_A:C == 0    -1.667e+00  3.540e-01  -4.709 7.25e-05 ***
## rs_B:A == 0    -1.667e+00  3.540e-01  -4.709 7.25e-05 ***
## rs_B:C == 0     1.667e+00  3.540e-01   4.709 7.25e-05 ***
## rs_C:A == 0     1.667e+00  3.540e-01   4.709 7.25e-05 ***
## rs_C:B == 0    -1.667e+00  3.540e-01  -4.709 7.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

Model fitting in two steps

In some cases, the analysis is performed in two steps and a diallel model is fitted to the means of selfs and crosses, which are calculated in the first step. Under the assumption of variance homogeneity and equal number of replicates, we can fit the Hayman’s model by using the lm.diallel() function without the ‘Block’ argument.

dMod3 <- lm.diallel(YieldM ~ Par1 + Par2, 
                    data = dfM, fct = "HAYMAN1")

In this case, we have no reliable estimate of residual error, but the summary() and anova() methods have been enhanced to give us the possibility of passing some information from the first step, i.e. an appropriate estimate of the residual mean square and degrees of freedom; the residual mean square from the first step needs to be appropriately weighted for the number of replicates (i.e., for this example, MSE = 3.007/4 with 24 degrees of freedom).

anova(dMod3, MSE = 3.007/4, dfr = 24)
## Analysis of Variance Table
## 
## Response: YieldM
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## GCA        2 61.000 30.5000 40.5720 2.000e-08 ***
## tSCA       3  6.000  2.0000  2.6605   0.07101 .  
## RGCA       2  2.333  1.1667  1.5519   0.23236    
## RSCA       1 16.667 16.6667 22.1705 8.713e-05 ***
## Residuals 24         0.7518                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(dMod3, MSE = 3.007/4, dfr = 24)
##                Estimate        SE       t value     Pr(>|t|)
## Intercept  1.533333e+01 0.2890117  5.305436e+01 2.157713e-26
## g_A       -2.166667e+00 0.2890117 -7.496812e+00 9.771159e-08
## g_B       -1.666667e-01 0.2890117 -5.766779e-01 5.695269e-01
## ts_A:A     1.000000e+00 0.5780235  1.730034e+00 9.646589e-02
## ts_A:B    -1.000000e+00 0.4569677 -2.188339e+00 3.861373e-02
## ts_B:B     2.417819e-15 0.5780235  4.182908e-15 1.000000e+00
## rg_A      -1.666667e-01 0.2890117 -5.766779e-01 5.695269e-01
## rg_B       5.000000e-01 0.2890117  1.730034e+00 9.646589e-02
## rs_A:B     1.666667e+00 0.3539656  4.708555e+00 8.712864e-05

The genetical parameters can be obtained by using the glht() function and passing the information from the first step within the call to the diallel.eff() function.

gh2 <- glht(linfct = diallel.eff(dMod3, MSE = 3.007/4, dfr = 24))
summary(gh2, test = adjusted(type = "none")) 
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)    
## Intercept == 0  1.533e+01  2.890e-01  53.054  < 2e-16 ***
## g_A == 0       -2.167e+00  2.890e-01  -7.497 5.85e-08 ***
## g_B == 0       -1.667e-01  2.890e-01  -0.577   0.5691    
## g_C == 0        2.333e+00  2.890e-01   8.073 1.49e-08 ***
## ts_A:A == 0     1.000e+00  5.780e-01   1.730   0.0955 .  
## ts_A:B == 0    -1.000e+00  4.570e-01  -2.188   0.0378 *  
## ts_A:C == 0    -8.882e-16  4.570e-01   0.000   1.0000    
## ts_B:A == 0    -1.000e+00  4.570e-01  -2.188   0.0378 *  
## ts_B:B == 0     2.418e-15  5.780e-01   0.000   1.0000    
## ts_B:C == 0     1.000e+00  4.570e-01   2.188   0.0378 *  
## ts_C:A == 0    -8.882e-16  4.570e-01   0.000   1.0000    
## ts_C:B == 0     1.000e+00  4.570e-01   2.188   0.0378 *  
## ts_C:C == 0    -1.000e+00  5.780e-01  -1.730   0.0955 .  
## j_A == 0       -1.667e-01  2.890e-01  -0.577   0.5691    
## j_B == 0        5.000e-01  2.890e-01   1.730   0.0955 .  
## j_C == 0       -3.333e-01  2.890e-01  -1.153   0.2593    
## rs_A:B == 0     1.667e+00  3.540e-01   4.709 7.25e-05 ***
## rs_A:C == 0    -1.667e+00  3.540e-01  -4.709 7.25e-05 ***
## rs_B:A == 0    -1.667e+00  3.540e-01  -4.709 7.25e-05 ***
## rs_B:C == 0     1.667e+00  3.540e-01   4.709 7.25e-05 ***
## rs_C:A == 0     1.667e+00  3.540e-01   4.709 7.25e-05 ***
## rs_C:B == 0    -1.667e+00  3.540e-01  -4.709 7.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

Estimation of variance components (random genetic effects)

In some cases, genetic effects are regarded as random and the aim is to estimate variance components. For this, we can use the mmer() function in the ‘sommer’ package (Covarrubias-Pazaran, 2019), although we need to code two dummy variables:

  1. ‘dr’: assuming a value of zero for selfs, a value of -1 for crosses and 1 for the respective reciprocals;
  2. ‘combination’: which gives the reciprocals the same label (i.e. BA is given the same label as AB), so that any difference is erased.

By using the overlay() function in the ‘sommer’ package and the above dummy variables, we can code the following effects:

  1. GCA effect: ~overlay(Par1, Par2)
  2. RGCA effect: ~overlay(Par1, Par2):dr (please, note that the interaction with ‘dr’ excludes the selfs, which are coded as 0s, and distinguish the reciprocals, as they are coded with opposite signs)
  3. SCA effect : ~combination (reciprocals are not considered)
  4. RSCA effect: ~combination:dr (same note as above for RGCA)

It would make no sense to estimate the variance components for genetic effects with a diallel experiment based on three parentals. Therefore, we give an example based on the ‘hayman54’ dataset, as available in the ‘lmDiallel’ package and relating to a complete diallel experiment with eight parentals (Hayman, 1954).

rm(list=ls())
library(sommer)
library(lmDiallel)
data(hayman54)

# Dummy variables
hayman54$dr <- ifelse(as.character(hayman54$Par1) < as.character(hayman54$Par2), -1,
                ifelse(as.character(hayman54$Par1) == as.character(hayman54$Par2), 0, 1))
hayman54$combination <- factor( ifelse(as.character(hayman54$Par1) <=
                                         as.character(hayman54$Par2),
                                 paste(hayman54$Par1, hayman54$Par2, sep =""),
                                 paste(hayman54$Par2, hayman54$Par1, sep ="")) )
mod1h <- mmer(Ftime ~ Block, data = hayman54, 
              random = ~ overlay(Par1, Par2)
              + overlay(Par1, Par2):dr
              + combination
              + combination:dr, verbose = F)
summary(mod1h)$varcomp
##                                       VarComp VarCompSE    Zratio Constraint
## overlay(Par1, Par2).Ftime-Ftime    1276.49877 750.02253 1.7019472   Positive
## overlay(Par1, Par2):dr.Ftime-Ftime   17.97648  19.91001 0.9028864   Positive
## combination.Ftime-Ftime            1108.86771 329.97440 3.3604659   Positive
## combination:dr.Ftime-Ftime           29.41085  46.54826 0.6318356   Positive
## units.Ftime-Ftime                   422.99303  75.36714 5.6124330   Positive

We do hope that you enjoyed this post; if you are interested in diallel models, please, stay tuned: we have other examples on the way.

Thanks for reading

Prof. Andrea Onofri Department of Agricultural, Food and Environmental Sciences University of Perugia (Italy) andrea.onofri@unipg.it


References

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post lmDiallel: a new R package to fit diallel models. The Hayman's model (type 1) first appeared on R-bloggers.

Custom Google Analytics Dashboards with R: Downloading Data

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

img { border: 0.5px solid #888; padding: 3px; background-color: #eee;}

This week, I’m taking a break from our regular blog content to write about some of the nitty-gritty data science we do within RStudio. Specifically, I’m going to address a question I was asked soon after joining the blogging team:

Which of your blog articles received the most views in the first 15 days they were posted?

I have access to Google Analytics for the blog, and it’s a powerful tool for understanding the flow of visitors through our web site. Nevertheless, the web interface makes it very difficult to compare 15- and 30-day windows of traffic that we need to evaluate blog posts. And given my need to report on the success of the blog to our stakeholders, this turned into a very tedious manual chore.

This Sounds Like A Job For Code-Based Data Science

If you’ve been reading our blog over the past few months, we’ve been writing about how we like to use code-based data science to hide complexity and improve reproducibility. I decided that our tedious process for extracting Google Analytics data posed a great opportunity to practice what we preach and build a custom dashboard in R that will:

  1. Download raw Google Analytics visitor data locally for analysis. Google provides fairly easy access to sampled visitor data through its Application Programming Interface (API), but I wanted to use the full data set.
  2. Give stakeholders a simple graphical interface to interact with the data. I wanted to hide the complexity available in the Google Analytics user interface and give stakeholders one-click access to blog metrics. I also to give them a way to interact with and drill into the data. This also meant writing the code in such a way that it could run in production on a server.
  3. Provide measurements of blog post viewership over their first 15 days of availability. Blog posts receive most of their traffic immediately after being posted and then experience a long-tail of diminishing visits. To properly compare the popularity of blog posts, we had to aggregate their metrics over fixed windows of time beginning on their posting dates.

While it took a few weeks to get the first dashboard working, we now regularly use these R-based dashboards to measure our blog post effectiveness. However, the process of getting the Google Analytics API working was tricky enough that I thought others might find documentation of the process useful.

To achieve this goal, I’ll address each of these steps listed above in its own blog post over the coming weeks. I’ll provide both code and screen shots along the way as well.

Getting Started: How To Download Data From Google Analytics

Before we begin, I want to start with a few caveats:

  1. Google APIs often change over time. The approach I’m showing you works with Version 4 of the Google Analytics API. Google APIs often change over time, and I can’t guarantee that the code I show will work months or years from now.
  2. R isn’t one of the official Google languages listed in their documentation. I use the excellent googleAnalyticsR package written by Mark Edmondson in my dashboard, but Google only officially documents Java, Python, and PHP interfaces.
  3. This may not be the most efficient or best solution. I hope readers who find simpler, easier, or better methods of authorizing the use Google Analytics API will document their methods as well. I encourage you to share them with the RStudio community using the googleAnalyticsR tag at https://community.rstudio.com/tag/googleanalyticsr.

With all that said, the dashboards that use this API provide insights into our blog use that would require a great deal of manual work to reproduce using the GA web interface.

Install These Packages If You Don’t Have Them Already

While the finished dashboards use 16 different R packages, the essential ones I use are:

  • gargle. This package helps us set up our Google Analytics (henceforth abbreviated as GA) authorization and credentials.
  • googleAnalyticsR. This essential package allows us to download the raw visitor data using the Google Analytics API.
  • flexdashboard. This package allows us to present the results in a simple Web interface using R Markdown.
  • reactable. This package allows users of the dashboard to browse, search, reorder, and interact with the data presented.

All of these packages are available for download at CRAN using install.packages().

You should also create an R project for your dashboard at this time. We will need a place to store our Google Analytics credentials, and having a project ready to store them will keep things organized.

Google Analytics credentials: The secrets to success

I want to begin by talking about what I found to be one of the most challenging pieces of the entire project: Creating, authorizing, and applying Google Analytics credentials. It’s not hugely difficult, but it does have a lot of steps you must get right before you can get any data.

Here’s a high-level overview of what we’ll need to get the visitor data for our Google Analytics dashboard. To use the Google Analytics API, we need to present two types of credentials that represent:

  1. The user represented by the client: The API only provides service on behalf of an authorized user. Most people are pretty familiar with this type of authorization; it’s the equivalent of logging into Google with an email address and password. The trick in this case is that the email address we’ll use will be one Google creates for our particular client.
  2. The client requesting service: The API requires that we authorize and provide credentials for each client making requests. In our case, our client will be an R program, which is considered a desktop client. We’ll request a service account to represent this our R program and allow direct server-to-server interactions without human interaction.

For any of this to work, the author of the dashboard has to be an authorized user of Google Analytics. You can test this by going to the Google Analytics Home page (analytics.google.com). If you are an authorized user, you’ll see the web dashboard. If you aren’t, you’ll get an error message and will have to ask for access from your Google Analytics Administrator. Keep the contact information for your Google Analytics administrator handy; we’ll need that information again later.

We must perform six steps to download data using the GA API. We need to:

  • Step 1: Request a service account from Google.
  • Step 2: Download the service account key and securely store its JSON file.
  • Step 3: Enable API access to your project.
  • Step 4: Add your service account credentials to your project.
  • Step 5: Create and download your project’s OAUTH credentials from Google.
  • Step 6: Submit both pieces of information to the Google Analytics API and make a test data request.

I’ll walk you through each step individually in the following sections. For readers not interested in the gory details, you can skip ahead to the conclusion of this piece where I’ll recap what we got out of this process and what the next steps are.

Step 1: Request a service account from Google

Google has written a comprehensive document on how to do API authentication. Because we want to build a stand-alone dashboard, we’re going to use the service account option, which Google describes this way:

Service accounts are useful for automated, offline, or scheduled access to Google Analytics data for your own account. For example, to build a live dashboard of your own Google Analytics data and share it with other users.

This sounds like exactly what we want, so let’s use that option. It will take a few sub-steps, but they are fairly straightforward. Jenny Bryan has written a nice overview about how this process works as part of her gargle package; the description of service accounts is at the bottom of the page.

To create your service account, you should:

  1. Go to https://console.cloud.google.com/cloud-resource-manager and click on Create Project (see figure below). While you could also use the web-based Google setup tool recommended by the Google document, I find that using the cloud resource manager page referenced above simplifies naming your project something other than “My Project”.

  1. Give your project a name. Here, we’ve named our project Test Project. Click Create once you’ve entered a name.

  1. Click on Go to project settings on the Google Cloud Dashboard project card. Usually the new project’s card will be at the top left of the page and should have the project name and number. You’ll now go to your project settings page to create a service account to access this project.

  1. Select Create Service Account. You now have a project created, but you don’t yet have a Google user account that can be used with that project. We’ll create that on the Service Account Details screen.

  1. Give your service account a name. The name will automatically populate the Service Account ID field. Record the full Service Account ID generated somewhere; we’ll need to register that account with your Google Analytics administrator. It’s also a good idea to provide a long description of what you intend to do with this account. When you’ve finished with filling in these fields, click Create.

This completes the creation of our service account.

Step 2: Download the service account key

  1. Now that your service account exists, download your key from the three vertical dots menu. Once your account is created, the dashboard will take you back to the Service Accounts page as shown below.

  1. Create your private service account key. Make your browser window wide enough to see the Actions menu with the three vertical dots. Click on those 3 vertical dots, and you’ll see a pop-up menu. Click on Create key.

  1. Select JSON as your key format. The googleAnalyticsR package requires the key to be in JSON format. Once you’ve selected that format, click Create.

  1. Store the downloaded key in a folder within your R project. I typically create a folder in my dashboard project named .secrets where I keep such keys.

At this point, you have the service key credentials you need to make requests. However, we still have a couple more steps to do before we can use the API.

Step 3: Enable API access to your project

The fact you have a valid service key is not enough to start making requests. You still need to enable the API from the Google Dashboard. To do this you:

  1. Go to https://console.cloud.google.com/apis as shown in the screenshot below and then click on Enable APIs and Services.

  1. Search for and click on the Google Analytics API.

  1. Click on Enable to make the API for your project active.

Sadly, the fact you have a valid service key is not enough to start making requests yet. We still need to authorize the user account with GA.

Step 4: Add your service account user credentials to your project

You now need to add the email associated with that key to the list of authenticated project users. To do this, we’re going to return to the Cloud Resource Manager pane at https://console.cloud.google.com/cloud-resource-manager.

Please note that for many Google Analytics configurations, only GA administrators may add new members to a project. If that is the case for you, you’ll will not see the screens shown below. Instead, you must contact your GA administrator and ask them to add your service account email to the project with Viewer rights.

If you do have the appropriate permissions, however, perform the following 3 tasks:

  1. Click on the IAM selection on the left-hand-side menu and select ADD from the top submenu as shown below:

  1. Add your service account user to the project. Enter the email address for your service account, select Viewer as the role, and click Save as shown below.

  1. Verify that your service account email has now been added by observing it in the list of members for this project.

Step 5: Create and download your project’s OAUTH credentials from Google

While you may be questioning why you ever started this seemingly endless project at this point, fear not; we’re almost done. All that remains to do is to create and download the OAUTH credentials for your service key.

Now if you’re anything like me, you’re probably thinking “Wait a minute, I created a service key to bypass all this OAUTH complexity. Why do I need an OAUTH project file now?” I’m glad you asked; it’s because Google:

  • Gathers API statistics on a per-project basis. Google needs to know what project to aggregate your Google Analytics API calls under for reporting and accounting purposes.
  • Needs to defend against excessive API calls. Because you are accessing the API from a computer program, Google has to defend its API against infinite loops and automated attacks. Should Google detect excessive API calls associated with your project, it can throttle its responses to you without affecting other users.

You don’t actually need a project client ID for debugging purposes because the GoogleAnalyticsR package has a default project associated with it. However, this project ID is shared among all programs using the package, and you may find your API calls denied because too many users are actively using the package. You can avoid this issue entirely by setting your own project client ID as shown below.

In my opinion, acquiring an OAuth 2.0 client ID for a service account is poorly documented on the Google API dashboard, in the Google documentation, and in the GoogleAnalyticsR package. I found this process difficult to reproduce for our test project even though I’d been through it for my own dashboards. With that said, it’s fairly straightforward if you start in the proper place as shown below:

  1. Go to the site https://console.developers.google.com/apis/api/analyticsreporting.googleapis.com/. Please note that this is not the Google Cloud API dashboard we went to in Step 3; this is the Google Analytics Report API dashboard. You probably will have no OAuth 2.0 client IDs shown. Click on + CREATE CREDENTIALS at the top of the page.

  1. Select OAuth client ID as the credential you wish to create.

  1. Select Desktop app as the application type and enter your a name for your client. I chose the name “Test Google Analytics script.”

  1. Click OK to acknowledge the ID being created, which will return you to the Google Analytics dashboard.

  1. Click the down arrow button next to your new Client ID to download the client ID JSON file. I typically put this file into my .secrets folder where I also keep my service account private key.

Step 6: Submit both pieces of information to the Google Analytics API and make a test data request.

While this multi-step process which may have seemed like something out of Lord of the Rings, you now should have all the credentials and permission to make API requests to Google Analytics. So let’s write code to fetch one day’s Google Analytics data for the rstudio.com site.

library(googleAnalyticsR)library(dplyr)library(ggplot2)library(lubridate)library(reactable)library(stringr)## First, authenticate with our client OAUTH credentials from step 5 of the blog post.googleAuthR::gar_set_client(json = "secrets/oauth-account-key.json")## Now, provide the service account email and private keyga_auth(email = "ga-analysis@test-project-291617.iam.gserviceaccount.com",        json_file = "secrets/service-account-key.json")## At this point, we should be properly authenticated and ready to go. We can test this## by getting a list of all the accounts that this test project has access to. Typically,## this will be only one if you've created your own service key. If it isn't your only## account, select the appropriate viewId from your list of accounts.my_accounts <- ga_account_list()my_id <- my_accounts$viewId     ## Modify this if you have more than one account## Let's look at all the visitors to our site. This segment is one of several provided## by Google Analytics by default.all_users <- segment_ga4("AllTraffic", segment_id = "gaid::-1")## Let's look at just one day.ga_start_date <- today()ga_end_date <- today()## Make the request to GAdata_fetch <- google_analytics(my_id,                               segments = all_users,                               date_range = c(ga_start_date, ga_end_date),                               metrics = c("pageviews"),                               dimensions = c("landingPagePath"),                               anti_sample = TRUE)## Let's just create a table of the most viewed postsmost_viewed_posts <- data_fetch %>%   mutate(Path = str_trunc(landingPagePath, width=40)) %>%   count(Path, wt=pageviews, sort=TRUE)head(most_viewed_posts, n=5)

Assuming you have the appropriate permissions, client ID, and service key, you should get a result that looks similar to this one I pulled from the rstudio.com web site.

Pathn
1rstudio.cloud/index.html22173
2rstudio.com/index.html18240
3rstudio.com/products/rstudio/download…16120
4www.shinyapps.io/admin/index.html8327
5support.rstudio.com/hc/en-us/articles…7486

Wrapping Up Our First Custom Google Analytics Script

While many of the details of the Google Analytics API may seem elaborate and arcane, I want to emphasize some of the main ideas behind this process:

  • You don’t have to settle for what the Google Analytics user interface gives you. The GA UI contains many general-purpose analytical views. However, if your organization wants to manage to web metrics that its interface doesn’t provide, the GA API and custom code allows you to create your own web metrics from raw GA data.
  • Two credentials unlock your ability to create your own web analytics. While the setup process to access the GA API seems complicated, it really boils down to agreeing on two basic credentials: one for the user authorizing the request and the other for the client program running it.
  • Once you can download your own GA data, you can apply ordinary R code to understand it. While the Google Analytics UI may take you days to learn, once you can download the raw GA data, you can return to R and Python tools to wrangle that data into the web characteristics you want to measure. Best of all, once that code is written, you can hand it to others who don’t have to understand anything about GA or its APIs. Your program becomes a useful tool instead of just a big toolbox.

This post has focused entirely on getting authorized to download Google Analytics data. The next post will focus on how to create a flex dashboard for stakeholders to interact with the data. The last post in this series will show how to create windowed views of this data and public a self-contained dashboard that can be used on demand from RStudio Connect.

To Learn More

This post only focused on using GoogleAnalyticsR to download data through the GA API, but the package is capable of much much more. I highly recommend taking a look at the extensive package documentation and its github repository as well as author Mark Edmondson’s blog.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Custom Google Analytics Dashboards with R: Downloading Data first appeared on R-bloggers.


A Mini MacroEconometer for the Good, the Bad and the Ugly

$
0
0

[This article was first published on R – Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

 

Gross domestic product (GDP) is the total monetary or market value of all the finished goods and services produced within a country’s/state’s borders in a specific time period. As a broad measure of overall domestic production, it functions as a comprehensive scorecard of a given country’s/state’s economic health. But, is this wealth and economic health distributed in a manner that can reduce the poverty and narrow the gap between the rich and the poor?

Early each month, the Bureau of Labor Statistics (BLS) of the U.S. Department of Labor announces the total number of employed and unemployed people in the United States for the previous month, along with many characteristics about them. These figures, particularly the unemployment rate—which tells us the percentage of the labor force that is unemployed—receive wide coverage in the media. Because unemployment insurance records relate only to people who have applied for such benefits, and since it is impractical to count every unemployed person each month, the government conducts a monthly survey called the Current Population Survey (CPS) to measure the extent of unemployment in the country [1].

The Census Bureau uses a set of money income thresholds that vary by family size and composition to determine who is in poverty. If a family’s total income is less than the family’s threshold, then that family and every individual in it is considered in poverty. The official poverty thresholds do not vary geographically, but they are updated for inflation using the Consumer Price Index (CPI-U). The official poverty definition uses money income before taxes and does not include capital gains or non-cash benefits (such as public housing, Medicaid, and food stamps) [2].

In this project, I’ve been looking at the GDP, unemployment rate and percentage of people in poverty among all the states in the time period of 2001-2018 to see if there is a meaningful relationship between these three parameters.

there are several popular variations of GDP measurements which can be useful for different purposes. Here, we use Real GDP which is an inflation-adjusted measure (in our dataset based on 2012 $) that reflects the quantity of goods and services produced by an economy in a given year, with prices held constant from year to year in order to separate out the impact of inflation or deflation from the trend in output over time.

Wealthy States

There are some states which are certainly doing better than the rest according to their GDP numbers. Massachusetts, Connecticut, Alaska, Delaware and New York keep their place in the top ten from 2001-2018 (Fig.1).

Fig. 1

Massachusetts’s economic success is both a product and an effect of possessing the nation’s most educated workforce along with the close clustering of research institutions and businesses in STEM sectors, makes for a bustling incubator of innovation and investment. For New York, The financial services sector is the most important area of the state. Professional and business services such as legal advice, administrative services, and management consulting have a big share as well. In addition to Wall Street, New York is steadily growing its technology and entrepreneurship presence. Finance, insurance, real estate, rental, and leasing are the most important areas of Connecticut’s economy. The state’s economic growth is also tied to manufacturing activity; United Technologies Corporation is based in Connecticut. Alaska’s high GDP per capita is due to its small population and its high production output of petroleum, natural gas, coal, gold, zinc, and other precious metals. Other prominent export goods from Alaska include seafood products, such as salmon and cod. Delaware has a reputation as one of the best places in the country for publicly traded American companies to incorporate, largely because of its business-friendly corporate tax laws. More than 50% of publicly traded American companies are incorporated in the state, including 63% of the Fortune 500. Combined with low labor costs, total business costs in the state are 21% below the U.S. average, among the lowest in the country [3].

Frozen, unemployed and rich !?

Among all the states, Alaska certainly is an interesting case. Over the time period of 2001-2018, The state GDP is among the top ten, the percentage of people in poverty is low has a high unemployment rate. (Fig. 2)

Fig. 2

A big portion of Alaska’s jobs are seasonal and only occurs in summer. This includes tourism, commercial fishing, firefighting, and others. These seasonal jobs can create large fluctuation in Alaska’s monthly unemployment data and play a role in pushing up the average unemployment. There are few large centers of wage employment in Alaska — Anchorage, Fairbanks, Juneau and a few others, plus the North Slope oil industry, whose workers are mainly commuters from other states. As a result, much of the economy of Alaska is informal which doesn’t make its way to surveys and statistics.Like all statistics, macroeconomic indicators work best when there is a large population. The other cause is the small population of Alaska which makes the weight of each data point is exaggerated.[4]

GDP, Unemployment and Poverty

It is a view in economics that the growth rate of the GDP of an economy increases employment and reduces unemployment. This theoretical proposition relating output and unemployment is called “Okun’s Law”. However, Okun reported a negative short-run correlation between unemployment which doesn’t necessarily hold up for longer term analysis. The correlation coefficients for our three key economic parameters are shown in Fig.3.

Fig. 3

As we can see in the Fig.3, there isn’t any unified pattern for the correlation between GDP growth rate and percentage of the people in poverty and their correlation varies for different states. It’s interesting to see that for a state like Pennsylvania, the GDP growth and poverty moves in the same direction which can potentially create even larger economical and financial injustice.

The case is slightly better for the relation between GDP growth and unemployment rate. Which shows a negative correlation for most of the states up to the maximum absolute value of 0.89 for Michigan (Fig. 4).

In contrast with the two previous cases; Unemployment rate and poverty percentage show a unified pattern which for more than half of the states shows the correlation values between 0.5-0.91 in which California has the highest value (Fig. 4).

Fig. 4

In summary, although GDP shows the direction that the economy is moving towards, it doesn’t guarantee that all the individuals will share the prosperity in the same way. To fight poverty, the economic policies should focus on reducing the unemployment rate to make sure that the financial and economical gap between the rich and the poor is narrowed, and we are on the route to financial justice.

 

References

[1] www.bls.gov

[2] www.census.gov

[3] www.investopedia.com

[4] www.kingeconomicsgroup.com

The post A Mini MacroEconometer for the Good, the Bad and the Ugly first appeared on Data Science Blog.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post A Mini MacroEconometer for the Good, the Bad and the Ugly first appeared on R-bloggers.

Monte Carlo Simulation of Bernoulli Trials in R

$
0
0

[This article was first published on Data Science Depot, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

A user on Stackoverflow recently asked a question about a program to generate Monte Carlo simulations on Bernoulli trials to calculate coverage percentages using Wald confidence intervals. One of the problems in the code is that probability value calculations are executed on individual observations rather than sums of successes and failures. R is primarily a vector processor and the code does not aggregate the individual observations to counts of successes and failures in order to calculate observed probability values for each sample.

This causes the code to generate 0 for coverage percentage for p values above 0.01 for sample sizes tested in the original post. We use code from the original post to isolate where error is introduced into the algorithm.

We set a seed, and assign values to m, n, and p, and attempt to generate 10,000 Bernoulli trials of size n.

set.seed(95014)m<-10000n<-5p<-0.01x <- rbinom(m,size=1,prob = p)

At this point x is a vector containing 10,000 true = 1, false = 0 values.

> table(x)x   0    19913   87

However, x is NOT 10,000 runs of samples of 5 Bernoulli trials. Given this fact, all subsequent processing by the algorithm in the original code will be incorrect.

The next line of code calculates a value for p.hat. This should be a single value, not a vector of 10,000 elements.

p.hat <- x/ntable(p.hat)> table(p.hat)p.hat   0  0.29913   87

An accurate calculation for p.hat would be the following:

> p.hat <- sum(x)/length(x)> p.hat[1] 0.0087

…which is very close to the population p-value of 0.01 that we assigned earlier in the code, but still does not represent 10,000 trials of sample size 5. Instead, p.hat as defined above represents one Bernoulli trial with sample size 10,000.

Two minor changes to fix the code

After independently developing a Monte Carlo simulator for Bernoulli trials (see below for details), it becomes clear that with a couple of tweaks we can remediate the code from the original post to make it produce valid results.

First, we multiply m by n in the first argument to rbinom(), so the number of trials produced is 10,000 times sample size. We also cast the result as a matrix with 10,000 rows and n columns.

Second, we use rowSums() to sum the trials to counts of successes, and divide the resulting vector of 10,000 elements by n, producing correct values for p.hat, given sample size. Once p.hat is corrected, the rest of the code works as originally intended.

f3 <- function(n,probs) {     res1 <- lapply(n, function(i) {          setNames(lapply(probs, function(p) {               m<-10000               n<-i               p<-p               # make number of trials m*n, and store               # as a matrix of 10,000 rows * n columns               x <- matrix(rbinom(m*n,size=1,prob = p),nrow=10000,ncol=i)               # p.hat is simply rowSums(x) divided by n               p.hat <- rowSums(x)/n               lower.Wald <- p.hat - 1.96 * sqrt(p.hat*(1-p.hat)/n)               upper.Wald <- p.hat + 1.96 * sqrt(p.hat*(1-p.hat)/n)               p.in.CI <- (lower.Wald Starting from scratch: a basic simulator for one p-value / sample size

Here we develop a solution that iteratively builds on a set of basic building blocks: one p-value, one sample size, and a 95% confidence interval. The simulator also tracks parameters so we can combine results from multiple simulations into data frames that are easy to read and interpret.

First, we create a simulator that tests 10,000 samples of size drawn from a Bernoulli distribution with a given probability value. It aggregates successes and failures, and then calculates Wald confidence intervals, and generates an output data frame. For the purposes of the simulation, the p-values we pass to the simulator represent the the "true" population probability value. We will see how frequently the simulations include the population p-value in their confidence intervals.

We set parameters to represent a true population p-value of 0.5, a sample size of 5, and z-value of 1.96 representing a 95% confidence interval. We created function arguments for these constants so we can vary them in subsequent code. We also use set.seed() to make the results reproducible.

set.seed(90125)simulationList <- lapply(1:10000,function(x,p_value,sample_size,z_val){     trial <- x     successes <- sum(rbinom(sample_size,size=1,prob = p_value))     observed_p <- successes / sample_size     z_value <- z_val     lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)     upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)     data.frame(trial,p_value,observed_p,z_value,lower.Wald,upper.Wald)},0.5,5,1.96)

A key difference between this code and the code from the original question is that we take samples of 5 from rbinom() and immediately sum the number of true values to calculate the number of successes. This allows us to calculate observed_p as successes / sample_size. Now we have an empirically generated version of what was called p.hat in the original question.

The resulting list includes a data frame summarizing the results of each trial.

We combine the list of data frames into a single data frame with do.call()

simulation_df <- do.call(rbind,simulationList)

At this point simulation_df is a data frame containing 10,000 rows and 6 columns. Each row represents the results from one simulation of sample_size Bernoulli trials. We’ll print the first few rows to illustrate the contents of the data frame.

> dim(simulation_df)[1] 10000     6> head(simulation_df)  trial p_value observed_p z_value  lower.Wald upper.Wald1     1     0.5        0.6    1.96  0.17058551  1.02941452     2     0.5        0.2    1.96 -0.15061546  0.55061553     3     0.5        0.6    1.96  0.17058551  1.02941454     4     0.5        0.2    1.96 -0.15061546  0.55061555     5     0.5        0.2    1.96 -0.15061546  0.55061556     6     0.5        0.4    1.96 -0.02941449  0.8294145>

Notice how the observed_p values are distinct values in increments of 0.2. This is because when sample size is 5, the number of TRUE values in each sample can vary between 0 and 5. A histogram of observed_p makes this clear.

enter image description here

Even with a sample size of 5, we can see the shape of a binomial distribution emerging in the histogram.

Next, we calculate the coverage percentage by summing the rows where the population p-value (represented as p_value) is within the Wald confidence interval.

# calculate coverage: % of simulations where population p-value is# within Wald confidence limits generated via simulationsum(simulation_df$p_value > simulation_df$lower.Wald &         simulation_df$p_value < simulation_df$upper.Wald) / 10000 * 100 > sum(simulation_df$p_value > simulation_df$lower.Wald &+          simulation_df$p_value < simulation_df$upper.Wald) / 10000 * 100[1] 93.54

Coverage of 93.54% is a reasonable result for the simulation, given that we calculated a 95% confidence interval. We interpret the result as 93.5% of the samples generated Wald confidence intervals that included the population p-value of 0.5.

Therefore, we conclude that our simulator appears to be generating valid results. We will build on this basic design to execute simulations with multiple p-values and sample sizes.

Simulating multiple p-values for a given sample size

Next, we’ll vary the probability values to see the percentage coverage for 10,000 samples of 5 observations. Since the statistics literature such as Sauro and Lewis, 2005 tells us that Wald confidence intervals have poor coverage for very low and very high p-values, we’ve added an argument to calculate Adjusted Wald scores. We’ll set this argument to FALSE for the time being.

p_val_simulations <- lapply(c(0.01,0.1,0.4,.5,.8),function(p_val){     aSim <- lapply(1:10000,function(x,p_value,sample_size,z_val,adjWald){          trial <- x          successes <- sum(rbinom(sample_size,size=1,prob = p_value))          if(adjWald){               successes <- successes + 2               sample_size <- sample_size + 4          }          observed_p <- sum(successes) / (sample_size)          z_value <- z_val          lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)          upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)          data.frame(trial,p_value,sample_size,observed_p,z_value,adjWald,lower.Wald,upper.Wald)     },p_val,5,1.96,FALSE)     # bind results to 1 data frame & return     do.call(rbind,aSim)})

The resulting list, p_val_simulations contains one data frame for each p-value run through the simulation.

We combine these data frames and calculate coverage percentages as follows.

do.call(rbind,lapply(p_val_simulations,function(x){     p_value <- min(x$p_value)     adjWald <- as.logical(min(x$adjWald))     sample_size <- min(x$sample_size) - (as.integer(adjWald) * 4)     coverage_pct <- (sum(x$p_value > x$lower.Wald &              x$p_value < x$upper.Wald) / 10000)*100     data.frame(p_value,sample_size,adjWald,coverage_pct)}))

As expected, the coverage is very poor the further we are away from a p-value of 0.5.

  p_value sample_size adjWald coverage_pct1    0.01           5   FALSE         4.532    0.10           5   FALSE        40.233    0.40           5   FALSE        83.494    0.50           5   FALSE        94.195    0.80           5   FALSE        66.35

However, when we rerun the simulation with adjWald = TRUE, we get the following results.

  p_value sample_size adjWald coverage_pct1    0.01           5    TRUE        95.472    0.10           5    TRUE        91.653    0.40           5    TRUE        98.954    0.50           5    TRUE        94.195    0.80           5    TRUE        94.31

These are much better coverage values, particularly for p-values close the the ends of the distribution.

The final task remaining is to modify the code so it executes Monte Carlo simulations at varying levels of sample size. Before proceeding further, we calculate the runtime for the code we’ve developed thus far.

system.time() tells us that the code to run 5 different Monte Carlo simulations of 10,000 Bernoulli trials with sample size of 5 takes about 38 seconds to run on a MacBook Pro 15 with a 2.5 Ghz Intel i-7 processor. Therefore, we expect that the next simulation will take multiple minutes to run.

Varying p-value and sample size

We add another level of lapply() to account for varying the sample size. We have also set the adjWald parameter to FALSE so we can see how the base Wald confidence interval behaves at p = 0.01 and 0.10.

set.seed(95014)system.time(sample_simulations <- lapply(c(10, 15, 20, 25, 30, 50,100, 150, 200),function(s_size){     lapply(c(0.01,0.1,0.8),function(p_val){          aSim <- lapply(1:10000,function(x,p_value,sample_size,z_val,adjWald){               trial <- x               successes <- sum(rbinom(sample_size,size=1,prob = p_value))               if(adjWald){                    successes <- successes + 2                    sample_size <- sample_size + 4               }               observed_p <- sum(successes) / (sample_size)               z_value <- z_val               lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)               upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)               data.frame(trial,p_value,sample_size,observed_p,z_value,adjWald,lower.Wald,upper.Wald)          },p_val,s_size,1.96,FALSE)          # bind results to 1 data frame & return          do.call(rbind,aSim)     })}))

Elapsed time on the MacBook Pro was 217.47 seconds, or about 3.6 minutes. Given that we ran 27 different Monte Carlo simulations, the code completed one simulation each 8.05 seconds.

The final step is to process the list of lists to create an output data frame that summarizes the analysis. We aggregate the content, combine rows into data frames, then bind the resulting list of data frames.

summarizedSimulations <- lapply(sample_simulations,function(y){     do.call(rbind,lapply(y,function(x){          p_value <- min(x$p_value)          adjWald <- as.logical(min(x$adjWald))          sample_size <- min(x$sample_size) - (as.integer(adjWald) * 4)          coverage_pct <- (sum(x$p_value > x$lower.Wald &                                    x$p_value < x$upper.Wald) / 10000)*100          data.frame(p_value,sample_size,adjWald,coverage_pct)     }))})results <- do.call(rbind,summarizedSimulations)

One last step, we sort the data by p-value to see how coverage improves as sample size increases.

results[order(results$p_value,results$sample_size),]

…and the output:

> results[order(results$p_value,results$sample_size),]   p_value sample_size adjWald coverage_pct1     0.01          10   FALSE         9.404     0.01          15   FALSE        14.317     0.01          20   FALSE        17.7810    0.01          25   FALSE        21.4013    0.01          30   FALSE        25.6216    0.01          50   FALSE        39.6519    0.01         100   FALSE        63.6722    0.01         150   FALSE        77.9425    0.01         200   FALSE        86.472     0.10          10   FALSE        64.255     0.10          15   FALSE        78.898     0.10          20   FALSE        87.2611    0.10          25   FALSE        92.1014    0.10          30   FALSE        81.3417    0.10          50   FALSE        88.1420    0.10         100   FALSE        93.2823    0.10         150   FALSE        92.7926    0.10         200   FALSE        92.693     0.80          10   FALSE        88.266     0.80          15   FALSE        81.339     0.80          20   FALSE        91.8812    0.80          25   FALSE        88.3815    0.80          30   FALSE        94.6718    0.80          50   FALSE        93.4421    0.80         100   FALSE        92.9624    0.80         150   FALSE        94.4827    0.80         200   FALSE        93.98>

Interpreting the results

The Monte Carlo simulations illustrate that Wald confidence intervals provide poor coverage at a p-value of 0.01, even with a sample size of 200. Coverage improves at p-value of 0.10, where all but one of the simulations at sample sizes 25 and above exceeded 90%. Coverage is even better for the p-value of 0.80, where all but one of the sample sizes above 15 exceeded 91% coverage.

Coverage improves further when we calculate Adjusted Wald confidence intervals, especially at lower p-values.

results[order(results$p_value,results$sample_size),]   p_value sample_size adjWald coverage_pct1     0.01          10    TRUE        99.754     0.01          15    TRUE        98.827     0.01          20    TRUE        98.3010    0.01          25    TRUE        97.7213    0.01          30    TRUE        99.7116    0.01          50    TRUE        98.4819    0.01         100    TRUE        98.2522    0.01         150    TRUE        98.0525    0.01         200    TRUE        98.342     0.10          10    TRUE        93.335     0.10          15    TRUE        94.538     0.10          20    TRUE        95.6111    0.10          25    TRUE        96.7214    0.10          30    TRUE        96.9617    0.10          50    TRUE        97.2820    0.10         100    TRUE        95.0623    0.10         150    TRUE        96.1526    0.10         200    TRUE        95.443     0.80          10    TRUE        97.066     0.80          15    TRUE        98.109     0.80          20    TRUE        95.5712    0.80          25    TRUE        94.8815    0.80          30    TRUE        96.3118    0.80          50    TRUE        95.0521    0.80         100    TRUE        95.3724    0.80         150    TRUE        94.6227    0.80         200    TRUE        95.96

The Adjusted Wald confidence intervals provide consistently better coverage across the range of p-values and sample sizes, with an average coverage of 96.72% across the 27 simulations. This is consistent with the literature that indicates Adjusted Wald confidence intervals are more conservative than unadjusted Wald confidence intervals.

At this point we have a working Monte Carlo simulator that produces valid results for multiple p-values and sample sizes. We can now review the code to find opportunities to optimize its performance.

Optimizing the solution

Following the old programming aphorism of Make it work, make it right, make it fast, working the solution out in an iterative manner helped enabled me to develop a solution that produces valid results.

Understanding of how to make it right enabled me not only to see the flaw in the code posted in the question, but it also enabled me to envision a solution. That solution, using rbinom() once with an argument of m * n, casting the result as a matrix(), and then using rowSums() to calculate p-values, led me to see how I could optimize my own solution by eliminating thousands of rbinom() calls from each simulation.

Refactoring for performance

We create a function, binomialSimulation(), that generates Bernoulli trials and Wald confidence intervals with a single call to rbinom(), regardless of the number of trials in a single simulation. We also aggregate results so each simulation generates a data frame containing one row describing the results of the test.

set.seed(90125)binomialSimulation <- function(trial_size,p_value,sample_size,z_value){     trials <- matrix(rbinom(trial_size * sample_size,size=1,prob = p_value),                      nrow = trial_size,ncol = sample_size)     observed_p <- rowSums(trials) / sample_size     lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)     upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)     coverage_pct <- sum(p_value > lower.Wald &                         p_value < upper.Wald) / 10000 * 100     data.frame(sample_size,p_value,avg_observed_p=mean(observed_p),coverage_pct)}

We run the function with a population p-value of 0.5, a sample size of 5, and 10,000 trials and a confidence interval of 95%, and track the execution time with system.time(). The optimized function is 99.8% faster than the original implementation described earlier in the article, which runs in about 6.09 seconds.

system.time(binomialSimulation(10000,0.5,5,1.96))> system.time(binomialSimulation(10000,0.5,5,1.96))   user  system elapsed  0.015   0.000   0.015

We skip the intermediate steps from the solution above, instead presenting the optimized version of the iteratively developed solution.

system.time(results <- do.call(rbind,lapply(c(5,10,15,20,25,50,100,250),                                function(aSample_size,p_values) {     do.call(rbind,lapply(p_values,function(a,b,c,d){             binomialSimulation(p_value = a,                                trial_size = b,                                sample_size = aSample_size,                                z_value = d)     },10000,5,1.96))},c(0.1,0.4,0.8))))

As expected, elimination of the thousands of unnecessary calls to rbinom() radically improves performance of the solution.

   user  system elapsed  0.777   0.053   0.830

Given that our prior solution ran in 217 seconds, performance of the optimized version is really impressive. Now we have a solution that not only generates accurate Monte Carlo simulations of Bernoulli trials, but it’s also very fast.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Science Depot.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Monte Carlo Simulation of Bernoulli Trials in R first appeared on R-bloggers.

Generalized fiducial inference on quantiles

$
0
0

[This article was first published on Saturn Elephant, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My new package ‘gfiExtremes’ is on CRAN now. So it is time to present it.

This package allows to get confidence intervals about the quantiles of any reasonable distribution (although the inference is based on a parametric model). The statistical inference is fiducial.

To give an illustration, I’m taking a sample of length 100 randomly generated from a Weibull distribution:

set.seed(1111111111L)X <- rweibull(100L, shape = 1.5)plot(X, pch = 19, main = "Data")

The model used for the fiducial inference assumes a generalized Pareto distribution above a certain threshold. For an unknown value of this threshold, the function to use is gfigpd2. It runs a MCMC sampler, and one has to specify the length of the burnin phase, the desired length of the MCMC chains after the burnin, and the thin value (e.g. a thin of 2 means that one sampled value over two is dropped). One also has to specify the desired probability levels of the quantiles we are interested in.

library(gfiExtremes)chains <- gfigpd2(  X,                                         # data  beta = c(99, 99.5, 99.9)/100,              # probability levels  threshold.init = 0.7,                      # initial threshold value  burnin = 20000L, iter = 20000L, thin = 10L # MCMC chains)

By default, gfigpd2 runs four MCMC chains and they are generated in parallel.

The output of gfigpd2 is a R object ready for analysis with the ‘coda’ package, which is loaded by ‘gfiExtremes’. In particular, it has a summary method:

summary(chains)## ## Iterations = 20001:219991## Thinning interval = 10 ## Number of chains = 4 ## Sample size per chain = 20000 ## ## 1. Empirical mean and standard deviation for each variable,##    plus standard error of the mean:## ##        Mean     SD Naive SE Time-series SE## beta1 3.297  1.172 0.004145        0.01229## beta2 3.892  3.539 0.012513        0.03705## beta3 6.789 80.903 0.286035        0.72901## ## 2. Quantiles for each variable:## ##        2.5%   25%   50%   75%  97.5%## beta1 2.512 2.858 3.096 3.444  5.240## beta2 2.765 3.157 3.459 3.963  7.307## beta3 3.250 3.740 4.268 5.362 17.127

The ‘coda’ package provides the HPDinterval function which gives the shortest confidence intervals:

HPDinterval(joinMCMCchains(chains))##          lower     upper## beta1 2.371387  4.591365## beta2 2.524189  5.910930## beta3 2.893593 11.091305## attr(,"Probability")## [1] 0.95

Below are the true values of the Weibull quantiles; they are caught by the confidence intervals:

qweibull(c(99, 99.5, 99.9)/100, shape = 1.5)## [1] 2.767985 3.039196 3.627087

Convergence diagnostics

Now one has to check that the MCMC chains have entered in their stationary phase. It is better to take the logarithm of the simulations of the fiducial distributions of the quantiles:

logChains <- as.mcmc.list(lapply(chains, log))

The ‘ggmcmc’ package is helpful here. Firstly, let’s have a look at the traces:

library(ggmcmc)gglogChains <- ggs(logChains)library(ggthemes)ggs_traceplot(gglogChains) + theme_fivethirtyeight()

Visually, nothing indicates a departure from the convergence. Let’s look at the estimated densities now:

ggs_density(gglogChains) + theme_fivethirtyeight()

The running means quickly stabilize:

ggs_running(gglogChains) + theme_fivethirtyeight()

Below are the densities of the whole chains compared with the densities of their last part:

ggs_compare_partial(gglogChains) + theme_fivethirtyeight()

The autocorrelations nicely decrease:

ggs_autocorrelation(gglogChains) + theme_fivethirtyeight()

Let’s also have a look at the Gelman-Rubin diagnostic:

gelman.diag(logChains)## Potential scale reduction factors:## ##       Point est. Upper C.I.## beta1          1       1.00## beta2          1       1.01## beta3          1       1.01## ## Multivariate psrf## ## 1

The upper Rhat are close to 1, thereby indicating a successful diagnostic.

Finally, let’s look at the Heidelberger & Welch diagnostic:

heidel.diag(logChains)## [[1]]##                                     ##       Stationarity start     p-value##       test         iteration        ## beta1 passed       1         0.591  ## beta2 passed       1         0.356  ## beta3 passed       1         0.189  ##                               ##       Halfwidth Mean Halfwidth##       test                    ## beta1 passed    1.17 0.00818  ## beta2 passed    1.31 0.01201  ## beta3 passed    1.60 0.03032  ## ## [[2]]##                                     ##       Stationarity start     p-value##       test         iteration        ## beta1 passed       1         0.652  ## beta2 passed       1         0.571  ## beta3 passed       1         0.400  ##                               ##       Halfwidth Mean Halfwidth##       test                    ## beta1 passed    1.17 0.00713  ## beta2 passed    1.30 0.01047  ## beta3 passed    1.58 0.02423  ## ## [[3]]##                                     ##       Stationarity start     p-value##       test         iteration        ## beta1 passed       1         0.0720 ## beta2 passed       1         0.0687 ## beta3 passed       1         0.0789 ##                               ##       Halfwidth Mean Halfwidth##       test                    ## beta1 passed    1.17 0.00856  ## beta2 passed    1.31 0.01278  ## beta3 passed    1.59 0.03042  ## ## [[4]]##                                     ##       Stationarity start     p-value##       test         iteration        ## beta1 passed       1         0.444  ## beta2 passed       1         0.276  ## beta3 passed       1         0.164  ##                               ##       Halfwidth Mean Halfwidth##       test                    ## beta1 passed    1.16 0.00709  ## beta2 passed    1.30 0.01008  ## beta3 passed    1.57 0.01926

All tests passed.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Saturn Elephant.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Generalized fiducial inference on quantiles first appeared on R-bloggers.

Gold-Mining Week 12 (2020)

$
0
0

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

<br /> Week 12 Gold Mining and Fantasy Football Projection Roundup now available.<br />

The post Gold-Mining Week 12 (2020) appeared first on Fantasy Football Analytics.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Gold-Mining Week 12 (2020) first appeared on R-bloggers.

Using LaTeX in R/exams: What, Why, How?

$
0
0

[This article was first published on R/exams, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

LaTeX is a document prepration system that is very powerful but can sometimes also be a bit confusing. Hence, some useful details are provided that should help R/exams users getting LaTeX installed and used effectively for preparing exams.

Using LaTeX in R/exams: What, Why, How?

What?

LaTeX is a software system for document preparation, see the corresponding Wikipedia page for an overview. The system encompasses various elements that are important to distinguish:

  • It uses a markup language that can be written in any simple text editor with commands structuring a document (title, sections, etc.), formatting text (italics, bold face, etc.), and providing a powerful mathematical notation.
  • Subsequently, a TeX engine can be used to process such a text file and produce an output document that can be viewed, printed, etc. The probably most commonly-used TeX engine is pdfTeX rendering LaTeX files into PDF documents.
  • TeX distributions ship collections of such TeX engines along with further utilities, macros, and fonts for processing LaTeX files. The most widely-used TeX distributions are TeX Live available on all major operating systems and MikTeX for Windows.

Why?

In general LaTeX it is very useful for professional typesetting, especially of scientific documents, and is the most widely-used markup language for mathematical notation. However, the complexity of the LaTeX markup language can also be confusing, especially for newcomers, and hence it can be easier to use other markup languages such as Markdown for many documents. But even when using Markdown as the main markup for a document, LaTeX may still be involved in producing an output document: either only for mathematical notation (e.g., in HTML output) and/or for rendering PDF output. In short, due to its power and flexibility LaTeX is used either explicitly or implicitly in the preparation of a lot of output documents.

Consequently, R/exams leverages LaTeX, mainly for three purposes:

  • As the main markup language for authoring exercises and formatting text in R/LaTeX (Rnw) format. However, this is optional and the R/Markdown (Rmd) format for authoring exercises is probably more widely used. See the First Steps tutorial for a side-by-side comparison.
  • As the markup language for mathematical notation. This is used in both Rnw and Rmd exercises and can be displayed in various ways. See the Mathematical Notation tutorial for some guidance, especially when rendering online exams.
  • As the engine for producing PDF output. This is also used for both Rnw and Rmd exercises, especially when using exams2pdf() for flexible, customizable PDF output and exams2nops() for standardized PDF output that can be automatically scanned and evaluated.

How?

If you use R/exams for rendering exercises without mathematical notation into online exams then you likely do not need LaTeX at all (and can stop reading here).

If you only use mathematical notation but still only generate online HTML-based exams, especially for e-learning materials, then you need a bit of the markup language but do not actually need to install a TeX engine as part of a TeX distribution.

However, when you produce PDF output, be it from Rnw or Rmd exercises, then you need to install a TeX distribution with suitable TeX engines. In case of R/Markdown exercises in Rmd format, R/exams converts the Markdown markup to LaTeX behind the scenes (using the powerful pandoc document converter), before using a TeX engine to render the PDF output. The following comments should help you choose an appropriate way to install and interface a TeX distribution when working with R/exams.

The TeX distribution providing the TeX engine(s) can either be installed “as usual” on the system, e.g., by downloading MikTeX for Windows or installing TeX Live from a standard repository under Linux or OS X. Alternatively, the R package tinytex offers the possibility for installing a lightweight TeX distribution called “TinyTeX”, based on TeX Live. The latter option is particularly attractive for R/exams users that do not actually use LaTeX for other projects. This is due to a couple of reasons: Everything is handled through R; no special super user rights are needed for the installation; and the installation is minimal with additional style files etc. being installed automatically only if needed. To install the R package tinytex you just need

install.packages("tinytex")

or alternatively use the graphical user interface in RStudio etc. To install the TeX distribution “TinyTeX” you just need

tinytex::install_tinytex()

Finally, even when you have installed a TeX distribution (other than TinyTeX) on your system (e.g., TeX Live or MikTeX), the R package tinytex may (or may not) still be useful for producing PDF exams. Three cases need to be distinguished.

  1. System LaTeX via tools::texi2dvi()When: The R package tinytex is not installed or when setting options(exams_tex = "tools"). Why: For R/exams users that also use LaTeX for other projects and do not need any further assistance and hence do not need to introduce another dependency in their code.

  2. System LaTeX via tinytex::latexmk()When: The R package tinytex is installed but not the TinyTeX distribution (and the exams_tex option is not set to "tools"). Why: This tries to resolve some dependencies automatically (e.g., additional style files). Also, it not only supports the "pdflatex" engine but also "xelatex" or "lualatex" etc., which may be useful/necessary for handling certain fonts.

  3. TinyTeX via tinytex::latexmk()When: The R package tinytex is installed and tinytex::install_tinytex() was used to install the TinyTeX distribution (and the exams_tex option is not set to "tools"). Why: For users that do not use LaTeX outside of R as it is more lightweight and very easy to install/maintain.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Using LaTeX in R/exams: What, Why, How? first appeared on R-bloggers.

Surge: Data Labeling You Can Trust

$
0
0

[This article was first published on Edwin Chen's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tl;dr I started Surge earlier this year to fix the problems I’ve always encountered with getting high-quality, human-labeled data at scale. Think MTurk 2.0—but with an obsessive focus on quality and speed, and an elite workforce you can trust. If you’ve ever had problems getting human-annotated data …

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Edwin Chen's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Surge: Data Labeling You Can Trust first appeared on R-bloggers.

rstudio::global(2021) Diversity Scholarships

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rstudio global

rstudio::global(2021) will be a very different conference from previous years. We will miss being together physically, but we are enthusiastic about planning this free, virtual event designed to be inclusive of R users in every time zone. Even though the conference itself is free, we are continuing our tradition of diversity scholarships, but with a different focus.

This year, we have planned for 70 diversity scholarships available to individuals around the world who are a member of a group that is underrepresented at rstudio::conf(). These groups include people of color, those with disabilities, elders/older adults, LGBTQ folks, and women/minority genders. In past years, we have had to limit our scholarships geographically due to visa issues and we are happy to have no such limitations this year.

The scholarships will have three main components:

  • Opportunities for online networking and support before and during the virtual conference
  • Two workshops, taught online the week after rstudio::global(2021)
  • Practical support, if needed, to enable participation in the virtual conference (such as an accessibility aid, a resource for internet access, or childcare)

The two workshops will be taught by some of RStudio’s most skilled and experienced educators, focusing on topics about sharing knowledge and teaching others.

Mine Çetinkaya-Rundel will lead a workshop titled: “What they forgot to teach you about teaching R”:

In this workshop, you will learn about using the RStudio IDE to its full potential for teaching R. Whether you’re an educator by profession, or you do education as part of collaborations or outreach, or you want to improve your workflow for giving talks, demos, and workshops, there is something for you in this workshop. During the workshop we will cover live coding best practices, tips for using RStudio Cloud for teaching and building learnr tutorials, and R Markdown based tools for developing instructor and student facing teaching materials.

Alison Hill will lead a workshop on building websites using R Markdown:

“You should have a website!” You may have heard this one before, or even said it yourself. In this workshop, you’ll learn how to build and customize a website from the comfort of the RStudio IDE using the blogdown package. We’ll also cover basic website care and feeding like using R Markdown to create content, and how to use GitHub and Netlify to improve your workflow. Pre-work will be shared with participants ahead of time, but to get the most out of this workshop, you’ll want to have a GitHub account and be able to push and pull files from a GitHub repository using your local RStudio IDE.

Since this year’s diversity scholarships focus on skills for teaching and sharing, applications will be evaluated considering experience and plans relevant to those skills. We know that people with marginalized identities are often experts and leaders investing in our communities, not only beginners. There are two main criteria:

  • Do you already have some experience with R and GitHub? The workshops will assume some working knowledge of both, so show us some of your current work. (If you are new to GitHub, check out GitHub’s Hello World to get some of your content up in under an hour.)
  • What are your current experiences and future plans for knowledge sharing and community building? This is the main theme and focus of our diversity scholarships this year, and we want to multiply our impact through individuals who will spread the love.

Even with 70 scholarships this year, we expect them to remain competitive, so be sure to highlight your own unique perspective. The application deadline is December 18, 2020.

Apply now!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post rstudio::global(2021) Diversity Scholarships first appeared on R-bloggers.


Brain image segmentation with torch

$
0
0

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When what is not enough

True, sometimes it’s vital to distinguish between different kinds of objects. Is that a car speeding towards me, in which case I’d better jump out of the way? Or is it a huge Doberman (in which case I’d probably do the same)? Often in real life though, instead of coarse-grained classification, what is needed is fine-grained segmentation.

Zooming in on images, we’re not looking for a single label; instead, we want to classify every pixel according to some criterion:

  • In medicine, we may want to distinguish between different cell types, or identify tumors.

  • In various earth sciences, satellite data are used to segment terrestrial surfaces.

  • To enable use of custom backgrounds, video-conferencing software has to be able to tell foreground from background.

Image segmentation is a form of supervised learning: Some kind of ground truth is needed. Here, it comes in form of a mask– an image, of spatial resolution identical to that of the input data, that designates the true class for every pixel. Accordingly, classification loss is calculated pixel-wise; losses are then summed up to yield an aggregate to be used in optimization.

The “canonical” architecture for image segmentation is U-Net (around since 2015).

U-Net

Here is the prototypical U-Net, as depicted in the original Rönneberger et al. paper (Ronneberger, Fischer, and Brox 2015).

Of this architecture, numerous variants exist. You could use different layer sizes, activations, ways to achieve downsizing and upsizing, and more. However, there is one defining characteristic: the U-shape, stabilized by the “bridges” crossing over horizontally at all levels.

In a nutshell, the left-hand side of the U resembles the convolutional architectures used in image classification. It successively reduces spatial resolution. At the same time, another dimension – the channels dimension – is used to build up a hierarchy of features, ranging from very basic to very specialized.

Unlike in classification, however, the output should have the same spatial resolution as the input. Thus, we need to upsize again – this is taken care of by the right-hand side of the U. But, how are we going to arrive at a good per-pixel classification, now that so much spatial information has been lost?

This is what the “bridges” are for: At each level, the input to an upsampling layer is a concatenation of the previous layer’s output – which went through the whole compression/decompression routine – and some preserved intermediate representation from the downsizing phase. In this way, a U-Net architecture combines attention to detail with feature extraction.

Brain image segmentation

With U-Net, domain applicability is as broad as the architecture is flexible. Here, we want to detect abnormalities in brain scans. The dataset, used in Buda, Saha, and Mazurowski (2019), contains MRI images together with manually created FLAIR abnormality segmentation masks. It is available on Kaggle.

Nicely, the paper is accompanied by a GitHub repository. Below, we closely follow (though not exactly replicate) the authors’ preprocessing and data augmentation code.

As is often the case in medical imaging, there is notable class imbalance in the data. For every patient, sections have been taken at multiple positions. (Number of sections per patient varies.) Most sections do not exhibit any lesions; the corresponding masks are colored black everywhere.

Here are three examples where the masks do indicate abnormalities:

Let’s see if we can build a U-Net that generates such masks for us.

Data

Before you start typing, here is a Colaboratory notebook to conveniently follow along.

We use pins to obtain the data. Please see this introduction if you haven’t used that package before.

# deep learning (incl. dependencies)library(torch)library(torchvision)# data wranglinglibrary(tidyverse)library(zeallot)# image processing and visualizationlibrary(magick)library(cowplot)# dataset loading library(pins)library(zip)torch_manual_seed(777)set.seed(777)# use your own kaggle.json herepins::board_register_kaggle(token = "~/kaggle.json")files <- pins::pin_get("mateuszbuda/lgg-mri-segmentation", board = "kaggle",  extract = FALSE)

The dataset is not that big – it includes scans from 110 different patients – so we’ll have to do with just a training and a validation set. (Don’t do this in real life, as you’ll inevitably end up fine-tuning on the latter.)

train_dir <- "data/mri_train"valid_dir <- "data/mri_valid"if(dir.exists(train_dir)) unlink(train_dir, recursive = TRUE, force = TRUE)if(dir.exists(valid_dir)) unlink(valid_dir, recursive = TRUE, force = TRUE)zip::unzip(files, exdir = "data")file.rename("data/kaggle_3m", train_dir)# this is a duplicate, again containing kaggle_3m (evidently a packaging error on Kaggle)# we just remove itunlink("data/lgg-mri-segmentation", recursive = TRUE)dir.create(valid_dir)

Of those 110 patients, we keep 30 for validation. Some more file manipulations, and we’re set up with a nice hierarchical structure, with train_dir and valid_dir holding their per-patient sub-directories, respectively.

valid_indices <- sample(1:length(patients), 30)patients <- list.dirs(train_dir, recursive = FALSE)for (i in valid_indices) {  dir.create(file.path(valid_dir, basename(patients[i])))  for (f in list.files(patients[i])) {        file.rename(file.path(train_dir, basename(patients[i]), f), file.path(valid_dir, basename(patients[i]), f))      }  unlink(file.path(train_dir, basename(patients[i])), recursive = TRUE)}

We now need a dataset that knows what to do with these files.

Dataset

Like every torch dataset, this one has initialize() and .getitem() methods. initialize() creates an inventory of scan and mask file names, to be used by .getitem() when it actually reads those files. In contrast to what we’ve seen in previous posts, though , .getitem() does not simply return input-target pairs in order. Instead, whenever the parameter random_sampling is true, it will perform weighted sampling, preferring items with sizable lesions. This option will be used for the training set, to counter the class imbalance mentioned above.

The other way training and validation sets will differ is use of data augmentation. Training images/masks may be flipped, re-sized, and rotated; probabilities and amounts are configurable.

An instance of brainseg_dataset encapsulates all this functionality:

brainseg_dataset <- dataset(  name = "brainseg_dataset",    initialize = function(img_dir,                        augmentation_params = NULL,                        random_sampling = FALSE) {    self$images <- tibble(      img = grep(        list.files(          img_dir,          full.names = TRUE,          pattern = "tif",          recursive = TRUE        ),        pattern = 'mask',        invert = TRUE,        value = TRUE      ),      mask = grep(        list.files(          img_dir,          full.names = TRUE,          pattern = "tif",          recursive = TRUE        ),        pattern = 'mask',        value = TRUE      )    )    self$slice_weights <- self$calc_slice_weights(self$images$mask)    self$augmentation_params <- augmentation_params    self$random_sampling <- random_sampling  },    .getitem = function(i) {    index <-      if (self$random_sampling == TRUE)        sample(1:self$.length(), 1, prob = self$slice_weights)    else      i        img <- self$images$img[index] %>%      image_read() %>%      transform_to_tensor()     mask <- self$images$mask[index] %>%      image_read() %>%      transform_to_tensor() %>%      transform_rgb_to_grayscale() %>%      torch_unsqueeze(1)        img <- self$min_max_scale(img)        if (!is.null(self$augmentation_params)) {      scale_param <- self$augmentation_params[1]      c(img, mask) %<-% self$resize(img, mask, scale_param)            rot_param <- self$augmentation_params[2]      c(img, mask) %<-% self$rotate(img, mask, rot_param)            flip_param <- self$augmentation_params[3]      c(img, mask) %<-% self$flip(img, mask, flip_param)          }    list(img = img, mask = mask)  },    .length = function() {    nrow(self$images)  },    calc_slice_weights = function(masks) {    weights <- map_dbl(masks, function(m) {      img <-        as.integer(magick::image_data(image_read(m), channels = "gray"))      sum(img / 255)    })        sum_weights <- sum(weights)    num_weights <- length(weights)        weights <- weights %>% map_dbl(function(w) {      w <- (w + sum_weights * 0.1 / num_weights) / (sum_weights * 1.1)    })    weights  },    min_max_scale = function(x) {    min = x$min()$item()    max = x$max()$item()    x$clamp_(min = min, max = max)    x$add_(-min)$div_(max - min + 1e-5)    x  },    resize = function(img, mask, scale_param) {    img_size <- dim(img)[2]    rnd_scale <- runif(1, 1 - scale_param, 1 + scale_param)    img <- transform_resize(img, size = rnd_scale * img_size)    mask <- transform_resize(mask, size = rnd_scale * img_size)    diff <- dim(img)[2] - img_size    if (diff > 0) {      top <- ceiling(diff / 2)      left <- ceiling(diff / 2)      img <- transform_crop(img, top, left, img_size, img_size)      mask <- transform_crop(mask, top, left, img_size, img_size)    } else {      img <- transform_pad(img,                           padding = -c(                             ceiling(diff / 2),                             floor(diff / 2),                             ceiling(diff / 2),                             floor(diff / 2)                           ))      mask <- transform_pad(mask, padding = -c(        ceiling(diff / 2),        floor(diff /                2),        ceiling(diff /                  2),        floor(diff /                2)      ))    }    list(img, mask)  },    rotate = function(img, mask, rot_param) {    rnd_rot <- runif(1, 1 - rot_param, 1 + rot_param)    img <- transform_rotate(img, angle = rnd_rot)    mask <- transform_rotate(mask, angle = rnd_rot)        list(img, mask)  },    flip = function(img, mask, flip_param) {    rnd_flip <- runif(1)    if (rnd_flip > flip_param) {      img <- transform_hflip(img)      mask <- transform_hflip(mask)    }        list(img, mask)  })

After instantiation, we see we have 2977 training pairs and 952 validation pairs, respectively:

train_ds <- brainseg_dataset(  train_dir,  augmentation_params = c(0.05, 15, 0.5),  random_sampling = TRUE)length(train_ds)# 2977valid_ds <- brainseg_dataset(  valid_dir,  augmentation_params = NULL,  random_sampling = FALSE)length(valid_ds)# 952

As a correctness check, let’s plot an image and associated mask:

par(mfrow = c(1, 2), mar = c(0, 1, 0, 1))img_and_mask <- valid_ds[27]img <- img_and_mask[[1]]mask <- img_and_mask[[2]]img$permute(c(2, 3, 1)) %>% as.array() %>% as.raster() %>% plot()mask$squeeze() %>% as.array() %>% as.raster() %>% plot()

With torch, it is straightforward to inspect what happens when you change augmentation-related parameters. We just pick a pair from the validation set, which has not had any augmentation applied as yet, and call valid_ds$ directly. Just for fun, let’s use more “extreme” parameters here than we do in actual training. (Actual training uses the settings from Mateusz’ GitHub repository, which we assume have been carefully chosen for optimal performance.1)

img_and_mask <- valid_ds[77]img <- img_and_mask[[1]]mask <- img_and_mask[[2]]imgs <- map (1:24, function(i) {    # scale factor; train_ds really uses 0.05  c(img, mask) %<-% valid_ds$resize(img, mask, 0.2)   c(img, mask) %<-% valid_ds$flip(img, mask, 0.5)  # rotation angle; train_ds really uses 15  c(img, mask) %<-% valid_ds$rotate(img, mask, 90)   img %>%    transform_rgb_to_grayscale() %>%    as.array() %>%    as_tibble() %>%    rowid_to_column(var = "Y") %>%    gather(key = "X", value = "value", -Y) %>%    mutate(X = as.numeric(gsub("V", "", X))) %>%    ggplot(aes(X, Y, fill = value)) +    geom_raster() +    theme_void() +    theme(legend.position = "none") +    theme(aspect.ratio = 1)  })plot_grid(plotlist = imgs, nrow = 4)

Now we still need the data loaders, and then, nothing keeps us from proceeding to the next big task: building the model.

batch_size <- 4train_dl <- dataloader(train_ds, batch_size)valid_dl <- dataloader(valid_ds, batch_size)

Model

Our model nicely illustrates the kind of modular code that comes “naturally” with torch. We approach things top-down, starting with the U-Net container itself.

unet takes care of the global composition – how far “down” do we go, shrinking the image while incrementing the number of filters, and then how do we go “up” again?

Importantly, it is also in the system’s memory. In forward(), it keeps track of layer outputs seen going “down”, to be added back in going “up”.

unet <- nn_module(  "unet",    initialize = function(channels_in = 3,                        n_classes = 1,                        depth = 5,                        n_filters = 6) {        self$down_path <- nn_module_list()        prev_channels <- channels_in    for (i in 1:depth) {      self$down_path$append(down_block(prev_channels, 2 ^ (n_filters + i - 1)))      prev_channels <- 2 ^ (n_filters + i -1)    }        self$up_path <- nn_module_list()        for (i in ((depth - 1):1)) {      self$up_path$append(up_block(prev_channels, 2 ^ (n_filters + i - 1)))      prev_channels <- 2 ^ (n_filters + i - 1)    }        self$last = nn_conv2d(prev_channels, n_classes, kernel_size = 1)  },    forward = function(x) {        blocks <- list()        for (i in 1:length(self$down_path)) {      x <- self$down_path[[i]](x)      if (i != length(self$down_path)) {        blocks <- c(blocks, x)        x <- nnf_max_pool2d(x, 2)      }    }        for (i in 1:length(self$up_path)) {        x <- self$up_path[[i]](x, blocks[[length(blocks) - i + 1]]$to(device = device))    }        torch_sigmoid(self$last(x))  })

unet delegates to two containers just below it in the hierarchy: down_block and up_block. While down_block is “just” there for aesthetic reasons (it immediately delegates to its own workhorse, conv_block), in up_block we see the U-Net “bridges” in action.

down_block <- nn_module(  "down_block",    initialize = function(in_size, out_size) {    self$conv_block <- conv_block(in_size, out_size)  },    forward = function(x) {    self$conv_block(x)  })up_block <- nn_module(  "up_block",    initialize = function(in_size, out_size) {        self$up = nn_conv_transpose2d(in_size,                                  out_size,                                  kernel_size = 2,                                  stride = 2)    self$conv_block = conv_block(in_size, out_size)  },    forward = function(x, bridge) {        up <- self$up(x)    torch_cat(list(up, bridge), 2) %>%      self$conv_block()  })

Finally, a conv_block is a sequential structure containing convolutional, ReLU, and dropout layers.

conv_block <- nn_module(   "conv_block",    initialize = function(in_size, out_size) {        self$conv_block <- nn_sequential(      nn_conv2d(in_size, out_size, kernel_size = 3, padding = 1),      nn_relu(),      nn_dropout(0.6),      nn_conv2d(out_size, out_size, kernel_size = 3, padding = 1),      nn_relu()    )  },    forward = function(x){    self$conv_block(x)  })

Now instantiate the model, and possibly, move it to the GPU:

device <- torch_device(if(cuda_is_available()) "cuda" else "cpu")model <- unet(depth = 5)$to(device = device)

Optimization

We train our model with a combination of cross entropy and dice loss.

The latter, though not shipped with torch, may be implemented manually:

calc_dice_loss <- function(y_pred, y_true) {    smooth <- 1  y_pred <- y_pred$view(-1)  y_true <- y_true$view(-1)  intersection <- (y_pred * y_true)$sum()    1 - ((2 * intersection + smooth) / (y_pred$sum() + y_true$sum() + smooth))}dice_weight <- 0.3

Optimization uses stochastic gradient descent (SGD), together with the one-cycle learning rate scheduler introduced in the context of image classification with torch.

optimizer <- optim_sgd(model$parameters, lr = 0.1, momentum = 0.9)num_epochs <- 20scheduler <- lr_one_cycle(  optimizer,  max_lr = 0.1,  steps_per_epoch = length(train_dl),  epochs = num_epochs)

Training

The training loop then follows the usual scheme. One thing to note: Every epoch, we save the model (using torch_save()), so we can later pick the best one, should performance have degraded thereafter.

train_batch <- function(b) {    optimizer$zero_grad()  output <- model(b[[1]]$to(device = device))  target <- b[[2]]$to(device = device)    bce_loss <- nnf_binary_cross_entropy(output, target)  dice_loss <- calc_dice_loss(output, target)  loss <-  dice_weight * dice_loss + (1 - dice_weight) * bce_loss    loss$backward()  optimizer$step()  scheduler$step()  list(bce_loss$item(), dice_loss$item(), loss$item())  }valid_batch <- function(b) {    output <- model(b[[1]]$to(device = device))  target <- b[[2]]$to(device = device)  bce_loss <- nnf_binary_cross_entropy(output, target)  dice_loss <- calc_dice_loss(output, target)  loss <-  dice_weight * dice_loss + (1 - dice_weight) * bce_loss    list(bce_loss$item(), dice_loss$item(), loss$item())  }for (epoch in 1:num_epochs) {    model$train()  train_bce <- c()  train_dice <- c()  train_loss <- c()    for (b in enumerate(train_dl)) {    c(bce_loss, dice_loss, loss) %<-% train_batch(b)    train_bce <- c(train_bce, bce_loss)    train_dice <- c(train_dice, dice_loss)    train_loss <- c(train_loss, loss)  }    torch_save(model, paste0("model_", epoch, ".pt"))    cat(sprintf("\nEpoch %d, training: loss:%3f, bce: %3f, dice: %3f\n",              epoch, mean(train_loss), mean(train_bce), mean(train_dice)))    model$eval()  valid_bce <- c()  valid_dice <- c()  valid_loss <- c()    i <- 0  for (b in enumerate(valid_dl)) {        i <<- i + 1    c(bce_loss, dice_loss, loss) %<-% valid_batch(b)    valid_bce <- c(valid_bce, bce_loss)    valid_dice <- c(valid_dice, dice_loss)    valid_loss <- c(valid_loss, loss)      }    cat(sprintf("\nEpoch %d, validation: loss:%3f, bce: %3f, dice: %3f\n",              epoch, mean(valid_loss), mean(valid_bce), mean(valid_dice)))}Epoch 1, training: loss:0.304232, bce: 0.148578, dice: 0.667423Epoch 1, validation: loss:0.333961, bce: 0.127171, dice: 0.816471Epoch 2, training: loss:0.194665, bce: 0.101973, dice: 0.410945Epoch 2, validation: loss:0.341121, bce: 0.117465, dice: 0.862983[...]Epoch 19, training: loss:0.073863, bce: 0.038559, dice: 0.156236Epoch 19, validation: loss:0.302878, bce: 0.109721, dice: 0.753577Epoch 20, training: loss:0.070621, bce: 0.036578, dice: 0.150055Epoch 20, validation: loss:0.295852, bce: 0.101750, dice: 0.748757

Evaluation

In this run, it is the final model that performs best on the validation set. Still, we’d like to show how to load a saved model, using torch_load() .

Once loaded, put the model into eval mode:

saved_model <- torch_load("model_20.pt") model <- saved_modelmodel$eval()

Now, since we don’t have a separate test set, we already know the average out-of-sample metrics; but in the end, what we care about are the generated masks. Let’s view some, displaying ground truth and MRI scans for comparison.

# without random sampling, we'd mainly see lesion-free patcheseval_ds <- brainseg_dataset(valid_dir, augmentation_params = NULL, random_sampling = TRUE)eval_dl <- dataloader(eval_ds, batch_size = 8)batch <- eval_dl %>% dataloader_make_iter() %>% dataloader_next()par(mfcol = c(3, 8), mar = c(0, 1, 0, 1))for (i in 1:8) {    img <- batch[[1]][i, .., drop = FALSE]  inferred_mask <- model(img$to(device = device))  true_mask <- batch[[2]][i, .., drop = FALSE]$to(device = device)    bce <- nnf_binary_cross_entropy(inferred_mask, true_mask)$to(device = "cpu") %>%    as.numeric()  dc <- calc_dice_loss(inferred_mask, true_mask)$to(device = "cpu") %>% as.numeric()  cat(sprintf("\nSample %d, bce: %3f, dice: %3f\n", i, bce, dc))    inferred_mask <- inferred_mask$to(device = "cpu") %>% as.array() %>% .[1, 1, , ]    inferred_mask <- ifelse(inferred_mask > 0.5, 1, 0)    img[1, 1, ,] %>% as.array() %>% as.raster() %>% plot()  true_mask$to(device = "cpu")[1, 1, ,] %>% as.array() %>% as.raster() %>% plot()  inferred_mask %>% as.raster() %>% plot()}

We also print the individual cross entropy and dice losses; relating those to the generated masks might yield useful information for model tuning.

Sample 1, bce: 0.088406, dice: 0.387786}Sample 2, bce: 0.026839, dice: 0.205724Sample 3, bce: 0.042575, dice: 0.187884Sample 4, bce: 0.094989, dice: 0.273895Sample 5, bce: 0.026839, dice: 0.205724Sample 6, bce: 0.020917, dice: 0.139484Sample 7, bce: 0.094989, dice: 0.273895Sample 8, bce: 2.310956, dice: 0.999824

While far from perfect, most of these masks aren’t that bad – a nice result given the small dataset!

Wrapup

This has been our most complex torch post so far; however, we hope you’ve found the time well spent. For one, among applications of deep learning, medical image segmentation stands out as highly societally useful. Secondly, U-Net-like architectures are employed in many other areas. And finally, we once more saw torch’s flexibility and intuitive behavior in action.

Thanks for reading!

Buda, Mateusz, Ashirbani Saha, and Maciej A. Mazurowski. 2019. “Association of Genomic Subtypes of Lower-Grade Gliomas with Shape Features Automatically Extracted by a Deep Learning Algorithm.” Computers in Biology and Medicine 109: 218–25. https://doi.org/https://doi.org/10.1016/j.compbiomed.2019.05.002.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” CoRR abs/1505.04597. http://arxiv.org/abs/1505.04597.


  1. Yes, we did a few experiments, confirming that more augmentation isn’t better … what did I say about inevitably ending up doing optimization on the validation set …?↩

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio AI Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Brain image segmentation with torch first appeared on R-bloggers.

Little useless-useful R functions – Script that generates calculator script

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This time the simple useless function will generate a long list of IF clauses that will in fact return the result of a mathematical operation between two numbers.

This is not by no means any type of automatic generation of code, not a script, that will write itself.

The concept is fairly simple. Create a function that will return the results of a mathematical operation: addition, multiplication, division, subtraction, for two positive integers (whole number). Between 1 and 10. That’s it 🙂

# Basic conceptcalc <- function(a,b,oper){  if(a==1 & b==1 & oper=="+"){print("Result is 2")}  if(a==1 & b==1 & oper=="-"){print("Result is 0")}  if(a==1 & b==1 & oper=="*"){print("Result is 1")}  if(a==1 & b==1 & oper=="/"){print("Result is 1")}  }calc(1,1,"-")calc(1,1,"+")calc(1,1,"*")calc(1,1,"/")

I have played a little bit with combinations of 4 operations, and two times 10 numbers, yielding 400 combinations. And the function that generates the final calculate function is the following:

# set all combinationsdf <- data.frame(merge(merge(c(1:10), c(1:10), by=NULL), c("+","-","/","*"), by=NULL))colnames(df) <- c("numberA", "numberB", "oper")f <- "calc <- function(a,b,oper){"for (i in 1:nrow(df)){  res <- paste0(as.character(df$numberA[i]) , df$oper[i], as.character(df$numberB[i]))  rr <- eval(parse(text=res))  f1 <- paste0(' if(a==',as.character(df$numberA[i]), ' & b==', as.character(df$numberB[i]), ' & oper==', '"',as.character(df$oper[i]),'"' ,               '){print("Result is ', as.character(rr),'")}', '\n\r' , collapse=NULL)  f <<- paste0(f, f1, collapse = NULL)  if(i==nrow(df)){    f <<- paste0(f, "}", collapse = NULL)        eval(parse(text=f))    }}calc(4,5,"/")

Once the calc function for all 400 combinations is created, feel free to use it.

For some comparison, let’s crunch some numbers. I wanted to see how long does it take if the function get’s longer. The code for measuring the execution time:

start.time <- Sys.time()calc(20,3,"+")end.time <- Sys.time()end.time - start.time

And the results for adding more positive integers is at least to say interesting, if not useless.

Number of integersNumber of combinationsFirst Execution time (sec)Average next execution time (sec)Size of function
104005,6s0.00122.3Mb
20160011,5s0.00229.3Mb
25250028,4s0.002714.5Mb
5010000717s0.008158Mb
10040000//232.1Mb

Hitting first 100 positive integers, the function increased to 230 Mb, taking space in memory. And based on the exponential growth of the size of the function, my laptop would hit “out of memory” boundary by the integer 1000. This makes you appreciate he usefulness of CPU architecture 🙂

As always, code is available on Github.

Happy R-Coding and stay healthy!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Little useless-useful R functions – Script that generates calculator script first appeared on R-bloggers.

Corporate identity graphics in R

$
0
0

[This article was first published on Rbloggers – The Analytics Lab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog is about getting corporate identity graphics ready in R using ggplot. Many corporates have decent identity Powerpoint decks and Excel templates available to work with, they might even have developed a PowerBI/Tableau template to fit their corporate identity. Yet, identity templates for R (or other languages used) are often not readily available. Marketing often does not have these tools in scope while developing templates to be used once a new corporate identity has been launched. The default graphics that ggplot provides in R are already of good quality. Sometimes the color palette needs some improvement and occasionally the axis need some editing before things look sharp. Wouldn’t it be great if all graphics you produce running an analysis were by default matching with the colors, fonts and layout of your brand?

If you work as an analyst or a consultant like me and create visuals for clients in R this post is useful for you. On more than a few times I’ve noticed that clients really appreciate it when presented visuals that match their corporate identity style. And here is the thing, my company – Cmotions– recently published a new corporate identity. A few weeks ago I was developing materials for some courses at our academy. I created new content for those courses in R matching our new identity. I read a few examples and blogposts of analysts with a similar question. In this blog I share my experience with you, it turned out to be rather easy to create a ggplot template in R matching a corporate identity.

First thing you need to do is register the Windows fonts in R. By default the fonts in R to choose from are limited.

# Make sure all Windows fonts are available within Rlibrary(extrafont)# Auto detect all Windows TTFs. use fonts() or fonttable() for a complete overview afterwardsfont_import()# Register fonts for Windows bitmap outputloadfonts(device="win")

Next you need to create a color palette matching the corporate identity you want to be reflected in ggplot output.

# Cmotions default palettepalette_cmotions <- c("#003D7C", #F5A507","#7C3F00","#000000","#FFDC51","#ACACAC")

You have two options to go forward: first is to create a new ggplot template from scratch by specifying all elements. It will give you maximum control and also a lot of work. Second option is to use an existing theme and only edit those part as you see fit. I’m choosing the theme_bw that comes with ggplot and alter only those parts I think are necessary to match my corporate identity. As you can see in the code below I’ve added many element_blank() tags so my legend does not have a title, the panel does not have a border and by default the plot has no background.

# load ggplotlibrary(ggplot2)# Create Cmotions theme for ggplottheme_cmotions <- function() {theme_bw(base_size=8, base_family="Verdana") %+replace% # use theme_bw as default, replace font familytheme(# adjust legendlegend.background = element_blank(),legend.title = element_blank(),# adjust axisaxis.ticks = element_blank(),axis.title.y = element_blank(),axis.title.x = element_blank(),axis.text = element_text(color = "black"),# adjust panel;panel.background = element_rect(colour = "white"),panel.border = element_blank(),panel.grid = element_blank(),# adjust plotplot.background = element_blank(),plot.title = element_text(family="Arial Black", face="bold", colour="black", size=14),complete = TRUE)}

Using the created template is easy, just specify which theme you want to use in the ggplot command. First let me set up some dummy data to use.

# create data for example plotsdf <- data.frame(category = c('cat1', 'cat1', 'cat1', 'cat2', 'cat2', 'cat2', 'cat3', 'cat3', 'cat3', 'cat4', 'cat4', 'cat4', 'cat5', 'cat5', 'cat5', 'cat6', 'cat6', 'cat6'),year = c('2018', '2019', '2020', '2018', '2019', '2020', '2018', '2019', '2020', '2018', '2019', '2020', '2018', '2019', '2020', '2018', '2019', '2020'),value = as.numeric(c('2','5', '6', '4','7', '10', '5','8', '12', '2','4', '6', '4','6', '8', '8','8', '12')))

Next I create 4 different plots and specify the new theme I would like to use and the color palette.

# Example for bar chart - no legendbar <-ggplot(df, aes(category, value, fill = category)) +geom_col(show.legend = FALSE) +theme_cmotions() +ggtitle("Category example") +scale_fill_manual(values = palette_cmotions)# Example for stacked bar chart - only 3 categoriesstacked<-ggplot(df[1:9,], aes(year, value, fill = category)) +geom_col() +theme_cmotions() +ggtitle("Category stacked example") +scale_fill_manual(values = palette_cmotions)# Example for line chartline<-ggplot(df[1:9,], aes(year, value, group = category, color=category)) +geom_line(size=2) +theme_cmotions() +ggtitle("Category trend") +scale_color_manual(values = palette_cmotions)# Example for bar chart - facetfacet<-ggplot(df, aes(year, value, fill = category)) +geom_bar(stat="identity") +theme_cmotions() +ggtitle("Category example - facet") +facet_wrap(~ category, nrow=2, ncol=3) +scale_fill_manual(values = palette_cmotions) +theme(strip.background = element_blank(), strip.text.x = element_blank()) # without the strip# Display exampleslibrary(gridExtra)grid.arrange(bar,stacked,line,facet, ncol=2)

Corporate identity graphics using ggplot in R

If you want to use this new theme as a default theme just add the lines below to your .Rprofile and the theme will be set during R startup.

# code to add to .Rprofile for Cmotions theme as defaultsetHook(packageEvent("ggplot2", "onLoad"),function(...) ggplot2::theme_set(ggplot2::theme_cmotions()))

Of course more changes can be made, depending on your needs. I hope this basic example is useful to you when you want your corporate identity reflected in your R output.

this article is written by

Wouter van Gils

w.v.gils@cmotions.nl

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rbloggers – The Analytics Lab.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Corporate identity graphics in R first appeared on R-bloggers.

Version 0.10.1 of NIMBLE released

$
0
0

[This article was first published on R – NIMBLE, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ve released the newest version of NIMBLE on CRAN and on our website. NIMBLE is a system for building and sharing analysis methods for statistical models, especially for hierarchical models and computationally-intensive methods (such as MCMC and SMC).
We’ve released version 0.10.1.  Version 0.10.1 is primarily a bug fix release:

–  In particular, it fixes a bug in retrieving parameter values from distributions that was introduced in version 0.10.0. The bug can cause incorrect behavior of conjugate MCMC samplers under certain model structures (such as particular state-space models), so we strongly encourage users to upgrade to 0.10.1. – In addition, version 0.10.1 restricts use of WAIC to the conditional version of WAIC (conditioning on all parameters directly involved in the likelihood). Previous versions of nimble gave incorrect results when not conditioning on all parameters directly involved in the likelihood (i.e., when not monitoring all such parameters). In a future version of nimble we plan to make a number of improvements to WAIC, including allowing use of marginal versions of WAIC, where the WAIC calculation integrates over random effects.

Please see the release notes on our website for more details.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Version 0.10.1 of NIMBLE released first appeared on R-bloggers.

Trees and forests

$
0
0

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For my ACT6100 weekly quiz, I usually generate some datasets, and then ask students to compare various predictive algorithms. Last week, it was about classification trees and random forests. And students were surprised to have such differences (they had to estimate the probability to have a specific label, for the barycenter of the covariates).

Usually, I use the following to generate some (here 12) covariates that could be correlated

library(FactoMineR)n=279library(clusterGeneration)library(mnormt)k=12S=genPositiveDefMat("unifcorrmat",dim=k)X=round(rmnorm(n,varcov=S$Sigma)+8,2)rownames(X)=1:ncolnames(X)=LETTERS[1:k]

Then I need to generate some data, based on some covariates (5 out of 12), with various strengths

idx = sample(1:k,size=5)u = sample(c(-(4:1),1:4),5)beta = rep(0,k)beta[idx] = uU = X%*%betaU = U-min(U)U = U/max(U)*6-3p = exp(( U))/(1+exp((U )))Y = rbinom(n,size=1,prob=p)df = data.frame(Y=as.factor(Y),X)levels(df$Y)=levels=c("blue","red")

We can run a classification tree

library(rpart)arbre = rpart(Y~., data=df)

and a random forest,

library(randomForest)set.seed(1)arbres = randomForest(Y~., data=df)

Here are the partial plots for 4 of the explanatory variables that actually have an impact

partialPlot(arbres,pred.data = df, x.var = "A")

Predictions for the “average” point of the dataset is here

(parbre = predict(arbre,newdata=data.frame(t(apply(df[,-1],2,mean))),type = "prob"))       blue       red1 0.8064516 0.1935484(parbres = predict(arbres,newdata=data.frame(t(apply(df[,-1],2,mean))),type = "prob"))   blue   red1 0.422 0.578attr(,"class")[1] "matrix" "votes"

and there is a substantial difference, with a probability of 19% with a single tree, 58% with 500 trees (the default value of the function).

To understand why we can have such a difference, we should not only focus on the bagging stratgy, but look at the variability of the predictions, obtained with trees,

B=1e4parbres = rep(NA,B)m=data.frame(t(apply(df[,-1],2,mean)))for(b in 1:B){  idx = sample(1:nrow(df),size=nrow(df),replace=TRUE)  arbre = rpart(Y~., data=df[idx,])  parbres[b] = predict(arbre,newdata=m,type = "prob")[2]}hist(parbres)

Surprisingly, we have here a bimodal function for \(\hat{y}\) which is either very small for some trees, of very large for others. On average, we have a value close to 55%… I think I will use more that generative algorithm for future quiz…

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Trees and forests first appeared on R-bloggers.

Viewing all 12126 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>