Quantcast
Channel: R-bloggers
Viewing all 12128 articles
Browse latest View live

DALEX 2.1.0 is live on GitHub!

$
0
0

[This article was first published on R in ResponsibleML on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dreamworks animation. Image from: https://www.theatlantic.com/entertainment/archive/2014/11/a-kids-eye-view-of-penguins-of-madagascar/383187/

After the awesome reception of my last blog about improvements for DALEX and DALEXtra, I couldn’t stand waiting for the next opportunity to share some details about new features with you. Sadly at some point, I realized that it’s necessary to implement them first. It took a while but finally after countless failed builds for GitHub actions, I’m glad to announce: “What's new in DALEX 2.1.0?”. With the new version, we will be able to choose which column of the output matrix will be taken into consideration while preparing explanations. Due to the character of that change, it obviously affects only classification tasks. But without further ado let’s move to working examples, and explaining what exactly that change means. For anyone interest in my previous blogs about DALEX here is the link.

Penguins

Our journey starts in Antarctica. Quite a weird place for XAI stuff one may say. However, in today’s blog, 3 different species of penguins exactly from that continent will serve as our companions. penguins is a dataset coming from the palmerpenguins R package. As README states, the data were collected and published by Dr. Kristen Gorman and contain information about 344 penguins living in 3 different islands in the Plamer Archipelago. Authors of the package see penguins as an alternative for iris, let’s see!

library(palmerpenguins)data_penguins <- na.omit(palmerpenguins::penguins)

Predict function target column

DALEX 2.1.0 brought a new parameter to explain a function which is predict_function_target_column. It allows users to actually steer the flow of the model’s response that is taken into consideration in explaining classification task models. In previous versions, whenever the default predict_function was used, DALEX was returning the second column of the output probabilities matrix for binary classification, and the whole matrix of probabilities for multiclass classification. While that default behavior was preserved, now bypassing predict_function_target_column parameter we can force DALEX to take the column that we are interested in without being forced to pass the custom predict_function. What’s even more fantastic is that you can do it not only for binary classification models but also for multiclass classification changing the task to one vs others binary classification. It’s a super handy tool whenever one of the classes in your multiclass task is far more important than the others and such analysis, made from both perspectives may be useful. The usage of that parameter is super useful. It accepts both numeric and character inputs and should be understood as either order of the column in the probabilities matrix or the name of the column that should be extracted. To be even more precise, the parameter’s value is used directly to index the column. Keep in mind that some of the engines like gbm returns a single vector for multiclass classification. The change does not affect such models.

Creation of model and explainer

That being said it’s high time to make some models and show new functionalities in practice. For that purpose, we will need two predictive models. Let them be simple, performance is not what we seek for today. The first model is going to be a binary classification. For this purpose, we will need to create a new variable, is_adelie which I think is really self-explanatory. The second model will be a multiclass classification for species variable. The engine behind both of them will be a ranger.

library(“ranger”)library(“DALEX”)model_multiclass <- ranger(species~., data = data_penguins,    probability = TRUE, num.trees = 100)explain_multiclass_one_vs_others <- explain(    model_multiclass,    data_penguins,    data_penguins$species == “Adelie”,    label = “Ranger penguins multiclass”,    predict_function_target_column = “Adelie”)explain_multiclass <- explain(    model_multiclass,    data_penguins,    data_penguins$species,    label = “Ranger penguins multiclass”,    colorize = FALSE)model_binary <- ranger((species==”Adelie”)~., data = data_penguins,    probability = TRUE, num.trees = 100)explain_binary <- explain(    model_binary,    data_penguins,    data_penguins$species == “Adelie”,    label = “Ranger penguins multiclass”,    colorize = FALSE,    predict_function_target_column = 2)

As you see the usage is very simple! Therefore let's explain some models!

Model performance is always a good place to start. An important note is that multiclass models with passed predict_function_target_column parameter are treated as standard binary classification, so measures for binary will be displayed. Let’s see the difference

(mp_one_vs_others <-    model_performance(explain_multiclass_one_vs_others))(mp <-     model_performance(explain_multiclass))
model_performance for both models

Another way that lays in front of us with the new option is calculating feature importance using different measures. Default measures are loss in cross-entropy for multiclass and one minus AUC for binary. With the change we can in fact calculate which features are most important for predictions of one specific class, isn’t that amazing?

fi_one_vs_others <- model_parts(explain_multiclass_one_vs_others)fi <- model_parts(explain_multiclass)plot(fi_one_vs_others)plot(fi)
feature importance for a multiclass model with chosen adelie variablefeature importance for a standard multiclass model

I’m quite sure you are familiar with how Predict Profile and Model Profile explanations handle multiclass models. They simply calculate explanations for each level of the y and then combine them together on the plot. But we are not only interested in profiles or breakdowns for all of the levels, right? Here comes another field when a new parameter can be utilized, we can simply choose which parameter should be included.

pdp_one_vs_others <- model_profile(    explainer = explain_multiclass_one_vs_others,     variables = “bill_length_mm”)pdp <- model_profile(    explainer = explain_multiclass,     variables = “bill_length_mm”)plot(pdp_one_vs_others)plot(pdp)
pdp feature importance for a multiclass model with chosen adelie variablepdp feature importance for a standard multiclass model
bd_one_vs_others <- predict_parts(    explainer = explain_multiclass_one_vs_others,     new_observation = data_penguins[1,])bd <- predict_parts(    explainer = explain_multiclass,     new_observation = data_penguins[1,])plot(bd_one_vs_others)plot(bd)
iBreakDown for a multiclass model with chosen adelie variableiBreakDown for a standard multiclass model

Summary

That will be enough for today! I hope you are as excited about a new feature as I am. I didn’t focus on methods used today, therefore if you want to know more about pdp, iBreakDown, or feature importance, I encourage you all to visit XAI tools page, where you can find an overview of different solutions for XAI in R and Python. There is also an excellent book Explanatory Model Analysis referring to that subject. As always, in case of any questions or problems feel free to open issues at https://github.com/ModelOriented/DALEX or https://github.com/ModelOriented/DALEXtra repos. We look for your suggestions regarding the future of our software.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.

In order to see more R related content visit https://www.r-bloggers.com/


DALEX 2.1.0 is live on GitHub! was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R in ResponsibleML on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post DALEX 2.1.0 is live on GitHub! first appeared on R-bloggers.


How to Make Impressive Shiny Dashboards in Under 10 Minutes with semantic.dashboard

$
0
0

[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Semantic Dashboard Article Thumbnail

Introducing semantic.dashboard

Dashboards allow you to structure reports intuitively and break them down into easy-to-read chunks. As a result, end-users can navigate and explore data much easier than with traditional reports.

The shinydashboard R package has been out for ages, and it is a good option with a decent amount of features. However, apps built with it tend to look alike – especially if you’re using it daily on multiple projects. A lot of simplistic, aesthetically identical dashboards can leave a bad impression on clients and stakeholders. That’s where semantic.dashboard comes into play.

The semantic.dashboard package is an open-source alternative to shinydashboard created by Appsilon. It allows you to include Fomantic UI components to R Shiny apps without breaking a sweat. 

For example, let’s take a look at two identical applications – the first built with shinydashboard, and the second one with semantic.dashboard:

Dashboard built with shinydashboard

Image 1 – Dashboard built with shinydashboard

Dashboard built with semantic.dashboard

Image 2 – Dashboard built with semantic.dashboard

Both look good – that’s guaranteed, but the one built with semantic.dashboard doesn’t look as generic. You’ll create a somewhat simplified version of this dashboard today.

You can download the source code of this article here.

Want to learn more about semantic.dashboardVisit the official Github page. Feel free to leave us a star!

Navigate to a section:

To learn more about Appsilon’s open-source packages, see our new Open-Source Landing Page: 

Appsilon's shiny.tools landing page for our open source packages.

Appsilon’s shiny.tools landing page for our open source packages.

Installation and Your First Dashboard

The semantic.dashboard package is available on CRAN (Comprehensive R Archive Network). To install it, execute the following line from the R console:

install.packages("semantic.dashboard")

You can now proceed by creating an empty dashboard:

library(shiny)library(semantic.dashboard)ui <- dashboardPage(  dashboardHeader(),  dashboardSidebar(sidebarMenu()),  dashboardBody())server <- function(input, output) { }shinyApp(ui, server)

Here’s the corresponding output:

Empty shiny.semantic dashboard

Image 3 – Empty shiny.semantic dashboard

The semantic.dashboard‘s app UI is made of a dashboardPage, which is further split into three elements:

  1. Header – dashboardHeader
  2. Sidebar – dashboardSidebar
  3. Body – dashboardBody

This structure is identical as with shinydashboard– making things easier to learn. Let’s see how to tweak all of them.

There are a lot of things you can do with dashboardHeader. For example, you can change the color by specifying a value for the color parameter. You could also include a logo by setting logo_path and logo_allign parameters. If you want to remove the header, specify disabled = TRUE.

Here’s how to change the color from white to something less boring:

dashboardHeader(color = "blue", inverted = TRUE)

The inverted parameter sets the color to the header background instead of the header text.

Here’s the corresponding output:

Styling header of the semantic dashboard

Image 4 – Styling the header of the semantic dashboard

Next, let’s see how to add elements to the dashboardSidebar. You can specify the sidebar size by tweaking the side parameter (left by default). You can also tweak the size, which you’ll see how in a minute. Finally, you can altogether disable the sidebar by setting disable = TRUE.

Here’s how to make the sidebar wider:

dashboardSidebar(  size = "wide",  sidebarMenu(    menuItem(tabName = "panel1", text = "Panel 1"),    menuItem(tabName = "panel2", text = "Panel 2")  ))

Here are the results:

Styling sidebar of the semantic dashboard

Image 5 – Styling the sidebar of the semantic dashboard

That adds the elements to the sidebar, but how can you display different content when the tab is clicked? That’s where tweaks to dashboardBody come into play.

Let’s add tabItems and two tabs, corresponding to two options in the sidebar. The first option is selected by default, as specified with the selected parameter. Only a single text box should be visible on both panels, indicating which panel you’re currently on. Here’s the code snippet:

dashboardBody(  tabItems(    selected = 1,    tabItem(      tabName = "panel1",      textOutput(outputId = "text1")    ),    tabItem(      tabName = "panel2",      textOutput(outputId = "text2")    )  ))

To make this work, you’ll need to make some tweaks to the server function. You’ll have to render text on the corresponding outputs. Here’s how:

server <- function(input, output) {  output$text1 <- renderText("This is Panel 1")  output$text2 <- renderText("This is Panel 2")}

Your dashboard should look like this now:

Initial dashboard with two panels

Image 6 – Initial dashboard with two panels

Now you know the basics of semantic.dashboard. Let’s see how to take it a step further and display an interactive data-driven map.

Build a Fully Interactive Dashboard

R comes with a lot of built-in datasets, quakes being one of them. It shows geolocations of 1000 seismic events that occurred near Fiji since 1964. Here’s what the first couple of rows look like:

First couple of rows of the Quakes dataset

Image 7 – First couple of rows of the Quakes dataset

You’ll now see how to develop a semantic dashboard with the following tabs:

  • Interactive map – display geographical area near Fiji with markers representing the magnitude of the seismic event
  • Table – shows the source dataset formatted as a table

You’ll create the interactive map with the leaflet package, so make sure to have it installed:

install.packages("leaflet")

The UI follows the pattern discussed in the previous section – there’s a header, sidebar, and a body. The header will be empty this time. Most of the differences are in the dashboardBody. The structure should look familiar, but there are two new functions:

  • leafletOutput()– used to display the interactive map 
  • dataTableOutput()– used to display the data table 

To make the map as large as possible, you can set some inline CSS styles. In the code below, the height is modified, so the map always takes almost the entire screen height (- 80 pixels as a margin).

Here’s the code for the UI:

library(shiny)library(shiny.semantic)library(shinydashboard)library(leaflet)ui <- dashboardPage(  dashboardHeader(),  dashboardSidebar(    size = "wide",    sidebarMenu(      menuItem(tabName = "map", text = "Map", icon = icon("map")),      menuItem(tabName = "table", text = "Table", icon = icon("table"))    )  ),  dashboardBody(    tabItems(      selected = 1,      tabItem(        tabName = "map",        tags$style(type = "text/css", "#map {height: calc(100vh - 80px) !important;}"),        leafletOutput("map")      ),      tabItem(        tabName = "table",        fluidRow(          h1("Quakes Table"),          semantic_DTOutput("quakesTable")        )      )    )  ))

In order to make this dashboard work, you’ll have to modify the server function. Inside it lies the code for rendering both the map and the table. The coordinates for the map were chosen arbitrarily, after a quick Google search. 

The magnitude of the seismic activity determines the size of a marker. Every marker is clickable – showing the magnitude, depth in kilometers, and the number of stations that reported the seismic activity.

Here’s the code for the server:

server <- function(input, output) {  output$map <- renderLeaflet({ leaflet() %>%      setView(lng = 179.3355929, lat = -20.4428959, zoom = 6.5) %>%      addProviderTiles("Esri.WorldStreetMap") %>%      addCircles(        data = quakes,        radius = sqrt(10^quakes$mag) * 30,        color = "#000000",        fillColor = "#ffffff",        fillOpacity = 0.5,        popup = paste0(          "Magnitude: ", quakes$mag, "",          "Depth (km): ", quakes$depth, "",          "Num. stations reporting: ", quakes$stations        )      )  })    output$quakesTable <- DT::renderDataTable(    semantic_DT(quakes)  )}

And here’s the final dashboard:

Final Quakes dashboard

Image 8 – Final Quakes dashboard

Conclusion

In this short hands-on guide, you’ve learned how to develop simple and aesthetically-pleasing Shiny dashboards. You’ve learned what semantic.dashboard is, what it brings to the table, and how to pair it with other libraries to produce stunning and intuitive dashboards. You can learn more about semantic.dashboard by visiting the links below.

Looking for inspiration? Check out component demos and complete dashboards here.

Appsilon is hiring globally! We are primarily seeking an Engineering Manager who can lead a team of 6-8 ambitious software engineers. See our Careers page for all new openings, including openings for a Project Manager and Community Manager.

Article How to Make Impressive Shiny Dashboards in Under 10 Minutes with semantic.dashboard comes from Appsilon | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Make Impressive Shiny Dashboards in Under 10 Minutes with semantic.dashboard first appeared on R-bloggers.

Docker for Data Science: An Important Skill for 2021 [Video]

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data Science Technology Trends

A year ago I wrote about technologies Data Scientists should focus on based on industry trends. Moving into 2021, these trends remain clear– Organizations want Data Science, Cloud, and Apps. Here’s what’s happening and how Docker plays a part in the essential skills of 2020-2021 and beyond.

Articles in Series

  1. Part 1 – Five Full-Stack Data Science Technologies
  2. Part 2 – AWS Cloud
  3. Part 3 – Docker(You Are Here)
  4. Part 4 – Git Version Control
  5. Part 5 – H2O Automated Machine Learning (AutoML)
  6. Part 6 – R Shiny vs Tableau (3 Business Application Examples)
  7. [NEW BOOK] – The Shiny Production with AWS Book

Changing Trends in Tech Jobs

Indeed, the popular employment-related search engine, released an article this past Tuesday showing changing trends from 2015 to 2019 in “Technology-Related Job Postings”. We can see a number of changes in key technologies – One that we are particularly interested in is the 4000% increase in Docker.

Today's Top Tech Skills

Source: Indeed Hiring Lab.

Drivers of Change

There are 3 Key Drivers of changes in technologies:

  1. Rise of Machine Learning (and more generically Data Science) – Unlock Business Insights

  2. Businesses Shifting to the Cloud Services versus On-Premise Infrastructure – Massive Cost Savings and Flexibility Increase

  3. Businesses Shifting to Distributed Applications versus Ad-Hoc Executive Reports – Democratize Data and Improve Decision-Making within the Organization

If you aren’t gaining experience in data science, cloud, and web applications, you are risking your future.

Machine Learning (Point 1)

Data Science is shifting. We already know the importance of Machine Learning. But a NEW CHANGE is happening. Organizations need distributed data science. This requires a new set of skills – Docker, Git, and Apps. (More on this in a minute).

Cloud Services (Point 2)

Last week, I released “Data Science with AWS”. In the article, I spoke about the shift to Cloud Services and the need to learn AWS (No. 6 on Indeed’s Skill Table, 418% Growth). I’ll reiterate – AWS is my Number 1 skill that you must learn going into 2020.

Azure (No. 17, 1107% Growth) is in the same boat along with Google Cloud Platform for Data Scientists in Digital Marketing.

The nice thing about cloud – If you learn one, then you can quickly switch to the others.

Distributed Web Applications (Point 3)

Businesses now need Apps + Cloud. I discuss this at length in this YouTube video.

Watch on YouTube   Download the Slides

Let’s talk about the BIG CHANGE from the video…

The Big Change: From 2015 to 2020, apps now essential to business strategy

The landscape of Data Science is changing from reporting to application building:

  • In 2015 – Businesses need reports to make better decisions
  • In 2020 – Businesses need apps to empower better decision making at all levels of the organization

This transition is challenging the Data Scientist to learn new technologies to stay relevant…

In fact, it’s no longer sufficient to just know machine learning. We also need to know how to put machine learning into production as quickly as possible to meet the business needs.

To do so, we need to learn from the Programmers the basics of Software Engineering that can help in our quest to unleash data science at scale and unlock business value.

Learning from programmers

Programmers need applications to run no matter where they are deployed, which is the definition of reproducibility.

The programming community has developed amazing tools that help solve this issue of reproducibility for software applications:

  • Docker (and DockerHub)– Controls the software environment state

  • Git (and GitHub)– Controls the file state

It turns out that Data Scientists can use these tools to build apps that work.

We’ll focus on Docker (and DockerHub), and we’ll make a separate article for Git (and GitHub).

What is Docker?

Let’s look at a (Shiny) application to see what Docker does and how it helps.

Application Internals

We can see that application consists of 2 things:

  • Files– The set of instructions for the app. For a Shiny App this includes an app.R file that contains layout instructions, server control instructions, database instructions, etc

  • Software– The code external to your files that your application files depend on. For a Shiny App, this is R, Shiny Server, and any libraries your app uses.

Docker “locks down” the Software Environment. This means your software is 100% controlled so that your application uses the same software every time.

Key terminology

Dockerfile

A Dockerfile contains the set of instructions to create a Docker Container. Here’s an example from my Shiny Developer with AWS Course.

Dockerfile

Dockerfile – Used to create a Docker Container From Shiny Developer with AWS Course

Docker Container

A docker container is a stored version of the software environment built – Think of this as a saved state that can be reproduced on any server (or computer).

Docker Containers are a productivity booster. It usually takes 30 minutes or so to build a software environment in Docker, but once built the container can be stored locally or on DockerHub. The Docker Container can then be installed in minutes on a server or computer.

Without Docker Containers, it would take 30 minutes per server/computer to build an equivalent environment.

Key Point: Docker Containers not only save the state of the software environment making apps reproducible, but they also enhance productivity for data scientists trying to meet the ever-changing business needs.

DockerHub

DockerHub is a repository for Docker Containers that have been previously built.

You can install these containers on computers or use these Containers as the base for new containers.

DockerHub

DockerHub – Used to share Docker Containers From Shiny Developer with AWS Course

Real Docker Use Case Example

In Shiny Developer with AWS, we use the following application architecture that uses AWS EC2 to create an Ubuntu Linux Server that hosts a Shiny App in the cloud called the Stock Analyzer.

Data Science Web Application Architecture From Shiny Developer with AWS Course

We use a Dockerfile that is based on rocker/shiny-verse:latest version.

We build on top of the “shiny-verse” container to increase the functionality by adding libraries:

  • mongolite for connecting to NoSQL databases
  • shiny libraries like shinyjs, shinywidgets to increase shiny functionality
  • shinyauthr for authentication

We then deploy our “Stock Analyzer” application using this Docker Container called shinyauth. The application gets hosted on our Amazon AWS EC2 instance.

Dockerfile Zoomed

If you are ready to learn how to build and deploy Shiny Applications in the cloud using AWS, then I recommend my NEW 4-Course R-Track System, which includes:

  • Business Analysis with R (Beginner)
  • Data Science for Business (Advanced)
  • Shiny Web Applications (Intermediate)
  • Expert Shiny Developer with AWS (Advanced) – NEW COURSE!!

I look forward to providing you the best data science for business education.

Matt Dancho

Founder, Business Science

Lead Data Science Instructor, Business Science University

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Docker for Data Science: An Important Skill for 2021 [Video] first appeared on R-bloggers.

Dataset of All ISL Results Season 1 and 2

$
0
0

[This article was first published on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The second ISL season wrapped up a couple of weeks ago, meaning that now we’re in the off-season. I loved watching the CBC’s coverage of the ISL. The races were exciting, the swimmers seemed like they were having a great time, they were absolutely swimming great times. My other sporting love is the NBA and I massively enjoyed the games in the Orlando bubble, but I never for a second wished I was down at Disney World seeing them in person. The ISL Budapest bubble though – man did I ever want to be there.

In lieu of being on deck though I stayed home and worked on the ISL-centric features of SwimmeR, namely the swim_parse_ISL function. I’m doing this because I love swimming and I want it to thrive and I want the ISL to succeed. SwimmeR is my tiny contribution to that effort. I’m skeptical though, skeptical about the viability of a professional swimming league, and skeptical about this particular incarnation of such a league.

One area that helps a sporting league make money and sustain itself is being able to drive interest even during the off-season. The NBA is great about this, especially over the last few years, as manyarticles have noted. Within the swimming world the importance of driving interest year round is recognized as well, as, Torrey Hart recent discussed with Mel Stewart, following her article on the business of the ISL. One of the things that makes the NBA work so well in the off-season is all the trades and salary cap machinations and the associated analytics. People love that stuff and they pay attention to it.

My original plan for this post was to do a bit of web scrapping using swim_parse_ISL to build data sets of all the International Swimming League results in the two year history of the league. Then I was going to put on my best Zach Lowe disguise to do some slick analytics on that data and make some trenchant points about roster construction for each team. I’m still going to do the first part, building the data set for all of you. The second part though, about the analytics, is premature.

I have a theory about sports analytics: they basically operate in the margins. It’s not a terribly revolutionary theory, in fact I suspect most analytics aficionados would agree. Lets examine what what means though in the context of two different leagues, the National Basketball Association (NBA) and the International Swimming League (ISL).

There are two relevant characteristics of the NBA with respect to analytics. One is the totality of the NBA. All, or very nearly all eligible players who are known, or even thought to be capable of contributing to an NBA team are already on NBA rosters. Yes there’s the occasional Sergio Llull who decides to play at a high level in other leagues, and there are plenty of guys who are talented by struggle with the professionalism the NBA requires, but it wouldn’t be possible to assemble a competitive NBA roster from eligible players not in the league. So the thinking goes at least. NBA front offices, and their analytics departments, spend their time looking for players that other teams don’t think can make it, but actually can – the Duncan Robinsons and Fred VanVleets of the world. The difference between the consensus about Robinson (can’t play in the league) and the reality (can excel in the league) is the margin in which the Miami Heat front office is operating. Some strong analytics can make that margin bigger – that’s the goal.

The situation in the ISL is quite different. A very plausible plan for any ISL team looking to improve is to sign Simone Manuel, Olympic and world champs gold medalist in the 100 freestyle, or Xu Jiayu, 100 backstroke world championships gold medalist, or any of of the many other top flight swimmers who have never swam in the ISL. Heck since the ISL swimmers don’t have contracts like NBA players, just do what London Roar GM Rob Woodhouse suggested on the championship broadcast and sign Caeleb Dressel away from the Condors – that would be great for any team. Tinkering with season 3 roster spots for potential or fringe contributors, just doesn’t make sense until we know what top level talent is doing.

The second key difference between the NBA and ISL is the salary cap structure. The NBA cap is extremely complex, but for our purposes it’s enough to say that there’s a limit to the total salary that each team is allowed to pay its players. Since individual players would like to be paid as much as possible this sets up a market, where teams will offer players different salaries, and the player will chose to play for the team that offers him the largest one. (I realize in practice this doesn’t exactly happen, but stick with me.) Teams, and their analytics departments aren’t trying to determine if a player is good or not, so much as they’re trying to determine how good a player is in relation to his market value. Determining that a player is worth a 10 million dollar/year salary and then paying him that 10 million dollars/year salary isn’t a win for an analytics department. A win is getting a 10 mil/year worth of value from a player for less than 10 mil/year – the difference between value and salary is the margin. The salary cap also means that NBA teams can’t just go and sign a full roster of A-list players. Those players want the big salaries that they earn with their excellent play, and teams are only allowed to pay out so much salary.

ISL athletes aren’t paid in the same way though, and there’s no cap. There’s nothing to prevent the reigning champion Condors from signing more awesome swimmers, and they don’t have to worry about exactly how to value a Simone Manual-level addition, they just need to know if she’s better than whoever is in their last roster spot. It’s a much simpler decision.

The ISL isn’t ready yet for the kind of analytics in use in the NBA. It’s not needed. If you’re a ISL GM what you can do to improve your team over the off-season is just find the fastest swimmers available and get them to join. No special insights are required to identify them – their medals are easy tell. Once all that’s sorted though – then maybe some analytics in the margins.

On a larger level, the uncertainty about the ISL, about what Season 3 will look like, and frankly when and if it will happen is what makes off-season coverage so difficult and makes it difficult to sustain excitement. I watched the excellent championship match, with all of its close races and world records and fantastic performances and when it was over I wasn’t sure it would ever happen again. I’m still not, but I am hopeful.

Okay, R people – you’ve waited very patiently – here’s how we’re going to put together the ISL data sets.


Web Scraping

library(SwimmeR)library(rvest)library(stringr)library(dplyr)library(purrr)library(ggplot2)

ISL releases their results as .pdf files, and I’ve collected all of those files on github.

The process here is about the same as in my previous web-scraping post.

  1. Get web address for where the stuff of interest is
  2. Identify CSS for the stuff
  3. Make list of links to the stuff with rvest functions html_nodes and html_attr
  4. Clean up list of links
  5. Use SwimmeR functions to grab contents of list

We’ll start with season 2, since it’s the most recent one. Season 1 will be handled later with the same methods.

The web address is easy, just copy it out of your browser window.

# Part 1 - web addressweb_url_season_2 <- "https://github.com/gpilgrim2670/Pilgrim_Data/tree/master/ISL/Season_2_2020"

CSS can be a bit more complicated, but some clicking around with the selector gadget got me what we need.

# Part 2 - CSSselector <- ".js-navigation-open"

Now for rvest. We’ll use read_html to get the contents of web_url_season_2 and then html_attr and html_nodes to get those contents which are links.

# Part 3 - rvest funpage_contents <- read_html(web_url_season_2)links_season_2 <- html_attr(html_nodes(page_contents, selector), "href")web_url_season_2 <- "https://github.com/gpilgrim2670/Pilgrim_Data/tree/master/ISL/Season_2_2020"head(links_season_2, 10)##  [1] ""                                                                                            ##  [2] ""                                                                                            ##  [3] ""                                                                                            ##  [4] "/gpilgrim2670/Pilgrim_Data/tree/master/ISL"                                                  ##  [5] "/gpilgrim2670/Pilgrim_Data/blob/master/ISL/Season_2_2020/ISL%202020%20Season%202%20Notes.txt"##  [6] "/gpilgrim2670/Pilgrim_Data/blob/master/ISL/Season_2_2020/ISL_01112020_Budapest_Match_6.pdf"  ##  [7] "/gpilgrim2670/Pilgrim_Data/blob/master/ISL/Season_2_2020/ISL_05112020_Budapest_Match_7.pdf"  ##  [8] "/gpilgrim2670/Pilgrim_Data/blob/master/ISL/Season_2_2020/ISL_05112020_Budapest_Match_8.pdf"  ##  [9] "/gpilgrim2670/Pilgrim_Data/blob/master/ISL/Season_2_2020/ISL_09112020_Budapest_Match_10.pdf" ## [10] "/gpilgrim2670/Pilgrim_Data/blob/master/ISL/Season_2_2020/ISL_09112020_Budapest_Match_9.pdf"

The first 5 elements (R indexes start at 1) I don’t want. Maybe if I was better at CSS selectors I could have avoided them outright, but I didn’t. I’ll just get rid of them here. The links are also partials, missing their beginnings, so we’ll add “http://github.com” to the beginning of each with paste0. We’ll also need to change “blob” to “raw” so that we just get the .pdfs, rather than the .pdfs and a github landing page. Little github trick there.

# Part 4 - cleaninglinks_season_2 <- links_season_2[6:17] # only want links 6-17links_season_2 <- paste0("https://github.com", links_season_2) # add beginning to linklinks_season_2 <- str_replace(links_season_2, "blob", "raw") # replace blob with raw

Now it’s just up to SwimmeR, with a bit of help from purrr. We’ll use read_results inside of map to read in all of the elements of links_season_2. Then we’ll map again, to apply swim_parse_ISL, inside of safely to all the results we read in. We’ll also set splits = TRUE and relay_swimmers = TRUE to capture both splits and relay swimmers. After that we’ll name each of our results based on their source link and then stick them all together with bind_rows.

# Part 5 - deploy SwimmeRseason_2 <- map(links_season_2, read_results)season_2_parse <- map(season_2, safely(swim_parse_ISL, otherwise = NA), splits = TRUE, relay_swimmers = TRUE)names(season_2_parse) <- str_split_fixed(links_season_2, "/", n = 10)[, 10]season_2_parse <- SwimmeR:::discard_errors(season_2_parse)season_2_df <- bind_rows(season_2_parse, .id = "ID")

A Season 2 Demo

Just to show you what we’ve got in this data set here’s a little exercise to get the major contributors to each team’s point total. Since the ISL is focusing on points and contributing to one’s team rather than times we’ll do the same.

season_2_df <- season_2_df %>%   mutate(ID = str_remove(ID, "\\.pdf\\.result")) %>%   mutate(Start_Date = as.Date(str_split_fixed(ID, "_", 4)[, 2], format = "%d%m%Y")) %>%   mutate(Match = str_split_fixed(ID, "_", 4)[, 4]) %>%   mutate(Season = 2) %>%   arrange(Start_Date) %>%   select(-ID)

ISL teams have 28 swimmers eligible for individual events. If they all contributed equally to the team score then each would account for about 3.5% of the total. That doesn’t happen though, most teams have a few athletes who score most of the points, and the rest are what Shaquille O’Neal refers to as “others” - role players who contribute in different ways. We’re going to name all the athletes who scored more than 7% of their team’s total as the major contributors - those athletes are worth twice their share of points.

What we’ll do here is collect all athletes scoring less than 7% of their team’s total points as “Other” and look at the breakdown for each team. This is also useful because trying to differentiate 28 different colors or point shapes or whatever on a plot, one for each athlete, is real pain.

First we’ll sum up each team’s total score. Mind you some teams competed in the post season and some did not. Teams that participated in the Semis and Championships had the most opportunity to earn points. By dividing each athlete’s total by their team’s total we’re removing this effect. It’s still worth noting though that an athlete scoring 15% of the league leading California Condors’ points has scored a lot more points than another athlete who scored 15% of the league trailing DC Trident.

team_scores <- season_2_df %>%   filter(is.na(Name) == FALSE) %>%   group_by(Team) %>%   summarise(Team_Score = sum(Points, na.rm = TRUE)) # score for each team

Now we need the total scores for each swimmer as a percentage of their team’s total score.

individual_scores <- season_2_df %>%  filter(is.na(Name) == FALSE) %>%  group_by(Name) %>%  summarise(    Score = sum(Points, na.rm = TRUE), # score for each athlete    Team = unique(Team)  ) %>%  group_by(Team) %>%  left_join(team_scores) %>% # get team scores  mutate(Score_Perc = 100 * round(Score/Team_Score, 5)) %>% # calculate percentage of team score for each athlete  mutate(Name = case_when(Score_Perc < 7 ~ "Other", # rename athletes with less than 7% as other                          TRUE ~ Name)) %>%   group_by(Name, Team) %>%   summarise(Score = sum(Score, na.rm = TRUE), # collect all "others" together            Score_Perc = sum(Score_Perc)) %>%  group_by(Team) %>%   mutate(Score_Rank = rank(desc(Score_Perc))) # get numeric rank for each athlete and other in terms of their percent of the total team score

Here’s the Condors. Four swimmers scored almost 50% of their points, including our favorite, the undefeated queen of ISL 100 breaststroke, Lilly King!

individual_scores %>%  filter(Team == "CAC") %>% # just want cali condors  arrange(Score_Perc) %>%  mutate(Name = factor(Name, unique(Name))) %>% # order athlete names by sore_perc  ggplot(aes(    x = "",    y = Score_Perc,    fill = Name,    label = paste0(round(Score_Perc, 2), "%") # put score_perc values in plot  )) +  geom_bar(width = 1, stat = "identity") +  geom_text(size = 4, position = position_stack(vjust = 0.5)) +  theme_bw() +  theme(axis.text.x = element_blank(),        axis.title.x = element_blank()) +  labs(title = "California Condors",       y = "Percentage of Total Team Score")

Pie charts aren’t terribly well loved in the data-vis community. They’re fairly panned as being hard to get values off of, and the colors can make them difficult to read. Pie charts are pretty though - they look like flowers. Just look at this, ten pretty, pretty data flowers, one for each team.

individual_scores %>%   ggplot(aes(x = "", y = Score_Perc, fill = as.factor(Score_Rank))) +  geom_bar(width = 1, stat = "identity") +  coord_polar("y", start = 0) + # converts bar chart into pie chart  scale_fill_discrete(name = "Point Source", labels = c("Other", "#1 Swimmer", "#2 Swimmer", "#3 Swimmer", "#4 Swimmer", "#5 Swimmer")) + # labels for colors  facet_wrap(. ~ Team) + # one plot per team  theme_void() +  theme(axis.text.x = element_blank()) +        # legend.position = "none") +  labs(title = "ISL Season 2")

We can convey a similar point with a chart like this - the percent of each team’s points scored by their top (non-other) athletes.

individual_scores %>%   filter(Name != "Other") %>%  group_by(Team) %>%   summarise(Score_Perc = sum(Score_Perc)) %>%   arrange(Score_Perc) %>%   mutate(Team = factor(Team, unique(Team))) %>%   ggplot(aes(x = reorder(Team, Score_Perc), y = Score_Perc)) +  geom_point() +  scale_y_continuous(breaks = seq(5, 50, 5)) +  theme_bw() +  labs(title = "Percentage Of Total Points Scored by Top Swimmers",       x = "Team",       y = "Score Percentage")

Some teams, like the London Roar are more balanced than others, like the Condors are more top heavy. There’s not an obvious lesson to be drawn here though, because LON, CAC, ENS and LAC where the best teams in the league in Season 2 and they’re all over this plot. Also there’s nothing to be surmised about next season, because we don’t know anything about next season. We’re just doing this for fun.


Season 1 Data Set

The ISL made a lot of changes between season 1 and season 2, so they’re not directly comparable. It’s still possible to build a data set from season 1 though. The methods are exactly the same as for season 2 above.

web_url_season_1 <- "https://github.com/gpilgrim2670/Pilgrim_Data/tree/master/ISL/Season_1_2019"selector <- ".js-navigation-open"page_contents <- read_html(web_url_season_1)links_season_1 <- html_attr(html_nodes(page_contents, selector), "href")links_season_1 <- links_season_1[5:18]links_season_1 <- paste0("https://github.com", links_season_1)links_season_1 <- str_replace(links_season_1, "blob", "raw")season_1 <- map(links_season_1, read_results)season_1_parse <- map(season_1, safely(swim_parse_ISL, otherwise = NA), splits = TRUE, relay_swimmers = TRUE)names(season_1_parse) <- str_split_fixed(links_season_1, "/", n = 10)[, 10]season_1_parse <- SwimmeR:::discard_errors(season_1_parse)season_1_df <- bind_rows(season_1_parse, .id = "ID") %>%   mutate(ID = str_remove(ID, "\\.pdf\\.result")) %>%   mutate(Start_Date = as.Date(str_split_fixed(ID, "_", 4)[, 2], format = "%d%m%Y")) %>%   mutate(Match = str_split_fixed(ID, "_", 5)[, 3]) %>%   mutate(Season = 1) %>%   arrange(Start_Date) %>%   select(-ID)

In closing

Thank you for once again joining us here at Swimming + Data Science. We hope to see you back next time, just like we hope to see the ISL back next season!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Dataset of All ISL Results Season 1 and 2 first appeared on R-bloggers.

Tune random forests for #TidyTuesday IKEA prices

$
0
0

[This article was first published on rstats | Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from starting out with first modeling steps to tuning more complex models. Today’s screencast walks through how to get started quickly with tidymodels via usemodels functions for code scaffolding and generation, using this week’s #TidyTuesday dataset on IKEA furniture prices. 🛋

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore the data

Our modeling goal is to predict the price of IKEA furniture from other furniture characteristics like category and size. Let’s start by reading in the data.

library(tidyverse)ikea <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-11-03/ikea.csv")

How is the price related to the furniture dimensions?

ikea %>%  select(X1, price, depth:width) %>%  pivot_longer(depth:width, names_to = "dim") %>%  ggplot(aes(value, price, color = dim)) +  geom_point(alpha = 0.4, show.legend = FALSE) +  scale_y_log10() +  facet_wrap(~dim, scales = "free_x") +  labs(x = NULL)

There are lots more great examples of #TidyTuesday EDA out there to explore on Twitter! Let’s do a bit of data preparation for modeling. There are still lots of NA values for furniture dimensions but we are going to impute those.

ikea_df <- ikea %>%  select(price, name, category, depth, height, width) %>%  mutate(price = log10(price)) %>%  mutate_if(is.character, factor)ikea_df## # A tibble: 3,694 x 6##    price name                  category      depth height width##                                  ##  1  2.42 FREKVENS              Bar furniture    NA     99    51##  2  3.00 NORDVIKEN             Bar furniture    NA    105    80##  3  3.32 NORDVIKEN / NORDVIKEN Bar furniture    NA     NA    NA##  4  1.84 STIG                  Bar furniture    50    100    60##  5  2.35 NORBERG               Bar furniture    60     43    74##  6  2.54 INGOLF                Bar furniture    45     91    40##  7  2.11 FRANKLIN              Bar furniture    44     95    50##  8  2.29 DALFRED               Bar furniture    50     NA    50##  9  2.11 FRANKLIN              Bar furniture    44     95    50## 10  3.34 EKEDALEN / EKEDALEN   Bar furniture    NA     NA    NA## # … with 3,684 more rows

Build a model

We can start by loading the tidymodels metapackage, splitting our data into training and testing sets, and creating resamples.

library(tidymodels)set.seed(123)ikea_split <- initial_split(ikea_df, strata = price)ikea_train <- training(ikea_split)ikea_test <- testing(ikea_split)set.seed(234)ikea_folds <- bootstraps(ikea_train, strata = price)ikea_folds## # Bootstrap sampling using stratification ## # A tibble: 25 x 2##    splits             id         ##                       ##  1  Bootstrap01##  2   Bootstrap02##  3   Bootstrap03##  4   Bootstrap04##  5   Bootstrap05##  6   Bootstrap06##  7   Bootstrap07##  8   Bootstrap08##  9   Bootstrap09## 10   Bootstrap10## # … with 15 more rows

In this analysis, we are using a function from usemodels to provide scaffolding for getting started with tidymodels tuning. The two inputs we need are:

  • a formula to describe our model price ~ .
  • our training data ikea_train
library(usemodels)use_ranger(price ~ ., data = ikea_train)## lots of options, like use_xgboost, use_glmnet, etc

The output that we get from the usemodels scaffolding sets us up for random forest tuning, and we can add just a few more feature engineering steps to take care of the numerous factor levels in the furniture name and category, “cleaning” the factor levels, and imputing the missing data in the furniture dimensions. Then it’s time to tune!

library(textrecipes)ranger_recipe <-  recipe(formula = price ~ ., data = ikea_train) %>%  step_other(name, category, threshold = 0.01) %>%  step_clean_levels(name, category) %>%  step_knnimpute(depth, height, width)ranger_spec <-  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%  set_mode("regression") %>%  set_engine("ranger")ranger_workflow <-  workflow() %>%  add_recipe(ranger_recipe) %>%  add_model(ranger_spec)set.seed(8577)doParallel::registerDoParallel()ranger_tune <-  tune_grid(ranger_workflow,    resamples = ikea_folds,    grid = 11  )

The usemodels output required us to decide for ourselves on the resamples and grid to use; it provides sensible defaults for many options based on our data but we still need to use good judgment for some modeling inputs.

Explore results

Now let’s see how we did. We can check out the best-performing models in the tuning results.

show_best(ranger_tune, metric = "rmse")## # A tibble: 5 x 8##    mtry min_n .metric .estimator  mean     n std_err .config              ##                                   ## 1     2     4 rmse    standard   0.342    25 0.00211 Preprocessor1_Model10## 2     4    10 rmse    standard   0.348    25 0.00234 Preprocessor1_Model05## 3     5     6 rmse    standard   0.349    25 0.00267 Preprocessor1_Model06## 4     3    18 rmse    standard   0.351    25 0.00211 Preprocessor1_Model01## 5     2    21 rmse    standard   0.355    25 0.00197 Preprocessor1_Model08show_best(ranger_tune, metric = "rsq")## # A tibble: 5 x 8##    mtry min_n .metric .estimator  mean     n std_err .config              ##                                   ## 1     2     4 rsq     standard   0.714    25 0.00336 Preprocessor1_Model10## 2     4    10 rsq     standard   0.704    25 0.00367 Preprocessor1_Model05## 3     5     6 rsq     standard   0.703    25 0.00408 Preprocessor1_Model06## 4     3    18 rsq     standard   0.698    25 0.00336 Preprocessor1_Model01## 5     2    21 rsq     standard   0.694    25 0.00324 Preprocessor1_Model08

How did all the possible parameter combinations do?

autoplot(ranger_tune)

We can finalize our random forest workflow with the best performing parameters.

final_rf <- ranger_workflow %>%  finalize_workflow(select_best(ranger_tune))final_rf## ══ Workflow ════════════════════════════════════════════════════════════════════## Preprocessor: Recipe## Model: rand_forest()## ## ── Preprocessor ────────────────────────────────────────────────────────────────## 3 Recipe Steps## ## ● step_other()## ● step_clean_levels()## ● step_knnimpute()## ## ── Model ───────────────────────────────────────────────────────────────────────## Random Forest Model Specification (regression)## ## Main Arguments:##   mtry = 2##   trees = 1000##   min_n = 4## ## Computational engine: ranger

The function last_fit()fits this finalized random forest one last time to the training data and evaluates one last time on the testing data.

ikea_fit <- last_fit(final_rf, ikea_split)ikea_fit## # Resampling results## # Manual resampling ## # A tibble: 1 x 6##   splits        id           .metrics      .notes      .predictions    .workflow##                                              ## 1 

The metrics in ikea_fit are computed using the testing data.

collect_metrics(ikea_fit)## # A tibble: 2 x 4##   .metric .estimator .estimate .config             ##                                ## 1 rmse    standard       0.314 Preprocessor1_Model1## 2 rsq     standard       0.769 Preprocessor1_Model1

The predictions in ikea_fit are also for the testing data.

collect_predictions(ikea_fit) %>%  ggplot(aes(price, .pred)) +  geom_abline(lty = 2, color = "gray50") +  geom_point(alpha = 0.5, color = "midnightblue") +  coord_fixed()

We can use the trained workflow from ikea_fit for prediction, or save it to use later.

predict(ikea_fit$.workflow[[1]], ikea_test[15, ])## # A tibble: 1 x 1##   .pred##   ## 1  2.72

Lastly, let’s learn about feature importance for this model using the vip package. For a ranger model, we do need to go back to the model specification itself and update the engine with importance = "permutation" in order to compute feature importance. This means fitting the model one more time.

library(vip)imp_spec <- ranger_spec %>%  finalize_model(select_best(ranger_tune)) %>%  set_engine("ranger", importance = "permutation")workflow() %>%  add_recipe(ranger_recipe) %>%  add_model(imp_spec) %>%  fit(ikea_train) %>%  pull_workflow_fit() %>%  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rstats | Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Tune random forests for #TidyTuesday IKEA prices first appeared on R-bloggers.

Visualizing geospatial data in R—Part 2: Making maps with ggplot2

$
0
0

[This article was first published on Articles - The Analyst Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This is part 2 of a 4-part series on how to build maps using R.

  1. How to load geospatial data into your workspace and prepare it for visualization

  2. How to make static maps using ggplot2

  3. How to make interactive maps (pan, zoom, click) using leaflet

  4. How to add interactive maps to a Shiny dashboard

Review: Load and clean data

With minimal commentary, below is code to load and clean two datasets for visual analysis in the rest of the post. The first dataset is a .geojson file containing geospatial descriptions of Philadelphia’s neighborhoods, courtesy of OpenDataPhilly. This dataset is polygon data and will form our basemap for layering on additional, more interesting, features.

The second dataset, also courtesy of OpenDataPhilly is our dataset of interest: a 2016 inventory of all of the trees in Philadelphia. This dataset is point data, where each tree has associated coordinates for identifying its precise location. Don’t ask who collected this data or why; just be thankful that it exists. As for its accuracy, I can attest that when I step out on my stoop and look up and down the street, what I see is spot on with the city’s data. At any rate, it is more than sufficient for illustrative purposes.

Without further ado, let’s get the data loaded into our workspace.

# SETUP        #### #### #### #### #### ####library(dplyr)library(sf)library(ggplot2)# LOAD DATA    #### #### #### #### #### ###### ## Simple features data: Philadelphia neighborhoods# Source: OpenDataPhilly. https://www.opendataphilly.org/dataset/philadelphia-neighborhoodsneighborhoods_geojson <- "https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson"neighborhoods_raw <- sf::read_sf(neighborhoods_geojson)head(neighborhoods_raw)#> Simple feature collection with 6 features and 8 fields#> geometry type:  MULTIPOLYGON#> dimension:      XY#> bbox:           xmin: -75.28027 ymin: 39.96271 xmax: -75.01684 ymax: 40.09464#> geographic CRS: WGS 84#> # A tibble: 6 x 9#>   name  listname mapname shape_leng shape_area cartodb_id created_at         #>                                          #> 1 PENN~ Pennypa~ Pennyp~     87084.  60140756.          9 2013-03-19 13:41:50#> 2 OVER~ Overbro~ Overbr~     57005.  76924995.        138 2013-03-19 13:41:50#> 3 GERM~ Germant~ Southw~     14881.  14418666.         59 2013-03-19 13:41:50#> 4 EAST~ East Pa~ East P~     10886.   4231000.        129 2013-03-19 13:41:50#> 5 GERM~ Germany~ German~     13042.   6949968.         49 2013-03-19 13:41:50#> 6 MOUN~ Mount A~ East M~     28846.  43152470.          6 2013-03-19 13:41:50#> # ... with 2 more variables: updated_at , geometry ## ## Simple features data: Philadelphia urban forest# Source: OpenDataPhilly. https://www.opendataphilly.org/dataset/philadelphia-street-tree-inventorytrees_geojson <- "http://data.phl.opendata.arcgis.com/datasets/957f032f9c874327a1ad800abd887d17_0.geojson"trees_raw <- sf::read_sf(trees_geojson)head(trees_raw)#> Simple feature collection with 6 features and 4 fields#> geometry type:  POINT#> dimension:      XY#> bbox:           xmin: -75.15902 ymin: 39.9944 xmax: -75.15441 ymax: 39.99517#> geographic CRS: WGS 84#> # A tibble: 6 x 5#>   OBJECTID SPECIES STATUS DBH               geometry#>                      #> 1        1          (-75.15608 39.99506)#> 2        2           (-75.1578 39.99459)#> 3        3          (-75.15441 39.99517)#> 4        4          (-75.15446 39.99495)#> 5        5           (-75.15752 39.9944)#> 6        6           (-75.15902 39.9946)# CLEAN DATA   #### #### #### #### #### ####neighborhoods <- neighborhoods_raw %>%   dplyr::transmute(    NEIGHBORHOOD_ID = cartodb_id,     LABEL = mapname,    AREA = shape_area/43560 # sq. ft to acres  )trees <- trees_raw %>%   dplyr::select(TREE_ID = OBJECTID)

As you can see, we now have two objects–neighborhoods and trees–prepared for our analysis. Since the original data was relatively clean, the only modifications we have made are to drop columns, rename columns, and convert square feet into acres.

Geospatial layers in ggplot2

Your first map

To draw static maps in R, we will use ggplot2, which is not only the standard package for drawing traditional bar plots, line plots, historgrams, and other standard visualizations of discrete or continuous data, but is also the standard package for drawing maps. A few other packages are excellent alternatives, including sf and maps. We prefer ggplot2 because it has a consistent grammar of graphics between its various applications and offers a robust set of geospatial graphing functions.

Let’s take a moment to refresh ourselves on ggplot2‘s functionality. To make a plot, you need three steps: (1) initate the plot, (2) add as many data layers as you want, and (3) adjust plot aesthetics, including scales, titles, and footnotes.

To (1) initiate the plot, we first call ggplot(), and to (2) add data layers, we next call geom_sf() once for each layer. We have the option to add data = neighborhoods to provide simple featrues data to our plot either in the ggplot() call or in the geom_sf() call. In ggplot2, functions inherit from functions called higher up. Thus, if we call ggplot(data = neighborhoods), we do not need to subsequently specify geom_sf(data = neighborhoods), as the data argument is inherited in geom_sf(). The same goes for aesthetics. If we specify ggplot(data = neighborhoods, aes(fill = FILL_GROUP)), the subsequent geom_sf() call will already know how we want to color our plot. The behavior of inheriting aesthetics is convenient for drawing graphs that only use one dataset, which is most common in non-geospatial visualiation. A line plot might call both geom_line() and geom_point() and want both the lines and points to inherit the same dataset. Since maps frequently have many layers from different sources, we will elect to specify our data and aesthetics within the geom_sf() calls and not in the ggplot() calls.

Finally, to (3) adjust overall aesthictics, we will use a range of functions, such as the theme_* family of functions, the scale_fill_* family of functions, and the coord_sf() function. Before we get there, however, we will start with a simple plot with just our two data layers.

ggplot2::ggplot() +  ggplot2::geom_sf(data = neighborhoods) +  ggplot2::geom_sf(data = trees)
basic_map.png

Simple formatting adjustments

Let’s clean up the colors to make this a little bit easier on the eyes.

ggplot2::ggplot() +  ggplot2::geom_sf(data = neighborhoods, fill = "#d3d3d3") +  ggplot2::geom_sf(data = trees, color = "#74b560") +  ggplot2::theme_bw()
colored_map.png

Zoom in on a region of interest

By default, ggplot2 will zoom out so that all of the mapping objects are in the image. Suppose, however, that we are interested in a smaller region of the map: Center City Philadelphia. We can use ggplot2::coord_sf() to specify the coordinates to display. By default, geom_sf() calls coord_sf() in the background, but by explicitly calling it ourselves, we can override the default parameters. Below, we will specify our latitude and longitude, and set expand = FALSE. By default, expand is true, which puts a small buffer around the coordinates we specify. It’s an aesthetic choice.

While we’re here, let’s make a brief comment on CRSs in ggplot2 mapping. If you recall from Part 1 of this series, the CRS is the ellipsoid and datum used to reference points on the globe. ggplot2 will take the first CRS provided (in this case, in our neighborhoods dataset) and ensure that all subsequent layers use the same CRS. It automatically converts any mismatched CRSs to the first one provided. Using coord_sf(), we have options to change the CRS and the datum. Changing the datum won’t affect plotting, but will affect where graticules (latitude/longitude lines) are drawn if you choose to include them. By default, ggplot2 draws graticules using WGS 84 (EPSG: 4326), which happens to be the CRS of our two datasets. If we had needed to, we could have changed to NAD 83 (EPSG: 4269) using datum = sf::st_crs(4269).

ggplot2::ggplot() +  ggplot2::geom_sf(data = neighborhoods, fill = "#d3d3d3") +  ggplot2::geom_sf(data = trees, color = "#74b560") +  ggplot2::coord_sf(    xlim = c(-75.185, -75.13), ylim = c(39.93, 39.96),    expand = FALSE  ) +   ggplot2::theme_bw()
cc_philadelphia_map.png

Add labels for clearer communication

Now that we have zoomed in to a smaller region, we have space on our map to add labels. Here, we use the fuction, geom_sf_text(), specifying the label as an aesthetic (aes(...)). We use the aesthetic because the label is coming from within our data object and changes with each row in the object, as opposed to being a single value specified at time of plotting, as we had done with fill and color above. Notice that by adding the labels after the “trees” data, the label appears on top. geom_sf_text() adds text, and geom_sf_label() adds text enclosed in a small box for easier reading. It’s an aesthetic choice.

ggplot2::ggplot() +  ggplot2::geom_sf(data = neighborhoods, fill = "#d3d3d3") +  ggplot2::geom_sf(data = trees, color = "#74b560") +  ggplot2::geom_sf_text(    data = neighborhoods, aes(label = LABEL),    fontface = "bold", check_overlap = TRUE  ) +  ggplot2::coord_sf(    xlim = c(-75.185, -75.13), ylim = c(39.93, 39.96),    expand = FALSE  ) +   ggplot2::theme_bw()
labeled_cc_philadelphia_map.png

Add highlights and annotations

I want to highlight and annotate my favorite tree. Since the highlight rule needs to be determined tree-by-tree, we need to adjust our simple features object and add an appropriate aesthetic call to our plot. First we adjust our simple features object by adding columns for the color group and the label text. Then, we adjust our plot by including aes(color = COLOR) to define color groups and simultaneously adding scale_color_manual() to specify the colors we want for each group. At the same time, we optionally set show.legend = FALSE to hide the legend. We also add the label text using geom_sf_label() using the aes(label = LABEL) to specify the text and other parameters to adjust how it appears on the plot.

trees_highlight <- trees %>%   dplyr::mutate(    HIGHLIGHT_IND = dplyr::if_else(TREE_ID == 39547, 1, 0),    COLOR = as.factor(HIGHLIGHT_IND),    LABEL = dplyr::if_else(HIGHLIGHT_IND == 1, "My favorite", NULL)  ) %>%  dplyr::select(-HIGHLIGHT_IND)  ggplot2::ggplot() +  ggplot2::geom_sf(data = neighborhoods, fill = "#d3d3d3") +  ggplot2::geom_sf(    data = trees_highlight, aes(color = COLOR),     show.legend = FALSE  ) +  ggplot2::geom_sf_text(    data = neighborhoods, aes(label = LABEL),    fontface = "bold", check_overlap = TRUE  ) +  ggplot2::geom_sf_label(    data = trees_highlight, aes(label = LABEL),    color = "#cb7123", fontface = "bold",    nudge_x = 0.005, na.rm = TRUE  ) +  ggplot2::coord_sf(    xlim = c(-75.185, -75.13), ylim = c(39.93, 39.96),    expand = FALSE  ) +   ggplot2::scale_color_manual(values = c("#74b560", "#cb7123")) +  ggplot2::theme_bw() 
highlighted_cc_philadelphia_map.png

Final beautification

The options for beautifying your map are endless. Some people like to add a scale and north arrow using the ggspatial package. I prefer to leave it off but to add axis labels, a title and subtitle, a source note, and to make a few additional adjustments using the theme() function. The aesthetic choice is yours.

ggplot2::ggplot() +  ggplot2::geom_sf(data = neighborhoods, fill = "#d3d3d3") +  ggplot2::geom_sf(    data = trees_highlight, aes(color = COLOR),     show.legend = FALSE  ) +  ggplot2::geom_sf_text(    data = neighborhoods, aes(label = LABEL),    fontface = "bold", check_overlap = TRUE  ) +  ggplot2::geom_sf_label(    data = trees_highlight, aes(label = LABEL),    color = "#cb7123", fontface = "bold",    nudge_x = 0.006, na.rm = TRUE  ) +  ggplot2::coord_sf(    xlim = c(-75.185, -75.13), ylim = c(39.93, 39.96),    expand = FALSE  ) +   ggplot2::xlab("Longitude") +  ggplot2::ylab("Latitude") +  ggplot2::labs(    title = "The Urban Forest of Center City Philadelphia",    subtitle = "2016 virtual assessment of Philadelphia's street trees",    caption = "Source: OpenDataPhilly"  ) +   ggplot2::scale_color_manual(values = c("#74b560", "#cb7123")) +  ggplot2::theme_bw() +  ggplot2::theme(    panel.grid.major = ggplot2::element_line(      color = gray(0.5), linetype = "dashed", size = 0.5    ),    panel.background = ggplot2::element_rect(fill = gray(0.75))  )
final_cc_philadelphia_map.png

Choropleths in ggplot2

We must cover one final topic in ggplot2 before wrapping up this article. This is the concept of a “choropleth” map, which colors regions to represent a statistical variable. For instance, we may want to color our neighborhoods by the number of trees in each, or (more appropriately) the number of trees per acre.

Merge and clean data

To color our neighborhoods, we must first merge what had been two distinct layers of neighborhoods and trees into a single simple features object. By combining, we lose our point data and convert it into a statistic (the count of trees and trees per acre) associated with each neighborhood.

The sf::st_join() function has options for the join method. By default, it uses st_intersects, but we can change it to other options sch as st_contains or st_covered_by, depending on our needs. When joining polygon data and point data, we may have some unusual cases for trees that are on the exact border between neighborhoods. For this example, we can safely ignore those edge cases.

The join will result in a very long object, with one row for every combination of neighborhood and tree. Thus, we need to group_by and summarise as quickly as we can to return to the proper-sized data frame.

tree_count <- sf::st_join(neighborhoods, trees) %>%   dplyr::group_by(NEIGHBORHOOD_ID, LABEL, AREA) %>%   dplyr::summarise(    COUNT = n(),    .groups = "drop"  ) %>%   dplyr::mutate(DENSITY = COUNT/AREA) %>%   dplyr::select(NEIGHBORHOOD_ID, LABEL, DENSITY)  head(tree_count)#> Simple feature collection with 6 features and 3 fields#> geometry type:  MULTIPOLYGON#> dimension:      XY#> bbox:           xmin: -75.23049 ymin: 39.9849 xmax: -75.0156 ymax: 40.11269#> geographic CRS: WGS 84#> # A tibble: 6 x 4#>   NEIGHBORHOOD_ID LABEL     DENSITY                                     geometry#>                                                #> 1               1 Bridesbu~   0.459 (((-75.06773 40.0054, -75.06765 40.0052, -7~#> 2               2 Bustleton   0.714 (((-75.0156 40.09487, -75.01768 40.09276, -~#> 3               3 Cedarbro~   1.37  (((-75.18848 40.07273, -75.18846 40.07275, ~#> 4               4 Chestnut~   1.34  (((-75.21221 40.08603, -75.21211 40.08603, ~#> 5               5 East Fal~   1.55  (((-75.18479 40.02837, -75.18426 40.02789, ~#> 6               6 East Mou~   2.50  (((-75.18088 40.04325, -75.18096 40.04317, ~

Draw the choropleth

Now we are ready to draw our map. After the exercises above, this map will seem surprisingly simple, since we have only one data object. Since we have only one object, we can safely place it in the initial ggplot() call and allow our subsequent functions to inherit the data and aes.

In order to color our object, we need to specify a scale. In this case, we will use scale_fill_viridis_c (where “c” stands for “continuous”), with an alpha (opacity) of 0.75. This choropleth, as opposed to the previous graphs of the tree locations, immediately shows the reader how variable tree density is across the city, with Center City, Fairmount, and University City much more tree-filled than all other regions.

ggplot2::ggplot(data = tree_count, aes(fill = DENSITY)) +  ggplot2::geom_sf() +  ggplot2::scale_fill_viridis_c(alpha = 0.75) +  ggplot2::theme_bw()
philadelphia_choropleth.png

Final beautification

As before, the options for beautifying your map are endless. We now have the added option of formatting our legend title with labs(fill = "Trees per acre"). If you desire, you can apply the same skills of zooming, labeling, and annotating as we used above in our non-choropleth maps to highlight particular regions of interest.

ggplot2::ggplot(data = tree_count, aes(fill = DENSITY)) +  ggplot2::geom_sf() +  ggplot2::scale_fill_viridis_c(alpha = 0.75) +  ggplot2::xlab("Longitude") +  ggplot2::ylab("Latitude") +  ggplot2::labs(    title = "The Urban Forest of Philadelphia",    subtitle = "Number of trees per acre",    caption = "Source: OpenDataPhilly",    fill = "Trees per acre"  ) +   ggplot2::theme_bw() +  ggplot2::theme(    panel.grid.major = ggplot2::element_line(      color = gray(0.5), linetype = "dashed", size = 0.5    ),    panel.background = ggplot2::element_rect(fill = gray(0.75))  )
final_philadelphia_choropleth.png

Conclusion

To conclude, we have now seen how to add multiple layers of simple features data to a single ggplot2 plot and adjust the aesthetics of those layers to suit our needs. This includes the ability to zoom, add labels, and highlight geospatial features. We have also seen how to join two geospatial datasets together in order to create a choropleth map, coloring regions according to a desired summary statistic.

These two types of maps–multiple layers and choropleths–form the basis for most mapping applications. With the appropriate combination of the tools demonstrated here and sufficient patience, you can create a wide range of maps with colors, points of interest, labels, highlights, and annotations.

Maps with ggplot2 are static images, perfect for export and sharing as a .jpg. The downside, however, is that viewers of the map are limited to what you choose to show them. You as the creator must choose the zoom level and decide which features are worth labeling. Users can evaluate your maps but cannot do any new exploration of their own. In this post, we focused on Center City Philadelphia, and you as the reader were limited in your ability to explore the rest of Philadelphia’s urban forest in the same amount of detail.

Continue reading in Part 3 to learn how to create an interactive plot with leaflet, which lets users pan, zoom, and hover for detail in order to give them greater flexibility of how they extract insights from maps you create.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Articles - The Analyst Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Visualizing geospatial data in R—Part 2: Making maps with ggplot2 first appeared on R-bloggers.

Championing the R Community for the NHS

$
0
0

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The NHS is one of the UK’s most valued institutions and serves as the healthcare infrastructure for millions of people. Mango has had the pleasure of supporting their internal NHS-R community over the last few years, supporting the initiative from its inception and sharing our knowledge and expertise at their events as they seek to promote the wider usage and adoption of R and develop best practice solutions to NHS problems.

According to a recent survey by Udemy, 62% of organisations are focusing on closing skills gaps, essential to keeping teams competitive, up to date and armed with the relevant skills to adapt to future challenges.  For many institutions, an important first step is connecting their analytics teams and data professionals to encourage the collaboration and sharing of knowledge. With ‘Data literacy’ fast becoming the new computer literacy, workforces with strong data skills are fast realising the strength and value of such skills across the whole organisation.

As the UK’s largest employer, comprising 207 clinical commissioning groups, 135 acute non-specialist trusts and 17 acute specialist trusts in England alone, the NHS faces a particularly daunting task when it comes to connecting their data professionals, a vast group which includes clinicians as well as performance, information and health analysts.

The NHS-R community was the brainchild of Professor Mohammed Mohammed, Principal Consultant (Strategy Unit), Professor of Healthcare, Quality & Effectiveness at the University of Bradford. He argues,  “I’m pretty sure there is enough brain power in NHS to tackle any analytical challenge, but what we have to do is harness that power, promoting R as the incredible tool that it is, and one that can enable the growing NHS analytics community to work collaboratively, rather than in silos”.

Three years in and the NHS-R Community has begun to address that issue, bringing together once disparate groups and individuals to create a community, sharing insights, use cases, best practices and approaches, designed to create better outputs across the NHS with a key aim of improving patient outcomes.  Having delivered workshops at previous NHS-R conferences, Mango consultants were pleased to support the most recent virtual conference with two workshops – An Introduction to the Tidyverse and Text Analysis in R. These courses proved to be a popular choice with the conference attendees, attracting feedback such as “The workshop has developed my confidence for using R in advanced analysis” and “An easy to follow and clear introduction to the topic.”

Liz Mathews, Mango’s Head of Community, has worked with Professor Mohammed from the beginning, sharing information and learnings from our own R community work and experience.  Professor Mohammed commented:

“The NHS-R community has, from its very first conference, enjoyed support from Mango who have a wealth of experience in using R for government sector work and great insight in how to develop and support R based communities. Mango hosts the annual R in Industry conference (EARL) to which NHS-R Community members are invited and from which we have learned so much. We see Mango as a friend and a champion for the NHS-R Community.”

The post Championing the R Community for the NHS appeared first on Mango Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Championing the R Community for the NHS first appeared on R-bloggers.

Bayesian forecasting for uni/multivariate time series

$
0
0

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is about Bayesian forecasting of univariate/multivariate time series in nnetsauce.

For each statistical/machine learning (ML) presented below, its default hyperparameters are used. A further tuning of their respective hyperparameters could, of course, result in a much better performance than what’s showcased here.

1 – univariate time series

The Nile dataset is used as univariate time series. It contains measurements of the annual flow of the river Nile at Aswan (formerly Assuan), 1871–1970, in 10^8 m^3, “with apparent changepoint near 1898” (Cobb(1978), Table 1, p.249).

library(datasets)plot(Nile)

pres-image

Split dataset into training/testing sets:

X <- matrix(Nile, ncol=1)index_train <- 1:floor(nrow(X)*0.8)X_train <- matrix(X[index_train, ], ncol=1)X_test <- matrix(X[-index_train, ], ncol=1)

sklearn’s BayesianRidge() is the workhorse here, for nnetsauce’s MTS. It could actually be any Bayesian ML model possessing methods fit and predict (there’s literally an infinity of possibilities here for class MTS).

obj <- nnetsauce::sklearn$linear_model$BayesianRidge()print(obj$get_params())

Fit and predict using obj:

fit_obj <- nnetsauce::MTS(obj = obj) fit_obj$fit(X_train)preds <- fit_obj$predict(h = nrow(X_test), level=95L,                          return_std=TRUE)

95% credible intervals:

n_test <- nrow(X_test)xx <- c(1:n_test, n_test:1)yy <- c(preds$lower, rev(preds$upper))plot(1:n_test, drop(X_test), type='l', main="Nile",     ylim = c(500, 1200))polygon(xx, yy, col = "gray", border = "gray")points(1:n_test, drop(X_test), pch=19)lines(1:n_test, drop(X_test))lines(1:n_test, drop(preds$mean), col="blue", lwd=2)

pres-image

2 - multivariate time series

The usconsumption dataset is used as an example of multivariate time series. It contains percentage changes in quarterly personal consumption expenditure and personal disposable income for the US, 1970 to 2010. (Federal Reserve Bank of St Louis. http://data.is/AnVtzB. http://data.is/wQPcjU.)

library(fpp)plot(fpp::usconsumption)

pres-image

Split dataset into training/testing sets:

X <- as.matrix(fpp::usconsumption)index_train <- 1:floor(nrow(X)*0.8)X_train <- X[index_train, ]X_test <- X[-index_train, ]

Fit and predict:

obj <- nnetsauce::sklearn$linear_model$BayesianRidge()fit_obj2 <- nnetsauce::MTS(obj = obj)fit_obj2$fit(X_train)preds <- fit_obj2$predict(h = nrow(X_test), level=95L,                          return_std=TRUE) # standardize output+#plot against X_test

95% credible intervals:

n_test <- nrow(X_test)xx <- c(1:n_test, n_test:1)yy <- c(preds$lower[,1], rev(preds$upper[,1]))yy2 <- c(preds$lower[,2], rev(preds$upper[,2]))par(mfrow=c(1, 2))# 95% credible intervalsplot(1:n_test, X_test[,1], type='l', ylim=c(-2.5, 3),     main="consumption")polygon(xx, yy, col = "gray", border = "gray")points(1:n_test, X_test[,1], pch=19)lines(1:n_test, X_test[,1])lines(1:n_test, preds$mean[,1], col="blue", lwd=2)plot(1:n_test, X_test[,2], type='l', ylim=c(-2.5, 3),     main="income")polygon(xx, yy2, col = "gray", border = "gray")points(1:n_test, X_test[,2], pch=19)lines(1:n_test, X_test[,2])lines(1:n_test, preds$mean[,2], col="blue", lwd=2)

pres-image

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Bayesian forecasting for uni/multivariate time series first appeared on R-bloggers.


MazamaSpatialUtils R package

$
0
0

[This article was first published on R – Blog – Mazama Science , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Version 0.7 of the MazamaSpatialUtils is now available on CRAN and includes an expanded suite of spatial datasets with even greater cleanup and harmonization than in previous versions. If your work involves environmental monitoring of any kind, this package may be of use. Here is the description:

A suite of conversion functions to create internally standardized spatial polygons dataframes. Utility functions use these data sets to return values such as country, state, timezone, watershed, etc. associated with a set of longitude/latitude pairs. (They also make cool maps.)

In this post we discuss the reasons for creating this package and describe its main features.

At Mazama Science we often work with data that is geo-located:

  • biological and chemical samples from streams
  • seismic sensor data
  • pollution monitoring data
  • output from gridded atmospheric models
  • forest management and geomorphology data
  • national and state demographic and economic data

Using the sp and sf and raster packages, among others, all of these types of data can be plotted on maps. (For an introduction to spatial data in R see rspatial.org.)

When working with geo-located environmental time series data, one of the important tasks is supplementing the ‘spatial metadata’ associated with each location. Data from monitoring devices will invariably contain a longitude, latitude and device identifier but sometimes not much else. Additional spatial metadata can allow us to ask more detailed, more interesting questions of our collection of time series. To understand diurnal patterns we need the local timezone at each location. To create county-wide averages, we must know the state and county. To pursue environmental justice issues, we may need census tract information.

The long term goal of the MazamaSpatialUtils package is to make it easier for us to work with GIS shapefile and geodatabase data we discover on the web as we create a library of interesting spatial datasets for our work in R. The package addresses the following specific issues:

  1. creating a scalable system for working with spatial data
  2. creating simplified versions of large spatial datasets
  3. cleaning polygon topologies
  4. standardizing identifiers in spatial data
  5. quickly finding spatial information based on a set of locations

Creating a scalable system

Shapefiles with high resolution features are by nature quite large. Working with the World Timezones dataset we see that the largest single timezone polygon, ‘Europe/Berlin’, takes up 2.47 Mb of RAM because of the highly detailed, every-curve-in-the-river border delineation between Germany and her neighbors.

Spatial datasets can be large and their conversion from shapefile to SpatialPolygonsDataFrame can be time consuming. In addition, there is little uniformity to the dataframe data found in these datasets. The MazamaSpatialUtils package addresses these issues in multiple ways:

  1. It provides a package state variable called SpatialDataDir which is used internally as the location for all spatial datasets.
  2. It defines a systematic process for converting spatial data into SpatialPolygonsDataFrames objects with standardized data columns.
  3. A suite of useful spatial data has been pre-processed with topology correction to result in full resolution and multiple, increasingly simplified versions of each dataset.

Spatial data directory

Users will want to maintain a directory where their .rda versions of spatial data reside. The package provides a setSpatialDataDir() function which sets a package state variable storing the location. Internally, getSpatialDataDir() is used whenever data need to be accessed. (Hat tip to Hadley Wickham’s description of Environments and package state.)

Standardized data

The package comes with several convert~() functions that download, convert, standardize and clean spatial datasets available on the web. Version 0.7 of the package has 21 such scripts that walk through the same basic steps with minor differences depending on the needs of the source data. With these as examples, users should be able to create their own convert~() functions to process other spatial data. Once converted and normalized, each dataset will benefit from other package utility functions that depend on the consistent availability and naming of certain columns in the @data slot of each SpatialPolygonsDataFrame.

Cleaned and Simplified files

Part of the process of data conversion involves using the cleangeo package to fix any topologies that are found to be invalid. In our experience, this is especially important if you end up working with the data in the sf or raster packages. The final step in the data processing is the creation of simplified datasets using the rmapshaper package. If you work with GIS data and are unfamiliar with mapshaper, you should go to mapshaper.org and try it out. It’s astonishing how well this javascript package performs in real-time.

Simplified datasets are important because the dramatically speed up both spatial searches and creation of plots when fast is more important than hyper-accurate. The conversion of the WorldTimezone dataset used above generates a .rda file at full resolution as well as additional versions with 5%, 2% and 1% as many vertices. In the plot above, the 5% version was used to create the the last three plots where high resolution squiggles would never be seen. File sizes for the WorldTimezone .rda files are 67 M, 3.4 M, 1.4 M and 717 K respectively.

Normalizing identifiers

The great thing about working with spatial data stored as a shapefile or geodatabase is that these are the defacto standard formats for spatial data. We LOVE standards! Many shapefiles, but not all, also use the ISO 3166-1 alpha-2 character encoding for identifying countries. However, there seems to be no agreement at all about what to call this encoding. We have seen  ‘ISO’, ‘ISO2’, ‘country’, ‘CC’ and many more. The ISOcodes package calls this column of identifiers ‘Alpha_2’ which is not particularly descriptive outside the context of ISO codes. From here on out, we will call this column the countryCode.

Of course there are many spatial datasets that do not include a column with the countryCode. Sometimes it is because they use FIPS or ISO 3166-1 alpha-3 or some (non-standardized) version of the plain English name. Other times it is because the data are part of a national dataset and the country is assumed.

Wouldn’t it be nice if every spatial dataset you worked with was guaranteed to have a column named countryCode with the ISO 3166-1 alpha-2 encoding? We certainly think so!

The heart of spatial data standardization in this package is the conversion of various spatial datasets into SpatialPolygonsDataFrames files with guaranteed and uniformly named identifiers. The package internal standards are very simple:

1) Every spatial dataset must contain the following data columns:

  • polygonID– unique identifier for each polygon
  • countryCode– country at centroid of polygon (ISO 3166-1 alpha-2)

2) Spatial datasets with timezone data must contain the following column:

  • timezone– Olson timezone

3) Spatial datasets at scales smaller than the nation-state should contain the following column:

  • stateCode– ‘state’ at centroid of polygon (ISO 3166-2 alpha-2)

If other columns contain these data, those columns must be renamed or duplicated with the internally standardized name. This simple level of consistency makes it possible to generate maps for any data that is ISO encoded. It also makes it possible to create functions that return the country, state or timezone associated with a set of locations.

Searching for ‘spatial metadata’

The MazamaSpatialUtils package began as an attempt to create an off-line answer the following question: “How can we determine the timezones associated with a set of locations?”

We arrived at that question because we often work with pollution monitoring data collected by sensors around the United States. Data are collected hourly and aggregated into a single multi-day dataset with a shared UTC time axis. So far so good. Not surprisingly, pollution levels show a strong diurnal signal so it is useful do identify measurements as being either during the daytime or nighttime. Luckily, the maptools package has a suite of ‘sun-methods’ for calculating the local sunrise and sunset if you provide a longitude, latitude and POSIXct object with the proper timezone.

Determining the timezone associated with a location is an inherently spatial question and can be addressed with a point-in-polygon query as enabled by the sp package. Once we enabled this functionality with a timezone dataset we realized that we could extract more spatial metadata for our monitoring stations from other spatial datasets: country, state, watershed, legislative district, etc. etc.

get~ functions

The package comes with several ‘get’ functions that rely on the consistency of datasets to provide a uniform interface. Current functionality includes the following functions that are all called in the same way. Any ~ below means there are two versions of this function, one each to return the Code or Name:

  • getCountry~(longitude, latitude, ...) – returns names, ISO codes and other country-level data
  • getState~(longitude, latitude, ...) – returns names, ISO codes and other state-level data
  • getUSCounty(longitude, latitude, ...) – returns names and other county-level data
  • getTimezone(longitude, latitude, ...) – returns Olson timezones and other data
  • getHUC~(longitude, latitude, ...) – returns USGS Hydrologic Unit Codes and other data
  • getSpatialData(longitude, latitude, ...) – returns all data
  • getVariable(longitude, latitude, ...) – returns a single variable

Simple search

Here is an example demonstrating a search for Olson timezone identifiers:

library(MazamaSpatialUtils)# Vector of lons and latslons <- seq(-120, -80, 2)lats <- seq(20, 60, 2)# Get Olson timezone namestimezones <- getTimezone(lons, lats)print(timezones)  [1] NA                   NA                   NA                    [4] NA                   NA                   "America/Hermosillo"  [7] "America/Denver"     "America/Denver"     "America/Denver"     [10] "America/Denver"     "America/Chicago"    "America/Chicago"    [13] "America/Chicago"    "America/Chicago"    "America/Chicago"    [16] "America/Nipigon"    "America/Nipigon"    "America/Nipigon"    [19] "America/Iqaluit"    "America/Iqaluit"    "America/Iqaluit"  

Additional data

Additional information is available by specifying allData = TRUE:

   # Additional fields   names(SimpleTimezones)   [1] "timezone"       "UTC_offset"     "UTC_DST_offset" "countryCode"      [5] "longitude"      "latitude"       "status"         "notes"            [9] "polygonID"        getTimezone(lons, lats, allData = TRUE) %>%       dplyr::select(timezone, countryCode, UTC_offset)            timezone countryCode UTC_offset   1                                 NA   2                                 NA   3                                 NA   4                                 NA   5                                 NA   6  America/Hermosillo          MX         -7   7      America/Denver          US         -7   8      America/Denver          US         -7   9      America/Denver          US         -7   10     America/Denver          US         -7   11    America/Chicago          US         -6   12    America/Chicago          US         -6   13    America/Chicago          US         -6   14    America/Chicago          US         -6   15    America/Chicago          US         -6   16    America/Nipigon          CA         -5   17    America/Nipigon          CA         -5   18    America/Nipigon          CA         -5   19    America/Iqaluit          CA         -5   20    America/Iqaluit          CA         -5   21    America/Iqaluit          CA         -5    

Subset by country

Becuase every datasets is guaranteed to have a countryCode variable, we can use this for subsetting.

   # Canada only   subset(SimpleTimezones, countryCode == 'CA') %>%     dplyr::select(timezone, UTC_offset)                    timezone UTC_offset   71       America/Atikokan       -5.0   77   America/Blanc-Sablon       -4.0   81  America/Cambridge_Bay       -7.0   90        America/Creston       -7.0   94         America/Dawson       -7.0   95   America/Dawson_Creek       -7.0   99       America/Edmonton       -7.0   102   America/Fort_Nelson       -7.0   104     America/Glace_Bay       -4.0   105     America/Goose_Bay       -4.0   112       America/Halifax       -4.0   123        America/Inuvik       -7.0   124       America/Iqaluit       -5.0   146       America/Moncton       -4.0   152       America/Nipigon       -5.0   161   America/Pangnirtung       -5.0   169   America/Rainy_River       -6.0   170  America/Rankin_Inlet       -6.0   172        America/Regina       -6.0   173      America/Resolute       -6.0   182      America/St_Johns       -3.5   187 America/Swift_Current       -6.0   190   America/Thunder_Bay       -5.0   192       America/Toronto       -5.0   194     America/Vancouver       -8.0   195    America/Whitehorse       -7.0   196      America/Winnipeg       -6.0   198   America/Yellowknife       -7.0 

Optimized searches

One important feature of the package is the ability to optimize spatial searches by balancing speed and accuracy. By default, the getTimezone() function uses the WorldTimezones_02 dataset to return results quickly. But, if you are very concerned about getting the right timezone on either side of the Roode Beek/Rothenbach border between the The Netherlands and Germany, then you will want to use the full resolution dataset. Luckily, the function signature for getTimezone() and the other ‘get’ functions includes a dataset parameter:

getTimezone(   longitude,   latitude,   dataset = "SimpleTimezones",   countryCodes = NULL,   allData = FALSE,   useBuffering = FALSE)

By specifying datasaset = WorldTimezone you can perform hyper-accurate (and hyper-slow) searches.

Buffering

For timezone and country searches, we have chosen default datasets that merge detailed borders on land and smoothed, all-encompassing borders off-shore. This avoids issues with peninsulas and islands disappearing with low-resolution datasets. But many datasets attempt to follow coastlines quite closely and lose some of the finer details. When working with these datasets it is useful to specify useBuffering = TRUE. This will make an initial pass at finding the polygon underneath each point location. Any locations that remain unassociated after the first pass will be expanded into a small circle and another pass will be performed looking for the overlap of these circles with the spatial polygons. This process is repeated with increasing radii up to 200 km.

Available data

Pre-processed datasets can be viewed and installed locally with the installSpatialData() function. Currently available data include:

  • CA_AirBasins – California regional air basin boundaries
  • EEZCountries – Country boundaries including Exclusive Economic Zones
  • EPARegions – US EPA region boundaries
  • GACC – Geographic Area Coordination Center (GACC) boundaries
  • GADM – GADM administrative area boundaries
  • HIFLDFederalLands – US federal lands
  • HMSSmoke – NOAA Hazard Mapping System Smoke (HMSS) areas
  • HouseLegislativeDistricts – US state legislative districts, by chamber
  • MTBSBurnAreas – MTBS burn areas from 1984 – 2017
  • NaturalEarthAdm1 – State/province/oblast level boundaries
  • NWSFireZones – NWS fire weather forecast zones
  • OSMTimezones – OpenStreetMap time zones
  • PHDs – Public Health Districts for Washington, Oregon, Idaho, and California
  • SimpleCountries – Simplified version of the TMWorldBorders
  • SimpleCountriesEEZ – Simplified version of EEZCountries
  • SimpleTimezones – Simplified version of WorldTimezones
  • TerrestrialEcoregions – Terrestrial eco-regions
  • TMWorldBorders – Country level boundaries
  • USCensus116thCongress – US congressional districts
  • USCensusCBSA – US Core Based Statistical Areas
  • USCensusCounties – US county level boundaries
  • USCensusStates – US state level boundaries
  • USCensusUrbanAreas – US urban areas
  • USFSRangerDistricts – US Forest Service ranger districts
  • USIndianLands – US tribal boundaries
  • WBDHU2 – Watershed boundary level-2 hydrologic units
  • WBDHU4 – Watershed boundary level-4 hydrologic units
  • WBDHU6 – Watershed boundary level-6 hydrologic units
  • WBDHU8 – Watershed boundary level-8 hydrologic units
  • weatherZones – NWS public weather forecast zones
  • WorldEEZ – Country boundaries including Exclusive Economic Zones over water
  • WorldTimezones – Timezone

We encourage interested parties to contribute convert~() functions for their own favorite spatial datasets. If they produce SpatialPolygonDataFrames that adhere to the package standards, we’ll include them in the next release.

Happy Mapping!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Blog – Mazama Science .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post MazamaSpatialUtils R package first appeared on R-bloggers.

Advent of 2020, Day 4 – Creating your first Azure Databricks cluster

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Azure Databricks posts:

On day 4, we came so far, that we are ready to explore how to create a Azure Databricks Cluster. We have already learned, that cluster is an Azure VM, created in the background to give compute power, storage and scalability to Azure Databricks plaform.

On vertical navigation bar select Clusters in order to get Clusters subpage.

This page will give you the list of existing clusters:

  • name of the cluster
  • Status (Running, Terminated, deleted, etc.)
  • Nodes
  • Runtime (Spark version installed on VM,
  • Driver type (Type of computer used for running this cluster)
  • Worker (type of VM eg.: 4 Cores, 0.90 DUB, etc..)
  • Creator
  • Actions (by hovering over, you will receive additional information)

By clicking on exists Server, you will receive the following informations, which you can configure (not all as they are grayed out as seen on the screen shoot), attach to the notebooks, install additional packages and have access to Spark UI, Driver Logs, Metrics for easier troubleshooting.

But when selecting and creating a new Azure Databricks cluster, you will get much all attributes available for defining in order to create a cluster tailored to your needs.

You will need to provide the following information for creating a new cluster:

  1. Cluster Name – go creative, but still stick to naming convention and give a name that will also include the Worker Type, Databricks Runtime, Cluster Mode, Pool, Azure Resource Group, Project name (or task you are working on) and environment type (DEV, TEST, UAT, PROD). The more you have, the better
  2. Cluster Mode – Azure Databricks support three types of clusters: Standard,High Concurrency and Single node. Standard is the default selection and is primarily used for single-user environment, and support any workload using languages as Python, R, Scala, Spark or SQL. High Concurrency mode is designed to handle workloads for many users and is a managed cloud resource. Main benefit is that it provides Apache Spark native environment for sharing maximum resources utilisation and provide minimum query latencies. It supports languages as Python, R, Spark and SQL but not support Scala, because Scala does not support running user code in separate processes. This cluster also support TAC – table access control – for finer and grained level of access security, granting more detailed permissions on SQL tables. Single Node will give no workers and will run Spark jobs on a driver node. What does this mean in simple english: work will not be distributed among workers, resulting in poorer performances.
  3. Pool – as of writing this post, this feature is still in Public preview. It will create a pool of clusters (so you need more predefined clusters) for better response and up-times. Pool keep a defined number of instances in ready-mode (idle) to reduce the cluster start time. Cluster needs to be attached to the pool (after creation of a cluster or if you already have a pool, it will automatically be available) in order to have allocated its driver and worker nodes from the pool.
  4. Databricks runtime version – is an image of Databricks version that will be created on every cluster. Images are designed for particular type of jobs (Genomics, Machine Learning, Standard workloads) and for different versions of Spark or Databricks. When selecting the right image, remember the abbreviations and versions. Each image will have a version of Scala / Spark and there are some significant differences General images will have up to 6 months of bug fixed and 12 months Databricks support. Unless there is LTS (Long time Support) this period will extend to 24 months of support. In addition the ML abbreviation stands for Machine Learning, bringing to image additional packages for machine learning tasks (which can also be added to general image, but out-of-the box solution will be better). And GPU will denote some optimized software for GPU tasks.

5. Worker and driver type will give you the option to select the VM that will suit your needs. For the first timers, keep the default selected Worker and driver type as selected. And later you can explore and change DBU (DataBricks Units) for higher performances. Three types of workloads are to be understood; All-purpose, Job Compute and Light-job Compute and many more Instances types; General, Memory Optimized, Storage optimized, Compute optimized and GPU optimized. All come with different pricing plans and set of tiers and regions.

All workers will have the minimum and maximum number of nodes available. More you want to scale out, give your cluster more workers. DBU will change with more workers are added.

6. AutoScalling – is the tick option that will give you capabilites to scale automatically between minimum and maximum number of nodes (workers) based on the workload.

7. Termination – is the timeout in minutes, when there is no work after given period, the cluster will terminate. Expect different behaviour when cluster is attached to the pool.

Explore also the advanced options, where additional Spark configuration and runtime variables can be set. Very useful when finet-uning the behaviour of the cluster at startup. Add also Tags (as key-value pairs), to keep additional metadata on your cluster, you can also give a Init script that can be stored on DBFS and can initiate some job, load some data or models at the start time.

Once you have selected the cluster options suited for your needs, you are ready to hit that “Create cluster” button.

Tomorrow we will cover basics on architecture of clusters, workers, DBFS storage and how Spark handles jobs.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Advent of 2020, Day 4 – Creating your first Azure Databricks cluster first appeared on R-bloggers.

2020 Earl Conference Insights

$
0
0

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This year the EARL conference was held virtually due to the restrictions imposed by COVID-19. Although this removed the valuable networking element of the conference, the ‘VirtuEARL’ virtual approach meant we reached a geographically wider audience and ensured a successful conference. Thought leadership from academia and industry logged in to discover how R can be used in business, and over 300 data science professionals convened to join workshops or hear presenters share their novel and interesting applications of R. The flexibility of scheduling allowed talks to be picked according to personal or team interests.

The conference kicked off with workshops delivered by Mango data scientists and guest presenters, Max Kuhn of RStudio and Colin Fay from ThinkR, with topics including data visualisation, text analysis and modelling. The presentation day both began and finished with keynote presentations: Annarita Roscino from Zurich spoke about her journey from data practitioner to data & analytics leader – sharing key insights from her role as a Head of Predictive Analytics, and Max Kuhn from RStudio used his keynote to introduce tidymodels – a collection of packages for modelling and machine learning using tidyverse principles.

Between these great keynotes, EARL offered a further 11 presentations from across a range of industry sectors and topics. A snapshot of these shows just some of the ways that R is being used commercially: Eryk Walczak from the Bank of England revealed his use of text analysis in R to study financial regulations, Joe Fallon and Gavin Thompson from HMRC presented on their impressive work behind the Self Employment Income Support Scheme launched by the Government in response to the Covid-19 outbreak, Dr. Lisa Clarke from Virgin Media gave an insightful and inspiring talk on how to maximize an analytics team’s productivity, whilst Dave Goody, lead data scientist from the Department of Education, presented on using R shiny apps at scale across a team of 100 to drive operational decision making.

Long time EARL friend and aficionado, Jeremy Horne of DataCove, demonstrated how to build an engaging marketing campaign using R, and Dr Adriana De Palma from the Natural History Museum showed her use of R to predict biodiversity loss.

Charity donation (awaiting confirmation)

Due to the reduced overheads of delivering the conference remotely this year, the Mango team decided to donate the profits of the 2020 EARL conference to Data for Black Lives. This is a great non-profit organization dedicated to using data science to create concrete and measurable improvements to the lives of Black people. They aim to use data science to fight bias, promote civic engagement and build progressive movements. The pledge has been made and details of the donation will be announced shortly

Whilst EARL 2020 was our first such virtual event, the conference was highly successful. Attendees described it as an “unintimidating and friendly conference,” with “high quality presentations from experts in their respective fields” and were delighted to see how R and data science in general were being used commercially. One attendee best described the conference: “EARL goes beyond introducing new packages and educates attendees on how R is being used around the world to make difficult decisions”.

If you’d like to learn more about EARL 2020 or see the conference presentations in full , click here.

The post 2020 Earl Conference Insights appeared first on Mango Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 2020 Earl Conference Insights first appeared on R-bloggers.

Top 5 Best Articles on R for Business [November 2020]

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Tutorials Update

Interested in more R for Business articles?

👉Register for our blog to get the Top Articles every month.

No 1: Forecasting Time Series ARIMA Models Time Series [Video Tutorial]

Making multiple ARIMA Time Series models in R used to be difficult. But, with the purrr nest() function and modeltime, forecasting has never been easier. Learn how to make many ARIMA models in this tutorial.This article was written by Business Science’s very own Matt Dancho.

No 2: Python and R – Part 2: Visualizing Data with Plotnine Python and R

In this post, we load our cleaned up big MT Cars data set in order to be able to refer directly to the variable without a short code or the f function from our datatable. This article was written by David Lucey of Redwall Partners.

No. 3: Detect Relationships With Linear Regression Tidyverse Functions

Detect relationships with Group Split and Map, my data science SECRET TOOLS. Combining them will help us scale up to 15 linear regression summaries to assess relationship strength & combine in a GT table. Written by Matt Dancho.

No. 4: Time Series Demand Forecasting Time Series

Learn how to forecast brazilian commodities demand using R. Demand Forecasting is a technique for estimation of probable demand for a product or services. It is based on the analysis of past demand for that product or service in the present market condition. Written by Luciano Oliveira Batista.

No. 5: A/B Testing with Machine Learning – A Step-by-Step Tutorial Throwback

Here’s our throwback article of the month. With the rise of digital marketing led by tools including Google Analytics, Google Adwords, and Facebook Ads, a key competitive advantage for businesses is using A/B testing to determine effects of digital marketing efforts. Learn to use A/B testing to determine whether changes in landing pages, popup forms, article titles, and other digital marketing decisions improve conversion rates. Written by Matt Dancho.

Finding it Difficult to Learn R for Business?

Finding it difficult to learn R for Business? Here’s the solution: My NEW 5-Course R-Track System that will take you from beginner to expert in months, not years. Learn what has taken me 10+ years to learn, the tools that connect data science with the business, and accelerate your career in the process. Here’s what’s included:

  • High-Performance Time Series Forecasting
  • Shiny Developer with AWS
  • Shiny Dashboards
  • Advanced Machine Learning and Business Consulting
  • Data Science Foundations

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Top 5 Best Articles on R for Business [November 2020] first appeared on R-bloggers.

Happy Anniversary Practical Data Science with R 2nd Edition!

$
0
0

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our book, Practical Data Science with R, just had its first year anniversary!

The book is doing great, if you are working with R and data I recommend you check it out.

(link)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Happy Anniversary Practical Data Science with R 2nd Edition! first appeared on R-bloggers.

Accounting for the experimental design in linear/nonlinear regression analyses

$
0
0

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I am going to talk about an issue that is often overlooked by agronomists and biologists. The point is that field experiments are very often laid down in blocks, using split-plot designs, strip-plot designs or other types of designs with grouping factors (blocks, main-plots, sub-plots). We know that these grouping factors should be appropriately accounted for in data analyses: ‘analyze them as you have randomized them’ is a common saying attributed to Ronald Fisher. Indeed, observations in the same group are correlated, as they are more alike than observations in different groups. What happens if we neglect the grouping factors? We break the independence assumption and our inferences are invalid (Onofri et al., 2010).

To my experience, field scientists are totally aware of this issue when they deal with ANOVA-type models (e.g., see Jensen et al., 2018). However, a brief survey of literature shows that there is not the same awareness, when field scientists deal with linear/nonlinear regression models. Therefore, I decided to sit down and write this post, in the hope that it may be useful to obtain more reliable data analyses.

An example with linear regression

Let’s take a look at the ‘yieldDensity.csv’ dataset, that is available on gitHub. It represents an experiment where sunflower was tested with increasing weed densities (0, 14, 19, 28, 32, 38, 54, 82 plants per \(m^2\)), on a randomised complete block design, with 10 blocks. a swift plot shows that yield is linearly related to weed density, which calls for linear regression analysis.

rm(list=ls())library(nlme)library(lattice)dataset <- read.csv("https://raw.githubusercontent.com/OnofriAndreaPG/agroBioData/master/yieldDensityB.csv",  header=T)dataset$block <- factor(dataset$block)head(dataset)##   block density yield## 1     1       0 29.90## 2     2       0 34.23## 3     3       0 37.12## 4     4       0 26.37## 5     5       0 34.48## 6     6       0 33.70plot(yield ~ density, data = dataset)

We might be tempted to neglect the block effect and run a linear regression analysis of yield against density. This is clearly wrong (I am violating the independence assumption) and inefficient, as any block-to-block variability goes into the residual error term, which is, therefore, inflated.

Some of my collegues would take the means for densities and use those to fit a linear regression model (two-steps analysis). By doing so, block-to-block variability is cancelled out and the analysis becomes more efficient. However, such a solution is not general, as it is not feasible, e.g., when we have unbalanced designs and heteroscedastic data. With the appropriate approach, sound analyses can also be made in two-steps (Damesa et al., 2017). From my point of view, it is reasonable to search for more general solutions to deal with one-step analyses.

Based on our experience with traditional ANOVA models, we might think of taking the block effect as fixed and fit it as and additive term. See the code below.

mod.reg <- lm(yield ~ block + density, data=dataset)summary(mod.reg)## ## Call:## lm(formula = yield ~ block + density, data = dataset)## ## Residuals:##     Min      1Q  Median      3Q     Max ## -2.6062 -0.8242 -0.3315  0.7505  4.6244 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept) 29.10462    0.57750  50.397  < 2e-16 ***## block2       4.57750    0.74668   6.130 4.81e-08 ***## block3       7.05875    0.74668   9.453 4.49e-14 ***## block4      -3.98000    0.74668  -5.330 1.17e-06 ***## block5       6.17625    0.74668   8.272 6.37e-12 ***## block6       5.92750    0.74668   7.938 2.59e-11 ***## block7       1.23750    0.74668   1.657  0.10199    ## block8       1.25500    0.74668   1.681  0.09733 .  ## block9       2.34875    0.74668   3.146  0.00245 ** ## block10      2.25125    0.74668   3.015  0.00359 ** ## density     -0.26744    0.00701 -38.149  < 2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1.493 on 69 degrees of freedom## Multiple R-squared:  0.9635, Adjusted R-squared:  0.9582 ## F-statistic: 181.9 on 10 and 69 DF,  p-value: < 2.2e-16

With regression, this solution is not convincing. Indeed, the above model assumes that the blocks produce an effect only on the intercept of the regression line, while the slope is unaffected. Is this a reasonable assumption? I vote no.

Let’s check this by fitting a different regression model per block (ten different slopes + ten different intercepts):

mod.reg2 <- lm(yield ~ block/density + block, data=dataset)anova(mod.reg, mod.reg2)## Analysis of Variance Table## ## Model 1: yield ~ block + density## Model 2: yield ~ block/density + block##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  ## 1     69 153.88                              ## 2     60 115.75  9    38.135 2.1965 0.03465 *## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-level confirms that the block had a significant effect both on the intercept and on the slope. To describe such an effect we need 20 parameters in the model, which is not very parsimonious. And above all: which regression line do we use for predictions? Taking the block effect as fixed is clearly sub-optimal with regression models.

The question is: can we fit a simpler and clearer model? The answer is: yes. Why don’t we take the block effect as random? This is perfectly reasonable. Let’s do it.

modMix.1 <- lme(yield ~ density, random = ~ density|block, data=dataset)summary(modMix.1)## Linear mixed-effects model fit by REML##  Data: dataset ##        AIC      BIC    logLik##   340.9166 355.0569 -164.4583## ## Random effects:##  Formula: ~density | block##  Structure: General positive-definite, Log-Cholesky parametrization##             StdDev     Corr  ## (Intercept) 3.16871858 (Intr)## density     0.02255249 0.09  ## Residual    1.38891957       ## ## Fixed effects: yield ~ density ##                Value Std.Error DF   t-value p-value## (Intercept) 31.78987 1.0370844 69  30.65311       0## density     -0.26744 0.0096629 69 -27.67704       0##  Correlation: ##         (Intr)## density -0.078## ## Standardized Within-Group Residuals:##        Min         Q1        Med         Q3        Max ## -1.9923722 -0.5657555 -0.1997103  0.4961675  2.6699060 ## ## Number of Observations: 80## Number of Groups: 10

The above fit shows that the random effects (slope and intercept) are sligthly correlated (r = 0.091). We might like to try a simpler model, where random effects are independent. To do so, we need to consider that the above model is equivalent to the following model:

modMix.1 <- lme(yield ~ density, random = list(block = pdSymm(~density)), data=dataset)

It’s just two different ways to code the very same model. However, this latter coding, based on a ‘pdMat’ structure, can be easily modified to remove the correlation. Indeed, ‘pdSymm’ specifies a totally unstructured variance-covariance matrix for random effects and it can be replaced by ‘pdDiag’, which specifies a diagonal matrix, where covariances (off-diagonal terms) are constrained to 0. The coding is as follows:

modMix.2 <- lme(yield ~ density, random = list(block = pdDiag(~density)), data=dataset)summary(modMix.2)## Linear mixed-effects model fit by REML##  Data: dataset ##       AIC      BIC   logLik##   338.952 350.7355 -164.476## ## Random effects:##  Formula: ~density | block##  Structure: Diagonal##         (Intercept)    density Residual## StdDev:    3.198267 0.02293222 1.387148## ## Fixed effects: yield ~ density ##                Value Std.Error DF   t-value p-value## (Intercept) 31.78987 1.0460282 69  30.39102       0## density     -0.26744 0.0097463 69 -27.44020       0##  Correlation: ##         (Intr)## density -0.139## ## Standardized Within-Group Residuals:##        Min         Q1        Med         Q3        Max ## -1.9991174 -0.5451478 -0.1970267  0.4925092  2.6700388 ## ## Number of Observations: 80## Number of Groups: 10anova(modMix.1, modMix.2)##          Model df      AIC      BIC    logLik   Test    L.Ratio p-value## modMix.1     1  6 340.9166 355.0569 -164.4583                          ## modMix.2     2  5 338.9520 350.7355 -164.4760 1 vs 2 0.03535079  0.8509

The model could be further simplified. For example, the code below shows how we could fit models with either random intercept or random slope.

#Model with only random interceptmodMix.3 <- lme(yield ~ density, random = list(block = ~1), data=dataset)#Alternative#random = ~ 1|block#Model with only random slopemodMix.4 <- lme(yield ~ density, random = list(block = ~ density - 1), data=dataset)#Alternative#random = ~density - 1 | block

An example with nonlinear regression

The problem may become trickier if we have a nonlinear relationship. Let’s have a look at another similar dataset (‘YieldLossB.csv’), that is also available on gitHub. It represents another experiment where sunflower was grown with the same increasing densities of another weed (0, 14, 19, 28, 32, 38, 54, 82 plants per \(m^2\)), on a randomised complete block design, with 8 blocks. In this case, the yield loss was recorded and analysed.

rm(list=ls())dataset <- read.csv("https://raw.githubusercontent.com/OnofriAndreaPG/agroBioData/master/YieldLossB.csv",  header=T)dataset$block <- factor(dataset$block)head(dataset)##   block density yieldLoss## 1     1       0     1.532## 2     2       0    -0.661## 3     3       0    -0.986## 4     4       0    -0.697## 5     5       0    -2.264## 6     6       0    -1.623plot(yieldLoss ~ density, data = dataset)

A swift plot shows that the relationship between density and yield loss is not linear. Literature references (Cousens, 1985) show that this could be modelled by using a rectangular hyperbola:

\[YL = \frac{i \, D}{1 + \frac{i \, D}{a}}\]

where \(YL\) is the yield loss, \(D\) is weed density, \(i\) is the slope at the origin of axes and \(a\) is the maximum asymptotic yield loss. This function, together with self-starters, is available in the ‘NLS.YL()’ function in the ‘aomisc’ package, which is the accompanying package for this blog. If you do not have this package, please refer to this link to download it.

The problem is the very same as above: the block effect may produce random fluctuations for both model parameters. The only difference is that we need to use the ‘nlme()’ function instead of ‘lme()’. With nonlinear mixed models, I strongly suggest you use a ‘groupedData’ object, which permits to avoid several problems. The second line below shows how to turn a data frame into a ‘groupedData’ object.

library(aomisc)datasetG <- groupedData(yieldLoss ~ 1|block, dataset)nlin.mix <- nlme(yieldLoss ~ NLS.YL(density, i, A), data=datasetG,                         fixed = list(i ~ 1, A ~ 1),            random = i + A ~ 1|block)summary(nlin.mix)## Nonlinear mixed-effects model fit by maximum likelihood##   Model: yieldLoss ~ NLS.YL(density, i, A) ##  Data: datasetG ##        AIC      BIC    logLik##   474.8228 491.5478 -231.4114## ## Random effects:##  Formula: list(i ~ 1, A ~ 1)##  Level: block##  Structure: General positive-definite, Log-Cholesky parametrization##          StdDev    Corr ## i        0.1112839 i    ## A        4.0444538 0.195## Residual 1.4142272      ## ## Fixed effects: list(i ~ 1, A ~ 1) ##      Value Std.Error  DF  t-value p-value## i  1.23238 0.0382246 104 32.24038       0## A 68.52305 1.9449745 104 35.23082       0##  Correlation: ##   i     ## A -0.408## ## Standardized Within-Group Residuals:##        Min         Q1        Med         Q3        Max ## -2.4416770 -0.7049388 -0.1805690  0.3385458  2.8788981 ## ## Number of Observations: 120## Number of Groups: 15

Similarly to linear mixed models, the above coding implies correlated random effects (r = 0.194). Alternatively, the above model can be coded by using a ’pdMat construct, as follows:

nlin.mix2 <- nlme(yieldLoss ~ NLS.YL(density, i, A), data=datasetG,                               fixed = list(i ~ 1, A ~ 1),                  random = pdSymm(list(i ~ 1, A ~ 1)))summary(nlin.mix2)## Nonlinear mixed-effects model fit by maximum likelihood##   Model: yieldLoss ~ NLS.YL(density, i, A) ##  Data: datasetG ##        AIC      BIC    logLik##   474.8225 491.5475 -231.4113## ## Random effects:##  Formula: list(i ~ 1, A ~ 1)##  Level: block##  Structure: General positive-definite##          StdDev    Corr ## i        0.1112839 i    ## A        4.0466971 0.194## Residual 1.4142009      ## ## Fixed effects: list(i ~ 1, A ~ 1) ##      Value Std.Error  DF  t-value p-value## i  1.23242  0.038225 104 32.24107       0## A 68.52068  1.945173 104 35.22600       0##  Correlation: ##   i     ## A -0.409## ## Standardized Within-Group Residuals:##        Min         Q1        Med         Q3        Max ## -2.4414051 -0.7049356 -0.1805322  0.3385275  2.8787362 ## ## Number of Observations: 120## Number of Groups: 15

Now we can try to simplify the model, for example by excluding the correlation between random effects.

nlin.mix3 <- nlme(yieldLoss ~ NLS.YL(density, i, A), data=datasetG,                               fixed = list(i ~ 1, A ~ 1),                  random = pdDiag(list(i ~ 1, A ~ 1)))summary(nlin.mix3)## Nonlinear mixed-effects model fit by maximum likelihood##   Model: yieldLoss ~ NLS.YL(density, i, A) ##  Data: datasetG ##        AIC      BIC    logLik##   472.9076 486.8451 -231.4538## ## Random effects:##  Formula: list(i ~ 1, A ~ 1)##  Level: block##  Structure: Diagonal##                 i        A Residual## StdDev: 0.1172791 4.389173 1.408963## ## Fixed effects: list(i ~ 1, A ~ 1) ##      Value Std.Error  DF  t-value p-value## i  1.23243 0.0393514 104 31.31852       0## A 68.57655 1.9905549 104 34.45097       0##  Correlation: ##   i     ## A -0.459## ## Standardized Within-Group Residuals:##        Min         Q1        Med         Q3        Max ## -2.3577291 -0.6849962 -0.1785860  0.3255925  2.8592764 ## ## Number of Observations: 120## Number of Groups: 15

With a little fantasy, we can easily code several alternative models to represent alternative hypotheses about the observed data. Obviously, the very same method can be used (and SHOULD be used) to account for other grouping factors, such as main-plots in split-plot designs or plots in repeated measure designs.

Happy coding!

Andrea Onofri Department of Agricultural, Food and Environmental Sciences University of Perugia (Italy) Borgo XX Giugno 74 I-06121 - PERUGIA


References

  1. Cousens, R., 1985. A simple model relating yield loss to weed density. Annals of Applied Biology 107, 239–252. https://doi.org/10.1111/j.1744-7348.1985.tb01567.x
  2. Jensen, S.M., Schaarschmidt, F., Onofri, A., Ritz, C., 2018. Experimental design matters for statistical analysis: how to handle blocking: Experimental design matters for statistical analysis. Pest Management Science 74, 523–534. https://doi.org/10.1002/ps.4773
  3. Onofri, A., Carbonell, E.A., Piepho, H.-P., Mortimer, A.M., Cousens, R.D., 2010. Current statistical issues in Weed Research. Weed Research 50, 5–24.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Accounting for the experimental design in linear/nonlinear regression analyses first appeared on R-bloggers.

Animated map of World War I UK ship positions by @ellis2013nz

$
0
0

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other day while looking for something else altogether, I stumbled across naval-history.net, a website aiming to preserve historical naval documents and make them more available. I’m not sure if it’s still being actively worked on; the creator and key driving force Gordon Smith passed away in late 2016.

The interesting collection of material includes transcriptions of 350,000 pages of logs from 314 Royal Navy (UK) ships of the World War 1 Era. From the naval-history.net website:

“The UK Meteorological Office and Naval-History.Net, under the guidance of the Zooniverse, worked with large numbers of online volunteers at Old Weather from 2010 to 2012 to transcribe historical weather data and naval events from the logbooks of the 314 Royal Navy ships of the World War 1-era that are presented here.”

Each of the 314 ships has a webpage with their transcribed log, a photo of the ship itself, links to relevant charts, and a map of the ship’s daily locations. However, I couldn’t find a visualisation of all the 314 ships together, that would give a sense of the sheer scale and complexity of British naval operations during the war. So I made one myself, in the form of an animation over time which seems the natural way to represent this.

Here’s how that finished up:

(I suggest turning up the definition to 1080p, particularly if you zoom it up to full screen; I couldn’t find a way to set the definition this myself, Google seems to have deliberately made resolution a choice made by the end user or their software.)

A few historical reflections

World War I was the climax of the battleship era for naval warfare. The Battle of Jutland in 1916 was only the third ever – and last – full scale clash of steel battleships (the first two were a few years earlier in the Russo-Japanese war of 1904-1905). By the time of World War II, Germany did not have a large surface fleet, and the conflict with Japan was dominated by a new form of naval asset, the aircraft carrier.

In World War I, the UK and its allies dominated Germany’s fleet on paper if all the assets were matched against eachother in an orderly fashion. However, the Royal Navy faced a requirement to assert itself globally to protect its country’s maritime lifeline, while also threatened by the German High Seas fleet in being within a day’s steaming of the UK homeland. This situation led to difficulties in translating naval dominance into strategic outcomes. The UK struggled with limited success to leverage its power through blockade and (in one, large, controversial campaign) movement of troops to open a new front; but if at any moment it lost its dominance and hence the ability to control surface and submarine raiders preying on its commercial shipping, it would not be able to stay in the war.

This is what Winston Churchill (who was First Lord of the Admirality when the war broke out) meant when he wrote later that British Admiral Sir John Jellicoe (Admiral of the Grand Fleet responsible for keeping the German fleet in check) was “the only man on either side who could lose the war in an afternoon”. Even a dramatic win in a meeting of the two battle fleets would not win the war for the UK, but a dramatic loss could lose it.

The UK strategy was to try to engineer a decisive confrontation in the North Sea on favourable terms as quickly as possible, to free up assets for protecting maritime trade from commerce raiders and submarines. Whereas the German strategy was to postpone such a confrontation and concentrate on throttling the UK’s mercantile marine; all the while leaving enough of a plausible threat in UK home waters to keep the UK Grand Fleet as large and anxious as possible.

I guess a motivation for a map like mine above is try to give at least a taste of the scale of the Royal Navy’s global coverage at the time. While the 314 ships for which I have data is a only a fraction of the ships that actually saw service, it’s enough to get a good global picture.

I did find a few things interesting in actually watching my map once it was finished. Of course, I’d expected to see a wide range of operations with focus points on the historical UK naval power locations of Scapa Flow, Gibraltar, Malta, Cape Town and Alexandria; but I hadn’t expected to see gunboats and other vessels operating far up-river in mainland China. Similarly, I was vaguely aware of expeditions to German colonies in what is now Samoa, Solomon Islands and Papua New Guinea but it was still a surprise to see the various (incomplete) coloured dots moving around in that area at different times.

One bit of value-add from me was to highlight the locations of key naval battles that did take place, including some of those by smaller groups of ships. The Battle of Coronel off the west coast of South America in 1914 and its sequel a few weeks later in the Falklands show up nicely, for example. In general, an annotation layer is important for turning a statistical graphic into a powerful communication tool, and never more so than in an animated map.

Making the map

All of that is by the by. How did I go about building this map? The R code to do so is in its own repository on GitHub. The code extracts below aren’t self-sufficient, you’d need to clone the repository in full to make it work.

Webscraping

Getting hold of the data from the old static website was a reasonably straightforward webscraping job. Each ship gets its own page, and there is a single index page with links to all of the ships’ pages. The pages themselves are fairly tidy and probably have been generated by a database somewhere (for example, dates are all in the same format). This is a fairly common pattern in webscraping; it works well if you’re just re-creating data that is in a database somewhere, inaccessible to you, but which is apparent in the structure of the web pages you’re getting data from.

Here’s a chunk that grabs all the links to ship-specific pages. It stores the links in a character vector all_links, and sets up an empty list vessel_logs_l which is going to store, one element per page, the results of scraping each ship’s page.

#-------------------Main page and all the links----------------
# Main page, which has links to all the ship-specific pages
url1 <- "https://www.naval-history.net/OWShips-LogBooksWW1.htm"
p1 <- read_html(url1)


all_links <- p1 %>%
  html_nodes("a") %>%
  str_extract("OWShips-WW1.*?\\.htm") %>%
  unique()


vessel_logs_l <- list()
all_links <- all_links[!is.na(all_links)]

# There is an error: JMS Welland, should be HMS Welland. URL is correct but
# 

Now here’s the main loop, which iterates through each ship’s page. It extracts the vessel’s name (which can be deduced from the URL - it’s the 16th character to the character 5 from the end of the URL); the type of vessel (destroyer, sloop etc) which can be deduced from the HTML page title; and then the text of the log entries themselves, in the form of a big column of character strings. This is fairly simple with a few regular expressions. The trick in the pattern used below is to create TRUE/FALSE vectors that in effect label each line of the log: is this line the position of the script? is it the weather? is the description of the position (ie the English name of the location)? etc. Then these columns are used as part of the process to turn the data into one row per day (for each specified ship), with summary columns containing relevant extracts from the various lines of the log entry.

#-----------------Main loop - one ship at a time----------------

# Loop from "i" to the end means if we get interrupted we can start
# the loop again from wherever it got up to. This loop takes about 30-60 minutes
# to run through all 314 ships.
i = 1
for(i in i:length(all_links)){
  cat(i, " ")
  the_url <- glue("https://www.naval-history.net/{all_links[i]}")
  
  the_vessel <- str_sub(all_links[i], 16, -5) %>%
    str_replace("_", " ")
  
  this_ship_page <- read_html(the_url)
  
  vessel_type <- this_ship_page %>%
    html_nodes("title") %>%
    html_text() %>%
    drop_rn() %>%
    str_replace(" - British warships of World War 1", "") %>%
    str_replace(" - British Empire warships of World War 1", "") %>%
    str_replace(" - British auxiliary ships of World War 1", "") %>%
    str_replace(" - logbooks of British warships of World War 1", "") %>%
    str_replace(".*, ", "")
  
  txt <- this_ship_page %>%
    html_nodes("p") %>%
    html_text()
  
  d <- tibble(txt) %>%
   mutate(txt2 = drop_rn(txt)) %>%
    mutate(is_date = grepl("^[0-9]+ [a-zA-Z]+ 19[0-9][0-9]$", txt2),
           entry_id = cumsum(is_date),
           is_position = grepl("^Lat.*Long", txt2),
           is_position_description = lag(is_date),
           is_weather = grepl("^Weather", txt2),
           last_date = ifelse(is_date, txt2, NA),
           last_date = as.Date(last_date, format = "%d %b %Y")) %>%
    fill(last_date) %>%
    filter(entry_id >= 1)
  
  vessel_logs_l[[i]] <- d %>%
    group_by(entry_id) %>%
    summarise(date = unique(last_date),
              position = txt2[is_position],
              # position_description is a bit of a guess, sometimes there are 0,
              # 1 or 2 of them (not necessarily correct), so we just take the
              # first one and hope for the best.
              position_description = txt2[is_position_description][1],
              weather = txt2[is_weather][1],
              log_entry = paste(txt2, collapse = "\n"),
              .groups = "drop") %>%
    mutate(url = the_url,
           vessel = the_vessel,
           vessel_type = vessel_type,
           vessel_id = i,
           lat = str_extract(position, "Lat.*?\\.[0-9]+"),
           long = str_extract(position, "Lon.*?\\.[0-9]+"),
           lat = as.numeric(gsub("Lat ", "", lat)),
           long = as.numeric(gsub("Long ", "", long)),
           weather = str_squish(gsub("Weather:", "", weather, ignore.case = TRUE)))
}

# save version with all the text (about 25 MB)
vessel_logs <- bind_rows(vessel_logs_l)
save(vessel_logs, file = "data/vessel_logs.rda")

# Cut down version of the data without the original log text (about 2MB):
vessel_logs_sel <- select(vessel_logs, -log_entry)
save(vessel_logs_sel, file = "data/vessel_logs_sel.rda")

Drawing the map

Drawing each daily frame of the map itself is surprisingly easy, thanks to the wonders of ggplot2 and neat coordinates transformation offered by simple features and sf. The “layered grammar of graphics” philosophy of Wickham’s ggplot2 really comes into its own here, providing the ability to neatly specify:

  • a default dataset
  • six different layers including land borders, solid points for each ship, hollow circular points for any battles present on the day, text annotating those battles, and text annotations for today’s date and the description of the stage of the war
  • a coordinate system to give a good presentation of the round world in a rectangle of real estate
  • scales to govern the colours of the ships
  • fine thematic control of background and text colours, fonts, etc

Skipping over a chunk of data management to define the times and labels used for the various annotations, here is the code for the actual drawing of the map with a single day’s data:

m <- ships_data %>%
    ggplot(aes(x = long, y = lat)) +
    borders(fill = "grey", colour = NA) +
    geom_point(aes(colour = vessel_type_lumped), size = 0.8) +
    geom_point(data = battle_data,
               aes(size = point_size),
               shape = 1, colour = battle_col) +
    geom_text(data = battle_data, 
              aes(label = battle), 
              family = main_family, 
              hjust = 0,
              nudge_x = 5,
              size = 2,
              colour = battle_col) +
    scale_size_identity() +
    coord_sf() +
    theme_void(base_family = main_family) +
    # The date, in the South Atlantic:
    annotate("text", x = 22, y = -64, label = format(the_date, '%e %B %Y'), 
             colour = date_col, hjust = 1) +
    # Summary text next the date:
    annotate("text", x = 24, y = -63, 
             label = glue("{date_sum_text}: {unique(ships_data$phase)}"), 
             colour = comment_col, 
             hjust = 0, size = 2.5) +
    scale_colour_manual(values = pal, labels = names(pal), drop = FALSE) +
    labs(title = glue("Daily locations of Royal Navy Ships 1914 to 1919"),
         colour = "",
         caption = str_wrap("Locations of 314 UK Royal Navy from log books compiled by 
         naval-history.net; map by freerangestats.info. Ships that survived the 
         war and that travelled out of UK home waters were more likely to be selected 
         for transcription, which was by volunteers for the 'Zooniverse Old Weather Project'.", 
                            # margin() theme on left and right doesn't work for plot.caption so we add our own:
                            width = 180, indent = 2, exdent = 2)) +
    theme(legend.position = "bottom",
          plot.title = element_text(family = "Sarala", hjust = 0.5),
          plot.caption = element_text(colour = "grey70", size = 8, hjust = 0),
          legend.spacing.x = unit(0, "cm"),
          legend.text = element_text(hjust = 0, margin = margin(l = -2, r = 15)),
          legend.background = element_rect(fill = sea_col, colour = NA),
          panel.background = element_rect(fill = sea_col, colour = NA))

Making movies

The loop that this is within draws one frame for each day, 2000 pixels wide in 16:9 ratio. I used Image Magick to create an animated GIF out of a subset of 40 of those frames, and Windows Video Editor to make the full length movie above.

So that’s all, folks. Just a tiny slice of history. Oh, if you think that 6.5 minutes of video is long to watch, imagine what it was like living through. It didn’t being or end there, either. We might think 2020 has been tough, but I’d still rather what we’ve just gone through than many of the years in the first half of last century.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Animated map of World War I UK ship positions by @ellis2013nz first appeared on R-bloggers.


Advent of 2020, Day 5 – Understanding Azure Databricks cluster architecture, workers, drivers and jobs

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Azure Databricks posts:

Yesterday we have unveiled couple of concepts about the workers, drivers and how autoscaling works. In order to explore the services behind, start up the cluster, we have created yesterday (it it was automatically terminated or you have stopped it manually).

Cluster is starting up (when is started, the green loading circle will become full):

My Cluster is Standard DS3_v2 cluster (4 cores) with Min 2 and Max 8 workers. Same applies for the driver. Once the cluster is up and running, go to Azure Portal. Look for your resource group that you have created it at the beginning (Day 2) when we started the Databricks Service. I have named my Resource group “RG_DB_py” (naming is importat! RG– ResourceGroup; DB– Service DataBricks; py– my project name). Search for the correct resource:

And Select “Resource Groups” and find your resource group. I have a lot of resource groups, since I try to bundle the projects to a small groups that are closely related:

Find yours and select it and you will find the Azure Databricks service that belongs to this resource group.

Databricks creates additional (automatically generated) resource group to hold all the services (storage, VM, network, etc.). Follow the naming convention:

RG_DB_py– is my resource group. What Azure does in this case, it prefixes and suffices your resource group name as: databricks_rg_DB_py_npkw4cltqrcxe. Prefix will always be “databricks_rg” and suffix will be 13-characters random string for uniqueness. In my case: npkw4cltqrcxe. Why separate resource group? It used to be under the same resource group, but decoupling and having services in separate group makes it easier to start/stop services, manage IAM, create pool and scale. Find your resource group and see what is insight:

In detail list you will find following resources (in accordance with my standard DS3_v2 Cluster):

  • Disk (9x Resources)
  • Network Interface (3x resources)
  • Network Security group (1x resource)
  • Public IP address (3x resources)
  • Storage account (1x resource)
  • Virtual Machine (3x resources)
  • Virtual network (1x resource)

Inspect the naming of these resources, you can see that the names are guid based, but the names are repeating through different resources and can easily be bundled together. Drawing the components together to get a full picture of it:

At a high level, the Azure Databricks service manages worker nodes and driver node in the separate resource group, that is tight to the same Azure subscription (for easier scalability and management). The platform or “appliance” or “managed service” is deployed as an set of Azure resources and Databricks manages all other aspects. The additional VNet, Security groups, IP addresses, and storage accounts are ready to be used for end user and managed through Azure Databricks Portal (UI). Storage is also replicated (geo redundant replication) for disaster scenarios and fault tolerance. Even when cluster is turned off, the data is persisted in storage.

Cluster is a virtual machine that has a blob storage attached to it. Virtual machine is rocking Linux Ubuntu (16.04 as of writing this) and it has 4 vCPUs and 14GiB of RAM. The workers are using two Virtual Machines. And the same Virtual machine is reserved for the driver. This is what we have set on Day 2.

Since each VM machine is the same (for Worker and Driver), the workers can be scaled up based on the vCPU. Two VM for Workers, with 4 cores each, is maximum 8 workers. So each vCPU / Core is considered one worker. And the Driver machine (also VM with Linux Ubuntu) is a manager machine for load distribution among the workers.

Each Virtual machine is set with public and private sub-net and all are mapped together in Virtual network (VNet) for secure connectivity and communication of work loads and data results. And Each VM has a dedicated public IP address for communication with other services or Databricks Connect tools (I will talk about this in later posts).

Disks are also bundled in three types, each for one VM. These types are:

  • Scratch Volume
  • Container Root Volume
  • Standard Volume

Each type has a specific function but all are designed for optimised performance data caching, especially for delta caching. This means for faster data reading, creating copies of remote files in nodes’ local storage using and using a fast intermediate data format. The data is cached automatically. Even when file has to be fetched from a remote location. This performs well also for repetitive and successive reads. Delta caching (as part of Spark caching)

This is supported for reading only Parquet files in DBFS, HDFS, Azure Blob storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2. Optimized storage (Spark caching) does not uspport file types as CSV, JSON, TXT, ORC, XML..

When request is pushed from the Databricks Portal (UI) the main driver accepts the requests and by using spark jobs, pushes the workload down to each node. Each node has a shards and copies of the data or it it gets through DBFS from Blob Storage and executes the job. After execution the summary / results of each worker node is summed and gathered again by driver. Driver node returns the results in fashionable manner back to UI.

The more worker nodes you have, the more “parallel” the request can be executed. And the more workers you have available (or in ready mode) the more you can “scale” your workloads.

Tomorrow we will start with working our way up to importing and storing the data and see how it is stored on blob storage and explore different type of storages that Azure Databricks provides.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Advent of 2020, Day 5 – Understanding Azure Databricks cluster architecture, workers, drivers and jobs first appeared on R-bloggers.

BoardGames choices explored

$
0
0

[This article was first published on R – Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

​​Boardgames have been on the rise since 2012. ​​​​A Star, Will Wheaton, created a ‘Table Top’ show on Youtube, where he gathers with his friends around a table and plays a board game. Your favorite Barnes and Noble has a wall dedicated to games for all age categories. There are board game cafes and stations. Due to a huge variety of choices it might be tricky to select a game  for one of these places. The audiences come from vast backgrounds and, in general, people have different tastes.

NamesTotal VotesAverageBoard Game Category
7 Wonders Duel5648.1‘Ancient’, ‘Card Game’, ‘City Building’, ‘Civilization’
The 7th Continent4228.3Adventure’, ‘Card Game’, ‘Exploration’, ‘Science Fiction’
Concordia3748.1‘Ancient’, ‘Economic’, ‘Nautical’
The Castles of Burgundy10378.1‘Dice’, ‘Medieval’, ‘Territory Building’

Let’s explore a dataset of 20,000 games. The table above contains the columns I’m going to talk about.

Each game in the said dataset receives a number of votes (see ‘total votes’ column). We’ll dismiss the games that don’t have sufficient data, leaving those with a 100 and up votes for it. Next, let’s look at the ratings for each game. The fans submit a vote for a game on a scale of 1-10, that’s been averaged and inputed into the dataset. This column is called ‘average’. Let’s dismiss the games that rated have less than 6 points on them. The dataset shrank to 950 games in total.

There’s a column called ‘board game category’. Each game is listed under several categories. If we group the whole dataset by categories we find out the most popular categories and select the top ten. However, some of these categories are very niche and encounter 3 or less games. This could be avoided if we show a dependency of the highest voted categories on their popularity so to speak, meaning the number of games listed under it.

Finally, after arriving at the top ten best board games categories we can easily select top ten games within each of the categories. Ta da, the work is done. The link to my Shiny-Project GitHub is here

My Shiny app can be found here

Created for tik-tok anticafe

Thank you for reading,

Khamanna

The post BoardGames choices explored first appeared on Data Science Blog.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post BoardGames choices explored first appeared on R-bloggers.

Poorman’s automated translation with R and Google Sheets using {googlesheets4}

$
0
0

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A little trick I thought about this week; using Google Sheets, which includes a “googletranslate()”function to translate a survey that we’re preparing at work, from French to English, and usingR of course. You’ll need a Google account for this. Also, keep in mind that you’ll be sendingthe text you want to translate to Google, so don’t go sending out anything sensitive.

First, let’s load the needed packages:

library(googlesheets4)library(dplyr)library(tibble)

As an example, I’ll be defining a tibble with one column, and two rows. Each cell contains asentence in French from the best show in the entire French speaking world, Kaamelott:

my_french_tibble <- tribble(~french,                  "J'apprécie les fruits au sirop",                  "C'est pas faux")

To this tibble, I’m now adding two more columns, that contain the following string: “=googletranslate(A:A,”fr“,”en“)”.This is exactly what you would write in the formula bar in Sheets. Then, we need to convert that toan actual Google Sheets formula using gs4_formula():

(my_french_tibble <- my_french_tibble %>%  mutate(english = '=googletranslate(A:A, "fr", "en")') %>%  mutate(portuguese = '=googletranslate(A:A, "fr", "pt")') %>%  mutate(english = gs4_formula(english),         portuguese = gs4_formula(portuguese)))## # A tibble: 2 x 3##   french     english                           portuguese                       ##                                                                ## 1 J'appréci… =googletranslate(A:A, "fr", "en") =googletranslate(A:A, "fr", "pt")## 2 C'est pas… =googletranslate(A:A, "fr", "en") =googletranslate(A:A, "fr", "pt")

We’re ready to send this to Google Sheets. As soon as the sheet gets uploaded, the formulas will beevaluated, yielding translations in both English and Portuguese.

To upload the tibble to sheets, run the following:

french_sheet <- gs4_create("repliques_kaamelott",                           sheets = list(perceval = my_french_tibble))

You’ll be asked if you want to cache your credentials so that you don’t need to re-authenticatebetween R sessions:

Your browser will the open a tab asking you to login to Google:

At this point, you might get a notification on your phone, alerting you that there was a login to your account:

If you go on your Google Sheets account, this is what you’ll see:

And if you open the sheet:

Pretty nice, no? You can of course download the workbook, or better yet, never leave your R session at alland simply get back the workbook using either the {googledrive} package, which simply needs the nameof the workbook ({googledrive} also needs authentication):

(translations <- googledrive::drive_get("repliques_kaamelott") %>%  read_sheet)

You’ll get a new data frame with the translation:

Reading from "repliques_kaamelott"Range "perceval"# A tibble: 2 x 3  french                    english                     portuguese                                                                               1 J'apprécie les fruits au… I appreciate the fruits in… I apreciar os frutos em…2 C'est pas faux            It is not false             Não é falsa             

Or you can use the link to the sheet (which does not require to re-authenticate at this point):

translations <- read_sheet("the_link_goes_here", "perceval")

You could of course encapsulate all these steps into a function and have any text translatedvery easily! Just be careful not to send out any confidential information out…

Hope you enjoyed! If you found this blog post useful, you might want to followme on twitter for blog post updates andbuy me an espresso or paypal.me, or buy my ebook on Leanpub.You can also watch my videos on youtube.So much content for you to consoom!

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}

Buy me an EspressoBuy me an Espresso

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Poorman's automated translation with R and Google Sheets using {googlesheets4} first appeared on R-bloggers.

R is Getting an Official Pipe Operator

$
0
0

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It looks like R is getting an official pipe operator (ref). R doesn’t work under an RFC process, so we hear about these things and they are discussed on the R-devel mailing list.

I’ve written on this topic before (ref), and I have taped some new comments.

(link)

My feeling is: argument inserting syntax translation is indeed a tempting way to go. I think this misses some great coding opportunities that arise when one can guarantee the piped entity is a realized value and also use placeholders. Also a lot is lost when one fails to work out some of the niceties in piping into expressions instead of just functions. But, I understand nobody has asked me.

Function oriented syntax transform appears to be both what RStudio wants, and what R-core is implementing (please keep in mind, these are two different actors in the R-space). So it is what it is.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post R is Getting an Official Pipe Operator first appeared on R-bloggers.

Data Science Courses on Udemy: Comparative Analysis

$
0
0

[This article was first published on R – Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

​While there’s an abundance of many different course topics on Udemy four out of five top rated Udemy Instructors teach Data Science. It’s easy to understands why people go for them.

The courses are not expensive and you own them for lifetime.​​ At the same time the number of instructors seems to be multiplying with each year. This article gives a clue as to why – the cost of starting your own course is zero to none. You pay Udemy from the your revenues. Additionally, it does all the advertisement for you. I decided to scrape Udemy Data Science courses to see where a Udemy newbie might want to start. I looked at three different topics: Python for Data Science, R-programming for Data Science and Statistics for Data Science. All three were filtered for English language. 

I dismissed exploring Python courses as the topic seems to be extremely saturated. Even when filtered for Data Science and English language it returned 10,000 results. Unlike Python, when scraped, Statistics  and R-programming filtered for Data Science returned 154 and 127 results respectively.  See Statistics     and    R-programming.

Udemy uses Json scripts, so I had to use selenium for scraping. As previously mentioned, I scraped two different url’s, Statistics for Data Science and R for Data Science.

Let’s look at the number of votes for courses under each topic to have an idea of how popular the topic is. Both R-programming and Statistics courses received a fair share of votes. 

If we treat the voters as our population we could look for clues when studying the distribution of the votes. 

The graph for R-programming looks pretty dispersed. A great majority of courses are given the votes of 3.6 to 4.8:

The Statistics courses graph is visibly pointier with the votes centered around the range from 4.1 to 4.8 and with a solid number of courses hugely downrated. Also, a noticeable portion of the courses has been awarded 5 stars.

The difference in ratings is more obvious when we put the graphs together to compare the ratings.

One possible conclusion would be that it’s riskier to start your own course in Statistics than in R-programming. 

You may think that the dispersion is due to the higher number of votes for R courses. Interestingly, there’s no correlation between the number of voters and received ratings. 

Let’s explore the pricing and its distribution for both topics. On the website the courses are highly priced. However, scraping returned much lower prices. It grabs the prices after all the discounts are being applied to them. 

 

The larger chunk of R courses are priced from $9.99 to $14.99, when Statistics courses cost a bit more – starting at $12.99. The recommendation for a start up would be to price his course accordingly and not deviate much from the prices within the topic. However, a closer look at the outlier in Statistics, priced for $30.99 suggests that a deviation in pricing is not penalized if the course material is good. See the outlier in the table below:

Well over 5,000 voters, the outlier (when it comes to pricing) is rated as high as 4.6. 

Overall, Udemy is worthy to look into if you have enough material for the course and you believe your course will be of value. The suggestion is to start with a less riskier course which is R-programming in this case and price it accordingly. 

My GitHub could be found here

 

The post Data Science Courses on Udemy: Comparative Analysis first appeared on Data Science Blog.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Data Science Courses on Udemy: Comparative Analysis first appeared on R-bloggers.

Viewing all 12128 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>