Quantcast
Channel: R-bloggers
Viewing all 12135 articles
Browse latest View live

“Clearing the Confusion” series

$
0
0

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In recent weeks, I’ve posted three tutorials with Clearing the Confusion titles, all in myregtools GitHub repo. Topics have been unbalanced classification data; k-fold cross validation; and scaling in PCA. Comments welcome!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


SHARPEn your portfolio

$
0
0

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our last post, we started building the intuition around constructing a reasonable portfolio to achieve an acceptable return. The hero of our story had built up a small nest egg and then decided to invest it equally across the three major asset classes: stocks, bonds, and real assets. For that we used three liquid ETFs (SPY, SHY, and GLD) as proxies. But our protagonist was faced with some alternative scenarios offered by his cousin and his co-worker; a Risky portfolio of almost all stocks and a Naive portfolio of 50/50 stocks and bonds.

After seeing the outcomes of the different portfolios, our hero wondered if there were a better alternative. To accommodate, we simulated the range of outcomes one could potentially expect based on the risk, return, and correlation profiles of the three ETFs. We did this by creating a 1000 randomly weighted portfolios. When we graphed the results of the simulation, our hero could see how his portfolio (red dot) compared with the risky (purple dot), naive (black dot), and many other portfolios, as shown below. Additionally, the scatter plot showed our hero that for a given level of risk, he could find a portfolio that offered the best possible return, or, for a given level of return, he could decide how much risk he wanted to take. The portfolio with the highest return for a given level of risk “dominated” the other portfolios at that level of risk.

However, we showed that some of these dominant portfolios might not be intuitively acceptable even if mathematically optimal. For example, when our hero thought that he was fine with the current riskiness of his portfolio, but wanted to eke out a bit more return, the solution was to increase his exposure to stocks by over 20 points and increase his exposure to gold by four points, all at the expense of bonds. But this only resulted in a one-to-two point improvement in returns. If he wanted to improve returns more than that, he would have to alter how much risk he would be willing to accept.

That begged the question of whether there was an alternative solution. Let’s resume where we left off…

Now that we’ve seen that a major change in the portfolio weights doesn’t yield that much improvement in returns, should we find a different metric? Maybe we should be looking for the best risk-adjusted return. Let’s graph the same random portfolios, but color them according to their risk-adjusted returns—in this case, simply return over risk—and we’ll call this the Sharpe ratio after the Nobel Prize winner William F. Sharpe who developed the concept.1 The higher the Sharpe ratio the greener the point, the lower the redder.

Interestingly, the highest risk-adjusted returns appear to be at the low end of the graph. In fact, the highest risk-adjusted return also happens to be the portfolio with the lowest risk. That isn’t exactly counter-intuitive. But it raises the question of how much additional return you’re getting for taking on more risk. To see this we add a line whose slope matches a one-to-one correspondence between change in risk and change in return. This is shown in the graph below.

What’s interesting about this line is that it tells you which portfolios generate more than one unit of return per unit of risk and which ones generate less. Let’s spend a few moments on the graph.

The points that lie above the purple line represent portfolios where you’re return per unit of risk is greater than one for one. The portfolios below that are, obviously, the opposite. It’s important to remember that in this case, a unit of risk is not the same as a unit of return. Volatility (or the standard deviation of returns) is used as a proxy for risk. Hence, risk is a standardized range, while return is a point. So it’s not the same as risk a dollar to make a dollar. You’re risking a likely range of dollars to make a dollar. If one of the portfolios has a 5% average return and a 10% risk, that means the returns of the portfolio could be -5% to 15% close to 70% of the time. Hence, when risk increases by one unit, the range of possible outcomes widens by two units. In the previous example, the range of values (-5% to 15%) based on risk was about 20 percentage points. If that risk increased by one unit to 11%, then the range would be 22 percentage points (-6% to 16%).

For those not indoctrinated by portfolio theory this isn’t the most intuitive concept on first blush. But think about it this way: embedded in that range of potential values is a risk of loss. By bearing that potential loss, you’re expecting a potential gain. So the purple line cuts the portfolios between those for which the expected upside potential is greater than overall potential. In other words, the upside is greater than a reasonable expectation of the downside and vice versa. Most folks prefer the upside to be greater than the down. In future posts, we’ll re-arrange this to look at risk only as expected loss. But we need to walk before we can run.

Let’s go back and see where the three portfolios are in relation to the purple line.

None of the portfolios enjoy a one-to-one relationship between return and risk. That doesn’t mean they’re “bad” portfolios. If you’re required return is greater than 5%—roughly the point above which return starts to lag risk—then to achieve that you’ll need to accept a poorer risk-adjusted return profile. That begs the question of whether this trade-off of accepting incrementally more downside potential for incrementally less upside potential is worth it.

Answering the “worth it” question takes us out of the realm of numbers and into the realm of preferences, psychology, and behavior. We won’t dwell too long on this because it’s hard to generalize individual preferences. Behavioral finance attempts to identify and explain the motivation and effect of such preferences. But that is way beyond the scope of this post.

Let’s move on to look at what the average weights for those greater than one-for-one return-to-risk portfolios actually look like.

On average we see a very high allocation to bonds and not much to stocks or gold. For people that don’t require a high return, this would probably be a good portfolio mix. But let’s assume our protagonist needs more than that, yet he doesn’t want to stray too far from a relatively evenly balanced weighting. So we’ll keep close to the same volatility and see what types of returns we can generate along with the implied portfolio weights. Here’s the original table of returns and risk.

Table 1: Annualized performance metrics
AssetReturn (%)Risk (%)Sharpe ratio
Equal6.09.20.66
Naive6.06.90.81
Risky8.412.40.64

Let’s look at the portfolios between the two bands that represent one percentage point more or less risk than the equal-weighted portfolio in the graph below.

Now we’ll see what the average returns and risk are for those risk bands. One thing should stand out: while both average returns and risk are higher, so is the Sharpe ratio. In general, then, our hero can achieve better returns and risk-adjusted returns by widening his risk parameters.

Table 2: Average returns and risk for risk bands (%)
ReturnsRiskSharpe
7.49.70.76

That doesn’t seem too bad. Let’s graph the average weights.

Seems reasonable. But you’ll note that this change in allocation isn’t too different from switching to the dominant portfolio we calculated earlier. So our hero thinks that maybe the gold allocation is too high. He wants to see if there are any portfolios that would afford him a similar risk and return, but allocate no more than 20% to gold. Indeed, there is as shown below.

Table 3: Average returns and risk for risk bands with gold constraint (%)
ReturnsRiskSharpe
7.39.70.75

However, to get gold below 20%, we need to raise the allocation to stocks to over 60%. Our hero’s not sure if this is the type of allocation he wants, so he asks if it’s possible to lower the exposure to stocks a little. Unfortunately, no luck there. So what does the average weighting look like?

Almost two-thirds of the portfolio is allocated to stocks and the remainder is relatively evenly divided among bonds and commodities. Is this acceptable? The weighting to stocks more than doubles, the weighting to gold is almost chopped in half, and the Sharpe ratio improves by over 10%. Our hero may not like the higher weighting to stocks, but at least his risk-adjusted return is much better. Only our hero can tell if he’s comfortable with the new portfolio. Whatever the case, we’re far off from an “optimal” portfolio. Where does the “average” portfolio lie on the scatter plot? The yellow dot is that portfolio, we’ll call it the “sufficient portfolio.”

What does this tell us? While the sufficient portfolio doesn’t offer the highest return for the given level of risk, it does offer a higher return for only a moderate increase of risk and with an allocation our hero may prefer relative to his cousin’s, co-worker’s, or the remainder of options. But then again it might not, in which case, we’d have to re-run the calculations with different weight constraints. Let’s at least look at how the sufficient would have performed historically, as shown in the graph below with the wider line in purple, before we summarize.

What’s the key takeaway? Portfolios that offer the highest return for a given level of risk may not be an allocation that many investors would be comfortable with. And the highest risk-adjusted return portfolio may not offer the required return. But it is possible to find a portfolio that offers most of the necessary requirements and improves risk-adjusted returns if the range of acceptable outcomes is broadened and the constraints aren’t overly stringent. Finding these portfolios becomes more of an iterative process than a closed form solution. Would the new portfolio be satisfactory? At the very least, that depends on the cost of adherence. If the psychological cost to maintain the portfolio is low—that is, it doesn’t keep you up at night—then, provided the portfolio satisfies the other requirements, it is a more “satisfactory” portfolio. This obviously touches on behavioral elements that would require separate posts, but the goal is to view most of these concepts through the lens of what has an intuitive appeal. Eventually, we might find a portfolio that satisfices our hero’s risk and return requirements. That is, it satisfies his needs based on sufficient thresholds. We won’t delve into that concept more in this post, but it will underlie the rest of this series on portfolio construction.

And speaking of this series, there’s still more ground to cover. Over the next few posts we’ll examine benchmarking, rebalancing, semi-deviation, capital market expectations, and time dependence. Stick with us and let us know if there’s something you’d like to see. Until then, here’s the code:

# Load packagelibrary(tidyquant)# Get datasymbols <- c("SPY", "EEM", "SHY", "IYR", "GLD")symbols_low <- tolower(symbols)prices <- getSymbols(symbols, src = "yahoo",                     from = "1990-01-01",                     auto.assign = TRUE) %>%   map(~Ad(get(.))) %>%   reduce(merge) %>%   `colnames<-`(symbols_low)prices_monthly <- to.monthly(prices, indexAt = "last", OHLC = FALSE)ret <- ROC(prices_monthly)["2005/2019"]naive <- ret[,c("spy", "shy")]basic <- ret[,c("spy", "shy", "gld")]# Create different weights and portflioswt1 <- rep(1/(ncol(basic)), ncol(basic))port1 <- Return.portfolio(basic, wt1) %>%   `colnames<-`("ret")wt2 <- c(0.9, 0.10, 0)port2 <- Return.portfolio(basic, weights = wt2) %>%   `colnames<-`("ret")wtn <- c(0.5, 0.5)portn <- Return.portfolio(naive, wtn)port_comp <- data.frame(date = index(port1), equal = as.numeric(port1),                        wtd = as.numeric(port2),                        naive = as.numeric(portn)) port_comp %>%   gather(key,value, -date) %>%   group_by(key) %>%   mutate(value = cumprod(value+1)) %>%   ggplot(aes(date, value*100, color = key)) +  geom_line() +  scale_color_manual("", labels = c("Equal", "Naive", "Risky"),                     values = c("blue", "black", "red")) +  labs(x = "",       y = "Index",       title = "Three portfolios, which is best?",       caption = "Source: Yahoo, OSM estimates") +  theme(legend.position = "top",        plot.caption = element_text(hjust = 0))# Portfoliomean_ret <- apply(ret[,c("spy", "shy", "gld")],2,mean)cov_port <- cov(ret[,c("spy", "shy", "gld")])port_exam <- data.frame(ports = colnames(port_comp)[-1],                        ret = as.numeric(apply(port_comp[,-1],2, mean)),                        vol = as.numeric(apply(port_comp[,-1], 2, sd)))# Weighting that ensures more variation and random weighthing to stocksset.seed(123)wts <- matrix(nrow = 1000, ncol = 3)for(i in 1:1000){  a <- runif(1,0,1)  b <- c()  for(j in 1:2){    b[j] <- runif(1,0,1-sum(a,b))  }  if(sum(a,b) < 1){    inc <- (1-sum(a,b))/3    vec <- c(a+inc, b+inc)  }else{    vec <- c(a,b)  }  wts[i,] <- sample(vec,replace = FALSE)}# Calculate random portfoliosport <- matrix(nrow = 1000, ncol = 2)for(i in 1:1000){  port[i,1] <- as.numeric(sum(wts[i,] * mean_ret))  port[i,2] <- as.numeric(sqrt(t(wts[i,] %*% cov_port %*% wts[i,])))}colnames(port) <- c("returns", "risk")port <- as.data.frame(port)# Graph with points   port %>%   ggplot(aes(risk*sqrt(12)*100, returns*1200)) +  geom_point(color = "blue", size = 1.2, alpha = 0.4) +  geom_smooth(method = "loess", formula = y ~ log(x), se = FALSE, color = "slategrey") +  geom_point(data = port_exam, aes(port_exam[1,3]*sqrt(12)*100,                            port_exam[1,2]*1200),             color = "red", size = 6) +  geom_point(data = port_exam, aes(port_exam[2,3]*sqrt(12)*100,                                   port_exam[2,2]*1200),             color = "purple", size = 7) +  geom_point(data = port_exam, aes(port_exam[3,3]*sqrt(12)*100,                                   port_exam[3,2]*1200),             color = "black", size = 5) +  scale_x_continuous(limits = c(0,14)) +  labs(x = "Risk (%)",       y = "Return (%)",       title = "Simulated portfolios")# Finad max and equivalent risk for Equal risk sliceequal_max <- port %>%   filter(risk < port_exam[1,3]+0.0005,         risk > port_exam[1,3]-0.0005) %>%    mutate(returns = returns*1200,         risk = risk * sqrt(12)*100) %>%   arrange(desc(returns)) %>%   slice(1)# Find wieghts for dominant portfolio eq_wt <- port %>%   mutate(spy_wt = wts[,1],         shy_wt = wts[,2],         gld_wt = wts[,3],         returns = returns * 1200,         risk = risk * sqrt(12) *100) %>%   filter(returns == equal_max$returns,         risk == equal_max$risk) %>%   select(spy_wt, shy_wt, gld_wt)   # Graph weightseq_wt %>%   rename("SPY" = spy_wt,         "SHY" = shy_wt,         "GLD" = gld_wt) %>%   gather(key,value) %>%   ggplot(aes(factor(key, level = c("SPY", "SHY", "GLD")), value*100)) +  geom_bar(stat = 'identity', fill = "blue") +  geom_text(aes(label = round(value,2)*100), nudge_y = 5) +  labs(x = "Assets",       y = "Weights (%)",       title = "Derived weighting to improve returns")# Portfolio with Sharpe ratioport %>%   ggplot(aes(risk*sqrt(12)*100, returns*1200, color = sharpe)) +  geom_point(size = 1.2, alpha = 0.4) +  geom_point(data = port_exam, aes(port_exam[1,3]*sqrt(12)*100,                                   port_exam[1,2]*1200),             color = "red", size = 6) +  geom_point(data = port_exam, aes(port_exam[2,3]*sqrt(12)*100,                                   port_exam[2,2]*1200),             color = "purple", size = 7) +  geom_point(data = port_exam, aes(port_exam[3,3]*sqrt(12)*100,                                   port_exam[3,2]*1200),             color = "black", size = 5) +  scale_x_continuous(limits = c(0,14)) +  labs(x = "Risk (%)",       y = "Return (%)",       title = "Simulated portfolios",       color = "Sharpe ratio") +  scale_color_gradient(low = "red", high = "green") +  theme(legend.position = "top", legend.key.size = unit(.5, "cm"))# Portfolio with sharpe linemax_sharpe <- max(port$sharpe)*sqrt(12)port %>%   ggplot(aes(risk*sqrt(12)*100, returns*1200, color = sharpe)) +  geom_point(size = 1.2, alpha = 0.4) +  geom_abline(intercept = 0, slope = max_sharpe, color = "blue") +  labs(x = "Risk (%)",       y = "Return (%)",       title = "Simulated portfolios",       color = "Sharpe ratio") +  scale_color_gradient(low = "red", high = "green") +  theme(legend.position = "top", legend.key.size = unit(.5, "cm"))# Graph with one-to-oneport %>%   ggplot(aes(risk*sqrt(12)*100, returns*1200, color = sharpe)) +  geom_point(size = 1.2, alpha = 0.4) +  geom_abline(intercept = 0, slope = max_sharpe, color = "blue") +  geom_abline(color = "purple", lwd = 1.25)+  labs(x = "Risk (%)",       y = "Return (%)",       title = "Simulated portfolios",       color = "Sharpe ratio") +  scale_color_gradient(low = "red", high = "green") +  geom_text(aes(x = 5, y = 7),             label = "Purple line is \none-to-one \nreturn-to-risk.",             color = "purple")# Three portfolios with purple lineport %>%   ggplot(aes(risk*sqrt(12)*100, returns*1200, color = sharpe)) +  geom_point(size = 1.2, alpha = 0.4) +  geom_point(data = port_exam, aes(port_exam[1,3]*sqrt(12)*100,                                   port_exam[1,2]*1200),             color = "red", size = 6) +  geom_point(data = port_exam, aes(port_exam[2,3]*sqrt(12)*100,                                   port_exam[2,2]*1200),             color = "purple", size = 7) +  geom_point(data = port_exam, aes(port_exam[3,3]*sqrt(12)*100,                                   port_exam[3,2]*1200),             color = "black", size = 5) +  geom_abline(color = "purple", size = 1.1) +  scale_x_continuous(limits = c(0,14)) +  labs(x = "Risk (%)",       y = "Return (%)",       title = "Simulated portfolios",       color = "Sharpe ratio") +  scale_color_gradient(low = "red", high = "green")# High return to riskport %>%   mutate(SPY = wts[,1],         SHY = wts[,2],         GLD = wts[,3],         returns = returns * 1200,         risk = risk * sqrt(12) *100,         sharpe = sharpe*sqrt(12)) %>%   filter(sharpe >= 1) %>%   summarise_all(mean) %>%   gather(key, value) %>%   filter(!key %in% c("returns", "risk", "sharpe")) %>%   ggplot(aes(factor(key, labels = c("SPY", "SHY", "GLD")), value *100)) +   geom_bar(stat = "identity", fill = "blue") +  labs(x = "",       y = "Weight (%)",       title = "Average weights for high risk-adjusted return portfolios") +  geom_text(aes(label = round(value,2)*100), nudge_y = 4)# Tableport_comp %>%   rename("Equal" = equal,         "Naive" = naive,         "Risky" = wtd) %>%   gather(Asset, value, -date) %>%   group_by(Asset) %>%   summarise(`Mean (%)` = round(mean(value, na.rm = TRUE),3)*1200,            `Volatility (%)` = round(sd(value, na.rm = TRUE)*sqrt(12),3)*100,            `Risk-adjusted (%)` = round(mean(value, na.rm = TRUE)/sd(value, na.rm=TRUE)*sqrt(12),3)*100,            `Cumulative (%)` = round(prod(1+value, na.rm = TRUE),3)*100) %>%   knitr::kable(caption = "Annualized performance metrics") # Graph with risk bandsport %>%   ggplot(aes(risk*sqrt(12)*100, returns*1200)) +  geom_point(color = "blue", size = 1.2, alpha = 0.4) +  geom_point(data = port_exam, aes(port_exam[1,3]*sqrt(12)*100,                                   port_exam[1,2]*1200),             color = "red", size = 6) +  geom_vline(xintercept = up_band, color = "slateblue") +  geom_vline(xintercept = down_band, color = "slateblue") +  labs(x = "Risk (%)",       y = "Return (%)",       title = "Simulated portfolios")# Portfoilio band output for blogport %>%   mutate(spy_wt = wts[,1],         shy_wt = wts[,2],         gld_wt = wts[,3],         returns = returns * 1200,         risk = risk * sqrt(12) *100,         sharpe = returns/risk) %>%   filter(returns > port_exam[1,2]*1200 +1,         risk >= down_band,         risk < up_band) %>%   summarise_all(function(x) round(mean(x),1)) %>%   select(returns, risk, sharpe) %>%   rename("Returns" = returns,         "Risk" = risk,         "Sharpe" = sharpe) %>%   knitr::kable(caption = "Average returns and risk for risk bands (%)")port %>%   mutate(SPY = wts[,1],         SHY = wts[,2],         GLD = wts[,3],         returns = returns * 1200,         risk = risk * sqrt(12) *100,         sharpe = returns/risk) %>%   filter(returns > port_exam[1,2]*1200 +1,         risk >= down_band,         risk < up_band) %>%   summarise_all(mean) %>%   gather(key, value) %>%   filter(key %in% c("SPY", "SHY", "GLD")) %>%   ggplot(aes(factor(key, levels = c("SPY", "SHY", "GLD")) ,value*100)) +  geom_bar(stat = "identity", fill = "blue") +  labs(x = "",       y = "Weight (%)",       title = "Average weights for high risk-adjusted return portfolios") +  geom_text(aes(label = round(value,2)*100), nudge_y = 5)# Portfoilio band output for blogport %>%   mutate(spy_wt = wts[,1],         shy_wt = wts[,2],         gld_wt = wts[,3],         returns = returns * 1200,         risk = risk * sqrt(12) *100,         sharpe = returns/risk) %>%   filter(returns > port_exam[1,2]*1200 +1,         risk >= down_band,         risk < up_band,         gld_wt <= 0.2) %>%   summarise_all(function(x) round(mean(x),2)) %>%    select(returns, risk, sharpe) %>%   rename("Returns" = returns,         "Risk" = risk,         "Sharpe" = sharpe) %>%   knitr::kable(caption = "Average returns and risk for risk bands (%)")# Bar chart of weightsport %>%   mutate(SPY = wts[,1],         SHY = wts[,2],         GLD = wts[,3],         returns = returns * 1200,         risk = risk * sqrt(12) *100,         sharpe = returns/risk) %>%   filter(returns > port_exam[1,2]*1200 +1,         risk >= down_band,         risk < up_band,         GLD <= 0.2) %>%   summarise_all(mean) %>%   gather(key, value) %>%   filter(key %in% c("SPY", "SHY", "GLD")) %>%   ggplot(aes(factor(key, levels = c("SPY", "SHY", "GLD")) ,value*100)) +  geom_bar(stat = "identity", fill = "blue") +  labs(x = "",       y = "Weight (%)",       title = "Average weights for high risk-adjusted return portfolios") +  geom_text(aes(label = round(value,2)*100), nudge_y = 5)port %>%   ggplot(aes(risk*sqrt(12)*100, returns*1200)) +  geom_point(color = "blue", size = 1.2, alpha = 0.4) +  geom_point(data = port_exam, aes(port_exam[1,3]*sqrt(12)*100,                                   port_exam[1,2]*1200),             color = "red", size = 6) +  geom_point(data = port_exam, aes(port_exam[2,3]*sqrt(12)*100,                                   port_exam[2,2]*1200),             color = "purple", size = 7) +  geom_point(data = port_exam, aes(port_exam[3,3]*sqrt(12)*100,                                   port_exam[3,2]*1200),             color = "black", size = 5) +  geom_point(data = suff_port, aes(risk,returns),             color = "yellow", size = 8) +  geom_vline(xintercept = up_band, color = "slateblue") +  geom_vline(xintercept = down_band, color = "slateblue") +  scale_x_continuous(limits = c(0,14)) +  labs(x = "Risk (%)",       y = "Return (%)",       title = "Simulated portfolios with sufficient allocation")# Add portfolioport_suff <- Return.portfolio(basic,suff_port_wts) %>%   `colnames<-`("suff")# Graphport_comp %>%   mutate(suff = as.numeric(port_suff)) %>%   gather(key,value, -date) %>%   group_by(key) %>%   mutate(value = cumprod(value+1)) %>%  ggplot(aes(date, value*100, color = key)) +  geom_line(aes(size = key)) +  scale_color_manual("", labels = c("Equal", "Naive", "Sufficient", "Risky"),                     values = c("blue", "black", "purple","red")) +  scale_size_manual(values = c(1,1,2,1), guide = 'none') +  labs(x = "",       y = "Index",       title = "Adding the sufficient portfolio",       caption = "Source: Yahoo, OSM estimates") +  theme(legend.position = "top",        plot.caption = element_text(hjust = 0))

  1. The Sharpe ratio was developed by William Sharpe to measure the excess return of an asset over risk-free rates adjusted for volatility.↩

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on OSM.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Electricity demand data in tsibble format

$
0
0

[This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The tsibbledata packages contains the vic_elec data set, containing half-hourly electricity demand for the state of Victoria, along with corresponding temperatures from the capital city, Melbourne. These data cover the period 2012-2014.

Other similar data sets are also available, and these may be of interest to researchers in the area.

For people new to tsibbles, please read my introductory post.

 

Australian state-level demand

The rawdata for other states are also stored in the tsibbledata github repository (under the data-raw folder), but these are not included in the package to satisfy CRAN space constraints. However, anyone can still load and use the data with the following code.

library(tidyverse)
library(lubridate)
library(tsibble)
repo <- "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/"
states <- c("NSW","QLD","SA","TAS","VIC")
dirs <- paste0(repo, states, "2015")

# Read holidays data
holidays <- paste0(dirs,"/holidays.txt") %>%
  as.list() %>%
  map_dfr(read_csv, col_names=FALSE, .id="State") %>%
  transmute(
    State = states[as.numeric(State)],
    Date = dmy(X1), 
    Holiday = TRUE
  )
# Read temperature data
temperatures <- paste0(dirs,"/temperature.csv") %>%
  as.list() %>%
  map_dfr(read_csv, .id = "State") %>%
  mutate(
    State = states[as.numeric(State)],
    Date = as_date(Date, origin = ymd("1899-12-30"))
  )
# Read demand data
demands <- paste0(dirs,"/demand.csv") %>%
  as.list() %>%
  map_dfr(read_csv, .id = "State") %>%
  mutate(
    State = states[as.numeric(State)],
    Date = as_date(Date, origin = ymd("1899-12-30"))
  )
# Join demand, temperatures and holidays
aus_elec <- demands %>%
  left_join(temperatures, by = c("State", "Date", "Period")) %>%
  transmute(
    State,
    Time = as.POSIXct(Date + minutes((Period-1) * 30)),
    Period,
    Date = as_date(Time),
    DOW = wday(Date, label=TRUE),
    Demand = OperationalLessIndustrial, 
    Temperature = Temp,
  ) %>%
  left_join(holidays, by = c("State", "Date")) %>%
  replace_na(list(Holiday = FALSE))
# Remove duplicates and create a tsibble
aus_elec <- aus_elec %>%
  filter(!are_duplicated(aus_elec, index=Time, key=State)) %>%
  as_tsibble(index = Time, key=State)

This block of code reads in raw data files containing holiday information, temperatures and electricity demand for each state, and then joins them into a single tsibble. For some reason, there are duplicated rows from South Australia, so the last few lines removes the duplicates before forming a tsibble, keyed by State.

aus_elec
## # A tsibble: 1,155,408 x 8 [30m] 
## # Key:       State [5]
##    State Time                Period Date       DOW   Demand Temperature
##                                   
##  1 NSW   2002-01-01 00:00:00      1 2002-01-01 Tue    5714.        26.3
##  2 NSW   2002-01-01 00:30:00      2 2002-01-01 Tue    5360.        26.3
##  3 NSW   2002-01-01 01:00:00      3 2002-01-01 Tue    5015.        26.3
##  4 NSW   2002-01-01 01:30:00      4 2002-01-01 Tue    4603.        26.3
##  5 NSW   2002-01-01 02:00:00      5 2002-01-01 Tue    4285.        26.3
##  6 NSW   2002-01-01 02:30:00      6 2002-01-01 Tue    4075.        26.3
##  7 NSW   2002-01-01 03:00:00      7 2002-01-01 Tue    3943.        26.3
##  8 NSW   2002-01-01 03:30:00      8 2002-01-01 Tue    3884.        26.3
##  9 NSW   2002-01-01 04:00:00      9 2002-01-01 Tue    3878.        26.3
## 10 NSW   2002-01-01 04:30:00     10 2002-01-01 Tue    3838.        26.3
## # … with 1,155,398 more rows, and 1 more variable: Holiday 

This data set contains half-hourly data from all states from 1 January 2002 – 1 March 2015 (and in the case of Queensland to 1 April 2015). The temperature variable is from a weather station in the capital city of each state.

 

GEFCOM 2017

The Global Energy Forecasting Competition in 2017 involved data on hourly zonal loads of ISO New England from March 2003 to April 2017. The data have already been packaged into tibble format by Cameron Roach in the gefcom2017data Github repository. So it is relatively easy to convert this to a tsibble.

devtools::install_github("camroach87/gefcom2017data")
library(gefcom2017data)
gefcom2017 <- gefcom %>% 
  ungroup() %>%
  as_tsibble(key=zone, index=ts)
gefcom2017
## # A tsibble: 1,241,710 x 15 [1h] 
## # Key:       zone [10]
##    ts                  zone  demand drybulb dewpnt date        year month
##                                
##  1 2003-03-01 00:00:00 CT      3386      25     19 2003-03-01  2003 Mar  
##  2 2003-03-01 01:00:00 CT      3258      23     18 2003-03-01  2003 Mar  
##  3 2003-03-01 02:00:00 CT      3189      22     18 2003-03-01  2003 Mar  
##  4 2003-03-01 03:00:00 CT      3157      22     19 2003-03-01  2003 Mar  
##  5 2003-03-01 04:00:00 CT      3166      23     19 2003-03-01  2003 Mar  
##  6 2003-03-01 05:00:00 CT      3255      23     20 2003-03-01  2003 Mar  
##  7 2003-03-01 06:00:00 CT      3430      24     20 2003-03-01  2003 Mar  
##  8 2003-03-01 07:00:00 CT      3684      24     20 2003-03-01  2003 Mar  
##  9 2003-03-01 08:00:00 CT      3977      25     21 2003-03-01  2003 Mar  
## 10 2003-03-01 09:00:00 CT      4129      27     22 2003-03-01  2003 Mar  
## # … with 1,241,700 more rows, and 7 more variables: hour ,
## #   day_of_week , day_of_year , weekend ,
## #   holiday_name , holiday , trend 

Details of the data (and the competition) are available on Tao Hong’s website.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Comparing Ensembl GTF and cDNA

$
0
0

[This article was first published on R on From System to System, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It seems that most people think Ensembl’s GTF file and cDNA fasta file mean the same transcripts:

Watch out! @ensembl‘s Fasta and GTF annotation files available via https://t.co/2AhCSnL7py do not match (there are transcripts in the GTF not found in the Fasta file. Anyone else expected them to match?

— K. Vitting-Seerup (@KVittingSeerup) August 13, 2018

However, my colleagues Joseph Min and Sina Booeshaghi found that for several species, Ensembl’s GTF file and cDNA fasta file do not have the same set of transcripts, so it would not be the same using the cDNA file as opposed to extracting the transcript sequences from the genome with the GTF file for a reference to pseudoalign RNA-seq reads. But how exactly does the GTF annotation differ from cDNA? This isn’t very clear on the Ensembl website. In this blog post, I’ll answer the following questions:

  • What kind of genes do those non-overlapping transcripts belong to?
  • For the transcripts present in both, do the GTF annotation and the cDNA fasta file mean the same sequences?

For now, I will analyze Ensembl’s human genome annotations; I suspect that the same rule applies to other species as well, especially vertebrates.

library(tidyverse)library(VennDiagram)library(biomartr)library(ggpubr)library(BSgenome.Hsapiens.UCSC.hg38)library(Biostrings)library(plyranges)library(GenomeInfoDb)library(GenomicFeatures)library(BUSpaRse)library(here)library(scales)source(here("code", "plotting.R")) # See GitHub repo of this blog
# Download cDNA fasta fileif (!file.exists(here("reference", "hs_cdna99.fa.gz"))) {  download.file("ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",                destfile = here("reference", "hs_cdna99.fa.gz"))}
# Download GTF filegtf_fn <- getGTF(db = "ensembl", organism = "Homo sapiens", path = here("reference"))
#> Starting gtf retrieval of 'Homo sapiens' from ensembl ...
#> 
#> File /Users/lambda/Documents/fs2s/reference/Homo_sapiens.GRCh38.99_ensembl.gtf.gz exists already. Thus, download has been skipped.
#> The *.gtf annotation file of 'Homo sapiens' has been downloaded to '/Users/lambda/Documents/fs2s/reference/Homo_sapiens.GRCh38.99_ensembl.gtf.gz' and has been named 'Homo_sapiens.GRCh38.99_ensembl.gtf.gz'.
cdna <- readDNAStringSet(here("reference", "hs_cdna99.fa.gz"))gtf <- read_gff(gtf_fn)

The sequence names in the Ensembl GTF file contain genome annotation information, which I’ll compare to the corresponding GTF annotation.

head(names(cdna))
#> [1] "ENST00000434970.2 cdna chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]"                #> [2] "ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]"                #> [3] "ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]"                #> [4] "ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]"#> [5] "ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]"                #> [6] "ENST00000390583.1 cdna chromosome:GRCh38:14:105904497:105904527:-1 gene:ENSG00000211923.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD3-10 description:immunoglobulin heavy diversity 3-10 [Source:HGNC Symbol;Acc:HGNC:5495]"
head(gtf)
#> GRanges object with 6 ranges and 22 metadata columns:#>       seqnames      ranges strand |   source       type     score     phase#>                |      #>   [1]        1 11869-14409      + |   havana       gene            #>   [2]        1 11869-14409      + |   havana transcript            #>   [3]        1 11869-12227      + |   havana       exon            #>   [4]        1 12613-12721      + |   havana       exon            #>   [5]        1 13221-14409      + |   havana       exon            #>   [6]        1 12010-13670      + |   havana transcript            #>               gene_id gene_version   gene_name gene_source#>               #>   [1] ENSG00000223972            5     DDX11L1      havana#>   [2] ENSG00000223972            5     DDX11L1      havana#>   [3] ENSG00000223972            5     DDX11L1      havana#>   [4] ENSG00000223972            5     DDX11L1      havana#>   [5] ENSG00000223972            5     DDX11L1      havana#>   [6] ENSG00000223972            5     DDX11L1      havana#>                             gene_biotype   transcript_id transcript_version#>                                           #>   [1] transcribed_unprocessed_pseudogene                           #>   [2] transcribed_unprocessed_pseudogene ENST00000456328                  2#>   [3] transcribed_unprocessed_pseudogene ENST00000456328                  2#>   [4] transcribed_unprocessed_pseudogene ENST00000456328                  2#>   [5] transcribed_unprocessed_pseudogene ENST00000456328                  2#>   [6] transcribed_unprocessed_pseudogene ENST00000450305                  2#>       transcript_name transcript_source                 transcript_biotype#>                                          #>   [1]                                                         #>   [2]     DDX11L1-202            havana               processed_transcript#>   [3]     DDX11L1-202            havana               processed_transcript#>   [4]     DDX11L1-202            havana               processed_transcript#>   [5]     DDX11L1-202            havana               processed_transcript#>   [6]     DDX11L1-201            havana transcribed_unprocessed_pseudogene#>               tag transcript_support_level exon_number         exon_id#>                           #>   [1]                                                 #>   [2]       basic                        1                    #>   [3]       basic                        1           1 ENSE00002234944#>   [4]       basic                        1           2 ENSE00003582793#>   [5]       basic                        1           3 ENSE00002312635#>   [6]       basic                       NA                    #>       exon_version  protein_id protein_version     ccds_id#>               #>   [1]                                     #>   [2]                                     #>   [3]            1                            #>   [4]            1                            #>   [5]            1                            #>   [6]                                     #>   -------#>   seqinfo: 47 sequences from an unspecified genome; no seqlengths
# Extract transcript ID from fasta sequence namecdna_tx <- str_extract(names(cdna), "^ENST\\d+")# Transcript IDs from GTFgtf_tx <- unique(gtf$transcript_id)gtf_tx <- gtf_tx[!is.na(gtf_tx)]length(cdna_tx)
#> [1] 190432
length(gtf_tx)
#> [1] 227818

In total, there are 190432 transcripts in the fasta file, and 227818 in the GTF file.

v <- draw.pairwise.venn(length(gtf_tx), length(cdna_tx),                        length(intersect(cdna_tx, gtf_tx)),                        category = c("GTF", "cDNA"),                        fill = c("purple", "blue"),                        alpha = c(0.5, 0.5))grid.draw(v)

grid.newpage()

While most transcripts overlap, a sizable minority don’t.

It would not be so terrible if the transcripts that don’t overlap between the GTF file and cDNA fasta file are all from genes most people don’t care about, such as pseudogenes. Or would those genes be haplotype variants? Is this the case? Here I’ll use Ensembl version 99, which is the most recent as of writing.

The Ensembl’s FTP site has README files for each directory. For GTF files, the README file says

GTF provides access to all annotated transcripts which make up an Ensembl gene set. Annotation is based on alignments of biological evidence (eg. proteins, cDNAs, RNA-seq) to a genome assembly. The annotation dumped here is transcribed and translated from the genome assembly and is not the original input sequence data that we used for alignment. Therefore, the sequences provided by Ensembl may differ from the original input sequence data where the genome assembly is different to the aligned sequence.

For cDNA files, the README says:

These files hold the cDNA sequences corresponding to Ensembl gene predictions. cDNA consists of transcript sequences for actual and possible genes, including pseudogenes, NMD and the like. See the file names explanation below for different subsets of both known and predicted transcripts.

FILE NAMES

The files are consistently named following this pattern: ....fa.gz

: The systematic name of the species.

: The assembly build name.

: cdna for cDNA sequences

:

  • ‘cdna.all’ – the super-set of all transcripts resulting from Ensembl gene predictions (see more below).
  • ‘cdna.abinitio’ – transcripts resulting from ‘ab initio’ gene prediction algorithms such as SNAP and GENSCAN. In general all ‘ab initio’ predictions are solely based on the genomic sequence and do not use other experimental evidence. Therefore, not all GENSCAN or SNAP cDNA predictions represent biologically real cDNAs. Consequently, these predictions should be used with care.

The one I used is Homo_sapiens.GRCh38.cdna.all.fa.gz, not the abinitio one. However, the README doesn’t seem to be clear about how the GTF annotation differs from that in the cDNA fasta file. Here I’ll find out about such differences.

GTF only

gtf_meta <- as.data.frame(gtf[gtf$type == "transcript"])gtf_meta <- gtf_meta %>%   mutate(gtf_only = !transcript_id %in% cdna_tx,         gene_biotype = str_replace_all(gene_biotype, "_", " "))
n_txs <- gtf_meta %>%   count(gtf_only, gene_biotype)

How many transcripts are there in each gene biotype, and how many transcripts in each biotype are only in the GTF file? For a description of Ensembl gene biotypes, see this page.

plot_bar_patch(n_txs, 3, "gtf_only", "GTF only",                "Number of transcripts in each gene biotype in GTF file")

Proportion of GTF only transcripts in each biotype

p <- ggplot(gtf_meta, aes(fct_reorder(gene_biotype, gtf_only, .fun = mean), fill = gtf_only)) +  geom_bar(position = "fill", alpha = 0.5) +  scale_y_continuous(expand = expand_scale(mult = c(0, 0))) +  scale_fill_discrete(name = "GTF only") +  coord_flip() + theme_bw() +  theme(legend.position = "top", legend.justification = c(0,0.5),        legend.margin = margin(t = 14), axis.title = element_blank())# To place title further to the left; will be fixed in ggplot2 develannotate_figure(p, fig.lab = "Proportion of GTF only transcripts in each gene biotype in GTF",                fig.lab.pos = "top.left", fig.lab.size = 14)

It’s now apparent that some transcripts are only present in the GTF file because their biotypes are excluded from the cDNA file. These GTF only biotypes are non-coding RNAs, except TEC, which stands for To be Experimentally Confirmed. However, Ensembl has a separately fasta file for lncRNA. Some non-coding RNAs are not polyadenylated (e.g. mature miRNAs), which means they are omitted by polyA selection prior to RNA-seq. However, some lncRNAs are polyadenylated, and Cell Ranger’s reference does include lincRNA (long intergenic non-coding RNA).

cDNA fasta only

What about cDNA only transcripts? Are they also from specific gene biotypes?

# Extract annotation from fasta sequence namescdna_meta <- tibble(transcript_id = cdna_tx,                    cr = str_extract(names(cdna),                                     "(?<=((chromosome)|(scaffold)):GRCh38:).*?(?=\\s)"),                    gene_biotype = str_extract(names(cdna), "(?<=gene_biotype:).*?(?=\\s)"),                    gene_id = str_extract(names(cdna), "(?<=gene:).*?(?=\\.)"),                    gene_symbol = str_extract(names(cdna), "(?<=gene_symbol:).*?(?=\\s)"),                    cdna_only = !transcript_id %in% gtf_tx) %>%   separate(cr, into = c("seqnames", "start", "end", "strand"), sep = ":") %>%   mutate(start = as.integer(start),         end = as.integer(end),         strand = case_when(           strand == "1" ~ "+",           strand == "-1" ~ "-",           TRUE ~ "*"         ),         gene_biotype = str_replace_all(gene_biotype, "_", " "))
head(cdna_meta)
#> # A tibble: 6 x 9#>   transcript_id seqnames  start    end strand gene_biotype gene_id gene_symbol#>                                       #> 1 ENST00000434… 14       2.24e7 2.24e7 +      TR D gene    ENSG00… TRDD2      #> 2 ENST00000415… 14       2.24e7 2.24e7 +      TR D gene    ENSG00… TRDD1      #> 3 ENST00000448… 14       2.24e7 2.24e7 +      TR D gene    ENSG00… TRDD3      #> 4 ENST00000631… CHR_HSC… 1.43e8 1.43e8 +      TR D gene    ENSG00… TRBD1      #> 5 ENST00000632… 7        1.43e8 1.43e8 +      TR D gene    ENSG00… TRBD1      #> 6 ENST00000390… 14       1.06e8 1.06e8 -      IG D gene    ENSG00… IGHD3-10   #> # … with 1 more variable: cdna_only 
n_txs_cdna <- cdna_meta %>%   count(cdna_only, gene_biotype)

Number of transcripts in each biotype and number within each biotype that is only in the fasta file

plot_bar_patch(n_txs_cdna, 3, col_fill = "cdna_only", name = "cDNA only",                title = "Number of transcripts in each gene biotype in cDNA fasta")

Proportion of transcripts that are only in the fasta file in each biotype

p <- ggplot(cdna_meta, aes(fct_reorder(gene_biotype, cdna_only, .fun = mean), fill = cdna_only)) +  geom_bar(position = "fill", alpha = 0.5) +  scale_y_continuous(expand = expand_scale(mult = c(0, 0))) +  scale_fill_discrete(name = "cDNA only") +  coord_flip() + theme_bw() +  theme(legend.position = "top", legend.justification = c(0,0.5),        legend.margin = margin(t = 14), axis.title = element_blank())annotate_figure(p,                 fig.lab = "Proportion of cDNA only transcripts in each gene biotype in cDNA fasta",                fig.lab.pos = "top.left", fig.lab.size = 14)

Apparently, cDNA fasta only transcripts are not specific to a particular biotype.

Chromosomes

chrs <- c(as.character(1:22), "X", "Y", "MT")

Gene annotations often contain information of not only the chromosomes, but also scaffolds.

seqlevels(gtf)
#>  [1] "1"          "2"          "3"          "4"          "5"         #>  [6] "6"          "7"          "8"          "9"          "10"        #> [11] "11"         "12"         "13"         "14"         "15"        #> [16] "16"         "17"         "18"         "19"         "20"        #> [21] "21"         "22"         "X"          "Y"          "MT"        #> [26] "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000213.1"#> [31] "GL000216.2" "GL000218.1" "GL000219.1" "GL000220.1" "GL000225.1"#> [36] "KI270442.1" "KI270711.1" "KI270713.1" "KI270721.1" "KI270726.1"#> [41] "KI270727.1" "KI270728.1" "KI270731.1" "KI270733.1" "KI270734.1"#> [46] "KI270744.1" "KI270750.1"

The GL* and KI* things are scaffolds, which are regions not assembled into chromosomes. Genomes, such as BSgenome.Hsapiens.UCSC.hg38 and Ensembl’s top level genome (Homo_sapiens.GRCh38.dna.toplevel.fa.gz, downloaded by biomartr::getGenome), also contain haplotype information. Sometimes multiple Ensembl IDs correspond to the same gene symbol, as those Ensembl IDs correspond to different haplotypes. In contrast, Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz does not have the scaffolds and haplotypes.

Are the non-overlapping transcripts only on haplotypes or scaffolds?

gtf_meta %>%   mutate(seqname_type = case_when(    seqnames %in% chrs ~ "chromosome",    str_detect(seqnames, "^CHR_") ~ "haplotype",    TRUE ~ "scaffold"  )) %>%   ggplot(aes(fct_rev(seqnames), fill = gtf_only)) +  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +  scale_fill_discrete(name = "GTF only") +  coord_flip() +  facet_wrap(~ seqname_type, scales = "free", ncol = 1, strip.position = "right") +  scale_y_continuous(expand = expand_scale(mult = c(0, 0.1)),                     breaks = pretty_breaks(n = 7)) +  labs(title = "Number of transcripts for each seqname for GTF") +  theme(legend.position = "top", legend.justification = c(0,0.5),        axis.title = element_blank())

Apparently GTF only transcripts are not specific to scaffolds or chromosomes, though some scaffolds have a small number of genes, all of which are GTF only. What about in the cDNA file?

cdna_meta <- cdna_meta %>%   mutate(    seqname_type = case_when(      seqnames %in% chrs ~ "chromosome",      str_detect(seqnames, "^CHR") ~ "haplotype",      TRUE ~ "scaffold"    ),    seqnames = fct_relevel(seqnames, c(chrs, setdiff(unique(seqnames), chrs) %>% sort()))  )
p <- ggplot(cdna_meta, aes(fct_rev(seqnames), fill = cdna_only)) +  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +  coord_flip() +  scale_fill_discrete(name = "cDNA only") +  facet_wrap(~ seqname_type, scales = "free", ncol = 1, strip.position = "right") +  scale_y_continuous(expand = expand_scale(mult = c(0, 0.1)),                      breaks = pretty_breaks(n = 7)) +  theme(legend.position = "top", legend.justification = c(0,0.5),         legend.margin = margin(t = 14), axis.title = element_blank())annotate_figure(p, fig.lab = "Number of transcripts for each seqname for cDNA fasta",                 fig.lab.pos = "top.left", fig.lab.size = 14)

cdna_meta %>%   count(cdna_only, seqname_type) %>%   arrange(desc(cdna_only), desc(n)) %>% knitr::kable()
cdna_onlyseqname_typen
TRUEhaplotype18143
FALSEchromosome172246
FALSEscaffold43

There’re hundreds of haplotypes here. All the cDNA only transcripts are on haplotypes. As haplotypes can confuse alignment, for the purpose of aligning RNA-seq reads to the genome, haplotypes should better be excluded.

How about the transcripts shared between GTF and cDNA? Do those two sources mean the same sequence for the same transcript?

inter <- gtf_meta %>%   inner_join(cdna_meta, by = c("gene_id", "transcript_id", "seqnames"))
#> Warning: Column `seqnames` joining factors with different levels, coercing to#> character vector

Do the GTF and cDNA files place the same transcripts at the same genomic ranges?

all.equal(inter$start.x, inter$start.y)
#> [1] TRUE
all.equal(inter$end.x, inter$end.y)
#> [1] TRUE
all.equal(as.character(inter$strand.x), as.character(inter$strand.y))
#> [1] TRUE
all.equal(inter$gene_biotype.x, inter$gene_biotype.y)
#> [1] TRUE

So the genomic ranges, strand, and gene biotypes do match. However, this is just for transcripts; exon annotations are absent from the sequence names of the cDNA fasta file. Are the exons also the same?

unique(inter$seqnames)
#>  [1] "1"          "2"          "3"          "4"          "5"         #>  [6] "6"          "7"          "X"          "8"          "9"         #> [11] "11"         "10"         "12"         "13"         "14"        #> [16] "15"         "16"         "17"         "18"         "20"        #> [21] "19"         "Y"          "22"         "21"         "MT"        #> [26] "KI270728.1" "KI270727.1" "GL000009.2" "GL000194.1" "GL000205.2"#> [31] "GL000195.1" "GL000219.1" "KI270734.1" "GL000213.1" "GL000218.1"#> [36] "KI270731.1" "KI270721.1" "KI270726.1" "KI270711.1" "KI270713.1"

Say we don’t care about the scaffolds. I’ll extract the transcriptome (only for genes also present in the cDNA fasta file) using the GTF file. BSgenome.Hsapiens.UCSC.hg38 denotes chromosomes as something like chr1, while Ensembl just uses 1, so I’ll convert BSgenome.Hsapiens.UCSC.hg38 to Ensembl style.

gn <- BSgenome.Hsapiens.UCSC.hg38seqlevelsStyle(gn) <- "Ensembl"
# This will discard scaffoldsgl <- BUSpaRse:::subset_annot(gn, gtf)
#> 22 sequences in the annotation absent from the genome were dropped.
#> 430 sequences in the genome are absent from the annotation.
# Only keep overlapping transcriptsgl <- gl[gl$type == "exon" & gl$transcript_id %in% inter$transcript_id]# Exons are already sorted in ascending order in the GTF file, even for minus strand genes# Need to sort if not already sortedgl <- split(gl, gl$transcript_id)# Extract transcriptometx_gtf <- extractTranscriptSeqs(gn, gl)
cdna_compare <- cdnanames(cdna_compare) <- cdna_meta$transcript_id# sort transcripts from the cDNA file, discard scaffoldscdna_compare <- cdna_compare[names(tx_gtf)]

From the cDNA fasta:

cdna_compare
#>   A DNAStringSet instance of length 172246#>           width seq                                         names               #>      [1]   1032 CTGCTGCTGCTGCGCCCCAT...TAAATTTGCTGTGGTTTGTA ENST00000000233#>      [2]   2450 AGAGTGGGGCACAGCGAGGC...TAAAAAACAAACAAAACATA ENST00000000412#>      [3]   2274 GTCAGCTGGAGGAAGCGGAG...TATAATACCGAGCTCAAAAA ENST00000000442#>      [4]   3715 CCTACCCCAGCTCTCGCGCC...GTGAGGATGTTTTGTTAAAA ENST00000001008#>      [5]   4732 AGGCAATTTTTTTCCTCCCT...AATAAACCGTGGGGACCCGC ENST00000001146#>      ...    ... ...#> [172242]   4105 TAGATGTAACCCTGAGTGAA...AATCACAATTCTGCTAATGT ENST00000674151#> [172243]   1374 AGGCTGATAAAATACCAGTA...TGAGCACGATGATGATGCAA ENST00000674152#> [172244]   2789 CCTGCGCAGAGTCTGCGGAG...AAAATGAGCAAAAGTTGATC ENST00000674153#> [172245]   8288 ATGGCCGAGAATGTGGTGGA...TAAACTGTGTGAGACAGACA ENST00000674155#> [172246]    898 TCTCTGGATATGAGGCAGGA...ACTCAATTTGTTATTCAAAA ENST00000674156

Sequences extracted from genome with GTF file:

tx_gtf
#>   A DNAStringSet instance of length 172246#>           width seq                                         names               #>      [1]   1032 CTGCTGCTGCTGCGCCCCAT...TAAATTTGCTGTGGTTTGTA ENST00000000233#>      [2]   2450 AGAGTGGGGCACAGCGAGGC...TAAAAAACAAACAAAACATA ENST00000000412#>      [3]   2274 GTCAGCTGGAGGAAGCGGAG...TATAATACCGAGCTCAAAAA ENST00000000442#>      [4]   3715 CCTACCCCAGCTCTCGCGCC...GTGAGGATGTTTTGTTAAAA ENST00000001008#>      [5]   4732 AGGCAATTTTTTTCCTCCCT...AATAAACCGTGGGGACCCGC ENST00000001146#>      ...    ... ...#> [172242]   4105 TAGATGTAACCCTGAGTGAA...AATCACAATTCTGCTAATGT ENST00000674151#> [172243]   1374 AGGCTGATAAAATACCAGTA...TGAGCACGATGATGATGCAA ENST00000674152#> [172244]   2789 CCTGCGCAGAGTCTGCGGAG...AAAATGAGCAAAAGTTGATC ENST00000674153#> [172245]   8288 ATGGCCGAGAATGTGGTGGA...TAAACTGTGTGAGACAGACA ENST00000674155#> [172246]    898 TCTCTGGATATGAGGCAGGA...ACTCAATTTGTTATTCAAAA ENST00000674156

Do the transcript sequences at least have the same lengths?

all.equal(width(tx_gtf), width(cdna_compare))
#> [1] TRUE

Are the sequences the same? Since I don’t care how the sequences are different if they are different, no alignment is needed.

all(pcompare(tx_gtf, cdna_compare) == 0)
#> [1] TRUE

Yes, the sequences are the same.

The GTF file contains annotations for non-coding RNAs, while the cDNA fasta file does not. The cDNA file contains haplotypes, while the GTF file does not. For pseudoalignment of RNA-seq reads from polyA selected techniques, non-coding RNAs in the GTF file probably aren’t so important, unless you do care about polyadenylated lncRNAs, so it’s fine to use the cDNA fasta file, but we should remove the haplotypes as they may cause confusion in alignment. However, if you are interested in non-coding RNAs, then download the ncRNA fasta file from Ensembl or extract the transcriptome with the GTF file. We’ve also got example R code here to filter by gene biotypes and to extract transcriptome from the genome with the GTF file.

sessionInfo()
#> R version 3.6.2 (2019-12-12)#> Platform: x86_64-apple-darwin15.6.0 (64-bit)#> Running under: macOS Catalina 10.15.1#> #> Matrix products: default#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib#> #> locale:#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8#> #> attached base packages:#>  [1] stats4    parallel  grid      stats     graphics  grDevices utils    #>  [8] datasets  methods   base     #> #> other attached packages:#>  [1] rlang_0.4.4                       scales_1.1.0                     #>  [3] here_0.1                          BUSpaRse_1.0.0                   #>  [5] GenomicFeatures_1.38.1            AnnotationDbi_1.48.0             #>  [7] Biobase_2.46.0                    plyranges_1.6.8                  #>  [9] BSgenome.Hsapiens.UCSC.hg38_1.4.1 BSgenome_1.54.0                  #> [11] rtracklayer_1.46.0                Biostrings_2.54.0                #> [13] XVector_0.26.0                    GenomicRanges_1.38.0             #> [15] GenomeInfoDb_1.22.0               IRanges_2.20.2                   #> [17] S4Vectors_0.24.3                  BiocGenerics_0.32.0              #> [19] ggpubr_0.2.4                      magrittr_1.5                     #> [21] biomartr_0.9.2                    VennDiagram_1.6.20               #> [23] futile.logger_1.4.3               forcats_0.4.0                    #> [25] stringr_1.4.0                     dplyr_0.8.4                      #> [27] purrr_0.3.3                       readr_1.3.1                      #> [29] tidyr_1.0.2                       tibble_2.1.3                     #> [31] ggplot2_3.2.1                     tidyverse_1.3.0                  #> #> loaded via a namespace (and not attached):#>  [1] colorspace_1.4-1            ggsignif_0.6.0             #>  [3] ellipsis_0.3.0              rprojroot_1.3-2            #>  [5] fs_1.3.1                    rstudioapi_0.10            #>  [7] farver_2.0.3                bit64_0.9-7                #>  [9] fansi_0.4.1                 lubridate_1.7.4            #> [11] xml2_1.2.2                  knitr_1.27                 #> [13] zeallot_0.1.0               jsonlite_1.6.1             #> [15] Rsamtools_2.2.1             broom_0.5.4                #> [17] dbplyr_1.4.2                compiler_3.6.2             #> [19] httr_1.4.1                  backports_1.1.5            #> [21] assertthat_0.2.1            Matrix_1.2-18              #> [23] lazyeval_0.2.2              cli_2.0.1                  #> [25] formatR_1.7                 htmltools_0.4.0            #> [27] prettyunits_1.1.1           tools_3.6.2                #> [29] gtable_0.3.0                glue_1.3.1                 #> [31] GenomeInfoDbData_1.2.2      rappdirs_0.3.1             #> [33] Rcpp_1.0.3                  cellranger_1.1.0           #> [35] vctrs_0.2.2                 nlme_3.1-144               #> [37] blogdown_0.17               xfun_0.12                  #> [39] rvest_0.3.5                 lifecycle_0.1.0            #> [41] ensembldb_2.10.2            XML_3.99-0.3               #> [43] zlibbioc_1.32.0             ProtGenerics_1.18.0        #> [45] hms_0.5.3                   SummarizedExperiment_1.16.1#> [47] AnnotationFilter_1.10.0     lambda.r_1.2.4             #> [49] yaml_2.2.1                  curl_4.3                   #> [51] gridExtra_2.3               memoise_1.1.0              #> [53] biomaRt_2.42.0              stringi_1.4.5              #> [55] RSQLite_2.2.0               highr_0.8                  #> [57] BiocParallel_1.20.1         pkgconfig_2.0.3            #> [59] bitops_1.0-6                matrixStats_0.55.0         #> [61] evaluate_0.14               lattice_0.20-38            #> [63] labeling_0.3                GenomicAlignments_1.22.1   #> [65] cowplot_1.0.0               bit_1.1-15.1               #> [67] tidyselect_1.0.0            bookdown_0.17              #> [69] R6_2.4.1                    generics_0.0.2             #> [71] DelayedArray_0.12.2         DBI_1.1.0                  #> [73] pillar_1.4.3                haven_2.2.0                #> [75] withr_2.1.2                 RCurl_1.98-1.1             #> [77] modelr_0.1.5                crayon_1.3.4               #> [79] futile.options_1.0.1        utf8_1.1.4                 #> [81] BiocFileCache_1.10.2        rmarkdown_2.1              #> [83] progress_1.2.2              readxl_1.3.1               #> [85] data.table_1.12.8           blob_1.2.1                 #> [87] reprex_0.3.0                digest_0.6.23              #> [89] openssl_1.4.1               RcppParallel_4.4.4         #> [91] munsell_0.5.0               askpass_1.1

Let’s block ads!(Why?)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on From System to System.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bio7 3.1 Released

$
0
0

[This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

07.02.2020

A new release of Bio7 is available. This update comes with a plethora of new features, improvements and bugfixes.

R ImageJ Analysis Perspective (MacOSX Catalina, Dark Mode)

For those who don’t know Bio7. The application Bio7 is an integrated development environment for ecological modeling, scientific image analysis and statistical analysis.

It also contains a feature complete development environment for R with an advanced R editor, R developer tools and interfaces to perform scientific image analysis with R and the also embedded ImageJ application.

New and Noteworthy

General:

  • Bio7 3.1 RCP (Rich Client Platform) built upon Eclipse 4.14
  • Bundled with AdoptOpenJDK 13.0.2
  • Bundled with R 3.6.2 (Windows only!)
  • New main toolbar action available to create different projects and files in one shot
  • Added a toolbar submenu action to open the ‘Control’ view
  • ‘Dark Mode’ theme automatically enabled on supported OS (MacOSX only!)
  • ‘2D’ perspective (e.g., for cellular automata simulations) is now hidden by default.
  • Removed the ‘Image’ perspective and the ‘Points Panel’ view and actions
  • Updated libraries (SSH for Windows, etc.)
  • Improved the ‘WorldWind’, ‘2D’ and  ‘3D’ perspective to open and arrange the editor area in a user-friendly programming layout by default (see screenshots below).

WorldWind Perspective with opened Java file (MacOSX Catalina, Dark Mode)

2D Perspective with opened Java file (MacOSX Catalina, Dark Mode)

  • JavaFX has been disabled by default in the preferences for Linux HighDPI displays (e.g., GTK scale 2) because of an open bug when embedded in an SWT canvas.
  • Improved the default Bio7 CSS interface colors

R:

  • New toolbar wizard to create quickly R, R Markdown and Shiny files in a project
  • New R Packages tab available to show and update installed R packages (see screenshot below)

Load Packages Tab (MacOSX Catalina, Dark Mode)

  • Changed the ‘Image-Methods’ view to show only R/ImageJ related actions (bidirectional transfer of image data, selection data, particle and cluster methods, etc.)
  • Added the ‘Geometry Selections’ view to the ‘R ImageJ Analysis’ perspective to transfer ROI selection coordinates as a list or georeferenced data geometries (using package ‘sp’ and Java GDAL)
  • Added the KMeans cluster algorithm to the ‘Cluster Pic’ action in the ‘Image-Methods’ view
  • The ‘R ImageJ Analysis’ perspective is now visible by default
  • Improved the ‘Pixel RM’ and ‘Pixel RM Stack’ action to transfer all ROI selections more efficiently, especially for VirtualStacks of ImageJ
  • Added a new option in the ‘Pixel RM’ and ‘Pixel RM Stack’ action to transfer ROI Manager selections signature names according to a separator char (handy for supervised classification with ImageJ and R)
  • Added a new stack transfer API method (‘imageFeatureStackToR’) to transfer whole stacks in a column layout (for pixel classification)
  • The ‘R-Shell’ view list now scrolls to the last selected variables after a workspace refresh
  • New R editor ‘Quickfixes’ available (warning for an assignment operator in function call, check for null and na in comparison operators)
  • Improved the code completion of the ‘R-Shell’ view and the R editor
  • Improved the plot layout in ImageJ (when using the preference option ‘ImageJ View Display Size’ in the Rserve plot preferences)
  • Added a Shiny timeout to improve the server startup
  • Updated the R ANTLR parser

ImageJ:

  • Updated ImageJ to version 1.52t(.51)
  • Added a new ImageJ macro debugger interface
  • Added a new thumbnail action to the context menu of the ‘Navigator’ view to open image and LUT files of a selected directory
  • Added a ‘Detach All Images’ action to transfer all opened tab images to a view (which can be moved and arranged like a separate window)
  • Improved the dark theme for all AWT components
  • Added new default macro editor templates
  • Improved the ImageJ plot windows
  • Improved the detached image view layout
  • Improved the canvas layout for HighDPI displays on Windows
  • Huge amount of other changes since the last release. For all changes, see:https://github.com/Bio7/EclipseImageJ1Plugin

Java:

  • Dynamic compilation now supports Java 13 by default
  • Added a new option to dynamically compile a class without a constructor call (without object creation). The object has to be created in the main method (as known from Java).
  • Updated the Java OpenGL (JOGL) 3D interfaces
  • Updated Java WorldWind to the latest version
  • Fixed the OpenGL view visibility on MacOSX (the ‘3D’ and ‘WorldWind’ view stayed visible when switching to another perspective)
  • Added an experimental fullscreen option to open WorldWind and the ‘3D’ view on a secondary monitor (press F3)
  • Updated Java libraries

Download and Installation:

Windows:

Just download the *.zip distribution file from https://bio7.org and unzip it in your preferred location. Bio7 comes bundled with a Java Runtime Environment, R and Rserve distribution and works out of the box.

Linux:

Download and extract the installation file from https://bio7.org. For Linux you have to install R and Rserve (see Rserve installation below!).

MacOSX:

Download and extract the installation file from https://bio7.org.

If you start Bio7 a warning or error can occur because of the changes how Apple treats signatures! To allow Bio7 to start see this instructions for Yosemite, Sierra and Mojave:

First try to open the app with the context menu to allow the execution. If that won’t work try the following:

Yosemite: Open an app from an unidentified developer

Sierra: Open an app from an unidentified developer

Moave and Sierra: How to fix “Application” is damaged and can’t be opened error in macOS Mojave and High Sierra.

In addition for MacOSX you have to install R and Rserve (see below!).

Linux and MacOSX Rserve (compiled for cooperative mode) installation:

To install Rserve open the ‘Native R’ console in the ‘Console’ view and then execute the view menu action “Options -> Install Rserve (coop. mode) for R …” for different R versions (SSL 1.1 version for Linux Ubuntu > 19.10). This will download and install Rserve in your default R library location, see video below (please make sure that your default Linux R library install location has writing permissions!):

How to install Rserve for Linux and MacOSX: https://youtu.be/tF7HbRBRIF

In cooperative mode only one connection at a time is allowed (which we want for this Desktop appl.) and all subsequent connections share the same namespace (default on Windows)!

Bio7 Documentation

For more information about Bio7 please consult the soon updated Bio7 User Guide.

A plethora of Bio7 videotutorials for an introduction can be found on YouTube.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Generating correlation matrix for AR(1) model

$
0
0

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Assume that we are in the time series data setting, where we have data at equally-spaced times 1, 2, \dots which we denote by random variables X_1, X_2, \dots. The AR(1) model, commonly used in econometrics, assumes that the correlation between X_i and X_j is \text{Cor}(X_i, X_j) = \rho^{|i-j|}, where \rho is some parameter that usually has to be estimated.

If we were writing out the full correlation matrix for n consecutive data points X_1, \dots, X_n, it would look something like this:

\begin{pmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} &\dots & 1 \end{pmatrix}

(Side note: This is an example of a correlation matrix which has Toeplitz structure.)

Given \rho, how can we generate this matrix quickly in R? The function below is my (current) best attempt:

ar1_cor <- function(n, rho) {exponent <- abs(matrix(1:n - 1, nrow = n, ncol = n, byrow = TRUE) -     (1:n - 1))rho^exponent}

In the function above, n is the number of rows in the desired correlation matrix (which is the same as the number of columns), and rho is the \rho parameter. The function makes use of the fact that when subtracting a vector from a matrix, R automatically recycles the vector to have the same number of elements as the matrix, and it does so in a column-wise fashion.

Here is an example of how the function can be used:

ar1_cor(4, 0.9)#       [,1] [,2] [,3]  [,4]# [1,] 1.000 0.90 0.81 0.729# [2,] 0.900 1.00 0.90 0.810# [3,] 0.810 0.90 1.00 0.900# [4,] 0.729 0.81 0.90 1.000

Such a function might be useful when trying to generate data that has such a correlation structure. For example, it could be passed as the Sigma parameter for MASS::mvrnorm(), which generates samples from a multivariate normal distribution.

Can you think of other ways to generate this matrix?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Three ways to calculate distances in R

$
0
0

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Three ways to calculate distances in R

Calculating a distance on a map sounds straightforward, but it can be confusing how many different ways there are to do this in R.

This complexity arises because there are different ways of defining ‘distance’ on the Earth’s surface.

The Earth is spherical. So do you want to calculate distances around the sphere (‘great circle distances’) or distances on a map (‘Euclidean distances’).

Then there are barriers. For example, for distances in the ocean, we often want to know the nearest distance around islands.

Then there is the added complexity of the different spatial data types. Here we will just look at points, but these same concepts apply to other data types, like shapes.

Example data

Let’s look at some example data. It is just a series of points across the island of Tasmania. We are going to calculate how far apart these points are from each other.

We’ll use sf for spatial data and tmap for mapping.

Here’s a map:

tm_shape(stas) +  tm_polygons() +  tm_graticules(col = "grey60") +  tm_shape(pts) +  tm_symbols(col = "black") +  tm_scale_bar(position = c("left", "bottom")) +  tm_shape(pts) +  tm_text("pt", ymod = -1)

Note I’ve included a scale bar, but of course the distance between longitude lines gets closer at higher latitudes.

Great circle distances

The first method is to calculate great circle distances, that account for the curvature of the earth. If we use st_distance() with unprojected coordinates (ie in lon-lat) then we get great circle distances (in metres).

m <- st_distance(pts)m/1000## Units: [m]##          [,1]     [,2]      [,3]## [1,]    0.000 821.5470 1200.7406## [2,]  821.547   0.0000  419.5004## [3,] 1200.741 419.5004    0.0000

The matrix m gives the distances between points (we divided by 1000 to get distances in KM).

Euclidean distances

Another option is to first project the points to a projection that preserves distances and then calculate the distances. This option is computationally faster, but can be less accurate, as we will see.

We will use the local UTM projection. So you can see what this looks like, we will project the land too.

tas_utm <- st_crs("+proj=utm +zone=55 +datum=WGS84 +units=m +no_defs")stas2 <- st_transform(stas, crs = tas_utm)pts2 <- st_transform(pts, crs = tas_utm)tm_shape(stas2) +  tm_polygons() +  tm_graticules(col = "grey60") +  tm_shape(pts2) +  tm_symbols(col = "black") +  tm_scale_bar(position = c("left", "bottom")) +  tm_shape(pts) +  tm_text("pt", ymod = -1)

Note how it now bends the lat/long lines. This happens because we are projecting a sphere onto a flat surface. The UTM will be most accurate at the centre of its zone (we used Zone 55 which is approximately centred on Tasmania).

If we were interested in mapping the mainland of Australia accurately, we’d use a different UTM zone.

Now we can calculate Euclidean distances:

m2 <- st_distance(pts2)m2/1000## Units: [m]##           [,1]     [,2]      [,3]## [1,]    0.0000 824.8996 1203.6228## [2,]  824.8996   0.0000  419.4163## [3,] 1203.6228 419.4163    0.0000

Compare these to our great circle distances:

m/1000## Units: [m]##          [,1]     [,2]      [,3]## [1,]    0.000 821.5470 1200.7406## [2,]  821.547   0.0000  419.5004## [3,] 1200.741 419.5004    0.0000

Note the slight differences, particularly between point 1 and the other points. The first method (great circle) is the more accurate one, but is also a bit slower. The Euclidean distances become a bit inaccurate for point 1, because it is so far outside the zone of the UTM projection.

Points 2 & 3 are within the UTM zone, so the distance between these points is almost identical to the great circle calculation.

Distances around a barrier

The basic idea here is that we turn the data into a raster grid and then use the gridDistance() function to calculate distances around barriers (land) between points.

So first we need to rasterize the land. The package fasterize has a fast way to turn sf polygons into land:

library(fasterize)library(raster)library(dplyr)r <- raster(extent(stas2), nrows = 50, ncols = 50)rtas <- fasterize(summarize(stas2), r)

I made the raster pretty blocky (50 x 50). You could increase the resolution to improve the accuracy of the distance measurements. Here’s how it looks:

Now we need to identify the raster cell’s where the points fall. We do this by extracting coordinates from pts2 and asking for their unique raster cell numbers:

rtas_pts <- rtasxy <- st_coordinates(pts2)icell <- cellFromXY(rtas, xy)

Now, we set the cells of our raster corresponding to the points to a different number than the rest. I will just use the 3rd point (if we used all points then we get nearest distance around barriers to any point).

rtas_pts[icell[3]] <- 2

This will look like the same raster, but with a spot where the 3rd point fell (note red box):

Now just run gridDistance telling it to calculate distances from the cells with a value of 2 (just one cell in this case) and omit values of 1 (land) when doing the distances:

d <- gridDistance(rtas_pts, origin = 2,                  omit = 1)/1000

This will be slow for larger rasters (or very high res). Let’s see how it looks:

Colours correspond to distances from point 3 (the location we gave a value of ‘2’ to in the raster).

Now we can just ask for the distance values at the cells of the other points:

d[icell]## [1] 1310.5141  612.1404    0.0000

So 612 km around Tasmania from point 3 to 2, as the dolphin swims. It was only 419 km if we could fly straight over Tasmania:

m[2,3]/1000## 419.5004 [m]

(note is says metres, but that is because R hasn’t remembered we’ve divided by 1000)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RProtoBuf 0.4.15: One fix, some updates, depcrecation coming

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release 0.4.15 of RProtoBuf just arrived at CRAN. RProtoBuf provides R with bindings for the Google Protocol Buffers (“ProtoBuf”) data encoding and serialization library used and released by Google, and deployed very widely in numerous projects as a language and operating-system agnostic protocol.

This release contains a small bug fix for repeated messages and groups. While making changes, I used the opportunity to change the unit testing framework to the excellent and lightweight tinytest package permitting, among other things, tests of the installed package, and also simplified the build by using pre-made pdf vignettes. A list of changes follows below.

As one heads-up and deprecation message, we are planning to remove the (entirely unused as best as we can tell, and minimal) remote procedure call feature. Protocol Buffers itself always lacked this, but eventually gRPC arrived to fill that void. All use cases should rely on it. So we prepared a test release 0.14.15.1 corresponding to the feature/retire_minimal_rpc branch. You can install this dev release from the ghrr drat repo via one of the drat-assisted commands, or directly via install.packages("RProtoBuf", repos="https://ghrr.github.io/drat"). Please do so and test if you suspect that the change may affect you. Otherwise the removal is likely to happen in the next release (but we will strive to wait a couple of months before doing so).

Changes in RProtoBuf version 0.4.15 (2020-02-08)

  • Repeated Message and Group objects are now returned correctly (Dirk in #64 fixing #62).

  • The unit tests are now run by tinytest (Dirk in #65).

  • The vignettes are now included pre-made (Dirk in #67).

CRANberries provides the usual diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, and the pre-print of our JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


New improved cdata instructional video

$
0
0

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We have a new improved version of the “how to design a cdata/data_algebra data transform” up!

The original article, the Python example, and the R example have all been updated to use the new video.

Please check it out!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Alternatives for retrieving sensor data from Arduino compatible microcontrollers into R

$
0
0

[This article was first published on R on Andres' Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

code.sourceCode > span { display: inline-block; line-height: 1.25; }code.sourceCode > span { color: inherit; text-decoration: inherit; }code.sourceCode > span:empty { height: 1.2em; }.sourceCode { overflow: visible; }code.sourceCode { white-space: pre; position: relative; }div.sourceCode { margin: 1em 0; }pre.sourceCode { margin: 0; }@media screen {div.sourceCode { overflow: auto; }}@media print {code.sourceCode { white-space: pre-wrap; }code.sourceCode > span { text-indent: -5em; padding-left: 5em; }}pre.numberSource code { counter-reset: source-line 0; }pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; }pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; }pre.numberSource { margin-left: 3em; padding-left: 4px; }div.sourceCode { color: #cccccc; background-color: #303030; }@media screen {code.sourceCode > span > a:first-child::before { text-decoration: underline; }}code span.al { color: #ffcfaf; } /* Alert */code span.an { color: #7f9f7f; font-weight: bold; } /* Annotation */code span.at { } /* Attribute */code span.bn { color: #dca3a3; } /* BaseN */code span.bu { } /* BuiltIn */code span.cf { color: #f0dfaf; } /* ControlFlow */code span.ch { color: #dca3a3; } /* Char */code span.cn { color: #dca3a3; font-weight: bold; } /* Constant */code span.co { color: #7f9f7f; } /* Comment */code span.cv { color: #7f9f7f; font-weight: bold; } /* CommentVar */code span.do { color: #7f9f7f; } /* Documentation */code span.dt { color: #dfdfbf; } /* DataType */code span.dv { color: #dcdccc; } /* DecVal */code span.er { color: #c3bf9f; } /* Error */code span.ex { } /* Extension */code span.fl { color: #c0bed1; } /* Float */code span.fu { color: #efef8f; } /* Function */code span.im { } /* Import */code span.in { color: #7f9f7f; font-weight: bold; } /* Information */code span.kw { color: #f0dfaf; } /* Keyword */code span.op { color: #f0efd0; } /* Operator */code span.ot { color: #efef8f; } /* Other */code span.pp { color: #ffcfaf; font-weight: bold; } /* Preprocessor */code span.sc { color: #dca3a3; } /* SpecialChar */code span.ss { color: #cc9393; } /* SpecialString */code span.st { color: #cc9393; } /* String */code span.va { } /* Variable */code span.vs { color: #cc9393; } /* VerbatimString */code span.wa { color: #7f9f7f; font-weight: bold; } /* Warning */

Have you ever wanted to get your own real-world data using sensors? With the advent of the Open-Source Hardware movement, this has become more accessible for the general public, microcontrollers and sensors have become cheaper and easier to program with human-friendly languages like Arduino (C++) and MicroPython (Python), so now anybody can produce its own real-world sensorial data, but, there is still a problem, How do we get this data into R so we can analyze it?

Depending on your application’s requirements you could choose among a plethora of options. For example, if you only need data on batches and you can have easy physical access to your device, you could simply add an SD card module to your project and log the data to a .csv file on an SD card but, that wouldn’t be fun, isn’t it? A much cooler solution would be to retrieve data remotely and preferably wirelessly, don’t you think?. Let’s explore some options.

Serial Connection

If you have never thought about getting data from microcontrollers before, I bet the first thought that comes to your mind, is to somehow directly connect the microcontroller with R to get the data, but this is currently the most technically challenging option, it can be done using a serial connection between the microcontroller and your computer, but as far as I know, there is only one R package implementing this and only runs on POSIX-compatible systems (so no Windows).

I can’t test this approach myself since I don’t have a physical Linux or macOS machine to connect a microcontroller to, so for this case, I’m just going to link to a blog post by @haozhu233 (the package author) showcasing its functionality.

If your application requires real-time data acquisition or very low latency this might be your only option and you may want to dive deeper into this.

Streaming arduino sensor signals right into @rstudio! I wonder if there is really anything #rstats can’t do 😎😎😎pic.twitter.com/nT6q6TT6xc

— Hao Zhu (@haozhu233) January 25, 2019

Writing to a SQL server

There are two ways of doing this, one is to directly connect the microcontroller to the database and the other is to use another service to pull data from the microcontroller and write it into the database.

The first option is harder to accomplish, as far as I know, there is only one Arduino library that provides a SQL connector for MySQL servers but, there are no already-made solutions for other SQL servers and although you could write your own connector for other Open Source SQL servers (like PostgreSQL), that would require a considerable amount of technical skills.

The second option is easier to implement, and my personal favorite.

The first step is to broadcast the data from the microcontroller, one way of doing this is to implement a simple web server on the microcontroller itself and serve the data on a suitable format like HTML, JSON or CSV.

This is a simple example of an Arduino Sketch for an ESP8266 board (Wemos D1 mini) and DS18B20 temperature probe sensors, that serves HTML and JSON outputs over a web server.

The second step is to pull the data out of the microcontroller and into a SQL server, there are a lot of approaches and languages you can use to implement this part but obviously we are going to use R. You can write an R script that connects to the microcontroller, retrieves the data and writes it into a SQL server, and schedule the script with a cron job.

#! /usr/bin/env Rscript# DATA ADQUISITION##############################################################raw_data <-rjson::fromJSON(file="http://192.168.0.102/reading.json") current_time <-as.character(Sys.time())# PARAMETERS ###################################################################current_location = "some_place"point_names =c("probe0" = "measurement_point_0", "probe1" = "measurement_point_1","probe2" = "measurement_point_2")# LIBRARIES ####################################################################library(dplyr)# DATA CLEANING ################################################################tidy_data <-as_tibble(raw_data) %>%tidyr::gather(measurement_point, value) %>%mutate(time = current_time,location = current_location,variable = stringr::str_extract(measurement_point, "^.+(?=-)"),measurement_point = stringr::str_extract(measurement_point, "(?<=-).+$"),measurement_point = stringr::str_replace_all(measurement_point, point_names)           ) %>%select(time, location, variable, measurement_point, value)# LOAD DATA ####################################################################connection_string <-glue::glue("Driver={{PostgreSQL ANSI}};\\                                Uid={Sys.getenv('MY_UID')};\\                                Pwd={Sys.getenv('MY_PWD')};\\                                Server=localhost;\\                                Port=5432;\\                                Database=sensors;")con <-odbc::dbConnect(odbc::odbc(), .connection_string = connection_string, encoding ="utf8")DBI::dbAppendTable(conn = con,name ='sensors_data',value = tidy_data)odbc::dbDisconnect(con)unlink(x ="*.log", force =TRUE)

Now that you have the data into a SQL server, you are at one SQL query away of getting it into R

library(odbc)library(glue)connection_string <-glue::glue("Driver={{PostgreSQL ANSI}};\\  Uid={Sys.getenv('MY_UID')};\\  Pwd={Sys.getenv('MY_PWD')};\\  Server={Sys.getenv('MY_REMOTE')};\\  Port=5432;\\  Database=sensors;")con <-dbConnect(drv = odbc::odbc(),.connection_string = connection_string,encoding ="utf8")query <- "SELECT * FROM public.sensors_data"raw_data <-dbGetQuery(conn = con,statement = query)dbDisconnect(con)

What makes this approach my personal favorite is that it is fairly easy to implement, very flexible and allows you to implement almost any security measure you might need.

Using the MQTT protocol

For more resource-constrained applications you can use a similar approach to the previous one, but with a different technology, the MQTT protocol is ideal for low power devices over slow and unreliable connections, moreover, this is also a way to get data from LoRa networks like the publicly available TTN (The Things Network) which provides an API over MQTT, giving you access to a low-power wide-area network for remote sensor deployment over greater distances.

This approach also allows you to write data to a SQL server like in the previous example but, a more interesting application for this would be to get instant status updates to show on dashboards or shiny apps. Consider this simple shiny app that receives a message from a simulated remote sensor and updates the color of a circle accordingly, you can update the temperature value of the simulated sensor by sending an MQTT message to the test topic from a system terminal on any computer with this command mosquitto_pub -h test.mosquitto.org -t simulated_sensor -q 1 -m 25.9 -r and see how the color changes.

✏ In order for this to work you need to have the mosquitto client installed in your system.

sudo apt install mosquitto-clients
library(shiny)library(ggplot2)library(rmqtt)  # This package is just a simple wrapper for an MQTT client using# system() calls, you need to have the mosquitto client installed# in your system for this to work.# Initialize Simulated Sensormqtt_topic_publish(topic ="simulated_sensor", message_to_send ="24.9",host ="test.mosquitto.org",port =1883,qos =1, retain_message =TRUE)ui <-fluidPage(titlePanel("Dashboard"),mainPanel(h2("Temperature Indicator"),plotOutput("status_indicator")    ))server <-function(input, output, session) {    get_color <-function() {        message <-mqtt_topic_subscribe(topic ="simulated_sensor", host ="test.mosquitto.org",port =1883,qos =1,intern =TRUE,num.messages =1        )if (as.numeric(message) >25) {            color = 'red'        } else {            color = 'green'        }return(color)    }    pollData <-reactivePoll(100, session,checkFunc = get_color,valueFunc = get_color    )    output$status_indicator <-renderPlot({ggplot(data =data.frame(x =1, y =1), aes(x, y)) +geom_point(size =50, color =pollData()) +scale_x_discrete() +scale_y_discrete() +theme_void() +theme(axis.title =element_blank())    })}shinyApp(ui = ui, server = server)

If you are interested in testing this, here is a GitHub Gist with the Arduino Sketch for an ESP8266 board whit a DS18B20 temperature probe sensor that can update the values in the example app.

I know this doesn’t look very interesting right now but consider the implications. You could be getting instant updates in your dashboard from sensors deployed at kilometers downrange and you could even control remote actuators from your shiny app by sending an MQTT message from it with the click of a button!

The downside of this method is that you need to set an MQTT broker (you should not abuse the publicly available ones that are intended for testing), but it can be done for no or little money on a Raspberry Pi or an EC2 AWS instance.

Disclaimer

I want to make clear that this is by no means an exhaustive list, there are other alternatives and countless variations of the ones presented here so if none of these fitts your particular project’s needs, just keep looking, very likely someone has come up with a more suitable method.

✏ If you want to see a detailed step by step example of any of these options, write it on the comments and I’ll try to make a blog post about it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Andres' Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

New Package to Process TVDI index and Filter Golay Savitzky Raster

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Description

  • Use MODIS image to calculate TVDI index
  • Make multiple Raster images at the same time
  • Can be used to calculate large image files
  • UI interface calculates TVDI index
  • UI interface exports Golay Savitzky filter images
  • The functions in the TVDI package
    • Golay_Raster
    • Golay_GUI (may be failed if you don’t have GTK+)
    • Mean_Raster
    • Mask_Multi_Raster
    • IQR_Raster
    • TVDI_process
    • TVDI_Largefiles_process
    • TVDI_GUI (may be failed if you don’t have GTK+)

How to Download and Install

  • Download and Install from Github
install.packages("devtools")library("devtools")install_github("nguyenduclam/TVDIpk")library("TVDIpk")
  • Install from Cran (waiting for update in Cran)
install.packages("TVDIpk")
  • Note that GTK+ library is not already installed on your system, installation may fail. In that case, please install and load the gWidgetsRGtk2 library beforehand:
install.packages("gWidgetsRGtk2")library("gWidgetsRGtk2")

How to use Pakages

  1. Golay UI
    • Golay_GUI()
  2. TVDI UI
    • TVDI_GUI()

References

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog is now on R-bloggers!

$
0
0

[This article was first published on modTools, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I should have done this long ago, but it’s never too late: modTools has finally been added to R-bloggers— a content aggregator contributed by bloggers who write about R, empowering bloggers to empower other R users. I hope the community will find it useful. Constructive feedback welcome!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: modTools.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

cran2copr: RPM repos with 15k binary R packages

$
0
0

[This article was first published on R – Enchufa2, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bringing R packages to Fedora (in fact, to any distro) is an Herculean task, especially considering the rate at which CRAN grows nowadays. So I am happy to announce the cran2copr project, which is an attempt to maintain binary RPM repos for most of CRAN (~15k packages as of Feb. 2020) in an automated way using Fedora Copr.

Are you a Fedora user? Enable the CRAN Copr repo for your system:

$ sudo dnf copr enable iucar/cran

and you are ready to go. Packages are prefixed with R-CRAN-, e.g.:

$ sudo dnf install R-CRAN-rstanarm

Currently, only x86_64 chroots for supported (non-EOL) versions of Fedora, including rawhide, are enabled. If you are interested in other chroots (from the supported architectures and distros), please open an issue on GitHub expressing so, but it is unlikely that it will be enabled in the short to medium term due to current storage limitations in the Copr infrastructure.

These repos are automatically synchronized with CRAN every day at 00:00 UTC through a GitHub Action that removes archived packages and builds the most recent updates. If you find any issue with any of the supported packages (see details and limitations below), please open an issue on GitHub.

Acknowledgements

Thanks to the authors of cran2deb for the inspiration. Thanks to RedHat and, particularly, the Copr team for developing this tool and maintaining the Fedora Copr service for the Fedora community. And thanks to AWS too, because they provide a CDN for free.

Article originally published in Enchufa2.es: cran2copr: RPM repos with 15k binary R packages.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Le Monde puzzle [#1129]

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A number challenge as Le weekly Monde current mathematical puzzle:

When the three consecutive numbers 110, 111 and 112, they all are multiples of the sum of their digits. Are there 4 consecutive numbers with three digits like this? A contrario, does there exist 17 consecutive numbers with three digits such that they cannot be divided by the sum of their digits? 18?

The run of a brute force R search return 510,511,512,513 as the solution to the first question

library(gtools)bez=!(100:999)%%apply(baseOf(100:999),1,sum)> (100:897)[bez[-(1:3)]*bez[-c(1:2,900)]*bez[-c(1,899:900)]*bez[-(898:900)]==1][1] 510

And to the second one:

> max(diff((1:899)[!!diff(bez)]))[1] 17

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppArmadillo 0.9.850.1.0

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

armadillo image

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 685 other packages on CRAN.

A new upstream release 9.850.1 of Armadillo was just released. And as some will undoubtedly notice, Conrad opted for an increment of 50 rather 100. We wrapped this up as version 0.9.850.1.0, having prepared a full (github-only) tarball and the release candidate 9.850.rc1 a few days ago. Both the release candidate and the release got the full reverse depends treatment, and no issues were found.

Changes in the new release below.

Changes in RcppArmadillo version 0.9.850.1.0 (2020-02-09)

  • Upgraded to Armadillo release 9.850.1 (Pyrocumulus Wrath)

    • faster handling of compound expressions by diagmat()

    • expanded .save() and .load() to handle CSV files with headers via csv_name(filename,header) specification

    • added log_normpdf()

    • added .is_zero()

    • added quantile()

  • The sparse matrix test using scipy, if available, is now simplified thanks to recently added reticulate conversions.

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


What is New For vtreat 1.5.2?

$
0
0

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

vtreat version 1.5.2 just became available from CRAN.

We have a logged a few improvement in the NEWS. The changes are small and incremental, as the package is already in a great stable state for production use.

One of the biggest improvements is documentation clean up, and adapting the examples to use wrapr unpack/to multiple assignment notation. An example of the new documentation can be found here, and the larger tutorials are here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Code generation in R packages

$
0
0

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you use the same code three times, write a function. If you write three such related functions, set up a package. But if you write three embarrassingly similar functions… write code to generate their code for you? In this post, we’ll deal with source code generation. We’ll differentiate scaffolding from generating code, and we’ll present various strategies observed in the wild.

This post was inspired by an excellent Twitter thread started by Miles McBain, from which we gathered examples. Thank you Miles!

Miles furthermore mentioned Alicia Schep’s rstudio::conf talk “Auto-magic package development” to us, that was a great watch/read!

Introduction

If you can repeat yourself, you’re lucky

When would you need to generate code? A possible use case is wrapping a web API with many, many endpoints that have a predictable structure (parameters, output format) that’s well documented (“API specs”, “API schema”).

In any case, to be able to generate code, you’ll have some sort of underlying data/ontology. Having that data (specs of a web API, of an external tool you’re wrapping, structured list of all your ideas, etc.), and some consistency in the different items, is quite cool, lucky you! Some of us deal with less tidy web APIs. 😉

Scope of this post

In this post, we’ll look into scaffolding code (when your output is some sort of skeleton that’s still need some human action before being integrated in a package) and generating code (you hit a button and end up with more functions and docs in the package for its users to find). We won’t look into packages exporting function factories.

Scaffolding code

“There was no way I was writing 146 functions from scratch”. Bob Rudis, GitHub comment.

Even without getting to the dream situation of code being cleanly generated, it can help your workflow to create function skeletons based on data.

  • The quote by Bob Rudis above refers to his work on crumpets where he used the Swagger spec of the Gitea API to generate drafts of many, many functions. The idea was to have following commits edit functions enough to make them work without, as he said, starting from scratch.

  • The experimental scaffolder package by Yuan Tang “provides a comprehensive set of tools to automate the process of scaffolding interfaces to modules, classes, functions, and documentations written in other programming languages. As initial proof of concept, scaffolding R interfaces to Python packages is supported via reticulate.”. The scaffold_py_function_wrapper() function takes a Python function as input and generates a R script skeleton (R code, and docs, both of them needing further editing).

In these two cases, what’s generated is a template for both R code and the corresponding roxygen2 docs.

Generating code

“odin works using code generation; the nice thing about this approach is that it never gets bored. So if the generated code has lots of tedious repetitive bits, they’re at least likely to be correct (compared with implementing yourself).” Rich FitzJohn, odin README.

Quite convincing, right? But when and how does one generate code for an R package?

Generating code once or once in a while

  • For the package whose development prompted him to start the Twitter thread mentioned earlier, Miles McBain used code generation. The package creates wrappers around dplyr functions, that can in particular automatically ungroup() your data. Now say Miles decides to wrap a further dplyr function.

Code generating a function

build_fn <- function(fn) {  fn_name <- name(fn)  glue::glue("{fn_name} <- function(...) {{\n",             "  dplyr::ungroup(\n",             "    {fn}(...)\n",             "  )\n",             "}}\n")}

Code generating docs

build_fn_doco <- function(fn) {  fn_name <- name(fn)  glue::glue(    "##' Ungrouping wrapper for {fn_name}\n",    "##'\n",    "##' The {PKGNAME} package provides a wrapper for {fn_name} that always returns\n",    "##' ungrouped data. This avoids mistakes associated with forgetting to call ungroup().\n",    "##'\n",    "##' For original documentation see [{fn}()].\n",    "##'\n",    "##' Use [{fn_name}...()] to retain groups as per `{fn}`, whilst\n",    "##' signalling this in your code.\n",    "##'\n",    "##' @title {fn_name}\n",    "##' @param ... parameters for {fn}\n",    "##' @return an ungrouped dataframe\n",    "##' @author Miles McBain\n",    "##' @export\n",    "##' @seealso {fn}, {fn_name}..."  )}

Voilà, there’s an updated R/ folder, and after running devtools::document() an updated man/ folder and NAMESPACE, and it all works. You’ll have noticed the use of the glue package, that Alicia Schep also praised in her rstudio::conf talk, and that we’ve seen in many of the examples we’ve collected for this post.

Code generator in a dedicated package

All the examples from the previous subsections had some sort of build scripts living in their package repo. There’s no convention on what to call them and where to store them. Now, R developers like their code packaged in package form. Alicia Schep actually stores a package in the build/ folder of vlbuildr, vlmetabuildr, that creates vlbuildr anew from the Vegalite schema! That’s meta indeed! Fret not, the build/ folder also holds a script called build.R that unleashes the auto-magic. Let us mention Alicia’s rstudio::conf talk again.

When to update the package?

We haven’t seen any code generating workflow relying on a Makefile or on a hook to an external source, so we assume such packages are updated once in a while when their maintainer amends, or notices an amendment of, the underlying ontology. See e.g. the PR updating vlbuildr to support Vegalite 4.0, or the commit regenerating redis commands for 3.2 in redux.

Generating code at install time

In the previous cases of code generation, the R package source was similar to many R package sources out there. Now, we’ve also seen cases where the code is generated when installing the package. It means that the code generation has to be perfect, since there isn’t be any human edit between the code generation and the code use. Let’s dive into a few examples.

Generating icon aliases in icon

In icon, an R package by Mitchell O’Hara-Wild that allows easy insertion of icons from Font Awesome, Academicons and Ionicons into R Markdown, to insert an archive icon one can type icon::fa("archive") or icon::fa_archive(), i.e. every possible icon has its own alias function which pairs well with autocompletion e.g. in RStudio when starting to type icon::fa_. When typing ?icon::fa_archive one gets a man page entitled “Font awesome alias”, the same for all aliases. How does it work?

Font files related to the fonts are stored in inst/. It’s the same for all three fonts, but let’s focus on what happens for Font Awesome. In the R code (that’s executed when installing the package), there’s a line reading the icon names from a font file. Further below are a few very interesting lines

#' @evalRd paste("\\keyword{internal}", paste0('\\alias{fa_', gsub('-', '_', fa_iconList), '}'), collapse = '\n')#' @name fa-alias#' @rdname fa-alias#' @exportPattern ^fa_fa_constructor <- function(...) fa(name = name, ...)for (icon in fa_iconList) {  formals(fa_constructor)$name <- icon  assign(paste0("fa_", gsub("-", "_", icon)), fa_constructor)}rm(fa_constructor)

When documenting the package, the man page “fa-alias” is created. The @evalRd tag ensures aliases for all icons from fa_iconList get an alias{} line in the “fa-alias” man page. The @exportPattern tag ensures a line exporting all functions whose starts with fa_ is added to NAMESPACE. This part happens before installing the package, every time the documentation is updated by the package maintainer. The fa_ functions are created at install time by the for loop. The function factory fa_constructor is then removed.

The code generation allows an easy update to new Font Awesome versions, with a very compact source code.

Generating an up-to-date API wrapper in civis

Another interesting example is provided by the civis package, an R client for the Civis platform. Its installation instructions state that when installing the package from source, all functions corresponding to the latest API version will be created. What happens exactly when the package is installed from source? A configure script is run (configure or configure.win). Such scripts are automatically run when installing a package from source. Here’s what this script does: sourcing tools/run_generate_client.R.

"${R_HOME}"/bin/Rscript tools/run_generate_client.R

And this script fetches the API spec and writes code and roxygen2 docs in R/generated_client.R. When the package is not installed from source, the users get the R/generated_client.R that’s last been generated by the package maintainer, so if the Civis platform itself was updated in the meantime, the users might find a platform endpoint is missing from the civis package. The approach used by civis has the clear advantage of allowing a perfect synchronization between the wrapped platform and the package.

Creating functions lists and R6 methods in minicss

In mimicss by mikefc, “Lists of CSS property information is turned into function lists and R6 methods.”. See aaa.R and prop_transform.R. As in most examples the code is generated as a string, but in that case it’s not written to disk, it becomes code via the use of eval() and parse().

Generate C++ bindings with Rcpp::compileAttributes()

Rcpp::compileAttributes() generates code (the bindings required to call C++ functions from R) after scanning a package source files. Find more information in the Rcpp vignette about attributes. You could call the function “whenever functions are added, removed, or have their signatures changed.” but the aforementioned vignette also states “if you are using either RStudio or devtoolsto build your package then the compileAttributes function is called automatically whenever your package is built”.

Generating code on-the-fly

One step further, one might generate code on-the-fly, i.e. as users run the package.

# Populate methods while the connection is being established.protocol_spec <- jsonlite::fromJSON(self$url("/json/protocol"), simplifyVector = FALSE)self$protocol <- process_protocol(protocol_spec, self$.__enclos_env__)# self$protocol is a list of domains, each of which is a list of# methods. Graft the entries from self$protocol onto selflist2env(self$protocol, self)

that are called when creating a chromote object. The process_protocol() function converts the Chrome Devtools Protocol JSON to a list of functions.

  • In stevedore by Rich FitzJohn, Docker client for R, functions are generated when one connects to the Docker server via stevedore::docker_client(), selecting the most appropriate version based on the server (possible specs are stored in inst/spec as compressed YAML files). In the author’s own words, in this package the approach is “not going through the text representation at all and using things like as.function and call/as.call to build up functions and expressions directly”. This happens in swagger_args.R. Thanks to Rich for many useful comments on this post.

Conclusion

In this post we explored different aspects of source code scaffolding and generation in R packages. We’ve mentioned examples of code scaffolding (gitea, scaffolder), of code generation by a script (wisegroup, eml.build, redux, xaringanthemer) or by a meta package (vlbuildr and vlmetabuildr) before package shipping, of code generation at install time (icon, civis, minicss, Rcpp::compileAttributes()) and of code generation at run time (chromote, stevedore). Many of these examples used some form of string manipulation, in base R or with glue, to either generate an R script and its roxygen2 docs or code using eval() and parse() (minicss). One of them doesn’t use any text representation, and as.function and call/as.call instead (stevedore). icon also doesn’t write R files.

In the more general context of automatic programming, there are also things called “generative programming”, and “low-code applications” (like tidyblocks?). As much as one enjoys writing R code, it’s great to be able to write less of it sometimes, especially when it gets too routine.

Do you use source code generation in R? Don’t hesitate to add your own use case and setup in the comments below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last call for the course on Advanced R programming

$
0
0

[This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Next week we will hold our yearly course on Advanced R programming at LStat, Leuven. If you are interested in learning one of the following techniques, don’t hesitate to subscribe at https://lstat.kuleuven.be/training/coursedescriptions/AdvancedprogramminginR.html

  • Functions, the apply family of functions, parallelisation, advanced data manipulation with R
  • S3 programming
  • Building reports with markdown / Sweave
  • Build an R package

Interested in other trainings, vist: http://bnosac.be/index.php/training

r training 

See you next week!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers - bnosac :: open analytical helpers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tempdisagg: converting quarterly time series to daily

$
0
0

[This article was first published on R – usefulr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Not having a time series at the desired frequency is a common problem for researchers and analysts. For example, instead of quarterly sales, they only have annual sales. Instead of a daily stock market index, they only have a weekly index. While there is no way to fully make up for the missing data, there are useful workarounds: with the help of one or more high frequency indicator series, the low frequency series may be disaggregated into a high frequency series.

The package tempdisagg implements the standard methods for temporal disaggregation: Denton, Denton-Cholette, Chow-Lin, Fernandez and Litterman. Our article on temporal disaggregation of time series in the R-Journal describes the package and the theory of temporal disaggregation in more detail.

The package has been around since eight years, enabling the standard year or quarter to month or quarter disaggregation. With version 1.0, there are now some major new features: disaggregation can be performed from any frequency to any frequency. Also, tempdisagg now supports time series classes other than ts.

Convert between any frequency

tempdisagg can now convert between most frequencies, e.g., it can disaggregate a monthly series to daily. It is no longer restricted to regular conversions, where each low frequency period had the same number of high frequency periods. Instead, a low frequency period (e.g. month) can contain any number of high-frequency periods (e.g. 31, 28 or 29 days). Thanks to Roger Kissling and Stella Sim who have suggested this idea.

We can not only convert months to days, but also years to days, weeks to seconds, or academic years to seconds, or lunar years to hours, … The downside is that the computation time depends on the number of observations. Thus, for longer high-frequency series, the computation may take a while.

In the following, we try to disaggregate quarterly GDP of Switzerland to a hypothetical daily GDP series. The example series are shipped with the package.

library(tempdisagg)data(tempdisagg)head(gdp.q)##         time    value## 1 2005-01-01 133101.3## 2 2005-04-01 136320.4## 3 2005-07-01 137693.7## 4 2005-10-01 139475.9## 5 2006-01-01 139204.7## 6 2006-04-01 141112.5

Time series can can be stored in data frames

Because we are dealing with daily data, we keep the data in a data.frame, rather than in a ts object. Other time series objects, such as xts and tsibble, are possible as well. For conversion and visualization, we use the tsbox package.

library(tsbox)ts_plot(gdp.q,title="Swiss GDP",subtitle="real, not seasonally adjusted")

Disaggregation to daily frequency

We use Swiss stock market data as an indicator series to disaggregate GDP. Data of the stock market index, the SMI, is also included in tempdisagg. Weekend and holiday values have been interpolated, because tddoes not allow the presence of missing values.

ts_plot(spi.d,title="Swiss Performance Index",subtitle="daily values, interpolated")

The following uses the Chow-Lin method to disaggregate the series. A high rho parameter takes into account that the two series are unlikely to be co-integrated.

m.d.stocks<-td(gdp.q~spi.d,method="chow-lin-fixed",fixed.rho=0.9)summary(m.d.stocks)#### Call:## td(formula = gdp.q ~ spi.d, method = "chow-lin-fixed", fixed.rho = 0.9)#### Residuals:##    Min     1Q Median     3Q    Max## -10656  -1760   1076   3796   8891#### Coefficients:##              Estimate Std. Error t value Pr(>|t|)## (Intercept) 1.320e+03  2.856e+01   46.22   <2e-16 ***## spi.d       5.512e-02  3.735e-03   14.76   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### 'chow-lin-fixed' disaggregation with 'sum' conversion## 59 low-freq. obs. converted to 5493 high-freq. obs.## Adjusted R-squared: 0.7928 AR1-Parameter:   0.9

And here is the result: A daily series of GDP

gdp.d.stocks<-predict(m.d.stocks)ts_plot(ts_scale(ts_c(gdp.d.stocks,gdp.q)),title="Daily disaggregated GDP",subtitle="one indicator")

Like with all disaggregation methods in tempdisagg, the resulting series fulfills the aggregation constraint (the resulting series is as long as the indicator, and needs to be shortened for a comparison):

all.equal(ts_span(ts_frequency(gdp.d.stocks,"quarter",aggregate="sum"),end="2019-07-01"),gdp.q)## [1] TRUE

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – usefulr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The significance of the sector on the salary of engineers in Sweden

$
0
0

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So far I have analysed the effect of experience, education, gender, year and region on the salary of engineers in Sweden. In this post, I will have a look at the effect of the sector on the salary of engineers in Sweden.

Statistics Sweden use NUTS (Nomenclature des Unités Territoriales Statistiques), which is the EU’s hierarchical regional division, to specify the regions.

First, define libraries and functions.

library (tidyverse) 
## -- Attaching packages -------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3## v tibble  2.1.3     v dplyr   0.8.3## v tidyr   1.0.2     v stringr 1.4.0## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------- tidyverse_conflicts() --## x dplyr::filter() masks stats::filter()## x dplyr::lag()    masks stats::lag()
library (broom) library (car)
## Loading required package: carData
## ## Attaching package: 'car'
## The following object is masked from 'package:dplyr':## ##     recode
## The following object is masked from 'package:purrr':## ##     some
library (swemaps) # devtools::install_github('reinholdsson/swemaps')library(sjPlot)
## Registered S3 methods overwritten by 'lme4':##   method                          from##   cooks.distance.influence.merMod car ##   influence.merMod                car ##   dfbeta.influence.merMod         car ##   dfbetas.influence.merMod        car
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(leaps)library(MASS)
## ## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':## ##     select
readfile <- function (file1){    read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%      gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%      drop_na() %>%      mutate (year_n = parse_number (year))}nuts <- read.csv("nuts.csv") %>%  mutate(NUTS2_sh = substr(NUTS2, 1, 4))map_ln_n <- map_ln %>%  mutate(lnkod_n = as.numeric(lnkod)) 

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000CG.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

I have renamed the file to 000000CG_sector.csv because the filename 000000CG.csv was used in a previous post.

The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by region, sector, occupational group (SSYK 2012) and sex. Year 2014 – 2018 Monthly salary 1-3 public sector 4-5 private sector

We expect that the sector is an important factor in salaries. As a null hypothesis, we assume that the sector is not related to the salary and examine if we can reject this hypothesis with the data from Statistics Sweden.

tb <- readfile ("000000CG_sector.csv") %>%  filter (`occuptional  (SSYK 2012)` == "214 Engineering professionals") %>%   left_join(nuts %>% distinct (NUTS2_en, NUTS2_sh), by = c("region" = "NUTS2_en")) 
## Warning: Column `region`/`NUTS2_en` joining character vector and factor,## coercing into character vector
tb_map <- readfile ("000000CG_sector.csv") %>%  filter (`occuptional  (SSYK 2012)` == "214 Engineering professionals") %>%  left_join(nuts, by = c("region" = "NUTS2_en")) 
## Warning: Column `region`/`NUTS2_en` joining character vector and factor,## coercing into character vector
tb_map %>%  filter (sector == "1-3 public sector") %>%  right_join(map_ln_n, by = c("Länskod" = "lnkod_n")) %>%  ggplot() +    geom_polygon(mapping = aes(x = ggplot_long, y = ggplot_lat, group = lnkod, fill = salary)) +    facet_grid(. ~ year) +     coord_equal() 
SSYK 214, Architects, engineers and related professionals, public sector, Year 2014 - 2018

Figure 1: SSYK 214, Architects, engineers and related professionals, public sector, Year 2014 – 2018

tb_map %>%  filter (sector == "4-5 private sector") %>%  right_join(map_ln_n, by = c("Länskod" = "lnkod_n")) %>%  ggplot() +    geom_polygon(mapping = aes(x = ggplot_long, y = ggplot_lat, group = lnkod, fill = salary)) +    facet_grid(. ~ year) +     coord_equal() 
SSYK 214, Architects, engineers and related professionals, private sector, Year 2014 - 2018

Figure 2: SSYK 214, Architects, engineers and related professionals, private sector, Year 2014 – 2018

tb %>%  ggplot () +      geom_point (mapping = aes(x = year_n, y = salary, colour = region, shape=sex)) +     facet_grid(. ~ sector) +  labs(    x = "Year",    y = "Salary (SEK/month)"  )
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 3: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

Before I investigate all possible combinations of the sector and the other factors I shall see if there is some way to predict what factors and interactions that are most significant.

First, use regsubsets to find the model which minimises AIC (Akaike information criterion). Regsubsets is a generic function for regression subset selection with methods for formula and matrix arguments.

b <- regsubsets (log(salary) ~ sector * (year_n + sex + NUTS2_sh), data = tb,  nvmax = 20)rs <- summary(b)AIC <- 50 * log (rs$rss / 50) + (2:20) * 2which.min (AIC)
## [1] 13
names (rs$which[13,])[rs$which[13,]]
##  [1] "(Intercept)"                          ##  [2] "sector4-5 private sector"             ##  [3] "year_n"                               ##  [4] "sexwomen"                             ##  [5] "NUTS2_shSE12"                         ##  [6] "NUTS2_shSE21"                         ##  [7] "NUTS2_shSE22"                         ##  [8] "NUTS2_shSE33"                         ##  [9] "sector4-5 private sector:year_n"      ## [10] "sector4-5 private sector:NUTS2_shSE21"## [11] "sector4-5 private sector:NUTS2_shSE23"## [12] "sector4-5 private sector:NUTS2_shSE31"## [13] "sector4-5 private sector:NUTS2_shSE32"## [14] "sector4-5 private sector:NUTS2_shSE33"

As a complement, I use stepwise model selection to find the model which fits the data best. StepAIC performs stepwise model selection by AIC.

model <-lm (log(salary) ~ year_n * sex * NUTS2_sh * sector, data = tb) b <- stepAIC(model, direction = c("both"))
## Start:  AIC=-1200.79## log(salary) ~ year_n * sex * NUTS2_sh * sector## ##                              Df Sum of Sq      RSS     AIC## - year_n:sex:NUTS2_sh:sector  7  0.001441 0.041008 -1209.1##                                     0.039567 -1200.8## ## Step:  AIC=-1209.07## log(salary) ~ year_n + sex + NUTS2_sh + sector + year_n:sex + ##     year_n:NUTS2_sh + sex:NUTS2_sh + year_n:sector + sex:sector + ##     NUTS2_sh:sector + year_n:sex:NUTS2_sh + year_n:sex:sector + ##     year_n:NUTS2_sh:sector + sex:NUTS2_sh:sector## ##                              Df Sum of Sq      RSS     AIC##                                     0.041008 -1209.1## - year_n:sex:NUTS2_sh         7 0.0047401 0.045748 -1205.6## - year_n:sex:sector           1 0.0022478 0.043256 -1202.5## - year_n:NUTS2_sh:sector      7 0.0058131 0.046821 -1201.9## + year_n:sex:NUTS2_sh:sector  7 0.0014410 0.039567 -1200.8## - sex:NUTS2_sh:sector         7 0.0080176 0.049026 -1194.5
model <- lm(log(salary) ~ year_n + sex + NUTS2_sh + sector +     year_n:sex + year_n:NUTS2_sh + sex:NUTS2_sh + year_n:sector +     sex:sector + NUTS2_sh:sector + year_n:sex:NUTS2_sh + year_n:sex:sector +     year_n:NUTS2_sh:sector + sex:NUTS2_sh:sector, data = tb)summary(model)$adj.r.squared 
## [1] 0.9135882
Anova(model, type = 2) %>%   tidy() %>%   arrange (desc (statistic)) %>%   filter(p.value < 0.05) %>%   knitr::kable(   booktabs = TRUE,  caption = 'Anova report from linear model fit')
Table 1: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.20693511519.7602780.0000000
sex0.11139832139.8999080.0000000
sector0.09526632119.6405600.0000000
NUTS2_sh0.23220971441.6601960.0000000
year_n:sector0.0120669130.3084110.0000003
NUTS2_sh:sector0.0523275718.7759000.0000000
year_n:sex0.002349315.9007610.0168659
year_n:sex:sector0.002247815.6456990.0193467
sex:sector0.001823114.5790790.0347260
sex:NUTS2_sh0.010628973.8138030.0010092
sex:NUTS2_sh:sector0.008017672.8768250.0087375
year_n:NUTS2_sh0.007867072.8228100.0098854

There are interactions between the different factors that are significant, i.e. have a p-value less than 0,05 but does not qualify because it´s inclusion in the model does not imply that it lowers the AIC value. The tradeoff between the goodness of fit of the model and the simplicity of the model leads me to exclude those interactions from the model we will examine further.

The model I chose from based on the AIC results is: log(salary) ~ year_n * sector + NUTS2_sh * sector + sex

From this model, the F-value from the Anova table for the sector is 146 (Pr(>F) < 2.2e-16), sufficient for rejecting the null hypothesis that the sector has no effect on the salary holding year as constant. The adjusted R-squared value is 0,870 implying a good fit of the model.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb)tb <- bind_cols(tb, as_tibble(exp(predict(model, tb, interval = "confidence"))))tb %>%  ggplot () +      geom_point (mapping = aes(x = year_n,y = fit, colour = region, shape=sex)) +     facet_grid(. ~ sector) +  labs(    x = "Year",    y = "Salary (SEK/month)"  )
Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 4: Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

summary(model) %>%    tidy() %>%  knitr::kable(   booktabs = TRUE,  caption = 'Summary from linear model fit')
Table 2: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-52.88574643.9015473-13.55507000.0000000
year_n0.03157050.001935316.31308670.0000000
sector4-5 private sector24.74660215.51762044.48501350.0000150
NUTS2_shSE12-0.06338860.0109476-5.79015870.0000000
NUTS2_shSE21-0.09518540.0109476-8.69460210.0000000
NUTS2_shSE22-0.05424150.0109476-4.95462640.0000020
NUTS2_shSE23-0.03046690.0109476-2.78296550.0061252
NUTS2_shSE31-0.02139740.0109476-1.95452010.0526182
NUTS2_shSE32-0.03041280.0109476-2.77802070.0062142
NUTS2_shSE33-0.07003990.0109476-6.39771390.0000000
sexwomen-0.05233930.0038706-13.52235690.0000000
year_n:sector4-5 private sector-0.01228150.0027369-4.48736790.0000149
sector4-5 private sector:NUTS2_shSE120.00691090.01548230.44637580.6560106
sector4-5 private sector:NUTS2_shSE21-0.03446240.0154823-2.22592140.0276066
sector4-5 private sector:NUTS2_shSE220.00893870.01548230.57735090.5646232
sector4-5 private sector:NUTS2_shSE23-0.02064950.0154823-1.33374740.1844371
sector4-5 private sector:NUTS2_shSE31-0.07655030.0154823-4.94437690.0000021
sector4-5 private sector:NUTS2_shSE32-0.08324670.0154823-5.37689440.0000003
sector4-5 private sector:NUTS2_shSE33-0.07112490.0154823-4.59394800.0000096
summary(model)$adj.r.squared 
## [1] 0.8699372
Anova(model, type=2) %>%   tidy() %>%   knitr::kable(   booktabs = TRUE,  caption = 'Anova report from linear model fit')
Table 2: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.20693511345.321220.00e+00
sector0.08728991145.664290.00e+00
NUTS2_sh0.1798897742.884210.00e+00
sex0.10957611182.854140.00e+00
year_n:sector0.0120669120.136471.49e-05
sector:NUTS2_sh0.0523275712.474440.00e+00
Residuals0.0844948141NANA
plot(model, which = 1)
Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 5: Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

tb[38,]
## # A tibble: 1 x 11##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit##                              ## 1 SE12 ~ 1-3 p~ 214 Engineering~ women 2015   37600   2015 SE12     40664.## # ... with 2 more variables: lwr , upr 
tb[55,]
## # A tibble: 1 x 11##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit##                              ## 1 SE31 ~ 4-5 p~ 214 Engineering~ men   2015   37900   2015 SE31     41366.## # ... with 2 more variables: lwr , upr 
tb[76,]
## # A tibble: 1 x 11##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit##                              ## 1 SE21 ~ 4-5 p~ 214 Engineering~ women 2016   34600   2016 SE21     38773.## # ... with 2 more variables: lwr , upr 

Let’s check what we have found.

For the sake of comparison, a model with no interactions.

model <-lm (log(salary) ~ year_n + sex + NUTS2_sh + sector, data = tb)  plot_model(model, type = "pred", terms = c("NUTS2_sh", "year_n", "sex", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
## Warning: Package `see` needed to plot multiple panels in one integrated figure.## Please install it by typing `install.packages("see", dependencies = TRUE)` into## the console.
## [[1]]
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 6: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

## ## [[2]]
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 7: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

First, we investigate the interaction between region and sector. All plots below are done with the model which minimised the AIC.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb) plot_model(model, type = "pred", terms = c("NUTS2_sh", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 8: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

Also, examine the relationship between gender and sector.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb)plot_model(model, type = "pred", terms = c("sector", "sex"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 9: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

And the interaction between year and sector.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb)plot_model(model, type = "pred", terms = c("year_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 10: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

The relationship between gender, sector and region.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb)plot_model(model, type = "pred", terms = c("NUTS2_sh", "sector", "sex"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 11: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

The relationship between gender, sector and year.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb) plot_model(model, type = "pred", terms = c("year_n", "sector", "sex"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 12: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

The relationship between region, sector and year.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb) plot_model(model, type = "pred", terms = c("NUTS2_sh", "year_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 13: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

The relationship between gender, region, sector and year.

model <- model <-lm (log(salary) ~ year_n * sector + NUTS2_sh * sector + sex, data = tb) plot_model(model, type = "pred", terms = c("NUTS2_sh", "year_n", "sector", "sex"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
## Warning: Package `see` needed to plot multiple panels in one integrated figure.## Please install it by typing `install.packages("see", dependencies = TRUE)` into## the console.
## [[1]]
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 14: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

## ## [[2]]
SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 15: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12135 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>