Quantcast
Channel: R-bloggers
Viewing all 12298 articles
Browse latest View live

Gold-Mining Week 7 (2019)

$
0
0

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

</p> <p>Week 7 Gold Mining and Fantasy Football Projection Roundup now available.</p> <p>

The post Gold-Mining Week 7 (2019) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Split Intermixed Names into First, Middle, and Last

$
0
0

[This article was first published on RLang.io | R Language Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data cleaning can be a challenge, so I hope this helps the process for someone out there. This is a tiny, but valuable function for those who deal with data collected from non-ideal forms. As nearly always, this depends on the tidyverse library. You may want to rename the function from fml, but it does best describe dealing with mangled data.

This function retuns the first, middle, and last names for a given name or list of names. Missing data is represented as NA.

Usage on Existing Dataframe

Setting up a dataframe with manged names and missing first, middle, and last names.

df <- data.frame(names = c("John Jacbon Jingle",
                           "Heimer Schmitt",
                           "Cher",
                           "John Jacbon Jingle Heimer Schmitt",
                           "Mr. Anderson",
                           "Sir Patrick Stewart",
                           "Sammy Davis Jr.")) %>%
  add_column(First = NA) %>%
  add_column(Middle = NA) %>%
  add_column(Last = NA)

RownamesFirstMiddleLast
1John Jacob JingleNANANA
2Heimer SchmittNANANA
3CherNANANA
4John Jacob Jingle Heimer SchmittNANANA
5Mr. AndersonNANANA
6Sir Patrick StewartNANANA
7Sammy Davis Jr.NANANA

Replacing the first, middle, and last name values…

df[,c("First","Middle","Last")] <-  df$names %>% fml

RownamesFirstMiddleLast
1John Jacbon JingleJohnJacbonJingle
2Heimer SchmittHeimerNASchmitt
3CherCherNANA
4John Jacbon Jingle Heimer SchmittJohnJacbon-Jingle-HeimerSchmitt
5Mr. AndersonNANAAnderson
6Sir Patrick StewartPatrickNAStewart
7Sammy Davis Jr.SammyNADavis

Values Changed

  • In roe 1 All names were found
  • In row 2 the middle name was skipped
  • In row 3 only a first name was found
  • In row 4 the middle names were collapsed
  • In row 5 only a last name was found
  • In row 6 the title Sir was omitted
  • In row 7 the title Jr. was omitted

Using with a single name.

fml("Matt Sandy")

V1V2V3
Matt SandyMattNASandy

The Function

fml <- function(mangled_names) {
  titles <- c("MASTER", "MR", "MISS", "MRS", "MS", 
              "MX", "JR", "SR", "M", "SIR", "GENTLEMAN", 
              "SIRE", "MISTRESS", "MADAM", "DAME", "LORD", 
              "LADY", "ESQ", "EXCELLENCY","EXCELLENCE", 
              "HER", "HIS", "HONOUR", "THE", 
              "HONOURABLE", "HONORABLE", "HON", "JUDGE")
  mangled_names %>% sapply(function(name) {
    split <- str_split(name, " ") %>% unlist
    original_length <- length(split)
    split <- split[which(!split %>% 
                           toupper %>% 
                           str_replace_all('[^A-Z]','')
                         %in% titles)]
    case_when(
      (length(split) < original_length) & 
        (length(split) == 1) ~  c(NA,
                                  NA,
                                  split[1]),
      length(split) == 1 ~ c(split[1],NA,NA),
      length(split) == 2 ~ c(split[1],NA,
                             split[2]),
      length(split) == 3 ~ c(split[1],
                             split[2],
                             split[3]),
      length(split) > 3 ~ c(split[1],
                            paste(split[2:(length(split)-1)],
                                  collapse = "-"),
                            split[length(split)])
    )
  }) %>% t %>% return
}

Improvements

I recommend improving upon this if you want to integrate this function (or attributes of this function) into your workflow. Naming the output or using lists so you can just get partial returns fml("John Smith")$Last could come in handy.

Additional cases could also be created, such as when names are entered Last, First M.. Tailoring the function to your project will yield best results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RLang.io | R Language Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Trends in U.S. Border Crossing Entry since 1996

$
0
0

[This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Since the 2016 election, inland U.S. Border security has been the huge topic. The construction for the new border wall has started and the tension between Mexico and U.S. has intensified along with it. Many people predicted not only the decrease in number of illegal border entry but also the decrease in number of legal border entry which could hurt the tourism and discourage the trade across the borders. Currently in end of 2019, 3 years from 2016 election campaign, what can we learn from the statistics of inland U.S. Border entry? Did the political agenda affect the way people come into our country? I would like to answer these questions through visualizing the findings.

Data

I used the dataset from Kaggle.com . This dataset is originally collected by U.S. Customs and Border Protection (CBP) every quarter which then gets cleaned, assessed and maintained by the Bureau of Transportation Statistics (BTS). It contains the statistics for inbound crossings at the U.S.-Canada and the U.S.-Mexico border at the port level for every month from beginning of 1996. 

The original dataset includes 349,000 rows with 7 Columns as following: Port Name, State, Port Code, Border, Date, Measure, Value and Location. Measure is a method of transportation used for border entry and it has 12 different categories: Bus Passengers, Buses, Pedestrians, Personal Vehicle Passengers, Personal Vehicles, Rail Containers Empty, Rail Containers Full, Train Passengers, Trains, Truck Containers Empty, Truck Containers Full, Trucks. Values column includes the total number of crossing. 

It is important to be aware that this dataset doesn’t count the number of unique vehicles, passengers or pedestrians but rather count the number of crossings. For example, same truck can go back and forth the border many times a day and data for each time will be collected. Also this data doesn’t include the nationality of the passengers or pedestrians nor the reason for the border crossing. 

I used R package dplyr to clean the data. Criteria for Border Name was shortened as Canada and Mexico and Location column was divided into two sections: Longitude and Latitude. Also year 2019 was excluded from the analysis as the data is not complete yet and that will not provide a good insight for this project. The exploratory data analysis (EDA) was done mostly using R package ggplot2 and leaflet. I used the ShinyDashboard to show the visualization and shinyapps.io as a server to present the findings.

 ShinyApp / Analysis

First I wanted to observe the location of border ports and their distribution across the U.S. to see the big picture.

 

 

There are total of 116 ports used in this dataset. Among those, 89 ports are in U.S.-Canada Border and 27 ports are in U.S.-Mexico Border. So there are about 3 times more ports in U.S.-Canada Border compared to U.S.-Mexico Border. 

However the number of total incoming from 1996 to 2018 showed total opposite as there were about 7 billions of total border crossing at U.S.-Mexico Border while there were about 2.6 billions of total border crossing at U.S-Canada Border. Even though there were more ports available in the northern border of U.S., there were less people coming in. 

Now moving on, I wanted to find out the methods of border entry and how it looks different between U.S.-Canada border and U.S.-Mexico border. 

Overall, the most of the border entry method was by using personal vehicles. Here, the Personal Vehicles count the numbers of personal cars entering the border whereas the Personal Vehicle Passengers count the number of people that were in the Personal Vehicles. Next high value was surprisingly the Pedestrians. Bus Passengers and Trucks came after. 

When the measure was compared between two borders, I could see a difference where the Mexico Border has a significant number of pedestrians whereas the Canada Border does not.

When I looked into the number of entry in different states, I could find the similar trend from the number of entry by two different Borders. There is more entry from the southern U.S. border especially in Texas, California and Arizona. From the northern U.S. border, the large number of border entry was from New York and Michigan. 

Most of the southern states had a similar trend of entry transportation as Texas as it is shown on the left graph above: mostly containing the Personal Vehicle and Pedestrians.  For the northern states, it looked similar to the New York as it is shown on the right graph above. This indicates that the northern and southern borders differ a lot in terms of number of people walking into our country. 

 

Two exceptions to the statement were Alaska and Ohio. Alaska had more significant number of Bus Passengers and Train Passengers suggesting that most of the people coming into the U.S. from Alaska border ports are travelers. Ohio, the state with the lowest number of border entry, was reported with only one method of transportation: Personal Vehicles. 

When I looked into the change in number of truck entering the U.S. in U.S.-Canada and U.S.-Mexico Borders, I saw some patterns. First, both Canada and Mexico sides had a sudden drop of number in certain year such as 2009. That was the year of global financial crisis. Trucks can be used for in-land trades and it is obvious from the graph above that the economy has a big impact on the border entry. Second, I was able to see the increase in number of trucks entering at the Mexico border, possibly suggesting a better trade condition between U.S. and Mexico. 

As I wanted to see the impact of the 2016 election and the increase in border security issue, I looked at the number of Pedestrians and Personal Vehicle Passengers over the years. Surprisingly, unlike what I have guessed that issue of border security and building the border wall would discourage the number of legal border entry in U.S.-Mexico border, the statistics show the increase in border entry. 

The number of incoming buses and bus passengers into the U.S. seem to be decreasing in both Canada and Mexico sides. Usually buses can be used for the tourism and as other methods of traveling such as flight and train have advanced over the years, use of bus as a method for traveling seems to have declined. 

Two methods of transportation used for border entry that look distinctly different were Pedestrian and Train Passengers. As seen in previous graphs, most of the pedestrians are coming into the U.S. using U.S.-Mexico border as it is more accessible to walk across the border in the southern side of the States. Train is used more often in U.S.-Canada border and the number of its use has been increasing. This increase in use of train can be related to the decrease use of bus as a traveling method. 

Different methods of transportation into the U.S. also show unique trend when the data was looked according to the months. Transportation methods that can be used for trade such as Truck, Truck Container Full, Rail Container Full seem to have a steady number of incoming in both Mexico and Canada borers no matter what month it is. This indicates that the business related border entry stays steady in all-year-around. 

However this changes, when it comes to the transportation related to traveling such as Personal Vehicle Passengers, Bus Passengers and Train Passengers. Number of border entry in U.S.-Canada increases significantly in summer months while number of border entry in U.S.-Mexico stays around same all-year-around. As the northern border get a harsh cold winter, it is obvious to have more travelers in summer months. 

Conclusion / Further Research

From the visualization and statistics, the number of border entry depends on the economy and business rather than the politics. However this dataset itself can’t explain the reason behinds the change in number of border entry as there are still many factors that need to be considered. For the further research, I would like to obtain the data regarding the citizenship of people that enter the U.S. borders as well as their intension or the reason for the entry. This can furthermore support how economy or tourism change the trend in the border entry. 

Thank you for reading my findings in U.S. Border Crossing Entry data. If you are interested in looking at the dataset I used, my ShinyApps, and the code, you can follow with the links below. 

Dataset

ShinyApps

GitHub

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

digest 0.6.22: More goodies!

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of digest arrived at CRAN earlier today, and I just sent an updated package to Debian too.

digest creates hash digests of arbitrary R objects (using the md5, sha-1, sha-256, sha-512, crc32, xxhash32, xxhash64, murmur32, and spookyhash algorithms) permitting easy comparison of R language objects. It is a fairly widely-used package (currently listed at 868k monthly downloads) as many tasks may involve caching of objects for which it provides convenient general-purpose hash key generation.

This release comes pretty much exactly one month after the very nice 0.6.21 release but contains five new pull requests. Matthew de Queljoe did a little bit of refactoring of the vectorised digest function he added in 0.6.21. Ion Suruceanu added a CFB cipher for AES. Bill Denney both corrected and extended sha1. And Jim Hester made the windows-side treatment of filenames UTF-8 compliant.

CRANberries provides the usual summary of changes to the previous version.

For questions or comments use the issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Access the free economic database DBnomics with R

$
0
0

[This article was first published on Macroeconomic Observatory - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

DBnomics : the world’s economic database

Explore all the economic data from different providers (national and international statistical institutes, central banks, etc.), for free, following the link db.nomics.world.

You can also retrieve all the economic data through the rdbnomics package here. This blog post describes the different ways to do so.

Fetch time series by ids

First, let’s assume that we know which series we want to download. A series identifier (ids) is defined by three values, formatted like this: provider_code/dataset_code/series_code.

Fetch one series from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider

library(magrittr)library(dplyr)library(ggplot2)library(rdbnomics)
df<-rdb(ids='AMECO/ZUTN/EA19.1.0.0.0.ZUTN')%>%filter(!is.na(value))

In such data.frame (data.table or tibble), you will always find at least nine columns:

  • provider_code
  • dataset_code
  • dataset_name
  • series_code
  • series_name
  • original_period (character string)
  • period (date of the first day of original_period)
  • original_value (character string)
  • value
  • @frequency (harmonized frequency generated by DBnomics)

The other columns depend on the provider and on the dataset. They always come in pairs (for the code and the name). In the data.frame df, you have:

  • unit (code) and Unit (name)
  • geo (code) and Country (name)
  • freq (code) and Frequency (name)

plot of chunk unnamed-chunk-5

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-6

In the event that you only use the argument ids, you can drop it and run:

Fetch two series from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider

df<-rdb(ids=c('AMECO/ZUTN/EA19.1.0.0.0.ZUTN','AMECO/ZUTN/DNK.1.0.0.0.ZUTN'))%>%filter(!is.na(value))

plot of chunk unnamed-chunk-9

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-10

Fetch two series from different datasets of different providers

df<-rdb(ids=c('AMECO/ZUTN/EA19.1.0.0.0.ZUTN','Eurostat/une_rt_q/Q.SA.TOTAL.PC_ACT.T.EA19'))%>%filter(!is.na(value))

plot of chunk unnamed-chunk-12

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()+theme(legend.text=element_text(size=7))

plot of chunk unnamed-chunk-13

Fetch time series by mask

The code mask notation is a very concise way to select one or many time series at once. It is compatible only with some providers : BIS, ECB, Eurostat, FED, ILO, IMF, INSEE, OECD, WTO.

Fetch one series from dataset ‘Balance of Payments’ (BOP) of IMF

df<-rdb('IMF','BOP',mask='A.FR.BCA_BP6_EUR')%>%filter(!is.na(value))

plot of chunk unnamed-chunk-15

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-16

In the event that you only use the arguments provider_code, dataset_code and mask, you can drop the name mask and run:

Fetch two series from dataset ‘Balance of Payments’ (BOP) of IMF

You just have to add a + between two different values of a dimension.

df<-rdb('IMF','BOP',mask='A.FR+ES.BCA_BP6_EUR')%>%filter(!is.na(value))

plot of chunk unnamed-chunk-19

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-20

Fetch all series along one dimension from dataset ‘Balance of Payments’ (BOP) of IMF

df<-rdb('IMF','BOP',mask='A..BCA_BP6_EUR')%>%filter(!is.na(value))%>%arrange(desc(period),REF_AREA)%>%head(100)

plot of chunk unnamed-chunk-22

Fetch series along multiple dimensions from dataset ‘Balance of Payments’ (BOP) of IMF

df<-rdb('IMF','BOP',mask='A.FR.BCA_BP6_EUR+IA_BP6_EUR')%>%filter(!is.na(value))%>%group_by(INDICATOR)%>%top_n(n=50,wt=period)

plot of chunk unnamed-chunk-24

Fetch time series by dimensions

Searching by dimensions is a less concise way to select time series than using the code mask, but it works with all the different providers. You have a “Description of series code” at the bottom of each dataset page on the DBnomics website.

Fetch one value of one dimension from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider

df<-rdb('AMECO','ZUTN',dimensions=list(geo="ea19"))%>%filter(!is.na(value))# or# df <- rdb('AMECO', 'ZUTN', dimensions = '{"geo": ["ea19"]}') %>%#   filter(!is.na(value))

plot of chunk unnamed-chunk-26

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-27

Fetch two values of one dimension from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider

df<-rdb('AMECO','ZUTN',dimensions=list(geo=c("ea19","dnk")))%>%filter(!is.na(value))# or# df <- rdb('AMECO', 'ZUTN', dimensions = '{"geo": ["ea19", "dnk"]}') %>%#   filter(!is.na(value))

plot of chunk unnamed-chunk-29

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-30

Fetch several values of several dimensions from dataset ‘Doing business’ (DB) of World Bank

df<-rdb('WB','DB',dimensions=list(country=c("DZ","PE"),indicator=c("ENF.CONT.COEN.COST.ZS","IC.REG.COST.PC.FE.ZS")))%>%filter(!is.na(value))# or# df <- rdb('WB', 'DB', dimensions = '{"country": ["DZ", "PE"], "indicator": ["ENF.CONT.COEN.COST.ZS", "IC.REG.COST.PC.FE.ZS"]}') %>%#   filter(!is.na(value))

plot of chunk unnamed-chunk-32

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-33

Fetch time series found on the web site

When you don’t know the codes of the dimensions, provider, dataset or series, you can:

  • go to the page of a dataset on DBnomics website, for example Doing Business,

  • select some dimensions by using the input widgets of the left column,

  • click on “Copy API link” in the menu of the “Download” button,

  • use the rdb_by_api_link function such as below.

df<-rdb_by_api_link("https://api.db.nomics.world/v22/series/WB/DB?dimensions=%7B%22country%22%3A%5B%22FR%22%2C%22IT%22%2C%22ES%22%5D%7D&q=IC.REG.PROC.FE.NO&observations=1&format=json&align_periods=1&offset=0&facets=0")%>%filter(!is.na(value))

plot of chunk unnamed-chunk-35

ggplot(df,aes(x=period,y=value,color=series_name))+geom_step(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-36

Fetch time series from the cart

On the cart page of the DBnomics website, click on “Copy API link” and copy-paste it as an argument of the rdb_by_api_link function. Please note that when you update your cart, you have to copy this link again, because the link itself contains the ids of the series in the cart.

df<-rdb_by_api_link("https://api.db.nomics.world/v22/series?observations=1&series_ids=BOE/6008/RPMTDDC,BOE/6231/RPMTBVE")%>%filter(!is.na(value))

plot of chunk unnamed-chunk-39

ggplot(df,aes(x=period,y=value,color=series_name))+geom_line(size=1.2)+geom_point(size=2)+dbnomics()

plot of chunk unnamed-chunk-40

Proxy configuration or connection error Could not resolve host

When using the functions rdb or rdb_..., you may come across the following error:

Errorinopen.connection(con,"rb"):Couldnotresolvehost:api.db.nomics.world

To get round this situation, you have two options:

  1. configure curl to use a specific and authorized proxy.

  2. use the default R internet connection i.e. the Internet Explorer proxy defined in internet2.dll.

Configure curl to use a specific and authorized proxy

In rdbnomics, by default the function curl_fetch_memory (of the package curl) is used to fetch the data. If a specific proxy must be used, it is possible to define it permanently with the package option rdbnomics.curl_config or on the fly through the argument curl_config. Because the object is a named list, its elements are passed to the connection (the curl_handle object created internally with new_handle()) with handle_setopt() before using curl_fetch_memory.

To see the available parameters, run names(curl_options()) in R or visit the website https://curl.haxx.se/libcurl/c/curl_easy_setopt.html. Once they are chosen, you define the curl object as follows:

h<-list(proxy="",proxyport=<port>,proxyusername="",proxypassword="")

Set the connection up for a session

The curl connection can be set up for a session by modifying the following package option:

options(rdbnomics.curl_config=h)

When fetching the data, the following command is executed:

hndl<-curl::new_handle()curl::handle_setopt(hndl,.list=getOption("rdbnomics.curl_config"))curl::curl_fetch_memory(url=<...>,handle=hndl)

After configuration, just use the standard functions of rdbnomics e.g.:

df1<-rdb(ids='AMECO/ZUTN/EA19.1.0.0.0.ZUTN')

This option of the package can be disabled with:

options(rdbnomics.curl=NULL)

Use the connection only for a function call

If a complete configuration is not needed but just an “on the fly” execution, then use the argument curl_config of the functions rdb and rdb_...:

df1<-rdb(ids='AMECO/ZUTN/EA19.1.0.0.0.ZUTN',curl_config=h)

Use the default R internet connection

To retrieve the data with the default R internet connection, rdbnomics will use the base function readLines.

Set the connection up for a session

To activate this feature for a session, you need to enable an option of the package :

options(rdbnomics.use_readLines=TRUE)

And then use the standard function as follows :

df1<-rdb(ids='AMECO/ZUTN/EA19.1.0.0.0.ZUTN')

This configuration can be disabled with :

options(rdbnomics.use_readLines=FALSE)

Use the connection only for a function call

If you just want to do it once, you may use the argument use_readLines of the functions rdb and rdb_... :

df1<-rdb(ids='AMECO/ZUTN/EA19.1.0.0.0.ZUTN',use_readLines=TRUE)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Macroeconomic Observatory - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Understanding Blockchain Technology by building one in R

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By now you will know that it is a good tradition of this blog to explain stuff by rebuilding toy examples of it in R (see e.g. Understanding the Maths of Computed Tomography (CT) scans, So, what is AI really? or Google’s Eigenvector… or how a Random Surfer finds the most relevant Webpages). This time we will do the same for the hyped Blockchain technology, so read on!

Everybody is talking about blockchains, e.g. applications like the so-called cryptocurrencies (like Bitcoins) or smart contracts and the big business potential behind it. Alas, not many people know what the technological basis is. The truth is that blockchain technology, like any database technology, can be used for every conceivable content, not only new currencies. Business and governmental transactions as well as research results, data about organ transplants and items you gained in online games can be stored as can examination results and all kinds of certificates, the possibilities are endless. There are two big advantages:

  • It is very hard to alter the content and
  • you don’t need some centralized trustee.

To understand why let us create a toy example of a blockchain in R. We will use three simple transactions as content:

trnsac1 <- "Peter buys car from Michael"trnsac2 <- "John buys house from Linda"trnsac3 <- "Jane buys car from Peter"

It is called a chain because the transactions are concatenated like so:

To understand this picture we need to know what a hash is. Basically, a hash (or better a cryptographic hash in this case) is just some function to encode messages. For educational purposes let us take the following (admittedly not very sophisticated) hash function:

# very simple (and not very good ;-)) hash functionhash <- function(x, l = 5) {  hash <- sapply(unlist(strsplit(x, "")), function(x) which(c(LETTERS, letters, 0:9, "-", " ") == x))  hash <- as.hexmode(hash[quantile(1:length(hash), (0:l)/l)])  paste(hash, collapse = "")}hash(trnsac1)## [1] "104040200d26"hash(trnsac2)## [1] "0a1c2240401b"

We will use this function to hash the respective transaction and (and this is important here!) the header of the transaction before that. In this way, a header is created and the transactions form a chain (have a look at the pic again).

We will create the blockchain via a simple data frame but of course, it can also be distributed across several computers (this is why the technology is also sometimes called distributed ledger). Have a look at the function to add a transaction to an already existing blockchain or create a new one in case you start with NULL.

add_trnsac <- function(bc, trnsac) {  if (is.null(bc)) bc <- data.frame(Header = hash(sample(LETTERS, 20, replace = TRUE)), Hash = hash(trnsac), Transaction = trnsac, stringsAsFactors = FALSE)  else bc <- rbind(bc, data.frame(Header = hash(paste0(c(bc[nrow(bc), "Header"]), bc[nrow(bc), "Hash"])), Hash = hash(trnsac), Transaction = trnsac))  bc}

We are now ready to create our little blockchain and add the transactions:

# create blockchainset.seed(1234)bc <- add_trnsac(NULL, trnsac1)bc##         Header         Hash                 Transaction## 1 10050502060e 104040200d26 Peter buys car from Michael# add transactionsbc <- add_trnsac(bc, trnsac2)bc <- add_trnsac(bc, trnsac3)bc##         Header         Hash                 Transaction## 1 10050502060e 104040200d26 Peter buys car from Michael## 2 36353b35373b 0a1c2240401b  John buys house from Linda## 3 38383c1b391c 0a404040402c    Jane buys car from Peter

To test the integrity of the blockchain we just recalculate the hash values and stop when they don’t match:

test_bc <- function(bc) {  integrity <- TRUE  row <- 2  while (integrity && row <= nrow(bc)) {    if (hash(paste0(c(bc[(row-1), "Header"]), hash(bc[(row-1), "Transaction"]))) != bc[row, "Header"]) integrity <- FALSE    row <- row + 1  }  if (integrity) {    TRUE  } else {    warning(paste("blockchain is corrupted at row", (row-2)))    FALSE  }}# test integrity of blockchaintest_bc(bc)## [1] TRUE

Let us now manipulate a transaction in the blockchain! Mafia-Joe hacks his way into the blockchain and manipulates the second transaction so that not John but he owns Linda’s house. He even changes the hash value of the transaction so that it is consistent with the manipulated transaction:

# manipulate blockchain, even with consistent hash-value!bc[2, "Transaction"] <- "Mafia-Joe buys house from Linda"bc[2, "Hash"] <- hash("Mafia-Joe buys house from Linda")bc##         Header         Hash                     Transaction## 1 10050502060e 104040200d26     Peter buys car from Michael## 2 36353b35373b 0d0a332d271b Mafia-Joe buys house from Linda## 3 38383c1b391c 0a404040402c        Jane buys car from Petertest_bc(bc)## Warning in test_bc(bc): blockchain is corrupted at row 2## [1] FALSE

Bingo, the integrity test cries foul! The consistency of the chain is corrupted and Mafia-Joe’s hack doesn’t work!

One last thing: in our toy implementation verifying a blockchain and creating a new one use the same amount of computing power. This is a gross oversimplification of what is going on in real-world systems: there creating a new blockchain is computationally much more expensive than verifying an existing one. For creating a new one huge amounts of possible hash values have to be tried out because they have to fulfill certain criteria (e.g. a number of leading zeros). This makes the blockchain extremely safe.

In the cryptocurrency world people (so-called miners) get paid (of course also in cryptocurrency) for finding valid hash values (called mining). Indeed big mining farms have been established which consume huge amounts of computing power (and therefore electricity, which is one of the disadvantages of this technology). For more details consult my question on Bitcoin.SE and the answers and references given there: Is verification of a blockchain computationally cheaper than recreating it?

I hope that this post helped you understand the technological basis of this fascinating trend. Please share your thoughts on the technology and its potential in the comments below!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Super Solutions for Shiny Architecture #5 of 5: Automated Tests

$
0
0

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Automated Tests. Super Solutions for Shiny Architecture

TL;DR

Describes the best practices for setting automated test architecture for Shiny apps. Automate and test early and often with unit tests, user interface tests, and performance tests.

Best Practices for Testing Your Shiny App

Even your best apps will break down at some point during development or during User Acceptance Tests. I can bet on this. It’s especially true when developing big, productionalized applications with the support of various team members and under a client’s deadlines pressure. It’s best to find those bugs on the early side. Automated testing can assure your product quality. Investing time and effort in automated tests brings a huge return. It may seem like a burden at the beginning, but imagine an alternative: fixing the same misbehaviour of the app for the third time e.g. when a certain button is clicked. What is worse, bugs are sometimes spotted after changes are merged to the master branch. And you have no idea which code change let the door open for the bugs, as no one checked particular functionality for a month or so. Manual testing is a solution to some extent, but I can confidently assume that that you would rather spend testing time on improving user experience rather than looking for a missing comma in the code.

How do we approach testing in Appsilon? We aim to organize our test structure according to the “pyramide” best practice:

testing pyramide

FYI there is also an anti-pattern called the “test-cone”. Even such tests architecture in the app I would consider as good sign, after all the app is (automatically) tested – which is unfortunately often not even the case. Nevertheless switching to the “pyramid” makes your tests more reliable and effective plus less time-consuming.

anti pattern test coneNo matter how extensively you are testing or planning to test your app, take this piece of advice: start your working environment with automated tests triggered before merging any pull request (check tools like CircleCI for this). Otherwise you would soon hate finding bugs caused by developers: “Aaaa, yeaaa, it’s on me, haven’t run the tests, but I thought that the change is so small and not related to anything crucial!” (I assume it goes without saying that no changes goes into ‘master’ or ‘development’ branches without proper Pull Request procedure and review).

Let’s now describe in details different types of tests:

Unit Tests

… are the simplest to implement and most low-level kind of tests. The term refers to testing the behaviour of functions based on the expected output comparison. It’s a case by case approach – hence the name. Implementing them will allow you to recognize all edge cases and understand the logic of your function better. Believe me – you will be surprised what your function can return when starting with unexpected input. This idea is pushed to the boundaries with so called Test Driven Development (TDD) approach. No matter if you’re a fan or rather skeptic at the end of the day you should have implemented the good unit tests for your functions.

How to achieve it in practice? The popular and well-known package testthat should be your weapon of choice. Add the tests folder in your source code. Inside it, add another folder testthat and a script testthat.R. The script’s only job will be to trigger all of your tests stored in testthat folder, in which you should define scripts for your tests (one script per functionality or single function – names should start with “test_” + some name that reflects the functionality or even just the name of the function). Start such a test script with context() – write inside some text that will help you understand what the test included is about. Now you can start writing down your tests, one by one. Every test is wrapped with test_that() function, with the text info what is exactly tested followed by test itself – commonly just calling the function with set of parameters and comparing the result with the expected output, e.g.

 result <- sum(2, 2)  expect_equal(result, 4)

Continue adding tests for single function and scripts for all functions. Once it is ready, we can set the main testthat.R script. You can use there code: test_check(“yourPackageName”) for apps as packages or general test_results <- test_dir(“tests/testthat”, reporter = “summary”, stop_on_failure = TRUE).

User Interface (UI) Tests

The core of those tests is to compare the actual app behaviour with what is expected to be displayed after various user actions. Usually it is done by comparing screen snapshots with the reference images. The crucial part though is to set up the architecture to automatically perform human-user like actions and taking snapshots. 

Why are User Interface (UI) tests needed? It is common that in an app development project, all of the functions are work fine, yet the app still crashes. It might be for example due to the JS code that used to do the job but suddenly stopped working as the object that it is looking for appears with a slight delay on the screen in comparison to what was there before. Or the modal ID has been changed and clicking the button does not trigger anything now. The point is this: Shiny apps are much more than R code with all of the JS, CSS, browser dependencies, and at the end of the day what is truly important is whether the users get the expected, bug-free experience.

The great folks from RStudio figured out a way to aid developers in taking snapshots. Check this article to get more information on the shinytest package. It basically allows you to record the actions in the app and select when the snapshots should be created to be checked during tests. What is important shinytest saves the snapshots as the json files describe the content. It fixes the usual problem with comparing images of recognizing small differences in colors or fonts on various browsers as an error. The image is also generated to make it easy for the human eye to check if everything is OK.

There is also an RSelenium package worth mentioning. It connects R with Selenium Webdriver API for automated web browsers. It is harder to configure than shinytest, but it does the job.

As shinytest is quite a new solution, in Appsilon we had already developed our internal architecture for tests. The solution is based on puppeteer and BackstopJS. The test scenarios are written in javascript, so it is quite easy to produce them. Plus BackstopJS has very nice looking reports. 

I guess the best strategy would be to start with shinytest and if there are some problems with using it, switch to some other more general solution for web applications.

Performance Tests

Yes, Shiny applications can scale. They just need the appropriate architecture. Check our case study and architecture description blog posts to learn how we are building large scale apps. As a general rule, you should always check how your app is performing in extreme usage conditions. The source code should be profiled and optimised. The application’s heavy usage can be tested with RStudio’s recent package shinyloadtest. It will help you estimate how many users your application can support and where the bottlenecks are located. It is achieved by recording the “typical” user session and then replaying it in parallel on the huge scale.

So, please test. Test automatically, early and often.

giant alien bugs from the Starship Troopers film

Smash down all the bugs before they become big, strong and dangerous insects!

Follow Appsilon Data Science on Social Media

Follow @Appsilon on Twitter! Follow us on LinkedIn! Don’t forget to sign up for our newsletter. And try out our R Shiny open source packages!

Article Super Solutions for Shiny Architecture #5 of 5: Automated Tests comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Council spending – open data

$
0
0

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My local authority recently decided to publish all spending over £500 in an effort to be more transparent. Here’s a post taking an overview of what they’ve published. I’ve used R for the analysis. The dataset doesn’t contain much detail, but if you’ve analysis suggestions, please add them in the comments!

You can download the spending data here. It’s available in pdf (why?!) and xlsx (plain text would be more open).

First off, some packages:

library(tidyverse)library(readxl)library(janitor)library(lubridate)library(formattable)

Read in the dataset:

df = read_excel("~/Downloads/midlothian_payments_over_500_01042019_to_15092019.xlsx") %>%   clean_names()

We’ve got six columns:

  • type
  • date_paid
  • supplier
  • amount
  • our_ref
  • financial_year

 

Busiest day:

df %>%   mutate(day = weekdays(date_paid)) %>%   group_by(day) %>%   summarise(transactions = n(),             thousands_pounds_spent = sum(amount) / 1000) %>%   mutate(day = fct_relevel(day, rev(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))) %>%   gather(var, value, -day) %>%   ggplot(aes(day, value)) +   geom_col() +   facet_wrap(~var, scales = "free_x") +   coord_flip() +   scale_y_continuous(labels = scales::comma) +   labs(title = "Busiest day of the week",        x = "",        y = "")

day

Busiest time of year:

df %>%   mutate(dow = weekdays(date_paid),          dow = if_else(dow == "Tuesday" | dow == "Friday", "Tue/Fri", "Other")) %>%   group_by(date_paid, dow) %>%   summarise(transactions = n(),             pounds_spent = sum(amount)) %>%   gather(var, value, -date_paid, -dow) %>%   ggplot(aes(date_paid, value, colour = dow)) +   geom_point() +   facet_wrap(~var, scales = "free_y") +   scale_y_log10(labels = scales::comma) +   scale_colour_brewer(type = "qual", palette = "Set2") +   labs(title = "Busiest day of the year",        x = "",        y = "")

year

Top 10 payees by value:

df %>%   group_by(supplier) %>%   summarise(pounds_spent = sum(amount),             transactions = n()) %>%   arrange(desc(pounds_spent)) %>%   top_n(n = 10, wt = pounds_spent) %>%   mutate(pounds_spent = currency(pounds_spent, "£", digits = 0L)) %>%   formattable(list(`pounds_spent` = color_bar("#FA614B"),                    `transactions` = color_bar("lightpink")))

Screenshot from 2019-10-22 11-59-36

In Scotland local authorities collect water charges on behalf of the water authority, which they then pass on. It’s not surprise that Scottish Water is the biggest supplier.

Top 10 payees by frequency:

df %>%   group_by(supplier) %>%   summarise(pounds_spent = sum(amount),             transactions = n()) %>%   arrange(desc(transactions)) %>%   top_n(n = 10, wt = transactions) %>%   mutate(pounds_spent = currency(pounds_spent, "£", digits = 0L)) %>%   formattable(list(`pounds_spent` = color_bar("lightpink"),                    `transactions` = color_bar("#FA614B")))

Screenshot from 2019-10-22 11-59-46

As a final note, writing this post is reminding me again I should be moving away from wordpress because incorporating code and output would be much easier with mark/blog down! As always, legacy is holding me back.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


pkgKitten 0.1.5: Creating R Packages that purr

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

kitten

Another minor release 0.1.5 of pkgKitten just hit on CRAN today, after a break of almost three years.

This release provides a few small changes. The default per-package manual page now benefits from a second refinement (building on what was introduced in the 0.1.4 release) in using the Rd macros referring to the DESCRIPTION file rather than duplicating information. Several pull requests fixes sloppy typing in the README.md, NEWS.Rd or manual page—thanks to all contributors for fixing these. Details below.

Changes in version 0.1.5 (2019-10-22)

  • More extensive use of newer R macros in package-default manual page.

  • Install .Rbuildignore and .gitignore files.

  • Use the updated Travis run script.

  • Use more Rd macros in default ‘stub’ manual page (#8).

  • Several typos were fixed in README.md, NEWS.Rd and the manual page (#9, #10)

More details about the package are at the pkgKitten webpage and the pkgKitten GitHub repo.

Courtesy of CRANberries, there is also a diffstat report for this release

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

linl 0.0.4: Now with footer

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release of our linl package for writing LaTeX letters with (R)markdown just arrived on CRAN. linl makes it easy to write letters in markdown, with some extra bells and whistles thanks to some cleverness chiefly by Aaron.

This version now supports a (pdf, png, …) footer along with the already-supported header, thanks to an intiial PR by Michal Bojanowski to which Aaron added nice customization for scale and placement (as supported by LaTeX package wallpaper). I also added support for continued integration testing at Travis CI via a custom Docker RMarkdown container—which is something I should actually say more about at another point.

Here is screenshot of the vignette showing the simple input for some moderately fancy output (now with a footer):

The NEWS entry follows:

Changes in linl version 0.0.4 (2019-10-23)

  • Continuous integration tests at Travis are now running via custom Docker container (Dirk in #21).

  • A footer for the letter can now be specified (Michal Bojanowski in #23 fixing #10).

  • The header and footer options be customized more extensively, and are documented (Aaron in #25 and #26).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the linl page. For questions or comments use the issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Horizontal scaling of data science applications in the cloud

$
0
0

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Prediction models, machine learning algorithms and scripts for data storage: The modern data science application not only shows more and more complexity, but also puts the existing infrastructure to the test by temporary resource peaks. In this article, we will show how tools such as the RStudio Job Launcher in conjunction with a Kubernetes cluster can be used to outsource the execution of arbitrary analysis scripts to the cloud, scale them and return them to the local infrastructure.

A brief introduction to Kubernetes and the Job Launcher

Kubernetes was designed by Google in 2014 and is an open source published container-orchestration system. The focus of these systems is the automated deployment, scaling and management of container applications. A Kubernetes cluster provides so-called (worker) nodes that can be addressed by other applications. Within the nodes, the necessary containers are then booted up in pods and are made available. In regard to a statistics/analysis context, the outsourcing or horizontal scalability of computation-intensive analyses is particularly interesting. Thus, in a multi-user environment, the distribution of jobs among the worker nodes ensures that the exact amount of resources required is made available, depending on the workload. In the analysis context with R, the RStudio Job Launcher, the independent tool of RStudio Server, can play to its strengths and send sessions and scripts directly to a Kubernetes cluster via plugin.

 

On the one hand this prevents additional costs caused by servers at-rest, on the other hand it also prevents bottlenecks, which can occur more often during workload peaks on standard systems. Based on this basic idea, the RStudio Job Launcher can also be used in local sessions by executing individual R scripts in the Kubernetes cluster and playing back their results. Data science use cases are the resource-intensive scripts, which are listed below, the simultaneous training of different analysis models or compilation tasks that can be outsourced to external nodes.

Our conclusion

Scalability, combined with on-demand provisioning and use of resources, is an ideal scenario for organizations that need to keep their data in the local data center and cannot go all the way to the cloud. In addition, by outsourcing computationally intensive processes, the local data center does not need to grow unnecessarily. This saves the purchase of additional servers in the local data center, which are only used at temporary resource peaks.

In our opinion, this scenario will be particularly interesting for companies that are not allowed to store their data in the cloud due to the constant data growth and the ever more complex requirements on the analysis infrastructure.

In addition to the advantage of processing local data hosted on-premise in a computing cluster, analyses can also base on different frameworks due to Docker images. In addition, flexible requirements on the analysis infrastructure, such as the execution of certain analyses on a GPU or CPU cluster or the booting of additional worker nodes, are easily implemented. Scaling computation-intensive processes horizontally can be achieved with little effort because access to a cluster is easier than ever, for example through Amazon’s EKS service, which provides a completely cloud-based Kubernetes cluster.

This special approach is a solution for numerous challenges for data scientists and data engineers. For this reason, we are happy to support and advise you in the planning and implementation of an IT infrastructure in your company. Learn more about aicon | analytic infrastructure consulting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – eoda GmbH.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Decision Making Support Systems #3: Differences between IA and AI

$
0
0

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Batou the cyborg confers with the Think-Tank Tachikoma in

The Differences between Artificial Intelligence and Augmented Intelligence

In previous posts, we looked at the definition of Artificial Intelligence (AI) and the definition of Intelligence Augmentation (IA). So, what are the differences between the two? Intelligence Augmentation has always been concerned with aiding human decision making and keeping humans-in-the-loop, whereas the AI endeavor seeks to “build machines that emulate and then exceed the full range of human cognition, with an eye on replacing them. 

Rob Bromage puts it well in “Artificial intelligence vs intelligence augmentation”: “AI is an autonomous system that can be taught to imitate and replace human cognitive functions. To put it simply, the machine completely replaces human intervention and interaction. IA, on the other hand, plays more of an assistive role by leveraging AI to enhance human intelligence, rather than replace it.”

In “Augmented Intelligence, not Artificial Intelligence, is the Future” Aaron Masih writes that “While the underlying technologies powering AI and IA are the same, the goals and applications are fundamentally different: AI aims to create systems that run without humans, whereas IA aims to create systems that make humans better.”  

A still from the film

A beneficial partnership between human and machine is depicted in Interstellar.

IA systems are able to exceed system boundaries (the parts of an environment not covered by an AI model’s data inputs) because of the “unsung heroes of the AI revolution” — humans! AI systems can only work with the datasets that they’re trained on and that feed them.  Subject matter expert humans excel in applying context and intuition to problems. Humans can grasp and explain causality. Alex Bates recalls from his experience running a an AI/prescriptive maintenance firm that “…what was remarkable about the human process engineers and maintenance engineers at these plants …were all the clues they incorporated into their assessments of equipment failure, allowing them to identify what was failing and what to repair.  Where they struggled, and where the AI helped, was in making sense of all the massive amount of sensor data coming off the equipment.”   

Another major difference? To state the obvious, one receives much more attention than the other. Even the casual industry observer of the industry that augmenting human intelligence represents a tiny fraction of the total research and development devoted to Artificial Intelligence. The lack of investment in the area seems like a missed opportunity. 

So the AI approach seeks to minimize or replace the role of humans, while an IA solution seeks to amplify the abilities and performance of the humans that participate in a given activity. The IA approach benefits from the human ability to think outside system boundaries. But despite the lack of investment and press attention on Intelligence Augmentation, businesses and other organizations continue to benefit from IA applications…   

Examples of IA at work today 

Appsilon CEO Filip Stachura presenting

CEO of Appsilon Filip Stachura at useR! Toulouse 2019

I asked Filip Stachura, CEO of Appsilon Data Science, who are specialists in Data Science and Machine Learning, How much of what you do as a company qualifies as Intelligence Augmentation?  

FS: Most of what we do qualifies as IA. Decision support systems can be more realistic in many business cases. Even a recommendation engine application is a sales-support decision system, not the AI ​​that sells itself.  

FS: For one firm we did the following: Prices for products that our client sells change frequently and depend heavily on negotiations.  When a client requests a product, the salesperson opens the application that we built and fills in the client and product name. They then see the history of deals made with the client and suggested prices/discounts for the product. The prices vary depending on the segment of the business, category/size of the client and the size of the order.  Ultimately, the human salespeople make the pricing decision, but the application provides the most up to date information to optimize decision-making.  It’s sales-support, not salesperson replacement.

FS: Here is another example. What if a manager has to optimize the usage of hundreds of varying models of cleaning machines that are based in different locations in a region? Each location has its own logistical needs and have staff of varying levels of expertise in operating the machines.  How do you optimize performance and cost with such diverse conditions? Is the answer to eliminate the manager in charge, or is it to augment the manager’s capabilities with a decision-support system? Obviously it’s the latter, since a fully automated management system would have great difficulty in assessing non-standard situations and communicating with the various teams at the various locations.  

Appsilon Data Science co-founder Damian Rodziewicz added that

DR: We also worked with Dr. Ken Benoit and his team from the London School of Economics to make their Quanteda text analysis tool available to a much greater number of social scientists and practitioners from other fields, including medicine and law. Now the user doesn’t need to know the R programming language in order to make use of the powerful Quanteda R package. A human researcher can use the Quanteda package to quickly evaluate language and social trends from millions of documents, truly extending the human’s research domain by many orders of magnitude.

Dr. Ken Benoit of the Quanteda projeect

The man and the machine: Dr. Ken Benoit and the Quanteda text analysis tool

Here is another example, and I have transcribed and hijacked it from the Andrew Ng “Amazon re:MARS 2019 in Las Vegas, Nevadapresentation.  He gave it as an example of AI, but I think it’s really an example of IA.  

Radiologists do a lot of things. They read x-ray images, they also consult with patients, they do surgical planning, and mentor younger doctors… out of all of these tasks, one of them seems amenable to AI automation or acceleration — that’s why many teams including mine are building systems like these to have AI enter this task of reading x-ray images. 

2016: @GeoffreyHinton says we should stop training radiologists, because radiologists will soon be replaced by deep learning

2019: There is a shortage of radiologists. # replaced ≈ 0

What did Hinton miss?

by @garymarcus& @MaxALittlehttps://t.co/rxTTNxaeBN

— Gary Marcus (@GaryMarcus) October 23, 2019

Does anyone really want to replace radiologists with machines at this time?  With human lives at stake? Probably not. But a machine-assisted radiologist?  That is interesting. After the first versions are released, radiologists can work with engineers to teach the machines to do a better job in finding problems.  And eventually there can be a feature set that allows the radiologists to teach the machines directly, without the constant participation of the engineers. And training data from all over the world can be shared to further optimize results.   

When is IA superior? 

The pilot works well with the ship's AI in Netflix short Lucky 13 from the Love Death Robots series

Another beneficial human and machine partnership depicted in “Lucky 13.”

In short, IA is superior when…

…only a limited amount of labeled data, or only unlabeled data exists for a given task

a task requires empathy and//or negotiation between humans

…a task requires a notion of causality, not just correlation

system boundaries exist. Crucial data cannot be captured by sensors. Data input doesn’t cover the entirety of the problem.

….a problem exists within a regulated environment in which machines still struggle with decisions, such as in medicine and surgery

…a task is so critical that only a human can make the final decision

…a task requires more than 2-3 seconds of human thought

The above criteria probably describe most business and research problem scenarios. Consider moving away from the approach of “how do we replace our staff with Artificial Intelligence agents,” and instead move towards “what repeatable, routinizable task can we automate in order to free up time for our human teammates? How do we unleash more intuitive and brilliant ideas by increasing their bandwidth? How do we prevent problems in our facilities by partnering machines and humans?”

Thanks for reading. In the next post, we’ll look at “How to Implement an IA Solution.”

Here are the previous posts:

What Is Artificial Intelligence?

What Is Intelligence Augmentation?

Follow me on Twitter @_joecha_

Follow Appsilon Data Science on Social Media

Follow @Appsilon on Twitter! Follow us on LinkedIn! Don’t forget to sign up for our newsletter. And try out our R Shiny open source packages!

 

 

 

Article Decision Making Support Systems #3: Differences between IA and AI comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Gold-Mining Week 8 (2019)

$
0
0

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

</p> <p>Week 8 Gold Mining and Fantasy Football Projection Roundup now available -on time and ready to go!</p> <p>

The post Gold-Mining Week 8 (2019) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RStudio Professional Drivers 1.6.0

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Access to data is crucial for data science. Unfortunately, servers that run RStudio are often disconnected from databases, especially in organizations that are new to R. In order to help data scientists access their databases, RStudio offers ODBC data connectors that are supported, easy to install, and designed to work everywhere you use RStudio professional products. The 1.6.0 release of RStudio Professional Drivers includes a few important updates.

New data sources

RStudio Professional Drivers

The 1.6.0 release includes new drivers for the following data sources:

  • Amazon Athena
  • Google BigQuery
  • Apache Cassandra
  • MongoDB
  • MySQL
  • IBM Netezza

These six new drivers complement the eight existing drivers from the prior release: Amazon Redshift, Apache Hive, Apache Impala, Oracle, PostgreSQL, Microsoft SQL Server, Teradata, and Salesforce. The existing drivers have also been updated with new features and improvements in the 1.6.0 release. For example, the SQL Server driver now supports the NTLM security protocol. For a full list of changes, refer to the RStudio Professional Drivers 1.6.0 release notes.

New packaging (.rpm / .deb)

Installations of drivers from the prior release of RStudio Professional Drivers relied on an installer script. In this release, the installer script has been eliminated and instead the drivers use standard Linux package management tools – .rpm and .deb packages – that we provide for RHEL/CentOS 6/7, Debian/Ubuntu, and SUSE 12/15. Standardized packaging makes installations and upgrades easier for administrators. Those needing custom installations (e.g. installations into a non-standard directory), can still download the .tar file. For step-by-step instructions see Installing RStudio Professional Drivers.

  • Breaking change. Installing 1.6.0 drivers on top of existing drivers could cause issues. Administrators should uninstall existing drivers and remove driver entries in odbcinst.ini before installing version 1.6.0. See Installing RStudio Professional Drivers.
  • Breaking change. Installing 1.6.0 drivers no longer updates odbcinst.ini. Administrators should manually add entries to odbcinst.ini based on odbcinst.ini.sample which is included in driver packaging. See Installing RStudio Professional Drivers.

Using with Python

RStudio Professional Drivers can be used with both R and Python. You can use the drivers with Jupyter Notebooks and JupyterLab sessions that launch from RStudio Server Pro 1.2.5. You can also use the drivers with Jupyter Notebooks that are published to RStudio Connect 1.7.0+.

A note about write-backs

RStudio Professional Drivers are just one part of a complex ODBC connection chain designed for doing data science. Typical data science tasks involve querying and extracting subsets of data into R. It can be tempting to use the ODBC connection chain for data engineering tasks such as bulk loads, high speed transactions, and general purpose ETL. However, heavy-duty data engineering tasks are better done with specialized third-party tools. We recommend using the ODBC connection chain primarily for querying and analyzing data.

ODBC Data Connectors

While doing data science, it is often handy to write data from R into databases. ODBC write-backs can be challenging when creating tables or inserting records. Standards vary wildly across data sources, and matching data types to data sources can be exacting. Compared to specialized third-party tools, ODBC write-backs tend to be slow. We recommend ODBC write-backs from R only when appropriate and only for small tables.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ordering Sentinel-2 products from Long Term Archive with sen2r

$
0
0

[This article was first published on R on Luigi Ranghetti Website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Until August 2019, all Sentinel-2 satellite data could be directly downloaded from the ESA Data Hub, both through the interactive Open Hub or using an API interface.

Recently this policy was changed: typically, only the most recent products are available for direct download, while the oldest ones (level 2A archives older than 18 months and level 1C older than one year) are stored in the so called Long Term Archive, and must be ordered by the user; then, they are made available for download after a while (no messages are sent to the user). For further details, see the announcement of activation of LTA access for Sentinel-2 and 3.

R users can exploit a functionality recent implemented in the package sen2r to deal with products no more available for direct download.

Overview and installation

sen2r is a package devoted to download and preprocess Sentinel-2 satellite imagery via an accessible and easy to install interface, which can also be used to manage massive processing operations and to schedule automatic processing chains. For an overview of the packages functionalities, see this previous post.

sen2rwas recently released on CRAN, but the current CRAN release (1.1.0) does not yet support ordering products from the Long Term Archive (LTA); to exploit this functionality, it is necessary to use the GitHub release (version 1.2.0 or higher), which can be installed using the package remotes:

remotes::install_github("ranghetti/sen2r")

(refer to this page for a detailed installation guide, including the installation of system dependences).

After installing the package, the following function should be used to save the SciHub credentials:

library(sen2r)write_scihub_login("my_user", "my_password")

Automatic orders

sen2r can be used both in online and offline mode: in the first case, Sentinel-2 source archives (in SAFE format) are searched and downloaded from ESA SciHub, while in the second case, only archives already present on the user machine are used. This post refers to the online usage of the package (nothing changes using it in offline mode).

Starting from version 1.2.0, SAFE archives which are not available for direct download are automatically ordered: this means that, in the case some archives – required to produce the users’ outputs – are stored in the LTA:

  1. they are ordered;
  2. the processing chain launched by the user continues skipping these archives (and so without producing the consequent output rasters);
  3. the user can check the order status with a specific command.

Let’s see some examples of that.

Note: in the following examples, we will search / download an area of interest located in a random Sentinel-2 tile of Kazakhstan, in a time window of 20 days. Randomness is necessary tyo catch products not available for download (otherwise, archives overlapping a fixed area of interest would be ordered the first time I ran examples, so making them available and compromising the subsequent executions). This expedient does not ensure reproducibility, since the randomly selected products could already be available for different reasons.

Using the function sen2r()

sen2r() is the main function of the package, which can be used to manage a whole processing chain. In the following example sen2r() will be used to launch a processing chain to generate RGB colour images from level 1C (top of atmosphere reflectances) data. First of all, the boundaries of Kazakhstan are downloaded and used to select the Sentinel-2 tiles intersecting this country. Then, a specific area of interest (the centroid of a random tile, with a buffer of 5 km) is created.

library(sf); library(magrittr)download.file(  "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_KAZ_0_sf.rds",  kaz_sf_path <- file.path(tempdir(), "gadm36_KAZ_0_sf.rds"))kaz_sf <- readRDS(kaz_sf_path)s2tiles_kaz <- tiles_intersects(kaz_sf, all = TRUE, out_format = "sf")

In this example, tiles_intersects() is a sen2r function which allows obtaining the Sentinel-2 tiles which cover a specific area of interest. Then, the processing is launched:

safe_folder <- tempfile(pattern = "safe_")out_folder_1 <- tempfile(pattern = "Example1_")sel_tile_1 <- s2tiles_kaz[sample(nrow(s2tiles_kaz),1),] %>%  sf::st_transform(3857)sel_tile_1$tile_id
## [1] "40TET"
sel_extent_1 <- sf::st_centroid(sel_tile_1) %>% st_buffer(5e3)out1 <- sen2r(  gui = FALSE,  extent = sel_extent_1,  timewindow = c("2018-02-21", "2018-03-02"),  list_rgb = "RGB432T",  path_l1c = safe_folder,  path_l2a = safe_folder,  path_out = out_folder_1,  log = log_path_1 <- tempfile())

This is the log of the function (by default it is returned at standard output, while in this example it was redirected to a temporary file by setting the argument log):

## [2019-10-24 15:27:02] Starting sen2r execution.## [2019-10-24 15:27:02] Searching for available SAFE products on SciHub...## [2019-10-24 15:27:09] Ordering 1 Sentinel-2 images stored in the Long Term Archive...## [2019-10-24 15:27:09] 1 of 1 Sentinel-2 images were correctly ordered. You can check at a later time if the ordered products were made available using the command:## ## safe_is_online("/home/lranghetti/.sen2r/lta_orders/lta_20191024_152709.json")## ## [2019-10-24 15:27:09] Computing output names...## [2019-10-24 15:27:10] Starting to download the required level-2A SAFE products.## [2019-10-24 15:27:10] Download of level-2A SAFE products terminated.## [2019-10-24 15:27:10] Starting to download the required level-1C SAFE products.## [2019-10-24 15:27:10] Check if products are available for download...## [2019-10-24 15:27:11] 1 Sentinel-2 images are already available and will not be ordered.## [2019-10-24 15:27:11] Downloading Sentinel-2 image 1 of 1 (S2B_MSIL1C_20180301T070819_N0206_R106_T40TET_20180301T105657.SAFE)...## [2019-10-24 15:28:45] Download of level-1C SAFE products terminated.## [2019-10-24 15:28:45] Updating output names...## [2019-10-24 15:28:45] Starting to translate SAFE products in custom format.## GDAL version in use: 2.2.3## Using UTM zone 40.## 1 output files were correctly created.## [2019-10-24 15:28:45] Starting to merge tiles by orbit.## Using projection "+proj=utm +zone=40 +datum=WGS84 +units=m +no_defs".## [2019-10-24 15:28:46] Starting to edit geometry (clip, reproject, rescale).## [2019-10-24 15:28:46] Producing required RGB images.## 1 output RGB files were correctly created.## [2019-10-24 15:28:50] Generating thumbnails.## 1 output files were correctly created.## [2019-10-24 15:28:51] Execution of sen2r session terminated.

The required archive, not available for direct download, were ordered. After doing that, function continued processing the available archives.

At the end of the processing, a warning could appear in case the user exceeded the order quota:

##  of  Sentinel-2 images were not correctly ordered because user '' offline products retrieval quota exceeded. Please retry later, otherwise use different SciHub credentials (see ?write_scihub_login or set a specific value for argument "apihub"). 

The function cited in the log can be used to check if the order was processed:

## safe_is_online("/home/lranghetti/.sen2r/lta_orders/lta_20191024_152709.json")
## S2A_MSIL1C_20180224T070851_N0206_R106_T40TET_20180224T110454.SAFE ##                                                             FALSE

(the JSON file, automatically created by sen2r(), contains the URLs of the ordered products).

Whenever this function will be returning TRUE for all the listed archives, the previous sen2r() execution can be re-launched to complete the processing.

In the case the user does not want to order missing products, it is sufficient to set the sen2r() argument order_lta to FALSE. Launching the function with the Graphical User Interface, this setting can be set in the first sheet (see the screenshot below).

Using specific functions

Other package functions can be used to perform specific steps of a processing chain:

  • s2_list() allows searching the SAFE archives matching the required parameters. By default, all archives (downloadable / stored in LTA) are returned; setting the new argument availability to "online" or "lta" allows returning only available archives or LTA products, respectively. Setting availability = "check" allows distinguish which returned products are available for download and which are not:

    out_folder_2 <- tempfile(pattern = "Example2_")sel_tile_2 <- s2tiles_kaz[sample(nrow(s2tiles_kaz),1),]sel_tile_2$tile_id
    ## [1] "42UXF"
    out_list_2 <- s2_list(  tile = sel_tile_2$tile_id,   time_interval = c("2018-02-21", "2018-03-02"),  availability = "check")# Show available productsout_list_2[attr(out_list_2, "online")]
    ##                                      S2A_MSIL1C_20180301T061801_N0206_R034_T42UXF_20180301T082121.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('ef9d324a-1172-42f5-8396-b120ab4dab89')/$value" ##                                      S2B_MSIL1C_20180302T063759_N0206_R120_T42UXF_20180302T102521.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('9efefd9f-563c-4ab4-bdb1-e7178688c551')/$value"
    # Show products stored in LTAout_list_2[attr(out_list_2, "lta")]
    ##                                      S2A_MSIL1C_20180222T062851_N0206_R077_T42UXF_20180222T101014.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('e6f94a43-804c-4fa7-a216-8ffc6620b8ed')/$value" ##                                      S2B_MSIL1C_20180224T061829_N0206_R034_T42UXF_20180224T101548.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('7766d2f8-da66-427f-8e19-b134c4c482f3')/$value" ##                                      S2A_MSIL1C_20180225T063831_N0206_R120_T42UXF_20180225T103712.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('bff08819-ce0e-44b5-8327-4e379fc7542c')/$value" ##                                      S2B_MSIL1C_20180227T062809_N0206_R077_T42UXF_20180227T083242.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('d44b1929-5b77-47a1-865c-7515dcab9960')/$value"
  • s2_download() can be used to download specific SAFE products. By default, products stored in LTA are automatically ordered (unless argument order_lta is set to FALSE by the user):

    s2_download(out_list_2, order_lta = FALSE)

    This function can also be used passing it the path of a JSON file created by sen2r() (see above) or by s2_order() (see below) containing the URLs of the archives to be downloaded.

  • safe_is_online() function, which was already shown, can also be used with a list of required archives:

    safe_is_online(out_list_2)
    ## S2A_MSIL1C_20180222T062851_N0206_R077_T42UXF_20180222T101014.SAFE ##                                                             FALSE ## S2B_MSIL1C_20180224T061829_N0206_R034_T42UXF_20180224T101548.SAFE ##                                                             FALSE ## S2A_MSIL1C_20180225T063831_N0206_R120_T42UXF_20180225T103712.SAFE ##                                                             FALSE ## S2B_MSIL1C_20180227T062809_N0206_R077_T42UXF_20180227T083242.SAFE ##                                                             FALSE ## S2A_MSIL1C_20180301T061801_N0206_R034_T42UXF_20180301T082121.SAFE ##                                                              TRUE ## S2B_MSIL1C_20180302T063759_N0206_R120_T42UXF_20180302T102521.SAFE ##                                                              TRUE

Manual orders

The new function s2_order() can be used to manually order SAFE archives stored on LTA. Similarly to s2_download() and safe_is_online(), s2_order() accepts a list of SAFE URLs to be downloaded, both as R vector of as path of a JSON file containing them:

ordered_list <- s2_order(out_list_2)
## [2019-10-24 15:29:19] Check if products are already available for download...
## [2019-10-24 15:29:22] 2 Sentinel-2 images are already available and will not be ordered.
## [2019-10-24 15:29:22] Ordering 4 Sentinel-2 images stored in the Long Term Archive...
## [2019-10-24 15:29:38] 4 of 4 Sentinel-2 images were correctly ordered. You can check at a later time if the ordered products were made available using the command:## ## safe_is_online("/home/lranghetti/.sen2r/lta_orders/lta_20191024_152938.json")
ordered_list
##                                      S2A_MSIL1C_20180222T062851_N0206_R077_T42UXF_20180222T101014.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('e6f94a43-804c-4fa7-a216-8ffc6620b8ed')/$value" ##                                      S2B_MSIL1C_20180224T061829_N0206_R034_T42UXF_20180224T101548.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('7766d2f8-da66-427f-8e19-b134c4c482f3')/$value" ##                                      S2A_MSIL1C_20180225T063831_N0206_R120_T42UXF_20180225T103712.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('bff08819-ce0e-44b5-8327-4e379fc7542c')/$value" ##                                      S2B_MSIL1C_20180227T062809_N0206_R077_T42UXF_20180227T083242.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('d44b1929-5b77-47a1-865c-7515dcab9960')/$value" ## attr(,"available")##                                      S2A_MSIL1C_20180301T061801_N0206_R034_T42UXF_20180301T082121.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('ef9d324a-1172-42f5-8396-b120ab4dab89')/$value" ##                                      S2B_MSIL1C_20180302T063759_N0206_R120_T42UXF_20180302T102521.SAFE ## "https://scihub.copernicus.eu/apihub/odata/v1/Products('9efefd9f-563c-4ab4-bdb1-e7178688c551')/$value" ## attr(,"notordered")## named character(0)## attr(,"path")## [1] "/home/lranghetti/.sen2r/lta_orders/lta_20191024_152938.json"

The function returns a vector of ordered products. Some attributes are eventually added:

  • "available" contains products not ordered because already available for download;
  • "notordered" contains products whose orders failed (commonly because the user exceeded his quota);
  • "path" contains the path of the saved JSON file containing the URLs of the ordered products, which can be used with functions safe_is_online() and s2_download() to check if the order was processed and then to download products (the creation of this file can be skipped setting export_prodlist = FALSE).

Conclusions

Starting from version 1.2.0, sen2r is able to manage Sentinel-2 products stored in the Long Term Archive (LTA) and so not directly downloadable. The default behaviour of package functions is to order unavailable products, processing only available archives and providing the way to check the order status through a single line of code. Once the order was processed, the user can relaunch the same code to complete the processing. New functions can be exploited to perform specific steps: safe_is_online() can be used to check if SAFE archives are / were made available, and s2_order() to manually order them. The implementation of additional features related to LTA is planned for the future (e.g. an additional HTML report with the list of used / ordered products).

These features were implemented recently, so users could encounter some issues exploiting them. With the exception of issues not depending on this package (e.g. errors due to invalid SciHub credentials, or if user quota exceeded), users are encouraged to report them on GitHub.

Credits

sen2r is developed by Luigi Ranghetti and Lorenzo Busetto (IREA-CNR), and it is released under the GNU GPL-3 license.

Using sen2r for production (including scientific products) requires to cite it (use this entry).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Luigi Ranghetti Website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


The politics of New Mexico: a brief historical-visual account

$
0
0

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we piece together a brief political history of New Mexico using a host of data sources, including Wikipedia, the US Census, the New Mexico State Legislature (NMSL), VoteView, and the National Conference of State Legislatures. A bit of a show/tell post, and one that piggybacks some on a guide I have developed for working with US political data using R.

Here, we focus on the political leanings of voters in New Mexico as attested by who they have backed historically in presidential elections, who they have sent to the US Congress, and who they have elected as representation in the state legislature & governorship. As we will see, a curious history since statehood in 1912.

In the process, we demonstrate some methods for accessing/cleaning online data sets made available in a variety formats. Many of these data (especially those made available by the NMSL) have not really seen the light of day; so, we let these data breathe some. Fully reproducible. Open methods. Open government.

if (!require("pacman")) install.packages("pacman")pacman::p_load(tidyverse, tigris, nmelectiondatr) #devtools::install_github("jaytimm/nmelectiondatr")options(tigris_use_cache = TRUE, tigris_class = "sf")

New Mexico demographics via tidycensus

We first take a quick look at some socio-demographic indicators in New Mexico (relative to other states in the Union) using the tidycensus package. The violin plots below summarize percentages of the population that are Hispanic (Per_Hispanic), White (Per_White), living below the poverty line (Per_BPL), have a Bachelor’s degree or higher (Per_Bachelors_Plus), and have a high school degree or higher (Per_HS_Plus). Also included are median household incomes (Median_HH_Income).

vars <- c(Per_Hispanic = 'DP05_0071P',          Per_Bachelors_Plus = 'DP02_0067P',           Per_BPL = 'DP03_0128P',           Per_White = 'DP05_0077P',          Per_HS_Plus = 'DP02_0066P',          Median_HH_Income = 'DP03_0062')m90 <- tidycensus::get_acs(geography = "state",                            variables = vars,                            year = 2017,                           output = "tidy",                            survey = 'acs1') %>%  filter(!GEOID %in% c('11', '72'))

So, New Mexico is browner, less financially heeled, and less educated relative to other states in the USA. A very simple overview of a fairly complicated state.

labs <- m90 %>%  filter(NAME == 'New Mexico')m90 %>%  ggplot(aes(x =1, y = estimate)) +  geom_violin()+  geom_point() +  ggrepel::geom_label_repel(data = labs,                            aes(x = 1, y = estimate,                                label = 'NM'),            color = 'steelblue',            nudge_x = .01)+  theme_minimal()+  theme(axis.title.x=element_blank(),        axis.text.x=element_blank(),        legend.position = "none") +  facet_wrap(~variable, scales = 'free') +  labs(title="Some socio-demographics",        caption = 'American Community Survey, 1-Year Estimates, 2017.') 

2016 Presidential Election

To get our bearings, we briefly consider how New Mexico voted in the 2016 presidential election. While Hillary Clinton carried the state by roughly eight points, here we investigate election results at the precinct level. My nmelectiondatr package (available on GitHub) makes available election returns in New Mexico for state & federal elections from 2014 – 2018. These data live on the New Mexico Legislature website as fairly inaccessible spreadsheets. I have cleaned things up some, and collated returns as simple data tables. For non-R users, data also live as simple csv/excel files.

Here, we access precinct-level returns for the 2016 presidential election.

precincts_raw <- #nmelectiondatr::nmel_pol_geos$nm_precincts %>%  nmelectiondatr::nmel_results_precinct %>%               filter(Type == 'President and Vice President of the United States' ) %>% #&   group_by(County_Name, Precinct_Num) %>%  mutate(per = round(Votes/sum(Votes), 3)) %>%  select(County_Name, Precinct_Num, Party, per) %>%  filter(Party %in% c('DEM', 'REP')) %>%  spread(Party, per) %>%  mutate(Trump_Margin = REP - DEM)
base <- nmelectiondatr::nmel_pol_geos$nm_precincts %>%  inner_join(precincts_raw) %>%  ggplot() +   geom_sf(aes(fill = cut_width(Trump_Margin, 0.2)),           color = 'darkgray') +  scale_fill_brewer(palette = 'RdBu', direction = -1, name = 'Margin')

The map below summarizes Trump vote margins by precinct in New Mexico. So, an airplane-red state, much like the country as a whole, with larger, more rural precincts dominating the map.

base +  ggsflabel::geom_sf_text_repel(data = nmel_pol_geos$nm_places %>%                                 filter (LSAD == '25'),                                aes(label = NAME), size = 2.5)  +  theme_minimal() +  theme(axis.title.x=element_blank(),        axis.text.x=element_blank(),        axis.title.y=element_blank(),        axis.text.y=element_blank(),        legend.position = 'right') +  labs(title = "2016 Trump margins by precinct")

When we zoom in some to New Mexico’s four largest metro areas, the state becomes a bit blue-er. Rio Rancho is the friendliest to 45 – where Trump held a rally in mid-September.

Presidential elections in New Mexico historically

A blue state in 2016, next we consider how New Mexico has voted in presidential elections since its statehood in 1912. We first grab a simple list of US presidents and their party affiliations via Git Hub.

url1 <- 'https://gist.githubusercontent.com/namuol/2657233/raw/74135b2637e624848c163759be9cd14ae33f5153/presidents.csv'us_pres <- read.csv(url(url1)) %>%  #select(Year, Party) %>%  mutate(Party = trimws(Party),         Party = gsub('/.*$', '', Party),         year = as.numeric(gsub('^.*/', '', Took.office))-1,         President = gsub(' \\(.*$', '', President)) %>%  select(year, President, Party) %>%  mutate(Party = gsub('Democratic', 'Democrat', Party)) %>%  bind_rows(data.frame(year = 2016,                       President = 'Donald Trump',                       Party = 'Republican'))

Then we access New Mexico’s presidential election voting history via Wikipedia.

url <- 'https://en.wikipedia.org/wiki/United_States_presidential_elections_in_New_Mexico'nm_returns <- url %>%  xml2::read_html() %>%  rvest::html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>%  rvest::html_table(fill = TRUE)nm_returns <- nm_returns[,c(1:2, 4:5, 7)]colnames(nm_returns) <- c('year', 'winner', 'winner_per', 'loser', 'loser_per')nm_returns1 <- nm_returns %>%  mutate(state_winner = ifelse(winner_per < loser_per, loser, winner))%>%  rowwise() %>%  mutate(other = round(100 - sum(winner_per, loser_per), 2)) %>%  left_join(us_pres %>% select(-year), by = c('winner' = 'President')) wins <- nm_returns1 %>%  select(year, state_winner, Party, winner_per)%>%  rename(per = winner_per)loss <- nm_returns1 %>%  select(year, state_winner, Party, loser_per)%>%  mutate(Party = ifelse(Party == 'Democrat', 'Republican', 'Democrat')) %>%  rename(per = loser_per)others <- nm_returns1 %>%  select(year, state_winner, Party, other)%>%  rename(per = other)%>%  mutate(Party = 'Other')new <- bind_rows(wins, loss, others)new$Party <- factor(new$Party, levels = c('Other', 'Democrat', 'Republican')) 

Based on these data, the plot below summarizes historical election results by party affiliation. Labeled are the candidates that won New Mexico. The gray portions of the plot reflect vote shares for “other”/ non-predominant political parties.

flip_dets <- c('Other', 'Democrat', 'Republican')flip_pal <- c('#b0bcc1', '#395f81', '#9e5055')names(flip_pal) <- flip_detsdems <- new %>%  group_by(year) %>%  filter(per == max(per)) %>%  filter(Party == 'Democrat')pres_labels <- new %>%  group_by(year) %>%  filter(Party == 'Democrat') %>%  mutate(percent1 = ifelse(state_winner %in% dems$state_winner,                           per + 7, per - 7),         state_winner = toupper(sub('^.* ', '', state_winner)))new %>%  ggplot(aes(x=year, y=per, fill = Party))+  geom_bar(alpha = 0.85, color = 'white', stat = 'identity') +  annotate(geom="text",            x = pres_labels$year,            y = pres_labels$percent1,            label = pres_labels$state_winner,           size = 3, angle = 90, color = 'white')+  theme_minimal() +  theme(axis.text.x = element_text(angle = 90, hjust = 1))+  #geom_hline(yintercept = 50, color = 'white', linetype = 2) +  theme(legend.position = "none")+  #guides(fill = guide_legend(reverse=TRUE))+  scale_fill_manual(values = flip_pal) +  #ggthemes::scale_fill_stata()+  scale_x_continuous(breaks=seq(1912,2016,4)) + xlab('') +  ggtitle('Presidential election results in New Mexico')

Margins historically

For a slightly different perspective, we consider Republican-Democrat vote margins historically. As the plot below attests, a state that swings quite a bit. However, more recently having settled some as blue post-Bush v2.

new %>%  select(-state_winner) %>%  spread(Party, per) %>%  mutate(margin = Republican - Democrat,         Party = ifelse(margin > 0, 'Republican', 'Democrat')) %>%    ggplot(aes(x=year, y=margin, fill = Party))+  geom_bar(alpha = 0.85, color = 'white', stat = 'identity') +  theme_minimal() +  theme(axis.text.x = element_text(angle = 90, hjust = 1))+  theme(legend.position = "none")+  ggthemes::scale_fill_stata()+  scale_x_continuous(breaks=seq(1912,2016,4)) + xlab('') +  ggtitle('Presidential vote margins in New Mexico since 1912')

New Mexico as bellwether?

While New Mexico got Clinton wrong in 2016, the state seems a fairly consistent bellwether of presidential winners historically, supporting JKF, Ronald Reagan, and Barack Obama alike. Here, then, we consider how New Mexico stacks up against other states in the Union in terms of voting for the winning presidential candidate.

Below, we extract voting histories for all US states from Wikipedia. Tables are mostly uniform across states (eg, New Mexico). California & Pennsylvania tables, eg, are structured a bit differently, and require some individual tweaking.

base_url <- 'https://en.wikipedia.org/wiki/United_States_presidential_elections_in_'states <- uspoliticalextras::uspol_csusa_senate_bios %>%  filter(congress == 116) %>%  select(state_fips:state_abbrev)%>%  distinct() %>%  filter(!state %in%  c('Pennsylvania',  'California'))%>%  mutate(which_table = ifelse(state %in% c('New York', 'Missouri'), 3, 2))states_correct <- list()for (i in 1:nrow(states)) {  states_correct[[i]] <-     paste0(base_url, states$state[i]) %>%    xml2::read_html() %>%    rvest::html_node(xpath = paste0('//*[@id="mw-content-text"]/div/table[',                                     states$which_table[i],']')) %>%    rvest::html_table(fill = TRUE)     states_correct[[i]] <- states_correct[[i]][,c(1:2, 4:5, 7)]   colnames(states_correct[[i]]) <- c('year', 'winner', 'winner_per', 'loser', 'loser_per')  states_correct[[i]] <- states_correct[[i]] %>%    mutate(winner_per = as.numeric(gsub('^$|-|%', 0, winner_per)),           loser_per = as.numeric(gsub('^$|-|%', 0, loser_per)),           state_winner = ifelse(winner_per < loser_per, loser, winner),           correct = ifelse(state_winner == winner, 'correct', 'incorrect'),           correct = ifelse(winner_per == 'n\a', NA, correct),           year = as.integer(gsub("\\D+", "", year)),           year = substr(year, 1,4))}names(states_correct) <- states$statestates_correct1 <- states_correct %>%  bind_rows(.id = 'state') 

The table below summarizes how states have fared in predicting presidential election winners since 1912– a total of 27 elections. New Mexico, then is tied for second with Missouri. Nevada and Ohio voters have only missed two winners since 1912.

correct <- states_correct1  %>%  select(state, year, correct) %>%  mutate(year = as.integer(year)) %>%  bind_rows(returns1, returns_ca) %>%  filter(year >  1911) %>%  group_by(state, correct) %>%  summarize(n = n()) %>%  filter(!is.na(correct))%>%  spread(correct, n) %>%  ungroup() %>%  rowwise() %>%  mutate(per_correct = round(correct/sum(correct, incorrect), 3))correct %>%  arrange(desc(per_correct))%>%  DT::datatable(rownames = FALSE) %>%  DT::formatStyle('per_correct',    background = DT::styleColorBar(range(correct[4]), 'lightblue'),    backgroundSize = '80% 70%',    backgroundRepeat = 'no-repeat',    backgroundPosition = 'right')

{"x":{"filter":"none","data":[["Nevada","Ohio","Missouri","New Mexico","Florida","Tennessee","Illinois","Kentucky","Montana","Arizona","California","Colorado","Delaware","Idaho","Maryland","New Hampshire","New Jersey","Utah","Wisconsin","Arkansas","Iowa","New York","North Carolina","Oklahoma","Pennsylvania","Texas","Virginia","West Virginia","Wyoming","Connecticut","Louisiana","Massachusetts","Michigan","Oregon","Rhode Island","Washington","Indiana","Kansas","Minnesota","Nebraska","North Dakota","Georgia","South Carolina","Alaska","Hawaii","Mississippi","South Dakota","Alabama","Maine","Vermont"],[25,25,24,24,23,23,22,22,22,21,21,21,21,21,21,21,21,21,21,20,20,20,20,20,20,20,20,20,20,19,19,19,19,19,19,19,18,18,18,18,18,17,17,9,9,16,16,15,15,15],[2,2,3,3,4,4,5,5,5,6,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8,9,9,9,9,9,10,10,6,6,11,11,12,12,12],[0.926,0.926,0.889,0.889,0.852,0.852,0.815,0.815,0.815,0.778,0.778,0.778,0.778,0.778,0.778,0.778,0.778,0.778,0.778,0.741,0.741,0.741,0.741,0.741,0.741,0.741,0.741,0.741,0.741,0.704,0.704,0.704,0.704,0.704,0.704,0.704,0.667,0.667,0.667,0.667,0.667,0.63,0.63,0.6,0.6,0.593,0.593,0.556,0.556,0.556]],"container":"</p><table class=\"display\">\n </p><thead>\n </p><tr>\n </p><th>state<\/th>\n </p><th>correct<\/th>\n </p><th>incorrect<\/th>\n </p><th>per_correct<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[1,2,3]}],"order":[],"autoWidth":false,"orderClasses":false,"rowCallback":"function(row, data) {\nvar value=data[3]; $(this.api().cell(row, 3).node()).css({'background':isNaN(parseFloat(value)) || value <= 0.556000 ? '' : 'linear-gradient(90.000000deg, transparent ' + (0.926000 - value)/0.370000 * 100 + '%, lightblue ' + (0.926000 - value)/0.370000 * 100 + '%)','background-size':'80% 70%','background-repeat':'no-repeat','background-position':'right'});\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}

Elections New Mexico got wrong

The table below summarizes results for elections that New Mexico, Ohio, Missouri, and Nevada got wrong since 1912. So, Ohio has been on spot since 1960; Missouri has (seemingly) gone full red. Also of note: 2/3 presidential nominees that New Mexico got wrong won the popular vote nationally, ie, Al Gore & HR Clinton.

states_correct1 %>%  filter(state %in% c('Ohio', 'Nevada', 'New Mexico', 'Missouri') &           correct == 'incorrect' & year > 1911) %>%  select(state:loser_per) %>%  knitr::kable()
stateyearwinnerwinner_perloserloser_per
Missouri2012Barack Obama44.38Mitt Romney53.76
Missouri2008Barack Obama49.29John McCain49.43
Missouri1956Dwight D. Eisenhower49.89Adlai Stevenson II50.11
Nevada2016Donald Trump45.50Hillary Clinton47.92
Nevada1976Jimmy Carter45.81Gerald Ford50.17
New Mexico2016Donald Trump40.04Hillary Clinton48.26
New Mexico2000George W. Bush47.85Al Gore47.91
New Mexico1976Jimmy Carter48.28Gerald Ford50.75
Ohio1960John F. Kennedy (D)46.72Richard Nixon (R)53.28
Ohio1944Franklin D. Roosevelt (D)49.82Thomas E. Dewey (R)50.18

The map below illustrates state ranks for voting with the presidential winner since 1912 (based on table above). States in darker blue are better bellwethers. So, the Southwest is generally quite good, while the South & New England less so.

correct_labels <- correct %>%  ungroup() %>%  arrange(desc(correct)) %>%  mutate(rank = dplyr::dense_rank(desc(correct)))out <- uspoliticalextras::uspol_dvos_equalarea_sf$tile_outer %>%  left_join(correct_labels) inner <- uspoliticalextras::uspol_dvos_equalarea_sf$tile_inner %>%  left_join(correct_labels) %>%  mutate(rank_label = paste0(state_abbrev, '\n', rank))out %>%  ggplot() +       geom_sf(aes(fill = rank),            color = 'black') +   ggsflabel::geom_sf_text(data = inner,                          aes(label = rank_label),                           size = 3.5,                          color = 'black') +  theme_minimal() +  scale_fill_gradient(low="#2c7bb6", high="#ffffbf")+  theme(axis.title.x=element_blank(),        axis.text.x=element_blank(),        axis.title.y=element_blank(),        axis.text.y=element_blank(),        legend.title=element_blank(),        legend.position = 'none') +  labs(title = "Voting with winning candidate since 1912 by rank")

Congressional delegation historically

Next, we consider the composition of New Mexico’s congressional delegation historically. Here we access data made available via the VoteView project and the R package Rvoteview.

House of Representatives

The table below details the names & party affiliations of the (3) representatives New Mexico has sent to Washington over the last 15 congresses. District 3, which is comprised of the northern half of the state and includes the Santa Fe metro, has generally gone blue during this time period. District 1 (the ABQ metro area) has flipped from red to blue since the election of Obama in 2008. District 2 (the southern half of the sate) has been a GOP stronghold, with the 111th and 116th congresses being exceptions.

js <- "(/Rep/).test(value) ? '#fcdbc7' : (/Dem/).test(value) ? '#d1e6ef' : ''"dat <- Rvoteview:: member_search(chamber= 'House',                                  state = 'NM',                                  congress = 102:116) %>%  mutate(bioname = gsub('\\(Tom\\)', '', bioname),         bioname = ifelse(party_name == 'Democratic Party',                          paste0(bioname, ' (Dem)'),                          paste0(bioname, ' (Rep)'))) %>%  select(congress, bioname, district_code) %>%  group_by(congress, district_code)%>%  slice(1) %>%  ungroup() %>%  spread(district_code, bioname)dat %>%  DT::datatable(rownames = FALSE,                options =  list(pageLength = 15,                                   dom = 't')) %>%   DT::formatStyle(1:ncol(dat), backgroundColor = htmlwidgets::JS(js))

{"x":{"filter":"none","data":[[102,103,104,105,106,107,108,109,110,111,112,113,114,115,116],["SCHIFF, Steven Harvey (Rep)","SCHIFF, Steven Harvey (Rep)","SCHIFF, Steven Harvey (Rep)","SCHIFF, Steven Harvey (Rep)","WILSON, Heather (Rep)","WILSON, Heather (Rep)","WILSON, Heather (Rep)","WILSON, Heather (Rep)","WILSON, Heather (Rep)","HEINRICH, Martin (Dem)","HEINRICH, Martin (Dem)","LUJAN GRISHAM, Michelle (Dem)","LUJAN GRISHAM, Michelle (Dem)","LUJAN GRISHAM, Michelle (Dem)","HAALAND, Debra (Dem)"],["SKEEN, Joseph Richard (Rep)","SKEEN, Joseph Richard (Rep)","SKEEN, Joseph Richard (Rep)","SKEEN, Joseph Richard (Rep)","SKEEN, Joseph Richard (Rep)","SKEEN, Joseph Richard (Rep)","PEARCE, Stevan (Rep)","PEARCE, Stevan (Rep)","PEARCE, Stevan (Rep)","TEAGUE, Harry (Dem)","PEARCE, Stevan (Rep)","PEARCE, Stevan (Rep)","PEARCE, Stevan (Rep)","PEARCE, Stevan (Rep)","TORRES SMALL, Xochitl (Dem)"],["RICHARDSON, Bill (Dem)","RICHARDSON, Bill (Dem)","RICHARDSON, Bill (Dem)","REDMOND, William Thomas (Rep)","UDALL, Thomas (Dem)","UDALL, Thomas (Dem)","UDALL, Thomas (Dem)","UDALL, Thomas (Dem)","UDALL, Thomas (Dem)","LUJÁN, Ben Ray (Dem)","LUJÁN, Ben Ray (Dem)","LUJÁN, Ben Ray (Dem)","LUJÁN, Ben Ray (Dem)","LUJÁN, Ben Ray (Dem)","LUJÁN, Ben Ray (Dem)"]],"container":"</p><table class=\"display\">\n </p><thead>\n </p><tr>\n </p><th>congress<\/th>\n </p><th>1<\/th>\n </p><th>2<\/th>\n </p><th>3<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"pageLength":15,"dom":"t","columnDefs":[{"className":"dt-right","targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[10,15,25,50,100],"rowCallback":"function(row, data) {\nvar value=data[0]; $(this.api().cell(row, 0).node()).css({'background-color':(/Rep/).test(value) ? '#fcdbc7' : (/Dem/).test(value) ? '#d1e6ef' : ''});\nvar value=data[1]; $(this.api().cell(row, 1).node()).css({'background-color':(/Rep/).test(value) ? '#fcdbc7' : (/Dem/).test(value) ? '#d1e6ef' : ''});\nvar value=data[2]; $(this.api().cell(row, 2).node()).css({'background-color':(/Rep/).test(value) ? '#fcdbc7' : (/Dem/).test(value) ? '#d1e6ef' : ''});\nvar value=data[3]; $(this.api().cell(row, 3).node()).css({'background-color':(/Rep/).test(value) ? '#fcdbc7' : (/Dem/).test(value) ? '#d1e6ef' : ''});\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}

So, 2018 (the 116th) was only the second time in the last thirty years that New Mexico elected an all-Democrat delegation to the House. See this post for some thoughts on how Torres Small carried New Mexico’s second district in 2018.

US Senate

Next we consider the political affiliations & ideologies of US Senators from New Mexico since 1947. I have discussed VoteView’s political ideology scores in previous posts (eg), and have also demonstrated their derivation using roll call data from New Mexico’s 53rd State Legislature as an example.

Here we utilize Nokken-Poole political ideology scores, which are congress-specific scores. These data are not available via the Rvoteview package; instead, we download these scores directly from the VoteView website.

voteview_nokken_poole <-   read.csv(url("https://voteview.com/static/data/out/members/HSall_members.csv"),           stringsAsFactors = FALSE) base1 <- voteview_nokken_poole %>%  filter(!is.na(nokken_poole_dim1),      chamber == 'Senate',           party_code %in% c('100','200')&           congress > 79) nm <- base1 %>%  filter(state_abbrev == 'NM')   nm_labels <- nm %>%  group_by(bioname) %>%  filter(congress == max(congress)) %>%  ungroup() %>%  mutate(bioname =  gsub(',.*$', '', bioname))

Below, the names and political ideology scores (first dimension) of Senators from New Mexico are presented relative to the median ideology for each major party historically. So, a history of fairly moderate representation in the Senate– dominated until more recently by the split delegation of Domenici (R) and Bingaman (D), both of whom voted center of their respective party medians. Udall (D) and Heinrich (D) may be drifting left, but this would reflect the state’s shifting ideology in general.

base2 <- base1 %>%    group_by(congress, party_code) %>%  summarize(med = median(nokken_poole_dim1)) %>%  ungroup() %>%  ggplot() +    geom_line(aes(x = congress, y= med, color = as.factor(party_code)),            size = 1.25) +  ylim(-.5, .5) +  theme_minimal()+  ggthemes::scale_color_stata() +  theme(legend.position = 'none') +  labs(title="Median ideologies for major parties: Houses 80 to 116") base2 +  geom_line(data = nm,             aes(x = congress, y= nokken_poole_dim1, color = as.factor(bioname)),            linetype = 2) +  geom_text(data = nm_labels,             aes(label = bioname,                x = congress, y =nokken_poole_dim1),            size = 3)

New Mexico State Government

Finally, we consider the composition of New Mexico’s state government historically, namely the governorship and the bicameral house.

State Control in 2019

For a quick look at the current & aggregate composition of New Mexican state leadership relative to other US states, we access data made available by the National Conference of State Legislatures (NCSL).

x <- 'http://www.ncsl.org/Portals/1/Documents/Elections/Legis_Control_2019_August%2026th.pdf'tmp <- tempfile()curl::curl_download(x, tmp)tab <- tabulizer::extract_tables(tmp,                           output = "data.frame",                           encoding = 'UTF-8')[[1]] %>%  slice(2:51) %>%  select(1,4:5, 7:8, 11:13) %>%  separate(col = `Total.House`, into = c('total_house', 'dem'), sep = ' ') %>%  select(-total_house) %>%  filter(X != 'Nebraska')colnames(tab) <- c('state', 'sen_dem', 'sen_rep', 'house_dem', 'house_rep',                    'legis_control', 'governor', 'state_control')tab$state_control <- factor(tab$state_control, levels = c('Divided', 'Dem', 'Rep')) 

The table below presents a sample of the NCSL data set, which includes composition of bicameral state houses by party affiliation, as well as the political affiliation of current governors. The legis_control variable indicates whether the House & Senate are controlled by the same party or not. Eg, both houses in California are controlled by Democrats, while in Minnesota, Republicans control the House and Democrats the Senate, ie, the houses are divided. The state_control variable indicates whether both houses and the governorship are controlled by the same party or not.

set.seed(89)tab %>% sample_n(5) %>% knitr::kable() %>%   kableExtra::kable_styling("striped") 
statesen_demsen_rephouse_demhouse_replegis_controlgovernorstate_control
California29116118DemDemDem
Minnesota32357559DividedDemDivided
New York402210643DemDemDem
Connecticut22149160DemDemDem
Iowa18324653RepRepRep

As far as state legislatures go, very little division exists nationally. The table below summarizes the distribution of legislature control-types (Republican, Democrat, or divided) for the 49 states in the Union with bicameral state houses. Only Minnesota has a divided legislature.

DemDividedRep
18130

State control is a bit more divided nationally, with thirteen states having divided gubernatorial and legislative party control. Thirty-six have state government trifectas.

DividedDemRep
131422

The map below illustrates government control by state and party affiliation for 2019. New Mexico, then, is one of fourteen states with a Democratic state government trifecta. This is new – during the previous eight years (at least) state control was divided in New Mexico.

flip_dets <- c('Divided', 'Dem', 'Rep')flip_pal <- c('gray', '#395f81', '#9e5055')names(flip_pal) <- flip_detsuspoliticalextras::uspol_dvos_equalarea_sf$tile_outer %>%  left_join(tab)  %>%  ggplot() +   geom_sf(aes(fill = state_control),          color = 'black',           alpha = .75) +   ggsflabel::geom_sf_text(data = uspoliticalextras::uspol_dvos_equalarea_sf$tile_inner,                          aes(label = state_abbrev),                           size = 3.5,                          color = 'white') +  theme_minimal()+  scale_fill_manual(values = flip_pal) +  theme(axis.title.x=element_blank(),        axis.text.x=element_blank(),        axis.title.y=element_blank(),        axis.text.y=element_blank(),        legend.title=element_blank(),        legend.position = 'bottom') +  labs(title = "State government control in 2019")

Governors historically

Next, we investigate the party affiliation of New Mexico’s governors since its statehood in 1912. These data are made available as a PDF from the New Mexico State Legislature website.

url <- 'https://www.nmlegis.gov/Publications/Handbook/leadership_since_statehood_17.pdf'tmp <- tempfile()curl::curl_download(url, tmp)tab <- tabulizer::extract_tables(tmp, output = "data.frame") xx <- c('year', 'speaker', 'pro_tem', 'governor', 'president')tab1 <- lapply(tab, function(x) {  colnames(x) <- xx  return(x) }) %>%  bind_rows() %>%  mutate(governor = gsub('\\(died\\)|\\(resigned\\)', NA, governor),         president = gsub('\\(died\\)|\\(resigned\\)', NA, president),         president = gsub('^$', NA, president)) %>%  tidyr::fill(governor, .direction = 'up') %>%  tidyr::fill(president, .direction = 'up') %>%  filter(!is.na(year)) %>%  mutate(gov_party = gsub('^.*\\(([A-Z])\\)', '\\1', governor),         pres_party = gsub('^.*\\(([A-Z])\\)', '\\1', president),         governor = gsub('\\(.\\)', '', governor),         president = gsub('\\(.\\)', '', president)) %>%  select(-speaker, -pro_tem)#Tabulizer is not perfect.  PDF is not up-to-date.hand_edits <- data.frame (year = c(1912, 1951:1953, 2000, 2018:2019),                          governor = c('McDonald', 'Horn', 'Horn', 'Stockton',                                        'Sanchez', 'Martinez', 'Lujan Grisham'),                          president = c('Wilson', 'Truman', 'Truman', 'Eisenhower',                                         'Clinton', 'Trump, D.', 'Trump, D.'),                          gov_party = c('D', 'D', 'D', 'R', 'D', 'R', 'D'),                          pres_party = c('D', 'D', 'D', 'R', 'D', 'R', 'R'))tab1 <- tab1 %>% bind_rows(hand_edits) %>% arrange(year)

After some cleaning, a sample of our data set is presented below. Included are the names of sitting US Presidents and their political affiliation.

yeargovernorpresidentgov_partypres_party
1912McDonaldWilsonDD
1913McDonaldWilsonDD
1914McDonaldWilsonDD
1915McDonaldWilsonDD
1916McDonaldWilsonDD
1917C de BacaWilsonDD

The table below summarizes the total number of years (since 1912) that each party has held the governor’s office, cross-tabbed with the political affiliation of the US President during the same time period. First to note is that Democrats have held gubernatorial control in 70/108 years.

Second to note is that in 59 (39 + 20) of those years the New Mexico governor shared party affiliation with the sitting US President; in 49 (18 + 31) of those years, the two were divided. Roughly a 50-50 split historically, which is pretty interesting.

table(tab1$gov_party, tab1$pres_party) %>%   data.frame() %>%   rename(Gov_Party = Var1, Pres_Party = Var2) %>%  spread(Pres_Party, Freq) %>% knitr::kable()%>%  kableExtra::kable_styling("striped", full_width = F) %>%  kableExtra::add_header_above(c(" " = 1, "Pres_Party" = 2))
Pres_Party
Gov_PartyDR
D3931
R1820

In rank order by total years, then:

  • [Dem Gov/Dem Pres (39)] > [Dem Gov/Rep Pres (31)] > [Rep Gov/Rep Pres (20)] > [Rep Gov/Dem Pres (18)]

The plot below illustrates the political affiliation of New Mexico governors and US presidents since statehood in 1912. Lots of back and forth for sure. It would seem that New Mexicans hedge their bets when it comes to gubernatorial elections, tempering federal leadership with state leadership from the opposing party. With the exception of the ~FDR years.

tab1 %>%  mutate(gov_val = ifelse(gov_party == 'D', .75, -.75),         pres_val = ifelse(pres_party == 'D', 1, -1)) %>%  ggplot() +  geom_line(aes(x = year, y = gov_val), size = 1.25, color = '#b0bcc1') +  geom_line(aes(x = year, y = pres_val), size = 1.25, color = '#55752f') +  ylim(-1.25, 1.25) +  theme_minimal()+  annotate("text", x = 1920, y = 1.25, label = "DEMOCRAT") +  annotate("text", x = 1920, y = -1.25, label = "REPUBLICAN") +  annotate("text", x = 1914, y = 1.05, label = "President") +  annotate("text", x = 1914, y = .8, label = "Governor") +  theme(legend.position = 'none',         axis.title.y=element_blank(),        axis.text.y=element_blank(),        axis.text.x = element_text(angle = 90, hjust = 1)) +  scale_x_continuous(breaks=seq(1912,2018,4)) +  labs(title="Presidential & Gubernatorial Party Affiliation by Year") 

State legislature composition historically

LASTLY, we investigate the party-based composition of the New Mexico state houses historically. Again, we access this data via a PDF made available at the New Mexico State Legislature website.

url_state <- 'https://www.nmlegis.gov/Publications/Handbook/political_control_17.pdf'tmp <- tempfile()curl::curl_download(url_state, tmp)tab <- tabulizer::extract_tables(tmp,                           output = "data.frame",                           encoding = 'UTF-8')current <- data.frame(year = c(2019, 2019),                      count = c(26,16, 46, 24),                      house = c('senate', 'senate', 'house', 'house'),                      party = c('dem', 'rep', 'dem', 'rep'))xx <- c('year', 'house', 'house_dem', 'house_rep', 'house_other',         'senate_dem', 'senate_rep', 'senate_other')tab2 <- lapply(tab, function(x) {  x <- x[, c(1:4,6, 8:9, 11)]  colnames(x) <- xx  x$house_other <- as.numeric(x$house_other)  x$senate_other <- as.numeric(x$senate_other)  return(x) }) %>%  bind_rows() %>%  filter(year %% 2 == 1 | house == '31st,2nd') %>%  filter(!grepl('SS', house)) %>%  mutate(house = gsub(',.*$', '', house)) %>%  gather(key = 'type', value = 'count', -year, -house) %>%  separate(type, into = c('house', 'party'), sep = '_') %>%  mutate(count = ifelse(is.na(count), 0, count)) %>%  bind_rows(current) %>%  group_by(year, house) %>%  mutate(per = round(count/sum(count),2)) %>%  ungroup()tab2$party <- factor(tab2$party, levels = c('other', 'dem', 'rep')) 

Per plot below, then, a post-Depression era stronghold for Democrats, with a couple of exceptions – most recently in the 52nd House (which took office in 2015). A bit of a different story relative to the state’s swingy-er tendancies in other offices considered here.

flip_dets <- c('other', 'dem', 'rep')flip_pal <- c('#b0bcc1', '#395f81', '#9e5055')names(flip_pal) <- flip_detstab2 %>%  ggplot(aes(x=year, y=per, fill = party))+  geom_area(alpha = 0.85, color = 'white', stat = 'identity') +  geom_hline(yintercept = .50, color = 'white', linetype = 2) +  theme_minimal() +  theme(axis.text.x = element_text(angle = 90, hjust = 1))+  theme(legend.position = "none")+  scale_fill_manual(values = flip_pal) +  scale_x_continuous(breaks=seq(1913, 2019, 8)) + xlab('') +  ggtitle('Composition of New Mexico state houses since statehood') +  facet_wrap(~house)

Summary

At present, then, New Mexico is a blue state. While Trump rallied in Rio Rancho in September in hopes of capturing the state in 2020, New Mexico has (seemed to have) lost some of the SWING that has defined the state through much of its history. And made it an excellent bellwether for presidential election winners.

The state supported Clinton in 2016, sends two Democrats to the Senate, 3/3 Democrats to the House, and has a Democratic state government trifecta. These things are fluid for sure, but the state’s demographics continue to move the state’s political ideology leftwards. So, we’ll see. Open data. Open government.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

dang 0.0.11: Small improvements

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release of what may be my most minor package, dang, is now on CRAN. The dang package regroups a few functions of mine that had no other home as for example lsos() from a StackOverflow question from 2009 (!!) is one, this overbought/oversold price band plotter from an older blog post is another. More recently added were helpers for data.table to xts conversion and a git repo root finder.

Some of these functions (like lsos()) where hanging in my .Rprofile, other just lived in scripts so some years ago I started to collect them in a package, and as of February this is now on CRAN too for reasons that are truly too bizarre to go about. It’s a weak and feeble variant of the old Torvalds line about backups and ftp sites …

As I didn’t blog about the 0.0.10 release, the NEWS entry for both follows:

Changes in version 0.0.11 (2019-10-24)

  • New functions getGitRoot, inGit and isConnected.

  • Improved function as.data.table.xts.

Changes in version 0.0.10 (2019-02-10)

  • Initial CRAN release. See ChangeLog for earlier changes.

Courtesy of CRANberries, there is a comparison to the previous release. For questions or comments use the issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pivoting tidily

$
0
0

[This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the fun bits of my job is that I have actual time dedicated to helping colleagues and grad students with statistical or computational problems. Recently I’ve been helping one of our Lab Instructors with some data that from their Plant Physiology Lab course. Whilst I was writing some R code to import the raw data for the lab from an Excel sheet, it occurred to me that this would be a good excuse to look at the new pivot_longer() and pivot_wider() functions from the tidyr package. In this post I show how these new functions facilitate common data processing steps; I was personally surprised how little data wrangling was actually needed in the end to read in the data from the lab.

In the lab course the students conduct an experiment to study the effect of the plant hormone gibberellin on plant growth. Over a number of weeks the students apply gibberellic acid (in two concentrations) or daminozide, a gibberellic acid antagonist, to the tips of the leaves of pea plants that are grown in a growth chamber with a 16-hour photoperiod. The students work in groups, with some of the groups growing the wild-type cultivar, whilst others work with a mutant dwarf cultivar. Each group has six plants per treatment level, and every seven days the students measure the height of each plant and the number of internodes that each plant has. On the last day of the experiment the plants are harvested and their fresh weight measured.

The pea plants from the 2019 Plant Physiology Lab course, toward the end of the experimental periodThe pea plants from the 2019 Plant Physiology Lab course, toward the end of the experimental period

Originally the data were recorded in a less than satisfactory way — let’s just say the original data sheets would have been good candidates for one of Jenny Bryan’s talks on spreadsheets. After being cleaned up a bit, we have something that looks like this in Excel

Raw data in the Excel WorkbookRaw data in the Excel Workbook

This isn’t perfect as we have data in the column names — the numbers after the colons are the day of observation — but it is a pretty simple layout for the students to complete, and this is how we decided to ask the students to record the data during the 2019 lab course, so this is what we have to work with going forward.

Ultimately we want to be able to refer to columns named height, internodes, etc depending on the statistical analysis the students will do, and we’re going to need a column with the observation days in it.

Pivoting

If you’re not familiar with pivoting, it is important to realize that we can store the same data in a wide rectangle or a long (or tall) rectangle

Examples of wide and long representations of the same data. Source: Garrick Aden-Buie’s (@grrrck) Tidy Animated VerbsExamples of wide and long representations of the same data. Source: Garrick Aden-Buie’s ((???)) Tidy Animated Verbs

The same information is stored in both the long and wide representations, but the two representations differ in how useful they are for certain types of operation or how easily they can be used in a statistical analysis. It’s also worth noting that there are more than just long or wide representations of the data; as we’ll see shortly, the long representation of the Plant Physiology Lab data is too general and we’ll need to arrange the data in a slightly wider form.

Moving between long and wide representations is known as pivoting. The animation below show the general idea of how the cells in one format are rearranged into the other format, with the relevant metadata that doesn’t get rearranged being extended or reduced as needed so we don’t loose any information.

Pivoting between wide and long representations of the same data. Source: Garrick Aden-Buie’s (@grrrck) Tidy Animated Verbs modified by Mara Averick (@dataandme)Pivoting between wide and long representations of the same data. Source: Garrick Aden-Buie’s ((???)) Tidy Animated Verbs modified by Mara Averick ((???))

With the lab data I showed earlier, we’re going to need to pivot from the original wide format into a longer format — just as the animation above shows. As we want to output an object that is longer than the input we will use the pivot_longer() function.

To start we will need to import the data from the .xls sheet, which I’ll do using the readxl package

library('curl')# download fileslibrary('readxl')# read from Excel sheetslibrary('tidyr')# data processinglibrary('dplyr')# mo data processinglibrary('forcats')# mo mo data processinglibrary('ggplot2')# plottingtheme_set(theme_bw())## Load Datatmp<-tempfile()curl_download("https://github.com/gavinsimpson/plant-phys/raw/master/f18ph.xls",tmp)plant<-read_excel(tmp,sheet=1)

We have to download the data first — which I do using curl_download() from the curl package — because read_excel() doesn’t currently know how to read from URLs at the moment.

Now we have our plant data within R, stored in a data frame

plant
# A tibble: 24 x 12   treatment cultivar plantid `height:0` `internodes:0` `height:7`                                     1 control   wt             1        235              4        525 2 control   wt             2        182              3        391 3 control   wt             3        253              3        452 4 control   wt             4        151              3        350 5 control   wt             5        195              3        335 6 control   wt             6        187              4        190 7 ga10      wt             1        250              4        458 8 ga10      wt             2        220              4        345 9 ga10      wt             3        180              2        30010 ga10      wt             4        230              4        510# … with 14 more rows, and 6 more variables: `internodes:7` ,#   `height:14` , `internodes:14` , `height:21` ,#   `internodes:21` , `freshwt:21` 

To go to the long representation we have to tell pivot_longer() a couple of bits of information

  • the name of the object to pivot,
  • which columns contain the data we want to pivot (or alternatively which columns not to pivot if that is easier),
  • the name we want to call the new column that will contain the variable name information from the original data, and
  • optionally, the name of the new column that will contain the data values. The default is to name this column value so you don’t need to change this if you’re happy with that.

So, to get our wide plant data into a longer format we would do this

pivot_longer(plant,-(1:3),names_to="variable")
# A tibble: 216 x 5   treatment cultivar plantid variable       value                          1 control   wt             1 height:0       235   2 control   wt             1 internodes:0     4   3 control   wt             1 height:7       525   4 control   wt             1 internodes:7     5   5 control   wt             1 height:14      810   6 control   wt             1 internodes:14   10   7 control   wt             1 height:21     1090   8 control   wt             1 internodes:21   14   9 control   wt             1 freshwt:21       7.210 control   wt             2 height:0       182  # … with 206 more rows

The -(1:3) is short-hand for excluding the first three columns of plant from the pivot. Here, we’re creating a new variable called (imaginatively!) variable. As you can see we now have our data in a much longer representation, with a single column containing all of the observations that this group of students made.

However, we have a bit of a problem: we have the added complication that some of the column names contain actual data that we want to use. While we have a column containing this information — it is not lost — the observation day or variable name information is not directly accessible in this format. What we could do is split the strings in this new variable column on “:” and form two new columns from there.

Thankfully, this is such a common operation that pivot_longer() (and it’s predecessor, gather()) can do this for you — all you have to do is tell pivot_longer() what character to split on, and what names you want for the columns that result from splitting the strings up.

pivot_longer(plant,-(1:3),names_sep=":",names_to=c("variable","day"))
# A tibble: 216 x 6   treatment cultivar plantid variable   day    value                        1 control   wt             1 height     0      235   2 control   wt             1 internodes 0        4   3 control   wt             1 height     7      525   4 control   wt             1 internodes 7        5   5 control   wt             1 height     14     810   6 control   wt             1 internodes 14      10   7 control   wt             1 height     21    1090   8 control   wt             1 internodes 21      14   9 control   wt             1 freshwt    21       7.210 control   wt             2 height     0      182  # … with 206 more rows

The changes we made above were to specify names_sep with the correct separator, and we pass a vector of new column names to names_to rather than the single name we provided previously.

Those of you with good eyes may have noticed another problem that we will encounter if we stopped here. The day variable that was just created is stored as a character vector. It is likely that we’ll want this information stored as a number if we’re going to analyze the data. We can do the required conversion within pivot_longer() call by specifying what the developers have started calling a prototype across many of the tidyverse packages. A prototype is an object that has the same properties that you want objects built from that prototype to take. Here we want the day variable as a column of integer numbers, so we set the prototype for this vector to integer() using the names_ptypes argument

plant<-pivot_longer(plant,-(1:3),names_sep=":",names_to=c("variable","day"),names_ptypes=list(day=integer()))plant
# A tibble: 216 x 6   treatment cultivar plantid variable     day  value                        1 control   wt             1 height         0  235   2 control   wt             1 internodes     0    4   3 control   wt             1 height         7  525   4 control   wt             1 internodes     7    5   5 control   wt             1 height        14  810   6 control   wt             1 internodes    14   10   7 control   wt             1 height        21 1090   8 control   wt             1 internodes    21   14   9 control   wt             1 freshwt       21    7.210 control   wt             2 height         0  182  # … with 206 more rows

Notice that we pass names_ptypes a named list of prototypes, with the list name matching one or more of the variables listed in names_to.

Now we have successfully wrangled the data into a long format and recovered the information hidden in the column names of the original data file. However, as it stands, we can’t easily use the data in this format in a statistical model. We want the students on the course to analyze the data to estimate what effects the treatments have on the height of the plants over the course of the experiment. With the data in this long format we don’t have a variable height containing just the height of the plants that we can refer to in a linear model say.

What we want is to create new columns for height, internodes and freshwt and pivot the value data out into those columns. As we’re adding columns we’re making the data wider, so we can use the pivot_wider() function to do what we want. Now we need to tell pivot_wider()

  • where to take the names of the new variables that are going to be created from— here that’s the variable column, and
  • where to take the data values from that are going to be put into these new columns — here, that’s the value column
plant<-pivot_wider(plant,names_from=variable,values_from=value)plant
# A tibble: 96 x 7   treatment cultivar plantid   day height internodes freshwt                           1 control   wt             1     0    235          4    NA   2 control   wt             1     7    525          5    NA   3 control   wt             1    14    810         10    NA   4 control   wt             1    21   1090         14     7.2 5 control   wt             2     0    182          3    NA   6 control   wt             2     7    391          5    NA   7 control   wt             2    14    615          9    NA   8 control   wt             2    21    810         12     3.8 9 control   wt             3     0    253          3    NA  10 control   wt             3     7    452          6    NA  # … with 86 more rows

As with other tidyverse package, we don’t have to quote the names of the columns we want to pull data from.

There are a couple of other things we need to do to make the data fully useful:

  1. it would be helpful to have a unique identifier for each individual plant — currently the plantid is just the values 1:6 repeated for each treatment group,
  2. it would also be good practice to convert treatment into a factor, and to set the control treatment as the reference level against which the other treatment levels will be compared — if we didn’t do that, the b9 level (daminozide treatment) would be the reference level

We can do those data processing steps quite easily now we have the data imported and arranged nicely the way we want them

plant<-mutate(plant,id=paste0(cultivar,"_",treatment,"_",plantid),treatment=fct_relevel(treatment,'control'))plant
# A tibble: 96 x 8   treatment cultivar plantid   day height internodes freshwt id                                             1 control   wt             1     0    235          4    NA   wt_control_1 2 control   wt             1     7    525          5    NA   wt_control_1 3 control   wt             1    14    810         10    NA   wt_control_1 4 control   wt             1    21   1090         14     7.2 wt_control_1 5 control   wt             2     0    182          3    NA   wt_control_2 6 control   wt             2     7    391          5    NA   wt_control_2 7 control   wt             2    14    615          9    NA   wt_control_2 8 control   wt             2    21    810         12     3.8 wt_control_2 9 control   wt             3     0    253          3    NA   wt_control_310 control   wt             3     7    452          6    NA   wt_control_3# … with 86 more rows

Here I just pasted together the cultivar, treatment and plantid information into a unique id for each individual plant. This won’t be used directly by the students in any analysis they do as this is a second year course and they don’t know about mixed models (yet), but it is handy to have this id available for plotting. The treatment variable is converted to a factor and the reference level set to be “control” using the fct_relevel() function from the forcats package.

The students will do one other step before proceeding to look at the data — each sheet in the .xls file contains observations from a single group and hence a single cultivar, and we want the students to compare cultivars. So they will repeat the steps above to import a second sheet of data containing data from the cultivar they didn’t work with, and then stick the two data sets together. But I’ll spare you having to repeat that.

If you’re interested, this is what the data look like, for a single cultivar and single group

ggplot(plant,aes(x=day,y=height,group=id,colour=treatment))+geom_point()+geom_line()+labs(y='Height (mm)',x='Day',colour='Treatment')
Plot of the plant growth dataPlot of the plant growth data

(and now you can see why I needed a unique plant identifier even though the students will essentially ignore this clustering in the data when the analyse it.)

The .xls file we downloaded at the start of the script contains multiple sheets all formatted the same way, so we could pull in all the data into one big analysis if you wanted, but in the lab we’re just getting the students one set of wild-type and mutant cultivars. I’m grateful to Dr. Maria Davis, the lab instructor for the course, for making the data from the course available to anyone who wants to use it — if you do use it, be sure to give Maria and the 2018 cohort of BIOL266 Plant Physiology students at the University of Regina an acknowledgement.

If you’re interested in the statistical analyses that we’ll be getting the students to do in the lab, I have an (at the time of writing this, almost finished) Rmd file in the GitHub repo for the lab course with all the instructions. It’s pretty simple ANOVA and ANCOVA analyses, but we do get the students to do post hoc testing using the excellent emmeans package, if you’re interested.

Finally, none of the data wrangling I did above is that complex, and I certainly didn’t need to use tidyr and dplyr etc to achieve the result I wanted. It is quite trivial to do this pivoting and wrangling in base R; we could just uses the reshape() function, strplit(), etc. However, if you’ve ever used reshape() you’ll know that the argument names for that function make no sense to anyone except perhaps the person that wrote the function. The real advantage of doing the wrangling using tidyr and dplyr is that we end up with code that is much more easy to read and understand, which is very important for students on these courses, who will have had little to no exposure to programming and related data science techniques.

Anyway, happy pivoting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RAthena 1.3.0 has arrived

$
0
0

[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RBloggers|RBloggers-feedburner

Recap:

RAthena is a R package that interfaces into Amazon Athena. However, it doesn’t use the standard ODBC and JDBC drivers like AWR.Athena and metis. Instead RAthena utilises Python’s SDK (software development kit) into Amazon, Boto3. It does this by using the reticulate package that provides an interface into Python. What this means is that RAthena doesn’t require any driver installation or setup. That can be particularly difficult when you are considering setting up the ODBC drivers and you are not familiar with how ODBC works on your current operating system. If you wish to use ODBC, RStudio has provided a good user guide Setting up ODBC Drivers to help set up ODBC drivers on your system. However if you do not wish to go down that route RAthena might be a good option for you.

New Features in RAthena:

Anyway, getting back to RAthena and what does the new update provide. One of the key changes in RAthena is the method of transferring data to and from AWS Athena. RAthena now utilising data.table for this process. The reason for this change is the raw speed data.table. When transferring data to and from AWS Athena the last thing you want is a bottle neck in R just preparing the data before it even transfers it to AWS Athena. This bottle neck can easily be 50 – 100x longer without the use of data.table.

The next change is bigint, and how it is converted from AWS Athena to R. In the past RAthena would just convert integer64 to bigint when writing to AWS Athena, however it would then convert bigint back into R as a normal integer. Which means it is constrained to 32-bit integers. This has now been fixed. When reading bigint from AWS Athena RAthena will now convert it into integer64.

Sum Up:

RAthena now provides a faster method in reading and writing data from AWS Athena (thanks data.table). With the correct handling of AWS Athena bigint. So please give RAthena a try and let me know what you think of the package. Suggestions/Bugs/Enhancements are always welcome and they will help the package to improve: https://github.com/DyfanJones/RAthena/issues.

Installation methods:

Just in case you are not aware Rathena is available on the CRAN and GitHub.

CRAN:

install.packages("RAthena")

GitHub development version:

remotes::install_github("dyfanjones/RAthena")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A Comprehensive Introduction to Command Line for R Users

$
0
0

[This article was first published on Rsquared Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial, you will be introduced to the command line. We have selected a set of commands we think will be useful in general to a wide range of audience. We have created a RStudio Cloud Project to ensure that all readers are using the same environment while going through the tutorial. Our goal was to ensure that after completing this tutorial, readers should be able to use the shell for version control, managing cloud services (like deploying your own shiny server etc.), execute commands in R & RMarkdown and execute R scripts in the shell. Apart from learning shell commands, the tutorial will also focus on

  • exploring R release names
  • mapping shell commands to R functions
  • RStudio Terminal
  • executing shell commands from R using system2() or processx::run()
  • execute shell commands in RMarkdown
  • execute R scripts in the shell

If you want a deeper understanding of using command line for data science, we suggest you read Data Science at the Command Line. Software Carpentry too has a lesson on shell. We have listed more references at the end of the tutorial for the benefit of the readers.

Resources

Below are the links to all the resources related to this post:

You can try our free online courseCommand Line Basics for R Users if you prefer to learn through self paced online courses or our ebook if you like to read the tutorial in a book format.

Introduction

What is Shell/Terminal?

Shell is a text based application for viewing, handling & manipulating files. It takes in commands and passes them on to the operating system. It is also known as

  • CLI (Command Line Interface)
  • Bash (Bourne Again Shell)
  • Terminal

It is sufficient to know a handful of commands to get started with the shell.

Launch Terminal

Although we will use the terminal in RStudio on RStudio Cloud, we should still know how to launch the terminal in different operating systems.

mac

Applications -> Utility -> Terminal

Windows

Option 1

Go to the Start Menu or screen and enter Command Prompt in the search field.

Option 2

Start Menu -> Windows System -> Command Prompt

Option 3

Hold the Windows key and press the R key to get a Run window. Type cmd in the box and click on the OK button.

Linux

  • Applications -> Accessories -> Terminal
  • Applications -> System -> Terminal

Windows Subsystem for Linux

If you want to use bash on Windows, try the Windows subsystem for Linux. It only works on 64 bit Windows 10. Below are the steps to enable Windows subsystem fro Linux:

Step 1 – Enable Developer Mode

To enable Developer Mode open the Settings app and head to Update & Security > For Developers. Press the Developer Mode switch.

Step 2 – Enable Windows Subsystem for Linux

To enable the Windows Subsystem for Linux (Beta), open the Control Panel, click Programs and Features, and click Turn Windows Features On or Off in left side bar under Programs and Features. Enable the Windows Subsystem for Linux (Beta) option in the list here and click OK. After you do, you’ll be prompted to reboot your computer. Click Restart Now to reboot your computer and Windows 10 will install the new feature.

Step 3 – Install your Linux Distribution of Choice

Open the Microsoft store and choose your favorite Linux distribution.

In the distro’s page, click on “Get”.

Launch the distro from the Start Menu.

You can learn more about the Windows Subsystem for Linux here.

RStudio Terminal

RStudio introduced the terminal with version 1.1.383. The terminal tab is next to the console tab. If it is not visible, use any of the below methods to launch it

  • Shift + Alt + T
  • Tools -> Terminal -> New Terminal

Note, the terminal depends on the underlying operating system. To learn more about the RStudio terminal, read this article or watch this webinar. In this book, we will use the RStudio terminal on RStudio Cloud to ensure that all users have access to Linux bash. You can try all the commands used in this book on your local system as well except in case of Windows users.

Prompt

As soon as you launch the terminal, you will see the hostname, machine name and the prompt. In case of mac & Linux users, the prompt is $. For Windows users, it is >.

OSPrompt
macOS$
Linux$
Windows>

Get Started

To begin with, let us learn to display

  • basic information about the user
  • the current date & time
  • the calendar
  • and clear the screen.
CommandDescription
whoamiWho is the user?
dateGet date, time and timezone
calDisplay calendar
clearClear the screen

whoami prints the effective user id i.e. the name of the user who runs the command. Use it to verify the user as which you are logged into the system.

whoami
## aravind

date will display or change the value of the system’s time and date information.

date
## Sat Oct 26 11:37:36 IST 2019

cal will display a formatted calendar and clear will clear all text on the screen and display a new prompt. You can clear the screen by pressing Ctrl + L as well.

cal

In R, we can get the user information from Sys.info() or whoami() from the whoami package. The current date & time are returned by Sys.date()& Sys.time(). To clear the R console, we use Ctrl + L.

CommandR
whoamiSys.info() / whoami::whoami()
dateSys.date() / Sys.time()
cal
clearCtrl + L

Help/Documentation

Before we proceed further, let us learn to view the documentation/manual pages of the commands.

CommandDescription
manDisplay manual pages for a command
whatisSingle line description of a command

man is used to view the system’s reference manual. Let us use it to view the documentation of the whatis command which we will use next.

man whatis
## WHATIS(1)                     Manual pager utils                     WHATIS(1)## ## NAME##        whatis - display one-line manual page descriptions## ## SYNOPSIS##        whatis  [-dlv?V]  [-r|-w]  [-s  list]  [-m  system[,...]] [-M path] [-L##        locale] [-C file] name ...## ## DESCRIPTION##        Each manual page has a short description available within  it.   whatis##        searches  the  manual  page names and displays the manual page descrip�##        tions of any name matched.## ##        name may contain wildcards (-w) or be a regular expression (-r).  Using##        these  options, it may be necessary to quote the name or escape (\) the##        special characters to stop the shell from interpreting them.## ##        index databases are used during the search,  and  are  updated  by  the##        mandb  program.   Depending  on your installation, this may be run by a##        periodic cron job, or may need to be  run  manually  after  new  manual##        pages  have  been installed.  To produce an old style text whatis data�##        base from the relative index database, issue the command:## ##        whatis -M manpath -w '*' | sort > manpath/whatis## ##        where manpath is a manual page hierarchy such as /usr/man.## ## OPTIONS##        -d, --debug##               Print debugging information.## ##        -v, --verbose##               Print verbose warning messages.## ##        -r, --regex##               Interpret each name as a regular expression.  If a name  matches##               any  part  of  a  page  name, a match will be made.  This option##               causes whatis to be somewhat slower due to the nature  of  data�##               base searches.## ##        -w, --wildcard##               Interpret  each  name  as a pattern containing shell style wild�##               cards.  For a match to be made, an expanded name must match  the##               entire  page  name.   This  option  causes whatis to be somewhat##               slower due to the nature of database searches.## ##        -l, --long##               Do not trim output to the terminal width.  Normally, output will##               be  truncated  to  the terminal width to avoid ugly results from##               poorly-written NAME sections.## ##        -s list, --sections list, --section list##               Search only the given manual sections.   list  is  a  colon-  or##               comma-separated list of sections.  If an entry in list is a sim�##               ple section,  for  example  "3",  then  the  displayed  list  of##               descriptions  will include pages in sections "3", "3perl", "3x",##               and so on; while if an entry in list has an extension, for exam�##               ple "3perl", then the list will only include pages in that exact##               part of the manual section.## ##        -m system[,...], --systems=system[,...]##               If this system has access to  other  operating  system's  manual##               page  names,  they can be accessed using this option.  To search##               NewOS's manual page names, use the option -m NewOS.## ##               The system specified can be a  combination  of  comma  delimited##               operating system names.  To include a search of the native oper�##               ating system's manual page names, include the system name man in##               the  argument  string.   This  option  will override the $SYSTEM##               environment variable.## ##        -M path, --manpath=path##               Specify an alternate set of colon-delimited manual page  hierar�##               chies  to search.  By default, whatis uses the $MANPATH environ�##               ment variable, unless it is empty or unset,  in  which  case  it##               will  determine an appropriate manpath based on your $PATH envi�##               ronment variable.  This option overrides the contents  of  $MAN�##               PATH.## ##        -L locale, --locale=locale##               whatis  will normally determine your current locale by a call to##               the C function setlocale(3) which interrogates various  environ�##               ment  variables,  possibly including $LC_MESSAGES and $LA

whatis displays short manual page descriptions (each manual page has a short description available within it).

whatis ls
## ls (1)               - list directory contents

You will find tldr.sh very useful while exploring new commands and there is a related R package, tldrrr as well.

# devtools::install_github("kirillseva/tldrrr")tldrrr::tldr("pwd")
## pwd ##  ## Print name of current/working directory. ##  ## • Print the current directory: ##  ##   pwd ##  ## • Print the current directory, and resolve all symlinks (i.e. show the "physical" path): ##  ##   pwd -P

new courses ad


Navigating File System

In this section, we will learn commands that will help us

  • navigate between different folders/directories
  • return current working directory
  • list all the files & folders in a directory
  • create and delete directories
CommandDescription
pwdPrint working directory
lsList directory contents
cdChange current working directory
mkdirCreate directory
rmdirRemove/delete directory

pwd displays the name of the present working directory.

pwd
## /mnt/j/R/Others/blogs/content/post/cline

ls displays information about files and directories in the current directory along with their associated metadata such as

  • size
  • ownership
  • modification date

With no options, it will list the files and directories in the current directory, sorted alphabetically.

ls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

cd (change directory) changes the current working directory. It is among the most used commands as it allows the user to move around the file system.

cd rpwd
## /mnt/j/R/Others/blogs/content/post/cline/r

mkdir will create new directory. It will allow you to set file mode (permissions associated with the directory) i.e. who can open/modify/delete the directory.

mkdir rfilesls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## rfiles## sept_15.csv.gz## urls.txt## zip_example.zip

rmdir will remove empty directories from the file system. It can be used to remove multiple empty directories as well. If the directory is not empty, rmdir will not remove it and instead display a warning that the directory is not empty.

rmdir rfilesls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Change Working Directory

Let us focus a bit more on changing working directory. The below table shows commands for changing working directory to

  • up one level
  • previous working directory
  • home directory
  • and root directory
CommandDescription
cd .Navigate into directory
cd ..Go up one level
cd -Go to previous working directory
cd ~Change directory to home directory
cd /Change directory to root directory

All files and directories stem from one main directory, the root directory. All the other directories in the system are sub-directories of the root directory and the root directory has no parent directory. It is represented by a single slash (/). cd / will change the current working directory to the root directory. In RStudio Cloud, use cd and cd .. to navigate back to the current working directory. If you get confused, close the terminal and relaunch it.

cd /pwd
## /

The parent directory i.e. the directory one level up from the current directory which contains the directory we are in now is represented by two dots (..). cd .. will change us into the parent directory of the current directory.

cd ..pwd
## /mnt/j/R/Others/blogs/content/post

The home directory is the directory we are placed in, by default, when we launch a new terminal session. It is represented by the tilde (~). In RStudio Cloud, use cd and cd .. to navigate back to the current working directory. If you get confused, close the terminal and relaunch it.

cd ~pwd
## /home/aravind

To change into the previous working directory, use cd -.

cd -pwd
## /mnt/j/R/Others/blogs/content/post## /mnt/j/R/Others/blogs/content/post

The current working directory is represented by a single dot (.). cd . will change us into the current directory i.e. it will do nothing.

List Directory Contents

ls will list the contents of a directory. Using different arguments, we can

  • list hidden files
  • view file permissions, ownership, size & modification date
  • sort by size & modification date
CommandDescription
lsList directory contents
ls -lList files one per line
ls -aList all files including hidden files
ls -laDisplay file permissions, ownership, size & modification date
ls -lhLong format list with size displayed in human readable format
ls -lSLong format list sorted by size
ls -ltrLong format list sorted by modification date

List files one per line

ls -l
## total 31108## -rwxrwxrwx 1 aravind aravind       12 Oct 25 23:23 analysis.R## -rwxrwxrwx 1 aravind aravind      430 Oct 25 23:23 bash.R## -rwxrwxrwx 1 aravind aravind      145 Oct 25 23:23 bash.Rmd## -rwxrwxrwx 1 aravind aravind      574 Oct 25 23:23 bash.html## -rwxrwxrwx 1 aravind aravind    16242 Oct 25 23:23 bash.sh## -rwxrwxrwx 1 aravind aravind       35 Oct 25 23:23 imports_blorr.txt## -rwxrwxrwx 1 aravind aravind       34 Oct 25 23:23 imports_olsrr.txt## -rwxrwxrwx 1 aravind aravind    39501 Oct 25 23:23 lorem-ipsum.txt## -rwxrwxrwx 1 aravind aravind     9291 Oct 25 23:23 main_project.zip## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myfiles## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 mypackage## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject1## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject2## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject3## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject4## -rwxrwxrwx 1 aravind aravind     1498 Oct 25 23:23 package_names.txt## -rwxrwxrwx 1 aravind aravind     1082 Oct 25 23:23 pkg_names.txt## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 r## -rwxrwxrwx 1 aravind aravind    10240 Oct 25 23:23 release_names.tar## -rwxrwxrwx 1 aravind aravind      632 Oct 25 23:23 release_names.tar.gz## -rwxrwxrwx 1 aravind aravind      546 Oct 25 23:23 release_names.txt## -rwxrwxrwx 1 aravind aravind       65 Oct 25 23:23 release_names_18.txt## -rwxrwxrwx 1 aravind aravind       53 Oct 25 23:23 release_names_19.txt## -rwxrwxrwx 1 aravind aravind 31754998 Oct 25 23:23 sept_15.csv.gz## -rwxrwxrwx 1 aravind aravind      157 Oct 25 23:23 urls.txt## -rwxrwxrwx 1 aravind aravind     4398 Oct 25 23:23 zip_example.zip

List all files including hidden files

ls -a
## .## ..## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Display file permissions, ownership, size & modification date

ls -la
## total 31108## drwxrwxrwx 1 aravind aravind     4096 Oct 26 11:37 .## drwxrwxrwx 1 aravind aravind     4096 Oct 26 11:36 ..## -rwxrwxrwx 1 aravind aravind       12 Oct 25 23:23 analysis.R## -rwxrwxrwx 1 aravind aravind      430 Oct 25 23:23 bash.R## -rwxrwxrwx 1 aravind aravind      145 Oct 25 23:23 bash.Rmd## -rwxrwxrwx 1 aravind aravind      574 Oct 25 23:23 bash.html## -rwxrwxrwx 1 aravind aravind    16242 Oct 25 23:23 bash.sh## -rwxrwxrwx 1 aravind aravind       35 Oct 25 23:23 imports_blorr.txt## -rwxrwxrwx 1 aravind aravind       34 Oct 25 23:23 imports_olsrr.txt## -rwxrwxrwx 1 aravind aravind    39501 Oct 25 23:23 lorem-ipsum.txt## -rwxrwxrwx 1 aravind aravind     9291 Oct 25 23:23 main_project.zip## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myfiles## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 mypackage## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject1## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject2## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject3## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject4## -rwxrwxrwx 1 aravind aravind     1498 Oct 25 23:23 package_names.txt## -rwxrwxrwx 1 aravind aravind     1082 Oct 25 23:23 pkg_names.txt## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 r## -rwxrwxrwx 1 aravind aravind    10240 Oct 25 23:23 release_names.tar## -rwxrwxrwx 1 aravind aravind      632 Oct 25 23:23 release_names.tar.gz## -rwxrwxrwx 1 aravind aravind      546 Oct 25 23:23 release_names.txt## -rwxrwxrwx 1 aravind aravind       65 Oct 25 23:23 release_names_18.txt## -rwxrwxrwx 1 aravind aravind       53 Oct 25 23:23 release_names_19.txt## -rwxrwxrwx 1 aravind aravind 31754998 Oct 25 23:23 sept_15.csv.gz## -rwxrwxrwx 1 aravind aravind      157 Oct 25 23:23 urls.txt## -rwxrwxrwx 1 aravind aravind     4398 Oct 25 23:23 zip_example.zip

Display size in human readable format

ls -lh
## total 31M## -rwxrwxrwx 1 aravind aravind   12 Oct 25 23:23 analysis.R## -rwxrwxrwx 1 aravind aravind  430 Oct 25 23:23 bash.R## -rwxrwxrwx 1 aravind aravind  145 Oct 25 23:23 bash.Rmd## -rwxrwxrwx 1 aravind aravind  574 Oct 25 23:23 bash.html## -rwxrwxrwx 1 aravind aravind  16K Oct 25 23:23 bash.sh## -rwxrwxrwx 1 aravind aravind   35 Oct 25 23:23 imports_blorr.txt## -rwxrwxrwx 1 aravind aravind   34 Oct 25 23:23 imports_olsrr.txt## -rwxrwxrwx 1 aravind aravind  39K Oct 25 23:23 lorem-ipsum.txt## -rwxrwxrwx 1 aravind aravind 9.1K Oct 25 23:23 main_project.zip## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 myfiles## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 mypackage## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 myproject## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 myproject1## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 myproject2## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 myproject3## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 myproject4## -rwxrwxrwx 1 aravind aravind 1.5K Oct 25 23:23 package_names.txt## -rwxrwxrwx 1 aravind aravind 1.1K Oct 25 23:23 pkg_names.txt## drwxrwxrwx 1 aravind aravind 4.0K Oct 25 23:23 r## -rwxrwxrwx 1 aravind aravind  10K Oct 25 23:23 release_names.tar## -rwxrwxrwx 1 aravind aravind  632 Oct 25 23:23 release_names.tar.gz## -rwxrwxrwx 1 aravind aravind  546 Oct 25 23:23 release_names.txt## -rwxrwxrwx 1 aravind aravind   65 Oct 25 23:23 release_names_18.txt## -rwxrwxrwx 1 aravind aravind   53 Oct 25 23:23 release_names_19.txt## -rwxrwxrwx 1 aravind aravind  31M Oct 25 23:23 sept_15.csv.gz## -rwxrwxrwx 1 aravind aravind  157 Oct 25 23:23 urls.txt## -rwxrwxrwx 1 aravind aravind 4.3K Oct 25 23:23 zip_example.zip

Sort list by size

ls -lS
## total 31108## -rwxrwxrwx 1 aravind aravind 31754998 Oct 25 23:23 sept_15.csv.gz## -rwxrwxrwx 1 aravind aravind    39501 Oct 25 23:23 lorem-ipsum.txt## -rwxrwxrwx 1 aravind aravind    16242 Oct 25 23:23 bash.sh## -rwxrwxrwx 1 aravind aravind    10240 Oct 25 23:23 release_names.tar## -rwxrwxrwx 1 aravind aravind     9291 Oct 25 23:23 main_project.zip## -rwxrwxrwx 1 aravind aravind     4398 Oct 25 23:23 zip_example.zip## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myfiles## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 mypackage## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject1## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject2## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject3## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject4## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 r## -rwxrwxrwx 1 aravind aravind     1498 Oct 25 23:23 package_names.txt## -rwxrwxrwx 1 aravind aravind     1082 Oct 25 23:23 pkg_names.txt## -rwxrwxrwx 1 aravind aravind      632 Oct 25 23:23 release_names.tar.gz## -rwxrwxrwx 1 aravind aravind      574 Oct 25 23:23 bash.html## -rwxrwxrwx 1 aravind aravind      546 Oct 25 23:23 release_names.txt## -rwxrwxrwx 1 aravind aravind      430 Oct 25 23:23 bash.R## -rwxrwxrwx 1 aravind aravind      157 Oct 25 23:23 urls.txt## -rwxrwxrwx 1 aravind aravind      145 Oct 25 23:23 bash.Rmd## -rwxrwxrwx 1 aravind aravind       65 Oct 25 23:23 release_names_18.txt## -rwxrwxrwx 1 aravind aravind       53 Oct 25 23:23 release_names_19.txt## -rwxrwxrwx 1 aravind aravind       35 Oct 25 23:23 imports_blorr.txt## -rwxrwxrwx 1 aravind aravind       34 Oct 25 23:23 imports_olsrr.txt## -rwxrwxrwx 1 aravind aravind       12 Oct 25 23:23 analysis.R

Sort list by modification date

ls -ltr
## total 31108## -rwxrwxrwx 1 aravind aravind      546 Oct 25 23:23 release_names.txt## -rwxrwxrwx 1 aravind aravind    10240 Oct 25 23:23 release_names.tar## -rwxrwxrwx 1 aravind aravind     4398 Oct 25 23:23 zip_example.zip## -rwxrwxrwx 1 aravind aravind     9291 Oct 25 23:23 main_project.zip## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 r## -rwxrwxrwx 1 aravind aravind      632 Oct 25 23:23 release_names.tar.gz## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myfiles## -rwxrwxrwx 1 aravind aravind       53 Oct 25 23:23 release_names_19.txt## -rwxrwxrwx 1 aravind aravind       65 Oct 25 23:23 release_names_18.txt## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject3## -rwxrwxrwx 1 aravind aravind       35 Oct 25 23:23 imports_blorr.txt## -rwxrwxrwx 1 aravind aravind     1498 Oct 25 23:23 package_names.txt## -rwxrwxrwx 1 aravind aravind       34 Oct 25 23:23 imports_olsrr.txt## -rwxrwxrwx 1 aravind aravind      157 Oct 25 23:23 urls.txt## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject1## -rwxrwxrwx 1 aravind aravind     1082 Oct 25 23:23 pkg_names.txt## -rwxrwxrwx 1 aravind aravind 31754998 Oct 25 23:23 sept_15.csv.gz## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 mypackage## -rwxrwxrwx 1 aravind aravind    16242 Oct 25 23:23 bash.sh## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject2## -rwxrwxrwx 1 aravind aravind      145 Oct 25 23:23 bash.Rmd## -rwxrwxrwx 1 aravind aravind       12 Oct 25 23:23 analysis.R## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject## -rwxrwxrwx 1 aravind aravind    39501 Oct 25 23:23 lorem-ipsum.txt## -rwxrwxrwx 1 aravind aravind      430 Oct 25 23:23 bash.R## drwxrwxrwx 1 aravind aravind     4096 Oct 25 23:23 myproject4## -rwxrwxrwx 1 aravind aravind      574 Oct 25 23:23 bash.html

R Functions

In R, getwd() will return the current working directory. You can use here() from the here package as well. To change the current working directory, use setwd(). The fs package provides useful functions for file operations.

CommandR
pwdgetwd() / here::here()
lsdir() / list.files() / list.dirs() / fs::dir_ls() / dir_info()
cdsetwd()
mkdirdir.create() / fs::dir_create()
rmdirfs::dir_delete()

File Management

In this section, we will explore commands for file management including:

  • create new file/change timestamps
  • copying files
  • renaming/moving files
  • deleting files
  • comparing files
CommandDescription
touchCreate empty file(s)/change timestamp
cpCopy files & folders
mvRename/move file
rmRemove/delete file
diffCompare files

Create new file

touch modifies file timestamps which is information associated with file modification. It can be any of the following:

  • access time (the last time the file was read)
  • modification time (the last time the contents of the file was changed)
  • change time (the last time the file’s metadata was changed)

If the file does not exist, it will create an empty file of the same name. Let us use touch to create a new file myanalysis.R.

touch myanalysis.Rls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myanalysis.R## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Copy Files/Folders

cp makes copies of files and directories. The general form of the command is cp source destination. By default, it will overwrite files without prompting for confirmation so be cautious while copying files or folders.

Copy files in same folder

Let us create a copy of release_names.txt file and name it as release_names_2.txt.

cp release_names.txt release_names_2.txtls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myanalysis.R## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## release_names_2.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Copy files into different folder

To copy a file into a different directory/folder, we need to specify the name of the destination folder. If the copied file should have a different name, then we need to specify the new name of the file as well. Let us copy the release_names.txt file into the r_releases folder (we will retain the same name for the file as we are copying it into a different folder).

cp release_names.txt r_releases/release_names.txt

Let us check if the file has been copied by listing the files in the r_releases folder using ls.

ls r_releases
## release_names.txt

Copy folders

How about making copies of folders? Use the -r option to copy entire folders. Let us create a copy of the r folder and name it as r2. The -r option stands for --recursive i.e. copy directories recursively.

cp -r r r2ls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myanalysis.R## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## release_names_2.txt## release_names_3.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Move/Rename Files

mv moves and renames files and directories. Using different options, we can ensure

  • files are not overwritten
  • user is prompted for confirmation before overwriting files
  • details of files being moved is displayed
CommandDescription
mvMove or rename files/directories
mv -fDo not prompt for confirmation before overwriting files
mv -iPrompt for confirmation before overwriting files
mv -nDo not overwrite existing files
mv -vMove files in verbose mode

Let us move the release_names_2.txt file to the r_releases folder.

mv release_names_2.txt r_releases

Use ls to verfiy if the file has been moved. As you can see, release_names_2.txt is not present in the current working directory.

ls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myanalysis.R## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## release_names_3.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Let us check if release_names_2.txt is present in the r_releases folder. Great! We have successfully moved the file into a different folder.

ls r_releases
## release_names.txt## release_names_2.txt

Move files in verbose mode

To view the details of the files being moved/renamed, use the -v option. In the below example, we move the release_names_3.txt file into the r_releases folder using mv.

mv -v release_names_3.txt r_releases
## renamed 'release_names_3.txt' -> 'r_releases/release_names_3.txt'

Do not overwrite existing files

How do we ensure that files are not overwritten without prompting the user first? In the below example, we will try to overwrite the release_names_2.txt in the r_releases folder using mv and see what happens. But first, let us look at the contents of the release_names_2.txt file using the cat command.

We will look into the cat command in more detail in the next chapter but for the time being it is sufficient to know that it prints contents of a file. The file contains release names of different R versions.

cat r_releases/release_names_2.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle## Roasted Marshmallows## Trick or Treat## Security Blanket## Masked Marvel## Good Sport## Frisbee Sailing## Warm Puppy## Spring Dance## Sock it to Me## Pumpkin Helmet## Smooth Sidewalk## Full of Ingredients## World-Famous Astronaut## Fire Safety## Wooden Christmas Tree## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tree

In our current working directory, we will create another file of the same name i.e. release_names_2.txt but its contents are different from the file in the r_releases folder. It contains the string release_names and nothing else. We will now move this file into the r_releases folder but use the option -n to ensure that the file in the r_releases folder is not overwritten. We can confirm this by printing the contents of the file in the r_releases folder.

The echo command is used to print text to the terminal or to write to a file. We will explore it in more detail in the next chapter.

echo "release_names" > release_names_2.txt mv -n release_names_2.txt r_releasescat r_releases/release_names_2.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle## Roasted Marshmallows## Trick or Treat## Security Blanket## Masked Marvel## Good Sport## Frisbee Sailing## Warm Puppy## Spring Dance## Sock it to Me## Pumpkin Helmet## Smooth Sidewalk## Full of Ingredients## World-Famous Astronaut## Fire Safety## Wooden Christmas Tree## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tree

As you can observe, the contents of the file in the r_releases folder has not changed. In the next section, we will learn to overwrite the contents using the -f option.

Do not prompt for confirmation before overwriting files

What if we actually intend to overwrite a file and do not want to be prompted for confirming the same. In this case, we can use the -f option which stands for --force i.e. do not prompt before overwriting. Let us first print the contents of the release_names_2.txt file in the r_releases folder.

cat r_releases/release_names_2.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle## Roasted Marshmallows## Trick or Treat## Security Blanket## Masked Marvel## Good Sport## Frisbee Sailing## Warm Puppy## Spring Dance## Sock it to Me## Pumpkin Helmet## Smooth Sidewalk## Full of Ingredients## World-Famous Astronaut## Fire Safety## Wooden Christmas Tree## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tree

Now we will create another file of the same name in the current working directory but with different content and use the -f option to overwrite the file in the r_releases folder. You can see that the contents of the file in the r_releases folder has changed.

echo "release_names" > release_names_2.txt mv -f release_names_2.txt r_releasescat r_releases/release_names_2.txt
## release_names

Remove/Delete Files

The rm command is used to delete/remove files & folders. Using additional options, we can

  • remove directories & sub-directories
  • forcibly remove directories
  • interactively remove multiple files
  • display information about files removed/deleted
CommandDescription
rmRemove files/directories
rm -rRecursively remove a directory & all its subdirectories
rm -rfForcibly remove directory without prompting for confirmation or showing error messages
rm -iInteractively remove multiple files, with a prompt before every removal
rm -vRemove files in verbose mode, printing a message for each removed file

Remove files

Let us use rm to remove the file myanalysis.R (we created it earlier using the touch command).

rm myanalysis.Rls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject1## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Recursive Deletion

How about folders or directories? We can remove a directory and all its contents including sub-directories using the option -r which stands for --recursive and removes directories and their contents recursively. Let us remove the myproject1 folder and all its contents.

rm -r myproject1ls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject2## myproject3## myproject4## package_names.txt## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Force Removal

Use the -f option which stands for --force to forciby remove directory and all its contents without prompting for confirmation or showing error messages. Let us remove the myproject2 folder and all its contents.

rm -rf myproject2ls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject3## myproject4## package_names.txt## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Verbose Mode

Remove files in verbose mode, printing a message for each removed file. This is useful when you want to see the details of the files being removed. In the below example, we will remove all files with .txt extension from the myfiles folder. Instead of specifying the name of each text file, we use the wildcard * along with .txt i.e. any file with the extension .txt will be removed.

cd myfilesrm -v *.txt
## removed 'release_names.txt'## removed 'release_names_18.txt'## removed 'release_names_19.txt'

Compare Files

diff stands for difference. It is used to compare files line by line and display differences. It also indicates which lines in one file must be changed to make the files identical. Using additional options, we can

  • ignore white spaces while comparing files
  • show differences sidy by side
  • show differences in unified format
  • compare directories recursively
  • display names of files that differ
CommandDescription
diffCompare files & directories
diff -wCompare files; ignoring white spaces
diff -yCompare files; showing differences side by side
diff -uCompare files; show differences in unified format
diff -rCompare directories recursively
diff -rqCompare directories; show the names of files that differ

Compare Files

Let us compare the contents of the following files

  • imports_olsrr.txt
  • imports_blorr.txt

The files contain the names of R packages imported by the olsrr and blorr packages respectively (Full disclosure: both the above R pakages are developed by Rsquared Academy.).

diff uses certain special symbols and gives instructions to make the files identical. The instructions are on how to change the first file to make it identical to the second file. We list the symbols below

  • a for add
  • c for change
  • d for delete

We will use the -w option to ignore white spaces while comparing the files.

diff -w imports_olsrr.txt imports_blorr.txt
## 1a2## > caret## 3d3## < cli## 4a5## > cli

Let us interpret the results. 4a5 indicates after line 4 in file 1, add line 5 from file 2 to make both the files identical i.e. add caret which is line 5 in imports_blorr.txt after line 4 in imports_olsrr.txt which will make both the files identical.

Let us change the file order and see the instructions from diff.

diff -w imports_blorr.txt imports_olsrr.txt
## 2d1## < caret## 4d2## < clisymbols## 5a4## > clisymbols

5d4 indicates delete line 5 from file 1 to match both the files at line4 i.e. delete caret which is line 5 in imports_blorr.txt to make both the files identical.

Side By Side

To view the differences between the files side by side, use the -y option.

diff -y imports_olsrr.txt imports_blorr.txt
## car                                   | car## checkmate                                  | caret## cli                                  | checkmate## clisymbols                                  | clisymbols##                                > cli

Unified Format

To view the differences between the files in a unified format, use the -u option.

diff -u imports_olsrr.txt imports_blorr.txt
## --- imports_olsrr.txt    2019-09-20 13:36:03.000000000 +0530## +++ imports_blorr.txt    2019-09-20 13:36:35.000000000 +0530## @@ -1,4 +1,5 @@## -car ## -checkmate## -cli## -clisymbols## +car## +caret## +checkmate## +clisymbols## +cli

Compare Recursively

To compare recursively, use the -r option. Let us compare the mypackage and myproject folders.

diff -r mypackage myproject
## Only in mypackage: .Rbuildignore## Only in mypackage: DESCRIPTION## Only in mypackage: LICENSE## Only in mypackage: NAMESPACE## Only in mypackage: NEWS.md## Only in mypackage: R## Only in myproject/data: processed## Only in myproject/data: raw## Only in mypackage: docs## Only in mypackage: man## Only in myproject: output## Only in myproject: run_analysis.R## Only in mypackage: tests## Only in mypackage: vignettes

File Details

To compare directories and view the names of files that differ, use the -rq option. In the below example, we look at the names of files that differ in mypackage and myproject folders.

diff -rq mypackage myproject
## Only in mypackage: .Rbuildignore## Only in mypackage: DESCRIPTION## Only in mypackage: LICENSE## Only in mypackage: NAMESPACE## Only in mypackage: NEWS.md## Only in mypackage: R## Only in myproject/data: processed## Only in myproject/data: raw## Only in mypackage: docs## Only in mypackage: man## Only in myproject: output## Only in myproject: run_analysis.R## Only in mypackage: tests## Only in mypackage: vignettes

R Functions

In R, file operations can be performed using functions from both base R and the fs package.

CommandR
touchfile.create() / fs::file_create() / fs::file_touch()
cpfile.copy() / fs::file_copy() / fs::dir_copy()
mvfile.rename() / fs::file_move()
rmfile.remove() / fs::file_delete()
diff

Input/Output

In this section, we will explore commands that will

  • display messages
  • print file contents
  • sort file contents
CommandDescription
echoDisplay messages
catPrint contents of a file
headPrints first ten lines of a file by default
tailPrints last ten lines of a file by default
moreOpen a file for interactive reading, scrolling & searching
lessOpen a file for interactive reading, scrolling & searching
sortSort a file in ascending order

Display Messages

The echo command prints text to the terminal. It can be used for writing or appending messages to a file as well.

CommandDescription
echoDisplay messages
echo -nPrint message without trailing new line
echo > fileWrite message to a file
echo >> fileAppend message to a file
echo -eEnable interpretation of special characters

Print Message

Let us start with a simple example. We will print the text Action of the Toes to the terminal. It is the release name for R version 3.6.1.

echo Action of the Toes

Redirect Output

What if we want to redirect the output? Instead of printing the text to the terminal, we want to write it to a file. In such cases, use > along with the file name to redirect the output to the file. Keep in mind that > will overwrite files. If you want to append to files instead of overwriting, use >>.

echo Great Truth > release.txt

Print & Concatenate Files

The cat command reads data from files, and outputs their contents. It is the simplest way to display the contents of a file at the command line. It can be used to overwrite or append new content to files as well. cat stands for catenate and can be used to

  • display text files
  • copy text files into a new document
  • append the contents of a text file to the end of another text file, combining them
CommandDescription
catPrint & concatenate files
cat >Concatenate several files into the target file
cat >>Append several files into the target file
cat -nNumber all output lines

Print Content

Let us print the content of the release_names.txt file (it contains R release names).

cat release_names.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle## Roasted Marshmallows## Trick or Treat## Security Blanket## Masked Marvel## Good Sport## Frisbee Sailing## Warm Puppy## Spring Dance## Sock it to Me## Pumpkin Helmet## Smooth Sidewalk## Full of Ingredients## World-Famous Astronaut## Fire Safety## Wooden Christmas Tree## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tree

Number All Output Lines

If you want to number the output line, use the -n option.

cat -n release_names.txt
##      1   Unsuffered Consequences##      2   Great Pumpkin##      3   December Snowflakes##      4   Gift-Getting Season##      5   Easter Beagle##      6   Roasted Marshmallows##      7   Trick or Treat##      8   Security Blanket##      9   Masked Marvel##     10   Good Sport##     11   Frisbee Sailing##     12   Warm Puppy##     13   Spring Dance##     14   Sock it to Me##     15   Pumpkin Helmet##     16   Smooth Sidewalk##     17   Full of Ingredients##     18   World-Famous Astronaut##     19   Fire Safety##     20   Wooden Christmas Tree##     21   Very Secure Dishes##     22   Very, Very Secure Dishes##     23   Supposedly Educational##     24   Bug in Your Hair##     25   Sincere Pumpkin Patch##     26   Another Canoe##     27   You Stupid Darkness##     28   Single Candle##     29   Short Summer##     30   Kite Eating Tree

Concatenate Several Files

To concatenate the contents of several files into a target file, use >. In the below example, we concatenate the contents of the files release_names_18.txt and release_names_19.txt into a single file release_names_18_19.txt. In this case we are not printing the contents of the file to the terminal and instead we concatenate the contents from both the files and redirect the output to the target file.

cat release_names_18.txt release_names_19.txt > release_names_18_19.txtcat release_names_18_19.txt
## Someone to Lean On## Joy in Playing## Feather Spray## Eggshell IglooGreat Truth## Planting of a Tree## Action of the Toes

Head

The head command will display the firt 10 lines of a file(s) by default. It can be used to display the first few lines or bytes of a file as well.

CommandDescription
headOutput the first parts of a file
head -nOutput the first n lines of a file
head -cOutput the first c bytes of a file
head -n -xOutput everything but the last x lines of a file
head -c -xOutput everything but the last x bytes of a file

Output the first parts of a file

Let us use head to display the first 10 lines of the release_names.txt file.

head release_names.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle## Roasted Marshmallows## Trick or Treat## Security Blanket## Masked Marvel## Good Sport

Output the first n lines of a file

Using the n option, we can specify the number of lines to be displayed. In the below example, we display the first 5 lines.

head -n 5 release_names.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle

Output the first c bytes of a file

The c option can be used to display characters or bytes instead of lines. Let us display the first 5 bytes of the release_names.txt file.

head -c 5 release_names.txt
## Unsuf

Output everything but the last 5 lines of a file

To display the last parts of a file, use - while specifying the number of lines. In the below example, we display the last 5 lines of the file.

head -n -5 release_names.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle## Roasted Marshmallows## Trick or Treat## Security Blanket## Masked Marvel## Good Sport## Frisbee Sailing## Warm Puppy## Spring Dance## Sock it to Me## Pumpkin Helmet## Smooth Sidewalk## Full of Ingredients## World-Famous Astronaut## Fire Safety## Wooden Christmas Tree## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch

Output everything but the last 3 bytes of a file

In this example, we display the last 3 bytes of the file using the c option and - while specifying the number of bytes.

head -c -3 release_names.txt
## Unsuffered Consequences## Great Pumpkin## December Snowflakes## Gift-Getting Season## Easter Beagle## Roasted Marshmallows## Trick or Treat## Security Blanket## Masked Marvel## Good Sport## Frisbee Sailing## Warm Puppy## Spring Dance## Sock it to Me## Pumpkin Helmet## Smooth Sidewalk## Full of Ingredients## World-Famous Astronaut## Fire Safety## Wooden Christmas Tree## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tre

Tail

The tail command displays the last 10 lines of a file(s) by default. It can be used to display the last few lines or bytes of a file as well.

CommandDescription
tailDisplay the last part of a file
tail -n numShow the last num lines of a file
tail -n +numShow all contents of the file starting from num line
tail -c numShow last num bytes of a file
tail -fKeep reading file until Ctrl + C
tail -FKeep reading file until Ctrl + C; even if the file is rotated

Display the last parts of a file

Let us use tail to display the last 10 lines of the file.

tail release_names.txt
## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tree

Display the last 5 lines of a file

As we did in the previous section, use n to specify the number of lines to be displayed.

tail -n 5 release_names.txt
## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tree

Display all contents from line 10

We can use tail to display all contents of a file starting from a specific line. In the below example, we display all contents of the file starting from the 10th line using the n option and + prefix while specifying the number of lines.

tail -n +10 release_names.txt
## Good Sport## Frisbee Sailing## Warm Puppy## Spring Dance## Sock it to Me## Pumpkin Helmet## Smooth Sidewalk## Full of Ingredients## World-Famous Astronaut## Fire Safety## Wooden Christmas Tree## Very Secure Dishes## Very, Very Secure Dishes## Supposedly Educational## Bug in Your Hair## Sincere Pumpkin Patch## Another Canoe## You Stupid Darkness## Single Candle## Short Summer## Kite Eating Tree

Display the last 10 bytes of a file

Use the c option to display the last 7 bytes of a file.

tail -c 7 release_names.txt
##  Tree

More

The more command displays text, one screen at a time. It opens a file for

  • interactive reading
  • scrolling
  • and searching

Press space to scroll down the page, the forward slash (/) for searching strings, n to go to the next match and q to quit.

CommandDescription
moreOpen a file for interactive reading, scrolling & searching
spacePage down
/Search for a string; press n to go the next match
qQuit

Less

The less command is similar to more but offers more features. It allows the user to scroll up and down, go to the beggining and end of the file, forward and backward search and the ability to go the next and previous match while searching the file.

CommandDescription
lessOpen a file for interactive reading, scrolling & searching
spacePage down
bPage up
GGo to the end of file
gGo to the start of file
/Forward search
?Backward search
nGo to next match
NGo to previous match
qQuit

Sort

The sort command will sort the contents of text file, line by line. Using additional options, we can

  • sort a file in ascending/descending order
  • ignore case while sorting
  • use numeric order for sorting
  • preserve only unique lines while sorting

Using the sort command, the contents can be sorted numerically and alphabetically. By default, the rules for sorting are:

  • lines starting with a number will appear before lines starting with a letter.
  • lines starting with a letter that appears earlier in the alphabet will appear before lines starting with a letter that appears later in the alphabet.
  • lines starting with a lowercase letter will appear before lines starting with the same letter in uppercase.

Using additional options, the rules for sorting can be changed. We list the options in the below table.

CommandDescription
sortSort lines of text files
sort -rSort a file in descending order
sort --ignore-caseIgnore case while sorting
sort -nUse numeric order for sorting
sort -uPreserve only unique lines while sorting

Sort

Let us sort the contents of the pkg_names.txt file. It contains names R packages randomly selected from CRAN.

sort pkg_names.txt
## ASIP## AdMit## AnalyzeTS## AzureStor## AzureStor## BIGDAWG## BIOMASS## BIOMASS## BenfordTests## BinOrdNonNor## BioCircos## ClimMobTools## CombinePValue## Eagle## FField## ICAOD## MARSS## MIAmaxent## MIAmaxent## MIAmaxent## MVB## MVTests## MaXact## MaxentVariableSelection## OptimaRegion## OxyBS## PathSelectMP## PropScrRand## RJDBC## RPyGeo## SCRT## SMARTp## SPEDInstabR## SemiParSampleSel## SetMethods## SmallCountRounding## SpatioTemporal## SphericalK## SuppDists## Survgini## TIMP## TSeriesMMA## VineCopula## WGScan## WPKDE## accept## accept## addhaz## alfr## aweek## aweek## bayesbio## blink## breakfast## cbsem## corclass## crsra## cyclocomp## dagitty## disparityfilter## edfReader## errorlocate## expstudies## fermicatsR## foretell## gLRTH## gazepath## generalhoslem## geoknife## hdnom## hindexcalculator## ibd## interplot## kfigr## logNormReg## ltxsparklines## lue## mbir## mcmcabn## mev## mgcViz## mined## mlflow## mongolite## mongolite## mvShapiroTest## odk## overlapping## pAnalysis## pls## pmdplyr## poisbinom## randtests## redcapAPI## rgw## rless## rsed## rstudioapi## solitude## splithalfr## sspline## sybilccFBA## tailr## tailr## tictactoe## viridisLite## vqtl## widyr## widyr

Descending Order

Using the -r option which stands for --reverse the contents of the file can be sorted in descending/reverse order. Let us now sort the contents of the pkg_names.txt file in reverse order.

sort -r pkg_names.txt
## widyr## widyr## vqtl## viridisLite## tictactoe## tailr## tailr## sybilccFBA## sspline## splithalfr## solitude## rstudioapi## rsed## rless## rgw## redcapAPI## randtests## poisbinom## pmdplyr## pls## pAnalysis## overlapping## odk## mvShapiroTest## mongolite## mongolite## mlflow## mined## mgcViz## mev## mcmcabn## mbir## lue## ltxsparklines## logNormReg## kfigr## interplot## ibd## hindexcalculator## hdnom## geoknife## generalhoslem## gazepath## gLRTH## foretell## fermicatsR## expstudies## errorlocate## edfReader## disparityfilter## dagitty## cyclocomp## crsra## corclass## cbsem## breakfast## blink## bayesbio## aweek## aweek## alfr## addhaz## accept## accept## WPKDE## WGScan## VineCopula## TSeriesMMA## TIMP## Survgini## SuppDists## SphericalK## SpatioTemporal## SmallCountRounding## SetMethods## SemiParSampleSel## SPEDInstabR## SMARTp## SCRT## RPyGeo## RJDBC## PropScrRand## PathSelectMP## OxyBS## OptimaRegion## MaxentVariableSelection## MaXact## MVTests## MVB## MIAmaxent## MIAmaxent## MIAmaxent## MARSS## ICAOD## FField## Eagle## CombinePValue## ClimMobTools## BioCircos## BinOrdNonNor## BenfordTests## BIOMASS## BIOMASS## BIGDAWG## AzureStor## AzureStor## AnalyzeTS## AdMit## ASIP

Ignore case

To ignore case while sorting contents, use the --ignore-case option. Time to sort the pkg_names.txt file while ignoring case.

sort --ignore-case pkg_names.txt
## accept## accept## addhaz## AdMit## alfr## AnalyzeTS## ASIP## aweek## aweek## AzureStor## AzureStor## bayesbio## BenfordTests## BIGDAWG## BinOrdNonNor## BioCircos## BIOMASS## BIOMASS## blink## breakfast## cbsem## ClimMobTools## CombinePValue## corclass## crsra## cyclocomp## dagitty## disparityfilter## Eagle## edfReader## errorlocate## expstudies## fermicatsR## FField## foretell## gazepath## generalhoslem## geoknife## gLRTH## hdnom## hindexcalculator## ibd## ICAOD## interplot## kfigr## logNormReg## ltxsparklines## lue## MARSS## MaXact## MaxentVariableSelection## mbir## mcmcabn## mev## mgcViz## MIAmaxent## MIAmaxent## MIAmaxent## mined## mlflow## mongolite## mongolite## MVB## mvShapiroTest## MVTests## odk## OptimaRegion## overlapping## OxyBS## pAnalysis## PathSelectMP## pls## pmdplyr## poisbinom## PropScrRand## randtests## redcapAPI## rgw## RJDBC## rless## RPyGeo## rsed## rstudioapi## SCRT## SemiParSampleSel## SetMethods## SmallCountRounding## SMARTp## solitude## SpatioTemporal## SPEDInstabR## SphericalK## splithalfr## sspline## SuppDists## Survgini## sybilccFBA## tailr## tailr## tictactoe## TIMP## TSeriesMMA## VineCopula## viridisLite## vqtl## WGScan## widyr## widyr## WPKDE

Numeric Order

To sort numerically, use the -n option which stands for --numeric-sort. In this example, we will use a different file, package_names.txt where the package names are prefixed by random numbers between 1 and 100.

sort -n package_names.txt
## 1. cyclocomp## 2. odk## 3. redcapAPI## 4. TIMP## 5. pls## 6. BinOrdNonNor## 7. bayesbio## 8. MVTests## 9. pAnalysis## 10. aweek## 11. hdnom## 12. ltxsparklines## 13. MaXact## 14. RJDBC## 15. MIAmaxent## 16. randtests## 17. ASIP## 18. gazepath## 19. mcmcabn## 20. rless## 21. corclass## 22. vqtl## 23. disparityfilter## 24. SCRT## 25. RPyGeo## 26. blink## 27. gLRTH## 28. splithalfr## 29. sspline## 29. sspline## 30. logNormReg## 31. BIGDAWG## 31. BIGDAWG## 32. SPEDInstabR## 33. tailr## 33. tailr## 34. ibd## 35. fermicatsR## 36. mlflow## 37. CombinePValue## 38. BenfordTests## 39. mev## 40. MaxentVariableSelection## 41. rstudioapi## 42. OptimaRegion## 43. accept## 44. expstudies## 45. solitude## 45. solitude## 46. cbsem## 47. SMARTp## 48. geoknife## 49. SemiParSampleSel## 50. mbir## 51. interplot## 52. ClimMobTools## 53. MVB## 54. OxyBS## 55. hindexcalculator## 56. MARSS## 57. generalhoslem## 58. alfr## 59. AdMit## 60. Eagle## 61. PropScrRand## 62. lue## 63. dagitty## 64. viridisLite## 65. mined## 65. mined## 66. SuppDists## 67. tictactoe## 68. AzureStor## 68. AzureStor## 69. FField## 70. rsed## 70. rsed## 71. kfigr## 72. overlapping## 72. overlapping## 73. VineCopula## 74. crsra## 75. pmdplyr## 76. errorlocate## 77. SetMethods## 78. sybilccFBA## 79. mvShapiroTest## 80. SpatioTemporal## 81. mgcViz## 82. breakfast## 83. WPKDE## 84. BIOMASS## 85. edfReader## 86. mongolite## 87. WGScan## 88. SphericalK## 89. foretell## 90. widyr## 91. rgw## 92. BioCircos## 93. PathSelectMP## 94. ICAOD## 95. TSeriesMMA## 96. poisbinom## 97. AnalyzeTS## 98. SmallCountRounding## 99. Survgini## 100. addhaz

Preserve Only Unique Lines

The -u option which stands for --unique will preserve only unique lines while sorting the contents of the file. In the below example, we remove all duplicate lines from the pkg_names.txt while sorting.

sort -u pkg_names.txt
## ASIP## AdMit## AnalyzeTS## AzureStor## BIGDAWG## BIOMASS## BenfordTests## BinOrdNonNor## BioCircos## ClimMobTools## CombinePValue## Eagle## FField## ICAOD## MARSS## MIAmaxent## MVB## MVTests## MaXact## MaxentVariableSelection## OptimaRegion## OxyBS## PathSelectMP## PropScrRand## RJDBC## RPyGeo## SCRT## SMARTp## SPEDInstabR## SemiParSampleSel## SetMethods## SmallCountRounding## SpatioTemporal## SphericalK## SuppDists## Survgini## TIMP## TSeriesMMA## VineCopula## WGScan## WPKDE## accept## accept## addhaz## alfr## aweek## bayesbio## blink## breakfast## cbsem## corclass## crsra## cyclocomp## dagitty## disparityfilter## edfReader## errorlocate## expstudies## fermicatsR## foretell## gLRTH## gazepath## generalhoslem## geoknife## hdnom## hindexcalculator## ibd## interplot## kfigr## logNormReg## ltxsparklines## lue## mbir## mcmcabn## mev## mgcViz## mined## mlflow## mongolite## mvShapiroTest## odk## overlapping## pAnalysis## pls## pmdplyr## poisbinom## randtests## redcapAPI## rgw## rless## rsed## rstudioapi## solitude## splithalfr## sspline## sybilccFBA## tailr## tictactoe## viridisLite## vqtl## widyr

Word Count

wc will print newline, word, and byte counts for file(s). If more than one file is specified, it will also print total line.

Count words, bytes and lines

wc release_names.txt
##  30  73 546 release_names.txt

Count lines in a file

wc -l release_names.txt
## 30 release_names.txt

Count words in a file

wc -w release_names.txt
## 73 release_names.txt

Count characters(bytes) in a file

wc -c release_names.txt
## 546 release_names.txt

youtube ad


Search & Regular Expressions

In this section, we will explore commands that will

  • search for a given string in a file
  • find files using names
  • and search for binary executable files
CommandDescription
grepSearch for a given string in a file
findFind files using filenames
whichSearch for binary executable files

grep

The grep command is used for pattern matching. Along with additional options, it can be used to

  • match pattern in input text
  • ignore case
  • search recursively for an exact string
  • print filename and line number for each match
  • invert match for excluding specific strings

grep processes text line by line, and prints any lines which match a specified pattern. grep, which stands for global regular expression print is a powerful tool for matching a regular expression against text in a file, multiple files, or a stream of input.

CommandDescription
grepMatches pattern in input text
grep -iIgnore case
grep -RISearch recursively for an exact string
grep -EUse extended regular expression
grep -HnPrint file name & corresponding line number for each match
grep -vInvert match for excluding specific strings

Match Pattern in Input Text

Using grep let us search for packages that inlcude the letter R in their names.

grep R package_names.txt
## 14. RJDBC## 30. logNormReg## 27. gLRTH## 35. fermicatsR## 42. OptimaRegion## 61. PropScrRand## 25. RPyGeo## 47. SMARTp## 24. SCRT## 56. MARSS## 85. edfReader## 32. SPEDInstabR## 98. SmallCountRounding

Ignore Case

In the previous case, grep returned only those packages whose name included R but not r i.e. it did not ignore the case of the letter. Using the -i option, we will now search while ignoring the case of the letter.

grep -i R package_names.txt
## 14. RJDBC## 58. alfr## 64. viridisLite## 99. Survgini## 30. logNormReg## 27. gLRTH## 71. kfigr## 72. overlapping## 90. widyr## 33. tailr## 40. MaxentVariableSelection## 33. tailr## 72. overlapping## 16. randtests## 12. ltxsparklines## 91. rgw## 35. fermicatsR## 21. corclass## 68. AzureStor## 42. OptimaRegion## 61. PropScrRand## 74. crsra## 80. SpatioTemporal## 23. disparityfilter## 49. SemiParSampleSel## 76. errorlocate## 88. SphericalK## 28. splithalfr## 89. foretell## 25. RPyGeo## 50. mbir## 51. interplot## 6. BinOrdNonNor## 47. SMARTp## 38. BenfordTests## 79. mvShapiroTest## 92. BioCircos## 55. hindexcalculator## 41. rstudioapi## 57. generalhoslem## 24. SCRT## 95. TSeriesMMA## 82. breakfast## 56. MARSS## 70. rsed## 68. AzureStor## 85. edfReader## 20. rless## 75. pmdplyr## 32. SPEDInstabR## 3. redcapAPI## 70. rsed## 98. SmallCountRounding

Highlight

The --color option will highlight the matched strings.

grep -i --color R package_names.txt
## 14. RJDBC## 58. alfr## 64. viridisLite## 99. Survgini## 30. logNormReg## 27. gLRTH## 71. kfigr## 72. overlapping## 90. widyr## 33. tailr## 40. MaxentVariableSelection## 33. tailr## 72. overlapping## 16. randtests## 12. ltxsparklines## 91. rgw## 35. fermicatsR## 21. corclass## 68. AzureStor## 42. OptimaRegion## 61. PropScrRand## 74. crsra## 80. SpatioTemporal## 23. disparityfilter## 49. SemiParSampleSel## 76. errorlocate## 88. SphericalK## 28. splithalfr## 89. foretell## 25. RPyGeo## 50. mbir## 51. interplot## 6. BinOrdNonNor## 47. SMARTp## 38. BenfordTests## 79. mvShapiroTest## 92. BioCircos## 55. hindexcalculator## 41. rstudioapi## 57. generalhoslem## 24. SCRT## 95. TSeriesMMA## 82. breakfast## 56. MARSS## 70. rsed## 68. AzureStor## 85. edfReader## 20. rless## 75. pmdplyr## 32. SPEDInstabR## 3. redcapAPI## 70. rsed## 98. SmallCountRounding

Print Filename

If there is more than one file to search, use the -H option to print the filename for each match.

grep -i --color -H bio package_names.txt
## package_names.txt:84. BIOMASS## package_names.txt:92. BioCircos## package_names.txt:7. bayesbio

Print Corresponding Line Number

The -n option will print the corresponding line number of the match in the file.

grep -i --color -n bio package_names.txt
## 59:84. BIOMASS## 71:92. BioCircos## 88:7. bayesbio

Print Filename & Line Number

Let us print both the file name and the line number for each match.

grep -i --color -Hn R package_names.txt
## package_names.txt:1:14. RJDBC## package_names.txt:3:58. alfr## package_names.txt:8:64. viridisLite## package_names.txt:14:99. Survgini## package_names.txt:15:30. logNormReg## package_names.txt:16:27. gLRTH## package_names.txt:18:71. kfigr## package_names.txt:20:72. overlapping## package_names.txt:21:90. widyr## package_names.txt:22:33. tailr## package_names.txt:23:40. MaxentVariableSelection## package_names.txt:26:33. tailr## package_names.txt:27:72. overlapping## package_names.txt:30:16. randtests## package_names.txt:31:12. ltxsparklines## package_names.txt:32:91. rgw## package_names.txt:33:35. fermicatsR## package_names.txt:37:21. corclass## package_names.txt:38:68. AzureStor## package_names.txt:41:42. OptimaRegion## package_names.txt:42:61. PropScrRand## package_names.txt:43:74. crsra## package_names.txt:51:80. SpatioTemporal## package_names.txt:52:23. disparityfilter## package_names.txt:54:49. SemiParSampleSel## package_names.txt:55:76. errorlocate## package_names.txt:57:88. SphericalK## package_names.txt:61:28. splithalfr## package_names.txt:62:89. foretell## package_names.txt:63:25. RPyGeo## package_names.txt:64:50. mbir## package_names.txt:65:51. interplot## package_names.txt:66:6. BinOrdNonNor## package_names.txt:67:47. SMARTp## package_names.txt:68:38. BenfordTests## package_names.txt:69:79. mvShapiroTest## package_names.txt:71:92. BioCircos## package_names.txt:75:55. hindexcalculator## package_names.txt:78:41. rstudioapi## package_names.txt:80:57. generalhoslem## package_names.txt:84:24. SCRT## package_names.txt:85:95. TSeriesMMA## package_names.txt:87:82. breakfast## package_names.txt:96:56. MARSS## package_names.txt:97:70. rsed## package_names.txt:98:68. AzureStor## package_names.txt:100:85. edfReader## package_names.txt:101:20. rless## package_names.txt:102:75. pmdplyr## package_names.txt:103:32. SPEDInstabR## package_names.txt:104:3. redcapAPI## package_names.txt:106:70. rsed## package_names.txt:107:98. SmallCountRounding

Invert Match

Use the -v option to select non-matching lines. In the below example, we search for packages whose name does not include R while ignoring the case.

grep -v -i R package_names.txt
## 36. mlflow## 10. aweek## 31. BIGDAWG## 22. vqtl## 29. sspline## 39. mev## 66. SuppDists## 15. MIAmaxent## 31. BIGDAWG## 29. sspline## 60. Eagle## 83. WPKDE## 11. hdnom## 26. blink## 18. gazepath## 52. ClimMobTools## 44. expstudies## 65. mined## 81. mgcViz## 45. solitude## 9. pAnalysis## 65. mined## 94. ICAOD## 48. geoknife## 45. solitude## 67. tictactoe## 46. cbsem## 93. PathSelectMP## 96. poisbinom## 17. ASIP## 5. pls## 84. BIOMASS## 59. AdMit## 77. SetMethods## 53. MVB## 2. odk## 86. mongolite## 4. TIMP## 97. AnalyzeTS## 87. WGScan## 63. dagitty## 69. FField## 13. MaXact## 73. VineCopula## 7. bayesbio## 34. ibd## 8. MVTests## 19. mcmcabn## 43. accept## 78. sybilccFBA## 62. lue## 100. addhaz## 37. CombinePValue## 1. cyclocomp## 54. OxyBS

Recursive Search

Use the -r option to search recursively. In the below example, we search all files with the .txt extension for the string bio while ignoring the case.

grep -i --color -r bio *.txt
## package_names.txt:84. BIOMASS## package_names.txt:92. BioCircos## package_names.txt:7. bayesbio## pkg_names.txt:BIOMASS## pkg_names.txt:BioCircos## pkg_names.txt:BIOMASS## pkg_names.txt:bayesbio

find

The find command can be used for searching files and directories. Using additional options, we can

  • search files by extension type
  • ignore case while searching files/directories

find is a powerful tool for working with the files. It can be used on its own to locate files, or in conjunction with other programs to perform operations on those files.

CommandDescription
findFind files or directories under the given directory; recursively
find -name '*.txt'Find files by extension
find -type d -inameFind directories matching a given name, in case-insensitive mode
find -type d -nameFind directories matching a given name, in case-sensitive mode

Search Recursively

Let us use find to search for the file release_names.txt recursively. The -name option is used to specify the name of the file we are searching.

find -name release_names.txt
## ./release_names.txt## ./r_releases/release_names.txt

There are two files with the name release_names.txt present in the current working directory and in r_releases directory.

Search by Extension

Let us search for all files with .txt extension in the r_releases folder.

find r_releases -name '*.txt'
## r_releases/release_names.txt## r_releases/release_names_2.txt## r_releases/release_names_3.txt

There are 3 files with the .txt extension in r_releases folder.

Case-insensitive Mode

Search for all folders with the name R or r. Here we use the -iname option to ignore case while searching. The -type option is used to specify whether we are searching for files or folders. Since we are searching for folder/directory, we use it along with d i.e. directory to indicate that we are searching for directories and not files.

find -type d -iname R
## ./mypackage/R## ./r

Case-sensitive Mode

Search for all folders with the name r. It should exclude any folder with the name R.

find -type d -name r
## ./r

Data Transfer & Network

In this section, we will explore commands that will allow us to download files from the internet.

CommandDescription
wgetDownload files from the web
curlTransfer data from or to a server
hostnameName of the current host
pingPing a remote host
nslookupName server details

We have not executed the commands in this ebook as downloading multiple files from the internet will take a lot of time or result in errors but we have checked all the commands offline to ensure that they work.

wget

The wget command will download contents of a URL and files from the internet. Using additional options, we can

  • download contents/files to a file
  • continue incomplete downloads
  • download multiple files
  • limit download speed and number of retries
CommandDescription
wget urlDownload contents of a url
wget -o file urlDownload contents of url to a file
wget -cContinue an incomplete download
wget -P folder_name -i urls.txtDownload all urls stored in a text file to a specific directory
wget --limit-rateLimit download speed
wget --triesLimit number of retries
wget --quietTurn off output
wget --no-verbosePrint basic information
wget --progress-dotChange progress bar type to dot
wget --timestampingCheck if the timestamp of the file has changed before downloading
wget --waitWait between retrievals

Download URL

Let us first use wget to download contents of a URL. Note, we are not downloading file as such but just the content of the URL. We will use the URL of the home page of R project.

wget https://www.r-project.org/

If you look at the list of files, you can see a new file, index.html which we just downloaded using wget. Downloading contents this way will lead to confusion if we are dealing with multiple URLs. Let us learn to save the contents to a file (we can specify the name of the file which should help avoid confusion.)

Specify Filename

In this example, we download contents from the same URL and in addition specify the name of the file in which the content must be saved. Here we save it in a new file, rhomepage.html using the -o option followed by the filename.

wget -o rhomepage.html https://www.r-project.org/

Download File

How about downloading a file instead of a URL? In this example, we will download a logfile from the RStudio CRAN mirror. It contains the details of R downloads and individual package downloads. If you are a package developer and would want to know the countries in which your packages are downloaded, you will find this useful. We will download the file for 29th September and save it as sep_29.csv.gz.

wget -o sep_29.csv.gz http://cran-logs.rstudio.com/2019/2019-09-29.csv.gz

Download Multiple URLs

How do we download multiple URLs? One way is to specify the URLs one after the other separated by a space or save all URLs in a file and read them one by one. In the below example, we have saved multiple URLs in the file urls.txt.

cat urls.txt
## http://cran-logs.rstudio.com/2019/2019-09-26.csv.gz## http://cran-logs.rstudio.com/2019/2019-09-27.csv.gz## http://cran-logs.rstudio.com/2019/2019-09-28.csv.gz

We will download all the above URLs and save them in a new folder downloads. The -i indicates that the URLs must be read from a file (local or external). The -P option allows us to specify the directory into which all the files will be downloaded.

wget -P downloads -i urls.txt     

Quiet

The --quiet option will turn off wget output. It will not show any of the following details:

  • name of the file being saved
  • file size
  • download speed
  • eta etc.
wget –-quiet http://cran-logs.rstudio.com/2019/2019-10-06.csv.gz

No Verbose

Using the -nv or --no-verbose option, we can turn off verbose without being completely quiet (as we did in the previous example). Any error messages and basic information will still be printed.

wget –-no-verbose http://cran-logs.rstudio.com/2019/2019-10-13.csv.gz    

Check Timestamp

Let us say we have already downloaded a file from a URL. The file is updated from time to time and we intend to keep the local copy updated as well. Using the --timestamping option, the local file will have timestamp matching the remote file; if the remote file is not newer (not updated), no download will occur i.e. if the timestamp of the remote file has not changed it will not be downloaded. This is very useful in case of large files where you do not want to download them unless they have been updated.

wget –-timestamping http://cran-logs.rstudio.com/2019/2019-10-13.csv.gz

curl

The curl command will transfer data from or to a server. We will only look at downloading files from the internet.

CommandDescription
curl urlDownload contents of a url
curl url -o fileDownload contents of url to a file
curl url > fileDownload contents of url to a file
curl -sDownload in silent or quiet mode

Download URL

Let us download the home page of the R project using curl.

curl https://www.r-project.org/

Specify File

Let us download another log file from the RStudio CRAN mirror and save it into a file using the -o option.

curl http://cran-logs.rstudio.com/2019/2019-09-08.csv.gz -o sept_08.csv.gz 

Another way to save a downloaded file is to use > followed by the name of the file as shown in the below example.

curl http://cran-logs.rstudio.com/2019/2019-09-01.csv.gz > sep_01.csv.gz

Download Silently

The -s option will allow you to download files silently. It will mute curl and will not display progress meter or error messages.

curl http://cran-logs.rstudio.com/2019/2019-09-01.csv.gz -o sept_01.csv.gz -s

R Functions

In R, we can use download.file() to download files from the internet. The following packages offer functionalities that you will find useful.

CommandR
wgetdownload.file()
curlcurl::curl_download()
hostnameR.utils::getHostname.System()
pingpingr::ping()
nslookupcurl::nslookup()

sudo

sudo Super User DO is a prefix of commands that only superuser or root users are allowed to run. It is similar to run as administrator option in Windows. It is used to install, update and remove software. We will use it in the next section to install & update packages. If you are using RStudio Cloud, you will not be able to run sudo (users do not have root privileges).

CommandDescription
dpkg --listList installed packages
sudo apt-get updateUpdate packages
sudo apt-get installInstall packages
sudo apt-get removeRemove packages (retain configuration, plugins and settings
sudo apt-get purgeRemove packages including personalized settings
sudo apt-get autoremoveRemove any dependencies no longer in use

File Compression

tar

The tar command is used for file compression. It works with both tar and tar.gz extensions. It is used to

  • list files
  • extract files
  • create archives
  • append file to existing archives

tar creates, maintains, modifies, and extracts files that are archived in the tar format. Tar stands for tape archive and is an archiving file format.

CommandDescription
tar tvfList an archive
tar tvfzList a gzipped archive
tar xvfExtract an archive
tar xvfzExtract a gzipped archive
tar cvfCreate an uncompressed tar archive
tar cvfzCreate a tar gzipped archive
tar rvfAdd a file to an existing archive
tar rvfzAdd a file to an existing gzipped archive

We will use different options along with the tar command for listing, extracting, creating and adding files. The vf (v stands for verbosely show .tar file progress and f stands for file name type of the archive file) option is common for all the above operations while the following are specific.

  • t for listing
  • x for extracting
  • c for creating
  • r for adding files

While dealing with tar.gz archives we will use z in addition to vf and the above options.

List

Let us list all the files & folders in release_names.tar. As mentioned above. to list the files in the archive, we use the t option.

tar -tvf release_names.tar 
## -rwxrwxrwx aravind/aravind 546 2019-09-16 15:59 release_names.txt## -rwxrwxrwx aravind/aravind  65 2019-09-16 15:58 release_names_18.txt## -rwxrwxrwx aravind/aravind  53 2019-09-16 15:59 release_names_19.txt

Extract

Let us extract files from release_names.tar using the x option in addition to vf.

tar -xvf release_names.tar ls
## release_names.txt## release_names_18.txt## release_names_19.txt## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject3## myproject4## package_names.txt## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_18_19.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Add

To add a file to an existing archive, use the r option. Let us add release_names_18.txt and release_names_19.txt to the archive we created in the previous step.

tar -rvf release_names.tar release_names_18.txt release_names_19.txt
## release_names_18.txt## release_names_19.txt

Create

Using the c option we can create tar archives. In the below example, we are using a single file but you can specify multiple files and folders as well.

tar -cvf pkg_names.tar pkg_names.txt
## pkg_names.txt

gzip

CommandDescription
gzipCompress a file
gzip -dDecompress a file
gzip -cCompress a file and specify the output file name
zip -rCompress a directory
zipAdd files to an existing zip file
unzipExtract files from a zip files
unzip -dExtract files from a zip file and specify the output file name
unzip -lList contents of a zip file

gzip, gunzip, and zcat commands are used to compress or expand files in the GNU GZIP format i.e. files with .gz extension

Compress

Let us compress release_names.txt file using gzip.

gzip release_names.txtls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject3## myproject4## package_names.txt## pkg_names.tar## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt.gz## release_names_18.txt## release_names_18_19.txt## release_names_19.txt## sept_15.csv.gz## urls.txt## zip_example.zip

Decompress

Use the -d option with gzip to decompress a file. In the below example, we decompress the sept_15.csv.gz file (downloaded using wget or curl earlier). You can also use gunzip for the same result.

gzip -d sept_15.csv.gzls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject3## myproject4## package_names.txt## pkg_names.tar## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_18_19.txt## release_names_19.txt## sept_15.csv## urls.txt## zip_example.zip

Specify Filename

Use -c and > to specify a different file name while compressing using gzip. In the below example, gzip will create releases.txt.gz instead of release_names.txt.gz.

gzip -c release_names.txt > releases.txt.gzls
## analysis.R## bash.R## bash.Rmd## bash.html## bash.sh## imports_blorr.txt## imports_olsrr.txt## lorem-ipsum.txt## main_project.zip## myfiles## mypackage## myproject## myproject3## myproject4## package_names.txt## pkg_names.tar## pkg_names.txt## r## r2## r_releases## release_names.tar## release_names.tar.gz## release_names.txt## release_names_18.txt## release_names_18_19.txt## release_names_19.txt## releases.txt.gz## sept_15.csv## urls.txt## zip_example.zip

zip & unzip

zip creates ZIP archives while unzip lists and extracts compressed files in a ZIP archive.

List

Let us list all the files and folders in main_project.zip() using unzip and the -l option.

unzip -l main_project.zip
## Archive:  main_project.zip##   Length      Date    Time    Name## ---------  ---------- -----   ----##         0  2019-09-23 18:07   myproject/##         0  2019-09-20 14:02   myproject/.gitignore##         0  2019-09-23 18:07   myproject/data/##         0  2019-09-20 14:02   myproject/data/processed/##         0  2019-09-20 14:02   myproject/data/raw/##         0  2019-09-20 14:02   myproject/output/##         0  2019-09-20 14:02   myproject/README.md##        13  2019-09-20 14:02   myproject/run_analysis.R##         0  2019-09-20 14:02   myproject/src/##         0  2019-09-23 18:07   mypackage/##         0  2019-09-20 14:11   mypackage/.gitignore##         0  2019-09-20 14:11   mypackage/.Rbuildignore##         0  2019-09-20 14:10   mypackage/data/##         0  2019-09-20 14:11   mypackage/DESCRIPTION##         0  2019-09-20 14:10   mypackage/docs/##         0  2019-09-20 14:11   mypackage/LICENSE##         0  2019-09-20 14:10   mypackage/man/##         0  2019-09-20 14:11   mypackage/NAMESPACE##         0  2019-09-20 14:11   mypackage/NEWS.md##         0  2019-09-20 14:10   mypackage/R/##         0  2019-09-20 14:11   mypackage/README.md##         0  2019-09-20 14:11   mypackage/src/##         0  2019-09-20 14:10   mypackage/tests/##         0  2019-09-20 14:10   mypackage/vignettes/##         0  2019-09-23 18:07   myfiles/##        12  2019-09-20 15:30   myfiles/analysis.R##         7  2019-09-20 15:31   myfiles/NEWS.md##         9  2019-09-20 15:31   myfiles/README.md##       546  2019-09-20 15:29   myfiles/release_names.txt##        65  2019-09-20 15:29   myfiles/release_names_18.txt##        53  2019-09-20 15:30   myfiles/release_names_19.txt##        12  2019-09-20 15:30   myfiles/visualization.R##     15333  2019-10-01 16:58   bash.sh##         0  2019-09-16 12:42   r/## ---------                     -------##     16050                     34 files

Extract

Using unzip, let us now extract files and folders from zip_example.zip.

unzip zip_example.zip
## Archive:  zip_example.zip##    creating: zip_example/##   inflating: zip_example/bash.sh     ##   inflating: zip_example/pkg_names.txt

Using the -d option, we can extract the contents of zip_example.zip to a specific folder. In the below example, we extract it to a new folder examples.

unzip zip_example.zip –d myexamples
## [1] "Archive:  zip_example.zip"                        ## [2] "   creating: examples/zip_example/"               ## [3] "  inflating: examples/zip_example/bash.sh  "      ## [4] "  inflating: examples/zip_example/pkg_names.txt  "

Compress

Use the -r option along with zip to create a ZIP archive. In the below example, we create a ZIP archive of myproject folder.

zip -r myproject.zip myprojectls
##   adding: myproject/ (stored 0%)##   adding: myproject/.gitignore (stored 0%)##   adding: myproject/data/ (stored 0%)##   adding: myproject/data/processed/ (stored 0%)##   adding: myproject/data/raw/ (stored 0%)##   adding: myproject/output/ (stored 0%)##   adding: myproject/README.md (stored 0%)##   adding: myproject/run_analysis.R (stored 0%)##   adding: myproject/src/ (stored 0%)

We can compress multiple directories using zip. The names of the directories must be separated by a space as shown in the below example where we compress myproject and mypackage into a single ZIP archive.

zip -r packproj.zip myproject mypackagels
##   adding: myproject/ (stored 0%)##   adding: myproject/.gitignore (stored 0%)##   adding: myproject/data/ (stored 0%)##   adding: myproject/data/processed/ (stored 0%)##   adding: myproject/data/raw/ (stored 0%)##   adding: myproject/output/ (stored 0%)##   adding: myproject/README.md (stored 0%)##   adding: myproject/run_analysis.R (stored 0%)##   adding: myproject/src/ (stored 0%)##   adding: mypackage/ (stored 0%)##   adding: mypackage/.gitignore (stored 0%)##   adding: mypackage/.Rbuildignore (stored 0%)##   adding: mypackage/data/ (stored 0%)##   adding: mypackage/DESCRIPTION (stored 0%)##   adding: mypackage/docs/ (stored 0%)##   adding: mypackage/LICENSE (stored 0%)##   adding: mypackage/man/ (stored 0%)##   adding: mypackage/NAMESPACE (stored 0%)##   adding: mypackage/NEWS.md (stored 0%)##   adding: mypackage/R/ (stored 0%)##   adding: mypackage/README.md (stored 0%)##   adding: mypackage/src/ (stored 0%)##   adding: mypackage/tests/ (stored 0%)##   adding: mypackage/vignettes/ (stored 0%)

Add

To add a new file/folder to an existing archive, specify the name of the archive followed by the name of the file or the folder. In the below example, we add the bash.sh file to the myproject.zip archive created in a previous step.

zip myproject.zip bash.sh
##   adding: bash.sh (deflated 78%)

R Functions

tar & tar.gz

In R, we can use the tar() and untar() functions from the utils package to handle .tar and .tar.gz archives.

CommandR
tar tvfutils::untar('archive.tar', list = TRUE)
tar tvfzutils::untar('archive.tar.gz', list = TRUE)
tar xvfutils::untar('archive.tar')
tar xvfzutils::untar('archive.tar.gz')
tar cvfutils::tar('archive.tar')
tar cvfzutils::tar('archive.tar', compression = 'gzip')

zip & gzip

The zip package has the functionalities to handle ZIP archives. The tar() and untar() functions from the utils package can handle GZIP archives.

CommandR
gziputils::tar(compression = 'gzip' / R.utils::gzip()
gzip -dutils::untar() / R.utils::gunzip()
gzip -cutils::untar(exdir = filename)
zip -rzip::zip()
zipzip::zipr_append()
unzipzip::unzip()
unzip -dzip::unzip(exdir = dir_name)
unzip -lzip::zip_list()

System Info

In this section, we will explore commands that will allow us to

  • display information about the system
  • display memory usage information
  • display file system disk space usage
  • exit the terminal
  • run commands a superuser
  • shutdown the system
CommandDescription
unameDisplay important information about the system
freeDisplay free, used, swap memory in the system
dfDisplay file system disk space usage
exitExit the terminal
sudoRun command as super user
shutdownShutdown the system

uname

The uname command is used to view important information about the system. Using additional options, we can

  • print details about operating system
  • hardware & software related information
CommandDescription
unamePrint details about the current machine and the operating system running on it
uname -mpHardware related information; machine & processor
uname -srvSoftware related information; operating system, release number and version
uname -nNodename of the system
uname -aPrint all available information system

Print all available information about the system

uname -a
## Linux Aravind 4.4.0-18362-Microsoft #1-Microsoft Mon Mar 18 12:02:00 PST 2019 x86_64 x86_64 x86_64 GNU/Linux

Display free, used, swap memory in the system

free
##               total        used        free      shared  buff/cache   available## Mem:        3621900     2880840      511708       17720      229352      607328## Swap:      11010048      316268    10693780

Display file system disk space usage

df
## Filesystem     1K-blocks      Used Available Use% Mounted on## rootfs         188482144 134461208  54020936  72% /## none           188482144 134461208  54020936  72% /dev## none           188482144 134461208  54020936  72% /run## none           188482144 134461208  54020936  72% /run/lock## none           188482144 134461208  54020936  72% /run/shm## none           188482144 134461208  54020936  72% /run/user## cgroup         188482144 134461208  54020936  72% /sys/fs/cgroup## C:\            188482144 134461208  54020936  72% /mnt/c## D:\             18660348  17154312   1506036  92% /mnt/d## F:\              3196924    231760   2965164   8% /mnt/f## G:\             86383612  30395584  55988028  36% /mnt/g## H:\             86383612  14755908  71627704  18% /mnt/h## J:\             83185660  16892352  66293308  21% /mnt/j

Others

In this section, let us look at a few other useful commands that will allow us to

  • see how long a command takes to execute
  • delay activity
  • display and clear command history list
CommandDescription
timeSee how long a command takes to execute
sleepDelay activity in seconds
sleep 1mDelay activity in minutes
sleep 1hDelay activity in hours
historyDisplay command history list with line numbers
history -cClear the command history list

Funny Commands

Below are a few funny commands for you to try out. Use sudo apt-get install to install fortune and banner before trying them.

CommandDescription
fortunePoignant, inspirational & silly phrases
yesOutput a string repeatedly until killed
bannerASCII banner
revReverse each character

R Functions

In R, we can use Sys.sleep() to delay activity and history() to view command history.

CommandR
sleepSys.sleep()
historyhistory()

packages ad


Execute Commands from R

Now, let us turn our attention to executing commands from R using system2(). Here we will focus on the following

  • execute a command without arguments
  • execute commands with arguments
  • redirect output

Let us try to execute a command without any additional arguments. We will execute the ls command to list all files and directories. Use system2() and specify the command using the command argument. Whenever you are trying to execute a command from R, the first argument or input should be the command and it must be enclosed in quotes.

system2(command = "ls")
##   [1] "2016-02-07-variables.html"                                      ##   [2] "2016-02-17-data-types-in-r.html"                                ##   [3] "2017-02-05-variables.Rmd"                                       ##   [4] "2017-02-05-variables.html"                                      ##   [5] "2017-02-17-data-types-in-r.Rmd"                                 ##   [6] "2017-02-17-data-types-in-r.html"                                ##   [7] "2017-03-01-getting-help-in-r.html"                              ##   [8] "2017-03-13-beginners-guide-to-r-package-ecosystem.Rmd"          ##   [9] "2017-03-13-beginners-guide-to-r-package-ecosystem.html"         ##  [10] "2017-03-25-vectors.Rmd"                                         ##  [11] "2017-03-25-vectors.html"                                        ##  [12] "2017-03-29-vectors-part-2.Rmd"                                  ##  [13] "2017-03-29-vectors-part-2.html"                                 ##  [14] "2017-04-03-vectors-part-3.Rmd"                                  ##  [15] "2017-04-03-vectors-part-3.html"                                 ##  [16] "2017-04-06-matrix.Rmd"                                          ##  [17] "2017-04-06-matrix.html"                                         ##  [18] "2017-04-12-matrix-part-2.Rmd"                                   ##  [19] "2017-04-12-matrix-part-2.html"                                  ##  [20] "2017-04-18-lists.Rmd"                                           ##  [21] "2017-04-18-lists.html"                                          ##  [22] "2017-04-30-factors.Rmd"                                         ##  [23] "2017-04-30-factors.html"                                        ##  [24] "2017-05-12-dataframes.Rmd"                                      ##  [25] "2017-05-12-dataframes.html"                                     ##  [26] "2017-05-24-data-visualization-with-r-introduction.Rmd"          ##  [27] "2017-05-24-data-visualization-with-r-introduction.html"         ##  [28] "2017-06-05-data-visualization-with-r-title-and-axis-labels.Rmd" ##  [29] "2017-06-05-data-visualization-with-r-title-and-axis-labels.html"##  [30] "2017-06-17-data-visualization-with-r-scatter-plots.Rmd"         ##  [31] "2017-06-17-data-visualization-with-r-scatter-plots.html"        ##  [32] "2017-06-29-data-visualization-with-r-line-graphs.Rmd"           ##  [33] "2017-06-29-data-visualization-with-r-line-graphs.html"          ##  [34] "2017-07-11-data-visualization-with-r-bar-plots.Rmd"             ##  [35] "2017-07-11-data-visualization-with-r-bar-plots.html"            ##  [36] "2017-07-23-data-visualization-with-r-box-plots.Rmd"             ##  [37] "2017-07-23-data-visualization-with-r-box-plots.html"            ##  [38] "2017-08-04-data-visualization-with-r-histogram.Rmd"             ##  [39] "2017-08-04-data-visualization-with-r-histogram.html"            ##  [40] "2017-08-16-data-visualization-with-r-legends.Rmd"               ##  [41] "2017-08-16-data-visualization-with-r-legends.html"              ##  [42] "2017-08-28-data-visualization-with-r-text-annotations.Rmd"      ##  [43] "2017-08-28-data-visualization-with-r-text-annotations.html"     ##  [44] "2017-09-09-data-visualization-with-r-combining-plots.Rmd"       ##  [45] "2017-09-09-data-visualization-with-r-combining-plots.html"      ##  [46] "2017-10-03-ggplot2-quick-tour.Rmd"                              ##  [47] "2017-10-03-ggplot2-quick-tour.html"                             ##  [48] "2017-10-15-ggplot2-introduction-to-geoms.Rmd"                   ##  [49] "2017-10-15-ggplot2-introduction-to-geoms.html"                  ##  [50] "2017-10-27-ggplot2-introduction-to-aesthetics.Rmd"              ##  [51] "2017-10-27-ggplot2-introduction-to-aesthetics.html"             ##  [52] "2017-11-08-ggplot2-axis-plot-labels.Rmd"                        ##  [53] "2017-11-08-ggplot2-axis-plot-labels.html"                       ##  [54] "2017-11-20-ggplot2-text-annotations.Rmd"                        ##  [55] "2017-11-20-ggplot2-text-annotations.html"                       ##  [56] "2017-12-02-ggplot2-scatter-plots.Rmd"                           ##  [57] "2017-12-02-ggplot2-scatter-plots.html"                          ##  [58] "2017-12-14-ggplot2-line-graphs.Rmd"                             ##  [59] "2017-12-14-ggplot2-line-graphs.html"                            ##  [60] "2017-12-26-ggplot2-bar-plots.Rmd"                               ##  [61] "2017-12-26-ggplot2-bar-plots.html"                              ##  [62] "2018-01-07-ggplot2-box-plots.Rmd"                               ##  [63] "2018-01-07-ggplot2-box-plots.html"                              ##  [64] "2018-01-19-ggplot2-histogram.Rmd"                               ##  [65] "2018-01-19-ggplot2-histogram.html"                              ##  [66] "2018-01-31-ggplot2-guides-axes.Rmd"                             ##  [67] "2018-01-31-ggplot2-guides-axes.html"                            ##  [68] "2018-02-12-ggplot2-guides-legends.Rmd"                          ##  [69] "2018-02-12-ggplot2-guides-legends.html"                         ##  [70] "2018-02-24-guides-legends-part-2.Rmd"                           ##  [71] "2018-02-24-guides-legends-part-2.html"                          ##  [72] "2018-03-08-legend-part-3.Rmd"                                   ##  [73] "2018-03-08-legend-part-3.html"                                  ##  [74] "2018-03-20-legend-part-4.Rmd"                                   ##  [75] "2018-03-20-legend-part-4.html"                                  ##  [76] "2018-04-01-legend-part-5.Rmd"                                   ##  [77] "2018-04-01-legend-part-5.html"                                  ##  [78] "2018-04-13-legend-part-6.Rmd"                                   ##  [79] "2018-04-13-legend-part-6.html"                                  ##  [80] "2018-04-25-ggplot2-facets-combine-multiple-plots.Rmd"           ##  [81] "2018-04-25-ggplot2-facets-combine-multiple-plots.html"          ##  [82] "2018-05-07-ggplot2-themes.Rmd"                                  ##  [83] "2018-05-07-ggplot2-themes.html"                                 ##  [84] "2018-07-30-importing-data-into-r-part-1.Rmd"                    ##  [85] "2018-07-30-importing-data-into-r-part-1.html"                   ##  [86] "2018-08-11-importing-data-into-r-part-2.Rmd"                    ##  [87] "2018-08-11-importing-data-into-r-part-2.html"                   ##  [88] "2018-08-23-data-wrangling-with-dplyr-part-1.Rmd"                ##  [89] "2018-08-23-data-wrangling-with-dplyr-part-1.html"               ##  [90] "2018-09-04-data-wrangling-with-dplyr-part-2.Rmd"                ##  [91] "2018-09-04-data-wrangling-with-dplyr-part-2.html"               ##  [92] "2018-09-16-data-wrangling-with-dplyr-part-3.Rmd"                ##  [93] "2018-09-16-data-wrangling-with-dplyr-part-3.html"               ##  [94] "2018-09-28-introduction-to-tibbles.Rmd"                         ##  [95] "2018-09-28-introduction-to-tibbles.html"                        ##  [96] "2018-10-10-readable-code-with-pipes.Rmd"                        ##  [97] "2018-10-10-readable-code-with-pipes.html"                       ##  [98] "2018-10-22-hacking-strings-with-stringr.Rmd"                    ##  [99] "2018-10-22-hacking-strings-with-stringr.html"                   ## [100] "2018-11-03-working-with-dates-in-r.Rmd"                         ## [101] "2018-11-03-working-with-dates-in-r.html"                        ## [102] "2018-11-15-working-with-categorical-data-using-forcats.Rmd"     ## [103] "2018-11-15-working-with-categorical-data-using-forcats.html"    ## [104] "2018-11-27-quick-guide-r-sqlite.Rmd"                            ## [105] "2018-11-27-quick-guide-r-sqlite.html"                           ## [106] "2018-12-09-data-wrangling-with-dbplyr.Rmd"                      ## [107] "2018-12-09-data-wrangling-with-dbplyr.html"                     ## [108] "2018-12-21-sql-for-data-science.Rmd"                            ## [109] "2018-12-21-sql-for-data-science.html"                           ## [110] "2019-01-02-sql-for-data-science-part-2.Rmd"                     ## [111] "2019-01-02-sql-for-data-science-part-2.html"                    ## [112] "2019-02-08-introducing-olsrr.Rmd"                               ## [113] "2019-02-08-introducing-olsrr.html"                              ## [114] "2019-02-12-introducing-rfm.Rmd"                                 ## [115] "2019-02-12-introducing-rfm.html"                                ## [116] "2019-02-19-introducing-descriptr.Rmd"                           ## [117] "2019-02-19-introducing-descriptr.html"                          ## [118] "2019-02-20-introducing-descriptr.Rmd"                           ## [119] "2019-02-20-introducing-descriptr.html"                          ## [120] "2019-02-26-introducing-blorr.Rmd"                               ## [121] "2019-02-26-introducing-blorr.html"                              ## [122] "2019-03-05-getting-help-in-r.Rmd"                               ## [123] "2019-03-05-getting-help-in-r.html"                              ## [124] "2019-03-12-introducing-rbin.Rmd"                                ## [125] "2019-03-12-introducing-rbin.html"                               ## [126] "2019-03-14-introducing-vistributions.Rmd"                       ## [127] "2019-03-14-introducing-vistributions.html"                      ## [128] "2019-04-01-shiny-apps.Rmd"                                      ## [129] "2019-04-01-shiny-apps.html"                                     ## [130] "2019-04-11-web-scraping.Rmd"                                    ## [131] "2019-04-11-web-scraping.html"                                   ## [132] "2019-04-13-web-scraping-note.Rmd"                               ## [133] "2019-04-13-web-scraping-note.html"                              ## [134] "2019-05-02-mba.Rmd"                                             ## [135] "2019-05-02-mba.html"                                            ## [136] "2019-05-27-regex.Rmd"                                           ## [137] "2019-05-27-regex.html"                                          ## [138] "2019-07-05-pkginfo.Rmd"                                         ## [139] "2019-07-05-pkginfo.html"                                        ## [140] "2019-07-22-customer-segmentation-using-rfm-analysis.Rmd"        ## [141] "2019-07-22-customer-segmentation-using-rfm-analysis.html"       ## [142] "2019-08-08-working-with-databases-using-r.Rmd"                  ## [143] "2019-08-08-working-with-databases-using-r.html"                 ## [144] "2019-10-26-command-line-crash-course.Rmd"                       ## [145] "airline.dta"                                                    ## [146] "airline.sas7bdat"                                               ## [147] "analysis.R"                                                     ## [148] "cline"                                                          ## [149] "config.yml"                                                     ## [150] "employee.sav"                                                   ## [151] "examples"                                                       ## [152] "hsb2.csv"                                                       ## [153] "hsb3.csv"                                                       ## [154] "hsb4.csv"                                                       ## [155] "imports_blorr.txt"                                              ## [156] "imports_olsrr.txt"                                              ## [157] "mydatabase.db"                                                  ## [158] "mypackage"                                                      ## [159] "myproject"                                                      ## [160] "online-retail.xlsx"                                             ## [161] "options.R"                                                      ## [162] "package_names.csv"                                              ## [163] "release.txt"                                                    ## [164] "sample.xls"                                                     ## [165] "transaction_data.csv"                                           ## [166] "zip_example.zip"

Great! Now, how do we specify the options? The additional options of a command must be stored as a character vector and specified using the args argument. In the below example, we delete the examples folder we created earlier while decompressing the zip_example.zip file.

system2(command = "rm",        args    = c("-r", "examples"))
## character(0)

In some cases, we might want to redirect the output. Let us say we are writing message to a file using the echo command. In this cases, we want the output to be redirected to the release.txt file. The stdout argument can be used to redirect output to a file or the R console. In the below example, we redirect the output to a file.

system2(command = "echo",         args    = c("Great Truth"),         stdout  = "release.txt")

In the next example, we redirect the output to the R console by setting the value of the stdout argument to TRUE. If you are curious, set the value to FALSE and see what happens.

system2(command = "diff",         args    = c("imports_olsrr.txt", "imports_blorr.txt"),          stdout  = TRUE)
## Warning in system2(command = "diff", args = c("imports_olsrr.txt",## "imports_blorr.txt"), : running command '"diff" imports_olsrr.txt## imports_blorr.txt' had status 1
##  [1] "1,4c1,5"      "< car "       "< checkmate"  "< cli"       ##  [5] "< clisymbols" "---"          "> car"        "> caret"     ##  [9] "> checkmate"  "> clisymbols" "> cli"       ## attr(,"status")## [1] 1

The run() command from the processx package can be used to execute shell commands as well.

Execute Shell Commands in RStudio

In RStudio, commands can be executed from shell scripts by pressing Ctrl + Enter. Instead of sending the command to the R console, it is redirected to the terminal where it is executed as shown below.

RMarkdown

RMarkdown supports bash, sh and awk. This post was initially created using sh as the underlying operating system is Windows. Later, we used bash after installing the Windows subsystem for Linux. Click here to learn more.

R in the Shell

In this section, we will learn to execute R commands and scripts in the command line using:

  • R -e
  • Rscript -e
  • R CMD BATCH

The -e option allows us to specify R expression(s). R -e will launch R and then execute the code specified within quotes. Use semi-colon to execute multiple expressions as shown below. You will be able to run the below commands only if you are able to launch R from the command line. Windows users need to ensure that R is added to the path environment.

R -e "head(mtcars); tail(mtcars)"
## ## R version 3.6.1 (2019-07-05) -- "Action of the Toes"## Copyright (C) 2019 The R Foundation for Statistical Computing## Platform: x86_64-w64-mingw32/x64 (64-bit)## ## R is free software and comes with ABSOLUTELY NO WARRANTY.## You are welcome to redistribute it under certain conditions.## Type 'license()' or 'licence()' for distribution details.## ##   Natural language support but running in an English locale## ## R is a collaborative project with many contributors.## Type 'contributors()' for more information and## 'citation()' on how to cite R or R packages in publications.## ## Type 'demo()' for some demos, 'help()' for on-line help, or## 'help.start()' for an HTML browser interface to help.## Type 'q()' to quit R.## ## > head(mtcars); tail(mtcars)##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2## > ## >

Rscript -e will run code without launching R.

Rscript -e "head(mtcars)"
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can use Rscript to execute a R script as well. In the below example, we execute the code in analysis.R file.

Rscript analysis.R
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If you are more interested, try the littler package as well.

What we have not covered…

  • shell scripting
  • editing files
  • file permissions
  • user information
  • pipes
  • awk
  • sed

Summary

  • Shell is a text based application for viewing, handling and manipulating files
  • It is also known by the following names
    • CLI (Command Line Inteface)
    • Terminal
    • Bash (Bourne Again Shell)
  • Use system2() and processx::run() in R to execute shell commands
  • Use Rscript -e or R -e to execute R scripts from the command line
  • RStudio includes a Terminal (from version 1.1.383)
  • Execute commands from shell script in RStudio using Ctrl + Enter
  • RMarkdown supports bash, sh and awk

Feedback

If you see mistakes or want to suggest changes, please create an issue on the source repository or reach out to us at support@rsquaredacademy.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12298 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>