Quantcast
Channel: R-bloggers
Viewing all articles
Browse latest Browse all 12081

Querying Zenodo.org repository with R

$
0
0

(This article was first published on Carsten's blog, and kindly contributed to R-bloggers)

Zenodo

Zenodo is a repository which allows everybody to deposit free of charge any type of research output, in all disciplines of science.

EFSA is piloting it’s use for creating a knowledge base on all types of food safety related evidence(data, documents, models).

Zenodo has an API and can be queried using the standard OAI-PMH protocol, which allows to harvest the metadata and all deposits.

‘oai’ package

R has a package available to query any OAI-PMH repository, including Zenodo. It can be installed from CRAN like this:

install.packages("oai")

The development version is available on Github at https://github.com/ropensci/oai

The libraries I use in this tutorial are:

library(knitr)library(tidyverse)library(httr)library(oai)library(xml2)opts_chunk$set(echo=T)

Retreive records from Zenodo

The oai package allows to retrieve all records of a given Zenodo community, in this case the EFSA pilot community. The following code shows all records of a community with their digital object identifier and the title.

record_list<-list_records("https://zenodo.org/oai2d",metadataPrefix="oai_datacite",set="user-efsa-pilot")kable(record_list%>%select(identifier.3,title))
identifier.3title
10.5281/zenodo.57132EFSA Source Attribution Model (EFSA_SAM)
10.5281/zenodo.57017PRIMo rev.1 – Pesticide Residue Intake Model
10.5281/zenodo.56662Bee-Tool V.1
10.5281/zenodo.56668Bee-Tool V.2
10.5281/zenodo.154720Egg Pooling Module
10.5281/zenodo.161300GMOANALYSIS VERSION 2.1.0 – 10 JULY 2014
10.5281/zenodo.159163Pesticide Residues Overview File: PROFile (3.0)
10.5281/zenodo.154725Food Additives Intake Model (FAIM) – Version 1.1 – July 2013
10.5281/zenodo.163080Modelling continental-scale spread of Schmallenberg virus in Europe
10.5281/zenodo.57079C-TSEMM – Cattle TSE Monitoring Model
10.5281/zenodo.57505TSEi – TSE Infectivity Model
10.5281/zenodo.159414Dietary Exposure Calculator Smoke Flavouring
10.5281/zenodo.159890CHIP: Commodity based Hazard Identification Tool
10.5281/zenodo.56287PRIMo rev.2 – Pesticide Residue Intake Model
10.5281/zenodo.56669Bee-Tool V.3
10.5281/zenodo.161298Exposure of operators, workers, residents and bystanders in risk assessment for plant protection products calculator (Version 30MAR2015)
10.5281/zenodo.163026Within farms transmission model for Schmallenberg Virus
10.5281/zenodo.154724User-friendly interface version of the QMRA model for Salmonella in pigs

Currently there are 18 records available.

Statistics on keywords

Query records from Zenodo

I was further on interested in the current distribution of keywords each record was tagged with. Zenodo supports two types of keywords. Simple free text keywords and ‘subjects’. Subjects need to come from a controlled vocabulary, in which each topic has an URI.

EFSA uses the GACS vocabulary, and so a certain topic ‘salmonella’ is represented as URI ‘http://browser.agrisemantics.org/gacs/en/page/C2225’.

The API returns therefore for the subjects only the URI, which is nicely unique and clear but not user friendly as a label. On the URI of each ‘subject’, additional information is available.

The following code retrieves all records and extract all their subjects (which have a Xpath of //d3:subject). The current oai package has some problems with some Zenodo specific metadata, so I parse the raw XML by hand.

The OIA-PMH standard and the oai::get_records function, allow the client to select, in which metadata format he wants to receive the metadata. Here I have selected ‘oai-datacite’, because it is recommended from the Zenodo API documenation and should contain all metadata Zenodo supports, while other metadata formats might only support a smaller subset.

record_data_xml<-get_records(record_list$identifier,url="https://zenodo.org/oai2d",prefix="oai_datacite",as="raw")keyword_counts<-record_data_xml%>%map(read_xml)%>%map(xml_find_all,"//d3:subject")%>%map(xml_text)%>%reduce(c)%>%table()%>%tbl_df()kable(keyword_counts%>%filter(grepl(".*C22.*|^food",`.`)))
.n
food additives1
food additives intake model1
food composition difference testing1
http://id.agrisemantics.org/gacs/C220702
http://id.agrisemantics.org/gacs/C220921
http://id.agrisemantics.org/gacs/C22253

I use the ‘map’ function from the ‘purrr’ package to apply to every vector in the result (which is first an xml string) a number of transformations:

  1. read_xml() – to convert from string to class xml_document
  2. xml_find_all() – to find all xml nodes given by xpath expression
  3. xml_text() – get the text from the xml node

Then I combine all this via c() and the reduce() function to obtain a single list of all subjects.

The API returns both types of subjects, the generic keywords and the terms referring to a controlled vocabulary.

The table() command produces then a frequency table for them, of which I show here a subset. We have in this table entries with an English label, and some with the GACS URI.

Add human readable label to GACS topics

To add a human readable label to each GACS URI, I use the GACS API which allows to query information on each topic. So I call the API for each URI and make a table where each row contains a list of (URI,label). This gets the converted into a table with bind_rows()

I use again the ‘map’ function with an anonymous function, which does the call to the GACS API. GACS uses the (Skomsos)[https://github.com/NatLibFi/Skosmos] software, so has an (API)[http://api.finto.fi/doc/] to query the vocabulary.

gacs<-keyword_counts%>%filter(grepl("*gacs*",.))gacs_label_en<-map(gacs$`.`,function(uri){r=GET("http://browser.agrisemantics.org/rest/v1/gacs/label",query=list(uri=uri,lang="en"))list(uri=uri,label=content(r)$prefLabel)})%>%bind_rows()kable(gacs_label_en[1:5,])
urilabel
http://id.agrisemantics.org/gacs/C10152Bayesian theory
http://id.agrisemantics.org/gacs/C10826commodities
http://id.agrisemantics.org/gacs/C12237flavourings
http://id.agrisemantics.org/gacs/C1263screening
http://id.agrisemantics.org/gacs/C14046emerging infectious diseases

Distributions of labels in efsa-pilot community

To get the final table, I join the label-GACS pairs with the former table and do some clean-up with the functions from tidyr package.

The table is then sorted by frequency and shown on the screen.

As we can see, the most frequent words are ‘risk assessment’ and ‘exposure assessment’, which is no surprise as these is the core of EFSA’s scientific work.

table<-left_join(keyword_counts,gacs_label_en,by=c("."="uri"))%>%replace_na(list(label=""))%>%unite("label",c(label,`.`),sep=" - ")%>%mutate(label=gsub("^ - ","",label))%>%rename(count=n)%>%arrange(-count)write.csv(table,"keywords.csv",row.names=F)knitr::kable(table%>%slice(1:20))
labelcount
risk assessment – http://id.agrisemantics.org/gacs/C14708
quantitative analysis – http://id.agrisemantics.org/gacs/C6037
exposure assessment – http://id.agrisemantics.org/gacs/C292326
population – http://id.agrisemantics.org/gacs/C29555
prion diseases – http://id.agrisemantics.org/gacs/C187284
pesticides – http://id.agrisemantics.org/gacs/C2844
Apoidea – http://id.agrisemantics.org/gacs/C19323
Salmonella – http://id.agrisemantics.org/gacs/C22253
pesticide residues – http://id.agrisemantics.org/gacs/C30093
linear models – http://id.agrisemantics.org/gacs/C35043
model validation – http://id.agrisemantics.org/gacs/C43323
time – http://id.agrisemantics.org/gacs/C45253
pollinators – http://id.agrisemantics.org/gacs/C53253
decision support systems – http://id.agrisemantics.org/gacs/C81543
acute risk assesment2
chronic risk assesment2
Epidemiology2
exposure assessment2
bovine spongiform encephalopathy – http://id.agrisemantics.org/gacs/C141822
calculation – http://id.agrisemantics.org/gacs/C153372

To monitor regularly this distribution can help in keeping the list of all keywords clean and eventually propose additional subjects to the GACS vocabulary.

Session info

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
## 
## Matrix products: default
## BLAS: /usr/lib/libblas_nehalemp-r0.2.19.so
## LAPACK: /usr/lib/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
##  [1] bindrcpp_0.2      xml2_1.1.1        oai_0.2.2        
##  [4] httr_1.3.1        dplyr_0.7.3.9000  purrr_0.2.3      
##  [7] readr_1.1.1       tidyr_0.7.1       tibble_1.3.4.9001
## [10] ggplot2_2.2.1     tidyverse_1.1.1   knitr_1.17       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12      highr_0.6         cellranger_1.1.0 
##  [4] pillar_0.0.0.9000 compiler_3.4.1    plyr_1.8.4       
##  [7] bindr_0.1         methods_3.4.1     forcats_0.2.0    
## [10] tools_3.4.1       digest_0.6.12     lubridate_1.6.0  
## [13] jsonlite_1.5      evaluate_0.10.1   nlme_3.1-131     
## [16] gtable_0.2.0      lattice_0.20-35   pkgconfig_2.0.1  
## [19] rlang_0.1.2       psych_1.7.8       parallel_3.4.1   
## [22] haven_1.1.0       stringr_1.2.0     hms_0.3          
## [25] tidyselect_0.2.0  grid_3.4.1        glue_1.1.1       
## [28] R6_2.2.2          readxl_1.0.0      foreign_0.8-69   
## [31] modelr_0.1.1      reshape2_1.4.2    magrittr_1.5     
## [34] servr_0.7         scales_0.5.0      rvest_0.3.2      
## [37] assertthat_0.2.0  mnormt_1.5-5      colorspace_1.3-2 
## [40] httpuv_1.3.5      stringi_1.1.5     lazyeval_0.2.0   
## [43] munsell_0.4.3     broom_0.4.2
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Carsten's blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Viewing all articles
Browse latest Browse all 12081

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>