ojsr allows you to crawl OJS archives, issues, articles, galleys, and search results, and retrieve meta-data from articles.
Important Notes:
(from the OJS documentation https://pkp.sfu.ca/ojs/, as of Jan.2020)
Open Journal Systems (OJS) is a journal management and publishing system that has been developed by the Public Knowledge Project through its federally funded efforts to expand and improve access to research.
OJS assists with every stage of the refereed publishing process, from submissions through to online publication and indexing. Through its management systems, its finely grained indexing of research, and the context it provides for research, OJS seeks to improve both the scholarly and public quality of refereed research.
OJS is open source software made freely available to journals worldwide for the purpose of making open access publishing a viable option for more journals, as open access can increase a journal’s readership as well as its contribution to the public good on a global scale (see PKP Publications).
Since OJS v3.1+ https://docs.pkp.sfu.ca/dev/api/ojs/3.1 a Rest API is provided. We’re positive a better R interface should use that API instead of webscraping.
So, why ojsr? According to https://pkp.sfu.ca/ojs/ojs-usage/ojs-stats/, as of 2019 (when v3.1+ was launched), OJS was being used by at least 10,000 journals worldwide. OJS is an excellent free publishing solution for institutions that probably could not publish otherwise, and presumably, can not afford to update constantly. ojsr aims to help crawling and retrieving info from OJS during this legacy period.
Let’s say we want to scrap metadata from a collection of journals, in order to compare them. We have their names and urls, and can use ojsr to scrap their issues, articles and metadata.
# first, load the library
library(ojsr)
# we'll use dplyr and ggplot on this example
library(tidyverse)
# our collection of journals
journals <- data.frame ( cbind(
name = c(
"PSocial",
"Odisea"
),
url = c(
"https://publicaciones.sociales.uba.ar/index.php/psicologiasocial",
"https://publicaciones.sociales.uba.ar/index.php/odisea"
)
), stringsAsFactors = FALSE
)
# we are using the journal url as input to retrieve the issues
issues <- ojsr::get_issues_from_archive(input_url = journals$url)
# we are using the issues url we just scrapped as the input to retrieve the articles
articles <- ojsr::get_articles_from_issue(input_url = issues$output_url)
# we are using the articles url we just scrapped as the input to retrieve the metadata
metadata <- ojsr::get_html_meta_from_article(input_url = articles$output_url)
Before doing some analysis, let’s bind these together to have a better understanding of our journals. Since we are interested in summarizing by journal, we can use ojsr::parse_base_url()
on our tables to have a binding value.
# we are including the base_url on each table to simplify joining
journals$base_url <- ojsr::parse_base_url(journals$url)
issues$base_url <- ojsr::parse_base_url(issues$input_url)
articles$base_url <- ojsr::parse_base_url(articles$input_url)
metadata$base_url <- ojsr::parse_base_url(metadata$input_url)
# a journal / issue / articles/ metadata / keywords / keywords/article table
journals %>%
left_join( issues %>% group_by( base_url ) %>% summarise(n_issues=n()) , by="base_url") %>%
left_join( articles %>% group_by( base_url ) %>% summarise(n_articles=n()) , by="base_url") %>%
left_join( metadata %>% group_by( base_url ) %>% summarise(n_metadata=n()) , by="base_url") %>%
left_join( metadata %>% filter(meta_data_name=="citation_keywords") %>% group_by( base_url ) %>% summarise(n_keywords=n()) , by="base_url") %>%
mutate( key_art = n_keywords/n_articles ) %>%
select(name, n_issues, n_articles, n_metadata, n_keywords, key_art)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> name n_issues n_articles n_metadata n_keywords key_art
#> 1 PSocial 11 69 3176 160 2.318841
#> 2 Odisea 6 66 3063 350 5.303030
Now, we can do our analysis: exploring the main keywords per journal. For this, we keep only non-empty keywords metadata in Spanish; then, we pick the 3 most frequent keywords (normally you would do some cleanup and normalization first); finally we plot by journal.
metadata %>% filter(meta_data_name=="citation_keywords", meta_data_xmllang=="es") %>% # filtering keywords
group_by(base_url, keyword = meta_data_content) %>% tally(sort=TRUE) %>% top_n(wt = n, n = 3) %>% # 3 most frequent keywords by journal
left_join( journals , by="base_url") %>% # let's include the journal names
ggplot(aes(x=reorder(keyword,n),y=n)) + facet_wrap(~name, scales = "free") + geom_bar(stat = "identity") + coord_flip()
get_issues_from_archive()
takes a vector of OJS urls and scraps the issues urls from the issue archive (e.g., https://papiro.unizar.es/ojs/index.php/rc51-jos/issue/archive).
You don’t need to provide the actual url to issue archives. get_issues_from_archive()
parses the url you provide to compose it. Then, looks for links containing “/issue/view” in the href. Links are post-processed to comply to OJS routing conventions before returning.
journals <- c(
'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive', # issue archive
'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2903' # article
)
issues <- ojsr::get_issues_from_archive(input_url = journals)
Result is in a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
get_articles_from_issue()
takes a vector of OJS (issue) urls and scraps the links to articles from the issues table of content (e.g., https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/issue/view/319/showToc).
You don’t need to provide the actual url of the issues’ ToC, but you must provide urls that include issue ID (articles urls do not include this info!). get_articles_from_issue()
parses the url you provide to compose the ToC url. Then, looks for links containing “/article/view” in the href. Links are post-processed to comply to OJS routing conventions before returning.
issues <- c(
'https://revistas.ucn.cl/index.php/saludysociedad/issue/view/65', # issue including ToC
'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/issue/view/31' # no ToC nor links
)
articles <- ojsr::get_articles_from_issue(input_url = issues)
Result is in a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
get_articles_from_search()
takes a vector of OJS urls and a string for search criteria to compose search result urls, then scraps them to retrieve articles’ urls.
You don’t need to provide the actual url of the search result pages. get_articles_from_search()
parses the url you provide to compose the search result page(s) url. If pagination is involved, necessary links are also included. Then, looks for links containing “/article/view” in the href. Links are post-processed to comply to OJS routing conventions before returning.
journals <- c(
'https://revistapsicologia.uchile.cl/index.php/RDP/',
'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/issue/current'
)
criteria <- "psicologia"
articles_search <- ojsr::get_articles_from_search(input_url = journals, search_criteria = criteria)
Result is in a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
Galleys are the final presentation version of the content of the articles. Most of the time, these include the full-content in pdf and other reading formats. Less often, they are supplementary files (tables, dataset) in different formats.
get_galleys_from_article()
takes a vector of OJS urls and scraps all the galleys urls from the article view (e.g., https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/593).
You may provide any article-level url (article abstract view, inline view, pdf direct download, etc.). get_galleys_from_article()
parses the url you provide to compose the url of the article view. Then, looks for links containing “/article/view” in the href. Links are post-processed to comply to OJS routing conventions before returning (i.e., having a galley ID).
articles <- c(
'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/55657', # galleys pdf and mp3
'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' # inline reader
)
galleys <- ojsr::get_galleys_from_article(input_url = articles)
Result is in a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
get_html_meta_from_article()
takes a vector of OJS urls and scraps all the metadata written in the html from the article view (e.g., https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/593).
You may provide any article-level url (article abstract view, inline view, pdf direct download, etc.). get_html_meta_from_article()
parses the url you provide to compose the url of the article view. Then, looks for <meta> tags in the <head> section of the html. Important! This may not only retrieve bibliographic metadata; any other “meta” property detailed on the html will be obtained (e.g., descriptions for propagation on social network, etc.).
articles <- c(
'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2137', # article
'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' # xml galley
)
metadata <- ojsr::get_html_meta_from_article(input_url = articles)
Result is in a long-format dataframe (1 inputUrl may result in several rows, one for each outputUrl), containing:
An alternative to web-scraping metadata from the html of article pages is to retrieve their OAI-PMH (Open Archives Initiative Protocol for ‘Metadata’ Harvesting) records http://www.openarchives.org/OAI/openarchivesprotocol.html
get_oai_meta_from_article()
will try to access the OAI records within the OJS for any article (e.g., https://fundacionmenteclara.org.ar/revista/index.php/RCA/oai/?verb=GetRecord&identifier=oai:ojs.fundacionmenteclara.org.ar:article/43&metadataPrefix=oai_dc) for which you provided an url.
articles <- c(
'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2137', # article
'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' # xml galley
)
metadata_oai <- ojsr::get_oai_meta_from_article(input_url = articles)
Result is in a long-format dataframe (1 inputUrl may result in several rows, one for each outputUrl), containing:
get_html_meta_from_article()
results)get_html_meta_from_article()
results)Note: This function is in a very preliminary stage. If you are interested in working with OAI records, you may want to check Scott Chamberlain’s OAI package for R https://CRAN.R-project.org/package=oai. If you only have the OJS home url, and would like to check all the article’s OAI records at one shot, an interesting option is to parse it with ojsr::parse_oai_url()
and passing the output_url to oai::list_identifiers()
.
parse_base_url()
takes a vector of OJS urls and retrieves their base url, according to OJS routing conventions.
mix_links <- c(
'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive',
'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2903'
)
base_url <- ojsr::parse_base_url(input_url = mix_links)
Result is a vector of the same length of your input.
parse_oai_url()
takes a vector of OJS urls and retrieves their OAI entry url, according to OJS routing conventions.
mix_links <- c(
'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive',
'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2903'
)
oai_url <- ojsr::parse_oai_url(input_url = mix_links)
Result is a vector of the same length of your input.