Introduction
This document describes how to use the dataone R package to upload data to DataONE, and how to perform maintenance operations on the data after upload.
The dataone R package provides methods to enable R scripts to interact with DataONE Coordinating Nodes (CN) and Member Nodes (MN), to search for, download, upload and update data and metadata. The dataone R package takes care of the details of calling the corresponding DataONE web service on a DataONE node. For example, the dataone createObject
R method calls the DataONE web service MNStorage.create() that uploads a dataset to a DataONE MN.
Before uploading any data to a DataONE MN, it is necessary to obtain a DataONE user identity that will be provided with each request to upload or update data. The method that DataONE uses to achieve this is known as user identity authentication, and requires that an authentication token, which is a character string, be provided during upload. The process to obtain this token is described in the DataONE Federation vignette, in the section DataONE User Authentication With Tokens, which is viewable with the R command vignette("dataone-overview")
. (Note: DataONE originally used X.509 certificates for authentication, which are still supported.)
Uploading A Package Using uploadDataPackage
Datasets and metadata can be uploaded individually or as a collection. Such a collection, whether contained in local R objects or existing on a DataONE repository, will be informally referred to as a package
or ‘data package’. Figure 1. is a diagram of a typical DataONE package showing a metadata file that describes, or documents
the data granules that the package contains.
The steps necessary to to prepare and upload a package to DataONE using the uploadDataPackage
method will be shown. A complete script that uses these steps is shown here:
library(dataone)
library(datapack)
library(uuid)
dp <- new("DataPackage")
emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone")
metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", filename=emlFile)
dp <- addMember(dp, metadataObj)
sourceData <- system.file("extdata/OwlNightj.csv", package="dataone")
sourceObj <- new("DataObject", format="text/csv", filename=sourceData)
dp <- addMember(dp, sourceObj, metadataObj)
progFile <- system.file("extdata/filterObs.R", package="dataone")
progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc")
dp <- addMember(dp, progObj, metadataObj)
outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone")
outputObj <- new("DataObject", format="text/csv", filename=outputData)
dp <- addMember(dp, outputObj, metadataObj)
myAccessRules <- data.frame(subject="http://orcid.org/0000-0002-2192-403X", permission="changePermission")
This particular package contains the R script filterObs.R
, the input file OwlNightj.csv
that was read by the script and the output file Strix-occidentalis-obs.csv
that was created by the R script, which was run at a previous time.
The following sections describe each line of the above script in detail.
1. Create a DataPackage object.
In order to use uploadDataPackage
, it is necessary to prepare an R DataPackage object which is a container for the set of files that will be included in the package. The following commands load the required libraries and creates an empty DataPackage object that will be added to later:
When using the uploadDataPackage
method, data structures that are required by DataONE are created, configured and uploaded automatically with the package. These data structures include a ResourceMap that details the contents of the package, and SystemMetadata objects that contain DataONE system information for each of the science datasets and associated science metadata.
A dataone
DataObject is a container that holds both the data bytes and the system information for a metadata file, data or other type of file. A DataObject is created for each file that will be included in a DataPackage.
2. Prepare a metadata file that will describe the files in the package
The next step is to prepare a metadata file that will describe the science datasets and other files in the package. The most common metadata format used in the DataONE network is the Ecological Metadata Langauge (EML). Other supported formats include FGDC, ISO 19115 and others. Additional information about EML is available at https://knb.ecoinformatics.org/#external//emlparser/docs/index.html.
Detailed directions regarding authoring metadata documents are outside the scope of this document.
DataONE requires that any file uploaded to a member node have a unique identifier associated with it.
When a DataObject is created, a unique identifier is generated for the DataObject if one is not specified using the id
parameter. This automatically generated identifier has the format “urn:uuid:”, for example “urn:uuid:c3443142-6260-4ea5-aaa1-1114981e04ad”.
The following commands create the DataObject for the science metadata, using an automatically generated identifier:
Now add the metadata object to the DataPackage:
Files are considered members of a package when they are enumerated and described by a metadata file, and a relationship between the metadata and data object is explicitly stated.
DataONE (and the dataone
R package) has adopted the package guidelines detailed by the DataONE package implementation. In this specification, the relationship that links a metadata object and a science object is CiTO (Citation Typing Ontology) documents.
This relationship between the science metadata and data objects will be added to the DataPackage automatically for each data object as it is added to the DataPackage, if the metadata object is first added, then referenced as the DataObjects are added.
As the metadata object has already been added, it can be referenced as each DataObject is added.
Since metadataObj
is included as the third argument here, the CiTO documents relationship will automatically be added between metadataObj
and sourceObj
.
Alternatively, this relationship between the metadata and science objects can be made explicitly using the insertRelationship()
method:
Note that the relationship type, using the insertRelationship()
predicate
argument does not have to be specified in this case, as the CiTO documents relationship is the default value for insertRelationship
.
3. Create and add a DataObject for each data file
A DataObject must be created for each metadata file, data file or any other type of file that will be included in the package.
A dataone
SystemMetadata R object will be created automatically and stored in each DataObject. The information from the SystemMetadata R object will be used by DataONE to maintain low level information about the dataset, such as the access policy, the user identity of the rightsholder (the user identity that can modify access the dataset), which Member Nodes it can be replicated to, etc.
The example below creates a DataObject for a science dataset:
An optional user argument can be specified when creating a DataObject, which will be used to set the DataONE submitter and rightsholder of the dataset when it is uploaded. The rightsholder is granted all access privileges to the object.
If user is not specified for a DataObject, then the submitter and rightsholder for an object will automatically be set, when the object is uploaded to DataONE, to the DataONE user that created the authentication token or X.509 certificate.
Now DataObjects for an R script and for a file created by the R script will be created:
progFile <- system.file("extdata/filterObs.R", package="dataone")
progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc")
dp <- addMember(dp, progObj, mo=metadataObj)
outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone")
outputObj <- new("DataObject", format="text/csv", filename=outputData)
dp <- addMember(dp, outputObj, mo=metadataObj)
4. Determine what access your data and metadata should have
DataONE provides a mechanism that allows data submitters to control access to their data.
The levels of access available to objects in DataONE are “read”, “write”, and “changePermission”.
The “read” permission allows a user the ability to view the content of a DataONE object. The “write” permission allows a user the ability to change the content of an object via update services. The “changePermission” permission allows the ability to change the access policy for an object and includes both read and write permissions.
The access rules that are added to DataObjects in a DataPackage will determine the access that is granted to users accessing the package after it is uploaded to DataONE.
Each of these permissions can be granted to a single user, a group of users, or the special public user which means all users.
Each object in DataONE can have one or more access rules that control the access of that object. The complete set of access rules for an object is referred to as its access policy.
Access rules can be added to each DataObject individually after it has been created.
Alternatively, access rules can be specified for all package members when a package is uploaded using uploadDataPackage
. This method is shown at the end of this section.
To grant read permission to all users:
Individual access rules to be added for a DataONE user identity can also be added to the access policy.
Access rules are added to a DataObject using the addAccessRule
method. The following access rule will grant the user with the ORCID http://orcid.org/0000-0002-2192-403X
changePermission
access to the dataset:
DataONE user identities and user authentication are described in section DataONE User Authentication in the vignette dataone-overview (to view this vignette, type this command in the R console: vignette("dataone-overview")
)
5. Upload the DataPackage
When all DataObjects have been added to the DataPackage, call the uploadDataPackage
method to upload the entire DataPackage.
As mentioned previous, as an alternative to adding access rules to each DataObject individually before adding it to the DataPackage, the access rules can be specified once when the package is uploaded to DataONE. For example, to add public access to every object in the package, and add the custom access rule show above, the public
and accessRules
arguments are used when calling updateDataPackage
:
(Note that the example uses a DataONE test environment STAGING, and not the production environment.)
After uploadDataPackage has been called successfully, the package can be viewed on the member node, searched for using the DataONE search facility. Note that if objects in DataONE are not publicly readable, and the authenticated user performing the search isn’t granted access in an object’s access policy, then the objects will not be viewable or discoverable via the search facility for that user.
Maintaining Uploaded Datasets
After data has been uploaded to DataONE, maintenance operations can be performed on these objects using the methods described in the following sections.
Replace an object with a newer version (MNode: updateObject)
The updateObject updates an existing object by creating a new object identified by a new PID on the Member Node. The new object replaces and obsoletes the old object. An obsoleted object in DataONE does not appear in search results, however it is still available for download if the identifier is known.
# Update object from previous example with a new version
updateid <- sprintf("urn:uuid:%s", UUIDgenerate())
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
size <- file.info(csvfile)$size
sha1 <- digest(csvfile, algo="sha1", serialize=FALSE, file=TRUE)
# Start with the old object's sysmeta, then modify it to match
# the new object. We could have also created a sysmeta from scratch.
sysmeta <- getSystemMetadata(mn, pid)
sysmeta@identifier <- updateid
sysmeta@size <- size
sysmeta@checksum <- sha1
sysmeta@obsoletes <- pid
# Now update the object on the member node.
response <- updateObject(mn, pid, csvfile, updateid, sysmeta)
# Get the new, updated sysmeta and check it to ensure that the update
# worked, i.e. "obsoletes" is the old pid that was replaced by the update.
updsysmeta <- getSystemMetadata(mn, updateid)
updsysmeta@obsoletes
The Member Node will mark the object as being obsolete by setting a property in the system metadata on the object being replaced. An object marked as obsolete will not appear in search results, however, such an object is still available for download if the PID is known.
Remove an object from DataONE search
An object can be removed from searches done with the DataONE search mechanism by calling the archive method with the PID of the object. This operation does not delete the object bytes, but instead updates the system metadata for the object to set the archived flag to true. The object can still be referenced with its PID and downloaded, but it will not appear in any search results.
Objects that are archived can not be updated using the updateObject method. Once an object is archived it cannot be un-archived.
The following statement archives the object that was just created in the previous example with the updateObject method.
The following commands can be used to verify that the object was archived.