The guiding principle behind duawranglr is to make it easier for
organizations to share data that contain protected elements and/or
personally idenfiable information (PII) with researchers. There are
two key problems this package attempts to solve:
- Data owners and reseachers may wish to collaborate on multiple
projects, each with a different level of data security required;
executing a unique data usage agreement (DUA) for each project can be
time consuming and inefficient.
- Administrators tasked with approving data requests do not always
have the time or technical proficiency to closely review the code that
reads, subsets, filters, and deidentifies data files according to a
DUA.
Data usage agreements
The duawranglr package is designed with the idea that rather than
setting a new DUA for each project in an ongoing collaboration between
researchers and data partners, two things will happen instead:
- An overarching DUA will be signed that establishes a general
framework for collaboration with multiple pre-established levels of
data restriction; for each new project, these levels (e.g., I, II,
& III) are invoked and used to determine which variables may be
shared, with whom, and under what conditions according to the DUA.
- An associated crosswalk file—which can be an easy-to-modify and
share spreadsheet—will list the names of data elements that are
restricted at each level. This crosswalk is then used to clearly
transform raw restricted data files into those that can be shared
under the conditions of the DUA.
An example DUA crosswalk
An example crosswalk file (e.g. a CSV file or Excel spreadsheet)
might look like this:
level_i |
level_ii |
level_iii |
sid |
sid |
sid |
sname |
sname |
sname |
dob |
dob |
|
gender |
|
|
raceeth |
|
|
tid |
|
|
tname |
tname |
tname |
zip |
zip |
|
Each column represents a restriction level—level_i
, level_ii
, or
level_iii
—along with the corresponding data element names that are
restricted at that level. In this crosswalk, like variable names have
been aligned so that they are easier to compare, but the elements can
be included in whichever way makes most sense to the data
administrator.
The restriction level names are arbitrary as far as the package goes,
but in conjunction with a DUA, they have meaning:
- Level I: The first level produces data sets that can be shared more
widely, but at the cost of losing access to many data elements in
the final data set.
- Level II: The second level has slightly fewer data element
restrictions, making it better for more research projects. Data
produced at this level likely come with more sharing and storage
restrictions than those produced at the first level.
- Level III: The third level has the fewest restrictions: only names
and the student's ID cannot be contained in the final data set. Data
produced at this level will have the strongest restrictions on who
can use it an how it is stored by the research team.
The benefit of this level-plus-crosswalk system is two-fold:
- Data element restrictions are clearly defined for each level, which
in turn has its own clearly defined scope for data storage and
sharing. When starting a new project under the scope of the DUA,
researchers and data partners need only to assign a proper level
based on the needs of the analyses.
- Because the crosswalk is a simple tabular file, data element names
can easily be added or deleted by data partners who do not
typically use data analysis software. This helps keep the process
transparent for all team members.
What duawranglr does not do
Functions in the package do not
- Replace existing data wrangling functions
- Guarantee data security
There are many packages, such as those in the
tidyverse suite, that are already well
suited to data wrangling tasks. There is no need to replicate those
functions in this package.
It also should go without saying, but users can simply not use
functions in this package when attempting to secure restricted
data. What this package does is offer a framework and a set of useful
functions that, when followed, help users secure data in a clear and
replicable manner that allows data administrators to more easily
participate in the process.