Plaintext
Making our data FAIR
- why should we care?
Margareta Hellström
ICOS Carbon Portal and LU domain specialist for SND
FAIR data seminar @ LU, 2019-02-05
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
Today’s research: survival in the data ocean!
• Big Data tsunami
• Not enough metadata
• Fragmentation of information
• Machine inoperability
• Human intervention needed
• Increasing data & metadata losses
• Too many standards & formats
• Reproducibility problem
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
The fate of research data
• Big data: Massive datasets produced through large science projects,
government records, social media, large corporations, …
• Long-tail data: Large amounts of small-to-medium size datasets
from very heterogeneous sources
• Literature limit: Many data files are never published,
Organized
catalogued or even explicitly mentioned in scientific
“big data”
literature. They become “dark data”.
Data size
Literature limit
Long-tail data
Unpublished and dark data
Number of data sets
A.R- Ferguson et al. (2014), Big data from small data: data-sharing in the 'long tail' of neuroscience.“ Nature neuroscience, 17(11), 1442-1447.
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
FAIR principles to the rescue!
• stands for Findable, Accessible, Interoperable, Reusable
• not a standard, but a set of 15 guiding principles*)
• aims to free up researchers from “data wrangling”,
leaving them time to “do science”
• was coined by FORCE11 in 2014, out of discussions in the
Life Sciences community
• has become the new fashion (and Holy Grail!)
• is increasingly called for by funders & policy makers
*) see the last 4 slides for a list of the principles!
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
What FAIR isn’t
• FAIR is not a standard
• FAIR is not equal to ‘Open’ or ‘Free’
• Data are often Open but not FAIR NOT
FAIR
• Data could be Closed, yet perfectly FAIR
• FAIR is not equal to Linked Data, Semantic Web or RDF
• FAIR is not assuming that only humans can find and re-use data
• FAIR is not for humans only but for machines as well
• Data that are not FAIR are pretty ‘Re-useless’…..
M. Wilkinson et al. (2018): Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science
Cloud, http://doi.org/10.3233/ISU-170824 and GO-FAIR (2018): FAIR Data Stewardship Awareness Course.
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
The research data lifecycle
Correspondingly, data comes in
several levels, from raw values to
finalized analysis results
Research projects can be broken
down into steps or phases FAIRness needs to be applied
Diagrams from Z. Zhao (2018) and U. Schwardmann (2016) where it makes sense
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
What’s in it for me?
Making your data “FAIR enough” gives you better control of what
happens to your data by:
• helping making your data sustainable
• ensuring your data can be found by others
• facilitating collection of relevant metadata
• guaranteeing data can be cited when used
• enabling collection of data usage statistics
• supporting creating data management plans
• simplifying reporting to funders & emplyer
• streamlining estimations of data curation & archival costs
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
How can my data become FAIR?
• Make a plan for the data before you start a project!
• Collect detailed descriptive information (= metadata)
throughout
• Use standards and formats common to your discipline
FAIR
• Store the data in a trusted & sustainable repository or
data center
• See to that the data gets persistent identifiers (DOIs)
• Apply a suitable usage license
• Provide end users with information on “intended use”
• Make the data “as open as possible, as closed as
necessary”
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
Learn more!
• “Turning FAIR into reality”, S. Jones ed. (2018), Report from the
European Commission’s high level expert group on FAIR data,
http://doi.org/10.2777/1524
• “The FAIR data principles”, FORCE11 (2014),
https://www.force11.org/group/fairgroup/fairprinciples
• “Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles
for the European Open Science Cloud”, B. Mons et al. (2017),
http://doi.org/10.3233/ISU-170824
• “A design framework and exemplar metrics for FAIRness”, M.
Wilkinson et al. (2017), bioRxiv preprint,
http://dx.doi.org/10.1101/225490
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
Thanks for your attention!
• Comments or suggestions?
• Angry or happy about FAIR?
• Want to discuss (environmental & earth
science) data management?
• E-mail margareta.hellstrom@nateko.lu.se,
or come see me at Geocentrum II !
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
Some extras…
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
Motivation behind the FAIR principles
• The challenge of enabling optimal use of research data and methods
is a complex one with multiple stakeholders: Researchers,
Professional data publishers, Funding agencies (private and public),
and a Data Science community
• Computational analysis to discover meaningful patterns in massive,
interlinked datasets is rapidly becoming a routine research activity.
• Providing machine-readable data as the main substrate for
Knowledge Discovery and for these eScientific processes to run
smoothly and sustainably is one of the Grand Challenges of eScience.
• The FAIR principles were formulated as guidelines & best practices for
both data producers and data consumers
FORCE11, 2014 https://www.force11.org/fairprinciples
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
F for Findable
• F1. (meta)data are assigned a globally unique
and persistent identifier
• F2. data are described with rich metadata
(defined by R1)
• F3. metadata clearly and explicitly include the
identifier of the data it describes
• F4. (meta)data are registered or indexed in a
searchable resource
FORCE11, 2014 https://www.force11.org/fairprinciples
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
A for Accessible
• A1. (meta)data are retrievable by their identifier
using a standardised communications protocol
• A1.1 the protocol is open, free, and universally
implementable
• A1.2 the protocol allows for an authentication and
authorization procedure, where necessary
• A2. metadata are accessible, even when the data
are no longer available
FORCE11, 2014 https://www.force11.org/fairprinciples
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
I for Interoperable
• I1. (meta)data use a formal, accessible, shared,
and broadly applicable language for knowledge
representation
• I2. (meta)data use vocabularies that follow FAIR
principles
• I3. (meta)data include qualified references to
other (meta)data
FORCE11, 2014 https://www.force11.org/fairprinciples
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.
R for Reusable (and Reproducible)
• R1. meta(data) are richly described with a plurality
of accurate and relevant attributes
• R1.1. (meta)data are released with a clear and
accessible data usage license
• R1.2. (meta)data are associated with detailed
provenance
• R1.3. (meta)data meet domain-relevant community
standards
FORCE11, 2014 https://www.force11.org/fairprinciples
This presentation by M. Hellström is distributed under a Creative Commons CC-BY license.