Data preservation and curation are ongoing processes which should be planned for throughout the lifecycle of research data. Lots of hard work is put into collecting and analysing data, but with a little extra work the data can be protected for the long term, become the legacy of the research, and receive the full recognition it deserves.
Preservation actions include:
Tip: Outline your preservation plan, as part of a Data Management Plan at the beginning of a project – consider how datasets with long term value will be preserved beyond the lifetime of the project. This involves planning how to prepare and document data for sharing and archiving. Think about where the data could be deposited and whether additional resources are needed for deposit.
Benefits of preservation:
It is not realistic, or desirable, to preserve all research data for all time. Efforts to preserve data should be focused on the data which are likely to generate the most value over the long term.
Reasons for choosing what data to keep:
Decisions about preserving or disposing data need to be made during the data management planning stage – taking into account institutional policy, funder requirements and the data repository requirements. The best time to actually select the data is well before the end of the project, or periodically, if it's a longitudinal or reference data collection.
The criteria for selecting data will vary depending on discipline-specific factors; community curation is an emerging area, and there is a widespread expectation that ‘community’ ratings or comments will become established as a means for peer review of datasets for preservation. Seven general criteria from the Digital Curation Centre are listed below:
The researcher who created the data is ultimately responsible for deciding what data to preserve, because they have the expertise about the data. This process will involve a subjective judgement, as nobody knows exactly what information is going to be wanted in the future. Data selection needs to be thought through carefully, adhering to funder and University policies and documenting decisions and the reasons for them. The EPSRC state that "it may be more effective to preserve the means to recreate the data by preserving the generating code and environment, rather than preserving the data themselves".
More detailed guidance is available from the Digital Curation Centre:
The most reliable way to dispose of data is physical destruction as deleting electronic files or reformatting a hard drive will not prevent the possible recovery of data from the drive. University of Salford guidance states that confidential material should be disposed of by the following means:
For more information please contact the Information Governance team
It is important to outline and justify the file format data will be stored in when planning a project. Technology is changing rapidly and researchers need to plan for preventing data obsolescence and ensuring the longevity of file formats for long term readability, access and re-use.
The choice of file format can be driven by the software used, disciplinary conventions, staff expertise, standards accepted by data repositories, preferences for open formats or how the data will be analysed and sorted.
However, it is also recommended to consider the following:
Reasons for data obsolescence - outdated computer hardware, software, technology, services or practices are no longer used, even if they are in working condition. Technology becomes obsolete when superseded by a newer or better technology, often evolving in complexity. Other reasons include software upgrades failing to support legacy data, low uptake of certain formats, or software supporting certain file formats fails in the marketplace or is bought by a competitor and withdrawn.
Proprietary formats belong to a company, organisation or individual, and are at more risk of obsolescence because they may not last forever. This includes Microsoft Office products, such as Word and Excel.
Open formats do not have restrictions on their use and no one claims intellectual property rights. They are free and open to everyone, and are more desirable for long term preservation because they are standardised and interchangeable e.g. Open Office products.
Examples of preferred formats and characteristics for long term preservation include:
However, open formats may not support all the functionality found within a proprietary format, or they might result in larger files because they offer less efficient compression of files. In some cases, it may be best to use one format for data collection and analysis, and converting the data to another format for long term preservation. Be sure to save a copy of the most important files actively being worked on in an open format, either at the end of the project, or throughout. If there is not space to store multiple formats or time to convert them, pick the most vital files and be sure to keep the longer access version.
JISC Digital Media have an infokit specifically about Digital File Formats for images, audio and video. Guidance about common image formats is available from the University of Cambridge. The UK Data Archive recommend formats for long term preservation
GIS data (and other vectors systems) are highly vulnerable to partial or complete data loss over time because the software is usually proprietary, changing rapidly, and has moving parts and links between files. The following suggestions should reduce this risk:
A data repository provides online archival storage, which is usually open access, and cares for digital materials, ensuring that they remain usable over time. When software and hardware advance, repositories migrate digital materials into new formats so that they stay readable and shareable. Data repositories provide a catalogue for discovery and access, making data more easily citable. Wherever possible, research data should be deposited permanently within a national collection.
Repositories provide support for documentation and metadata and many provide additional services such as advice and assistance with data management, formats, security, and intellectual property rights concerns.
Benefits of using a data repository include:
JISC provide more information on the value and impact of data sharing and curation.
Many funding bodies state that research outputs produced using a research grant must be deposited with a designated repository. ESRC, NERC and STFC all run their own services, while BBSRC, Cancer Research UK, the MRC and Wellcome Trust are partners in PubMed Central. The World Health Organistion also runs the Global Health Observatory Data Repository.
The only Councils which do not provide a repository for published outputs are the AHRC and EPSRC. Researchers supported by these Councils are expected to use institutional or subject based repositories.
There are many discipline specific repositories available, such as the UK Data Service and Biosharing, whilst some repositories are multidisciplinary such as Zenodo and Dryad. Nature's Scientific Data journal also maintains a list of recommended data archives. Most data repositories are free to deposit and access, but some require registration and may charge for the service.
The best repository to choose for your data will be a national data centre or discipline specific repository; they have the expertise and resources to deal with particular types of data. An institutional data repository is suitable for the preservation of data where no other repository is available – a pilot data repository is available for researchers funded by the EPSRC.
It is also important to consider the following when choosing a data repository:
Understanding the factors that influence the length of how long data should be kept is important for making the correct decisions when planning, selecting storage and depositing data.
As an activity, research involves the creation, collection and collation of a great deal of information, some relating to the management of the research project itself, some to publications and presentations given to report on the outcome of the research and also the data upon which the research is based.
For the purposes of clarity, the following definitions are used:
Research information: All information, records and data relating to the conduct of research including research records and research data
Research records: Information relating to the management and conduct of the research project
Research data: The information collected or created in the course of research but which isn't a published research output
There is no hard and fast rule on how long research information should be retained as research can cover such a range of types of activity over a broad array of subjects.
How long research information should be retained for may depend on a number of factors such as:
In many cases the institution funding the research may require some or all research information or research data to be held. Some do not have prescribed retention periods but all UK funding councils and other significant funding bodies require research data to be held in a safe and an accessible way. Those conducting research are therefore required to consider how best to manage their research data in order to facilitate this (see table below).
The University of Salford advises that:
Research records should be held for a minimum of six years after the completion of the research
Research data should be held for a minimum of ten years after the completion of the research. The actual retention period may be longer where the data is actively used or where otherwise required to retain it as a condition of the research funding.
For more information, please contact the University’s Information Governance Team who can provide more specific advice in individual cases.
|Availability and Retention Requirements of Funding Bodies|
|Organisation||Expectations of when data should be accessible to others||Required retention period|
|AHRC||Within 3 months of end of project||At least 3 years from the end of the project|
|BBSRC||Timely; no later than publication of main findings; best practice||10 years from the end of project|
|Cancer Research UK||Timely; no later than acceptance for publication of the main findings||At least 5 years from the end of the project|
|EPSRC||Metadata available within 12 months of generation; data - timely||10 years from the date of last access|
|ESRC||Within 3 months of end of project|
|MRC||Timely||10 years from generation (in original form)|
|NERC||As soon as possible after the end of data collection|
|STFC||Within 6 months of relevant publication||At least 10 years from the end of the project
"Permanently for data that cannot be re-measured or reproduced.
|Wellcome Trust||Timely; linked to publication||At least 10 years from the end of the project|
Retention requirements are also established by these bodies and legislation: