Preserve

Data preservation and curation are ongoing processes which should be planned for throughout the lifecycle of research data. Lots of hard work is put into collecting and analysing data, but with a little extra work the data can be protected for the long term, become the legacy of the research, and receive the full recognition it deserves.

Preservation actions include:

Responsive image

 

Benefits of preservation:

  • Raise the impact of your research
  • Increase collaboration opportunities and stimulate novel interdisciplinary research
  • Increase citations of your data; the recognition and prestige associated with data citations is rising
  • Verify research results
  • Protect your data from becoming obsolete in the future
  • Meet funder requirements about data preservation



It is not realistic, or desirable, to preserve all research data for all time. Efforts to preserve data should be focused on the data which are likely to generate the most value over the long term.

Reasons for choosing what data to keep:

  • Discovery becomes harder when everything is kept. Searches will produce more results which requires more individual effort to filter out what they need
  • Data preservation can be expensive. It requires a commitment to incur future costs which inescapably imposes a need for careful consideration of what to keep
  • Many funders specify which data need preserving, how long for, and where to deposit the data
  • Anything you keep is subject to a Freedom of Information (FOI) request.

Decisions about preserving or disposing data need to be made during the data management planning stage – taking into account institutional policy, funder requirements and the data repository requirements. The best time to actually select the data is well before the end of the project, or periodically, if it's a longitudinal or reference data collection.

Selection criteria

The criteria for selecting data will vary depending on discipline-specific factors; community curation is an emerging area, and there is a widespread expectation that ‘community’ ratings or comments will become established as a means for peer review of datasets for preservation. Seven general criteria from the Digital Curation Centre are listed below:

  • Relevance to Mission: The data content fits the research group’s remit and any priorities stated in the University of Salford or research funder’s current strategy, including any legal requirement to retain the data beyond its immediate use
  • Scientific or Historical Value: Is the data scientifically, socially, or culturally significant? Evidence of current research and educational value can be used to assess possible future uses
  • Uniqueness: The extent to which the data is the only copy or most complete source of the information that can be derived from it, and whether it is at risk of loss if not preserved
  • Potential for Redistribution: The extent to which data are understandable and usable, and kept in formats meeting designated technical, ethical and contractual criteria
  • Non-Replicability: It would not be feasible to replicate the data or doing so would not be financially viable
  • Economic Case: estimated costs for preserving the data are justifiable when assessed against evidence of potential future benefits
  • Full Documentation: the information necessary to discover, interpret, understand and use data is comprehensive and correct

The researcher who created the data is ultimately responsible for deciding what data to preserve, because they have the expertise about the data. This process will involve a subjective judgement, as nobody knows exactly what information is going to be wanted in the future. Data selection needs to be thought through carefully, adhering to funder and University policies and documenting decisions and the reasons for them. The EPSRC state that "it may be more effective to preserve the means to recreate the data by preserving the generating code and environment, rather than preserving the data themselves".

Data selected for preservation should be deposited in an appropriate repository. Data which have fulfilled their purpose should be selected for deletion and must be disposed of securely.

More detailed guidance is available from the Digital Curation Centre:

The most reliable way to dispose of data is physical destruction as deleting electronic files or reformatting a hard drive will not prevent the possible recovery of data from the drive. University of Salford guidance states that confidential material should be disposed of by the following means:

  • Paper should be shredded and then disposed of in waste sacks and recycled with other waste paper. Paper shredders should be available in your area in order to shred sensitive documentation, but there is no need to use them for everything
  • Bagging up confidential waste for removal and destruction by a commercial secure waste disposal company. If your Unit does not have arrangements, special waste sacks for confidential waste can be ordered from Estates and Facilities Management who will send you the sacks and then take them away when they are full. The Cleaning helpline is 53091
  • Electronic media such as disks and tapes should be reformatted and then cut up. Diskettes should be pulled apart and the real 'floppy' disk cut up with scissors.
  • In order to ensure the removal of confidential information from hard drives, they should be formatted at least three times and then completely overwritten to ensure they are effectively scrambled and remain inaccessible (re-formatted disks still contain information and can be read unless they have been formatted to government standards)
  • Flash-based solid state discs, such as memory sticks, are constructed differently to hard drives and the techniques for securely erasing files mentioned above cannot be relied upon. Physical destruction is advised as the only certain way to erase files
  • Hardware and other software must be disposed by secure methods through the ITS Service Desk and not thrown on the local tip. ITS either ensures controlled physical destruction of the relevant parts or provides hardware to a third party organisation which first completely wipes all data using a strong magnetic current (guaranteed under legal contract) and then passes on the hardware to organisations such as schools
  • Highly Confidential files should be deleted using Entrust True Delete or equivalent software.

For more information please contact the Information Governance team

It is important to outline and justify the file format data will be stored in when planning a project. Technology is changing rapidly and researchers need to plan for preventing data obsolescence and ensuring the longevity of file formats for long term readability, access and re-use.

The choice of file format can be driven by the software used, disciplinary conventions, staff expertise, standards accepted by data repositories, preferences for open formats or how the data will be analysed and sorted.

However, it is also recommended to consider the following:

  • Formats for sharing data with colleagues in future projects
  • Formats with reduced risk of obsolescence
  • Formats for easy annotation with metadata

Proprietary versus Open formats

Proprietary formats belong to a company, organisation or individual, and are at more risk of obsolescence because they may not last forever. This includes Microsoft Office products, such as Word and Excel.

Open formats do not have restrictions on their use and no one claims intellectual property rights. They are free and open to everyone, and are more desirable for long term preservation because they are standardised and interchangeable e.g. Open Office products.

Examples of preferred formats and characteristics for long term preservation include:

  • PDF/A rather than Microsoft Word
  • CSV or ASCII rather than Excel
  • TIFF or JPEG2000 rather than GIF or JPG
  • MPEG-4 rather than Quicktime
  • XML or RDF rather than RDBMS
  • Common usage by the research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

However, open formats may not support all the functionality found within a proprietary format, or they might result in larger files because they offer less efficient compression of files. In some cases, it may be best to use one format for data collection and analysis, and converting the data to another format for long term preservation. Be sure to save a copy of the most important files actively being worked on in an open format, either at the end of the project, or throughout. If there is not space to store multiple formats or time to convert them, pick the most vital files and be sure to keep the longer access version.

JISC Digital Media have an infokit specifically about Digital File Formats for images, audio and video. Guidance about common image formats is available from the University of Cambridge. The UK Data Archive recommend formats for long term preservation

A data repository provides online archival storage, which is usually open access, and cares for digital materials, ensuring that they remain usable over time. When software and hardware advance, repositories migrate digital materials into new formats so that they stay readable and shareable. Data repositories provide a catalogue for discovery and access, making data more easily citable. Wherever possible, research data should be deposited permanently within a national collection.

Repositories provide support for documentation and metadata and many provide additional services such as advice and assistance with data management, formats, security, and intellectual property rights concerns.

Benefits of using a data repository include:

  • increase the efficiency of research, teaching and studying
  • increase returns on investment in the creation and collection of the hosted data, by facilitating additional use
  • ensure long term access to the data
  • dissemination of data focused on the academic community

JISC provide more information on the value and impact of data sharing and curation.

Many funding bodies state that research outputs produced using a research grant must be deposited with a designated repository. ESRC, NERC and STFC all run their own services, while BBSRC, Cancer Research UK, the MRC and Wellcome Trust are partners in PubMed Central. The World Health Organistion also runs the Global Health Observatory Data Repository.

The only Councils which do not provide a repository for published outputs are the AHRC and EPSRC. Researchers supported by these Councils are expected to use institutional or subject based repositories.

Examples of data repositories

An international list of research data repositories is available from Re3data, and OpenDOAR (Directory of Open Access Repositories) maintains an online list of open access digital repositories.

There are many discipline specific repositories available, such as the UK Data Service and Biosharing, whilst some repositories are multidisciplinary such as Zenodo and Dryad. Nature's Scientific Data journal also maintains a list of recommended data archives. Most data repositories are free to deposit and access, but some require registration and may charge for the service.

Examples of specialist research data centres

The best repository to choose for your data will be a national data centre or discipline specific repository; they have the expertise and resources to deal with particular types of data. An institutional data repository is suitable for the preservation of data where no other repository is available – a pilot data repository is available for researchers funded by the EPSRC.

It is also important to consider the following when choosing a data repository:

  • Will the repository issue the data with a persistent identifier such as a Digital Object Identifier (DOI) that can be included in a data access statement?
  • Are access restrictions or embargoes permitted?
  • Will the repository ensure that confidential or personal data are secured?
  • Do the repository's terms and conditions fit with the University's Intellectual Property Policy (under development at Salford)? For example, does the archive require that you assign any copyright in the data to the archive? We recommend avoiding using archives that require transfer of rights.
  • What licenses are available?

More guidance on where to keep research data is available from the DCC. MIT Libraries have also created a Data Repository Comparison Tool, to help compare features of different repositories.

Understanding the factors that influence the length of how long data should be kept is important for making the correct decisions when planning, selecting storage and depositing data.

As an activity, research involves the creation, collection and collation of a great deal of information, some relating to the management of the research project itself, some to publications and presentations given to report on the outcome of the research and also the data upon which the research is based.

For the purposes of clarity, the following definitions are used:

Research information: All information, records and data relating to the conduct of research including research records and research data

Research records: Information relating to the management and conduct of the research project

Research data: The information collected or created in the course of research but which isn't a published research output

There is no hard and fast rule on how long research information should be retained as research can cover such a range of types of activity over a broad array of subjects.

How long research information should be retained for may depend on a number of factors such as:

  • Impact of the research
  • Academic reputation
  • Derived and linked publications
  • Statutory/legal obligations
  • On-going or further research
  • Validation/testing

Funder Requirements

In many cases the institution funding the research may require some or all research information or research data to be held. Some do not have prescribed retention periods but all UK funding councils and other significant funding bodies require research data to be held in a safe and an accessible way. Those conducting research are therefore required to consider how best to manage their research data in order to facilitate this (see table below).

The University of Salford advises that:

Research records should be held for a minimum of six years after the completion of the research

Research data should be held for a minimum of ten years after the completion of the research. The actual retention period may be longer where the data is actively used or where otherwise required to retain it as a condition of the research funding.

For more information, please contact the University’s Information Governance Team who can provide more specific advice in individual cases.

Availability and Retention Requirements of Funding Bodies
Organisation Expectations of when data should be accessible to others Required retention period
AHRC Within 3 months of end of project At least 3 years from the end of the project
BBSRC Timely; no later than publication of main findings; best practice 10 years from the end of project
Cancer Research UK Timely; no later than acceptance for publication of the main findings At least 5 years from the end of the project
EPSRC Metadata available within 12 months of generation; data - timely 10 years from the date of last access
ESRC Within 3 months of end of project
MRC Timely 10 years from generation (in original form)
NERC As soon as possible after the end of data collection
STFC Within 6 months of relevant publication At least 10 years from the end of the project

"Permanently for data that cannot be re-measured or reproduced.
Wellcome Trust Timely; linked to publication At least 10 years from the end of the project

Retention requirements are also established by these bodies and legislation:

  • The Data Protection Act states that personal data contained in a research dataset should not be kept for longer than is necessary
  • The Medicines for Human Use (Clinical Trials) Amendment Regulations 2006 state that essential documents related to clinical trials must be stored for 5 years after the trial's end
  • A data provider may require the research to comply with a Data Transfer Agreement that states specific requirements for the storage and management of their data.