Create and organise

Data files should be organised and named in a consistent and practical system, to make it easier to find and keep track of data.

Benefits

  • Save time and frustration by finding files quickly
  • Reduce errors from confusion between file versions
  • Understand the data you hold and identify any gaps
  • Identify duplicated data files
  • Verification – evidence of logical processes and methods
Responsive image

 

File names should help classify files, uniquely identify files and provide information about the content and status of the file.

Best practice when choosing a file name:

  • Keep names short
  • Always use a 2-digit number, rather than 1 e.g. 01,02,03, unless it is a year or a number with more than 2 digits
  • Considering how filenames will sort in directory listings e.g. putting common elements at the start of a file so they are grouped together
  • Unless the system you are using automatically maintains version histories, you should add version numbering in file names to indicate files revisions or edits; or use discrete or continuous numbering depending on minor or major revisions
  • Decide on a file naming convention at the start of your project – this involves decisions about punctuation, date formats, number of digits and the order of each element
  • Use controlled vocabularies within your research discipline to enable effective searching and retrieval of files by different people
  • Name files suitably as soon as you create them

Avoid:

  • meaningless names (or only mean something to you)
  • names relating to individuals
  • unnecessary repetition and redundant words
  • Spaces; use underscores or hyphens instead
  • Non-alphanumeric characters
  • Common words e.g. draft at the start of the file name
  • Avoid use of characters that perform specific functions in some operating systems, such as the following reserved characters / \ “ * :

Examples from the UK Data Archive:

  • FG1_CONS_2010-02-12 is the file that contains the transcript of the first focus group with consumers, that took place on 12 February 2010
  • Int024_AP_2008-06-05 is an interview with participant 024, interviewed by Anne Parsons on 5 June 2008
  • BDHSurveyProcedures_00_04.pdf is version 4 of the survey procedures for the British Dental Health Survey

Examples of filenames for scientific data are available from the Centre for Environmental Data Archival.

More detailed advice about file naming is available from JISC.

It is important to take the time to plan how to structure files in folders to enable quick location, especially when working in collaboration with others.

Best practice for structuring files and folders:

  • Group files within folders so information on a particular topic is located in one place. For example, experimental work could be stored in folders organised by the date of the experiment, or by a key experimental condition
  • Apply logical structuring of files within folders relating to projects or issues
  • Don’t leave files unsorted, hanging under top level folders
  • Separate current and completed work or versions e.g. where a document has many versions and multiple contributors consider a “Current Version” folder
  • Structure folders hierarchically by starting with a limited number of high level folders for broader topics, and create more specific folders within these. It helps to restrict the level of folders to three or four deep and not to have more than ten items in each list
  • Use existing conventions and procedures, from your project team or Research Centre, to structure folders.

An example folder structure is available from the UK Data Archive; this separates data and documentation files, and then according to type and research activity:

Responsive image

 

The London School of Hygiene and Tropical Medicine also has a sample organisation structure for longitudinal data

Version control is managing different versions and drafts of a document, file, record or dataset. It provides an audit trail for the revision and update of draft and final versions. Version control is important for working on collaborative documents with a number of contributors, and for knowing which version of a file is being used or enforced.

Some systems automatically maintain version histories; but if not, the following best practice should be used:

  • Use appropriate labels to differentiate between statuses:
    • ‘d’ for draft revisions that are still in development e.g. d1, d2, d3
    • ‘v’ for versions that are intended to be seen by others e.g. v1, v2, v3
    • Agree who will finish finals and mark them as 'final'
  • Use a 'revision' numbering system:
    • First draft versions should use v0-1, v0-2 etc. until it becomes the final approved version v1-0
    • Final approved versions and major changes should be indicated by whole numbers e.g. v1-0 would be the first major version, v2-0 the second major version.
    • Minor changes can be indicated by increasing the figure after the dash, for example v1-1
      indicates a minor change has been made to the first version, and v3-1 a minor change has been made to the third version.
  • Although full stops are usually used in numbering, dashes should be used for electronic filenames e.g. v1-0
  • Apply version numbers consistently e.g. v1-0, v2-0, v2-1, v3-0, rather than v1, v2, v2.1, v3
  • Include the version as part of both the file name, and within the document itself. In the header or footer of a document identify the author, filename, page number and date the document was created/revised
  • If you store the same data in different file formats, ensure that the filename and version are the same e.g. ‘SmithB-transcript-v10.doc’, ‘Smith-B-transcript-v10.rtf’, ‘SmithB-transcript-v10.pdf’

Examples of file versions from the UK Data Archive:

  • date recorded in the file name or embedded within the file: HealthTest_06-04- 2008
  • version numbering in the file name: BGHSurveyProcedures_v1-3
  • version description in the file name or embedded within the file (draft, final):
    FoodInterview_1_draft
    FoodInterview_1_final


Version Date Description of Change Name
0-1 17/06/2015 First draft sent to project team Miss A. Researcher
0-2 22/06/2015Updates made from project team feedback, including changes to the method, because an alternative method was discovered since the first draft and is better suited to the project Miss A. Researcher
1-0 29/06/2015 Final version – approved by Research Committee Miss A. Researcher
1-1 16/07/2015Minor amendment to section 3 Miss A. Researcher

Version control can also be maintained through:

  • version control facilities within software used - Example for Microsoft Word
  • using versioning software, e.g. in SharePoint or Github
  • using file sharing services such as Syncplicity
  • manual merging of entries or edits by multiple users

To demonstrate the authenticity of data and prevent unauthorised changes to it, follow this best practice from the UK Data Archive:

  • keep a single master file of data
  • assign responsibility for master files to a single project team member
  • master versions of data files should be given read-only status to general readership
  • record all changes to master files
  • maintain old master files in case later ones contain errors
  • archive copies of master files at regular intervals
  • develop a formal procedure for the destruction of master files

Data documentation encompasses all the information necessary to discover, interpret, understand and use data. This is important for collaborators, original researchers returning to data, or new users of data. Good documentation is vital for successful data preservation as data can quickly become unusable if key details of the context have been forgotten.

Data documentation should include detailed data description and annotation:

Study level description

  • Research aims, objectives, questions
  • Why and how the data were created, prepared or digitised (e.g. data collection methodologies, analytical information, classification schemes used, details of how a sample was chosen, assumptions)
  • Instruments, measures and secondary data sources used
  • Data validation and quality checking procedures
  • Data ownership, confidentiality, access and use conditions

Data level description

  • Data context (e.g. names and definitions, units of measurements, geographic location and time period)
  • Data content and structure (e.g. data format and volume)
  • Data alterations or coding (e.g. algorithm or command file used, plus reasons for missing values)
  • Weighting and grossing variables
  • Data list describing cases, individuals or items studied, for example for logging qualitative interviews
  • Audit trail of activities performed when capturing, processing, and analysing contained content
  • Relationships between individual files or entities (e.g. X is later superseded by Y)

All of this extra information is collectively known as metadata.

It is important to consider any third party requirements before describing the data as there are many standards. This could include:

Creating data documentation from the start of a project makes it easier to manage and understand data later in the research lifecycle. Therefore, include procedures for documentation in your data planning.

How to capture data documentation

  • Embedding documentation within data or documents:
    • statistical e.g. SPSS
      variable descriptions and attributes (codes, data type, missing values) of each variable in the data file can be documented in 'Variable View' or via syntax, whereby embedded data documentation is then contained in the SPSS command file
    • databases e.g. Microsoft Access
      variable descriptions and attributes can be documented in 'Design View' and relationships between tables and files can be created
    • GIS e.g ArcGIS
      shapefiles (layers) and tables can be organised in a geo-database with rich metadata created in ArcCatalog
    • spreadsheets e.g. Microsoft Excel
      an additional worksheet within the data file can contain data-related documentation
    • text files e.g. Microsoft Word
      add documentation to header or footer
  • Supporting documentation accompanying data - in separate ‘read me’ files, final reports for funders before depositing data, supplementary materials underpinning published articles
  • Catalogue metadata – a subset of core data documentation providing standardised structured information, usually associated with the data required when depositing in a repository. Metadata are typically used for discovery, providing searchable information that helps users to find existing data, as a bibliographic record for citation, or for online data browsing.

Examples:

Detailed guidance is available from the UK Data Service for both tabular data and qualitative data, and guidelines are available about documenting qualitative data using NVivo 9.

Further advice on documentation for describing images is available from JISC Digital Media

Not all research data is digital and hand drawn sketches and hand-written laboratory notebooks, journals and other materials are at particular risk of loss. Digitisation can help organise and protect non-digital data:

  • Anything stored on paper can be scanned fairly easily: find out how to scan directly to your University storage using any of the managed printers on campus
  • Take a digital photo, but check the quality of the image to make sure you can use it if you lose the original
  • Audio recordings can be turned into digital sound files, or transcribed if only the words are needed. This can be done individually or by employing a professional transcription service

If the data or artefact absolutely cannot be digitised, consider other options for protection, such as a fireproof safe.

Best practice and standards for digitising analogue media is available from JISC Digital Media.