Document and describe your data

Documentation and metadata

Digital data by definition are machine-readable, but understanding their meaning is a job for human beings.

Documenting your research includes descriptive and administrative information attached to your research data (metadata) and records.

  • This documentation gives context and meaning to the data and makes it easier for you (and others) to understand now, and into the future.
  • Documenting your data during the collection and analysis phases of your research is very important.  It documents your thinking, methodology and the research processes undertaken and how this may be reviewed during the research process. This is especially important if your research is going to be part of the scholarly record. It is also critical for the potential re-use of your data should this be an intended output.

Research data need to be documented at various levels.

  • Project level: This is what the research sets out to do; the new knowledge it contributes to the field; the research questions/hypotheses tested; the methodologies used; the sampling frames used; the instruments and measures used, the phenomena it explores. etc.
  • File or database level: How all the files (or tables in a database) that make up the dataset relate to each other; what format(s) they are in; whether they supersede or are superseded by previous files. A readme.txt file is the classic way of accounting for all the files and folders in a project. If there are also non-digital objects in your dataset, what are the relationships between these objects and the digital files?
  • Variable or item level: The key to understanding research results is knowing exactly how an object of analysis came about. Not just, for example, a variable name at the top of a spreadsheet table, but the full label explaining the meaning of that variable.

Importance of data documentation

You have an intimate understanding of your dataset while you are collecting and analysing it, and the temptation is to rely on the information in your head rather than writing it down. Documentation is most commonly the action that is put off - to tomorrow/next week/next month - when you have multiple competing demands on your time! The chances that you will still remember exact details after a few months, a year, or more, are slim. The reality is that regular creation and updating of documentation and metadata about your research data can play a vital role in the longevity of your research data. Remember that the most likely person who will want to re-use your data is yourself!

Help others:

  • Other people might have many reasons to examine or use your data:
  • To understand your findings
  • To verify your findings
  • To review your submitted publication
  • To replicate your results
  • To design a similar study
  • To preserve your data for access and re-use by others in the future.

Metadata

An important component of research record keeping and documentation is the management of different types of metadata associated with research data and records.

Metadata is structured information associated with an object for purposes of discovery, description, use, management and preservation (NISO, 2007) Metadata is data about data; information about information. Metadata adds value to documents or images. For scientific data, metadata is very important because it provides the context needed to make sense of what would otherwise be a collection of numbers.

Collection level metadata is used to describe an aggregation of objects. For example, a photo album (or folder) that contains a group of photographs might use descriptors that provide a connection or give context to the photos within. Collection metadata may contain information relating to the size of the collection, who took the photographs (there may be more than one person), the time period over which the photographs were taken, and so on. Collection level metadata can assist with the discovery of a large aggregation of objects.

There are many labels used for metadata but essentially there are three main types of metadata based on core functions that the information relates to:

Describes a resource in sufficient detail to uniquely identify it and enable its retrieval/discovery. It ensures that an object or group of objects can be distinguished from one another and will maintain meaning over time. Examples include author, title, and project.

Provides information about relationships within and among objects in a resource. It helps users navigate complex objects while also understanding how objects relate to each other and other entities. Examples include how pages are ordered in chapters, how images are related to text or other data, or whether some objects are contained within others.

Provides information relating to the provenance and management of the resource including when and how it was created, file formats, technical details, and access rights. This information helps data managers to keep track of objects in a resource. Administrative metadata includes:

  • Rights management - information which manages legal issues such as intellectual property rights, privacy and confidentiality;
  • Preservation metadata - which includes the information required to archive and preserve the resource.

Metadata schemas

You and your supervisor are best placed to find out what standard metadata schemas are being used in your discipline/community; check with your supervisor or colleagues to learn about what metadata standards they are using and the reasoning for their choices.

By choosing a well-supported metadata schema or standard, you will maximise the chance that your data can be re-used and understood by other researchers now, and into the future. If you don't know where to start, the Research Data Alliance have created a directory of metadata schemas that may get you started.

Below are links to collections of a number of well-known metadata schemas from a range of research areas put together by the Digital Curation Centre (DCC).

Follow the most appropriate link to find the schema most relevant to your area of research as well as additional tools, services and programs related to the schema. You may be surprised by how much work has already been done in your discipline. Even if a schema doesn't cover everything you want, you may be able to use a part of a schema, or a combination of schemas to describe your data.

For more examples of metadata schemas or further information about metadata schemas refer to the Digital Curation Centre (DCC) metadata registry.

Documentation

Projects often have many relationships that need to be documented if they are to be understood into the future. Projects have a natural hierarchy and documenting the information about your data at the correct levels in the hierarchy will assist in maintaining organisation within and around your data information.

Some of things you might need to document include:

  • Laboratory notebooks & experimental protocols
  • Questionnaires, codebooks, data dictionaries
  • Visual diaries, notebooks, scrapbooks
  • Software syntax and output files
  • Information about equipment settings & instrument calibration
  • Database schema
  • Methodology reports
  • Provenance information about objects or about sources of derived or digitised data

Research notebooks

Researchers can benefit by keeping a systematic record of their research. For many disciplines this may take the form of a diary or notebook to record ideas, articles and references. Laboratory-based researchers usually complete laboratory notebooks as crucial components of data management. Look to the local practices in your research group or lab for guidance.

Lab notebooks
Lab notebooks can play an important role in supporting claims relating to intellectual property developed by University researchers, and even defending claims of scientific fraud. This can also be particularly important for the Patent registration process. In less extreme circumstances, they demonstrate adherence to standards of good practice, academic and ethical integrity, and compliance with contractual provisions permitting sponsors to audit work carried out in pursuit of sponsored research. Therefore, thorough and effective management of laboratory data and the routine documentation of all lab procedures is a highly important responsibility for all laboratory researchers.

The University also provides access to an Electronic Laboratory Notebook (ELN) platform called LabArchives. Any student or staff member at the University can log into LabArchives using their University credentials and create their own electronic notebook to record, manage, and safely store their digital research data. LabArchives makes it easy to collaborate on research projects, even with external partners, and graduate research supervisors and students can keep track of a research project.

Data citation

What is data citation?
Data citation refers to the practice of providing a reference to data in the same way as researchers routinely provide a bibliographic reference to other scholarly resources. While data has often been shared in the past, it was seldom cited in the same way as journal articles or other publications. This culture is rapidly changing.

Did you know?

  • Many journal publishers now encourage or require citation of research data.
  • There is a global network of discipline and institutional data repositories where research data collections are described with a preformatted citation statement provided.
  • Only cited data can be counted and tracked (in a similar manner to journal articles) to measure impact.
  • A Digital Object Identifier (DOI) may be assigned to data in the same way as journal articles.
  • Data citation information may soon be incorporated into practices for research evaluation and reward.
  • Some bibliographic management systems (eg EndNote) now include a template for research data citations.
  • Some referencing styles now include guidelines on how to cite datasets, so check the style guide for your referencing style if you’re citing a dataset. The DataCite DOI Citation Formatter lets you format in lots of different styles, but some of those styles have datasets as part of the formal style guide, and some of them don’t – the DataCite tool just extrapolates from the existing style in the latter cases.

Unpublished data should also be cited to acknowledge the data creator even if only in an informal way.

How do I cite data?
Data is cited in the same way as a publication. However, it is important to include the right elements when citing data. A citation should include enough information to locate the data.

Key elements include:

  • Creator/Author
  • Publication Year
  • Title
  • Edition/Version
  • Publisher - usually the archive or data repository
  • Access information - preferably a persistent unique identifier: DOI or Handle

If the data has a DOI, then it is important to cite this - this allows the link to always point to the right location and helps systems manage the linking and analysis of citations.

Examples of data citations:

  • PARRIS, KIRSTEN (2016): Raw data, Potvin et al. (2016) Genetic erosion and escalating extinction risk in frogs with increasing wildfire frequency. University of Melbourne. https://doi.org/10.4225/49/57DF2AEE61B76. Retrieved: 03 21, Feb 03, 2017 (GMT)
  • Pintor, Anna; Krockenberger, Andrew; Schwarzkopf, Linda (2015): Hydroregulation in the tropical skink Carlia rubrigularis. James Cook University. https://doi.org/10.4225/28/55B58C5D690A8
  • H. E. M. Cool, Mark Bell (2011) Excavations at St Peter's Church, Barton-upon-Humber [data-set]. York: Archaeology Data Service [distributor] https://doi.org/10.5284/1000389

Many archives or repositories will give guidance on how to cite data. There is even an online data citation formatter. provided by DataCite for any data that has a DOI. For more information on data citation, see the ARDC guide to Data Citation.

The University’s data repository Melbourne Figshare provides a citation format for your data that includes a DOI. Publishing your data using this platform makes it easy for you to share your data in a way that makes it citable and trackable.