Skip to main content

Here you will find recommendations and useful examples on how to make data compliant with the FAIR principles. Compliance with them starts with preparing and organizing files correctly, using the desired file formats, as well as characterizing the data set with the necessary metadata. Depending on the safety and ethical considerations of the research, the researcher must evaluate which data to publish and which not. After publishing the data, it is recommended to link them to scientific articles and other research results, thus the data gain greater visibility, and the articles - credibility. Also, useful links are collected here to improve research data management knowledge and skills, as well as find examples of good practice.

Research data life cycle

The research data lifecycle includes the following stages:

  • Study planning - before data collection, research objectives are determined, methods are selected, and data storage, sharing, and ethical considerations, as well as compliance with funder and institutional requirements, are considered.
  • Data collection - data is obtained, for example, by conducting experiments, observations, surveys, or simulations, based on a predetermined methodology.
  • Data processing - raw data is cleaned, formatted, and organized for analysis.
  • Data analysis - insights and conclusions are obtained using statistical, computational, or qualitative methods.
  • Data storage - data is securely stored, applying appropriate backup and access control protocols to ensure data integrity, quality, and prevent data loss.
  • Data sharing - data is made available to others by publishing it in digital repositories or controlled access systems (depending on its sensitivity), providing the opportunity to verify, validate, use, or replicate the research results.
  • Reuse and preservation - data is reused in new studies or archived for long-term preservation.
File formats

Choosing a file format is a very important step in ensuring that your data is readable and usable in the future. Some formats provide better long-term preservation options than others if they:

  • are non-commercial, meaning they are not tied to a specific software license that may change or disappear;
  • are open and based on well-documented international standards that ensure a well-understood data structure and meaning even after years;
  • use a universal character encoding, such as Unicode - UTF-8, to avoid problems with special characters or languages;
  • are uncompressed or use a simple and reversible compression method that allows you to avoid data loss and fully restore it.

More information about data preparation and formats can be found here.

Data preparation

 

Spreadsheets are one of the most commonly used tools for collecting and processing data, but if such files are not well thought out, they can become difficult to read for both humans and computer programs. To increase the accessibility and reusability of such data, it is necessary to follow certain good practice steps.

It is recommended to:

  • give each column a short, descriptive title that describes the data it contains;
  • use a single header row and make sure that the table starts from the first cell - “A1”;
  • include an explanation of the titles and labels to describe each spreadsheet;
  • save each file with a name that adequately reflects its content;
  • if the data set consists of multiple tables or worksheets, deposit each of them as a separate file.

It is not recommended to:

  • include charts, comments, or multiple tables in a spreadsheet at the same time;
  • use color coding, as computer systems cannot interpret this;
  • use special (non-alphanumeric) symbols, including commas;
  • merge cells, as this can cause problems during conversion;
  • deposit multiple worksheets in a single file, such as Microsoft Excel, as CSV and TAB formats do not support this.

Open Research Europe Open Data, Software and Code Guidelines

It is safest to store and deposit data in CSV or TAB format, as they are simple, open and widely supported. If the spreadsheet contains variable labels, code labels or missing values, it is more appropriate to use SAV, SAS or POR formats, defining variable names in English.

More information on organizing data in spreadsheets can be found here.

Data cleaning script for use in R
 

Describing and documenting data

For research results to be reliable and repeatable, data must be documented and carefully described. The prepared documentation must be deposited alongside the data. It must include information that explains how the data was generated, structured and interpreted.

Data must be coded using variables, such as numerical designations. This is facilitated by internationally recognized classifications that are commonly used by researchers in a particular field. Using such standards, data becomes more accessible, understandable and usable outside the institution or country. However, it should be noted that not always existing classifications will be suitable for a specific study - in such cases, it is necessary to provide a detailed description of your approach. This should also be deposited with the dataset to avoid inaccurate interpretation of the data.

The following types of documents are most often published alongside the data (but not limited to):

  • descriptions of methodology - study design, methods used, data collection or generation process;
  • codebooks - technical description of the data, variables and numbers used, explanation of their values, data structure and other contextual information depending on the research field;
  • questionnaires - especially in the case of survey data, it is important to attach questionnaire files to see the formulated questions;
  • laboratory notes and experimental protocols - detailed records of the course of experiments;
  • software-related documentation - if the data was processed with less well-known or open source software, it is recommended to attach a description and code;
  • ReadMe file - simple text with instructions on how to use the data and reproduce the data analysis used in the study (example of a ReadMe file).

In the case of large projects, researchers can also publish information on the taxonomies/ontologies used (if they are not yet publicly available), different types of mappings (in cases where there are many files), contextual information describing the project and policies related to the research topic, etc. It should be noted that different research fields have different practices regarding documentation and required additional materials.

Naming and organizing files

A properly designed file naming and organization system is one of the foundations for making research data transparent, easily manageable and reusable. This is especially important when working with several projects at the same time and/or processing large amounts of data.

The structure and names of files should be:

  • distinguishable, human-readable and understandable - such that they reflect the content;
  • consistent, preferably understandable by computer systems;
  • systematic, with a thoughtful approach to organizing files in folders;
  • computer-readable, avoiding repetition of semantic elements;
  • appropriate to the format, i.e. with extensions that accurately indicate the file type.

For each file and data set, it is recommended to specify:

  • one short and concise title that describes the content of the file;
  • a more detailed description of the notations, including explanations of the acronyms and notations used, so that the files are easily distinguishable;
  • numbering starting with “01”, for example “01_ReadMe”.

This approach makes it easier to navigate the dataset not only for researchers involved in the specific study, but also for collaboration partners and later other researchers who will want to reuse the study data.

To keep names simple and stable in the long term, it is recommended to:

  • avoid using spaces, replacing them with underscores (code_gram), hyphens (code-gram) or capital letters (CodeGram);
  • use only Latin alphabet letters (A–Z, a–z), avoid diacritical marks (Āā, Čč, Ēē, Ģģ, etc.), symbols of other languages (Ææ, Øø, Åå, Öö, Ää, etc.) or special characters (*, /, ?, :, ", <, |, %, #, [, {, etc.);
  • indicate dates according to the international convention - YYYY-MM-DD (2017-10-25);
  • make sure that the file name in the original format matches the file in the desired format for deposit.

The way in which files are organized depends on their type and research field, as well as the best practices observed in it.

Metadata

Metadata is structured information about a dataset that allows us to understand what is included in it, how it was created, and how to use it. Without metadata, public databases would not be possible.

Many open data portals include tools that help create and maintain metadata when publishing new data. In addition, some open data portals update metadata automatically when editing datasets. RSU Dataverse also sets minimum requirements for metadata that researchers must specify (see here).

Core metadata elements provide key information about the data and help users find the data and evaluate its relevance to their needs. These elements are often displayed in live data catalogues or search tools, so their quality and accuracy are particularly important.

Essential

Persistent Identifier (PID) - a unique reference to a dataset, ensuring its recognizability regardless of its movement. (RSU Dataverse automatically assigns a DOI to a dataset as soon as it is published).

Title - the name of the dataset in English (often also the title of the study), which briefly and clearly reflects its content.

Publisher - the organization or institution that ensures the publication of the dataset, for example, Rīga Stradiņš University.

Author(s) - the author or authors of the dataset (it is recommended to indicate both the main author and co-authors, as well as their institutional affiliation and ORCID identifier).

Contact person - one or two people who can be contacted about the dataset.

Description - a short summary of the dataset, providing enough details for a potential user to be able to understand in time whether the specific data is useful to them.

Scientific field - the main scientific field to which the dataset is attributed according to the classification.

Keywords - labels or terms that help users find the dataset (it is recommended to include words and phrases that would be used not only by specialists, but also by the wider community; keyword lists (vocabularies) can also be specified).

Language - the language used in the dataset.

Creation date and place - the date (in YYYY-MM-DD format) when and where the dataset was created (not published).

Collection date - the time period (in YYYY-MM-DD - YYYY-MM-DD format) in which the data was collected/generated.

Data type - one or more types of data included in the dataset, for example, survey data or clinical data, etc.

License - the license applicable to the dataset, which determines the terms of its use (in the case of RSU Dataverse, the data is automatically granted public domain status).

Public accessibility level - the level of access granted to the dataset, which must be determined even if the data is not made public or an embargo period is set for it (this must be specified and justified); The following access options are available in RSU Dataverse:

  • open (anyone can access the data without restrictions);
  • restricted, request access (access is restricted, but a request with a collaboration proposal can be submitted to the authors);
  • restricted, no access (access denied, files will be opened only through the author's contact person);
  • embargo period.

Version number - information about the latest version of the dataset (date when it was last changed, supplemented or modified, as well as a description of the main changes).

Optional

Grant information - the source of funding for the project/research, including the name of the funding institution, the project number or ID, and the full project title.

Time period covered - the time period (in the format YYYY-MM-DD - YYYY-MM-DD) to which the dataset applies (especially important in the case of historical or longer studies).

Software - instructions on the application software required to open, read, and use the files.

Related materials and datasets - references to other datasets, research results, or scientific publications directly related to the specific dataset (it is recommended to add a direct link or DOI).

Embargo period - information on whether the dataset is subject to an embargo period, its duration, and the reasons for this restriction.

Other

In addition to the essential and optional elements, metadata can also include a range of other information that may be useful in a specific research context, such as classifications, controlled vocabularies, taxonomies and ontologies you use, geographical data, etc., based on the requirements of the scientific field and established metadata standards. Such elements help make the dataset easier to find, understand and reuse.

Data that cannot be shared

It is not always possible to publish all the data obtained in a study. This is especially true when the data includes personal data or sensitive information. In such situations, it is important to consult with more experienced colleagues, data protection specialists or data curators in advance to assess potential risks and find a safer solution.

The conditions for publication also depend on the research topics, project contract or agreements with industry representatives.

Data cannot be shared if they contain:

  • personal data that allows direct or indirect identification of an individual;
  • confidential commercial information;
  • information about the security conditions of systems, the publication of which may pose risks or threats;
  • content, the publication of which may result in the loss of intellectual property rights protection;
  • very large volumes of data, e.g. >50 GB, the deposit of which may be technically difficult.

In such cases, researchers are advised to provide:

  • detailed metadata (excluding any confidential information);
  • justification for access restrictions;
  • clear conditions under which data can still be accessed.

Open Research Europe Open Data, Software and Code Guidelines

If data includes personal information, it should be appropriately anonymized or pseudonymized. We recommend using the R anonymizer package for this purpose. Sensitive data can be coded and grouped to reduce the risk of identifying individuals.

If you are unable to share data for any reason not mentioned here, or if you have additional questions about data sharing, please contact the RSU Data Curators at datukuratoriatrsu[pnkts]lv.

Connecting a scientific article to a dataset

Sharing information about the data used ensures the transparency, verifiability and credibility of scientific articles and their content, and can also facilitate new collaboration opportunities.

When submitting an article to a journal, it is recommended to indicate the data set involved. Depending on the journal, this is usually possible by indicating the unique identifier of the data set, such as a DOI, or other access information. This practice is supported by most scientific journals and databases. In cases where the data cannot be publicly available, it is still possible to link it to the publication by indicating the identifier and an explanation of the reasons for restricted access.

Once the article is published, it is also recommended to add a reference in the opposite direction – linking the data set to the publication using the DOI of the article, which is sent by e-mail after its publication. This strengthens the credibility of the article and makes the data more discoverable, as well as the researcher’s scientific work more visible.

RSU researchers are encouraged to inform RSU Dataverse (dataverseatrsu[pnkts]lv) about new publications or datasets related to already deposited datasets even after their publication. This also applies to RSU's scientific activity information system Pure, where new links to scientific articles can be added to registered datasets.

Dataset licensing

Dataset authors should choose an appropriate data usage license, which serves as a tool to inform interested parties about the conditions for reuse.

A license is an agreement by which the author of a creative work grants permission to use it to the wider public, specifying restrictions on its distribution and use. A license provides broader rights to the material, while preserving authorship.

Datasets deposited in RSU Dataverse can be subject to one of the Creative Commons licenses.

CC BY

CC BY

The license allows modification of the creative work. Commercial use of the creative work is permitted. The requirement to attribute the author is retained. This is the most commonly used type of CC license.

License URI: http://creativecommons.org/licenses/by/4.0/

CC BY SA

CC BY SA

The license allows for modification of the creative work. Commercial use of the creative work is permitted. When modifying the creative work, the license of the original creative work must be retained. The requirement to attribute the author is retained.

License URI: https://creativecommons.org/licenses/by-nc-sa/4.0/

CC BY ND

CC BY ND

The license prohibits modification of the creative work, but allows its distribution for commercial and non-commercial purposes. The requirement to attribute the author is retained.

License URI: https://creativecommons.org/licenses/by-nc-nd/4.0/

CC BY NC

CC BY NC

The license prohibits commercial use of the creative work. Modification is permitted, but derivative works do not need to be granted the same usage rights as the original work. The requirement to attribute the author is retained.

License URI: http://creativecommons.org/licenses/by-nc/4.0/

CC BY NC SA

CC BY NC SA

The license allows you to modify the creative work and use it for non-commercial purposes. When doing so, you must retain the original creative work's license and attribute it to the author.

License URI: http://creativecommons.org/licenses/by-nc-sa/4.0/

CC BY NC ND

CC BY NC ND

The most restrictive of all CC licenses. The license prohibits modification of creative work and its commercial use. It can only be downloaded and shared, with attribution to the author.

License URI: http://creativecommons.org/licenses/by-nc-nd/4.0/

If the dataset is to be deposited elsewhere for long-term storage, the license selection tool is available here: https://choosealicense.com/.

RSU Dataverse

RSU Dataverse is an institutional research data repository built on open source software provided by Harvard Dataverse.

Dataverse is one of the most popular types of academic research data repositories in the world, and is regularly updated to improve its accessibility for researchers and machine readability. Dataverse includes all necessary protocols to store datasets in the most FAIR-compliant manner possible.

RSU Dataverse is registered in the Register of Research Data Repositories, OpenAIRE and EOSC as a resource and service for depositing research data. Thus, RSU Dataverse is available to researchers across Europe, and also facilitates collaboration, data exchange, monitoring and machine readability.

The CoreTrustSeal certification process has now been launched, confirming that the repository meets internationally recognized standards for trustworthy research data management and long-term preservation, thus ensuring data accessibility, understandability and reusability, as well as promoting research transparency, reproducibility and compliance with FAIR principles.

RSU Dataverse is divided into four sectors:

  • Medicine;
  • Public Health;
  • Social Sciences.

RSU Dataverse was created so that RSU researchers can deposit their data there after the completion of research projects or research activities, especially in cases where there is no appropriate and reliable repository in the relevant field. Data sets stored in RSU Dataverse can be freely available or with limited or closed access. In order to collaborate and share data sets, it is always possible to contact the authors of the data sets.

To deposit data in RSU Dataverse, write to dataverseatrsu[pnkts]lv, briefly describe the nature of the data and other relevant aspects related to it:

  • datasets can be sent to dataverseatrsu[pnkts]lv for placement in the repository or shared by creating a link to Nextcloud or SharePoint, depending on the sensitivity of the data;
  • before depositing datasets, it is necessary to fill in the minimum metadata questionnaire (it can be supplemented later);
  • different types* of datasets can be published (tabular data, audio and video recordings, transcripts of interviews or focus group discussions, etc.);
  • a codebook must also be prepared to describe the approaches used in the dataset to standardize the data.

The maximum amount of data to be deposited is 50 GB. In cases where the dataset exceeds this amount, contact datukuratoriatrsu[pnkts]lv by e-mail to find the most suitable option for data storage.

To access RSU Dataverse, RSU employees can use RSU authentication. Authors or co-authors are granted full access to their datasets. Researchers outside RSU can also register and access RSU Dataverse - the RSU Dataverse administrator will grant the appropriate access level to co-authors of datasets.

* The exception is genomic and sequencing raw data.

ZDIS Pure

The RSU Science Portal is part of the Latvian Scientific Activity Information System, which is based on the Elsevier Pure platform. It collects information about the scientific activity results of RSU academic staff - publications, projects, awards, research activities, data sets, appearances, communication in the press and media, etc. achievements.

Pure provides:

  • centralized storage of research information, compilation of publications, projects, conferences and other metadata related to scientific activity;
  • integration with international registers and systems - ORCID, SCOPUS, etc.;
  • implementation of open science principles, helping to maintain open repositories and ensuring data management plans;
  • the ability to track the impact of scientific activity and publications (citation rates, metrics and other statistical indicators).
Useful resources

General guidelines and training

Data preparation

Metadata