The AgBioData consortium was formed in 2015 in response to the need for GGB personnel to work together to come up with better, more efficient database solutions. The mission of the consortium, comprised of members responsible for over 25 GGB databases and allied resources, is to work together to identify ways to consolidate and standardize common GGB database operations to create database products with more interoperability. FAIR principles have rapidly become standard guidelines for proper data management, as they outline a road map to maximize data reuse across repositories. However, more specific guidelines on how to implement FAIR principles for agricultural GGB data are needed to assist and streamline implementation across GGB databases. The results were used to focus and foster the workshop discussions. Here we present the current challenges facing GGBs in each of these seven areas and recommendations for best practices, incorporating discussions from the Salt Lake City meeting and results of the survey. The purpose of this paper is 3-fold: first, to document the current challenges and opportunities of GGB databases and online resources regarding the collection, integration and provision of data in a standardized way; second, to outline a set of standards and best practices for GGB databases and their curators; and third, to inform policy and decision makers in the federal government, funding agencies, scientific publishers and academic institutions about the growing importance of scientific data curation and management to the research community. The paper is organized by the seven topics discussed at the Salt Lake City workshop. For each topic, we provide an overview, challenges and opportunities and recommendations. The acronym ‘API’ appears frequently in this paper, referring to the means by which software components communicate with each other: i.e. a set of instructions and data transfer protocols.
We envision this paper will be helpful to scientists in the GGB database community, publishers, funders and policy makers and agricultural scientists who want to broaden their understanding of FAIR data practices.Biocurators strive to present an accessible,ebb flow tray accurate and comprehensive representation of biological knowledge . Biocuration is the process of selecting and integrating biological knowledge, data and metadata within a structured database so that it can be accessible, understandable and reusable by the research community. Data and metadata are taken from peer-reviewed publications and other sources and integrated with other data to deliver a value-added product to the public for further research. Biocuration is a multidisciplinary effort that involves subject area experts, software developers, bio-informaticians and researchers. The curation process usually includes a mixture of manual, semi-automated and fully automated workflows. Manual biocuration is the process of an expert reading one or several related publications, assessing and/or validating the quality of the data and entering data manually into a database using curation tools, or by providing spreadsheets to the database manager. It also encompasses the curation of facts or knowledge, in addition to raw data; for example, the role a gene plays in a particular pathway. These data include information on genes, proteins, DNA or RNA sequences, pathways, mutant and nonmutant phenotypes, mutant interactions, qualitative and quantitative traits, genetic variation, diversity and population data, genetic stocks, genetic maps, chromosomal information, genetic markers and any other information from the publication that the curator deems valuable to the database consumers. Manual curation includes determining and attaching appropriate ontology and metadata annotations to data. This sometimes requires interaction with authors to ensure data is represented correctly and completely, and indeed to ask where the data resides if they are not linked to a publication. In well-funded large GGB databases, manually curated data may be reviewed by one, two or even three additional curators.
Manual biocuration is perhaps the best way to curate data, but no GGB database has enough resources to curate all data manually. Moreover, the number of papers produced by each research community continues to grow rapidly. Thus, semi-automated and fully automated workflows are also used by most databases. For example, a species-specific database may want to retrieve all Gene Ontology annotations for genes and proteins for their species from a multi-species database like UniProt . In this case, a script might be written and used to retrieve that data ‘en masse’. Prediction of gene homologs, orthologs and function can also be automated. Some of these standard automated processes require intervention at defined points from expert scientist to choose appropriate references, cut off values, perform verifications and do quality checks. All biocuration aims to add value to data. Harvesting biological data from published literature, linking it to existing data and adding it to a database enables researchers to access the integrated data and use it to advance scientific knowledge. The manual biocuration of genes, proteins and pathways in one or more species often leads to the development of algorithms and software tools that have wider applications and contribute to automated curation processes. For example, The Arabidopsis Information Resource has been manually adding GO annotations to thousands of Arabidopsis genes from the literature since 1999. This manual GO annotation is now the gold standard reference set for all other plant GO annotations and is used for inferring gene function of related sequences in all other plant species . Another example is the manually curated metabolic pathways in Ecocyc, MetaCyc and PlantCyc, which have been used to predict genome-scale metabolic networks for several species based on gene sequence similarity . The recently developed Plant Reactome database has further streamlined the process of orthology-based projections of plant pathways by creating simultaneous projections for 74 species. These projections are routinely updated along with the curated pathways from the Reactome reference species Oryza sativa . Without manual biocuration of experimental data from Arabidopsis, rice and other model organisms, the plant community would not have the powerful gene function prediction workflows we have today, nor would the development of the wide array of existing genomic resources and automated protocols have been possible. Biocurators continue to provide feedback to improve automated pipelines for prediction workflows and help to streamline data sets for their communities and/or add a value to the primary data.
All biocuration is time consuming and requires assistance from expert biologists. Current efforts in machine learning and automated text mining to pull data or to rank journal articles for curation more effectively work to some extent, but so far these approaches are not able to synthesize a clear narrative and thus cannot yet replace biocurators. The manual curation of literature, genes, proteins, pathways etc. by expert biologists remains the gold standard used for developing and testing text mining tools and other automated workflows. We expect that although text-mining tools will help biocurators achieve higher efficiency, biocurators will remain indispensable to ensure accuracy and relevance of biological data. GGB databases can increase researchers’ efficiency, increase the return on research funding investment by maximizing reuse and provide use metrics for those who desire to quantify research impact. We anticipate that the demand for biocurators will increase as the tsunami of ‘big data’ continues. Despite the fact that the actual cost of data curation is estimated to be less than 0.1% of the cost of the research that generated primary data , data curation remains underfunded .Databases are focused on serving the varied needs of their stakeholders. Because of this, different GGB databases may curate different data types or curate similar data types to varying depths, and are likely to be duplicating efforts to streamline curation. In addition, limited resources for most GGB databases often prevent timely curation of the rapidly growing data in publications.The size and the complexity of biological data resulting from recent technological advances require the data to be stored in computable or standardized form for efficient integration and retrieval. Use of ontologies to annotate data is important for integrating disparate data sets. Ontologies are structured, controlled vocabularies that represent specific knowledge domains . Examples include the GO for attributes of gene products such as subcellular localization, molecular function or biological role,flood and drain tray and Plant Ontology for plant attributes such as developmental stages or anatomical parts. When data are associated with appropriate ontology terms, data interoperability, retrieval and transfer are more effective. In this section, we review the challenges and opportunities in the use of ontologies and provide a set of recommendations for data curation with ontologies.To identify current status and challenges in ontology use, an online survey was offered to AgBioData members. The survey results for ontology use in databases for each data type are provided in Table 1 and a summary of other survey questions such as barriers to using ontologies are provided in the supplementary material 1. In addition, the ways ontologies are used in data descriptions in some GGB databases are described in supplementary material 2. To facilitate the adoption of ontologies by GGB databases, we describe the challenges identified by the survey along with some opportunities to meet these challenges, including a review of currently available ontologies for agriculture, ontology libraries and registries and tools for working with ontologies.
A key component of FAIR data principles is that data can be found, read and interpreted using computers. APIs and other mechanisms for providing machine-readable data allow researchers to discover data, facilitate the movement of data among different databases and analysis platforms and when coupled with good practices in curation, ontologies and metadata are fundamental to building a web of interconnected data covering the full scope of agricultural research. Without programmatic access to data, the goals laid out in the introduction to this paper cannot be reached because it is simply not possible to store all data in one place, nor is it feasible to work across a distributed environment without computerized support. After a brief description of the current state of data access technology across GGB databases and other online resources, we more fully describe the need for programmatic data access under Challenges and Opportunities and end with recommendations for best practices. Sharing among AgBioData databases is already widespread, either through programmatic access or other means. The results of the AgBioData survey of its members indicate that GGB databases and resources vary in how they acquire and serve their data, particularly to other databases. All but 3 out of 32 GGB databases share data with other databases, and all but two have imported data from other database. Some make use of platforms, such as Inter Mine , Ensembl and Tripal , to provide programmatic access to data that is standard within, but not across the different options. Other databases develop their own programmatic access or use methods such as file transfer protocol . Finally, some databases provide no programmatic access to data. A number of infrastructure projects already exist that support AgBioData data access needs, most of which have been adopted to some degree by different GGB platforms . A more recent approach to facilitate data search, access and exchange is to define a common API that is supported by multiple database platforms. An example of this is BrAPI , which defines querying methods and data exchange formats without requiring any specific database implementation. Each database is free to choose an existing implementation or to develop its own. However, BrAPI’s utility is restricted to specific types of data. Alternatively, the Agave API provides a set of services that can be used to access, analyse and manage any type of data from registered systems, but is not customized to work with GGB databases.Aside from primary repositories like GenBank, model organism and specialty databases remain the primary means of serving data to researchers, particularly for curated or otherwise processed data. These databases represent different community interests, funding sources and data types. They have grown in an ad hoc fashion and distribute data in multiple formats, which are often unique to each database and are may be without programmatic access. Below, we lay out some of the challenges and opportunities in programmatic data access faced by GGB researchers using the current landscape of databases. Exploration of these use cases yielded a set of common data access requirements under five different themes, summarized in Table 7.Large comparative genomic portals exist but have limitations in their utility for specialized communities, such as not incorporating data from minor crop species or crop wild relatives or rarely handling multiple genomes for the same species.