All biocuration is time consuming and requires assistance from expert biologists

Analyses of single census years provide wildly varying estimates of the effect of landscape simplification on insecticide use. It is evident that the relationship between landscape simplification and insecticide use is spatially and temporally context-dependent, and that there are a number of ways that context could be determined. Although it remains unclear what underlying mechanisms are providing the context, it is abundantly clear that the relationship between landscape simplification and insecticide use observed in 2007 does not hold for previous census years. It is time to move beyond simply asking whether landscape simplification drives insecticide use and instead focus on what factors may explain the variability in this relationship over time and space.We are in an exciting time in Biology. Genomic discovery on a large scale is cheaper, easier and faster than ever. Picture a world where every piece of biological data is available to researchers from easy-to-find and well-organized resources; the data are accurately described and available in an accessible and standard formats; the experimental procedures, samples and time points are all completely documented; and researchers can find answers to any question about the data that they have. Imagine that, with just a few mouse clicks, you could determine the expression level of any gene under every condition and developmental stage that has ever been tested. You could explore genetic diversity in any gene to find mutations with consequences. Imagine seamless and valid comparisons between experiments from different groups. Picture a research environment where complete documentation of every experimental process is available,dutch bucket for tomatoes and data are always submitted to permanent public repositories, where they can be easily found and examined.

We ‘can’ imagine that world, and feel strongly that all outcomes of publicly funded research can and should contribute to such a system. It is simply too wasteful to ‘not’ achieve this goal. Proper data management is a critical aspect of research and publication. Scientists working on federally funded research projects are expected to make research findings publicly available. Data are the lifeblood of research, and their value often do not end with the original study, as they can be reused for further investigation if properly handled. Data become much more valuable when integrated with other data and information . For example, traits, images, seed/sample sources, sequencing data and high-throughput phenotyping results become much more informative when integrated with germplasm accessions and pedigree data. Access to low-cost, high-throughput sequencing, large-scale phenotyping and advanced computational algorithms, combined with significant funding by the National Science Foundation , the US Department of Agriculture and the US Department of Energy for cyber infrastructure and agricultural-related research have fueled the growth of databases to manage, store, integrate, analyse and serve these data and tools to scientists and other stakeholders. To describe agricultural-related databases, we use the term ‘GGB database’. GGB databases include any online resource that holds genomic, genetic, phenotypic and/or breeding-related information and that is organized via a database schema, and contained within a database management system , or non-relational storage systems. GGB databases play a central role in the communities they serve by curating and distributing published data, by facilitating collaborations between scientists and by promoting awareness of what research is being done and by whom in the community. GGB databases prevent duplicated research efforts and foster communication and collaboration between laboratories .

As more and more organisms are sequenced, cross-species investigations become increasingly informative, requiring researchers to use multiple GGB databases and requiring that GGB databases share data and use compatible software tools. Use of common data standards, vocabularies, ontologies and tools will make curation more effective, promote data sharing and facilitate comparative studies . The AgBioData consortium was formed in 2015 in response to the need for GGB personnel to work together to come up with better, more efficient database solutions. The mission of the consortium, comprised of members responsible for over 25 GGB databases and allied resources, is to work together to identify ways to consolidate and standardize common GGB database operations to create database products with more interoperability. The AgBioData consortium joins the larger scientific community in embracing the Findable, Accessible Interoperable, and Reusable data principles, established by stakeholders from the scientific, publishing and library communities . FAIR principles have rapidly become standard guidelines for proper data management, as they outline a road map to maximize data reuse across repositories. However, more specific guidelines on how to implement FAIR principles for agricultural GGB data are needed to assist and streamline implementation across GGB databases. Members of the AgBioData consortium convened in Salt Lake City, UT on 18 & 19 April 2017 to describe challenges and recommendations for seven topics relevant to GGB databases—Biocuration, Ontologies, Metadata and persistence, GGB database platforms, Programmatic access to data, Communication and Sustainability. Preceding this workshop, a survey was sent out to all AgBioData members regarding the seven topics, in order to identify concerns and challenges of AgBioData members. The results were used to focus and foster the workshop discussions. Here we present the current challenges facing GGBs in each of these seven areas and recommendations for best practices, incorporating discussions from the Salt Lake City meeting and results of the survey.

The purpose of this paper is 3-fold: first, to document the current challenges and opportunities of GGB databases and online resources regarding the collection, integration and provision of data in a standardized way; second, to outline a set of standards and best practices for GGB databases and their curators; and third, to inform policy and decision makers in the federal government, funding agencies, scientific publishers and academic institutions about the growing importance of scientific data curation and management to the research community. The paper is organized by the seven topics discussed at the Salt Lake City workshop. For each topic, we provide an overview, challenges and opportunities and recommendations. The acronym ‘API’ appears frequently in this paper, referring to the means by which software components communicate with each other: i.e. a set of instructions and data transfer protocols. We envision this paper will be helpful to scientists in the GGB database community, publishers, funders and policy makers and agricultural scientists who want to broaden their understanding of FAIR data practices.Bio-curators strive to present an accessible, accurate and comprehensive representation of biological knowledge . Bio-curation is the process of selecting and integrating biological knowledge, data and metadata within a structured database so that it can be accessible, understandable and reusable by the research community. Data and metadata are taken from peer-reviewed publications and other sources and integrated with other data to delivera value-added product to the public for further research. Biocuration is a multidisciplinary effort that involves subject area experts, software developers, bio-informaticians and researchers. The curation process usually includes a mixture of manual, semi-automated and fully automated workflows. Manual biocuration is the process of an expert reading one or several related publications, assessing and/or validating the quality of the data and entering data manually into a database using curation tools, or by providing spreadsheets to the database manager. It also encompasses the curation of facts or knowledge, in addition to raw data; for example, the role a gene plays in a particular pathway. These data include information on genes, proteins, DNA or RNA sequences, pathways, mutant and nonmutant phenotypes, mutant interactions, qualitative and quantitative traits, genetic variation, diversity and population data, genetic stocks, genetic maps, chromosomal information, genetic markers and any other information from the publication that the curator deems valuable to the database consumers. Manual curation includes determining and attaching appropriate ontology and metadata annotations to data. This sometimes requires interaction with authors to ensure data is represented correctly and completely,blueberry grow pot and indeed to ask where the data resides if they are not linked to a publication. In well-funded large GGB databases, manually curated data may be reviewed by one, two or even three additional curators. Manual biocuration is perhaps the best way to curate data, but no GGB database has enough resources to curate all data manually. Moreover, the number of papers produced by each research community continues to grow rapidly. Thus, semi-automated and fully automated workflows are also used by most databases. For example, a species-specific database may want to retrieve all Gene Ontology annotations for genes and proteins for their species from a multi-species database like UniProt . In this case, a script might be written and used to retrieve that data ‘en masse’. Prediction of gene homologs, orthologs and function can also be automated. Some of these standard automated processes require intervention at defined points from expert scientist to choose appropriate references, cut off values, perform verifications and do quality checks. All biocuration aims to add value to data. Harvesting biological data from published literature, linking it to existing data and adding it to a database enables researchers to access the integrated data and use it to advance scientific knowledge. The manual biocuration of genes, proteins and pathways in one or more species often leads to the development of algorithms and software tools that have wider applications and contribute to automated curation processes.

For example, The Arabidopsis Information Resource has been manually adding GO annotations to thousands of Arabidopsis genes from the literature since 1999. This manual GO annotation is now the gold standard reference set for all other plant GO annotations and is used for inferring gene function of related sequences in all other plant species . Another example is the manually curated metabolic pathways in Ecocyc, MetaCyc and PlantCyc, which have been used to predict genome-scale metabolic networks for several species based on gene sequence similarity . The recently developed Plant Reactome database has further streamlined the process of orthology-based projections of plant pathways by creating simultaneous projections for 74 species. These projections are routinely updated along with the curated pathways from the Reactome reference species Oryza sativa . Without manual biocuration of experimental data from Arabidopsis, rice and other model organisms, the plant community would not have the powerful gene function prediction workflows we have today, nor would the development of the wide array of existing genomic resources and automated protocols have been possible. Biocurators continue to provide feedback to improve automated pipelines for prediction workflows and help to streamline data sets for their communities and/or add a value to the primary data.Current efforts in machine learning and automated text mining to pull data or to rank journal articles for curation more effectively work to some extent, but so far these approaches are not able to synthesize a clear narrative and thus cannot yet replace biocurators. The manual curation of literature, genes, proteins, pathways etc. by expert biologists remains the gold standard used for developing and testing text mining tools and other automated workflows. We expect that although text-mining tools will help biocurators achieve higher efficiency, biocurators will remain indispensable to ensure accuracy and relevance of biological data. Well-curated GGB databases play an important role in the data lifecycle by facilitating dissemination and reuse. GGB databases can increase researchers’ efficiency, increase the return on research funding investment by maximizing reuse and provide use metrics for those who desire to quantify research impact. We anticipate that the demand for biocurators will increase as the tsunami of ‘big data’ continues. Despite the fact that the actual cost of data curation is estimated to be less than 0.1% of the cost of the research that generated primary data , data curation remains underfunded .Databases are focused on serving the varied needs of their stakeholders. Because of this, different GGB databases may curate different data types or curate similar data types to varying depths, and are likely to be duplicating efforts to streamline curation. In addition, limited resources for most GGB databases often prevent timely curation of the rapidly growing data in publications.The size and the complexity of biological data resulting from recent technological advances require the data to be stored in computable or standardized form for efficient integration and retrieval. Use of ontologies to annotate data is important for integrating disparate data sets. Ontologies are structured, controlled vocabularies that represent specific knowledge domains .