Colombia's cyberinfrastructure for biodiversity: Building data infrastructure in emerging countries to foster socioeconomic growth

Citation for published version (APA): de Vega, J., Davey, R. P., Duitama, J., Escobar, D., Cristanchoardila, M. A., Etherington, G. J., Minotto, A., ArenasSuarez, N. E., PinedaCardenas, J. D., CorreaAlvarez, J., Camargo Rodriguez, A. V., Haerty, W., MallarinoRobayo, J. P., BarretoHernandez, E., MuñozTorres, M., Fernandez Fuentes, N., & Di Palma, F. (2019). Colombia's cyberinfrastructure for biodiversity: Building data infrastructure in emerging countries to foster socioeconomic growth. Plants, People, Planet. https://doi.org/10.1002/ppp3.10086


| DATA-DRIVEN INNOVATI ON AND DE VELOPMENT
Science and innovation are not a luxury but a prerequisite for social and economic development (Annan, 2003).Across different fields, acquisition and analysis of large amounts of data have become a common practice to drive innovation (Yang, Huang, Li, Liu, & Hu, 2017), particularly with today's highly instrumented data collection methods (Borgman, Wallis, & Mayernik, 2012).The efficient analysis of such data has an unprecedented potential to transform how we tackle the major challenges faced by humanity, from climate change to food security (Hilbert, 2016).
Data-driven innovation can only be achieved through greater access to data, through effective and efficient-enabling resources, and ensuring that the best available expertise is harnessed through them.This is particularly the case when collaboration is needed to address the research questions at a continental scale, such as the effect of global impacts on rich, vast ecological systems in the present climate change scenario (Peters, Loescher, SanClements, & Havstad, 2014).
One way of ensuring these conditions is to cultivate and foster a research data infrastructure or cyberinfrastructure (Florio & Sirtori, 2016), which aims to meet the needs of the research community for democratic access to digital resources and collaborative environments around common practices (Atkins, 2003).A cyberinfrastructure includes high performance computing (HPC) and use of large shared data storage, a platform or stack of services that provides methods for leveraging those physical resources, and a community of people and institutes that manage these resources in a sustainable, secure, collaborative, and interoperable way (Goff et al., 2011).

| COLOMB IA' S B IODIVER S IT Y FOS TERING SO CI OECONOMI C G ROW TH
Colombia's topography and location near the equator make it a highly biodiverse country (Rangel-Ch, 2015).The country is one of the 17 "megadiverse countries" in the world according to the United Nations Environment Programme (UNEP).Colombia has suffered an expensive internal conflict for five decades, which was only recently alleviated through a peace agreement in late 2016 (Baptiste et al., 2016).Lack of stability and limited opportunities in at least half of the country, particularly remote rural regions, have resulted in evident negative socioeconomic and ecological impacts (Baumann & Kuemmerle, 2016).
The "Colombia BIO" programme lead by the Colombian Research Council (Colciencias) is seeking to make sustainable use of this natural capital to drive the growth of the Colombian bioeconomy, social equality, and a long-lasting peace (Sierra et al., 2017).In "Colombia BIO"'s expeditions, large amounts of data about Colombia's ecosystems are being collected, including novel biodiversity in regions that were previously unexplored due to the internal conflict (Gonzalez, Arenas, Tovar, Pulido, & Tenorio, 2017).As 2019, Colombia is one of the 11 country funders of the "Earth Biogenome Project" (EBP; earthbiogenome.org).The EBP "can be viewed as infrastructure for the new biology" that aims to sequence, catalogue, and characterize the genomes of all known eukaryotes to inform ecosystems preservation under the growing impacts from climate change and overexploitation (Lewin et al., 2018).The EBP consortium in Colombia is led by the University of Los Andes and "BRIDGE Colombia" (Prof F. Di Palma, personal communication).
The capacity to share and analyze this information needs to keep pace with the wealth of information gleaned from these new and upcoming explorations (Canhos et al., 2015).To date, the national catalogue of Colombian biodiversity (SiB Colombia) (Abud et al., 2017) includes 7,848 endemic species and around 10% of all known species.Researchers and policymakers need to be provided with comprehensive evidence to inform evidence-supported decisions on biodiversity management and protection.

| COLOMB IA' S A SS E SS MENT BY "C 3B IODIVER S IDAD": INTRODUCING A REFEREN CE FR AME WORK FOR EMERG ING COUNTRIE S
• Facilitate skills sharing.
• Open research infrastructures • Enabling e-infrastructures • Survey the resources needs and availability.
• Facilitate physical connectivity among institutions.
• Formalize an advisory community about physical computational resources.
• Implement a recognition scheme for resources providers in evaluations and panels.

Grow training in scientific data analysis for users and providers
A generation skilled in scientific data analysis.Promote a coordinated accessible programme of training in scientific data analysis.
• Coordinate advanced tailored training.
"Cross-cutting issue" • Develop a national online network of trainers and communities in data analysis.
• Coordinate with global and regional training networks such as GLOBET.
• Support hosting of international trainers.
• Support staff exchanges from smaller to larger institutions within the country.

Develop and enforce a national policy for research data
Data-supported decision-making.

Develop and enforce a
national policy for research data.
• Incentive excellence in research.
• Facilitate access to research data • Open research data • Open licenses and IPR • Involve stakeholders in new policy design.
• Require open access to taxpayer-funded research.
• Promote an institution to coordinate repositories and databases within the country.
• Reward researchers involved in data partnerships.
Engage diverse stakeholders in research projects and funding planning A society highly involved and interested in science and technology.Develop transversal research schemes that reward stakeholder's engagement.• Support private-public partnerships.
• Support third-sector involvement.• Implement funding calls for public-private projects.
• Extend the role of research-support offices.
• Catalogue and disseminate networking opportunities.
*Based on the conceptual open-science (OS) framework defined by the Colombian Research Council Colciencias (Colciencias, 2018).

| IMPROVING THE PROVIS IONING AND AVAIL AB ILIT Y OF PHYS I C AL DATA INFR A S TRUC TURE
Biodiversity cyberinfrastructures increase data access and reusability, and also support education and effective public policies.To balance the potential costs in the context of the scientific benefits, the research community often self-organizes to identify the broad-scale questions that require large data-driven analysis that can only be addressed by expensive infrastructure, which is then funded by research councils usually on the condition to be shared as a community resource.
The main challenge Colombia and other middle and upper-middle countries currently face is their limited access to computational It is a priority to deploy high-performance computational platforms in the institutions of a country as a requirement to accelerate research and skills training.We believe the best option is progressively integrating into increasing orders of complexity existing resources under a fair-sharing policy that prioritises the host institution while promoting sharing new computational and data storage capacities through capital investments and incentives.Infrastructures require substantial financial investments in the hardware itself, physical space, environment control, management, and maintenance.For example, the CyVerse cyberinfrastructure is leveraging on the considerable investment from the USA's National Science Foundation (NSF) (Goff et al., 2011).
Distributed infrastructures are composed of multiple independent and distributed resources that act as one, often with resources provided by different institutions (Towns et al., 2014), so the initial costs and complexity are distributed.These are usually rolled out in stages in increasing degree of decentralization (Chaterji et al., 2017).
A federation of heterogeneous computing resources as the kind proposed needs to address two managerial requirements in order to be successful.Firstly, because most institutions want to retain the right to define their own policies on data management and execution priorities, the system must guarantee that users can access each resource at the right level of privilege.As a result, a distributed system typically "authenticates, authorises, and accounts" (AAA system) the user for each individual system in a centralized server.Secondly, when computational resources and data are dispersed in storage locations among participating organizations, end users should be relieved of the complexities associated with negotiating access rights with individual organizations, moving data back and forth, or porting programs to process the data (Langmead & Nellore, 2018).Technical software solutions for example, data management middleware such as the open source iRODS software (Rajasekar et al., 2010), workflow software and virtual machines (Boettiger, 2015;Köster & Rahmann, 2012) provide tested options for data federation, data replication, quota management, and access control etc.
A successful precedent of distributed high-performance computational platform is the Iberian-American Network for High-Performance Computing (RICAP, 2017-2020).RICAP's resources are distributed across 11 sites in various Latin-American countries, which are connected through RedClara, the network of Latin America's academic networks (Cazar, 2018).The existence in many emerging countries of state-sponsored high-speed academic-network providers (Red Nacional Académica de Tecnología Avanzada, RENATA, in the case of Colombia) is key to facilitate the necessary physical connectivity between institutions.However, our SWOT analysis highlighted that Colombian research institutions actively use the connectivity services from private providers too (Table S1).

| IMPLEMENTING A NATIONAL FAIR POLICY FOR RESEARCH DATA MANAGEMENT
The absence of a comprehensive policy that regulates and enforces access to research data restrains research.It is a priority to develop F I G U R E 1 A reference framework consisting of four priorities to facilitate the socioeconomic growth in emerging countries through innovation by developing a research cyberinfrastructure and implement a national policy for research data that regulates the access, processing and sharing of data in a standardized way.This would facilitate data-supported decision-making, as well as scientific excellence and innovation.In the case of biodiversity, the national implementations on "Access and Benefit Sharing" of genetic resources, designed to give greater control over the natural capital, have also generated regulatory regimes fraught with unintended consequences, this is not exclusive to Colombia (Prathapan et al., 2018;Wight, 2019).In Colombia and other emerging countries, there are well-developed policies that regulate other data types, such as Government (e-gov) and personal data that can serve as examples to develop research data policy (Sanabria, Pliscoff, & Gomes, 2014).We recommend requiring open access to taxpayer-funded research, including both generated data and research publications, as recommended by the Organisation for Economic Co-operation and Development (OECD) (Arzberger et al., 2004).When funding is a limiting factor, policy needs to maximize return on investment in data generation.
Good data management is not a goal in itself, but rather is the key conduit guaranteeing experimental reproducibility (Baker, 2016) and maximizing return on investment in data generation by facilitating its reuse by third parties.Four foundational principles, findability, accessibility, interoperability, and reusability (FAIR) usually guide good data management practices among producers and publishers (Wilkinson et al., 2016).In Colombia, Colciencias has recently published its vision to promote an "open science" in the country based on the FAIR principles (Colciencias, 2018).Significant challenges to implementing data management arise from the size and complexities of modern scientific collaboration (Borgman et al., 2012).Still, when psychology researchers were asked to rank barriers to data sharing, technological barriers (such as "My dataset is too big" or "There is no suitable repository to share my data") were at the bottom of the list (Houtkoop et al., 2018).Similar results were obtained in other disciplines (Van den Kaye, Bruce, & Fripp, 2017;Eynden et al., 2016), or in the specific case of Colombian researchers (OCyT, 2017).
Data sharing can be incentivized by normative pressure, for example through a strong centralized information system or due to requirements of funding agencies and journals to release research data at the time of publication or end of funding (Wolkovich, Regetz, & O'connor MI., 2012).In large projects, funding agencies and international directorates will need to work together in joint initiatives to overcome cultural barriers and geopolitical constraints among countries (Vargas et al., 2012).However, regardless of journal or funder requirements, data are routinely shared in some scientific fields as a result of a cultural shift, scholarly altruism, and peer approval (Kim & Stanton, 2012;OCyT, 2017).Also, data sharing can be promoted by recognizing those who analyze it as creative collaborators in need of career paths (Chang, 2015).Highlighting and disseminating specific research communities and projects that follow standards, curation and preservation approaches can serve as showcases (Canhos et al., 2015;Sanabria et al., 2014).For example, SIB Colombia was re- ticians (Tan, Lim, Khan, & Ranganathan, 2009;Welch et al., 2014).
The Global Organisation for Bioinformatics Learning, Education and Training (GOBLET) surveys provide "perspectives on the current status of training gaps" and evidence that "the need for bioinformatics training is both real and urgent, and requires worldwide solutions" (Attwood et al., 2015).
Running effective courses and workshops means having tailored teaching materials and instructors trained in how to teach students who may come from different backgrounds and have different goals.
Not surprisingly, the completion rate for self-paced Massive Open Online Courses (MOOCs) is less than 10% (Jordan, 2014).However, trainers are available in Colombia and equivalent countries.For example, there is an academic network in Colombia focused on bioinformatics, as well as a biannual national bioinformatics conference, which is often organized in collaboration with other scientific societies.Another key strength is the availability of graduate system administrators and developers; formal training is available through at least four M.Sc.programmes in bioinformatics, data science, or computational biology, as well as several in computational sciences.
On the one hand, Train-the-Trainers (TTT) workshops, where future instructors are equipped with practical skills to effectively teach, are a cost-effective way to prepare instructors (Pfund et al., 2015;Via et al., 2017).On the other hand, the "keep training local but act to deliver and develop training materials globally" motto highlights how a community might break down the effort of producing training materials in a modular way (Williams & Teal, 2017).This decentralized approach allows training to become more accessible to more people while "responding at scale to rapidly evolving science" (Teal et al., 2015).For example, software Carpentry and Data Carpentry lessons are developed collaboratively on Github by volunteers.

| S ECURING THE ENG AG EMENT OF DIVER S E S TAKEHOLDER S IN PL ANNING
We believe it is a priority securing the engagement of a diverse range of stakeholders in research planning, and particularly in cyberinfrastructure planning and execution.Researchers are the driving force in the innovation process, and they will only engage in the cyberinfrastructure if they perceive the cyberinfrastructure as a way to ease data management and analysis.There is consequently a need to survey a priori the needs of the community (Cutcher-Gershenfeld et al  S2, Colombian researchers' priorities were to "develop strategies and tools" (91%), "promote skills exchange" (83%), "design incentives" (80%), and "support best practices" (78%); and also found "data availability" (72%), "digital technology and capacities" (62%), and "new ways for dissemination (59%) and collaboration (55%)" as courses of action (Table S2).
The workshop results also proposed promoting private-public partnerships and extending the involvement of the third sector (non-profit associations, charities, cooperatives, etc.) in research.
While researchers are the driving force in the innovation process, the environment where each researcher works (industry, academia, nonprofit, general public, or government) frames how researchers can conduct that research.Our analysis in Colombia highlighted that there is a limited number of initiatives to engage stakeholders in research and a variable interest in research from different sectors.Partnerships between industry, third sector, government and academia appear to be more established in the agricultural and environmental sectors, for example.We identified the following three positive recent initiatives in Colombia: 1. Specific research public funding opportunities involving industry; 2. a new research funding system from the regions to promote regional redistribution; and 3. increasing international investment after Colombia's access to the OECD and the peace agreement process.
Finally, secondary stakeholders (citizens, educators, librarians, policymakers, funding officers, editors, professional societies, etc.) have their particular interests and priorities, and consequently a say in planning.When asked about the impact of open science on society, researchers in Colombia highlighted the mutual benefits of improving the social awareness, reproducibility and general efficiency of science (OCyT, 2017).
• Require engagement plan annexed to research projects.
capacity and physical connectivity between research institutions.This technological gap is mostly the result of limited funding, the high cost of foundational infrastructure, inconsistent interest from multinational vendors, and short-term strategic planning.As a result, key academic and industrial institutions prioritise limiting uncertainty, unforeseen overheads, and imported commodities.Still, major universities and centers in Colombia and other emerging countries have access to HPC infrastructure(Cazar, 2018).However, these infrastructures are primarily, and usually exclusively, implemented to meet the internal needs of the host institution.
figure of "Data Champions" (volunteers who advise researchers in their institutions on good research data management and promote FAIR guidelines) and promoting a model where institutional repositories would coexist with a centralized national data management repository.