The data rescue intern:
Collection of the long-term ecological database for the Turkey Lakes Watershed in Ontario began in 1979 with the intention to evaluate anthropogenic and disturbance impacts on the watershed by monitoring its air, water, land, and biological communities. It was initiated by Fisheries and Oceans Canada (DFO), Natural Resources Canada (NRCAN), and Environment and Climate Change Canada (ECCC) with the initial mission of determining the impacts of acid rain in the watershed, but many other questions have been considered since. The main goal of the Data Rescue Internship was to prepare the data collected by the DFO in the Turkey Lakes from 1979 to 2008 on the water chemistry and temperature, chlorophyll, phytoplankton, fish, benthos, zooplankton for submission to a public repository, the Government of Canada’s Open Data Portal. Data files existed in a variety of formats, including paper, Excel, and a Microsoft Access database. Over the course of the internship, the project was expanded to include additional data collected by the DFO in the Magpie and Batchawana Rivers and various headwaters lakes in Ontario.
Due to the immense size of the database, the Fall 2021 internship only worked with chlorophyll a data. The original data existed in multiple formats and was often repeated in multiple files. Files needed to be closely inspected to find all pertinent data and then reformatted to be compatible with programs like R and avoid any loss of information (such as analysis notes). Once loaded into R, the data were cleaned and standardized to be joined into a single data file for all years and sampling locations. Because these data were collected over many years by many people, there were multiple codes for the same sampling locations and files from each year were slightly different. Reproducible methods were used to reassign sample location names and numbering by using relational tables and ‘soft coding.’ Data cleaning illuminated some remaining issues in the data, such as abnormally high values over a two-year period in the 1980s. This discovery led to an investigation into the causes of these values and inspired the inclusion of a transparent method to communicate data quality. A column was included to recommend data for use or exclusion from future analyses with a binary code. Another goal of the internship was to improve the design and functionality of the relational database. This included simplification of the files (entities) included in the relational database, conversion of data and files to ‘tidy’ format, elimination of redundant information and files, and simplification of the relationships between files. These changes will make use of the database more intuitive and less prone to error and facilitate ongoing data entry and extraction for researchers from the DFO.
The work of this internship was supported by a large team from the DFO, and was lucky to have the input of experts on the various DFO projects performed, the history of the Turkey Lakes Watershed, and relational database design and data management. Although much data remains to be cleaned, the structure of the database was greatly simplified and improved, which should facilitate the remaining data cleaning and future use of the database. Data cleaning and standardization were carefully documented to allow for smooth transitions between future data management experts that work to make this data available for public use.