The data rescue intern: Sophia Fan
Forest ecosystems are crucial for many of the natural processes humans and wildlife need to survive. For example, they affect water flow and water quality in watersheds, prevent soil erosion, provide habitat for wildlife, and are key for mitigating climate change through carbon sequestration. However, forests often undergo many forms of disturbance. Some are beneficial or even needed for proper forest regeneration (e.g., fire regimes) while others may have negative consequences (e.g., overharvesting, insect pests). Because forest ecosystems take a long time to grow and regenerate, any detectable changes may take decades to realize after disturbances. Therefore, understanding and predicting forest resilience to these disturbances, in addition to properly managing post-disturbance forests, is therefore crucial to mitigating the loss of forests and their services.
Within the Canadian Forest Service, several independent researchers have conducted studies to understand post-disturbance forest recovery in different parts of eastern Canada, each with different objectives and data collection methods. The challenge here is to combine these disparate data sources into a common database so that analysis of forest recovery can take on a much larger scale across geographical locations. Hence, this project is a part of a broader effort to collect, connect, and synthesize post-disturbance forest recovery research across Canada.
In partnership with the Canadian Forest Service’s – Laurentian Forestry Centre, Dr. Kaysandra Waldron, and Dr. Dominique Boucher, this data rescue internship aimed to clean, standardize, and combine data from thee sources: the Canadian Boreal Forest Agreement (CBFA), the St. Germain et al. 2004 study; and the Martineau et al. 2019 study. The CBFA collected data from 100 plots in Newfoundland, Ontario, and Quebec for up to 40 years following fire and spruce budworm outbreaks, with the goal of assessing impacts of salvage logging following natural disturbance. The St. Germain et al. (2004) study concerns evaluating beetle community richness and abundance following different periods of post-fire recovery in Grands-Jardins Provincial Park. Whereas, Martineau et al. (2019) looked to understand the impact of cumulative disturbances (insect outbreaks combined with logging) on short-term responses of boreal forest soil properties in the Laurentian Mountains. Several types of datasets were present within each study involving tree data, soil data, and vegetation data. Due to time constraints, the final database only concerns tree data.
The most time-consuming aspect of this project was understanding and cleaning old datasets. In particular, some tree data was recorded in individual Excel as a worksheets. This facilitated printing and data collection in the field; however, made the data more difficult than it would have been if it were in a ‘tidy’ format (one-variable-per-column) As a result, it required that the data be into whole new datasets. The most challenging part of the project, however, was understanding how each individual variable was collected and measured for each study, and how they could be standardized and combined together into a single database. Each of these studies measured different variables, and often encoded them in different ways; for example, tree species abbreviations were different between studies. Additionally, some variables required standardization, or in some cases, omission. For example, a “burn percentage” variable had to be converted into a binned “burn class” variable to standardize across all studies, and decay classifications were too different between studies to reliably standardize and were thus omitted. Some additional variables were also desired by the data owners which required them to be either extracted from the raw data, or coded, documented and created from scratch. For example, disturbance type (fire, insect outbreak, etc.), starting year, and ending year information had to be manually transferred from metadata and original published papers. Another fun variable was “status,” where I had to play detective and find out whether each individual tree from each dataset was dead or alive! As such, determining common variables between the studies, weighing them by relative importance (for example, not all studies measured the height of individual trees but it was important to have this data in the database, so studies that did not measure this received NA’s in that column), creating new variables, and ultimately combining all data together was quite challenging. On top of all this, much of the data and metadata was in French. As someone whose last interaction with any sort of French curriculum was in high school, this took a lot of Google-translating and (frequently) soliciting definitions from project leads. Their patience truly knows no bounds.
The final database now contains 13, 415 entries combined from all three studies, spanning from Newfoundland to Ontario with forest disturbances dating back to 1955. A total of 14 datasets from the original pool were included in the final database file. In addition, a master metadata file describes the database, with links to existing metadata from each study, original papers and sampling protocols, as well as both the original and cleaned datasets used in database construction. Taken together, this database will be used to analyze forest disturbances, such as fire, salvage logging, and insect outbreaks, and ultimately inform post-disturbance management actions.