By Kelsey Butler —
My name is Kelsey Butler and in March of this year, I joined the Colubri Lab at UMass Chan Medical School. Our lab uses computational approaches to create new tools for biomedical research and STEM education. As a new lab getting started right in the middle of the COVID pandemic, our initial meetings just comprised the two of us in a nearly empty research building:
One of my projects is a continuation of a longstanding collaboration between my PI, Dr. Andrés Colubri, the Sabeti Lab at the Broad Institute of MIT and Harvard, and the Irrua Specialist Teaching Hospital (ISTH) in Edo State, Nigeria. Back in 2016, Andrés, together with his Broad and ISTH colleagues created a CommCare mobile app that allows healthcare workers at ITSH to collect detailed clinical data of Lassa fever patients. Lassa Fever is viral hemorrhagic fever endemic to West Africa, which was first identified as an emerging disease only in 1969. For those interested, I would recommend checking out the fascinating (and scary) story of the nurses and doctors who risked their lives to save patients sick with this mysterious and deadly disease.
Prior to the introduction of this app, clinical records were stored on paper and were laboriously transcribed to digital spreadsheets for data analysis. Through implementation of digital record keeping in the app, we compiled the largest clinical Lassa Fever dataset to date, consisting of nearly 1,000 patients treated at ISTH during the past four years.
This dataset is an important source of information that can help us learn more about this neglected emerging disease, and hopefully construct accurate computational models for disease severity and prognosis that might inform patient care in the future.
We started by analyzing demographic information, clinical symptoms, vital signs, laboratory results, and treatment interventions for patients in the dataset. Of 841 patients enrolled in the study, 714 patients with known outcome (died or survived) were included in the analysis. The median age was 33 years and 42.9% (294/684) were female. The overall case-fatality rate was 17.8% (127/714), showing that even with dedicated professional care (ISTH has one of the first specialized Lassa fever wards in Nigeria, Lassa fever is a highly deadly disease. Our analysis confirmed that bleeding and severe central nervous system (CNS) symptoms were more prevalent in patients who died, confirming results from earlier studies. Both are significant predictors of mortality, with odds ratios of 21.95 (95% CI: 10.5–45.8) for severe CNS symptoms and 10.28 (95% CI: 5.07–20.86) for bleeding. Bleeding represents an aggregate variable and can range in severity, although patients who present with this symptom are generally at a more advanced stage of disease. The presentation of Lassa fever varies widely, from mild to very severe illness. Figuring what accounts for the difference between mild and serious manifestation of the disease is one of the big challenges in Lassa fever research!
After completing a descriptive analysis of the dataset, we analyzed patterns of missing data to inform variable selection. It is common to have high missingness in clinical datasets, and this is especially true for data collected in clinical settings with limited resources. Our goal was to create clinically useful models that take into consideration this limited availability of data. Consider the patterns of missing values of aspartate aminotransferase (AST) and potassium, markers of organ failure and dehydration, respectively. These variables are rarely reported together, but both are important predictors of mortality. We found that over 90% of the patients do not have AST lab results within the first two days after admission, while 80% do not have potassium levels available, and only a handful of patients have both.
We reasoned that these lab tests are ordered in distinct clinical scenarios and trained a separate model for each one. Model 1 is suitable for patients who have laboratory tests ordered for suspected dehydration, while Model 2 is suitable for patients with suspected organ failure. Model 1 uses age, severe CNS symptoms, bleeding, creatinine, and AST as predictors. It has an AUC of 0.95 (95% CI 0.86–1), a sensitivity of 90% (63- 100%) and specificity of 87% (52–100%). Model 2 uses age, severe CNS symptoms, bleeding, creatinine, and potassium as predictors. It has an AUC of 0.86 (95% CI 0.73–0.99), a sensitivity of 96% (67–100%) and specificity of 65% (6–91%).
Note that in model 2, specificity is 65% and has a wide range of uncertainty. This indicates that the model might produce false positives, which in this case would mean predicting mortality for patients who will ultimately survive. The probability threshold can be adjusted depending on the clinical situation, and here we set it to 30% to maximize the ability of the models to identify the most at-risk cases.
One of the most challenging aspects of working with this dataset is dealing with missing data. Below are the distributions of AST (92% missing) and potassium (83% missing). Potassium contains 145 values and has a right skewed distribution. This variable could potentially be a candidate for imputation, but the high percentage of missing data makes it a gray area. AST, on the other hand, is not suitable for imputation because the 67 values do not have a predictable distribution. Despite the large amount of missing data, we must keep in mind that this is the largest dataset to date describing Lassa Fever patients, and we want to extract as much information as possible to improve clinical outcomes and increase our understanding of disease presentation. The missing data also contributes to large uncertainty in our calculations, as seen above.
Dealing with missing data is just one aspect of working with complex clinical datasets, and it is a skill that I will continue to develop as I progress through my degree. I am fortunate to be in a program and lab where I am surrounded by researchers who have built their careers on extracting meaningful information from large, unwieldy datasets and presenting it in succinct, impactful ways. If you’re interested in learning more about this project, I’ll be presenting the results at the Annual Meeting of the American Society of Tropical Medicine and Hygiene in a few weeks!
And as a final note, I’d like to mention that the lab has been growing since I joined, take a look at the team page to learn more about its current members. Since we are a 100% computational lab, we have adopted a hybrid model where we do a majority of work remotely/from home and meet on site weekly for in-person communication and discussion. We also get together for occasional field trips, like the one below to Fruitlands Museum :-)