Statistics and Probability, Statistics, Probability and Uncertainty
15
Scopus Publications
Scopus Publications
Aggregation in Ill-Conditioned Regression Models: A Comparison with Entropy-Based Methods Ana Helena Tavares, Ana Silva, Tiago Freitas, Maria Costa, Pedro Macedo, et al. Entropy, 2025 Despite the advances on data analysis methodologies in the last decades, most of the traditional regression methods cannot be directly applied to large-scale data. Although aggregation methods are especially designed to deal with large-scale data, their performance may be strongly reduced in ill-conditioned problems (due to collinearity issues). This work compares the performance of a recent approach based on normalized entropy, a concept from information theory and info-metrics, with bagging and magging, two well-established aggregation methods in the literature, providing valuable insights for applications in regression analysis with large-scale data. While the results reveal a similar performance between methods in terms of prediction accuracy, the approach based on normalized entropy largely outperforms the other methods in terms of precision accuracy, even considering a smaller number of groups and observations per group, which represents an important advantage in inference problems with large-scale data. This work also alerts for the risk of using the OLS estimator, particularly under collinearity scenarios, knowing that data scientists frequently use linear models as a simplified view of the reality in big data analysis, and the OLS estimator is routinely used in practice. Beyond the promising findings of the simulation study, our estimation and aggregation strategies show strong potential for real-world applications in fields such as econometrics, genomics, environmental sciences, and machine learning, where data challenges such as noise and ill-conditioning are persistent.
Predicting Red Blood Cell Transfusion in Elective Cardiac Surgery: A Machine Learning Approach Beatriz Lau, Daniel Ramos, Vera Afreixo, Luís M. Silva, Ana Helena Tavares, et al. Mathematical and Computational Applications, 2025 The benefits of Patient Blood Management can vary depending on a patient’s risk profile for requiring a blood transfusion. The objective of this study is to develop and analyse machine learning models that can identify patients at risk of requiring red blood cell transfusion. This retrospective cohort study was conducted at a tertiary northern Portuguese hospital between 2018 and 2023. Two machine learning algorithms, extreme gradient boosting and neural networks, were employed due to their efficiency in handling complex feature interactions. Shapley additive explanations values were analysed to assess the contribution of each feature to the predictions generated by the models. The neural network achieved an accuracy of 0.735 and an area under the receiver operating characteristic curve of 0.798 (95% CI 0.747 to 0.849). The extreme gradient boosting model achieved an accuracy of 0.700 and an area under the receiver operating characteristic curve of 0.762 (95% CI 0.707 to 0.817). An analysis of Shapley additive explanations values revealed that the most important variable was preoperative haemoglobin levels, which can be optimised through the Patient Blood Management approach. These machine learning models demonstrate the potential to improve the accuracy of transfusion prediction at hospital admission, despite the absence of key variables such as surgeon identity and anaemia diagnosis.
Stable Variable Selection Method with Shrinkage Regression Applied to the Selection of Genetic Variants Associated with Alzheimer’s Disease Vera Afreixo, Ana Helena Tavares, Vera Enes, Miguel Pinheiro, Leonor Rodrigues, et al. Applied Sciences Switzerland, 2024 In this work, we aimed to establish a stable and accurate procedure with which to perform feature selection in datasets with a much higher number of predictors than individuals, as in genome-wide association studies. Due to the instability of feature selection where many potential predictors are measured, a variable selection procedure is proposed that combines several replications of shrinkage regression models. A weighted formulation is used to define the final predictors. The procedure is applied for the investigation of single nucleotide polymorphism (SNP) predictors associated with Alzheimer’s disease in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. Furthermore, the two following data scenarios are investigated: one that solely considers the set of SNPs, and another with the covariates of age, sex, educational level, and ε4 allele of the Apolipoprotein E (APOE4) genotype. The SNP rs2075650 and the APOE4 genotype are provided as risk factors for Alzheimer’s disease, which is in line with the literature, and another four new SNPs are indicated, thus cultivating new hypotheses for in vivo analyses. These experiments demonstrate the potential of the new method for stable feature selection.
Analysis of Potential Risk Factors for Multidrug-Resistance at a Burn Unit Luís Cabral, Leonor Rodrigues, Ana H. Tavares, Gonçalo Tomé, Marisa Caetano, et al. European Burn Journal, 2023 Background: Infections by multidrug-resistant (MDR) microorganisms are associated with increased morbidity and mortality in burn patients. This study aimed to analyze the evolution of MDR bacteria over a five-year period at Coimbra Burns Unit (CBU) in Portugal, seeking to assess the possible associations of specific bacteria with presumed risk factors. Methods: The data obtained consisted of identified bacteria present in any microbiological sample from each patient (including blood, central venous catheter, urine, tracheal aspirate and/or wound exudate). Univariate models and a multivariate model were constructed for each of the MDR bacteria species that infected at least 50 patients or that had five or more MDR strains. Statistical hypothesis tests with a p-value less than 0.05 were considered significant. Results: Of a total of 341 samples obtained, 107 were MDR, corresponding to 10 species. Globally, there was no significant variation in MDR bacteria frequency over the period under analysis. Some risk factors and/or trends were identified for some species, but none was linked to all of them. Conclusions: The risks for the development of MDR in bacteria in burn patients are multifactorial, mainly linked to longer hospital stays, the use of invasive devices and inadequate antimicrobial treatment. However, the influence of these risks regarding specific bacterial species is not straightforward and may rely on individual characteristics, type of treatment and/or local prevalent flora. Due to the severity of multidrug-resistant infections, continued microbiological surveillance with the aid of rapid diagnostic tests and prompt institution of appropriate antimicrobial therapy are crucial to improving outcomes for burn patients.
Responsive Regulation and Tax Investigations: The Tax Investigation Diamond Model Assumptions João Araújo Marques, Ana Helena Tavares Journal of White Collar and Corporate Crime, 2023 This is an exploratory study to analyse whether the Tax Investigation Diamond (TID) model is a necessary and useful tool for tax investigations. The TID meant to have a special role within tax compliance as an alternative strategy mechanism to maximise the likelihood of restoration by the offender. Through an analysis of interviews of Portuguese Tax Inspectors, we show that in tax investigations it is not always desirable to fulfil all steps of escalation. This study provides data that indicates that TID can be an innovative strategy for adapting responsive regulation strategies to tax investigations, based on the tax inspector’s perception of the taxpayer’s motivational posture and the way in which power actions are used according to these motivational postures.
A Multiobjective Optimization Approach to Pulmonary Rehabilitation Effectiveness in COPD Jorge Cabral, Vera Afreixo, Cristiana J Silva, Ana Tavares, Alda Marques Statistics Optimization and Information Computing, 2023 Chronic obstructive pulmonary disease (COPD) is a common disease that accounts for a significant individual and societal burden. Pulmonary rehabilitation (PR) is a key management strategy but it is highly inaccessible, making prioritisation highly needed. This study aimed to determine and optimize predictive models of PR outcomes and build a tool to help healthcare professionals in their clinical decision-making about PR prioritisation. Data from patients who performed a 12-week community-based PR programme were analysed. Exercise capacity with the six-minutes walk test distance (6MWD), isometric quadriceps muscle strength with the handheld dynamometry (QMS) and dyspnoea with the modified Medical Research Council dyspnoea scale (mMRC) were assessed before and after PR. Multiple linear regression models were determined based on the Akaike information criteria and a cross-validation method. The resultant multiobjective problem was solved using the Nondominated Sorting Genetic Algorithm-II. R Shiny package was used to create a web-based user interface. Data from 95 patients with COPD (median age of 69 years, 19 female and generally overweight), resulted in linear predictive models for the post-pre difference of the 6MWD, QMS and mMRC with cross-validation R2 of 0.49, 0.53 and 0.51, respectively. 6MWD and mMRC were common statistically significant predictors. Pareto front patients were obese ex-smoker women that do not do long-term oxygen therapy and that performed PR. The distance to the Pareto front along with the estimates given by our models are easily obtained using the designed R Shiny interface and may help healthcare professionals decide on the prioritisation to PR programmes.
COPD profiles and treatable traits using minimal resources: identification, decision tree and stability over time Alda Marques, Sara Souto-Miranda, Ana Machado, Ana Oliveira, Cristina Jácome, et al. Respiratory Research, 2022 Background and objective Profiles of people with chronic obstructive pulmonary disease (COPD) often do not describe treatable traits, lack validation and/or their stability over time is unknown. We aimed to identify COPD profiles and their treatable traits based on simple and meaningful measures; to develop and validate a decision tree and to explore profile stability over time. Methods An observational, prospective study was conducted. Clinical characteristics, lung function, symptoms, impact of the disease (COPD Assessment Test—CAT), health-related quality of life, physical activity, lower-limb muscle strength and functional status were collected cross-sectionally and a subsample was followed-up monthly over six months. A principal component analysis and a clustering procedure with k-medoids were applied to identify profiles. A decision tree was developed and validated cross-sectionally. Stability was explored over time with the ratio between the number of timepoints that a participant was classified in the same profile and the total number of timepoints (i.e., 6). Results 352 people with COPD (67.4 ± 9.9 years; 78.1% male; FEV1 = 56.2 ± 20.6% predicted) participated and 90 (67.6 ± 8.9 years; 85.6% male; FEV1 = 52.1 ± 19.9% predicted) were followed-up. Four profiles were identified with distinct treatable traits. The decision tree included CAT (< 18 or ≥ 18 points); age (< 65 or ≥ 65 years) and FEV1 (< 48 or ≥ 48% predicted) and had an agreement of 71.7% (Cohen’s Kappa = 0.62, p < 0.001) with the actual profiles. 48.9% of participants remained in the same profile whilst 51.1% moved between two (47.8%) or three (3.3%) profiles over time. Overall stability was 86.8 ± 15%. Conclusion Four profiles and treatable traits were identified with simple and meaningful measures possibly available in low-resource settings. A decision tree with three commonly used variables in the routine assessment of people with COPD is now available for quick allocation to the identified profiles in clinical practice. Profiles and treatable traits may change over time in people with COPD hence, regular assessments to deliver goal-targeted personalised treatments are needed.
Respiratory function and upper extremity functional activity performance in people with dementia: A shout for attention Cátia Paixão, Ana Tavares, Alda Marques Journal of Aging and Physical Activity, 2021 The aim of this study was to explore respiratory function and upper extremity functional activity in people with dementia (PWD) and the associations between these variables and cognitive function (n = 22 institutionalized PWD, 28 community-dwelling PWD, and 26 healthy older people). All measures were significantly lower in PWD who live in an institution, such as a nursing home or long-term care facility or who attend adult daycare than PWD who live in a community dwelling . The values from these two groups were significantly lower than those from healthy older people. Moderate to high negative correlations between upper extremity functional activity and respiratory function (−.73 < rs < −.49) and cognitive function (rs = −.83), and between cognitive function and respiratory function (−.74 < rs < −.58) were identified (p < .001). When adjusted for cognitive function (−.38 < rs < −.29; p < .05), the association between upper limb functional activity and respiratory function decreased. The decline demonstrates the importance of physical activity and cognitive and respiratory function in PWD.
Dissimilar symmetric word pairs in the human genome Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, et al. Advances in Intelligent Systems and Computing, 2017