“This dataset describes pediatric patients with several hematolic disease, who were subject to the unmanipulated allogeneic unrelated donor hematopoietic stem cell.” (UC Irvine).
CD34+ cells, also known as hematopoietic stem cells (HSCs), primarily serve the purpose of self-renewal and producing mature blood cells, including erythrocytes, leukocytes, platelets, and lymphocytes. As the source of all blood lineages, CD34+ T cells are critical in hematopoietic stem cell transplantation (HSCT) as they play a central role in governing the immune environment post-transplantation. In pediatric HSCT studies, CD34+ T cell dynamics help evaluate immune recovery and treatment efficacy; this study aims to highlight the synergy between immune function and CD34+ stem cell transplantation outcomes.
Dataset Source: UCI Bone Marrow Transplant Children (187 observations x 36 features)
![Screenshot 2024-12-25 at 1 00 11 AM](https://private-user-images.githubusercontent.com/158855066/398522117-9a499ecc-f586-4137-b188-0916f022d3e2.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzY0MTAsIm5iZiI6MTczOTE3NjExMCwicGF0aCI6Ii8xNTg4NTUwNjYvMzk4NTIyMTE3LTlhNDk5ZWNjLWY1ODYtNDEzNy1iMTg4LTA5MTZmMDIyZDNlMi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwODI4MzBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02MjA4MjdlNmU4ZTMyYWZhMzExY2IxNzU5ODMxZWYxYjlmMGY3NjQ5Y2QzZDAwMDc5YzMyZDMzNzZlNWJhYjhkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.OZA2tWbLLjtQEI3DXS3GDCm5lFjudEKpDXg6R1rKT6U)
UCI Bone Marrow dataset was analyzed to predict two key outcomes:
- Survival Status (categorical)
- Survival Time (continuous)
Our main objective was to determine which variables best predict these outcomes and to compare different supervised learning models in terms of their predictive performance.
- Inspected missing data using visualizations (
vis_miss
,gg_miss_var
). - Examined correlations (
corrplot
) and outliers using the IQR rule. - Explored distributions via histograms, density plots, and scatterplot matrices:
-
Survival Status (Classification)
- Logistic Regression
- Random Forest
-
Survival Time (Regression)
- Linear Regression
- Lasso (L1 Regularization)
- Random Forest
Three main strategies were used to identify important features:
- Stepwise Selection (using AIC-based forward/backward selection)
- Lasso Regularization (to shrink less important coefficients to zero)
- Random Forest Feature Importance (ranking variables by mean decrease in node purity)
-
Most Important Predictors (overlap of stepwise, Lasso, and Random Forest):
- Relapse
- extcGvHD
- Survival Time
- Txpostrelapse
-
Model Comparison
- Logistic Regression: ~94.44% accuracy
- Random Forest: ~94.44% accuracy (rounded before and after tuning)
- Logistic Regression remained the best choice in our comparison, even after Random Forest tuning, due to consistent predictive performance and model interpretability.
-
Features Identified by Each Method:
-
Stepwise Selection
Stemcellsource, RecipientABO, Disease, Txpostrelapse, extcGvHD, Recipientage, Rbodymass, survival_status, DosageGroup -
Lasso
Donorage, CD34kgx10d6, CD3dCD34, CD3dkgx10d8, Rbodymass, ANCrecovery, PLTrecovery, time_to_aGvHD_III_IV, survival_status -
Random Forest
survival_status, extcGvHD, CD3dCD34, PLTrecovery, CD3dkgx10d8, Donorage, CD34kgx10d6, CMVstatus, Rbodymass, HLAgrI
-
-
Model Comparison
-
Stepwise Linear Model
- R-squared: 0.654
- RMSE: 494.12
- AIC: 2814.30
-
Lasso Model
- R-squared: 0.612
- RMSE: 523.05
- AIC: 2817.01
-
Random Forest Model
- R-squared: 0.656
- RMSE: 492.40
- AIC: 2815.03
-
-
Best Model
- Random Forest outperformed other models with the highest R-squared and lowest RMSE, indicating that Random Forest is the most robust regressor for predicting survival time.
-
Survival Status depends primarily on:
- Relapse, extcGvHD, Survival Time, and Txpostrelapse
- CD34+ dosage did not appear as a crucial determinant for survival status in the final models.
- Logistic Regression proved the most reliable for classification.
-
Survival Time is strongly influenced by:
- Survival Status, extcGvHD, CD3dCD34, PLTrecovery, CD3dkgx10d8, Donorage, CD34kgx10d6, CMVstatus, Rbodymass, and HLAgrI
- CD34+ dosage surfaced as a significant predictor of survival time but does not alone guarantee survival.
-
Interaction Between Outcomes
- Survival Status and Survival Time are interdependent.
- Only extcGvHD was shared as a top predictor across both final models.
- While higher CD34+ dosage may prolong survival time, it does not unequivocally ensure survival status.
- For categorical survival status predictions, Logistic Regression is recommended, while for continuous survival time predictions, Random Forest is most effective.
- The hypothesis that higher CD34+ cell dosage extends survival time is partially supported by the results, though not conclusively linked to improved survival status.