diff --git a/notebooks/Spaceship Titanic.ipynb b/notebooks/Spaceship Titanic.ipynb index feaab4a..398d937 100644 --- a/notebooks/Spaceship Titanic.ipynb +++ b/notebooks/Spaceship Titanic.ipynb @@ -36,6 +36,16 @@ }, "metadata": {}, "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/sebastian/Documents/GitHub/TalkToEBM/t2ebm/graphs.py:318: SyntaxWarning:\n", + "\n", + "invalid escape sequence '\\%'\n", + "\n" + ] } ], "source": [ @@ -65,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -84,7 +94,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -103,7 +113,7 @@ "0.7832087406555491" ] }, - "execution_count": 3, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -126,8 +136,8 @@ { "data": { "text/html": [ - "\n", - "" + "\n", + "" ] }, "metadata": {}, @@ -161,21 +171,24 @@ "name": "stdout", "output_type": "stream", "text": [ - "The graph illustrates the effects of the categorical feature `HomePlanet` on a\n", - "dependent variable, as modeled by a Generalized Additive Model (GAM). The\n", - "feature `HomePlanet` includes three categories: \"Earth,\" \"Europa,\" and \"Mars.\"\n", - "Passengers from Europa exhibit a notably high positive mean effect of 0.5678 on\n", - "the dependent variable, with a tight confidence interval ranging from 0.5116 to\n", - "0.624, suggesting a strong and consistent positive impact. In contrast, Earth\n", - "shows a negative mean effect of -0.3246 with the confidence interval from -0.354\n", - "to -0.2952, indicating a robust negative association. Mars, while also positive,\n", - "has a much milder effect of 0.1713, with its confidence interval spanning from\n", - "0.1256 to 0.2171. This pronounced disparity in the effects, especially the\n", - "negative impact associated with Earth, is surprising and could hint at\n", - "underlying socio-economic or contextual factors influencing these outcomes.\n", - "Understanding these patterns might require further investigation into the\n", - "dataset's characteristics, including possible biases or the nature of the\n", - "dependent variable.\n" + "The graph represents the impact of the categorical feature \"HomePlanet\" on a\n", + "predicted outcome in a Generalized Additive Model (GAM), with HomePlanet having\n", + "three possible values: Earth, Europa, and Mars. Each planet shows a distinct\n", + "effect on the outcome: Earth has a negative mean effect of -0.3246, Europa\n", + "exhibits a strong positive mean effect of 0.5678, and Mars shows a moderate\n", + "positive effect of 0.1713. The 95% confidence intervals are tight for each\n", + "category, reinforcing the significance of these effects: Earth's interval ranges\n", + "from -0.354 to -0.2952, Europa's from 0.5116 to 0.624, and Mars' from 0.1256 to\n", + "0.2171. The negative effect associated with Earth is particularly surprising, as\n", + "one might typically expect either a neutral or positive effect from a planet\n", + "potentially representing a larger and more diverse population. Conversely, the\n", + "strong positive effect for Europa could suggest a modeling scenario where Europa\n", + "is conceptualized as a technologically advanced or affluent community. Mars’\n", + "positive but smaller effect might imply a scenario where Martian colonists are\n", + "considered resilient or pioneering. Understanding these patterns is crucial as\n", + "they suggest significant roles of planetary origin in influencing the modeled\n", + "outcomes, possibly reflecting narrative elements in simulated or fictional\n", + "settings.\n" ] } ], @@ -227,7 +240,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -346,7 +359,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -359,61 +372,43 @@ "INFO: The graph of feature ShoppingMall was simplified by 0.9%.\n", "INFO: The graph of feature Spa was simplified by 0.9%.\n", "INFO: The graph of feature VRDeck was simplified by 1.0%.\n", - "HomePlanet: The graph from the Generalized Additive Model (GAM) illustrates the impact of the \"HomePlanet\" feature on the likelihood of passengers being transported to an alternate dimension during the Spaceship Titanic's collision with a spacetime anomaly. It categorically compares three home planets: Earth, Europa, and Mars. Passengers from Europa show a notably higher likelihood of transportation with a positive effect (mean = 0.5678), suggesting a significant vulnerability or a unique characteristic of Europa inhabitants in facing spacetime anomalies. In contrast, Earth exhibits a negative impact (mean = -0.3246), indicating a lower probability of their passengers being transported, which could imply better preventive measures or technologies. Mars falls between the two, with a positive but relatively mild effect (mean = 0.1713), hinting at a slight predisposition towards transportation compared to Earth but significantly less than Europa. This pattern highlights potential differences in technology, preparedness, or societal behaviors among the populations of these planets. The results are surprising, particularly the high positive effect for Europa and the negative effect for Earth, suggesting underlying differences in interaction with the spacetime anomaly or the emergency response strategies of each planet.\n", - "\n", - "CryoSleep: The graph represents the influence of the \"CryoSleep\" boolean feature on the predictions from a Generalized Additive Model (GAM) regarding the likelihood of passengers being transported to another dimension due to a spacetime anomaly on the Spaceship Titanic. It shows two distinct outcomes based on whether passengers were in CryoSleep (True) or not (False). Specifically, passengers not in CryoSleep (False) have a negative mean effect on the model's prediction (-0.447), suggesting a decreased likelihood of being transported. Contrastingly, passengers in CryoSleep (True) show a positive mean effect (0.814), indicating an increased likelihood of being transported. The confidence intervals for both states are tight, reinforcing the reliability of these estimates. This pattern is somewhat counterintuitive as one might expect that active passengers would have a higher likelihood of transportation due to more extensive interaction with the ship’s environment. The observed effect might be explained by specific interactions between the CryoSleep technology and the anomaly or unique properties of the CryoSleep chambers' location on the ship.\n", - "\n", - "Cabin: The GAM graph for the \"Cabin\" feature on the Spaceship Titanic reveals significant variations in the likelihood of passengers being transported to an alternate dimension based on their cabin location and orientation (Port vs Starboard). Notably, cabins on the Starboard side of decks B (\"B/S\") and C (\"C/S\") show the highest positive effects, suggesting passengers in these locations had a markedly higher likelihood of being transported. In contrast, cabins on the Port side and lower decks, especially on G (\"G/P\") and T (\"T/S\"), exhibit negative effects, indicating a lower likelihood of transportation. This asymmetric effect between Port and Starboard sides, particularly on the same decks, is surprising and could imply an uneven exposure to the spacetime anomaly or structural differences in ship design. Additionally, the strong negative impact in higher and potentially more isolated decks such as G and T is counterintuitive, as one might expect these areas to offer safer environments. Overall, the graph suggests a complex interaction between cabin location and the anomaly's effects, with certain areas of the ship being significantly more affected than others. This insight could be crucial for understanding the dynamics of the anomaly and improving safety measures in similar future scenarios.\n", - "\n", - "Destination: The GAM graph for the categorical feature \"Destination\" demonstrates the impact of three destinations on the likelihood of passengers being transported to an alternate dimension during the Spaceship Titanic's collision with a spacetime anomaly. The destinations analyzed are \"55 Cancri e,\" \"PSO J318.5-22,\" and \"TRAPPIST-1e.\" Interestingly, passengers headed to \"55 Cancri e\" exhibit a significantly higher likelihood of being transported, evidenced by a positive mean effect of 0.315 with a confidence interval from 0.2766 to 0.3534. In contrast, the destinations \"PSO J318.5-22\" and \"TRAPPIST-1e\" both show negative mean effects of -0.0924 and -0.0917 respectively, with their confidence intervals also lying entirely below zero, indicating a lower probability of transportation to an alternate dimension for these passengers. This pattern is surprising and suggests a unique risk associated with the route or conditions related to the destination \"55 Cancri e.\" Such findings could imply operational, navigational, or environmental factors specifically influencing the anomaly interaction for voyages to this destination. Further investigation into these factors could provide crucial insights into safety protocols and risk management for interstellar travel.\n", - "\n", - "Age: The graph derived from a Generalized Additive Model (GAM) illustrates the relationship between passengers' ages and the likelihood of being transported to an alternate dimension aboard the Spaceship Titanic. The graph reveals that infants (particularly those aged 0-0.5 years) show a notably high probability of transport, with a mean value peaking at 0.65. As age increases, there is a general downward trend in probability, particularly notable from late childhood through to early adulthood (ages 9.5 to around 39 years), where the likelihood even dips into negative values, suggesting a lower chance of being transported. Surprisingly, there is a sharp increase in the likelihood of transport for seniors, particularly around the age range of 73.5 to 74.5 years, where the mean value rises dramatically to about 0.212. However, this is followed by a steep decline for the oldest age group analyzed (77.5-79 years), where the mean probability falls sharply to -0.408. These patterns suggest complex age-related dynamics in the transportation anomaly, with very young and older seniors displaying higher probabilities, and a notably lower likelihood for younger adults and middle-aged individuals. Understanding these age-specific vulnerabilities or protections could be crucial for rescue operations and further investigations into the anomaly's mechanics.\n", - "\n", - "VIP: The GAM graph for the \"VIP\" feature from the Spaceship Titanic dataset shows how VIP status influences the likelihood of a passenger being transported to an alternate dimension during a spacetime anomaly. The graph indicates that non-VIP passengers (VIP = False) have a mean effect close to zero (-0.0006), with a wide confidence interval that spans both negative and positive values, suggesting minimal influence on the transportation likelihood. In contrast, VIP passengers (VIP = True) exhibit a significant negative mean effect (-0.145) with a confidence interval entirely below zero (-0.2277 to -0.0623), indicating a notably lower likelihood of being transported. This pattern is somewhat counterintuitive as one might expect that the spacetime anomaly's effect would be uniformly random, or that VIPs might be better protected or equipped to handle such incidents. The observed negative effect for VIPs suggests that their specific accommodations or the ship's protocols for VIP areas may have mitigated the impact of the anomaly. This highlights an interesting aspect of how structural or procedural differences associated with VIP status could influence outcomes in unexpected ways during crises. Overall, the graph provides valuable insights into the interaction between passenger status and the effects of unpredictable spacetime events on the spaceship.\n", - "\n", - "RoomService: The Generalized Additive Model (GAM) graph for the \"RoomService\" feature from the Spaceship Titanic dataset demonstrates a complex relationship between the amount spent on room service and the likelihood of passengers being transported to an alternate dimension. Initially, there is a slight positive association for very low expenditures (0.0 to 27.5), suggesting that minimal spending on room service slightly increases the likelihood of the outcome. However, as expenditure increases, the effect becomes progressively negative, particularly after the spending exceeds 201.5, indicating that higher spending on room service correlates with a decreased likelihood of being transported. This negative association deepens significantly at very high levels of expenditure, with particularly sharp declines observed in the intervals (2420.0, 2621.5) and (6377.0, 14327.0). The confidence intervals widen at these higher expenditure levels, reflecting increasing uncertainty in the predictions. Interestingly, there is a slight recovery in the effect in the highest expenditure bracket, suggesting a complex nonlinear relationship. These patterns could reflect underlying passenger behaviors, ship security measures, or data irregularities at extreme spending levels, indicating a need for further exploration into these segments.\n", - "\n", - "FoodCourt: The graph from the Generalized Additive Model (GAM) represents the relationship between the amount spent at the FoodCourt and the likelihood of being transported to an alternate dimension aboard the Spaceship Titanic. It showcases a generally increasing trend: as spending increases, so does the predicted effect on the likelihood of transportation. Initially, for very low spending (0 to about 60), the effect sizes are slightly negative, indicating a minor decrease in the probability of transportation. However, as spending increases, particularly beyond 60 units, there is a noticeable upward trend in the effect sizes, suggesting higher FoodCourt expenditure correlates positively with the chances of being transported. Notably, there are surprising jumps and dips in the graph; for instance, a dramatic dip in effect size at a spending range of 1820 to 1894, and a significant jump between 3862.5 and 3871.0 in spending. These unexpected patterns could indicate non-linear relationships or particular thresholds in spending that markedly change the likelihood of transportation. The widening confidence intervals at higher spending levels suggest increased variability, possibly due to fewer observations or more heterogeneous characteristics within these passenger segments.\n", - "\n", - "ShoppingMall: The GAM graph for the \"ShoppingMall\" feature on the Spaceship Titanic dataset illustrates the impact of the amount spent at the shopping mall on the likelihood of a passenger being transported to an alternate dimension. The feature is continuous, and the effect of the spending on the predicted outcome is nonlinear. Initially, there is a negative correlation between the amount spent and the likelihood of transportation, with the effect becoming less negative as spending increases from 0 up to about 627.0. Surprisingly, at around 627.0, there is a sharp transition where the predicted value jumps from negative to positive, suggesting a threshold effect where higher spending suddenly increases the likelihood of being transported. Another notable pattern is the sudden drop and subsequent spike in the predicted values between the intervals 1826.5 to 1849.5, which is counterintuitive and may indicate data anomalies or unique interactions. As the expenditure increases, particularly beyond 1644.0, the confidence intervals widen, indicating increased uncertainty in predictions at higher expenditure levels. Overall, the graph shows a complex relationship between ShoppingMall spending and the event of transportation, highlighting specific spending thresholds that significantly alter the likelihood of being transported to an alternate dimension.\n", - "\n", - "Spa: The graph from the Generalized Additive Model (GAM) for the \"Spa\" feature on the Spaceship Titanic dataset illustrates a generally decreasing trend in predicted values as expenditure on spa services increases. Starting with a positive mean value of 0.482 in the lowest expenditure bracket (0.0, 13.5), the prediction gradually declines, transitioning to negative values from the interval (169.0, 204.5) onward. The decline becomes notably steeper in higher expenditure ranges, with a significant drop observed past the (3214.0, 3229.0) interval, where values plummet to -5.084, continuing to decrease to -6.405 in the highest bracket (10567.0, 18572.0). This sharp decline at high expenditures is counterintuitive, suggesting a complex, possibly non-linear relationship between spa spending and the model's outcome. Additionally, there are minor fluctuations in the highly negative ranges, indicating potential outliers or threshold effects influencing the predictions. The confidence intervals provided widen in higher spending brackets, reflecting increasing uncertainty in predictions as expenditures rise. This graph underscores the nuanced interaction between luxury spending and predicted outcomes, revealing insights that could guide further investigation into passenger behavior or model refinement.\n", - "\n", - "VRDeck: The graph from the Generalized Additive Model (GAM) for the \"VRDeck\" feature exhibits a generally decreasing trend in the predicted outcome as the expenditure on the VRDeck increases. Notably, the effect starts slightly positive or neutral at very low expenditures but becomes increasingly negative as spending rises. This suggests that higher spending on the VRDeck is associated with a lower likelihood of being transported to an alternate dimension. There are specific intervals, such as [(785.5, 789.5)], [(1898.5, 1909.5)], and especially [(4105.0, 4147.0)] where the model output shows counterintuitive or abrupt changes. These include unexpected less negative peaks or sharp declines in the predicted effect, suggesting complex interactions or anomalies within the data that deviate from the overall trend. Such observations highlight the potential influence of other unmodeled factors or unique groups of passengers. Understanding these irregularities could require further investigation into the interaction with other features or a deeper examination of the data quality and distribution in these specific ranges.\n", - "The Generalized Additive Model (GAM) applied to the Spaceship Titanic dataset\n", - "has uncovered significant relationships between passenger features and the\n", - "likelihood of being transported to an alternate dimension during a spacetime\n", - "anomaly. Here's a condensed summary of the key findings: 1. **CryoSleep**\n", - "(Feature Importance: 0.56): - Passengers in CryoSleep are significantly more\n", - "likely to be transported (mean effect = 0.814) compared to those not in\n", - "CryoSleep (mean effect = -0.447). This suggests unique interactions between the\n", - "CryoSleep technology and the anomaly. 2. **Spa** (Feature Importance: 0.72):\n", - "- There is a strong negative correlation between spa spending and the likelihood\n", - "of transportation, with a steep decline in likelihood as expenditures increase.\n", - "This counterintuitive pattern points to complex, non-linear effects of luxury\n", - "expenditures on outcomes. 3. **VRDeck** (Feature Importance: 0.63): - Higher\n", - "spending on the VRDeck is associated with a lower likelihood of being\n", - "transported, with an increasing negative effect as expenditure rises. 4.\n", - "**RoomService** (Feature Importance: 0.48): - Minimal initial spending on\n", - "room service slightly increases the likelihood of being transported, but as\n", - "spending rises, the effect becomes negatively pronounced. This indicates complex\n", - "interactions influenced by passenger behavior or security measures. 5.\n", - "**HomePlanet** (Feature: 0.35): - Europa passengers are notably more likely\n", - "to be transported (mean = 0.5678) compared to those from Earth (mean = -0.3246)\n", - "and Mars (mean = 0.1713), suggesting differences in technology or emergency\n", - "responses among the planets. 6. **Cabin** (Feature: 0.39): - Cabin location\n", - "significantly affects transportation likelihood, with notable disparities\n", - "between cabin sides (Port vs. Starboard) on the same deck, indicating uneven\n", - "exposure to the anomaly or structural differences in the ship. Surprising\n", - "Patterns: - **Age**: There's a non-linear relationship with age, where both very\n", - "young and older seniors show higher transportation probabilities, highlighting\n", - "age-specific dynamics in anomaly interaction. - **Destination**: Passengers to\n", - "\"55 Cancri e\" have a higher transportation likelihood compared to other\n", - "destinations, indicating unique risks associated with this route. This GAM\n", - "analysis provides crucial insights into the dynamics of spacetime anomalies in\n", - "interstellar travel, emphasizing the importance of considering complex and non-\n", - "linear interactions between features and outcomes for safety and operational\n", - "strategies.\n" + "The Generalized Additive Model (GAM) used for analyzing the Spaceship Titanic\n", + "anomaly provides crucial insights into factors influencing the likelihood of\n", + "passengers being transported to an alternate dimension. Here’s a concise summary\n", + "of the most impactful features: 1. **CryoSleep**: This feature significantly\n", + "affects the outcome, with passengers in cryosleep more likely to be transported\n", + "(mean effect size 0.814) compared to those who are not (mean effect -0.447).\n", + "This suggests a unique interaction between the cryosleep state and the anomaly,\n", + "potentially due to the location or conditions of cryosleep chambers. 2.\n", + "**Spa**: Expenditures on spa services show a strong negative correlation with\n", + "the likelihood of transportation, particularly at higher spending levels. The\n", + "effect becomes extremely strong (below -5) at the highest expenditures,\n", + "indicating a protective factor potentially linked to socioeconomic status or\n", + "specific behaviors. 3. **VRDeck**: Similar to spa spending, expenditure on the\n", + "VRDeck is negatively correlated with the probability of transportation,\n", + "intensifying with higher spending. This suggests that engagement in VRDeck\n", + "amenities might be associated with safer areas or protective behaviors on the\n", + "ship. 4. **RoomService**: Initially, a slight increase in transportation\n", + "likelihood is observed at very low spending levels on room service, but it\n", + "shifts to a significant negative correlation as spending increases. High\n", + "expenditures on room service might correlate with safer locations on the ship.\n", + "5. **HomePlanet**: Passengers from Europa are much more likely to be transported\n", + "(mean effect 0.5678) compared to those from Earth (mean effect -0.3246) and Mars\n", + "(mean effect 0.1713). This indicates that planetary origin, reflecting differing\n", + "socio-economic or technological contexts, significantly influences\n", + "susceptibility to the anomaly. 6. **Cabin**: The cabin location, particularly\n", + "differences between Port and Starboard sides, significantly impacts the\n", + "likelihood of transportation. For instance, Starboard side cabins, especially on\n", + "specific decks (e.g., \"C/S\" with mean = 2.016), show higher positive effects.\n", + "7. **Destination**: The intended destination affects transportation likelihood,\n", + "with passengers destined for 55 Cancri e exhibiting a higher likelihood compared\n", + "to those heading to PSO J318.5-22 and TRAPPIST-1e. This might be influenced by\n", + "route or operational parameters specific to each destination. The model\n", + "highlights the importance of understanding interactions between passenger states\n", + "(like cryosleep), cabin locations, spending on ship amenities, and origins in\n", + "assessing risks from spacetime anomalies. These factors play crucial roles in\n", + "the model's predictive accuracy and offer insights for enhancing safety and\n", + "design in future interstellar travel scenarios.\n" ] } ], diff --git a/setup.py b/setup.py index 4ea166e..addb9fe 100644 --- a/setup.py +++ b/setup.py @@ -7,9 +7,14 @@ with open("README.md", "r", encoding="utf-8") as fh: long_description = fh.read() +# Read the version from the version file +version = {} +with open("t2ebm/version.py") as fp: + exec(fp.read(), version) + setuptools.setup( name="t2ebm", - version="0.1.0", + version=version["__version__"], author="Sebastian Bordt, Ben Lengerich, Harsha Nori, Rich Caruana", author_email="sbordt@posteo.de", description="A Natural Language Interface to Explainable Boosting Machines", diff --git a/t2ebm/__init__.py b/t2ebm/__init__.py index 788fdce..8833ec2 100644 --- a/t2ebm/__init__.py +++ b/t2ebm/__init__.py @@ -2,7 +2,7 @@ TalkToEBM: A Natural Language Interface to Explainable Boosting Machines """ -__version__ = "0.1.0" +from .version import __version__ # high-level functions from .functions import ( diff --git a/t2ebm/prompts.py b/t2ebm/prompts.py index c795ff5..5da8687 100644 --- a/t2ebm/prompts.py +++ b/t2ebm/prompts.py @@ -65,7 +65,7 @@ def describe_graph_cot(graph, num_sentences=7, **kwargs): {"role": "assistant", "temperature": 0.7, "max_tokens": 2000}, { "role": "user", - "content": f"Thanks. Now please provide a brief, at most {num_sentences} sentence description of the graph. Be sure to include any important suprising patterns in the description. You can assume that the user knows that the graph is from a Generalized Additive Model (GAM).", + "content": f"Thanks. Now please provide a brief, at most {num_sentences} sentence description of the graph. Be sure to include any important surprising patterns in the description. You can assume that the user knows that the graph is from a Generalized Additive Model (GAM).", }, {"role": "assistant", "temperature": 0.7, "max_tokens": 2000}, ] @@ -87,10 +87,10 @@ def summarize_ebm( user_msg = """Your task is to summarize a Generalized Additive Model (GAM). To perform this task, you will be given - The global feature importances of the different features in the model. - Summaries of the graphs for the different features in the model. There is exactly one graph for each feature in the model. """ - user_msg += f"Here are the global feature importaces.\n\n{feature_importances}\n\n" + user_msg += f"Here are the global feature importances.\n\n{feature_importances}\n\n" user_msg += f"Here are the descriptions of the different graphs.\n\n{graph_descriptions}\n\n" if dataset_description is not None and len(dataset_description) > 0: - user_msg += f"Here is a description of the dataset that the model was trained on.\n\n{graph_descriptions}\n\n" + user_msg += f"Here is a description of the dataset that the model was trained on.\n\n{dataset_description}\n\n" user_msg += """Now, please provide a summary of the model. The summary should contain the most important features in the model and their effect on the outcome. Unimportant effects and features can be ignored. diff --git a/t2ebm/version.py b/t2ebm/version.py new file mode 100644 index 0000000..d1f2e39 --- /dev/null +++ b/t2ebm/version.py @@ -0,0 +1 @@ +__version__ = "0.1.1" \ No newline at end of file