Skip to content

Commit

Permalink
fix: fix duplicate
Browse files Browse the repository at this point in the history
  • Loading branch information
DiogoRibeiro7 committed Oct 18, 2024
1 parent f862366 commit dc11e6d
Show file tree
Hide file tree
Showing 239 changed files with 5,087 additions and 3,442 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ categories:
- Machine Learning
classes: wide
date: '2019-12-29'
excerpt: Splines are powerful tools for modeling complex, nonlinear relationships in data. In this article, we'll explore what splines are, how they work, and how they are used in data analysis, statistics, and machine learning.
excerpt: Splines are powerful tools for modeling complex, nonlinear relationships
in data. In this article, we'll explore what splines are, how they work, and how
they are used in data analysis, statistics, and machine learning.
header:
image: /assets/images/data_science_19.jpg
og_image: /assets/images/data_science_19.jpg
Expand All @@ -16,25 +18,30 @@ header:
twitter_image: /assets/images/data_science_19.jpg
keywords:
- Splines
- Spline Regression
- Nonlinear Models
- Data Smoothing
- Statistical Modeling
- python
- bash
- go
seo_description: Splines are flexible mathematical tools used for smoothing and modeling complex data patterns. Learn what they are, how they work, and their practical applications in regression, data smoothing, and machine learning.
- Spline regression
- Nonlinear models
- Data smoothing
- Statistical modeling
- Python
- Bash
- Go
seo_description: Splines are flexible mathematical tools used for smoothing and modeling
complex data patterns. Learn what they are, how they work, and their practical applications
in regression, data smoothing, and machine learning.
seo_title: What Are Splines? A Deep Dive into Their Uses in Data Analysis
seo_type: article
summary: Splines are flexible mathematical functions used to approximate complex patterns in data. They help smooth data, model non-linear relationships, and fit curves in regression analysis. This article covers the basics of splines, their various types, and their practical applications in statistics, data science, and machine learning.
summary: Splines are flexible mathematical functions used to approximate complex patterns
in data. They help smooth data, model non-linear relationships, and fit curves in
regression analysis. This article covers the basics of splines, their various types,
and their practical applications in statistics, data science, and machine learning.
tags:
- Splines
- Regression
- Data Smoothing
- Nonlinear Models
- python
- bash
- go
- Data smoothing
- Nonlinear models
- Python
- Bash
- Go
title: 'Understanding Splines: What They Are and How They Are Used in Data Analysis'
---

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@ categories:
- Machine Learning
classes: wide
date: '2019-12-30'
excerpt: AUC-ROC and Gini are popular metrics for evaluating binary classifiers, but they can be misleading on imbalanced datasets. Discover why AUC-PR, with its focus on Precision and Recall, offers a better evaluation for handling rare events.
excerpt: AUC-ROC and Gini are popular metrics for evaluating binary classifiers, but
they can be misleading on imbalanced datasets. Discover why AUC-PR, with its focus
on Precision and Recall, offers a better evaluation for handling rare events.
header:
image: /assets/images/data_science_8.jpg
og_image: /assets/images/data_science_8.jpg
Expand All @@ -14,21 +16,28 @@ header:
teaser: /assets/images/data_science_8.jpg
twitter_image: /assets/images/data_science_8.jpg
keywords:
- AUC-PR
- Precision-Recall
- Binary Classifiers
- Imbalanced Data
- Machine Learning Metrics
seo_description: When evaluating binary classifiers on imbalanced datasets, AUC-PR is a more informative metric than AUC-ROC or Gini. Learn why Precision-Recall curves provide a clearer picture of model performance on rare events.
- Auc-pr
- Precision-recall
- Binary classifiers
- Imbalanced data
- Machine learning metrics
seo_description: When evaluating binary classifiers on imbalanced datasets, AUC-PR
is a more informative metric than AUC-ROC or Gini. Learn why Precision-Recall curves
provide a clearer picture of model performance on rare events.
seo_title: 'AUC-PR vs. AUC-ROC: Evaluating Classifiers on Imbalanced Data'
seo_type: article
summary: In this article, we explore why AUC-PR (Area Under Precision-Recall Curve) is a superior metric for evaluating binary classifiers on imbalanced datasets compared to AUC-ROC and Gini. We discuss how class imbalance distorts performance metrics and provide real-world examples of why Precision-Recall curves give a clearer understanding of model performance on rare events.
summary: In this article, we explore why AUC-PR (Area Under Precision-Recall Curve)
is a superior metric for evaluating binary classifiers on imbalanced datasets compared
to AUC-ROC and Gini. We discuss how class imbalance distorts performance metrics
and provide real-world examples of why Precision-Recall curves give a clearer understanding
of model performance on rare events.
tags:
- Binary Classifiers
- Imbalanced Data
- AUC-PR
- Precision-Recall
title: 'Evaluating Binary Classifiers on Imbalanced Datasets: Why AUC-PR Beats AUC-ROC and Gini'
- Binary classifiers
- Imbalanced data
- Auc-pr
- Precision-recall
title: 'Evaluating Binary Classifiers on Imbalanced Datasets: Why AUC-PR Beats AUC-ROC
and Gini'
---

When working with binary classifiers, metrics like **AUC-ROC** and **Gini** have long been the default for evaluating model performance. These metrics offer a quick way to assess how well a model discriminates between two classes, typically a **positive class** (e.g., detecting fraud or predicting defaults) and a **negative class** (e.g., non-fraudulent or non-default cases).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ categories:
- Statistics
classes: wide
date: '2019-12-31'
excerpt: Let's examine why multiple imputation, despite being popular, may not be as robust or interpretable as it's often considered. Is there a better approach?
excerpt: Let's examine why multiple imputation, despite being popular, may not be
as robust or interpretable as it's often considered. Is there a better approach?
header:
image: /assets/images/data_science_20.jpg
og_image: /assets/images/data_science_20.jpg
Expand All @@ -13,18 +14,22 @@ header:
teaser: /assets/images/data_science_20.jpg
twitter_image: /assets/images/data_science_20.jpg
keywords:
- multiple imputation
- missing data
- single stochastic imputation
- deterministic sensitivity analysis
seo_description: Exploring the issues with multiple imputation and why single stochastic imputation with deterministic sensitivity analysis is a superior alternative.
- Multiple imputation
- Missing data
- Single stochastic imputation
- Deterministic sensitivity analysis
seo_description: Exploring the issues with multiple imputation and why single stochastic
imputation with deterministic sensitivity analysis is a superior alternative.
seo_title: 'The Case Against Multiple Imputation: An In-depth Look'
seo_type: article
summary: Multiple imputation is widely regarded as the gold standard for handling missing data, but it carries significant conceptual and interpretative challenges. We will explore its weaknesses and propose an alternative using single stochastic imputation and deterministic sensitivity analysis.
summary: Multiple imputation is widely regarded as the gold standard for handling
missing data, but it carries significant conceptual and interpretative challenges.
We will explore its weaknesses and propose an alternative using single stochastic
imputation and deterministic sensitivity analysis.
tags:
- Multiple Imputation
- Missing Data
- Data Imputation
- Multiple imputation
- Missing data
- Data imputation
title: A Deep Dive into Why Multiple Imputation is Indefensible
---

Expand Down
180 changes: 113 additions & 67 deletions _posts/2020-01-01-causality_correlation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ categories:
- Statistics
classes: wide
date: '2020-01-01'
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving paradoxes and leading to more accurate insights from data analysis.
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
paradoxes and leading to more accurate insights from data analysis.
header:
image: /assets/images/data_science_4.jpg
og_image: /assets/images/data_science_1.jpg
Expand All @@ -18,10 +19,14 @@ keywords:
- Berkson's paradox
- Correlation
- Data science
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and Berkson's, can help us avoid the common pitfalls of interpreting data solely based on correlation.
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
on correlation.
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
seo_type: article
summary: An in-depth exploration of the limits of correlation in data interpretation, highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as a tool for uncovering true causal relationships.
summary: An in-depth exploration of the limits of correlation in data interpretation,
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
a tool for uncovering true causal relationships.
tags:
- Simpson's paradox
- Berkson's paradox
Expand All @@ -36,20 +41,41 @@ In today's data-driven world, we often rely on statistical correlations to make
This article is aimed at anyone who works with data and is interested in gaining a more accurate understanding of how to interpret statistical relationships. Here, we will explore how to uncover **causal relationships** in data, how to resolve confusing situations like **Simpson's Paradox** and **Berkson's Paradox**, and how to use **causal graphs** as a tool for making better decisions. The goal is to demonstrate that by understanding causality, we can avoid the pitfalls of over-relying on correlation and make more informed decisions.

---

## Correlation and Causation: Why the Distinction Matters

In statistics, **correlation** measures the strength of a relationship between two variables. For example, if you observe that ice cream sales increase as temperatures rise, you might conclude that warmer weather causes more ice cream to be sold. This conclusion feels intuitive, but what about cases where the data is less obvious? Imagine a study finds a correlation between shark attacks and ice cream sales. Does one cause the other? Clearly not—but the correlation exists because both are influenced by a common factor: hot weather.

This example underscores the central problem: **correlation does not imply causation**. Just because two variables move together doesn’t mean one causes the other. Correlation can arise for several reasons:

- **Direct causality**: One variable causes the other.
- **Reverse causality**: The relationship runs in the opposite direction.
- **Confounding variables**: A third variable influences both.
- **Coincidence**: The relationship is due to chance.

To understand the true nature of relationships in data, we need to go beyond correlation and ask **why** the variables are related. This is where **causal inference** comes in.

author_profile: false
categories:
- Statistics
classes: wide
date: '2020-01-01'
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
paradoxes and leading to more accurate insights from data analysis.
header:
image: /assets/images/data_science_4.jpg
og_image: /assets/images/data_science_1.jpg
overlay_image: /assets/images/data_science_4.jpg
show_overlay_excerpt: false
teaser: /assets/images/data_science_4.jpg
twitter_image: /assets/images/data_science_1.jpg
keywords:
- Simpson's paradox
- Causality
- Berkson's paradox
- Correlation
- Data science
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
on correlation.
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
seo_type: article
summary: An in-depth exploration of the limits of correlation in data interpretation,
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
a tool for uncovering true causal relationships.
tags:
- Simpson's paradox
- Berkson's paradox
- Correlation
- Data science
- Causal inference
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
---

## The Importance of Causal Inference
Expand All @@ -61,23 +87,41 @@ In most real-world scenarios, we rely on **observational data**, which is data c
Fortunately, researchers have developed methods to uncover causal relationships from observational data by combining **statistical reasoning** with a deep understanding of the data's context. This is where **causal graphs** and tools like **Simpson's Paradox** and **Berkson's Paradox** come into play.

---

## Simpson's Paradox: The Danger of Aggregating Data

Simpson's Paradox is a statistical phenomenon in which a trend that appears in different groups of data disappears or reverses when the groups are combined. This paradox occurs because of a **lurking confounder**, a variable that influences both the independent and dependent variables, skewing the relationship between them.

### The Classic Example

Imagine you're analyzing the effectiveness of a new drug across two groups: younger patients and older patients. Within each group, the drug seems to improve health outcomes. However, when you combine the two groups, the overall analysis shows that the drug is **less** effective.

This reversal happens because age, a **confounding variable**, is driving the overall result. If more older patients received the drug and older patients have worse outcomes in general, it can skew the overall data. Thus, the combined analysis gives a misleading result, suggesting the drug is less effective when it actually benefits each group.

### Why Does This Happen?

Simpson’s Paradox occurs because the relationship between variables changes when data is aggregated. In the example above, **age** confounds the relationship between the drug and health outcomes. It’s important to note that combining data from different groups without accounting for confounders can hide the true relationships within each group.

This paradox demonstrates why it’s crucial to understand the **story behind the data**. If we simply relied on the overall correlation, we would draw the wrong conclusion about the drug’s effectiveness.

author_profile: false
categories:
- Statistics
classes: wide
date: '2020-01-01'
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
paradoxes and leading to more accurate insights from data analysis.
header:
image: /assets/images/data_science_4.jpg
og_image: /assets/images/data_science_1.jpg
overlay_image: /assets/images/data_science_4.jpg
show_overlay_excerpt: false
teaser: /assets/images/data_science_4.jpg
twitter_image: /assets/images/data_science_1.jpg
keywords:
- Simpson's paradox
- Causality
- Berkson's paradox
- Correlation
- Data science
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
on correlation.
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
seo_type: article
summary: An in-depth exploration of the limits of correlation in data interpretation,
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
a tool for uncovering true causal relationships.
tags:
- Simpson's paradox
- Berkson's paradox
- Correlation
- Data science
- Causal inference
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
---

## Berkson's Paradox: The Pitfall of Selection Bias
Expand All @@ -99,39 +143,41 @@ Berkson's Paradox illustrates the problem of **selection bias**—when we restri
The key takeaway from Berkson’s Paradox is that we need to be careful about **how we select data for analysis**. If we focus only on a specific group without understanding how that group was selected, we can introduce misleading correlations.

---

## Causal Graphs: A Tool for Visualizing Relationships

To avoid falling into the traps of Simpson’s and Berkson’s Paradoxes, it’s helpful to use **causal graphs** to visualize the relationships between variables. These graphs, also known as **Directed Acyclic Graphs (DAGs)**, allow us to represent the causal structure of a system and identify which variables are influencing others.

### What Are Causal Graphs?

A **causal graph** is a diagram that represents variables as **nodes** and the causal relationships between them as **directed edges** (arrows). A directed edge from variable **A** to variable **B** indicates that **A** has a causal influence on **B**.

Causal graphs are powerful because they help us:

1. **Identify confounders**: Variables that influence both the independent and dependent variables.
2. **Clarify causal relationships**: Show which variables are direct causes and which are effects.
3. **Avoid incorrect controls**: Help us decide which variables to control for in statistical analysis.

### Using Causal Graphs to Resolve Simpson's Paradox

Let’s return to the example of the drug trial. A causal graph for this scenario might look like this:

- **Age** influences both **Drug Use** and **Health Outcome**.
- **Drug Use** directly affects **Health Outcome**.

In this case, **Age** is a **confounder** because it influences both the independent variable (**Drug Use**) and the dependent variable (**Health Outcome**). When we control for **Age**, we remove its confounding effect and can properly assess the impact of the drug on health outcomes.

### Using Causal Graphs to Resolve Berkson's Paradox

In the case of celebrities, a causal graph might look like this:

- **Talent** and **Attractiveness** are independent in the general population.
- **Celebrity Status** depends on both **Talent** and **Attractiveness**.

Here, **Celebrity Status** is a **collider**, a variable that is influenced by both **Talent** and **Attractiveness**. When we condition on a collider (i.e., focus only on celebrities), we create a spurious correlation between **Talent** and **Attractiveness**. The key is to recognize that the negative correlation between these variables only exists because we have selected a specific subset of the population (celebrities), not because there is a true relationship between talent and attractiveness.

author_profile: false
categories:
- Statistics
classes: wide
date: '2020-01-01'
excerpt: Understand how causal reasoning helps us move beyond correlation, resolving
paradoxes and leading to more accurate insights from data analysis.
header:
image: /assets/images/data_science_4.jpg
og_image: /assets/images/data_science_1.jpg
overlay_image: /assets/images/data_science_4.jpg
show_overlay_excerpt: false
teaser: /assets/images/data_science_4.jpg
twitter_image: /assets/images/data_science_1.jpg
keywords:
- Simpson's paradox
- Causality
- Berkson's paradox
- Correlation
- Data science
seo_description: Explore how causal reasoning, through paradoxes like Simpson's and
Berkson's, can help us avoid the common pitfalls of interpreting data solely based
on correlation.
seo_title: 'Causality Beyond Correlation: Understanding Paradoxes and Causal Graphs'
seo_type: article
summary: An in-depth exploration of the limits of correlation in data interpretation,
highlighting Simpson's and Berkson's paradoxes and introducing causal graphs as
a tool for uncovering true causal relationships.
tags:
- Simpson's paradox
- Berkson's paradox
- Correlation
- Data science
- Causal inference
title: 'Causality Beyond Correlation: Simpson''s and Berkson''s Paradoxes'
---

## The Broader Implications of Causality in Data Analysis
Expand Down
Loading

0 comments on commit dc11e6d

Please sign in to comment.