DNSC6290 Responsible Machine Learning

Baisc Information

Person or organization developing model: Patirck Hall, [email protected] & Shuning Ma, [email protected]
Model date: June, 2022
Model version: 1.0
License: Apache License 2.0
Model implementation code: DNSC6290

Assignment workflows

Final selected model is Monotonic XGBoost, and the other models are here for comparison.

Training basic interpretable models
Post-hoc explanations
Bias testing
- Explainable Boosting Machine
- Monotonic XGBoost
Red-teaming
- Explainable Boosting Machine
- Monotonic XGBoost
Model debugging
- Explainable Boosting Machine
- Monotonic XGBoost

Intended Use

Primary intended uses: This model is an example probability of XGBoost model, with an example use case for determining whether (1) or not (0) the annual percentage rate (APR) charged for a mortgage is 150 basis points (1.5%) or more above a survey-based estimate of similar mortgages.
Primary intended users: Students in GWU DNSC 6290 course.
Out-of-scope use cases: Any use beyond an educational example is out-of-scope.

Training Data

Data dictionary:

Name	Modeling Role	Measurement Level	Description
row_id	ID	int	unique row indentifier
black	demographic information	binary	1 = black; 0 = non-black
asian	demographic information	binary	1 = asian; 0 = non-asian
white	demographic information	binary	1 = white; 0 = non-white
amind	demographic information	binary	1 = amind; 0 = non-amind
hipac	demographic information	binary	1 = hipac; 0 = non-hipac
hispanic	demographic information	binary	1 = hispanic; 0 = non-hispanic
non_hispanic	demographic information	binary	1 = non_hispanic; 0 = hispanic
male	demographic information	binary	1 = male; 0 = female
female	demographic information	binary	1 = female; 0 = male
agegte62	demographic information	binary	1 = agegte62; 0 = agelt62
agelt62	demographic information	binary	1 = agelt62; 0 = agegte62
conforming	inputs	binary	1 = the mortgage conforms to normal standards; 0 = the loan is different
debt_to_income_ratio_std	inputs	float	standardized debt-to-income ratio for mortgage applicants
debt_to_income_ratio_missing	inputs	binary	1 = missing debt_to_income_ratio_std; 0 = non-missing debt_to_income_ratio_std
income_std	inputs	float	standardized income for mortgage applicants
loan_amount_std	inputs	float	standardized amount of the mortgage for applicants
intro_rate_period_std	inputs	float	standardized introductory rate period for mortgage applicants
loan_to_value_ratio_std	inputs	float	ratio of the mortgage size to the value of the property for mortgage applicants
no_intro_rate_period_std	inputs	binary	-4.09174686250922 = a mortgage does not include an introductory rate period; 0.244394395255126 = a mortgage includes an introductory rate period
property_value_std	inputs	float	value of the mortgaged property
term_360	inputs	binary	1 = the mortgage is a standard 360 month mortgage; 0 = the mortgage is a different type of mortgage
high_priced	target	binary	whether the annual percentage rate (APR) charged for a mortgage is (not) 150 basis points (1.5%) or more above a survey-based estimate of similar mortgages, 1 = is; 0 = not

Source of training data: Home Mortgage Disclosure Act (HMDA) data
How training data was divided into training and validation data: 70% training, 30% validation
Number of rows in training and validation data:
- Training rows: 112,253
- Validation rows: 48,085

Evaluation Data

Source of test data: Home Mortgage Disclosure Act (HMDA) data
Number of rows in test data: 19,831
State any differences in columns between training and test data: The test data does not contain the target column.

Model Details

Final selected model: Monotonic XGBoost

Columns used as inputs in the final model: 'term_360', 'conforming', 'debt_to_income_ratio_missing', 'loan_amount_std', 'loan_to_value_ratio_std', 'no_intro_rate_period_std', 'intro_rate_period_std', 'property_value_std', 'income_std', 'debt_to_income_ratio_std'

Column(s) used as target(s) in the final model: 'high_priced'
Type of model: Monotonic XGBoost model
Software used to implement the model: h2o, interpret, jupyter, matplotlib, numpy, pandas, scikit-learn, seaborn, xgboost
Version of the modeling software:
- h2o==3.32.1.3
- interpret==0.2.4
- jupyter==1.0.0
- matplotlib==3.3.4
- numpy==1.19.5
- pandas==1.1.5
- scikit-learn==0.24.2
- seaborn==0.11.1
- xgboost==1.4.2
Hyperparameters or other settings of your model:
- 'colsample_bytree': 0.3,
- 'colsample_bylevel': 0.3,
- 'eta': 0.05,
- 'max_depth': 3,
- 'reg_alpha': 0.0005,
- 'reg_lambda': 0.0005,
- 'subsample': 0.3,
- 'min_child_weight': 10,
- 'gamma': 0.2
- 'n_jobs': 4,
- 'random_state': 12345,
- 'mono_constraints': (1, 1, 1, -1, 0, -1, 0, 1, 0, 1)

Quantitative Analysis

Metrics used to evaluate your final model: AUC & AIR
State the final values of the metrics for all data: training, validation, and test data:

Partition AUC

Training 0.7894

Validation 0.7907

Evaluation 0.8294

Compare v. Control AIR

Asian people vs. White people 1.146

Black people vs. White people 0.805

Females vs. Males 0.949
Provide any plots related to your data or final model -- be sure to label the plots!:
- Global feature importance:
- Local feature importance:
- Partial dependence:
- Bias testing grid search results:
- Variable importance for stolen model:
- Adversarial examples:
- Global logloss residuals:

Ethical Considerations

Describe potential negative impacts of using your model
- Math or software problems: Due to compatibility issues between h2o and xgboost, the stolen decision tree model generated had unusual big numbers, causing adversial examples full of missing values.
- Real-world risks: who, what, when or how: The adverse impact ratio of Black-to-White, although > 0.8, is relatively low (0.805), which could lead to unfair difference in outcomes. When using this model to predict high-priced mortgages, for every 100,000 low-priced loans to white people, there are only 80,500 low-priced loans to Black people.
Describe potential uncertainties relating to the impacts of using your model
- Math or software problems: With different versions of software, some codes may need to be changed, especially for xgboost.
- Real-world risks: who, what, when or how: Hackers may steal the model to do model extraction attack. According to adversial examples results in Red-teaming, the hackers can get predictions they want by inputting missing values in multiple features.
Describe any unexpected or results: Although this model was tested and remediated for bias, there is much more to bias than models and data, and this model should be monitored for bias issues moving forward.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Assignment 1		Assignment 1
Assignment 2		Assignment 2
Assignment 3		Assignment 3
Assignment 4		Assignment 4
Assignment 5		Assignment 5
Data		Data
Image		Image
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNSC6290 Responsible Machine Learning

Baisc Information

Assignment workflows

Intended Use

Training Data

Evaluation Data

Model Details

Quantitative Analysis

Ethical Considerations

About

Releases

Packages

Languages

Compare v. Control	AIR
Asian people vs. White people	1.146
Black people vs. White people	0.805
Females vs. Males	0.949

Partition	AUC
Training	0.7894
Validation	0.7907
Evaluation	0.8294

License

Mshuning/DNSC6290-Home-Mortgage-Disclosure-Act-Analysis-Project

Folders and files

Latest commit

History

Repository files navigation

DNSC6290 Responsible Machine Learning

Baisc Information

Assignment workflows

Intended Use

Training Data

Evaluation Data

Model Details

Quantitative Analysis

Ethical Considerations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages