What Do These Models Mean?

Remember, our goal in this research project is to improve financial risk modeling using machine learning + techniques.

To do this, we generated multiple machine learning models which predicted mortgage default. By generating multiple models which use different techniques and processes, we can compare (and visualize!) what works and what doesn't.

We will define and explore these models below.

Psst! If you want to skip straight to results, check out our Results tab!

General Model Decisions:

AKA: What is consistent for all models that we implemented?

Model Architecture: Thanks to domain research and a suggestion from our mentor, we chose to exclusively build all our machine learning model instances using a library called XGBoost. XGBoost excels1 at synthesizing and modeling dense tabular timeseries data (which is the data we are using).

Language: We worked in Python2 as we were most familiar with the language.

Environment: To ensure we were all working on the same OS, we ran and hosted our code on UCSD's DSMLP. You can find out more about how we used DSMLP here. We also used Docker Images3 to standardize our OS.


1. ["XGboost for Timeseries Forecasting", Analytics Vidha]
2. [Python.org]
3. [Docker Docs]

Specific Model Architecture

Now let's dive into the specific models we are testing.


The Baseline Model

We wanted this to represent how banks currently predict mortgage defaults using credit score. For simplicity's sake, we assume the bank only uses credit score as its single input.

We will use this baseline to contextualize our other models and see if/how much they "improved" performance compared to existing technologies.

  • Features: Credit Score only
  • Data Loading Tool: Pandas (only able to load 5,000 data points for train/test/val)

Next, we created model iterations with new techniques/tools to try to beat the baseline model


The Original Model

This code predicts mortgage defaults using credit score and additional contextual information pulled from the Freddie Mac dataset. Additional context information (additional features/columns) includes number of cosigners; which bank the mortgage was approved at; etc.

  • Features: Credit Score + Freddie Mac Contextual Columns
  • Data Loading Tool: Pandas (only able to load 5,000 data points for train/test/val)

The Original + Parquet Model

This code uses the same features/columns as the previous model, but loads data using Parquet. This allows for more data to be loaded into partitions without overwhelming the system's memory.

  • Features: Credit Score + Freddie Mac Contextual Columns
  • Data Loading Tool: Parquet (able to load 3,000,000+ data points for train/test/val)

The Original + Parquet + Feature Engineering Model

This code uses the same features and parquet loading as the previous model, but also applies feature engineering techniques to the columns. This allows a package called featuretools4 to recursively synthesize new features from the existing columns to provide more insight and context to data.

An example of feature engineering: Given a column Number of Co-Signers, feature engineering can iterate over the column and generate the mean, median, and mode values per zip code. This might help the model contextualize and understand whether a number is important/helps classify default or not.

Note: this data uses the most data (both in terms of loading instances and generating columns) out of all model iterations.

  • Features: Credit Score + Freddie Mac Contextual Columns + Feature Engineered Context Columns
  • Data Loading Tool: Parquet (able to load 3,000,000+ data points for train/test/val)

4. [Alteryx FeatureTools Documentation Page]