Credit Risk Analysis

Data AnalysisSQLRandom ForestPandasMachine Learning

I built a credit risk scoring project using the UCI Credit Card dataset, which contains 30,000 customer records with demographic, financial, and repayment history data. The goal was to explore real-world credit behavior, uncover patterns in default risk, and build a model that can predict which customers are more likely to default on their next payment.

Most Likely Default Analysis

The project began with extensive data cleaning and exploration using Pandas and SQL. Through visualization and analysis, I examined how credit limits, spending behavior, and default rates vary across education levels, marital status, age groups, and sex.

Sample Data Format
Average Limits Analysis

After understanding the data, I trained a Random Forest model to classify customers as likely to default or not. The model learned from features such as credit limit, utilization ratio, age, education, marital status, and repayment history. Random Forest was chosen for its ability to capture complex patterns by combining many decision trees into a single, more stable prediction.

Feature Importance

The model achieved an AUC score of 0.86 on the ROC curve, indicating strong performance in distinguishing higher risk customers from lower risk ones. The results highlighted clear behavioral patterns, including higher default rates among younger customers, increased risk in later age groups, and noticeable differences across education levels and gender.

ROC Curve

Real-World Impact

This project stood out to me because of its real-world impact. Credit risk models like this can be used by financial institutions to better manage lending decisions, design adaptive credit limits, and help prevent individuals from falling into unmanageable debt. Through this work, I strengthened my skills in SQL, Pandas, data analysis, and machine learning while gaining hands-on experience turning raw data into real-world findings.