Mintos meets Data Science #1


In this article, I’m going to explain a first approach regarding how to apply Data Science in order to boost the performance of our Mintos portfolios.

If you are not using Mintos yet, read my Mintos review. Sign up through my affiliate link to receive a up to 1% cashback after the 30th day of registration. Holly molly!

Let’s crack the code

First off, we can download the Excel which Mintos provide in their Statistics page (you must be logged in first). It could be easily loaded into a pandas dataframe by this way:

# The folder which contains the excel files
files_xls = glob.glob('/home/mllamaza/projects/mintos/excel/mintos*.xlsx')

# Create a empty dataframe
df = pd.DataFrame()

# Loop downloaded files
for f in files_xls:
    print('Loading', f, '...')
  
    df_tmp = pd.read_excel(f, 
                 sheetname='Sheet1',
                 skiprows=1, 
                 names=['id', 'issue_date', 'listing_date', 'country', 'loan_orig', 'loan_type', 
                        'interest', 'term', 'collateral', 'initial_ltv', 'ltv', 'loan_status', 
                        'buyback_reason', 'ini_loan_amount', 'rem_loan_amount', 'currency', 'buyback'])

    df = df.append(df_tmp, ignore_index=True)

# Drop null rows
df.dropna(inplace=True)

Lovely! So the very first step is to ask ourselves what do we want to know? and afterwards, how can we define the best target for that goal?

What do we want to know?

Which portfolio configuration will give us higher returns, but we just want to play with trusted loan providers, and those interest and term duration ranges which have less rate of payment delays. Some investors rather high-interest contracts despite they are not paid on time, but we are losing the chance of reinvesting this money in the meantime, which might infer negatively in our ROI balance.

How can we define the best target for that goal?

In the Excel they provide, combining Loan Status and Buyback reason columns, we can determine the current status of the active loan (Current, Late, Default, etc.), also we can see how did past load ended (Early payment, As scheduled, Bad debt, etc).

What it’s the trick? We just can see the status of the last payment but not the previous ones. Let’s suppose the first payment of a loan is paid Late +60, but the second one is paid on time, the Excel will just display Current in this contract. (Although we could handle it by creating a script which accesses to each loan URL by its id and parses the HTML of its payments table, we rather a simpler method at least for this first approach).

I end up applying different weights to our active loans depending on their status. Why? Simply because Late+60 it’s far worse than a Grace Period status.

What does the data says?

Now let’s see a fancy correlation map taking our target column as a reference:

Interesting, huh?

Designing the algorithm

These are the steps I followed:

  1. Data gathering: download the data from their website and load them all in our database.
  2. Preprocessing: casting float columns, trim and clean string columns, handle missing data, remove nulls…
  3. Feature generation: rank each loan provider based on their trust, creating several metric columns to measure the average performance.
  4. Bruteforce simulation: simulate the performance of all the possible portfolio settings combinations by tuning their interest rate and contract duration.
  5. Output formatting: sort the output based on our criteria, filter those results which haven’t enough data and apply some format so it becomes more readable.

The outcome

After these steps, we have a dataframe with the following columns: loan_orig, country, loan_type, term_min, term_max, int_min, int_max, tot_count, orig_trust, int_mean, accuracy and rank. Let’s explain the less obvious ones:

  • tot_count: total count of active contracts matching this criterion.
  • orig_trust: the trustworthiness of the loan originator.
  • accuracy: the weighted average of the status of their active contracts.
  • rank: a custom column created merging the previous numbers using a weighted average.

Let’s sort it out on reverse order so we could spot the worse portfolios configurations we could ever set:

Kids, don’t try this at home

Finally, I sorted the dataframe properly and applied some fancy colour formatting by this way:

def plot_dataframe(df_ref):

    # Capitalize text
    cols = ['country', 'loan_orig', 'loan_type']
    df_ref[cols] = df_ref[cols].apply(lambda x: x.astype(str).str.title())

    # Creating groupby expression
    g = pd.Categorical(df_ref.term_min + df_ref.term_max + df_ref.int_min + df_ref.int_max).codes
    n = np.unique(g).size

    # Definning color palete
    palette = sns.color_palette("Pastel2", n).as_hex()

    # Applying the palette to the groupby and adding bars to number columns
    df_ref = df_ref.style.apply(
        lambda x: ['background-color: {}'.format(palette[i]) for i in g],
        subset=['loan_orig', 'country', 'loan_type', 'term_min', 'term_max', 'int_min', 'int_max']
    ).bar(subset=['tot_count', 'orig_trust', 'int_mean', 'accuracy', 'rank', 'rank_mean', 'sel_ind'], color='#ea8685')

    return df_ref

Therefore, each portfolio we are supposed to create is coloured differently and ranked according to their performance. This is the final output (I’ve blurred the interesting columns for obvious reasons):

*breathing intensifies*

Hands-free method

But why stop here? I know the importance of rebalancing when managing a portfolio, so we do want to do the same here. But I’m too lazy to download the excels and run my scripts by hand, so I created a script which does the dirty work for me placed in my web server 3GHz 4GB RAM (the whole process takes 20 minutes-ish):

  1. It is executed automatically (via crontab) the first day of each month.
  2. It connects to Mintos and downloads the Loan Book excel.
  3. Load these data into our database and generates an up to date report like the one displayed in the previous section.
  4. Finally, we send the report via SMTP email.

The bottom line

I hope you have enjoyed this first approach. It can be improved by adding more columns like Amortization method in order to create more specific configurations (being careful not to overfit, of course), or why not by using Machine Learning models to classify the loans according to our needs. The possibilities are endless…

Depending on the popularity of this article I’ll release a Part 2, explaining this steps with more level of detail, or maybe applying some of the new fancy techniques mentioned above. It is completely up to you!

Rate this post
[Total: 0 Average: 0]
Further reading