How To Build a Prediction API in 10 Minutes with Flask, Swagger, and SciPy

I’ve seen a lot of hype around Prediction APIs, recently. This is obviously a byproduct of the current data science fad.

As a public service, I’m going to show you how you can build your own prediction API … and I’ll do it by creating a very basic version in 10 minutes.

We will build an API that will determine if we should provide credit to someone based on certain demographic information.

We will use Kaggle’s “Give Me Some Credit” dataset as the basis for this example.

Go to the “Give Me Some Credit” page, and download the files.

You will have 4 files:

  • cs-training.csv
  • cs-test.csv
  • sampleEntry.csv
  • DataDictionary.xls

We will only need the cs-training.csv and DataDictionary.xls files for this project.

Create application folder

Use the following commands to create a directory and move into it.

In the directory create a file called We will use incrementally build the program in that text file.

Load Data Set

We need to parse the cs-training.csv file so that we can make sense of the data. The following shows the first 5 lines of data from cs-training.csv

We can observe the following fields in the file

  • SeriousDlqin2yrs
  • RevolvingUtilizationOfUnsecuredLines
  • age
  • NumberOfTime30-59DaysPastDueNotWorse
  • DebtRatio
  • MonthlyIncome
  • NumberOfOpenCreditLinesAndLoans
  • NumberOfTimes90DaysLate
  • NumberRealEstateLoansOrLines
  • NumberOfTime60-89DaysPastDueNotWorse
  • NumberOfDependents

DataDictionary.xls contains a description of each column, except for the first column. However, the first column is obviously an identification id column.

For this example, we want to predict if someone is likely to be a credit risk based on past data.

The column “SeriousDlqin2yrs” is our “outcome” feature and the rest of the columns are our “target” features.

We want to create a classifier that given some target features can predict the outcome feature. In order to do that we need to do what is known as “feature extraction”. The following code will do that will pandas.

Generate Training and Testing Set

We now have to separate our data into two disjoint sets: a training set, and a testing set.

We have to do this because we will use “cross-validation” to measure the accuracy of our predictive model.

We will train our classifier on the training set and test it’s accuracy on the testing set.

Intuitively, if our classifier should classify credit risks in the testing set the same as in the real world. This makes the testing set a proxy to how it would behave in production.

Define Classifier Type

Scipy comes with a bunch of baked-in classifiers. We will use the default Naive Bayes classifier for this example.

Train Classifier

To train the model we simply have to feed the classifier the target and output variables

Validate Classifier

Now that we have our classifier we cross verify the results against our test set.

The output for this script is the following

The output shows that we have a 92% accuracy with the following error types

  • 55737 true positives
  • 110 true negatives
  • 212 false positives
  • 4076 false negatives

Save Classifier

With our classifier done we can save it so that we can use it a separate program

Create Web API

With our model created, we can now create our web service that can decide if we should give credit to someone based on certain demographic information.

Create the file

Install flask-restplus from the command line

flask-restplus makes creating flask and swagger applications much simpler.

The following code will setup the scaffolding for setting up a flask application

The following code will setup the request parameters for our web service

This code will setup take the request parameters, feed them into the model, and determine the eligibility for extending credit.

You can start the flask app from the command line

And you can use the web interface by visiting localhost:5000


You can also use curl to get a response from the flask app

You can get the complete code on my github repo.


So there you have it: a prediction API built in about 10 mins.

I would never actually put this into production. A real production prediction API would need to handle edge cases and we would need to do model section.

However, the basic nuts and bolts of a prediction API are pretty straightforward. There really isn’t any magic to building a prediction engine.


How To Create Non-Reproducable Results in Academic Research: A Story of Survivor Bias

I had a discussion today with a friend about the recent news that some researchers couldn’t reproduce a significant amount of the results in academic journals.

He believed that this indicated rampant cheating in the academic community. I disagreed with him, though.

According to Hanlon’s Razor, we should never attribute to malice that which we can attribute to stupidity; so, I argued that we should prefer to believe in massive incompetence instead of some evil grand conspiracy.

I used a very simple thought experiment to illustrate this.

Stupid Is as Stupid Does


Suppose that some postgraduate student wants to research the following question: “Does spanking children increase their likelihood of going to jail?”.

Suppose that after a few months of collecting data, our postgraduate student found that spanked people went to jail at a rate that is 50% higher than non-spanked people.  Does this suggest that spanking children increases the chance of committing crime?

Well, that depends on how many people we included in our sample.

Suppose that the data looked like this:


No Jail






Not Spanked








This hypothetical study included 200 people of which 10 went to jail. If spanking a child has no effect then we would expect to see the same amount of people from both classes (i.e. spanked and not spanked) go to jail. However, in this case we see that 2 extra spanked people went to jail.

Does this suggest that we have a measurable effect worth investigating?

The answer is no, because we could have observed this outcome by mere chance.

Suppose we take any group of 200 people, tag two of them, and randomly assign all 200 people to 4 cells. The probability that both of our tagged people would fall into one cell is around 30%.

We use that fact to claim that our results are not significant.

However, suppose that the data looked like the following, instead:


No Jail






Not Spanked








In this case, we have 200 extra people who have been spanked and went to jail. The probability that all 200 tagged people would randomly fall into one group is less than .01%.

This type of data would suggest that something significant is worth investing about this phenomenon.

This provides an illustration of the basic theory of “null hypothesis testing” which is also known as “significance testing”.

We use this method to measure the probability of observing some data when we assume that the null hypothesis is correct.

Statisticians call this measurement a “p-value”, and from what I hear, the social sciences heavily use it to determine what they publish and don’t publish.

With “significance testing”, we want to reject the null hypothesis, and we use the p-value as the means to reject it. Therefore, we want small p-values, since small p-values provide good evidence against the null hypothesis.

Further, there exists a consensus among some researchers that we should consider p-values less than 5% as significant.

By claiming that we should use 5% as the cutoff for significance testing, the academic community necessarily accepts false negatives 5 times out 100.

In the examples I gave above, we did not have a significant result with 200 sampled people because we had a 30% (i.e. 30 times out of 100) chance to produce that result by accident. However, we did have a significant result with 20,000 sampled people because we had a .01% (i.e. 1 time out of 10,000) chance to produce that result by accident.

Now, suppose that we have 10 postgraduate students researching the question “Does spanking children increase their likelihood of going to jail?”, and they all aim for a 5% p-value.

With 10 postgraduate students the probability that at least one of them will get a false negative is roughly 40%.

If we suppose that 20 postgraduate students did this experiment then the probability that at least one of them will get a false negative is roughly 64%.

If we suppose that 100 postgraduate students did this experiment then the probability that at least one of them will get a false negative  is roughly 99%.

Essentially, if you have enough postgraduate students running the same experiment then you’ll get a “significant” result for pretty much any research question.

Mediocrity: It Takes A Lot Less Time And Most People Won’t Notice The Difference Until It’s Too Late


The incentives caused by “publish or die” in academia essentially guarantee that we will have a significant amount of research papers based on false negatives.

In this case, everyone ignores the results that don’t look sexy even when they are true, and focus only the sexy results even when they are false.

This is a special case of survivorship bias.

Sadly, we have an easy way to check against this phenomenon that apparently the journals do not use: REPEAT THE EXPERIMENT.

This works because we have a 1 in 400 chance to generate two false negative in a row when we use a 5% p-value.

The fact that so many articles made it passed the peer-review process just exposes just how lazy some of our “academics” really are.

Data Science and the Answer to the Ultimate Question of Life, the Universe, and Everything

In “The Hitchhiker’s Guide to the Galaxy”, Douglas Adams tells the story of hyper-intelligent pan-dimensional beings who build a computer named Deep Thought to calculate “the Answer to the Ultimate Question of Life, the Universe, and Everything.” After seven and a half million years, Deep Thought outputs an unintelligible answer: 42.

When they probed Deep Thought for more information it tells them that they did not understand the answer because they did not understand what they had asked.

The moral: make sure you have a good question before you start looking for an answer.

So is the case with “data science”.

You can employ the most sophisticated data science techniques with the right data crunching technologies, but without clear goals you can’t make sense of the numbers.

Based on this principle, I believe that business analysts contribute the most to the success of any “data science” project: they know what to ask, and they know what an answer should look like.

Unfortunately, I’ve seen many organizations invest heavily in machine learning experts and statisticians who don’t understand the business. They are simply building another Deep Thought who will return unactionable results like “42”.

All this could have been avoided if more people just read science fiction.