Best Machine Learning Practices

What makes a good machine learning model?

A good ML model hinges on the data collection and preparation stages but then again, the steps taken during ML is also very important. Give 5 people the same dataset to build a machine learning model, its probable that each model will be created using a different approach. A lot of ML modeling tutorials available online are basic, and most don't have adequate information as to do's and don'ts of creating ML models from top to bottom. One of the recent advancements in ML includes the "scikit" pipelines. They are a good way of building machine learning models in a highly reusable and test driven manner.

Also, there are many questions surrounding the ML problem and how do you identify the error metric to use to tackle a specific business problem. Accuracy is not always the "answer" for a classification model, neither is the RMSE for a regression model.


Model Creation

There are 7 major steps to building a machine learning model. Some of these steps are low-level but a very integral part of the coding process.

  1. Data Collection

  2. Data Preparation

    • - This includes data exploration, feature engineering, feature selection e.g. using SelectKBest().
  3. Choose a model

    • - Decide on the error metric which fits the business problem and select the algorithms to use to create the model.
  4. Train the models

    • - Set seed.

    • - Train Test split and use cross-validate on all the models

    • - Standardise the feature variables using MinMax Scaler, Standard Scaler etc. Make sure to fit_transform() the train data, transform() the test data. This was discussed clearly in this article What and why behind fit_transform() and transform() | Towards Data Science

    • - if you are dealing with KNN and K-Means (both use Euclidean distance), PCA, Neural Networks, gradient descent algorithms that compute distance (Cosine or Euclidean for example) or assumes normality, scale!

    • - Decision trees do not need scaling.

      • - You wouldn’t want to scale ordinal features. [1]
    • - Evaluate the models

  5. Refit the selected model using the steps in 4 above.

  6. Parameter Tune the selected model

  7. Predict using the best performing parameters on unseen datasets. [2]




A pipeline is a progression of steps where information is changed. In this way, you may have a class for each filter and afterward another class to join those means into the pipeline and make a complete final pipeline.[3]

NOTE: The script below is used to explain the concept behind pipelines and not necessarily explain the ML steps previously discussed.



Model Evaluation

The aim of evaluating a model is to minimise the bias and variance. Underfitting occurs where you have high bias and low variance, Overfitting occurs when you have high variance and low bias. To understand more about the bias-variance tradeoff in Machine learning read Bias and Variance Tradeoff | Beginners Guide


A classification problem may have two (binary) or more (multi-class) class labels.

Class Imbalance

An imbalanced classification problem is when there is a skewed distribution in the datapoints of the known classes. In other words, this is due to a disproportionate distribution of classes in the dataset. e.g. Class 0 with 100 examples, Class 1 with 90 examples and Class 2 with 10 examples.


Have you ever thought this: there are so many classification scoring techniques, which should I use?

Before you decide on which to use, its important to first understand what these mean.



It's easy to assume that this is the best error metric for your ML model, however, this is not always the case. You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class.


  • - When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,

  • - When every class is equally important to you.[4]



This indicates what proportion of positive identifications were actually correct.

Sometimes they are referred to as Precision Curves, Positive Predicted Value (PPV). Say we develop a machine learning model that determines if a stock will be a good investment or not. If we were to pour our entire net worth into this one stock, we would better hope that our model is right. Precision would be the best metric to use here because it determines the correctness of our model. We can afford to miss a few profitable stock investments here and there. More clearly, Precision defines how many of the positively classified were relevant.


This indicates what proportion of actual positives were identified correctly.

Sometimes they are referred to as True Positive Rate, Sensitivity, Recall Curves. Let’s say we were trying to detect if an apple was poison or not. In this case, we would want to reduce the number of False Negatives because we hope to not miss any poison apples in the batch. Recall would be the best evaluation metric to use here because it measures how many poison apples we might have missed. We are not too concerned with mislabelling an apple as poisonous because we would rather be safe than sorry. More clearly, Recall defines how good a test is at detecting the positives.

As mentioned earlier, Sensitivity is the same as Recall, however, Specificity, also known as True Negative Rate measures how good the test is at avoiding false alarms.

F1 Score

Useful where you can’t really rely on both Precision and Recall. You use this when choosing either of precision or recall score can result in compromise in terms of model giving high false positives and false negatives respectively. F1 score is balancing precision and recall on the positive class while accuracy looks at correctly classified observations both positive and negative

The farther F1 is from 0, the better. F1, just like Precision and Recall range from 0 to 1.


  • Pretty much in every binary classification problem where you care more about the positive class. There is also a Multi-Class F1 Score.

  • It can be easily explained to business stakeholders which in many cases can be a deciding factor. Always remember, machine learning is just a tool to solve a business problem. [4]



PRC is short for "Precision Recall Curve". You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. At a certain point, the higher the recall the lower the precision. Knowing at which recall your precision starts to fall fast can help you choose the threshold and deliver a better model.


  • - when you want to communicate precision/recall decision to other stakeholders
  • - when you want to choose the threshold that fits the business problem.
  • - when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR), it cares less about the frequent negative class.
  • - when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).[4]



AUC is short for "Area Under the Curve". ROC is short for "Receiver Operating Characteristic". This is a plot of signal (True Positive Rate) against noise (False Positive Rate). The higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left-side are better. The model performance is determined by looking at the area under the ROC curve (or AUC). The best possible AUC is 1 while the worst is 0.5 (the 45 degrees random line). Any value less than 0.5 means we can simply do the exact opposite of what the model recommends getting the value back above 0.5.


  • - You should not use it when your data is heavily imbalanced. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
  • - You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.
  • - You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration). [4]



Regression models are not ranked using accuracy. There are a number of metrics to use, the most common are briefly described below but to learn more about regression error metrics, read these articles Understanding Linear Regression and Regression Error Metrics and Regression: An Explanation of Regression Metrics And What Can Go Wrong

MSE (Mean Squared Error)

RMSE (Root Mean Squared Error)

MAE (Mean Absolute Error)

It is good practice to establish a baseline MAE for your dataset using a baseline model. A model that achieves a better MAE than the MAE for the baseline model is preferable.

MAPE (Mean Absolute Percentage Error)

The lower the MAPE, the more accurate the model. E.g. a model with MAPE 2% is better than a model with MAPE 10% on the same data.

R-squared or Coefficient of Determination

Adjusted R-squared



Thank you for reading my article. There is a lot of resource for writing and testing ML code in python. "scikit-learn" is a very powerful tool with loads of resource materials and support and innovation. I reckon this will be suitable for:

  • - Writing production ready ML scripts and pipelines in our solutions domain projects.

  • - Providing the foresight suite with production-ready code.



  1. All about Feature Scaling. Scale data for better performance of… | by Baijayanta Roy | Towards Data Science
  2. [7 Steps to Machine Learning: How to Prepare for an Automated Future | by Dr Mark van Rijmenam | DataSeries | Medium]
  3. ML Pipelines using scikit-learn and GridSearchCV | by Nikhil pentapalli | Analytics Vidhya | Medium
  4. F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose? -

Leave a Reply

Your email address will not be published. Required fields are marked *