Preparing for some Azure ML magic!
Let's reiterate the steps required to get Azure ML up running:
- Feed Azure ML with some data, either historical or current data.
There are a variety of ways to make your data available to Azure ML where the trivial approach is to simply upload your data set to Azure ML.
Azure ML also supports cleaning your data like removing duplicates, removing columns not relevant for your analyses etc.
- Define the model that will be used to make predictions
This is of course the most important step and also the step where at least some understanding of Machine Learning is required. Here we build a model which depends onÂ whether we are dealing with supervised or unsupervised learning andÂ whether we are dealing with for instance a classification or a regression problem.
An example of a model could be to predict if a machine problem is mechanical or electrical. This is a classic classification problem.
- In the case of supervised learning we need to train our model on our known dataset.
Typically you use 80% of your known data for training and then 20% for validation.
- Validate the model using known data.
Apply your model to part of the data you set aside for validation to test if the model is able to predict known results.
- If the model performance is "adequate" (i.e. it is able to make correct predictions on known data at an acceptable rate) expose the model to real time data
Azure ML allows you to expose your model as a web service to make it consumable for other software components or to people without Azure ML access or competencies.
The data we will feed into Azure ML to do analytics on in this articel is contained in two tables.
One table with vehicle mileage data and another table with registered breakdowns:
This is actually a pretty well chosen data set as in maintenance / CMMS, the situation very often is, that you have a potentially huge amount of assets (i.e. machines or machine components) and for each asset you have a huge amount of readings, either collected manually or automatically through PLC’s etc. Or maybe even over the internet in which case you are the proud owner of an “Internet of Thing”’s! Exactly how you collect you data is not really that relevant but including the “Internet of Things” thing in this article will probably increase the article sharing probably by an order of a magnitude! Just how probable is something we could ask Azure ML……. IoT IoT IoT.... Just to be sure Google picks it up! ;-)
Predicting the breakdown type is a classification problem which is what we want to focus on.
Predicting when there will be a breakdown is a regression problem which will not be our focus.
Azure ML supports a wide variety of multi-class classification algorithms like:
- Multi-class Decision Forest
- Multi-class Decision Jungle
- Multi-class Logistic Regression
- Multi-class Neural Network
In Supervised Learning, Azure ML makes it very easy to try out multiple algorithms, so a deep understanding of the different algorithms is not stricty necessary. You simply try out multiple algorithms on known data and see which algorithm performs the best. We will get back to that a little later, and for now assume Logistic Regression is our favorite algorithm.
Given the data and an algorithm we can setup our model in Azure ML Studio:
Starting at the top-right of this model we have our dataset containing the mileage and the breakdown tables. Within Azure ML Studio you can massage your dataset in various manners to e.g. remove columns not relevant for the analysis, remove dublicates etc. This is what takes place in the "Project Columns" node.
The splitting of the dataset is where we reserve some of our known data for training and some fore scoring. In this case we reserve 20% for scoring and use 80% for the training.
The result of the scoring of our model where we use Logistic Regression can be visualized in Azure ML:
From the above visualization we can see that the Logistic Regression algorithm is able to predict the correct fault type, e.g. electrical with an error on 4.1%. In total we have an error percentage on 10.6%. One way to try to bring down the error percentage is to adjust some initializtion values, which can be done based on a deeper understanding of the algorithm or based on a Parameter Sweeping. We will not go into details on Parameter Sweeping but simply note that if we make use of Parameter Sweeping in the above example we can reduce the error percentage to 5.8%.
Another way to try to achieve a lower error percentage is to replace the algorithm. Using Azure ML Studio this is quite easy:
In the above we have simply added a selection of algorithm nodes to our model which we can then compare:
Let's step back a moment and review what we have achieved:
- We have uploaded our known dataset to Azure ML
- We have created our model in Azure ML
- We have trained our model using a subset of the uploaded data
- We have scored our model using the remaining data
- We have compared the performance of various algorithms
All this work is preparation work that only has to take place once for a given model. Now that we have the model in place with a preferred algorithm we can start to expose our model to real time data. So let's move on to the next (much shorter!) part where look at how to do predictions.
PS: Credit to Tomas Grubliauskas for providing the hardcore background material for these posts