Azure Machine Learning was demoed at the Data Insights Summit last week and got me pretty stoked about all the possibilities, so I decided to try it out. Ostensibly, it allows you to analyze a data set and train a model to predict outcomes. The demo they did at the conference predicted your likelihood of surviving the Titanic sinking by looking at your age/gender/class statistics.
I am not a data scientist, but I find all of this pretty fascinating… I found the experience to be simultaneously frustrating and awesome. The interface tries to be user-friendly but falls a little short of that, though the video demo the first time you load the studio is very helpful. It took half a day of trial and error to get it working, hopefully I can save someone else a few hours of their life with this.
In the example below I took a CSV export of one of the banks in the CFPB public complaint database (this is where consumers submit complaints for financial institutions) and created a model to predict the outcome of the complaint (basically whether the company had to compensate the consumer or whether they closed it after chatting with them).
How to Create a Machine Learning Model For Beginners
A few tips to start out
- It likes CSVs and reliably formatted data sources. It gets hung up on date formatting, so best to remove that if it’s not relevant to your model.
- Every time you make changes, you have to re-run the model and re-publish to web to see if you “fixed” it.
- When you’re playing with your predictions, you need to fill out ALL the input fields, and match the same syntax as the rest of your data.
- For all of the “actions” used, you can search for them with the search bar, it makes it much faster.
- The machine learning “predicts” one column outcome per model (as far as I can tell), so start with that and add more on later if you need to.
- Log into Azure ML studio with a free workspace account: https://studio.azureml.net/. Take the tour if you want to, it’s useful. It’ll add a demo project to your space so that you can refer to it if you want to see how they made certain things work.
- Import your data set – click the +new button in the very far bottom left, and select dataset on the right. If you don’t know what to do here or if it’s your first time, import a CSV (I tried an OData feed first and it choked on something it didn’t like in the data model). Remove any “extra” data columns and clean it up before you import it or it’ll be a pain to remove later. It doesn’t love blank fields either.
- Create a new experiment (a blank experiment unless you want to try an existing model). Looks like this:
- Drag the data set you added earlier from the menu on the left onto the working area to start your flowchart.
- Next you need to partition your data and use some to train the model and some to validate the model. Split your data by dragging the “split data” action under your data source and click-dragging to connect them with a line. The Internet says to split 80% to training and 20% to scoring, I am going with that.
- Drag “train model” onto the left side of the flow and direct your data output to it’s input. On the right-hand side of the window, “launch the column selector” to choose which column you’re training on. This is the column you are trying to predict based on the rest of your data (in my example, I chose the outcome of the CFPB complaint).
- Drag an algorithm above the model and connect it. You can use the guide linked below to pick an algorithm, if you have no idea which to use and are just playing around, you can pick one of each type until you don’t get an error when you run the model. 🙂 https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/
- Drag the “score model” action over, connect it to your “train model.”
- Connect your OTHER “split data” chunk from earlier to the “score model.”
- Drag an “evaluate model” action under the score model and connect it.
- Run your model to see if you need to fix any errors (run button is in the bar at the bottom)
Here’s what a basic example looks like (under the blue scribbles was a column selector, turned out I didn’t need it):
Once you get this running without issue, right-click both the “score model” and “evaluate model” actions and browse through the submenus – you should see an option for “visualize” in each – click for fun times.
Your evaluate model visuals will show you the accuracy of the algorithm you chose vs your data set, you can swap them out and re-run it to see if you can find one with higher accuracy.
Now we need to set up a web service so that we can play around with entering new data to see what it predicts for us! If you’ve made any changes, make sure to run again, if not click the “set up web service” button at the very bottom of the screen.
This will give you a “predictive experiment” tab, it should add input/output actions to your flow (if it doesn’t you can drag them over from the left). This is where you specify what columns are going in and which are coming out the other end. Any columns that you used in your train model have to be used as inputs, it’ll give an error if you try to remove some. For your outputs, you will want to select all of the probability columns created by your model.
Use the “project columns” widget to select the columns for your output (drag it between your score model and output, make sure to connect them). Again, make sure to select the probability columns in the model.
NOTE: it will not let you select columns by name until you’ve run it at least once, so you may want to run it before adding the action. ALSO, I have had issues with it giving me does-not-exist errors for these even if they clearly exist, and had much better luck telling it to exclude all the other non-probability/score label columns by name and pull everything else. YMMV.
Run the model yet again, then use the “deploy web service” button in the bottom to get to the area where you can test it.
At this point, you should have a window with an API key and a TEST button. You can try the test button but the fields are all plain text and figuring out exactly what to type in them to test it is kind of a pain, it’s much easier to play with it in Excel. To do this, click the “Excel 2013 or later” link in the request/response row to open your model in Excel (unless you have Excel 2007…). This MAY require Windows Excel, unsure if it works on the Mac version.
Excel will ask you if you want to enable editing, say yes. Click your model on the right to open it and click the “use sample data” button to pull in a table of your sample data. Select the whole table and make sure that the input has it as the range (this is what it looks at for the evaluation, and if the number of columns doesn’t exactly match it has a cow).
For the output, enter the cell name of the first empty cell to the right of your table (I1 for me).
Click “predict” and it should dump all the probabilities of your outcomes into the rows to the right of your table. You can see from my example that it added Scored Probabilities columns for all of the possible outcomes, I’ve only expanded one because the names are super long. I believe the “Scored Labels” on the end is the prediction the model ended up with.
- If it gives you an error about the data model not matching up, your table wasn’t selected properly.
- If it does nothing at all, something is wrong in your flowchart. When you try fixing things, make sure to re-run your model in both tabs and re-publish it to the web.
You can test your model with some new values by replacing cell values in your table with new values (make sure they match the syntax of the original data), re-predict, and it will calculate the probability that that combination will have which outcomes! You can pull down your table with a ton of new data if you want, just make sure your input recognizes that you’ve done it.
You may notice at this point that the column you’re evaluating the probability of shows up as a column with data in it in your inputs. Not sure how to make this go away (I tried removing it in a few places, it seems to really want it to stay), but it doesn’t seem to affect the probability calculation at all. I just hid the column and ignored it.
If you need to update your model, you can upload a new version of your data set back in the studio – it is smart enough to realize this and asks you if you want to update (provided you use the same filename). Again, you’ll need to re-run all your models, republish, etc to test it. Cool!