Data scientists and analysts have to do a lot of routine work: from time-consuming data preparation to enumeration of methods and algorithms that will work best on this data. And it doesn’t matter whether we are talking about key business tasks or auxiliary ones — most of the time is always spent on routine.
If you need to speed up processes and allow specialists to focus on the most important tasks, you should use AutoML frameworks. What is interesting about automatic machine learning systems? What frameworks are suitable for AutoML? What are the limitations so far? We will answer these questions in the article.
In this article:
What is AutoML?
In simple technical terms, AutoML automates the selection, construction, and parameterization of MO models. Simply put, AutoML provides methods and processes to accelerate exploration and prediction.
The skyrocketing demand for AI projects, coupled with a shortage of AI experts, means that complex tasks have to be left for automation. However, AutoML is not a general tool for managing model performance and cannot be used to analyze the resulting data.
One example of an AutoML constraint is the hill climbing algorithm, where the model is tasked with finding a globally optimal outcome or solution. An AutoML model will often only work until it reaches the top of the first “hill” - a local maximum. While it looks like you've found a solution, the data scientist knows that you may not be on the biggest hill, and as the model expands further, it will become less and less accurate. A skilled person can help quickly expand the model and find the global optimal maximum.
Extensive training and testing are what guarantee the long-term viability of the project. Here the importance of attracting technological expertise becomes clear. More precisely, data scientists.
The concept of Automatic Machine Learning
Supposing there is a data set by which we want to obtain a predictive model. The traditional machine-learning approach requires the following sequence of actions:
- preliminary data processing;
- determination of characteristic features of the construction of new features;
- choosing the right learning model;
- optimization of hyperparameters;
- model training with optimal parameters.
The process can be long and, therefore, expensive. Indeed, for a better result, it is necessary to test the hypothesis repeatedly; moreover, at each step, it can be refined further.
The task of automated machine learning is to automate all or at least some of these steps without losing predictive accuracy. The ideal AutoML strategy assumes that any machine learning user can take raw data, build a model on it, and get predictions with the best possible (for the available sample) accuracy.
But does this mean that the day will come when there is no need for data analysis specialists? Of course not. Automated machine learning technologies are aimed at eliminating the routine sequence of operations and manual enumeration of models so that experts can devote more time to the creative side of the issue.
Consider the “conveyor” of machine learning described above. Each stage requires its own approach. For example, to prepare data, it may be necessary to automate:
- determining the type of columns (numeric data, text, Boolean values, etc.);
- the semantic content, for example, if the field is text, then what it represents: last name, date, geotag, etc.;
- task detection: cluster allocation, ranking, etc.
Particular attention is paid to the process of finding the best model hyperparameters. The two most common methods for finding them are:
- Grid search.
- Random search (random search).
Obviously, the popularity of these methods is explained by the ease of implementation. Both methods are justified only for a small number of hyperparameters. Other algorithms are used to optimize parameters: Bayesian optimization, simulated annealing, evolutionary algorithms, etc. Let us consider in more detail the frameworks that allow you to find a suitable model and configure its parameters.
More on the topic
MLOps Benefits that make it an industry trend
Find out why everyone cares about MLOps that much and why you should use itShow me
Automatic Machine Learning Framework
TransmogrifAI is a library built on the Scala language and the SparkML framework and it achieves this goal. With just a few lines of code, data scientists can perform automated data cleansing, feature engineering, and model selection to get a model with good performance and then perform further exploration and iteration.
TransmogrifAI includes five main components of machine learning:
- function derivation;
- transmogrification (i.e., automatic feature engineering);
- automatic function check;
- automatic model selection;
- hyperparameter optimization.
AutoGluon is an open library for machine learning application developers from Amazon Web Services that makes it easy to use and easily extend Automated Machine Learning. It allows you to achieve the highest accuracy of forecasts using modern deep learning methods without special knowledge. It's also a quick way to prototype what you can achieve with the dataset, as well as get a starting foundation for your machine learning.
- create models designed to classify images and text;
- object recognition;
- tabular forecasting.
Also, this tool contains a programming interface for advanced software engineers interested in delving into the model parameters themselves.
MLJAR is a browser-based platform for rapidly building and deploying machine learning models. It has an intuitive interface and allows you to train in parallel. It has built-in Hyperfeit search functionality, making it easier to deploy models. MLJAR provides integration with NVIDIA's CUDA, Python, Tensorflow, etc.
You only need to follow three steps to create a good model:
- Download your dataset.
- Train and adjust many machine learning algorithms and select the best algorithm.
- Use the best predictive model and share your results.
This AutoML tool is currently used for subscription versions. It has a free version and has 0.25 GB of data settings. Limits. It's worth a try.
DataRobot is a platform that allows business analysts to build predictive analytics without knowledge of machine learning or programming. The platform uses automated machine learning (AutoML) to build accurate predictive models in a short amount of time.
DataRobot provides a convenient user interface for creating machine learning models. A company can deploy a real-time predictive analytics service powered by an accurate machine learning model in just a few steps.
A huge advantage of DataRobot is the ability to go deeper into the platform and take control of the machine learning workflow; on the one hand, business analysts can use it as a tool, on the other hand, experienced data scientists can tune many parameters on their own to get even more accurate models.
Features of using DataRobot:
- DataRobot uses state-of-the-art distributed processing while running experiments in parallel;
- the solution can be used locally or in the cloud;
- quickly and easily connects to any data source;
- DataRobot offers built-in security for role-based fine-grained authorization and supports Kerberos and LDAP protocols.
MLBox Framework managed to prove itself well.
MLBox solves the following machine-learning tasks:
- Data preparation (the most developed part of the library)
- Model selection
- Hyper Parameter Search
Among the shortcomings, we note that on Linux the system is much easier to install than on a Mac or Windows.
Do you know?
Is your business ready to use AI?
In the near future, we will completely get rid of manual processing of information - it will be replaced by autonomous systems capable of processing huge amounts of information with high speed.Find out more
6. Auto Sklearn
As the name implies, the Auto Sklearn framework is built on top of the popular scikit-learn machine learning library. What Auto Sklearn can do:
- Characterization (a distinctive feature of the framework)
- Model selection
- Hyper Settings
Auto Sklearn does a good job with small data sets, but doesn’t “digest” large datasets.
TPOT is positioned as a framework in which the machine learning pipelines are fully automated. To find the optimal model, a genetic algorithm is used. Many different models are being built with the choice of the best in predictive accuracy.
Like Auto Sklearn, this framework is an add-on for scikit-learn. But TPOT has its own regression and classification algorithms. Disadvantages include the inability of TPOT to interact with natural language and categorical lines.
H2o Flow is an interactive web tool that allows you to select data from various sources, visualization, and a seamless environment for model building, forecasting, scoring, and exporting your model. In my opinion, the strength of H2o is distributed in-memory processing.
H2O is written in Java and supports algorithms commonly used in Data Science, such as GBM, Random Forest, and Stacked Ensembles. H2O works with R, Python, and Scala on Hadoop/Yarn, Spark, or on your laptop.
Advantages of AutoML in H2O:
- open source, distributed (multi-core + multi-node) implementations of advanced ML algorithms;
- the presence of basic algorithms in high-performance Java. including API in R, Python, Scala, and web interface;
- easily deployable models for production as pure Java code;
- easily works on Hadoop, Spark, AWS, your laptop, etc.
9. Auto Keras
Auto Keras follows the design of the classic scikit-learn API, but uses a powerful neural network search for model parameters using Keras.
10. Google Cloud AutoML
Cloud AutoML uses a neural network architecture. This Google product has a simple user interface for learning and deploying models.
However, the platform is paid, and in the long run it makes sense to use it only in commercial projects. On the other hand, Cloud AutoML with restrictions is available free of charge for research purposes throughout the year.
11. Uber Ludwig
The goal of the Uber Ludwig project is to automate modern deep learning systems with a minimal amount of code. This framework only works with deep learning models, ignoring other ML models. And, of course, as is usually the case with Deep Learning, the amount of data plays a significant role.
More real facts
AI in FinTech: Use cases of AI and ML in Fintech
More than 90% of global fintech companies are already relying heavily on artificial intelligence and machine learning. Do you know how?Let's see
So, AutoML is already pretty good at teaching with the teacher, with high-quality labeled data. But so far, he is not able to solve the problems of learning without a teacher or with reinforcements. The latter causes difficulties in the implementation of such scenarios as the artificial intelligence of a robot located in the real world or an opponent in the game.
A rare example of a successful implementation of reinforced learning is AlphaZero, developed by DeepMind. We can see in their example the possibility of improving the game quality in Go during training in which artificial intelligence competed with itself.
AutoML frameworks also still have difficulties in processing complex raw data and optimizing the process of constructing new features (feature engineering). For this reason, features selection remains one of the cornerstones of the model learning process.
However, progress has been observed in all of these areas, which is accelerating with the increasing number of AutoML contests.
Share your list of must-have machine learning frameworks by writing us at firstname.lastname@example.org ;)