Recently, the growth in demand for skills in data science has grown faster than the skill level of specialists. Today it is difficult to imagine a business that would not benefit from a detailed analysis of the data that scientists and machine learning algorithms conduct. Since artificial intelligence penetrates all corners of the industry, it is difficult to satisfy the needs of data scientists in every possible use case. To reduce the pressure created by this deficit, several companies have begun to develop structures that can partially automate the process commonly used by Scientists.
AutoML is a method that automates the process of applying machine learning methods to data. As a rule, a data processing specialist spends most of his time pre-processing, engineering features, selecting and tuning models, and then evaluating the results. AutoML can automate these tasks by providing a basic result, and it can also provide high performance for certain problems and provide an understanding of where to continue your research.
In this article, we will look at the Python H2O module and its AutoML function. H2O is Java-based software for data modeling and general computing. According to H2O.ai, “The main purpose of H2O is a distributed, parallel memory processing mechanism (up to several hundred gigabytes with the Xmx parameter for the JVM).”
AutoML is a feature in H2O that automates the process of building a large number of models in order to find the “best” model without any prior knowledge. AutoML won’t win you in any competition (however, this information is outdated - editor's note), but it can provide a lot of information that will help you create better models and reduce the time spent studying and testing various models.
The current version of AutoML can train and perform cross-validation for random forest, extreme random forest, random mesh of gradient boosting machines, random mesh of deep neural networks, and then train a composite ensemble using all models.
Stacking (also called meta-assembly) is a method of ensemble of models used to combine information from several predictive models to create a new model. Often, a combined model (also called a Level 2 model) is superior to each of the individual models due to its smoothing nature and the ability to highlight each base model where it works best, and weaken each base model where it does not work well. For this reason, stacking is most effective when the underlying models are significantly different.
Prediction stacks of machine learning models often exceed current academic results and are widely used in Kaggle competitions. On the other hand, they usually require large computing resources. But if time and resources are not a problem, then often a minimal percentage improvement in the forecast can, for example, help companies save a lot of money.
In this article, we will look at the Mushroom Classification Dataset, which can be found on Kaggle and which is provided by UCI Machine Learning. The dataset contains 23 categorical attributes and more than 8000 objects. The data are divided into two categories: edible and toxic. Classes are distributed fairly evenly, with 52% of the objects in the edible class. There are no missing observations in the data. This is a popular dataset with over 570 Kaggle kernels that we can use to see how well AutoML works compared to traditional working methods.
First you need to install and import the Python H2O module and the H2OAutoML class, as in any other library, and initialize the local H2O cluster (for this article I use Google Colab).
from h2o.automl import H2OAutoML
Then we need to load the data, this can be done directly in the “H2OFrame” or (as I will do for this dataset) in the pandas DataFrame, so that we can apply label encoding and then convert them to H2OFrame. Like many things in H2O, an H2OFrame works very similarly to a Pandas DataFrame, but with slight differences and a different syntax.
# Load data into H2O
path = "./gdrive/My Drive/Mushrooms/mushrooms.csv"
# df = h2o.import_file(path=path, header =1)
df = pd.read_csv(path)
labelEncoder = preprocessing.LabelEncoder()
for col in df.columns:
df[col] = labelEncoder.fit_transform(df[col])
df = h2o.H2OFrame(df)
df = df.asfactor()
Although AutoML will do most of the work for us in the initial stages, it’s important that we still have a good understanding of the data we are trying to analyze so that we can build on its work.
Like the functions in sklearn, we can create a separation between train and test so that we can test the model in an invisible (test) dataset to prevent overfitting. It is important to note that when separating H2O frames, there is no exact separation. It is designed to work with big data using the probabilistic separation method, not the exact separation. For example, if you specify a separation of 0.70 / 0.15, H2O will separate the train / test with the expected value of 0.70 / 0.15, and not the exact 0.70 / 0.15. For small datasets, the sizes of the resulting partitions will differ from the expected value more than for large data, where they will be very close to accurate.
train, test, valid = df.split_frame ( ratios = [ .7 , .15 ])
Then we need to get the column names for the dataset so that we can pass them into the function. There are several parameters in AutoML that must be defined: x, y, training_frame, and validation_frame, of which y and training_frame are mandatory, and the rest are optional. You can also configure values for max_runtime_sec and max_models. max_runtime_sec is a required parameter, and max_model is optional. If you do not pass any parameter, it takes a NULL value by default. Parameter x is the feature vector from training_frame. If you do not want to use all the attributes from the passed frame, you can skip the x parameter.
To solve this problem, we are going to use all the parameters in the x data frame (except the target) and set the value max_runtime_sec to 10 minutes (some of these models take a lot of time). Now it's time to run AutoML:
y = "class"
x_train = train.columns
aml = H2OAutoML(max_runtime_secs=600, seed = 1)
aml.train(x = x_train, y = y, training_frame = train)
Here, a function was specified for starting a 10-minute training period, but instead it was possible to specify the maximum number of models. If you want to customize the workflow of AutoML, there are also many additional parameters that you can pass for this:
After launching the models, you can see which models work best and consider them for further study.
lb = aml.leaderboard
To make sure that the model has not been retrained, we now run it on test data:
preds = aml.predict(test)
AutoML generated accuracy and F1-score values of 1.0 on the test data, which means the model has not been retrained.
Obviously, this is an exceptional case for AutoML, since we cannot improve accuracy to 100% on our test dataset without testing more data. Looking at many of the kernels presented at Kaggle for this dataset, it seems that many people (and even the Kaggle Kernel bot) were also able to get the same result using traditional machine learning methods.
The next step is to save the trained model. There are two ways to preserve the leader model — the binary format and the MOJO format. If you use your model in production, it is recommended to use the MOJO format, as it is optimized for use in production.
Now that you have found the best model for the data, you can conduct further research on the steps that will improve the performance of the model. Perhaps the best model on the training data is retrained, and another top model is preferable for test data. It may be better to prepare or select only the most important features for some models. Many of the best models in H2O AutoML use ensemble techniques, and perhaps the models used in ensembles can be further developed.
Although AutoML alone won’t bring you superiority in machine learning competitions, it is definitely worth considering as an addition to your mixed and folded models.
AutoML can work with various types of datasets, including binary classification (as shown here), multiclass classification, and also work with regression tasks.
AutoML is a great tool that helps (rather than replace) the work that data scientists do. I look forward to seeing the achievements that can be made in AutoML environments and how they can benefit all of us as scientists and the organizations they serve. Of course, one automated solution cannot exceed human creativity, for example, when it comes to character engineering, but AutoML is a tool worth exploring in your next data analysis project.
Enjoy this blog?
Please, spread the word :)
Pareto Principle in IT Security
''Fiddle with'' web traffic like a pro with Fiddler web debugger
Geniusee received an ISO 27001:2013 certificate
Organizational structures of IT department
But did you know that 80% of software vulnerabilities are accidental, and 20% are intentional?
Written by Ihor D.
Over the years has been developed a number of tools for inspecting traffic. Let's look closer at one of the best in the development community.
Written by Roksoliana V.
Find out how we received an ISO 27001 certificate and what benefits you gain from it - read and get into details in our news item!
Written by Yaryna Y.
How to properly assemble the efficient work of your IT department to get the best business results and amaze your customers? Learn here with Geniusee.
Written by Sofiia K.
Learn how UX testing methods can help you provide a better user experience and customer journey, which lead to increased revenue flow.
Written by Dmytro M.
These useful insights for FinTech, based on the real case might save you a fortune and prevent you from hidden dangers on your path to victory.
Written by Sophia K.
What is the Anonymous group, what was before it, when did it first reveal itself to the world, and what and why they do now - in the article!
We are honored and happy to be ranked among the world leaders in our industry and we will continue to evolve together with our clients.
If you are interested in how to create an online learning platform like Udemy or Coursera, now is the time to do so while the market is in a booming phase.
Written by Nazariy H.
We are thrilled to develop for you and develop ourselves. Another recognition is already here to prove the highest quality of services we deliver!
Cyber security breaches might cost a fortune for your company and that's something you definitely don't want to happen. Our expertise can prevent you from that.
In this article, we’ll explore the top most successful FinTech startups and financial technology companies you need to pay attention to in 2022 and beyond.
Written by Sofiia V.
Fintech is a fertile ground for development. However, there are barriers to entry with regulations. But don’t worry; this guide will give you the information you need to get started!
We are honored to be recognized as an ISO 9001:2015 certified company. Why constant growth is important to us and why it matters for our clients - read here.
If you are still undecided on the Agile vs. Waterfall vs. Scrum vs. Kanban conundrum, this article will point you in the right direction.
Written by Alisher A.
We can either change an existing retail software
solution or develop retail software from scratch that meets your requirements. Let's discover our
successfully implemented projects in the field of e-commerce.
A digital platform built to merge traditional banking systems with new-age digital assets such as cryptocurrencies and NFTs. The platform allows tracking and managing of children’s (6-17 y.o) spending...
Android and iOS mobile app with automated payments, add geolocation services, integrate local market stakeholders, and as a result - the product for rapid grocery delivery in 15 minutes? Say no more....
Meet one of our clients – Drum! This 5-star application is a platform designed for creators. That’s a great tool for people who care about their personal brands to engage with their followers, earn...
Our client, a technology solutions company in MedTech, aims to make the latest technological advances available to millions by providing high-caliber, more affordable solutions to all. Target audience:...
Our main goal was to develop a digital platform for healthy habits called EinkaufsCHECK. We aimed to create a hybrid app for iOS and Android for the easiest and most accurate diet tracking and food...
Our client is a secure, automated platform that streamlines the merchant cash advance process and enables ISOs and lenders to manage their businesses from one centralized, convenient place. Combining...
For Crave retail Geniusee has developed 2 enterprise mobile applications that solve the double-sided problem for every shopper visiting the fitting room. The Fitting Room application allows shoppers...
Outstanding case in Geniusee portfolio, Pause – mobile app for meditation. iOS application was downloaded 1000+ times on the launch day.
The Ajuma company was founded by a couple after the birth of their child. They wanted to protect their baby from the harmful effects of ultraviolet radiation sunburn and from potentially generated skin...
Zedosh is a new digital advertising platform that financially empowers Gen Z. Using Open Banking, we provide insights into their spending behaviour, tips on how to master money and crucially, the ability...
Revenu is an All in one POS (Point of sale) management system . It uses the latest trends of technology to manage different types of Food & Beverage from scratch up to reaching ultimate clients...
Realm Five develops devices that collect various data, such as soil moisture, rainfall, amount of water in tanks, condition of tractors and their location, etc. from different parts of agriculture.
FactMata is an AI-based platform that identifies and classifies content. Advanced natural language processing learns what different types of deceptive content look like, and then detects...
Tradesmarter is leading in providing white label trading solutions offering a web responsive trading platform that enables top financial companies to unleash a new era of competition, innovation...
Swoon is an online furniture brand with a difference. Their main idea is that everyone should be able to buy beautifully designed and crafted furniture at reasonable prices. The brand has...
Frenotec LLC is a motorcycle distribution company eventually grew into the nation’s largest distributor of Brembo motorcycle brake components as well as became the exclusive importer and...
Validify Access is a new innovation discovery platform that showcases only best-in-class and pre-vetted emerging retail technology solutions. Validify helps leading retailers access curated...
NCourage was created to understand the nature of anxiety & stress, the cause of problems with falling asleep, which promotes personal growth, success work and increase productivity....
Wyzoo App is built on artificial intelligence and learning techniques to identify patterns in your customer data.
Tamam on-demand mobile application connects customers with independent local couriers, who acquire goods from any restaurant or shop in a city and also deliver urgent packages for a variable...
DigitalBits™ is an open-source project supporting the adoption of blockchain technology by enterprises. The technology enables enterprises to tokenize assets on the decentralized DigitalBits blockchain;...
The blockchain based platform - Totalizator. The goal of this R&D project was to validate the possibility of using blockchain technology in order to create an objective betting platform.
The Virtual Console is the graphics space that actually allows you to control your light shows during live events. It visually displays a number of so called widgets and aim to represent all...
PoolParty app allows increasing your popularity on Instagram by sharing links to the community of users, that will like, share and follow such links.
My Uber app allows everyone with a car to join the community of uber drivers within a couple of clicks - the company will take care of everything else. My Uber provides support and education for all...
Due to the high volatility of the cryptocurrency market, a trading company faced with an issue that traders need to quickly analyze cryptocurrency market information.
This system provides a complete omnidirectional view for armored vehicles crew (transparent walls effect) and the possibility to receive necessary data and interactive tips on helmet screen.
BuzzShow is a video social media network which incorporates the blockchain technology in a reward-based ecosystem. The platform offers full decentralization and a unique social media experience to users...
ZaZa is an expert in online learning and education abroad that helps its clients to get the highest quality services for quite affordable prices. They bring together native-speakers from all over the...
PrintBI has the largest and most detailed database of printing companies worldwide, powered by advanced technologies and market intelligence tools.
Tell us how we can help you.