Computers work great with structured information, such as tables in databases. But people communicate with each other, not with tables, but with words. For computers, this is too complicated.
Most of the information in the world is not structured - it's just texts in English or any other language. Can machines be taught to extract important data from them? This problem is addressed in a special direction of artificial intelligence: natural language processing, or NLP. In the article, we will understand how it works.
From the very beginning of the computer age, developers have been trying to teach them to understand ordinary languages, for example, English. For thousands of years, people wrote something, and it would be great to instruct machines to read and parse all this data.
Unfortunately, computers cannot fully understand living human language, but they are capable of much. NLP can do truly magical things and save a huge amount of time.
The process of reading and understanding the English text is very complicated. In addition, people often do not follow the logic and sequence of narration. For example, what does this news headline mean?
"Environmental regulators grill business owner over illegal coal fires".
Do regulators interrogate business owners about illegal coal-burning? Or maybe they literally cook it on the grill? Did you guess? Whether a computer can?
The implementation of a complex task with machine learning usually means building a pipeline. The point of this approach is to break the problem into very small parts and solve them separately. By combining several of these models that deliver data to each other, you can get great results.
Let's take a look at the following excerpt from Wikipedia:
“London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south-east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.”
This section contains some useful facts. I wish the computer could understand that London is a city, it’s located in England, was founded by the Romans, etc. But first of all, we must teach it the most basic concepts of written language.
The first stage of the pipeline is to break the text into separate sentences. As a result, we get the following:
It can be assumed that each sentence is an independent thought or idea. It’s easier to teach a program to understand a single sentence, not a whole paragraph.
One could simply separate the text according to certain punctuation marks. But modern NLP pipelines have more sophisticated methods in store, suitable even for working with unformatted fragments.
Now we can process the received offers one by one. Let's start with the first one:
London is the capital and most populous city of England and the United Kingdom.
The next step of the pipeline is the allocation of individual words or tokens - tokenization. The result at this stage looks like this:
«London», «is», «the», «capital», «and», «most», «populous», «city», «of», «England», «and», «the», «United», «Kingdom», « . »
In English, this is easy. We simply separate a piece of the text each time we encounter a space. Punctuation marks are also tokens, as they can be important.
Now let's look at each token and try to guess what part of speech it is: a noun, a verb, an adjective or something else. Knowing the role of each word in a sentence, we can understand its general meaning.
At this step, we will analyze each word together with its immediate environment using a previously prepared classification model:
Word: “London” neighboring tokens:
“Is” “the” “capital”→
Definition of a part of speech
This model has been trained on millions of English sentences with already indicated parts of speech for each word and is now able to recognize them.
Keep in mind that this analysis is based on statistics - in fact, the model does not understand the meaning of the words. It just knows how to guess a part of speech based on a similar sentence structure and previously studied tokens.
After processing, we get the following result:
In English and most other languages, words can take many forms. Take a look at the following example:
I had a pony. / I had two ponies.
Both sentences contain the noun “pony”, but with different endings. If the computer processes the texts, it must know the basic form of each word in order to understand that we are talking about the same concept of a pony. Otherwise, the pony and ponies tokens will be perceived as completely different.
In NLP, this process is called lemmatization - finding the basic form (lemma) of each word in a sentence.
This is what our offer looks like after processing:
The only change is turning “is” into “be”.
Now we want to determine the importance of each word in a sentence. There are a lot of auxiliary words in English, for example, “and”, “the”, “a”. In a statistical analysis of the text, these tokens create a lot of noise, as they appear more often than others. Some NLP pipelines mark them as stop words and filter them out before counting.
Now our sentence is as follows:
Ready-made tables are usually used to detect stop words. However, there is no single standard list suitable in any situation. Ignored tokens can change, it all depends on the features of the project.
For example, if you decide to create a rock band search engine, you probably won't ignore the article “the”. It is found in the name of many collectives, and one famous group of the 80s is even called “The The!”.
Now you need to establish the relationship between the words in the sentence. This is called dependency parsing. The ultimate goal of this step is to build a tree in which each token has a single parent. The root may be the main verb.
After the first approach, we have the following scheme:
It is necessary not only to determine the parent but also to establish the type of connection between two words:
This parsing tree demonstrates that the main subject of the proposal is the noun “London”. Between it and "capital", there is a relationship "be". This is how we learn that London is the capital! If we went further up the tree branches (already beyond the borders of the diagram), we could find out that London is the capital of the United Kingdom.
We have already done all the difficult work. Finally, we can move from school grammar to really interesting tasks.
Our sentence contains the following nouns:
London is the capital and most populous city of England ...
Some of them mean real things. For example, “London” and “England” are points on the map. It would be great to define them! With NLP, we can automatically get a list of real objects mentioned in the document.
The purpose of recognizing named entities is to discover such nouns and relate them to real concepts. After processing each token with the NER-model, our sentence will look like this:
London (geographical name) is the capital and most populous city of England (geographical name)...
NER systems don't just browse dictionaries. They analyze the context of the token in the proposal and use statistical models to guess which object it represents. Good NER systems can distinguish actress Brooklyn Decker from the city of Brooklyn.
Most NER models recognize the following types of objects:
Since these models make it easy to extract structured data from solid text, they are very actively used in various fields. This is one of the easiest ways to take advantage of the NLP pipeline.
We already have an excellent and useful presentation of the analyzed sentence. We know how words are related to each other, which parts of speech they refer to, and what named objects stand for.
Nevertheless, we have a big problem. English has a lot of pronouns - words like he, she, it. These are abbreviations by which we replace the real names in writing. A person can trace the relationship of these words from sentence to sentence, based on context. But the NLP model does not know what pronouns mean, because it considers only one sentence at a time.
Let's look at the third sentence in our document:
It was founded by the Romans, who named it Londinium.
If we pass it through the conveyor, we will find out that “it” was founded by the Romans. Not very useful knowledge, right?
You will easily guess in the process of reading that “it” is nothing but London. The permission of coreference is the tracking of pronouns in sentences in order to select all words related to one entity.
Here is the result of processing the document for the word "London":
By combining this technique with a parsing tree and information about named entities, we get the opportunity to extract a huge amount of useful data from the document.
Resolving coreference is one of the most difficult steps in our pipeline, it is even more complicated than parsing sentences. In the field of deep learning methods for its implementation have already appeared, they are quite accurate, but still not perfect.
So we learned a little about NLP!
Enjoy this blog?
Please, spread the word :)
Pareto Principle in IT Security
''Fiddle with'' web traffic like a pro with Fiddler web debugger
Geniusee received an ISO 27001:2013 certificate
Organizational structures of IT department
But did you know that 80% of software vulnerabilities are accidental, and 20% are intentional?
Written by Ihor D.
Over the years has been developed a number of tools for inspecting traffic. Let's look closer at one of the best in the development community.
Written by Roksoliana V.
Find out how we received an ISO 27001 certificate and what benefits you gain from it - read and get into details in our news item!
Written by Yaryna Y.
How to properly assemble the efficient work of your IT department to get the best business results and amaze your customers? Learn here with Geniusee.
Written by Sofiia K.
Learn how UX testing methods can help you provide a better user experience and customer journey, which lead to increased revenue flow.
Written by Dmytro M.
These useful insights for FinTech, based on the real case might save you a fortune and prevent you from hidden dangers on your path to victory.
Written by Sophia K.
What is the Anonymous group, what was before it, when did it first reveal itself to the world, and what and why they do now - in the article!
We are honored and happy to be ranked among the world leaders in our industry and we will continue to evolve together with our clients.
If you are interested in how to create an online learning platform like Udemy or Coursera, now is the time to do so while the market is in a booming phase.
Written by Nazariy H.
We are thrilled to develop for you and develop ourselves. Another recognition is already here to prove the highest quality of services we deliver!
Cyber security breaches might cost a fortune for your company and that's something you definitely don't want to happen. Our expertise can prevent you from that.
In this article, we’ll explore the top most successful FinTech startups and financial technology companies you need to pay attention to in 2022 and beyond.
Written by Sofiia V.
Fintech is a fertile ground for development. However, there are barriers to entry with regulations. But don’t worry; this guide will give you the information you need to get started!
We are honored to be recognized as an ISO 9001:2015 certified company. Why constant growth is important to us and why it matters for our clients - read here.
If you are still undecided on the Agile vs. Waterfall vs. Scrum vs. Kanban conundrum, this article will point you in the right direction.
Written by Alisher A.
We can either change an existing retail software
solution or develop retail software from scratch that meets your requirements. Let's discover our
successfully implemented projects in the field of e-commerce.
A digital platform built to merge traditional banking systems with new-age digital assets such as cryptocurrencies and NFTs. The platform allows tracking and managing of children’s (6-17 y.o) spending...
Android and iOS mobile app with automated payments, add geolocation services, integrate local market stakeholders, and as a result - the product for rapid grocery delivery in 15 minutes? Say no more....
Meet one of our clients – Drum! This 5-star application is a platform designed for creators. That’s a great tool for people who care about their personal brands to engage with their followers, earn...
Our client, a technology solutions company in MedTech, aims to make the latest technological advances available to millions by providing high-caliber, more affordable solutions to all. Target audience:...
Our main goal was to develop a digital platform for healthy habits called EinkaufsCHECK. We aimed to create a hybrid app for iOS and Android for the easiest and most accurate diet tracking and food...
Our client is a secure, automated platform that streamlines the merchant cash advance process and enables ISOs and lenders to manage their businesses from one centralized, convenient place. Combining...
For Crave retail Geniusee has developed 2 enterprise mobile applications that solve the double-sided problem for every shopper visiting the fitting room. The Fitting Room application allows shoppers...
Outstanding case in Geniusee portfolio, Pause – mobile app for meditation. iOS application was downloaded 1000+ times on the launch day.
The Ajuma company was founded by a couple after the birth of their child. They wanted to protect their baby from the harmful effects of ultraviolet radiation sunburn and from potentially generated skin...
Zedosh is a new digital advertising platform that financially empowers Gen Z. Using Open Banking, we provide insights into their spending behaviour, tips on how to master money and crucially, the ability...
Revenu is an All in one POS (Point of sale) management system . It uses the latest trends of technology to manage different types of Food & Beverage from scratch up to reaching ultimate clients...
Realm Five develops devices that collect various data, such as soil moisture, rainfall, amount of water in tanks, condition of tractors and their location, etc. from different parts of agriculture.
FactMata is an AI-based platform that identifies and classifies content. Advanced natural language processing learns what different types of deceptive content look like, and then detects...
Tradesmarter is leading in providing white label trading solutions offering a web responsive trading platform that enables top financial companies to unleash a new era of competition, innovation...
Swoon is an online furniture brand with a difference. Their main idea is that everyone should be able to buy beautifully designed and crafted furniture at reasonable prices. The brand has...
Frenotec LLC is a motorcycle distribution company eventually grew into the nation’s largest distributor of Brembo motorcycle brake components as well as became the exclusive importer and...
Validify Access is a new innovation discovery platform that showcases only best-in-class and pre-vetted emerging retail technology solutions. Validify helps leading retailers access curated...
NCourage was created to understand the nature of anxiety & stress, the cause of problems with falling asleep, which promotes personal growth, success work and increase productivity....
Wyzoo App is built on artificial intelligence and learning techniques to identify patterns in your customer data.
Tamam on-demand mobile application connects customers with independent local couriers, who acquire goods from any restaurant or shop in a city and also deliver urgent packages for a variable...
DigitalBits™ is an open-source project supporting the adoption of blockchain technology by enterprises. The technology enables enterprises to tokenize assets on the decentralized DigitalBits blockchain;...
The blockchain based platform - Totalizator. The goal of this R&D project was to validate the possibility of using blockchain technology in order to create an objective betting platform.
The Virtual Console is the graphics space that actually allows you to control your light shows during live events. It visually displays a number of so called widgets and aim to represent all...
PoolParty app allows increasing your popularity on Instagram by sharing links to the community of users, that will like, share and follow such links.
My Uber app allows everyone with a car to join the community of uber drivers within a couple of clicks - the company will take care of everything else. My Uber provides support and education for all...
Due to the high volatility of the cryptocurrency market, a trading company faced with an issue that traders need to quickly analyze cryptocurrency market information.
This system provides a complete omnidirectional view for armored vehicles crew (transparent walls effect) and the possibility to receive necessary data and interactive tips on helmet screen.
BuzzShow is a video social media network which incorporates the blockchain technology in a reward-based ecosystem. The platform offers full decentralization and a unique social media experience to users...
ZaZa is an expert in online learning and education abroad that helps its clients to get the highest quality services for quite affordable prices. They bring together native-speakers from all over the...
PrintBI has the largest and most detailed database of printing companies worldwide, powered by advanced technologies and market intelligence tools.
Tell us how we can help you.