Extract meaningful information from Big Data using NLP and Machine Learning

8 min readFeb 22, 2021

Most often, AI systems are exposed to a vast variety of information such as the voice of a person, images with text content, raw documents like NEWS articles, etc. In order to act on this unstructured form of information (data), the AI has to perform one of the crucial processes called Information Extraction(IE). Information Extraction is the process of retrieving key information intertwined within the unstructured data. In other words, extracting structured data from the unstructured data.

Let's understand how to build a system that can extract structured information from unstructured text data. I have split the discussion into the below topics,

Need for Information Extraction with a Use Case
Steps involved in extracting information from the text data
Named-Entity Recognition(NER) and Relation Extraction
NER implementation using Machine Learning

Need for Information Extraction

A wide range of NLP-based applications uses Information Extraction System. For instance extracting summaries from large corpora of text such as Wikipedia, conversational AI systems like chatbots, extracting information about stock market announcements from financial news, etc. In fact, present virtual assistants such as Google Assistant, Amazon’s Alexa, Apple’s Siri, among others use advanced IE systems to extract information from large encyclopedias. For further discussion, let’s take a use case.

Flight Booking System — Use Case

Imagine you are implementing a conversational flight-booking system, which can show relevant flights when given a natural-language query such as “show me the flights that arrive in Toronto from Cincinnati on next Monday morning.” In order to process this query, the system must be able to extract useful named entities from the Unstructured text query and convert them into a structured format. When we say structured format, it could be anything like JSON or other key-value pairs, that can be parsed and used for querying a database and retrieving relevant flight details.

Using these named entities you can look up a database and bring all the relevant flight results. In general, named entities refer to names of people, organizations (IBM, Payoda), places (India, Chennai, New York), specific dates and times (Friday, 7 pm), etc.

Steps involved in extracting information from Raw Text

Although textual data is abundantly available, the entanglement of natural language makes it particularly difficult to extract useful information from them. However, no matter how complex the Information Extraction task, there are some common steps that form the pipeline of almost all IE systems.

Tokenization

Tokenization is a part of Lexical Processing that is usually performed as a preliminary task in NLP applications. Tokenization involves splitting text documents into semantically meaningful units such as sentences and words (tokens).

Generally, sentence tokenization is done by splitting the text at the sentence-endings(‘.’) and then the word tokenization by splitting the sentence at the Blank Space. Nevertheless, Sophisticated methods are used for tokenizing more complex text structures, such as, the words that often go together like “Los Angeles”, which are sometimes known as collocations.

Part-of-Speech Tagging

Part-of-Speech (PoS) Tagging is the first level of Syntactic Processing that tags the word to the role it plays in a sentence. Some general forms of pos tags are nouns, verbs, pronouns, adjectives, adverbs, prepositions, interjection, conjunction, etc. Below is the example of Pos Tagging.

The set of standard PoS tags used in the NLTK library by default is included in the reference section. In Fact, it’s not a straightforward task to tag a word, for instance, in the following sentence “the song is a big hit”, the word “hit” is a Noun, whereas, in the sentence “he hit me”, “hit” is a Verb. Handling such ambiguity needs more Advanced techniques that deserve a separate blog.

Named-entity Recognition and Relation Extraction

A crucial component in IE systems is Named Entity Recognition (NER). Named-entity recognition is the problem of identifying and classifying entities into categories such as the names of people, locations, organizations, the expressions of quantities, times, measurements, monetary values, and so on. In general terms, entities refer to names of people, organizations (e.g. Jet Airways, American Airlines), places/cities (Bengaluru, Boston), etc.

In entity recognition, every token is tagged with an IOB label and then nearby tokens are combined together based on their labels. IOB labels (I-inside, O-out, B-beginning) are something similar to PoS tagging but it includes domain-specific custom labels. For instance, for the user request, ‘What is the price of American Airlines flight from New York to Los Angeles’ the tagged IOB labels are on the left.

Once we find the PoS tags and the IOB labels, it is a task of mapping the relationships between entities.

Models for Entity Recognition

Rule-based models

Chunking is one of the rule-based text extraction processes which is used for building Named-entity recognition models. In chucking, a chunker chunks the phrases that are meaningful in a text. The chunker is built upon a set of production rules, otherwise known as Grammar Rules. For instance, in the case of NER, grammar can be a pattern to match a Noun Phrase, since Named-entities are mostly nouns.

As you can see the chunker rules are based on PoS tags, Pos tagging becomes the necessary task before chunking. The noun chunks can be further reduced to Named-entities by applying the IOB-Based rule on the chunker.

Probabilistic models

Various approaches are available to build a Probabilistic NER model, such as Unigram and Bigram models, Conditional Random Fields, Naive Bayes Classifier, etc. Moreover, we will implement one of the Machine Learning (ML) based NER models.

Named Entity Recognition (NER) using Machine Learning:

The objective of the machine learning model is to assign appropriate IOB tags to the words in the user query.

Let’s create a Classifier using an ML algorithm to classify the words in a sentence and assign an IOB label to each class. It involves the following tasks,

Data Understanding
Preprocessing
Feature Extraction
Model Creation
Model Evaluation

Load and Understand the Data

The Airline Travel Information Systems (ATIS) dataset consists of user queries (in English) for booking and requesting information about flights in the US.

Each word in a user request(query) is labeled according to its entity type, for instance in the query ‘what flights are available Monday from San Francisco to Pittsburgh’, ‘San Francisco’ and ‘Pittsburgh’ are labeled as ‘source’ and ‘destination’ locations respectively while ‘morning’ is labeled as ‘time-of-day.

Take a look at some of the sample user queries:

Structure of the Dataset

Once you load the dataset you will find two columns, which are in a form of a list of numbers, where the list of numbers in the first column (Query) corresponds to the list of the word of the query, and the list of numbers in the second column (label) corresponds to list of IOB of the query.

Preprocessing

Generally, initial tasks in pre-processing involve data cleaning, sentence, and word tokenization. The ATIS dataset that we have is already in tokenized form, which means, we can proceed to do PoS tagging.

We can apply a pos tagging function on the query text column to obtain respective PoS tags. Let’s visualize one of the tagged queries in a tree format.

Feature Extraction

Like any other machine learning classification models, we can have features for sequence labeling tasks. Features could be the morphology of the word (upper/lowercase), POS tags of the words in the neighborhood, whether the word is present in the lookup dictionary of some most common entities like geographical locations, etc.

Dictionary Lookup

A gazetteer is a directory that stores data with regard to the names of geographical entities such as, cities, states, countries among the others, and some other features related to the geographies. An example gazetteer file for the US is given below in the references section. This can be used as a lookup table in our case.

Let’s create a function to extract some of the useful predictors of IOB labels, such as the word itself, pos tag, previous pos, word_is_city, word_is_state, and word_is_country. Below is an extracted feature data of one of the user queries.

Model Building and Evaluation

We use NLTK’s built-in classifiers for the classification task. We have implemented a Naive Bayes Model and Decision Tree model to classify words of user query to IOB label. The code implementation can be found in the GitHub notebook shared in the references section.

We can of course also tune the decision tree hyperparameters (maxdepth, num_leaves, min_sample_split, etc.) to optimize the model as per the need.

In the world of Natural Language Processing and Machine Learning, there is nothing called one best way to do a task, which is indeed applicable to Information Extraction tasks. I believe this blog would have given some idea of building an IE system, yet there are more dimensions to NLP that can be used for multiple applications. Hope we will see a few more of them in the future.

Image Sources & References:

PoS tags used in NLTK (Default):
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
ATIS (Airline Travel Information Systems) dataset: https://github.com/santho3110/Information-Extraction/raw/main/data/ATIS%2Bdataset.zip
Gazetteer file for the US: https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv
NER code implementation using NLTK: https://github.com/santho3110/Information-Extraction/blob/main/Named%20Entity%20Recognition%20(ATIS).ipynb

Author: Santhosh Rajasingaram

Extract meaningful information from Big Data using NLP and Machine Learning

Written by Payoda Technology Inc