Approaches to Natural Languages Processing Tasks

Intellexer

Published in

DataDrivenInvestor

3 min readNov 23, 2018

Rule-based

The rule-based approach is probably the oldest one amongst approaches to NLP, but it is still widely used though for a restricted range of tasks. Rules are written by skilled experts, usually linguists or other knowledge engineers. The process starts with a deep data analysis performed manually in order to find some consistency and relations that can be further transformed into rules. The problem can be described as a constant search of balance between the situation when a rule applies to a small part and the opposite situation.

Regular expressions and context-free grammars serve as instruments of rule-based approaches to NLP.

Overall, a rule-based approach has proven to work well at a sentence level, e.g. for word boundary disambiguation. It has high performance when applied to a narrow domain, however, it tends to have a very poor generalization.

Another obvious drawback of the rule-based approach is that it is labor-intensive and time-consuming as the rules are created manually and need to be continuously updated, which, on top of all, can often lead to a too complex system of rules contradicting with each other.

2. Traditional Machine Learning

Machine-learning approach doesn’t require any rules written by experts, on the contrary, the ML-based system builds its own knowledge, produces its own rules and its own classifiers. It can learn to “understand” text without being explicitly programmed. Machine learning is based on statistical methods and training dataset. The only thing is needed is an annotated dataset which is used to train the ML model.

The obvious advantage of machine learning is its ability to learn on its own, which significantly reduces time expenditures. This approach works best with classification and clusterization tasks, e.g. for document categorization based on its topic.

The main problem of using ML for NLP tasks, as you might have already guessed, lies in this training data. Quite often one training dataset cannot be reused and preparation of new data is required. This process can be very time-consuming because datasets are mostly created manually and they have to be big enough to ensure high accuracy of the model.

It can also happen so, that even a small amount of new data added to the training dataset can greatly change the model in a way it can act unpredictably.

Nevertheless, using ML approaches can substantially speed up the development of certain NLP systems, providing good training datasets are available.

3. Neural networks

This approach is similar to “traditional” machine learning, but it does not require feature engineering as features are learned from a very big dataset by the neural networks. Different NNs can be used to handle NLP tasks, there is even a neural network capable to simulate Markov Chains.

Unlike “traditional” ML, the words in a dataset are represented as vectors distributed in a vector space. The associations between words are found based on their similarity which is measured using Cosine Distance.

Those associations are done automatically using unsupervised learning (where no annotated data is needed). This approach needs only a very big corpus of natural language from which it will learn.

Recent developments of computation resources and datasets accumulation combined with improved algorithms have enabled neural networks methods to dominate many NLP tasks like speech recognition, question-answering, etc.

All the above-mentioned approached are used both individually and combined with each other to deliver better results. It is the sphere of application or the type of a developed system that influences the choice of approach.

Approaches to Natural Languages Processing Tasks

Written by Intellexer