By: Huifang Ye, RIG Intern Researcher
The prevalence of Grammatical Error Correction Tools
If you think of Grammarly as a spell checker, you may miss out on an extraordinary tech product. The reason why people attach so much importance to writing is essentially because it is a form of productivity. Professional academic writing can make you stand out, proper business emails can facilitate transactions, and consistent brand copy can help establish a corporate image. I am a big fan of Grammarly and I use it to facilitate my writing and check grammatical errors.
Intrigued by this fantastic tool, I intend to take a glimpse of its underlying technology stacks. The product is powered by an advanced system that combines rules, patterns, and artificial intelligence techniques like machine learning, deep learning, and natural language processing (NLP) to improve users’ writing. Its success is largely due to its focus on a narrow application of AI NLP: grammar error correction. As an enthusiast of computational linguistics, I will then explore this specific technology in this article.
Grammatical Error Correction Introduction
Grammatical Error Correction (GEC) is an essential task in natural language processing. It detects whether there are grammatical errors in a sentence and automatically corrects the detected error(s). GEC has significant applications in text proofreading and foreign language acquisition.
Current grammar error correction is mainly implemented using the Seq2Seq framework similar to the machine translation. Specifically, the input incorrect sentence is the source sentence and the output correct sentence is the target sentence. For example, in the figure below, “A B C D” is the input incorrect sentence, and “X Y Z” is the output corrected sentence. Obviously, we can train a generative model with a large scale (incorrect sentence & correct sentence) parallel corpus, and then implement the automatic correction of grammatical errors by the generative model.
Typically, generative models require large-scale parallel corpora for training, such as machine translation corpora ranging from tens to hundreds of millions of sentences. In contrast, grammar error correction-related corpus is relatively scarce, usually only a few hundred thousand sentences in size. Hence, how to solve the problem of data scarcity is a major focus of grammar error correction research. In addition, the difference between source and target utterances in grammar error correction tasks is usually very small, and using Seq2Seq to generate target utterances from scratch could be a bit of “overkill”. Some researchers have proposed a model structure specifically for grammar error correction based on this feature, and simultaneously achieved good results.
Grammatical Error Correction Technology
(1) Automatic corpus expansion method
In response to the problem of insufficient training data for grammar error correction, some researchers propose to increase the training data by constructing pseudo-data. Wei Zhao et al. from Yuanfudao AI lab proposed to construct pseudo-data by using the method of randomly creating error data, and the specific process is as follows: delete a word randomly according to 10% probability; add a word randomly in proportion to 10%; replace a word randomly at a ratio of 10%. The sentences obtained after the addition of the normal distribution are reordered as error statements.
(2) Model improvement method
The difference between input and output statements in the grammar error correction task is relatively minor, and it can be seen from the following table that more than 80% of the words in the input and output statements are the same. Based on this, Wei Zhao et al. from Yuanfudao AI lab proposed to use the Copy Mechanism for text error correction, so that the structure of the model such as Attention can learn more about how to correct errors.
The model structure of the Copy Mechanism is shown below. The main idea is to consider two generation distributions during the generation sequence: the probability distribution of words in the input sequence and the probability distribution of words in the lexicon, respectively, and then weight the sum of the two probability distributions as the final probability distribution to predict the words generated at each moment. This method can effectively exploit the property of many overlapping words between input and generated sentences.
Current grammatical error correction (GEC) methods mainly have the following challenges:
(1) The speed for training is too slow to be applied on a large scale. Current grammatical error correction technology mainly uses the Seq2Seq generative model, and tasks such as grammar detection use the BERT model, which corresponds to a relatively large scale and often requires the use of GPUs resulting in notably slow practical applications. These problems greatly limit the application and popularity of grammar error correction technology. Therefore, in order to alleviate the bottleneck, NLP researchers and experts mainly focus on reducing the scale of GEC model and speed up prediction accordingly.
(2) The scale of real training data is limited. Although researchers have proposed various methods to increase the training data, the quality of the increased training data is often unsatisfactory. Hence, increasing the scale of real training data still has a long way to go.
(3) Models for grammar error correction is still being developed. At present, the models used in the field of grammar error correction are still more often used in machine translation and text summarization, and few models are designed specifically for the characteristics of grammar error correction tasks. It is also a challenge to propose corresponding models based on the similarity of input and generated sentences in grammar error correction.
Grammatical error correction is an important research topic in NLP, and researchers nowadays usually use the Seq2Seq method of machine translation to perform automatic error correction. In response to the lack of data and other problems, researchers have proposed various data expansion methods and have made steady progress in the task of grammar error correction. At present, the grammar error correction task still has the problems of slow speed and insufficient data, etc. It is believed in the data science community that with the rapid development of deep learning and NLP technology, these problems will be well resolved and more and more products like Grammarly will make the world a better place.