Creating a summary from a given piece of content is a very abstract process that everyone participates in. Automating such a process can help parse through a lot of data and help humans better use their time to make crucial decisions. With the sheer volume of media out there, one can be very efficient by reducing the fluff around the most critical information. We have already started seeing text summaries across the web that are automatically generated.
If you frequent Reddit, you might’ve seen the ‘Autotldr bot’ routinely helps Redditors by summarizing linked articles in a given post. It was created in just 2011 and has already saved thousands of person-hours. There is a market for reliable text summaries, as shown by a trend of applications that do precisely that, such as Inshorts (summarizing news in 60 words or less) and Blinkist (summarizing books ).
Automatic Text Summarization, thus, is an exciting yet challenging frontier in Natural Language Processing (NLP) and Machine Learning (ML). The current developments in Automatic text Summarization are owed to research into this field since the 1950s when Hans Peter Luhn’s paper titled “The automatic creation of literature abstracts” was published.
This paper outlined the use of features such as word frequency and phrase frequency to extract essential sentences from a document. This was followed by another critical research done by Harold P Edmundson in the late 1960s, which highlighted the presence of cue words, words used in the title appearing in the text, and the location of sentences to extract sentences of significance from a document.
Now that the world has made strides in Machine learning and publishing newer studies in the field, automatic text summarization is on the verge of becoming a ubiquitous tool to interact with information in the digital age.
Must Read: NLP Engineer Salary in India
There are primarily two main approaches to Summarizing text in NLP
Text Summarization in NLP
1. Extraction-based summarization
As the name suggests, this technique relies on merely extracting or pulling out key phrases from a document. It is then followed by combining these key phrases to form a coherent summary.
2. Abstractive-based summarization
This technique, unlike extraction, relies on being able to paraphrase and shorten parts of a document. When such abstraction is done correctly in deep learning problems, one can be sure to have consistent grammar. But, this added layer of complexity comes at the cost of being harder to develop than extraction.
There is another way to come up with higher quality summaries. This approach is called aided summarization, which entails a combined human and software effort. This too comes in 2 different flavors
- Machine-aided human summarization: extractive techniques highlight candidate passages to be included, which the human may add or remove text.
- Human aided Machine summarization: the human simply edits the output of the software.
Apart from the main approaches to summarize text, there are other bases on which text summarizers are classified. The following are those category heads:
3. Single vs. Multi-document summarization
Single documents rely on the cohesiveness and infrequent repetition of facts to generate summaries. Multi-document summarizations, on the other hand, increase the chance of redundant information and recurrence.
4. Indicative vs. informative
The taxonomy of the summaries relies on the user’s end goal. For instance, in indicative type summaries, one would expect high-level points of an article. Whereas, in an informative overview, one may expect more topic filtering to let the reader drill down the summary.
5. Document length and type
The length of the input text heavily influences the sort of summarization approach.
The largest summarization datasets, like newsroom by Cornell, have focussed on news articles, which are about 300-1000 words on average. Extractive summarizers deal with such lengths relatively well. A multipage document or chapter of a book can only be summarized adequately with more advanced approaches like hierarchical clustering or discourse analysis.
Additionally, the genre of the text influences the summarizer too. The methods that would summarize a technical white-paper would be radically different from the techniques that may be better equipped to summarize a financial statement.
In this article, we will focus on further details of the extraction summarization technique.
This algorithm helps search engines like google rank web pages. Let’s understand the algorithm with an example. Assume you have four web pages with different levels of connectivity between them. One may have no links to the other three; one may be connected to the other 2, one may be correlated to just one, and so on.
We can then model the probabilities of navigating from one page to another by using a matrix with n rows and columns, where n is the number of web pages. Each element within the matrix will represent the probability of transitioning from one webpage to another. By assigning the right probabilities, one can iteratively update such a matrix to come to a web page ranking.
Also Read: NLP Project & Topics
The reason we explored the PageRank algorithm is to show how the same algorithm can be used to rank text instead of web pages. This can be done by changing perspective by replacing links between pages to similarity between sentences and using the PageRank style matrix as a similarity score.
Implementing the TextRank algorithm
The following is an explanation of the code behind the extraction summarization technique:
Concatenate all the text you have in the source document as one solid block of text. The reason to do that is to provide conditions so that we can execute step 2 more easily.
We provide conditions that define a sentence such as looking for punctuation marks such as period (.), question mark (?), and an exclamation mark (!). Once we have this definition, we simply split the text document into sentences.
Now that we have access to separate sentences, we find vector representations (word embeddings) of each of those sentences. It is now that we must understand what vector representations are. Word embeddings are a type of word representation that provides a mathematical description of words with similar meanings. In actuality, this is an entire class of techniques that represent words as real-valued vectors in a predefined vector space.
Each word is represented by a real-valued vector that has many dimensions (over 100 at times). The distribution representation is based on the usage of words and, thus, allows words used in similar ways to have similar descriptions. This allows us to naturally capture the meanings of words as by their proximity to other words represented as vectors themselves.
For this guide, we will use the Global Vectors of Word Representation (GloVe). The gloVe is the open-source distributed word representation algorithm that was developed by Pennington at Stanford. It combines the features of 2 model families, namely the global matrix factorization and local context window methods.
Once we have the vector representation for our words, we have to extend the process to represent entire sentences as vectors. To do so, we may fetch the vector representations of the terms that constitute words in a sentence and then the mean/average of those vectors to arrive at a consolidated vector for the sentence.
At this point, we have a vector representation for each individual sentence. It is now helpful to quantify similarities between the sentences using the cosine similarity approach. We can then populate an empty matrix with the cosine similarities of the sentences.
Now that we have a matrix populated with the cosine similarities between the sentences. We can convert this matrix into a graph wherein the nodes represent the sentences, and the edges represent the similarity between the sentences. It is on this graph that we will use the handy PageRank algorithm to arrive at the sentence ranking.
We now have ranked all sentences in the article in order of importance. We can now extract the top N (say 10) sentences to create a summary.
To find the code for such a method, there are many such projects on Github; this article, on the other hand, helps develop an understanding of the same.
An important factor in fine-tuning such models is to have a reliable method to judge the quality of the summaries produced. This necessitates good evaluation techniques, which can be broadly classified into the following:
- Intrinsic and extrinsic evaluation:
Intrinsic: such evaluation tests the summarization system in and of itself. They mainly assess the coherence and informativeness of the summary.
Extrinsic: such evaluation tests the summarization based on how it affects some other task. It may test the impact of the summarization on tasks like relevance assessment, reading comprehension, etc.
- Inter-textual and Intra-textual:
Inter-textual: Such evaluations focus on a contrastive analysis of several summarization systems.
Intra-textual: such evaluations assess the output of a specific summarization system.
- Domain-specific and domain-independent:
Domain independent: These techniques generally apply sets of general features that can be focused on identifying information-rich text segments.
Domain-specific: These techniques utilize the available knowledge specific to a domain on a text. For example, text summarization of medical literature requires the use of sources of medical knowledge and ontologies.
- Evaluating summaries qualitatively:
The major drawback of other evaluation techniques is that they necessitate reference summaries to be able to compare the output of the automatic summaries with the model. This makes the task of evaluation hard and expensive. There is work being done to build a corpus of articles/documents and their corresponding summaries to solve this problem.
Challenges to Text Summarization
Despite highly developed tools to generate and evaluate summaries, challenges remain to find a reliable way for text summarizers to understand what is important and relevant.
As discussed, vector representation and similarity matrices attempt to find word associations, but they still do not have a reliable method to identify the most important sentences.
Another challenge in text summarization is the complexity of human language and the way people express themselves, especially in written text. Language is not only composed of long sentences with adjectives and adverbs to describe something but also relative sentences, appositions, etc. such insights may add valuable information they don’t help in establishing the main crux of information to be included into the summary.
“Anaphora problem” is another barrier in text summarization. In language, we often replace the subject in the conversation with its synonyms or pronouns. The understanding of which pronoun substitutes for which term is the “anaphora problem.”
“Cataphora problem” is the opposite problem of the anaphora problem. In these ambiguous words and explanations, a particular term is used in the text before introducing the term itself.
The field of text summarization is experiencing rapid growth, and specialized tools are being developed to tackle more focused summarization tasks. With open-source software and word embedding packages becoming widely available, users are stretching the use case of this technology.
Automatic text summarization is a tool that enables a quantum leap in human productivity by simplifying the sheer volume of information that humans interact with daily. This not only allows people to cut down on the reading necessary but also frees up time to read and understand otherwise overlooked written works. It is only a matter of time that such summarizers get integrated so well that they create summaries indistinguishable from those written by humans.
If you wish to improve your NLP skills, you need to get your hands on these NLP projects. If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.