I have been pondering over some of these questions below for sometime and after taking a break from chasing many paper deadlines, I found sometime over Christmas to pen down my thoughts. I’ll discuss four issues (or at least the way I see them) facing NLP and I’ll offer suggestions.
Please note that I do not advocate the following as the only way everyone should be working. I believe that research is best done when people do what they are excited about and we should try diverse things with open mind specially in the early stages of a field.
1: Too Many Datasets and Not Enough Structure
Few months back, I made a comment jokingly that there are now “more NLP datasets than NLP researchers”. I do think there is an element of truth here. The field has witnessed an explosive growth and the result has been an over emphasis on too many problems in isolation without enough work on how these different problems relate to one another.
For reference, let’s consider how the field of mathematics progressed. People first proved a set of theorems which are grounded in something of immediate interest such as how to approximate the area of a circle or measure height of a tower. It is of no surprise that initial mathematics was mostly real algebra and trigonometry instead of abstract algebra and algebraic geometry. A good healthy trend in mathematics is that most maths paper will use at least some results from the literature. Unfortunately, in NLP one seldom has the opportunity to use (I do not mean citation but actual use) results from very old research papers or use the solution of another problem which has been reliably solved (e.g., POS tags). Why is this case?
Let’s look at a bit of history first.
NLP from early 1990s to sometime around 2012 was dominated by the linguistic agenda. This agenda consisted of building a structure, one level at a time, by first solving part of speech tagging, then syntactic parsing, then approaching semantic and finally something in pragmatics/discourse. It made total sense! Surely one cannot really approach discourse if one cannot understand a sentence. And you cannot understand a sentence if you cannot even understand dependencies between its words. These problems were also generally well defined (mostly one can agree on what is the right parse tree) and were also interpretable (you can look at the parse tree to see what mistakes were made).
Unfortunately, it is beginning to be apparent that these beautiful linguistic approaches didn’t yield empirical gains (using empirical evaluations was another breakthrough that happened in NLP from 1990s and I talk about that in point 4 below). The biggest evidence of this is that most NLP papers these days in non parsing track, don’t use a syntactic parser either in their model or in some baseline. A single LSTM with attention processing the entire sentence outperforms linear models using parse trees. The field realized this and as a result, many researchers changed their agenda and left the linguistic bandwagon for deep learning supercar.
It is disappointing that after improving score on the Penn Tree Bank from somewhere in 80% to now 96%, we still don’t end up using parse trees as features in our model. But this is mostly a tragedy for the linguistic agenda and not the field cause across the board performance on several datasets have improved (due to better feature representations using deep learning, improved learning and optimization algorithms and just more researchers working), new problems have blossomed, attendance and submissions to conference has skyrocketed. Another change that has taken place is that more people are now directly working on the final problem of interest like Question Answering, Machine Comprehension, Instruction Based Navigation etc. The general approach these day appears to be:
- Create a new dataset and problem.
- Improve the performance on this dataset by generally using more and more complicated models.
- Once the performance reaches a certain level and further improvement is hard, interest dies and people move on to step 1.
A new problem however is appearing.
Rarely a solution for one problem informs solution for another problem. The field is littered with dead datasets or solutions which are not used again.
I believe this is happening cause we are not building a structure. There are too many problems and it is not clear how these problems relate to one another.
For reference, one crowning achievement of complexity theory was to reduce thousands of problems to few complexity classes and study how these relate to one another. There are thousand different NLP problems. What is our analog of a NP class?
Since linguistic agenda of studying a hierarchy of linguistic structure hasn’t succeeded after decades of work. What new structure could we use? The following is one suggestion which I call Learning Reduction Agenda:
- Narrow down to problems that are of immediate interest and well defined. I can name for example, knowledge base or visual question answering, vision based instruction following and multiple choice reading comprehension. These will be our analog of Pythagorean theorem.
- Create new problems which use solutions to (1) for evaluation or reduction. Here is one example: One can define a machine translation metric in terms of question answering. If the gold French text and the translated French text can be used to answer the same question then they have the same content and should be treated as paraphrase (I am assuming there is an executable environment).
- Gradually expand the problem set by doing 2.
[Learning Reduction Agenda] Researchers should think carefully where a new problem stands in this structure. If it is of end interest with an unambiguous evaluation metric it belongs in the ground floor otherwise it should reduce to a problem in the lower structure and thereby reveal its position.
2: Solving an Ill-Defined Problem
There are several problems in NLP that do not have a well defined metric. Most of these problems involve generating some text. For example, a chat bot having a conversation with a human, machine translation, image caption generation. The solution to this problem has been to defer to humans for evaluation or use proxy evaluation metrics such as BLEUE, ROGUE, METEOR, CIDER etc. If one really buys into these metrics then they can use standard method to maximize the performance of their approach with respect to this metric. Some sophisticated researchers often prefer an ensemble of metrics or report performance on several metrics.
The main question still remains: what do you want to achieve from the generated text? An image caption that says, “this is an image” is an accurate caption but doesn’t contain any content. A chat bot that says, “Hmm…” can accurately mimic the attitude of a person who doesn’t want to be bothered. I believe this question is not adequately answered in most research papers. We should understand what this problem is going to be used for. For example, if the image caption generation model is being used to describe the image to a visually impaired person then we must understand what kind of information that person seeks. Are they interested in hearing about the type of image (e.g., impressionist, cubism etc.) or the color of the image (e.g. ,pale blue background with pink tinge) , or the number of people in the image. We should understand what the output of our models is going to be used for and merely maximizing BLEU, METEOR etc. is not enough.
Following the learning reduction agenda, one can for example define this evaluation metric in terms of the accuracy of a well trained QA system to answer questions based on the generated text. If you are using a chat bot for retention then you can use an evaluation metric which is given by a model that predicts the retention time (chatbot is one of those problems for which it is really hard to think of any scenario where it could be useful).
3: Over Reliance on “Linguistic Intuition”
Another thing that one hears is the idea of “linguistic priors” or “linguistic intuition“. Often times, papers report using “linguistic prior” as one of their main contribution and back it up by showing performance gain. An example of this is use of recursive structures and deep learning (Stanford really championed this research) that was very popular in 2013-2015. It is very well known that sentences have recursive tree structures and therefore it made “total sense” when recursive tree structures were employed for sentence representation. These were some very popular research work at that point. However, they fell out of fashion later largely due to results showing that same gain can be obtained by not using recursive structures and simply doing averaging.
The point where it gets slippery is that these linguistic intuitions could be wrong or when coupled with other ingredients (models, optimization, learning) may in fact reduce performance. I have often tried several changes to my model that I expected should have increased performance only to see the performance go down (from what I hear this is a general experience. For a general discussion related to this, see this NIPS talk by Ali Rahimi).
We should not over rely on linguistic intuition (and the failure of the linguistic agenda should be a warning in itself) and try more broad range of solutions. After all the current deep learning trend did not emerge out of the mainstream. Who knows, maybe the answer to NLP are hidden in some abstract mathematical objects.
4: Product-less Off Line Evaluation
The field of NLP underwent a revolution with the use of datasets and rigorous evaluation. Nowadays nearly all papers at NLP conferences either introduce a dataset or evaluate on an existing dataset. One can say, the field is going through a dataset revolution (or bubble?). However, existing datasets have an issue. They only offer offline evaluation i.e., they are static and not continuously changing. This is different from online evaluation where one subjects their algorithm to live use by various people across the world and one computes statistics based on hourly (or other timescale) performance. Most applications in industry have an online evaluation: the Amazon echo dot and Google Translate engines are both being evaluated by users every minute who can type anything based on something that just happened in the world. For example, a Korean video called “Oppa Gagnam style” going live can spur users to try this phrase on these algorithms. Offline datasets just don’t allow us to make such evaluation (one can however simulate an online audience but this has other issues).
I was told by a researcher at industry that product engineers do not get convinced of the utility of a research algorithm by performance on some datasets. The gold standard for industry is A/B testing by showing improvement on an end metric (e.g., engagement, profit, or other utility) with real audience and setting. Industry standards are quite high and rightfully so! A twitter chatbot that becomes racist can be a PR nightmare for the company (you can read more about chatbot fails here). There is too much to loose. The damages are not only for the company but also real users. Think of the following example:
Human: “I feel like just killing myself. No one cares about me.”
Some Conversation Engine: “Sure.”
If this is part of offline dataset then generating “Sure” results in a minor reduction in BLEUE score but this can abet a suicide and cause real harm if this was a live testing. I would even argue that this error should get a score of , thus giving a total score of even when every other response was generated perfectly.
While using human evaluation and presenting Likert score is better, I think we should go the full extent. We need online evaluation where anyone can use these systems and one can see the performance for themselves. Never Ending Language Learning (NELL) is one such popular NLP project to do that. In robotics, I was part of a team that attempted to do something similar (this project is no longer active). We do need more such evaluation and I also believe that this is the only way to truly judge an NLP system.