Advanced Model Update

May 2, 2017

What We Are Trying

For our first advanced model, we decided to attempt to implement the model presented in Salesforce Research’s paper, Dynamic Coattention Networks. The paper was very well written, with clear explanations, and was state-of-the-art on the SQuAD dataset was the paper came out. At this stage of the project, we are still trying to perfect the models ability to find an answer within the selected passage it is fed.

As presented in the paper, there are two major components to this model:

Coattention Encoder - An attention mechanism that accounts for the important words in the passage in light of the question, and the important words in the question in light of the passage. Intuitively, this is like reading the question first and then searching the passage for the answer by looking for words that seem particularly relevant.
Dynamic Decoder - This is an iterative approach to decode an answer. The start and end index are in conversation with each other as they converge to a solution the model thinks is best. Within each iteration, the paper makes use of an LSTM, as well as a “highway maxout network” for each the start and end index.

What is Going Wrong

So far we have implemented the whole model according to the paper in Tensorflow. That being said, when we attempt to train the model, the loss function stagnates at the same value. Beginning beginners at deep learning, this may require a good deal of debugging. For the time being, we removed the decoder and replaced it with two dense layers with a single output each, for the start and end index. This was a small enough model for it to learn to reasonable and produce results comparable to our other models. Moving forward, we will have to focus on the decoder and see if it is wired incorrectly, or subject to another bug.

Results So Far

The table below shows Bleu and Rouge scores for our baseline and attention models from last week as well as the new coattention model.

Metric	Baseline	Attention	Coattention
Bleu-1	0.466	0.526	0.506
Bleu-2	0.446	0.504	0.486
Bleu-3	0.432	0.490	0.472
Bleu-4	0.420	0.478	0.461
Rouge-L	0.537	0.517	0.570

These scores are based on the location query type with the following hyperparameters:

batch_size: 1024 (baseline), 256 (attention), 512 (coattention)
epochs: 50
hidden_size: 50
keep_prob: 0.5
learning_rate: 0.01

The new coattention model has better Bleu scores than the baseline but it does not perform as well as the attention model. It performs significantly better than the baseline model in Rouge-L, which is a weakness of the attention model.

A comparison of the baseline, attention, and coattention models is available here. The coattention model usually produces similar answers to the other two models but occasionally differs greatly. For example, for the query “where do bees usually live?”, the answers are as follows:

Reference: “tropical climates”
Baseline: “researchers believe that the original habitats of the honey bee are tropical climates and heavily forested areas.”
Attention: “researchers believe that the original habitats of the honey bee are tropical climates and heavily forested areas.”
Coattention: “honey bees can thrive in natural or domesticated environments , though they prefer to live in gardens , woodlands , orchards , meadows and other areas where flowering plants are abundant.”

All of these answers are at least somewhat correct. I think that in this case, the coattention model actually provides a better answer than the other two models and the baseline, since it answers the question “where do bees usually live?” and not “where did bees originally live?”.

The coattention model also performs very well for the question “where does the main aorta run the body?”.

Reference: “heart”
Baseline: “heart to the rest of the body.”
Attention: “the aorta is the main artery that carries blood from the heart to the rest of the body.”
Coattention: “the aorta comes out from the left ventricle of the heart and travels through the chest and abdomen.”

The reference answer is terrible for this question. I think that it is actually worse than the answers of any of our models. Still, the coattention model produced a good, coherent answer to this question.

Otherwise, there does not seem to be a general pattern of errors in the coattention model. Its answers seem to be very good in general, even if its Bleu scores are not as good as the attention model.

Next Steps

Debugging - The coattention mechanism seems promising. However, as we explained above, our initial implementation has some bugs. We will spend a good amount of time over the next week or two debugging our implementation. Specifically, we ideally want to fix and run our dynamic decoder implementation.
Passage Relevance - The MS MARCO dataset provides multiple passages for each question. So far, we have been assuming we know which passage has the answer to our question. Obviously, we can’t assume that for the testing set. So we want to find a solution to this problem, and we are looking at three different approaches. The first approach is to just concatenate the different passages. It is a naive approach, but an easy one to implement and it might be worthwhile to try it. The second approach is to build a model that takes different passages and a question as an input and decide which passage is the most important one. And then, we can use that passage as an input to our main reading comprehension model. The last approach is to make our model take in as an input all ten passages and then attend to the different passages based on the question. We don’t know the specific details of this approach, but in any of the last two approaches we would need some type of a model that assign relevance scores to different passages based on a given question. So, we will be working on this for the next week or two.

MrKnowItAll

Natural Language Processing Capstone - Spring 2017 - University of Washington

Advanced Model Update

What We Are Trying

What is Going Wrong

Results So Far

Next Steps