Strawman I and Evaluation Framework

Apr 18, 2017

Training Data

In an attempt to scope the baseline task into separate modules, we decided that initially, there are two models to build:

A model that can read in a question and either sort which passages are more relevant, or score the relevance of each passage.
A model that is given a passage and a question, and can find the answer within the given passage.

For our initial experiments, we chose to only use samples where the answer can be found within a single passage, without having to synthesize the answer from multiple, and generate the language to express that answer. Afterall, there is no use generating an answer if the model does not know what to look for.

As mentioned in previous posts, we divided the MS MARCO dataset into 5 sets, for each of the 5 answer types. Therefore we can intially experiment with training separate models for each of the answer types before attemptting to use transfer learning or just using a single model for any question/answer sample.

For the baseline model discussed below, we chose to focus on the location answer type, since it generally fit our initial plan of finding answers that were verbatim within the passage, had a nice distribution of answer lengths, and was one of the smaller training subsets.

Here is an example sample to give an idea of what the dataset looks like:

{
  "passages": [
    ...
    {
      "is_selected": 0,
      "url": "http:\/\/www.webmd.com\/lung\/picture-of-the-lungs",
      "passage_text": "\u00a9 2014 WebMD, LLC. All rights reserved. The lungs are a pair of spongy, air-filled organs located on either side of the chest (thorax). The trachea (windpipe) conducts inhaled air into the lungs through its tubular branches, called bronchi. The bronchi then divide into smaller and smaller branches (bronchioles), finally becoming microscopic. "
    },
    {
      "is_selected": 1,
      "url": "https:\/\/www.healthtap.com\/topics\/where-are-lungs-located-in-your-back",
      "passage_text": "1. Get help from a doctor now \u203a. under ribs: The lungs in the front and back are inside the rib cage. This is why doctors will place a stethoscope on the back as well as the front to evaluate the function of the lungs. ...Read more. 1 Where are your kidneys located on your back in women. 2  Where are the lungs located in the back. 3  Where are lungs located in your body. 4  Ask a doctor a question free online. 5  Where are the lungs located in the human body."
    },
    {
      "is_selected": 0,
      "url": "http:\/\/www.living-with-back-pain.org\/lungs-back-pain.html",
      "passage_text": "completely unrelated to the lungs. There are many possible causes of upper back pain, but the most common cause is muscle strain. Ligament and tendon strains and sprains may also occur in the upper back. Mechanical back pain or disc herniation is possible. There are also many possible causes of lung or breathing pain, which are not associated with your back or spine. Among these would be a collapsed lung, or an infection of some type, such as pneumonia or an abscess. You may also have pleuritis or pleurisy, an inflammation of the pleura that covers"
    },
    ...
  ],
  "query_id": 19704,
  "answers": [
    "Inside the rib cage."
  ],
  "query_type": "location",
  "query": "where are the lungs located in the back"
}

Making the strong assumption that we have the relevant passage (captured in the is_selected field of the training data) and that it contains the answer in it verbatim, we built our baseline model to find the answer within that passage.

Evaluation Framework

For the baseline model, we are specifically focusing on two metrics; the accuracy of the pointer to the start of the answer in the passage, and the accuracy of the pointer to the end of the answer in the passage. Analyzing these two metrics gives us a good idea of how the model is behaving from an intuitive sense, but in order to truly evaluate the model, we will be using the official evaluation scripts provided with the MS MARCO dataset. The link to the scripts can be found here.

In order to use the scripts, we output two files from the model, references.json and candidates.json. Both files follow the same schema as shown below:

{"answers": ["bodega lane , bodega , northern california ."], "query_id": 9661}
{"answers": ["south america , caribbean , and the united states ."], "query_id": 9702}
{"answers": ["in the form of a high energy phosphate bond joining the terminal phosphate group to the rest of the molecule ."], "query_id": 9705}
{"answers": ["mexico"], "query_id": 9714}
{"answers": ["the alaskan way viaduct replacement tunnel is a bored road tunnel that is under construction in the city of seattle in the u.s. state of washington ."], "query_id": 9732}
{"answers": ["the phrase what the dickens was coined by william shakespeare and originated in the merry wives of windsor act 3 , scene 2 , 18 -- 23 , it was an oath to the devil said by mrs page ."], "query_id": 9743}
{"answers": ["fargo , north dakota"], "query_id": 9757}
{"answers": ["above the eyes , in the forehead bone ."], "query_id": 9778}
{"answers": ["the parietal lobe is located near the center of the brain , behind the frontal lobe , in front of the occipital lobe , and above the temporal lobe ."], "query_id": 9816}
{"answers": ["puerto rico"], "query_id": 9825}

Running the run.sh script, found within the eval folder in the GitHub repository, with output the Bleu-1 and Rouge-L scores for the candidates when scored against the references. These are standard metrics used by machine translation and text summarization systems to assess their output. Essentially, they score how well one piece of text captures the meaning of another.

Baseline Model

Architecture

For our baseline model, we started with a very basic design. The model has two bi-directional dynamic GRU RNN’s. The first RNN takes a matrix of the word embeddings of the question, and the second RNN takes the word embeddings of the context. The output of each RNN is the concatenation of the backward and forward outputs of that RNN. The output of the two RNN’s is concatenated and inserted in a third bi-directional dynamic GRU RNN. Then, we have a dense output layer that takes the output of the third RNN. From the output layer we get the logits for the starting and ending indices of the answer, from which we take the arg max. All three RNN’s have a dropout as a regularization technique to reduce overfitting during training. The current dropout rate for all RNN’s is 0.3.

baseline.py

Results

Using the location dataset and the baseline model discussed above, we were able to train a model and test it on a validation dataset. We did 20 epochs (restrained by limited computation resources) and got the following results:

Metrics	Results
begin accuracy	0.36619718309859156
end accuracy	0.4422535211267606
bleu_1	0.462772785622494
bleu_2	0.43903421091500255
bleu_3	0.4233918676482152
bleu_4	0.41101951209884013
rouge_l	0.488624329997

Experiments in Progress (Passage Relevance)

Each query has about ten passages associated with it, of which only one or two contain information that is relevant to the answer. For our initial model, we are only considering these passages. We plan on training another model to gauge how relevant each passage is. If we can assess passage relevance with high accuracy, then we can combine the two models. If this does not work well, then another option is to implement some sort of attention mechanism in the main model.

MrKnowItAll

Natural Language Processing Capstone - Spring 2017 - University of Washington