Formal Project Proposal
MS MARCO Dataset
The MS MARCO (Microsoft MAchine Reading COmprehension) dataset is “A Reading Comprehension Dataset for the Artificial Intelligence research community.” It contains approximately 100,000 real-world Bing queries along with human generated answers.
MS MARCO is split into training, development, and test sets at an approximate ratio of 80-10-10. There are 102,023 documents in total, split as follows.
- train: 82,326
- test: 9,650
- dev: 10,047
Each document contains a query as well as a selection of passages related to the query. Passages are extracted from Bing results for the given query. There are between 1 and 12 passages for each query, with an average of 8.2. Passages are between 19 and 1,167 characters in length, with an average of 422.
The first two passages for the query “how much do bartenders make” are as follows:
A bartender’s income is comprised mostly of tips, 55% to be exact. In some states, employers aren’t even required to pay their bartenders the minimum wage and can pay as low as $2.13 per hour, and they depend on their tips almost entirely. Bartending can be a lot of things. For some it is exciting, for others exhausting. At times there is a lot of fun to be had, at others it is rather dull. But for the most part, bartending is almost always rewarding in the financial sense, as long as you stick with it.
According to the Bureau of Labor Statistics, the average hourly wage for a bartender is $10.36, and the average yearly take-home is $21,550. Bartending can be a lot of things. For some it is exciting, for others exhausting. At times there is a lot of fun to be had, at others it is rather dull. But for the most part, bartending is almost always rewarding in the financial sense, as long as you stick with it.
http://www.breakintobartending.com/how-much-do-bartenders-make/
These passages come from the same source and actually share the same ending, which is actually the introduction of the page that the passages come from.
There are two answers for this query, which both come from the second passage:
- “$21,550 per year”,
- “The average hourly wage for a bartender is $10.36 and the average yearly take-home is $21,550.”
The second answer appears directly in the text, whereas the first is a rephrasing of the text. Since answers are human generated, they may restate information in ways that a machine would struggle with. This is what makes the MS MARCO dataset more challenging than its predecessors.
Queries are separated into 5 types with the following counts:
- description: 55,684
- numeric: 28,291
- entity: 10,485
- location: 5,068
- person: 2,495
Below are some example queries and their answers grouped by query type.
- description
- Q: what is a furuncle boil? A: A boil, also called a furuncle, is a deep folliculitis, infection of the hair follicle.It is most commonly caused by infection by the bacterium Staphylococcus aureus.
- Q: what can urinalysis detect? A: Detect and assess a wide range of disorders, such as urinary tract infection, kidney disease and diabetes.
- Q: what is vitamin a used for? A: Shigellosis, diseases of the nervous system, nose infections, loss of sense of smell, asthma, persistent headaches, kidney stones, overactive thyroid, iron-poor blood (anemia), deafness, ringing in the ears, and precancerous mouth sores.
- numeric
- Q: walgreens store sales average? A: Approximately $15,000 per year.
- Q: how much do bartenders make? A: $21,550 per year
- Q: cost to frame basement? A: $2.51 - $3.17 per square foot
- entity
- Q: what is a hummingbird moth? A: The hummingbird moth is an enchanting insect. Many mistake it for a hummingbird, it is that charming! On summer evenings, my husband and I sit outside in front of our flower garden watching hummingbirds.
- Q: what is baseball twine made of? A: Yarn
- Q: what kind of tick has rocky mountain spotted fever? A: An infected American dog tick
- location
- Q: where was movie the birds filmed? A: Bodega Lane, Bodega, Northern California.
- Q: where do roseate spoonbills live? A: South America, Caribbean, and the United States.
- Q: where is energy located in the atp molecule? A: In the form of a high energy phosphate bond joining the terminal phosphate group to the rest of the molecule.
- person
- Q: who is discovered silk? A: Chinese empress and wife of the Yellow Emperor.
- Q: who was philip kiriakis mother on days? A: Kate Roberts
- Q: cat named as director of company? A: Bossy the cat
Description queries require complex answers that synthesize information from multiple passages. The other query types are much more straightforward and can generally be answered by selecting the relevant text from the passage.
Project Objectives
- MVP: A basic GRU RNN with some attention mechanism trained on the MS MARCO dataset to answer questions.
- Stretch Goal: Add more advanced NLP techniques to claim a high rank in the MS MARCO leaderboard, and perhaps publish a preprint in arXiv.
Related Work
Shen et al. ‘16 achieved state-of-the-art performance in multiple reading comprehension datasets and performed well in the MS MARCO dataset. ReasoNet, their model, uses a multi-turn process, in which it repeatedly process the context and the question after digesting intermediate information, with different attention weights in each iteration. Results show that multi-turn reasoning have generally outperformed single-turn reasoning, which basically utilizes an attention mechanisms to emphasize specific parts of the context that are relevant to the question. Moreover, ReasoNet Uses reinforcement learning to dynamically decide when to terminate the inference process in reading comprehension tasks.
Wang et al. ‘16 currently have the best performance in the MS MARCO leader board. Their model, Prediction, uses an end-to-end neural network based on match-LSTM and Pointer Net. match-LSTM is used to predict textual entailment, where given two sentences, a premise and a hypothesis, it predicts whether the premise entails the hypothesis. Pointer Net generates an output sequence whose tokens come from the input. Wang et al. ‘16 propose two ways to use Pointer Net: a sequence model, which outputs pointers to the tokens of the answer in the context, and a boundaries model, which outputs pointers boundaries (first and last tokens) of the answer in the context.
Proposed Methodologies
Below we outline a rough plan in order to provide checkpoints that display tangible progress with a RNN model trained and used on the the MS MARCO dataset.
- Preprocess data, splitting it into 5 sets, one for each of the answer types within the MS MARCO dataset.
- Choose one of the answer types to start training a model with. This is in order to reduce the time spent training the model, and to get a baseline implementation working.
- Implement a simple RNN, either LSTM or GRU, that can take in passages, read the question, and produce the location of the answer within the passages.
- Add an attention mechanism to the RNN.
- Experiment with more advanced attention mechanisms and RNN architectures.
- Experiment with extending the model to multiple of the answer types. This may involve transfer learning or training an initial model or all answer types, and composing it with trained model made specifically with each answer type.
- Begin to synthesize multiple pieces of an answer into a single, succinct answer.
- Explore using match-LSTM and PtrNet
- Explore adding reinforcement learning as in ReasoNet
Available Resources
- Software Packages
- Tensorflow
- word2vec
- GloVe
- The Stanford Parser
- NLTK
- Papers
Evaluation Plan
Luckily, Microsoft has provided scripts on the MS MARCO submission page that we will use to evaluate our model. We may have to edit the scripts if we only end up using a subset of the MS MARCO dataset. The produced Bleu-1 and Rouge-L will still be comparable to numbers on their leaderboard, with the exception that ours may only be based on a subset of the dataset versus it in its entirety.