Project Ideas
MrKnowItAll - TJ Gilbrough, Omar Alhadlaq, Lane Aasen
We will be most likely operate in the research mode, unless we choose the IMDb idea, in which case it would make more sense to go the startup route.
Reading Comprehension
- Description: Microsoft has recently released a new large scale dataset for reading comprehension and question answering called MS MARCO. We are very interested in information extraction and question answering, so we would like to work on training a model that would perform well on the dataset. We think this project will further our knowledge of RNN’s, attention, deep memory, and other NLP techniques.
- MVP: A basic GRU RNN with some attention mechanism trained on the MS MARCO dataset to answer questions.
- Stretch Goal: Add more advanced NLP techniques to claim a high rank in the MS MARCO leaderboard, and perhaps publish a preprint in arXiv.
- References: http://www.msmarco.org
Language Detection
- Description: Wikipedia provides millions of articles in many different languages which could be used to train a language detection system.
- MVP: High accuracy on entire Wikipedia articles. This could be done with a very simple model. A harder goal would be to achieve comparable accuracy to Google Translate on shorter snippets of text.
- Stretch Goal: Find articles about the same topic in different languages using only the text of the article.
IMDb Regression Model
- Description: Given movie scripts, train a model to produce the IMDb user ratings for each movie. Consequently, writers could judge the success of their movie before going through production.
- MVP: Average difference in model’s outputted score and real user score for each movie should be less than 1.0.
- Stretch Goal: Extend the model to work on TV shows, that can account for previous episodes in the series. Another domain this could be used in would be lyrics to songs.