Video question answering aims to pinpoint answers in response to user's specified questions. However, most question answering technologies involve in integrating rich specific external knowledge such as syntactic parsers, which are often unavailable for m