Survey on Community Question AnsweringSystem to solve the Lexical GapB.Deeppikaa1, A.Geetha2,S.Sri Heera31PG Student/CSEEaswariEngineering College, Tamil Nadu, Chennai.2Assistant Professor/CSEEaswari Engineering College, Tamil Nadu, Chennai.3Assistant Professor/CSEEaswari Engineering College, Tamil Nadu, Chennai. Abstract: Web search engines give a ranked list ofrelated documents based on users keywords which depends on various aspects likepopularity measures, keyword match, frequency of accessing documents in whichusers have to check every specific document for getting the desired informationand it causes information retrieval a prolonged process.
Community QuestionAnswering (CQA) system focus to deliver users short and precise answers insteadof irrelevant documents. CQA is a specialized application which deals withinformation retrieval which has an ability to retrieve the right answers toquestions posed in natural language. Natural Language Processing (NLP)techniques used to process a question, then searches for the requiredinformation regarding user questions to determine the answer accurately. Thissurvey mainly focuses on different approaches to solve the problems arising dueto the lexical gap and also rank the accurate answers in question and answeringblogs such as community question answering websites. Keywords: Community QuestionAnswering system, Information Retrieval, Lexical Gap, NaturalLanguage Processing. 1.
INTRODUCTIONIn recentyears, large amount of memory is employed by the historical web pages toretrieve the vital information, from those pages mainly in the blogs such astraditional Frequently Asked Questions (FAQ) archives and the emergingCommunity Question Answering (CQA) services, such as Yahoo! Answers, Live QnA, HealthTap and Zhihu.com. The webcontent of these web sites is usually organized as questions and the answersassociated with metadata from which the requesting users categorize the questionsand the respondents reply the best answers. This results in CQA archives tohave valuable resources for various tasks like question-answering and knowledgemining, etc. One fundamental task for reusing the contents in CQA is to findsimilar questions for newly queried questions, as questions are the keys toaccessing the knowledge in CQA.
Then, the best answers to those similarquestions used to answer the queried questions, results in Question retrieval. Fig1.Question answering system 1.1. QUESTION ANSWERING TYPES · Question classification:Open-domain QAS is dealswith questions of nearly everything and Closed-domain QAS deals with questionsin a specific domain. · Data Source classification: Structured data deals withrelational database and Unstructured data deals with documents / web pages inInternet · Answer classification:Extracted answer liesdirectly in the database and Generated answer needs to be generated orformulated from the retrieved data. Fig 2.
Architecture of questionanswering system · QuestionProcessing used to analyze sentence tagging, sentiment analysis and alsoclassify question to determine the intention of the question It also doesquestion reformulation so that the document processing can handle it. · DocumentProcessing in Information Retrieval (IR) module used to retrieve the mostRelevant data from the system. · Answer processingused to identify useful information from the document and also rank the answersif there are multiple answers and returns the most relevant answer 4. 1.
2 QUESTIONCLASSIFICATION AND ANSWER In Question processing thesystem first should analyze the type of question. Questions can be classifiedinto two categories:Question with ‘WH’ questionwords such as what, where, who, whom, which, how, why and etc. Questions with ‘modal’ or’auxiliary’ verbs that their answers are Yes/No4.
Table 1 shows questionwords, type of questions and answers. Table 1. Question classification andanswer 1.
3 APPROACHESIN QUESTION ANSWERING SYSTEMS · Linguistic Approach Linguistic approachunderstands natural language text, linguistic & common knowledge Linguistictechniques such as tokenization, POS tagging and parsing. These wereimplemented to user’s question for formulating it into a precise query that extractsthe respective response from the structured database. · Statistical Approach Availability of huge amountof data on internet increased the importance of statistical approaches. Astatistical learning method gives the better results than other approaches.
Online text repositories and statistical approaches are independent ofstructured query languages and can formulate queries in natural language form.Mostly all Statistical Based QA system applied a statistical technique in QA systemsuch as Support vector machine classifier, Bayesian Classifiers, maximumentropy models. · Pattern Matching Approach Pattern matching approachdeals with expressive power of text pattern, it replace the sophisticatedprocessing involved in other computing approaches. Most of the pattern matchingQA systems uses the surface text pattern, while some of them also rely ontemplates for response generator. · Deep Learning approach Most of the Deep Learningmethods is used to implement one or more component of QAS such as questionclassification, sentence selection, etc. It converts Natural Language into acomputable form e.
g. using word embedding or using Neural Language Model e.g.
using RNN / LSTM or C. 2. RESEARCH BACKGROUND 2.
1 COMMUNITY QUESTION ANSWERING · Community Question answering(CQA) is a computer science discipline within the fields of informationretrieval and natural language processing (NLP), which is concerned withbuilding systems that automatically answer questions posed by humans in anatural language. A CQA implementation, usually a computer program, mayconstruct its answers by querying a structured database of knowledge orinformation, usually a knowledge base. More commonly, CQA systems can pullanswers from an unstructured collection of natural language documents. Table 2shows the comparison between the CQA and QA.· In many cases, the communitygenerated content, however, may not be directly usable due to the vocabularygap. Users with diverse backgrounds do not necessarily to share the samevocabulary.
Stack Overflow, one of the technical question answering sites forusers can ask technical related questions. The entire technical questions aresolved by technical experts, where different users ask different questions forthe similar answer and there is a gap exists between what is asked and what isanswered either syntactically or semantically and such gaps ends in lexicalgap. 2.2 NATURAL LANGUAGE PROCESSING(NLP) · Natural language processing can bedefined as the ability of a machine to analyse, understand, and generate humanspeech. The goal of NLP is to make interactions between computers and humansfeel exactly like interactions between humans and humans 1. CQA system take anatural language question as input, convert the question into a query andforwards it to the next module. When a set of required documents is retrieved,the CQA system extracts an accurate answer for user posted question 5.
· Deep Analytics: Deep analytics involvesthe application of advanced data processing techniques in order to extractspecific information from large or multi-source data sets. Deep analytics isoften used in the financial sector, the scientific community, thepharmaceutical sector, and biomedical industries. Increasingly, however, deepanalysis is also being used by organizations and companies interested in miningdata of business value from expansive sets of consumer data. · Machine Translation: Natural languageprocessing is increasingly being used for machine translation programs, inwhich one human language is automatically translated into another humanlanguage. 2.3 INFORMATION RETRIEVAL An informationretrieval process begins when a user enters a query into the system. Queriesare formal statements of information needs, for example search strings in websearch engines.
In informationretrieval a query does not uniquely identify a single object in the collection.Instead, several objects may match the query, perhaps with different degrees ofrelevancy. User queries are matched against the database information. However,as opposed to classical SQL queries of a database, in information retrieval theresults returned may or may not match the query, so results are typicallyranked. This ranking of results is a key difference of information retrievalsearching compared to database searching. 2.4 LEXICAL GAP Lexical gapsare instances of lack of lexicalization detected in a language while comparingtwo languages or in a target language during translation.
Although the problemseems to be minor and clear, one gets rather the opposite impression after anexcursion through the linguistic literature on lexical gaps.Different living environment and experience causes some lexical meanings in onelanguage does not exist in another language. Lexical gaps are filled by means of Hypernyms or by AntonymousExpressions 2.
5 RANK BASED ANSWER GENERATOR Rank basedanswer generation is to determine that the user post their exact questions andobtain answers from other users. This helps to get better answers than comparedwith search engines. The generated answers will be better than the answerssearched by the browsers and accurate result is determined. A enormous numberof QA pairs have beengathered in their repositories, and it also preserve the answers of the searchresults. The Support Vector Machine algorithm is a retrievalfunction that uses pairwise rankingmethods to sort the accurate results based on how relevant they are to a particularuser query.3. RELATED WORK ON COMMUNITY QUESTIONANSWERING 3.1 Al-Harbi O.
, Jusoh S., and Norwawi N. (2011) 1 Resolving thelexical ambiguity problem by combining two pieces of knowledge; contextknowledge and ontology of concepts knowledge of interesting domain, intoshallow natural language processing (SNLP). The combination of these knowledgeis used to decide the most possible meaning of the word. It lacks in resolvingthe syntactic ambiguity in natural language questions. 3.2 Bernhard D., and Gurevych I.
(2009) 2 Monolingual translationprobabilities have recently been introduced in retrieval models to solve thelexical gap problem. It is evaluatedwith three datasets for training statistical word translation models for use inanswer finding question-answer pairs, manually-tagged question reformulationsand glosses for the same term extracted from several lexical semanticresources.The existing system lacks in question analysis by automaticallyidentifying question topic and question focus. 3.3 Guangyou Zhou, Zhiwen Xie, Tingting He.(2016) 3 TheState-of-the-art approaches address these issues by implicitly expanding thequeried questions with additional words or phrases using monolingualtranslation models. The task of question retrieval in CQA and represent aquestion as Bag-of-Embedded-Words (BoEW) in a continuous space.
The existingsystem lacks in pairs to learn various translation models to bridge the lexicalgap problem. 3.4 Qiu X., Huang X. (2015)6 Theconvolutional neural tensor network architecture to encode the sentences insemantic space and model their interactions with a tensor layer and also helpto learn better word embeddings.
The existing system lacks in to efficientlydetect local reuses at the semantic level for large scale problems. 3.5 Shen Y., Rong W., Jiang N., Peng B.
, TangJ., and Xiong Z. (2017) 7 A WordEmbedding based Correlation (WEC) model is proposed by integrating advantagesof both the translation model and word embedding and also leverages thecontinuity and smoothness of continuous space word representation to deal withnew pairs of words that are rare in the training parallel text. It is necessaryto focus on question-question matching tasks, or Multilanguage questionretrieval task.
It does not solve parallel detection problems. 3.6 Wei-Nan Zhang, Zhao Yan Ming, Yu Zhang.(2016) 8 It Explore thekey concept identification approach for query refinement and a pivot languagetranslation based approach to explore key concept paraphrasing.
These wordembedding models contribute the most the performance. The existing systemgenerates noise samples for each input word to estimate the target word causesinefficiency. 3.7 Zhang K., Wu W., Wu H., Li Z.
, Zhou M.(2014) 9 They areheterogeneous for both the literal level and user behaviours. Conduct a seriesof experiments to evaluate our proposed approaches automatically on large-scaledata sets.
The existing system cannot be directly used for large scaleproblems. 3.8 Zhou G.
, Chen Y., Zeng D., and Zhao J.(2013) 11 A novelQuestion-Answer Topic Model (QATM) to learn the latent topics aligned acrossthe question-answer pairs to alleviate the lexical gap problem. A faster andbetter retrieval model for question search by leveraging user chosen category.
The existing system lacks in the localness and hierarchy intrinsic to thenatural language problems. 3.9 Zhou G., He T.
, Zhao J., Hu P. (2015)12 The frameworkof fisher kernel to aggregated them into the fixed length vectors.
Thatmetadata of category information benefits the word embedding learning forquestion representation. The existing system have problem from differentaspects such as extraction methods with or without linguistic knowledge. 3.
10 Zhou G., Cai L., Zhao J., and Liu K.
(2011)10 A machinelearning algorithm that aims to predict a ranking among all the possiblelabels, to perform question classification. Training process does not need manytraining data, which are always expensive to obtain in the CQA services.Combining the bilingual translation or category information can be done forbetter question retrieval. Sometimes when we search a query in the services,the system always tells us it cannot find any results. 4. COMPARISON TABLE Table 2.
Comparison of CQA and QA 5. PROPOSEDWORK We present anovel approach to determine accurate answer in CQA based on user questionsinstead of retrieving the entire document. It categorized into offline learningand online search component.
In the offline learning component, instead ofprolonged process we construct the positive, neutral, and negative trainingsamples in the forms of preference pairs.We then propose a robust pairwiselearning to rank model to incorporate these three types of training samples. Inthe online search component, for a given question, we first collect a pool ofanswer candidates via finding its similar questions. We then employ the offlinelearned model to rank the answer candidates via pair wise comparison. 6.
Conclusion and Future Work· The main goal of a communityquestion answering system is to retrieving answers to questions rather thanfull documents as most information retrieval systems. Therefore we analyzed the different approaches to solvethe problem arising due to lexical gap and also determine to rank the accurate answersfor the user question instead of retrieving the entire document.Lexical Gap is complex phenomenon, in which it requires many factorsto obtain exact answers for the causes of classification and translationstrategies.
· The future work towards multilingual,multi-knowledge source CQA systems that are capable of understanding noisy,human natural language input and also to overcome the problems in pairwiseranking such as noise-sensitive, large-scale preference pairs, and loss ofinformation about the finer granularity in the field of CQA. REFERENCES 1 Al-Harbi O.,Jusoh S., and Norwawi N. (2011),’Lexical Disambiguation in Natural LanguageQuestions (NLQs)’, in IJCSI International Journal of Computer ScienceIssues, Vol. 8, Issue 4, No 2.2 Bernhard D.
, andGurevych I. (2009), ‘Combining lexicalsemantic resources with question & answer archives for translation-basedanswer finding’, in Proceedings of the ACL, pp. 728–736.3 Guangyou Zhou,Zhiwen Xie, Tingting He.
(2016),’Question-answertopic model for question retrieval in community question answering’, inIEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 24,Issue:7).4 Pawan Kumar., RajKumar Goel., and Prem Sagar Sharma. (2014), ‘A New Architecture of Automatic Question Answering System usingOntology’ in International Journal of Computer Applications (0975 – 8887)Volume 97 – No.20.5 Poonam Gupta.,and Vishal Gupta.
(2012), ‘A Survey ofText Question Answering Techniques’ in International Journal of ComputerApplications (0975 – 8887) Volume 53– No.4.6 Qiu X.
, Huang X.(2015), ‘Convolutional neural tensornetwork architecture for community-based question answering’, in IJCAIproceedings of the twenty fourth international joint conference on artificialintelligence.7 Shen Y., Rong W.,Jiang N., Peng B., Tang J.
, and Xiong Z. (2017), ‘Word Embedding Based Correlation Model for Question/Answer Matching’, inProceedings of the Thirty-First AAAI Conference on Artificial Intelligence.8 Wei-Nan Zhang,Zhao Yan Ming, Yu Zhang .(2016), ‘Capturingthe Semantics Of Key Phrases Using Multiple Languages For Question Retrieval’,in IEEE Transactions on Knowledge and Data Engineering ( Volume: 28, Issue: 4).9 Zhang K.
, Wu W.,Wu H., Li Z., Zhou M. (2014), ‘QuestionRetrieval With High Quality Answers In Community Question Answering’, Proceedingsof the 23rd ACM International Conference on Conference on Information andKnowledge Management pp. 371-380.10 Zhou G.
, Cai L.,Zhao J., and Liu K. (2011), ‘Phrase-basedtranslation model for question retrieval in community question answerarchives’, in Proceedings of the ACL, pp. 653–662.11 Zhou G., Chen Y.,Zeng D.
, and Zhao J. (2013), ‘TowardsFaster And Better Retrieval Models For Question Search’, CIKM Proceedingsof the 22nd ACM international conference on Information & KnowledgeManagement, pp. 2139- 2148.12 Zhou G., He T.
,Zhao J., Hu P. G.(2015), ‘LearningContinuous Word Embedding With Metadata For Question Retrieval In CommunityQuestion Answering’, Proceedings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference onNatural Language Processing, pages 250–259.