Showing papers in "arXiv: Information Retrieval in 2014"

PDF

Open Access

Posted Content•

Sequential Click Prediction for Sponsored Search with Recurrent Neural Networks

[...]

Yuyu Zhang¹, Hanjun Dai², Chang Xu³, Jun Feng⁴, Taifeng Wang⁵, Jiang Bian⁵, Bin Wang¹, Tie-Yan Liu⁵ - Show less +4 more•Institutions (5)

Chinese Academy of Sciences¹, Fudan University², Nankai University³, Tsinghua University⁴, Microsoft⁵

23 Apr 2014-arXiv: Information Retrieval

TL;DR: Wang et al. as discussed by the authors introduced a novel framework based on Recurrent Neural Networks (RNN), which directly models the dependency on user's sequential behaviors into the click prediction process through the recurrent structure in RNN.

...read moreread less

Abstract: Click prediction is one of the fundamental problems in sponsored search. Most of existing studies took advantage of machine learning approaches to predict ad click for each event of ad view independently. However, as observed in the real-world sponsored search system, user's behaviors on ads yield high dependency on how the user behaved along with the past time, especially in terms of what queries she submitted, what ads she clicked or ignored, and how long she spent on the landing pages of clicked ads, etc. Inspired by these observations, we introduce a novel framework based on Recurrent Neural Networks (RNN). Compared to traditional methods, this framework directly models the dependency on user's sequential behaviors into the click prediction process through the recurrent structure in RNN. Large scale evaluations on the click-through logs from a commercial search engine demonstrate that our approach can significantly improve the click prediction accuracy, compared to sequence-independent approaches.

...read moreread less

296 citations

Posted Content•

An Information Retrieval Approach to Short Text Conversation

[...]

Zongcheng Ji, Zhengdong Lu, Hang Li

29 Aug 2014-arXiv: Information Retrieval

TL;DR: This paper proposes formalizing short text conversation as a search problem at the first step, and employing state-of-the-art information retrieval techniques to carry out the task, investigating the significance as well as the limitation of the IR approach.

...read moreread less

Abstract: Human computer conversation is regarded as one of the most difficult problems in artificial intelligence. In this paper, we address one of its key sub-problems, referred to as short text conversation, in which given a message from human, the computer returns a reasonable response to the message. We leverage the vast amount of short conversation data available on social media to study the issue. We propose formalizing short text conversation as a search problem at the first step, and employing state-of-the-art information retrieval (IR) techniques to carry out the task. We investigate the significance as well as the limitation of the IR approach. Our experiments demonstrate that the retrieval-based model can make the system behave rather "intelligently", when combined with a huge repository of conversation data from social media.

...read moreread less

245 citations

Posted Content•

Machine learning approach for text and document mining

[...]

Vishwanath Bijalwan, Pinki Kumari, Jordan Pascual, Vijay Bhaskar Semwal

06 Jun 2014-arXiv: Information Retrieval

TL;DR: In this article, a KNN-based machine learning approach was used to classify the documents and then return the most relevant documents for text categorization, which has received much attention in the last years from both researchers in the academia and industry developers.

...read moreread less

Abstract: Text Categorization (TC), also known as Text Classification, is the task of automatically classifying a set of text documents into different categories from a predefined set. If a document belongs to exactly one of the categories, it is a single-label classification task; otherwise, it is a multi-label classification task. TC uses several tools from Information Retrieval (IR) and Machine Learning (ML) and has received much attention in the last years from both researchers in the academia and industry developers. In this paper, we first categorize the documents using KNN based machine learning approach and then return the most relevant documents.

...read moreread less

172 citations

Posted Content•

Real-Time Classification of Twitter Trends

[...]

Arkaitz Zubiaga¹, Damiano Spina², Raquel Suriá Martínez², Víctor Fresno²•Institutions (2)

Dublin Institute of Technology¹, National University of Distance Education²

06 Mar 2014-arXiv: Information Retrieval

TL;DR: This work explores the types of triggers that spark trends on Twitter, introducing a typology with the following 4 types: news, ongoing events, memes, and commemoratives, and provides an efficient way to accurately categorize trending topics without need of external data.

...read moreread less

Abstract: Social media users give rise to social trends as they share about common interests, which can be triggered by different reasons. In this work, we explore the types of triggers that spark trends on Twitter, introducing a typology with following four types: 'news', 'ongoing events', 'memes', and 'commemoratives'. While previous research has analyzed trending topics in a long term, we look at the earliest tweets that produce a trend, with the aim of categorizing trends early on. This would allow to provide a filtered subset of trends to end users. We analyze and experiment with a set of straightforward language-independent features based on the social spread of trends to categorize them into the introduced typology. Our method provides an efficient way to accurately categorize trending topics without need of external data, enabling news organizations to discover breaking news in real-time, or to quickly identify viral memes that might enrich marketing decisions, among others. The analysis of social features also reveals patterns associated with each type of trend, such as tweets about ongoing events being shorter as many were likely sent from mobile devices, or memes having more retweets originating from a few trend-setters.

...read moreread less

118 citations

Posted Content•

Semantic Modelling with Long-Short-Term Memory for Information Retrieval.

[...]

Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, Rabab K. Ward - Show less +4 more

20 Dec 2014-arXiv: Information Retrieval

TL;DR: Experimental evaluation on an IR task derived from the Bing web search demonstrates the ability of the proposed method in addressing both lexical mismatch and long-term context modelling issues, thereby, significantly outperforming existing state of the art methods for web document retrieval task.

...read moreread less

Abstract: In this paper we address the following problem in web document and information retrieval (IR): How can we use long-term context information to gain better IR performance? Unlike common IR methods that use bag of words representation for queries and documents, we treat them as a sequence of words and use long short term memory (LSTM) to capture contextual dependencies. To the best of our knowledge, this is the first time that LSTM is applied to information retrieval tasks. Unlike training traditional LSTMs, the training strategy is different due to the special nature of information retrieval problem. Experimental evaluation on an IR task derived from the Bing web search demonstrates the ability of the proposed method in addressing both lexical mismatch and long-term context modelling issues, thereby, significantly outperforming existing state of the art methods for web document retrieval task.

...read moreread less

72 citations

Posted Content•

Hete-CF: Social-Based Collaborative Filtering Recommendation using Heterogeneous Relations

[...]

Chen Luo, Wei Pang, Zhe Wang

24 Dec 2014-arXiv: Information Retrieval

TL;DR: Hete-CF as mentioned in this paper is a social collaborative filtering algorithm using heterogeneous relations, which can effectively utilize multiple types of relations in a heterogeneous social network and can be used in arbitrary social networks.

...read moreread less

Abstract: Collaborative filtering algorithms haven been widely used in recommender systems. However, they often suffer from the data sparsity and cold start problems. With the increasing popularity of social media, these problems may be solved by using social-based recommendation. Social-based recommendation, as an emerging research area, uses social information to help mitigate the data sparsity and cold start problems, and it has been demonstrated that the social-based recommendation algorithms can efficiently improve the recommendation performance. However, few of the existing algorithms have considered using multiple types of relations within one social network. In this paper, we investigate the social-based recommendation algorithms on heterogeneous social networks and proposed Hete-CF, a Social Collaborative Filtering algorithm using heterogeneous relations. Distinct from the exiting methods, Hete-CF can effectively utilize multiple types of relations in a heterogeneous social network. In addition, Hete-CF is a general approach and can be used in arbitrary social networks, including event based social networks, location based social networks, and any other types of heterogeneous information networks associated with social information. The experimental results on two real-world data sets, DBLP (a typical heterogeneous information network) and Meetup (a typical event based social network) show the effectiveness and efficiency of our algorithm.

...read moreread less

68 citations

Journal Article•DOI•

SIMD Compression and the Intersection of Sorted Integers

[...]

Daniel Lemire¹, Leonid Boytsov², Nathan Kurz•Institutions (2)

Télé-université¹, Carnegie Mellon University²

24 Jan 2014-arXiv: Information Retrieval

TL;DR: The S4-BP128-D4 algorithm as discussed by the authors uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression.

...read moreread less

Abstract: Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD Galloping algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.

...read moreread less

66 citations

Posted Content•

Evaluating the retrieval effectiveness of Web search engines using a representative query sample

[...]

Dirk Lewandowski¹•Institutions (1)

Hamburg University of Applied Sciences¹

09 May 2014-arXiv: Information Retrieval

TL;DR: In this article, a random representative sample of 1,000 informational queries from a major German search engine and comparing Google's and Bing's results based on this sample was taken, and they found that while Google outperforms Bing in both query types, the difference in the performance for informational queries was rather low.

...read moreread less

Abstract: Search engine retrieval effectiveness studies are usually small-scale, using only limited query samples. Furthermore, queries are selected by the researchers. We address these issues by taking a random representative sample of 1,000 informational and 1,000 navigational queries from a major German search engine and comparing Google's and Bing's results based on this sample. Jurors were found through crowdsourcing, data was collected using specialised software, the Relevance Assessment Tool (RAT). We found that while Google outperforms Bing in both query types, the difference in the performance for informational queries was rather low. However, for navigational queries, Google found the correct answer in 95.3 per cent of cases whereas Bing only found the correct answer 76.6 per cent of the time. We conclude that search engine performance on navigational queries is of great importance, as users in this case can clearly identify queries that have returned correct results. So, performance on this query type may contribute to explaining user satisfaction with search engines.

...read moreread less

58 citations

Journal Article•DOI•

A Survey on optimization approaches to text document clustering

[...]

R.Jensi, Dr.G.Wiselin Jiji

10 Jan 2014-arXiv: Information Retrieval

TL;DR: A brief survey on optimization approaches to text document clustering is turned out and shows that the optimization technique performs a globalized search in the entire solution space.

...read moreread less

Abstract: Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.

...read moreread less

44 citations

Posted Content•

Predicting a Business Star in Yelp from Its Reviews Text Alone.

[...]

Mingming Fan¹, Maryam Khademi²•Institutions (2)

University of North Carolina at Charlotte¹, University of California, Irvine²

05 Jan 2014-arXiv: Information Retrieval

TL;DR: This paper predicts a business rating based on user-generated reviews texts alone, which not only provides an overview of plentiful long review texts but also cancels out subjectivity.

...read moreread less

Abstract: Yelp online reviews are invaluable source of information for users to choose where to visit or what to eat among numerous available options. But due to overwhelming number of reviews, it is almost impossible for users to go through all reviews and find the information they are looking for. To provide a business overview, one solution is to give the business a 1-5 star(s). This rating can be subjective and biased toward users personality. In this paper, we predict a business rating based on user-generated reviews texts alone. This not only provides an overview of plentiful long review texts but also cancels out subjectivity. Selecting the restaurant category from Yelp Dataset Challenge, we use a combination of three feature generation methods as well as four machine learning models to find the best prediction result. Our approach is to create bag of words from the top frequent words in all raw text reviews, or top frequent words/adjectives from results of Part-of-Speech analysis. Our results show Root Mean Square Error (RMSE) of 0.6 for the combination of Linear Regression with either of the top frequent words from raw data or top frequent adjectives after Part-of-Speech (POS).

...read moreread less

33 citations

Posted Content•

PinView: Implicit Feedback in Content-Based Image Retrieval

[...]

Zakria Hussain, Arto Klami, Jussi Kujala, Alex Po Leung, Kitsuchart Pasupa, Peter Auer, Samuel Kaski, Jorma Laaksonen, John Shawe-Taylor - Show less +5 more

02 Oct 2014-arXiv: Information Retrieval

Abstract: This paper describes PinView, a content-based image retrieval system that exploits implicit relevance feedback collected during a search session. PinView contains several novel methods to infer the intent of the user. From relevance feedback, such as eye movements or pointer clicks, and visual features of images, PinView learns a similarity metric between images which depends on the current interests of the user. It then retrieves images with a specialized online learning algorithm that balances the tradeoff between exploring new images and exploiting the already inferred interests of the user. We have integrated PinView to the content-based image retrieval system PicSOM, which enables applying PinView to real-world image databases. With the new algorithms PinView outperforms the original PicSOM, and in online experiments with real users the combination of implicit and explicit feedback gives the best results.

...read moreread less

Posted Content•

Information Filtering via Balanced Diffusion on Bipartite Networks

[...]

Da-Cheng Nie¹, Ya-Hui An¹, Qiang Dong¹, Yan Fu¹, Tao Zhou¹ - Show less +1 more•Institutions (1)

University of Electronic Science and Technology of China¹

24 Feb 2014-arXiv: Information Retrieval

TL;DR: In this paper, the authors investigate the effect of weight assignment in the hybrid of mass diffusion (MD) and heat conduction (HC) algorithms, and find that a new hybrid algorithm of MD and HC with balanced weights will achieve the optimal recommendation results.

...read moreread less

Abstract: Recent decade has witnessed the increasing popularity of recommender systems, which help users acquire relevant commodities and services from overwhelming resources on Internet. Some simple physical diffusion processes have been used to design effective recommendation algorithms for user-object bipartite networks, typically mass diffusion (MD) and heat conduction (HC) algorithms which have different advantages respectively on accuracy and diversity. In this paper, we investigate the effect of weight assignment in the hybrid of MD and HC, and find that a new hybrid algorithm of MD and HC with balanced weights will achieve the optimal recommendation results, we name it balanced diffusion (BD) algorithm. Numerical experiments on three benchmark data sets, MovieLens, Netflix and RateYourMusic (RYM), show that the performance of BD algorithm outperforms the existing diffusion-based methods on the three important recommendation metrics, accuracy, diversity and novelty. Specifically, it can not only provide accurately recommendation results, but also yield higher diversity and novelty in recommendations by accurately recommending unpopular objects.

...read moreread less

Book Chapter•DOI•

A Hybrid Approach Using Ontology Similarity and Fuzzy Logic for Semantic Question Answering

[...]

Monika Rani¹, Maybin K. Muyeba², Om Prakash Vyas¹•Institutions (2)

Indian Institute of Information Technology, Allahabad¹, Manchester Metropolitan University²

01 Jan 2014-arXiv: Information Retrieval

TL;DR: A hybrid approach for a Semantic question answering retrieval system using Ontology Similarity and Fuzzy logic to provide retrieval systems with more accurate answers than non-fuzzy Semantic Ontology approach.

...read moreread less

Abstract: One of the challenges in information retrieval is providing accurate answers to a user’s question often expressed as uncertainty words. Most answers are based on a Syntactic approach rather than a Semantic analysis of the query. In this paper our objective is to present a hybrid approach for a Semantic question answering retrieval system using Ontology Similarity and Fuzzy logic. We use a Fuzzy Co-clustering algorithm to retrieve collection of documents based on Ontology Similarity. Fuzzy scale uses Fuzzy type-1 for documents and Fuzzy type-2 for words to prioritize answers. The objective of this work is to provide retrieval systems with more accurate answers than non-fuzzy Semantic Ontology approach.

...read moreread less

Posted Content•

Penerapan teknik web scraping pada mesin pencari artikel ilmiah

[...]

Ahmad Josi, Leon Andretti Abdillah, Suryayusra

18 Oct 2014-arXiv: Information Retrieval

TL;DR: The aim is for information collected after the program makers learn navigation techniques on the website information will be taken to a web application mimicked the scraping that the authors will create.

...read moreread less

Abstract: Search engines are a combination of hardware and computer software supplied by a particular company through the website which has been determined. Search engines collect information from the web through bots or web crawlers that crawls the web periodically. The process of retrieval of information from existing websites is called "web scraping." Web scraping is a technique of extracting information from websites. Web scraping is closely related to Web indexing, as for how to develop a web scraping technique that is by first studying the program makers HTML document from the website will be taken to the information in the HTML tag flanking the aim is for information collected after the program makers learn navigation techniques on the website information will be taken to a web application mimicked the scraping that we will create. It should also be noted that the implementation of this writing only scraping involves a free search engine such as: portal garuda, Indonesian scientific journal databases (ISJD), google scholar.

...read moreread less

Posted Content•

MOOCdb: Developing Standards and Systems to Support MOOC Data Science.

[...]

Kalyan Veeramachaneni, Sherif A. Halawa, Franck Dernoncourt, Una-May O'Reilly, Colin Anthony Taylor, Chuong B. Do - Show less +2 more

08 Jun 2014-arXiv: Information Retrieval

TL;DR: A shared data model for enabling data science in Massive Open Online Courses (MOOCs) is presented and becomes the foundation for a number of collaborative frameworks that enable progress in data science without the need to share the data.

...read moreread less

Abstract: We present a shared data model for enabling data science in Massive Open Online Courses (MOOCs). The model captures students interactions with the online platform. The data model is platform agnostic and is based on some basic core actions that students take on an online learning platform. Students usually interact with the platform in four different modes: Observing, Submitting, Collaborating and giving feedback. In observing mode students are simply browsing the online platform, watching videos, reading material, reading book or watching forums. In submitting mode, students submit information to the platform. This includes submissions towards quizzes, homeworks, or any assessment modules. In collaborating mode students interact with other students or instructors on forums, collaboratively editing wiki or chatting on google hangout or other hangout venues. With this basic definitions of activities, and a data model to store events pertaining to these activities, we then create a common terminology to map Coursera and edX data into this shared data model. This shared data model called MOOCdb becomes the foundation for a number of collaborative frameworks that enable progress in data science without the need to share the data.

...read moreread less

Posted Content•

Using temporal IDF for efficient novelty detection in text streams

[...]

Margarita Karkali, François Rousseau, Alexandros Ntoulas, Michalis Vazirgiannis

07 Jan 2014-arXiv: Information Retrieval

TL;DR: A resource-aware mechanism that is able to handle massive text streams such as the ones present today thanks to the burst of social media and the emergence of the Web as the main source of information is described.

...read moreread less

Abstract: Novelty detection in text streams is a challenging task that emerges in quite a few different scenarios, ranging from email thread filtering to RSS news feed recommendation on a smartphone. An efficient novelty detection algorithm can save the user a great deal of time and resources when browsing through relevant yet usually previously-seen content. Most of the recent research on detection of novel documents in text streams has been building upon either geometric distances or distributional similarities, with the former typically performing better but being much slower due to the need of comparing an incoming document with all the previously-seen ones. In this paper, we propose a new approach to novelty detection in text streams. We describe a resource-aware mechanism that is able to handle massive text streams such as the ones present today thanks to the burst of social media and the emergence of the Web as the main source of information. We capitalize on the historical Inverse Document Frequency (IDF) that was known for capturing well term specificity and we show that it can be used successfully at the document level as a measure of document novelty. This enables us to avoid similarity comparisons with previous documents in the text stream, thus scaling better and leading to faster execution times. Moreover, as the collection of documents evolves over time, we use a temporal variant of IDF not only to maintain an efficient representation of what has already been seen but also to decay the document frequencies as the time goes by. We evaluate the performance of the proposed approach on a real-world news articles dataset created for this task. The results show that the proposed method outperforms all of the baselines while managing to operate efficiently in terms of time complexity and memory usage, which are of great importance in a mobile setting scenario.

...read moreread less

Posted Content•

Multi-Linear Interactive Matrix Factorization

[...]

Lu Yu¹, Chuang Liu¹, Zi-Ke Zhang¹•Institutions (1)

Hangzhou Normal University¹

07 Apr 2014-arXiv: Information Retrieval

TL;DR: In this article, a multi-linear interactive matrix factorization (MLIMF) algorithm is proposed to model the interactions between the users and each event associated with their final decisions, which considers not only the user-item rating information but also the pairwise interactions based on some empirically supported factors.

...read moreread less

Abstract: Recommender systems, which can significantly help users find their interested items from the information era, has attracted an increasing attention from both the scientific and application society. One of the widest applied recommendation methods is the Matrix Factorization (MF). However, most of MF based approaches focus on the user-item rating matrix, but ignoring the ingredients which may have significant influence on users' preferences on items. In this paper, we propose a multi-linear interactive MF algorithm (MLIMF) to model the interactions between the users and each event associated with their final decisions. Our model considers not only the user-item rating information but also the pairwise interactions based on some empirically supported factors. In addition, we compared the proposed model with three typical other methods: user-based collaborative filtering (UCF), item-based collaborative filtering (ICF) and regularized MF (RMF). Experimental results on two real-world datasets, \emph{MovieLens} 1M and \emph{MovieLens} 100k, show that our method performs much better than other three methods in the accuracy of recommendation. This work may shed some light on the in-depth understanding of modeling user online behaviors and the consequent decisions.

...read moreread less

Posted Content•

Inferring gender of a Twitter user using celebrities it follows

[...]

Puneet Singh Ludu

26 May 2014-arXiv: Information Retrieval

TL;DR: This paper first evaluates linguistic content based features using LIWC dictionary and popular neighborhood features using Wikipedia and Freebase and augments both features which yielded a significant increase in the accuracy for gender prediction.

...read moreread less

Abstract: This paper addresses the task of user gender classification in social media, with an application to Twitter. The approach automatically predicts gender by leveraging observable information such as the tweet behavior, linguistic content of the user's Twitter feed and the celebrities followed by the user. This paper first evaluates linguistic content based features using LIWC dictionary and popular neighborhood features using Wikipedia and Freebase. Then augments both features which yielded a significant increase in the accuracy for gender prediction. Results show that rich linguistic features combined with popular neighborhood prove valuables and promising for additional user classification needs.

...read moreread less

Posted Content•

Preference Networks: Probabilistic Models for Recommendation Systems

[...]

Dinh Phung¹, Svetha Venkatesh¹•Institutions (1)

Curtin University¹

22 Jul 2014-arXiv: Information Retrieval

TL;DR: This work proposes an unified framework called Preference Network (PN) that jointly models various types of domain knowledge for the task of recommendation and is a probabilistic model that systematically combines both content-based filtering and collaborative filtering into a single conditional Markov random field.

...read moreread less

Abstract: Recommender systems are important to help users select relevant and personalised information over massive amounts of data available. We propose an unified framework called Preference Network (PN) that jointly models various types of domain knowledge for the task of recommendation. The PN is a probabilistic model that systematically combines both content-based filtering and collaborative filtering into a single conditional Markov random field. Once estimated, it serves as a probabilistic database that supports various useful queries such as rating prediction and top-$N$ recommendation. To handle the challenging problem of learning large networks of users and items, we employ a simple but effective pseudo-likelihood with regularisation. Experiments on the movie rating data demonstrate the merits of the PN.

...read moreread less

Journal Article•DOI•

Pareto-depth for Multiple-query Image Retrieval

[...]

Ko-Jen Hsiao, Jeff Calder¹, Alfred O. Hero²•Institutions (2)

University of California, Berkeley¹, University of Michigan²

21 Feb 2014-arXiv: Information Retrieval

TL;DR: In this article, the authors proposed a multiple-query information retrieval algorithm that combines the Pareto front method (PFM) with efficient manifold ranking (EMR) for content-based image retrieval.

...read moreread less

Abstract: Most content-based image retrieval systems consider either one single query, or multiple queries that include the same object or represent the same semantic information. In this paper we consider the content-based image retrieval problem for multiple query images corresponding to different image semantics. We propose a novel multiple-query information retrieval algorithm that combines the Pareto front method (PFM) with efficient manifold ranking (EMR). We show that our proposed algorithm outperforms state of the art multiple-query retrieval algorithms on real-world image databases. We attribute this performance improvement to concavity properties of the Pareto fronts, and prove a theoretical result that characterizes the asymptotic concavity of the fronts.

...read moreread less

Posted Content•

Text Based Approach For Indexing And Retrieval Of Image And Video: A Review.

[...]

Avinash N. Bhute¹, Bandu B. Meshram¹•Institutions (1)

Veermata Jijabai Technological Institute¹

05 Apr 2014-arXiv: Information Retrieval

TL;DR: The different techniques for text extraction from images and videos are discussed and the techniques for indexing and retrieval of image and videos by using extracted text are reviewed.

...read moreread less

Abstract: Text data present in multimedia contain useful information for automatic annotation, indexing. Extracted information used for recognition of the overlay or scene text from a given video or image. The Extracted text can be used for retrieving the videos and images. In this paper, firstly, we are discussed the different techniques for text extraction from images and videos. Secondly, we are reviewed the techniques for indexing and retrieval of image and videos by using extracted text.

...read moreread less

Posted Content•

Performance Evaluation of Incremental K-means Clustering Algorithm

[...]

Sanjay Chakraborty, Naresh Kumar Nagwani

18 Jun 2014-arXiv: Information Retrieval

TL;DR: The basic methodology for the incremental K-means clustering algorithm is defined and it is evaluated that the particular point of change in the database upto which incrementalK-mean clustering performs much better than the existing K- means clusters.

...read moreread less

Abstract: The incremental K-means clustering algorithm has already been proposed and analysed in paper [Chakraborty and Nagwani, 2011]. It is a very innovative approach which is applicable in periodically incremental environment and dealing with a bulk of updates. In this paper the performance evaluation is done for this incremental K-means clustering algorithm using air pollution database. This paper also describes the comparison on the performance evaluations between existing K-means clustering and incremental K-means clustering using that particular database. It also evaluates that the particular point of change in the database upto which incremental K-means clustering performs much better than the existing K-means clustering. That particular point of change in the database is known as „Threshold value‟ or „% delta ( change in the database‟. This paper also defines the basic methodology for the incremental K-means clustering algorithm.

...read moreread less

Posted Content•

Utilizing Online Social Network and Location-Based Data to Recommend Products and Categories in Online Marketplaces

[...]

Emanuel Lacic¹, Dominik Kowald¹, Lukas Eberhard¹, Christoph Trattner², Denis Parra, Leandro Balby Marinho - Show less +2 more•Institutions (2)

Graz University of Technology¹, Norwegian University of Science and Technology²

08 May 2014-arXiv: Information Retrieval

TL;DR: In this article, the authors exploit users' interactions along three data sources (marketplace, social network and location-based) to assess their performance in a barely studied domain: recommending products and domains of interests (i.e., product categories) to people in an online marketplace environment.

...read moreread less

Abstract: Recent research has unveiled the importance of online social networks for improving the quality of recommender systems and encouraged the research community to investigate better ways of exploiting the social information for recommendations. To contribute to this sparse field of research, in this paper we exploit users' interactions along three data sources (marketplace, social network and location-based) to assess their performance in a barely studied domain: recommending products and domains of interests (i.e., product categories) to people in an online marketplace environment. To that end we defined sets of content- and network-based user similarity features for each data source and studied them isolated using an user-based Collaborative Filtering (CF) approach and in combination via a hybrid recommender algorithm, to assess which one provides the best recommendation performance. Interestingly, in our experiments conducted on a rich dataset collected from SecondLife, a popular online virtual world, we found that recommenders relying on user similarity features obtained from the social network data clearly yielded the best results in terms of accuracy in case of predicting products, whereas the features obtained from the marketplace and location-based data sources also obtained very good results in case of predicting categories. This finding indicates that all three types of data sources are important and should be taken into account depending on the level of specialization of the recommendation task.

...read moreread less

Posted Content•

Sentiment Analysis Using Collaborated Opinion Mining

[...]

Deepali Virmani, Vikrant Malhotra, Ridhi Tyagi

12 Jan 2014-arXiv: Information Retrieval

TL;DR: The sentiment analysis in collaboration with opinion extraction, summarization, and tracking the records of the students is proposed and by applying the proposed sentiment analysis algorithm the opinion is extracted and represented.

...read moreread less

Abstract: Opinion mining and Sentiment analysis have emerged as a field of study since the widespread of World Wide Web and internet. Opinion refers to extraction of those lines or phrase in the raw and huge data which express an opinion. Sentiment analysis on the other hand identifies the polarity of the opinion being extracted. In this paper we propose the sentiment analysis in collaboration with opinion extraction, summarization, and tracking the records of the students. The paper modifies the existing algorithm in order to obtain the collaborated opinion about the students. The resultant opinion is represented as very high, high, moderate, low and very low. The paper is based on a case study where teachers give their remarks about the students and by applying the proposed sentiment analysis algorithm the opinion is extracted and represented.

...read moreread less

Posted Content•

Name Disambiguation from link data in a collaboration graph using temporal and topological features

[...]

Baichuan Zhang, Tanay Kumar Saha, Mohammad Al Hasan

19 Jun 2014-arXiv: Information Retrieval

TL;DR: In this paper, the authors proposed a method for solving entity disambiguation task from link information obtained from a collaboration network, which is nonintrusive of privacy as it uses only the time-stamped graph topology of an anonymized network.

...read moreread less

Abstract: In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.

...read moreread less

Posted Content•

Evaluation of YTEX and MetaMap for clinical concept recognition

[...]

John D. Osborne¹, Binod Gyawali¹, Thamar Solorio¹•Institutions (1)

University of Alabama at Birmingham¹

07 Feb 2014-arXiv: Information Retrieval

Abstract: We used MetaMap and YTEX as a basis for the construc- tion of two separate systems to participate in the 2013 ShARe/CLEF eHealth Task 1[9], the recognition of clinical concepts. No modifications were directly made to these systems, but output concepts were filtered using stop concepts, stop concept text and UMLS semantic type. Con- cept boundaries were also adjusted using a small collection of rules to increase precision on the strict task. Overall MetaMap had better per- formance than YTEX on the strict task, primarily due to a 20% perfor- mance improvement in precision. In the relaxed task YTEX had better performance in both precision and recall giving it an overall F-Score 4.6% higher than MetaMap on the test data. Our results also indicated a 1.3% higher accuracy for YTEX in UMLS CUI mapping.

...read moreread less

Posted Content•

Opinion Mining In Hindi Language: A Survey

[...]

Richa Sharma¹, Shweta Nigam, Rekha Jain²•Institutions (2)

Indian Institute of Information Technology and Management, Gwalior¹, Banasthali Vidyapith²

19 Apr 2014-arXiv: Information Retrieval

TL;DR: An overview of the work that has been done in Hindi language on the Web can be found in this article, where the authors give an overview of Hindi language opinion mining in Hindi text.

...read moreread less

Abstract: Opinions are very important in the life of human beings. These Opinions helped the humans to carry out the decisions. As the impact of the Web is increasing day by day, Web documents can be seen as a new source of opinion for human beings. Web contains a huge amount of information generated by the users through blogs, forum entries, and social networking websites and so on To analyze this large amount of information it is required to develop a method that automatically classifies the information available on the Web. This domain is called Sentiment Analysis and Opinion Mining. Opinion Mining or Sentiment Analysis is a natural language processing task that mine information from various text forms such as reviews, news, and blogs and classify them on the basis of their polarity as positive, negative or neutral. But, from the last few years, enormous increase has been seen in Hindi language on the Web. Research in opinion mining mostly carried out in English language but it is very important to perform the opinion mining in Hindi language also as large amount of information in Hindi is also available on the Web. This paper gives an overview of the work that has been done Hindi language.

...read moreread less

Posted Content•

Information Retrieval (IR) through Semantic Web (SW): An Overview

[...]

Gagandeep Singh, Vishal Jain

27 Mar 2014-arXiv: Information Retrieval

TL;DR: In this article, the authors discuss the use of IR technology for handling annotations in Semantic Web (SW) languages and discuss the knowledge representation languages used for retrieving information from documents.

...read moreread less

Abstract: A large amount of data is present on the web. It contains huge number of web pages and to find suitable information from them is very cumbersome task. There is need to organize data in formal manner so that user can easily access and use them. To retrieve information from documents, we have many Information Retrieval (IR) techniques. Current IR techniques are not so advanced that they can be able to exploit semantic knowledge within documents and give precise results. IR technology is major factor responsible for handling annotations in Semantic Web (SW) languages and in the present paper knowledgeable representation languages used for retrieving information are discussed.

...read moreread less

Posted Content•

Why we need an independent index of the Web.

[...]

Dirk Lewandowski

09 May 2014-arXiv: Information Retrieval

TL;DR: I describe how building and maintaining a proprietary index is the greatest deterrent to such an undertaking, and how first overcoming this obstacle may establish the conditions necessary to achieve that desired end.

...read moreread less

Abstract: The path to greater diversity, as we have seen, cannot be achieved by merely hoping for a new search engine nor will government support for a single alternative achieve this goal. What is instead required is to create the conditions that will make establishing such a search engine possible in the first place. I describe how building and maintaining a proprietary index is the greatest deterrent to such an undertaking. We must first overcome this obstacle. Doing so will still not solve the problem of the lack of diversity in the search engine marketplace. But it may establish the conditions necessary to achieve that desired end.

...read moreread less

Posted Content•

Coupled Matrix Factorization within Non-IID Context

[...]

Fangfang Li¹, Guandong Xu¹, Longbing Cao¹•Institutions (1)

University of Technology, Sydney¹

08 Apr 2014-arXiv: Information Retrieval

TL;DR: This paper proposes a novel generic coupled matrix factorization (CMF) model by incorporating non-IID coupling relations between users and items and demonstrates that the user/item couplings can be effectively applied in RS and CMF outperforms the benchmark methods.

...read moreread less

Abstract: Recommender systems research has experienced different stages such as from user preference understanding to content analysis. Typical recommendation algorithms were built on the following bases: (1) assuming users and items are IID, namely independent and identically distributed, and (2) focusing on specific aspects such as user preferences or contents. In reality, complex recommendation tasks involve and request (1) personalized outcomes to tailor heterogeneous subjective preferences; and (2) explicit and implicit objective coupling relationships between users, items, and ratings to be considered as intrinsic forces driving preferences. This inevitably involves the non-IID complexity and the need of combining subjective preference with objective couplings hidden in recommendation applications. In this paper, we propose a novel generic coupled matrix factorization (CMF) model by incorporating non-IID coupling relations between users and items. Such couplings integrate the intra-coupled interactions within an attribute and inter-coupled interactions among different attributes. Experimental results on two open data sets demonstrate that the user/item couplings can be effectively applied in RS and CMF outperforms the benchmark methods.

...read moreread less

Collapse