Author
Laila Abdelhafeez
Bio: Laila Abdelhafeez is an academic researcher from University of California, Riverside. The author has contributed to research in topics: Computer science & Spatial analysis. The author has an hindex of 1, co-authored 3 publications receiving 18 citations.
Papers
More filters
••
01 Jan 2020TL;DR: This paper reviews core components that enable large-scale querying and indexing for microblogs data, and discusses system-level issues and on-going effort on supporting microblogs through the rising wave of big data systems.
Abstract: Microblogs data is the microlength user-generated data that is posted on the web, e.g., tweets, online reviews, comments on news and social media. It has gained considerable attention in recent years due to its widespread popularity, rich content, and value in several societal applications. Nowadays, microblogs applications span a wide spectrum of interests including targeted advertising, market reports, news delivery, political campaigns, rescue services, and public health. Consequently, major research efforts have been spent to manage, analyze, and visualize microblogs to support different applications. This paper gives a comprehensive review of major research and system work in microblogs data management. The paper reviews core components that enable large-scale querying and indexing for microblogs data. A dedicated part gives particular focus for discussing system-level issues and on-going effort on supporting microblogs through the rising wave of big data systems. In addition, we review the major research topics that exploit these core data management components to provide innovative and effective analysis and visualization for microblogs, such as event detection, recommendations, automatic geotagging, and user queries. Throughout the different parts, we highlight the challenges, innovations, and future opportunities in microblogs data research.
23 citations
••
03 Nov 2020TL;DR: This paper proposes a highly-parallelized query processing framework to efficiently compute the spatial group-by query, which has shown significant superiority over all existing techniques.
Abstract: This paper studies a spatial group-by query over complex polygons. Groups are selected from a set of non-overlapping complex polygons, typically in the order of thousands, while the input is a large-scale dataset that contains hundreds of millions or even billions of spatial points. Given a set of spatial points and a set of polygons, the spatial group-by query returns the number of points that lie within boundaries of each polygon. This problem is challenging because real polygons (like counties, cities, postal codes, voting regions, etc.) are described by very complex boundaries. We propose a highly-parallelized query processing framework to efficiently compute the spatial group-by query. Our experimental evaluation with real data and queries has shown significant superiority over all existing techniques.
3 citations
••
01 Jun 2022
TL;DR: This research focuses on scaling spatial queries in the context of big data systems, to be able to apply complex algorithms on large-scale spatial datasets in a timely manner.
Abstract: The amount of data in the world is increasing exponentially, a large portion of this data comes from the interactions over mobile devices and the ubiquitous IoT applications. Improving our ability to extract information and insights from these large and complex datasets is crucial to a variety of applications. Our research focuses on scaling spatial queries in the context of big data systems, to be able to apply complex algorithms on large-scale spatial datasets in a timely manner. In particular, this paper studies two spatial queries: (a) spatial group-by polygon query which groups input data points by a given complex polygon set (e.g. world countries), and (b) polygonization query which polygonizes an input set of line strings (e.g. USA road network).
••
TL;DR: In this article , the spatial group-by-query over complex polygons is studied and a highly-parallelized query processing framework is proposed to efficiently compute the spatial groups-by query on highly skewed spatial data.
Abstract: Abstract This paper studies the spatial group-by query over complex polygons. Given a set of spatial points and a set of polygons, the spatial group-by query returns the number of points that lie within the boundaries of each polygon. Groups are selected from a set of non-overlapping complex polygons, typically in the order of thousands, while the input is a large-scale dataset that contains hundreds of millions or even billions of spatial points. This problem is challenging because real polygons (like counties, cities, postal codes, voting regions, etc.) are described by very complex boundaries. We propose a highly-parallelized query processing framework to efficiently compute the spatial group-by query on highly skewed spatial data. We also propose an effective query optimizer that adaptively assigns the appropriate processing scheme based on the query polygons. Our experimental evaluation with real data and queries has shown significant superiority over all existing techniques.
••
20 Apr 2020TL;DR: DLEEL is a research system that supports scalable spatial queries with multiple predicates on user-generated data streams, such as social media streams, and is the first to address personalized queries on streaming spatial- social data through novel low-overhead indexing that scales for large amounts of data and users.
Abstract: This paper demonstrates DLEEL; a research system that supports scalable spatial queries with multiple predicates on user-generated data streams, such as social media streams. Supported queries include spatial-social queries and spatial-keyword queries, which are popular in different applications but have never been addressed in the challenging environment of streaming data, where data arrives with excessively high rates. DLEEL distinguishes itself with three novel contributions: (1) Indexing spatial-social data in for personalized real-time search: DLEEL is the first to address personalized queries on streaming spatial- social data through novel low-overhead indexing that scales for large amounts of data and users. The novel indexing has a hybrid storage architecture that trades off indexing overhead, memory consumption, and query latency. (2) Indexing spatial-keyword data for real-time search: DLEEL is the first to enrich existing spatial-keyword indexes with novel streaming data components. The new components reveal performance losses and gains from a system perspective, trading off the system overhead with flexibility to support a variety of queries. (3) Scalable query processing: DLEEL exploits the indexes content to smartly prune the search space on multiple dimensions and support efficient query latency for its different queries on excessive number of data records. DLEEL is demonstrated using a stream of 5 billions real tweets collected from Twitter APIs and real query locations obtained from a popular web search engine. DLEEL has shown superior performance with serving incoming queries with an average latency of few milliseconds while digesting hundreds of thousands of data records every second.
Cited by
More filters
•
TL;DR: This paper is the first complete description of the resulting open source AsterixDB system, covering the system's data model, its query language, and its software architecture.
Abstract: AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store.
Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements.
168 citations
••
TL;DR: This study demonstrates the potential of multitask models on this type of problems and improves the state-of-the-art results in the fine-grained sentiment classification problem.
Abstract: Traditional sentiment analysis approaches tackle problems like ternary (3-category) and fine-grained (5-category) classification by learning the tasks separately. We argue that such classification tasks are correlated and we propose a multitask approach based on a recurrent neural network that benefits by jointly learning them. Our study demonstrates the potential of multitask models on this type of problems and improves the state-of-the-art results in the fine-grained sentiment classification problem.
53 citations
••
20 Apr 2020
TL;DR: An efficient divide-and-conquer algorithm is proposed to derive bounds of spatial similarity and textual similarity between two semantic trajectories, which enable us prune dissimilar trajectory pairs without the need of computing the exact value of spatio-textual similarity.
Abstract: Matching similar pairs of trajectories, called trajectory similarity join, is a fundamental functionality in spatial data management. We consider the problem of semantic trajectory similarity join (STS-Join). Each semantic trajectory is a sequence of Points-of-interest (POIs) with both location and text information. Thus, given two sets of semantic trajectories and a threshold θ, the STS-Join returns all pairs of semantic trajectories from the two sets with spatio-textual similarity no less than θ. This join targets applications such as term-based trajectory near-duplicate detection, geo-text data cleaning, personalized ridesharing recommendation, keyword-aware route planning, and travel itinerary recommendation.With these applications in mind, we provide a purposeful definition of spatio-textual similarity. To enable efficient STS-Join processing on large sets of semantic trajectories, we develop trajectory pair filtering techniques and consider the parallel processing capabilities of modern processors. Specifically, we present a two-phase parallel search algorithm. We first group semantic trajectories based on their text information. The algorithm’s per-group searches are independent of each other and thus can be performed in parallel. For each group, the trajectories are further partitioned based on the spatial domain. We generate spatial and textual summaries for each trajectory batch, based on which we develop batch filtering and trajectory-batch filtering techniques to prune unqualified trajectory pairs in a batch mode. Additionally, we propose an efficient divide-and-conquer algorithm to derive bounds of spatial similarity and textual similarity between two semantic trajectories, which enable us prune dissimilar trajectory pairs without the need of computing the exact value of spatio-textual similarity. Experimental study with large semantic trajectory data confirms that our algorithm of processing semantic trajectory join is capable of outperforming our well-designed baseline by a factor of 8–12.
32 citations
••
TL;DR: Web of Science core collection was taken as the data source, and traditional statistical methods and CiteSpace software were used to carry out the scientometrics analysis of SMBD, which showed the research status, hotspots and trends in this field.
Abstract: Social Media Big Data (SMBD) is widely used to serve the economic and social development of human beings. However, as a young research and practice field, the understanding of SMBD in academia is not enough and needs to be supplemented. This paper took Web of Science (WoS) core collection as the data source, and used traditional statistical methods and CiteSpace software to carry out the scientometrics analysis of SMBD, which showed the research status, hotspots and trends in this field. The results showed that: (1) More and more attention has been paid to SMBD research in academia, and the number of journals published has been increased in recent years, mainly in subjects such as Computer Science Engineering and Telecommunications. The results were published primarily in IEEE Access Sustainability and Future Generation Computer Systems the International Journal of eScience and so on; (2) In terms of contributions, China, the United States, the United Kingdom and other countries (regions) have published the most papers in SMBD, high-yield institutions also mainly from these countries (regions). There were already some excellent teams in the field, such as the Wanggen Wan team at Shanghai University and Haoran Xie team from City University of Hong Kong; (3) we studied the hotspots of SMBD in recent years, and realized the summary of the frontier of SMBD based on the keywords and co-citation literature, including the deep excavation and construction of social media technology, the reflection and concerns about the rapid development of social media, and the role of SMBD in solving human social development problems. These studies could provide values and references for SMBD researchers to understand the research status, hotspots and trends in this field.
29 citations
••
09 Mar 2020
TL;DR: This work proposes solutions that are capable of supporting real-life location-based publish/subscribe applications that process large numbers of SST and RST subscriptions over a realistic stream of spatio-temporal documents.
Abstract: Massive amounts of data that contain spatial, textual, and temporal information are being generated at a rapid pace. With streams of such data, which includes check-ins and geo-tagged tweets, available, users may be interested in being kept up-to-date on which terms are popular in the streams in a particular region of space. To enable this functionality, we aim at efficiently processing two types of general top-k term subscriptions over streams of spatio-temporal documents: region-based top-k spatial-temporal term (RST) subscriptions and similarity-based top-k spatio-temporal term (SST) subscriptions. RST subscriptions continuously maintain the top-k most popular trending terms within a user-defined region. SST subscriptions free users from defining a region and maintain top-k locally popular terms based on a ranking function that combines term frequency, term recency, and term proximity. To solve the problem, we propose solutions that are capable of supporting real-life location-based publish/subscribe applications that process large numbers of SST and RST subscriptions over a realistic stream of spatio-temporal documents. The performance of our proposed solutions is studied in extensive experiments using two spatio-temporal datasets.
29 citations