scispace - formally typeset
Search or ask a question

Showing papers in "Data in 2022"


Journal ArticleDOI
10 Jan 2022-Data
TL;DR: In this article , a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages was used to enrich the data with sentiment labels and named entities, and a gender identification approach was proposed to segregate user gender.
Abstract: As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread usage of social networking sites, especially during mass convergence events, such as health emergencies, provides instant access to citizen-generated data offering rich information about public opinions, sentiments, and situational updates useful for authorities to gain insights. We offer a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages. We used state-of-the-art machine learning models to enrich the data with sentiment labels and named-entities. Additionally, a gender identification approach is proposed to segregate user gender. Furthermore, a geolocalization approach is devised to geotag tweets at country, state, county, and city granularities, enabling a myriad of data analysis tasks to understand real-world issues at national and sub-national levels. We believe this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being.

17 citations


Journal ArticleDOI
13 May 2022-Data
TL;DR: This study investigates the ability of deep neural networks, namely, Long Short-Term Memory (LSTM), Bi-directional LSTM, Convolutional Neural Network (CNN), and a hybrid of CNN and L STM networks, to automatically classify and identify fake news content related to the COVID-19 pandemic posted on social media platforms.
Abstract: The fast growth of technology in online communication and social media platforms alleviated numerous difficulties during the COVID-19 epidemic. However, it was utilized to propagate falsehoods and misleading information about the disease and the vaccination. In this study, we investigate the ability of deep neural networks, namely, Long Short-Term Memory (LSTM), Bi-directional LSTM, Convolutional Neural Network (CNN), and a hybrid of CNN and LSTM networks, to automatically classify and identify fake news content related to the COVID-19 pandemic posted on social media platforms. These deep neural networks have been trained and tested using the “COVID-19 Fake News” dataset, which contains 21,379 real and fake news instances for the COVID-19 pandemic and its vaccines. The real news data were collected from independent and internationally reliable institutions on the web, such as the World Health Organization (WHO), the International Committee of the Red Cross (ICRC), the United Nations (UN), the United Nations Children’s Fund (UNICEF), and their official accounts on Twitter. The fake news data were collected from different fact-checking websites (such as Snopes, PolitiFact, and FactCheck). The evaluation results showed that the CNN model outperforms the other deep neural networks with the best accuracy of 94.2%.

16 citations


Journal ArticleDOI
12 Apr 2022-Data
TL;DR: The phenomenon of disinformation is examined through the lens of cyber threat epistemology; it displays the presence of the necessary elements required for its appropriate classification and is argued for as an official and actual cyber threat.
Abstract: This study examines the phenomenon of disinformation as a threat in the realm of cybersecurity. We have analyzed multiple authoritative cybersecurity standards, manuals, handbooks, and literary works. We present the unanimous meaning and construct of the term cyber threat. Our results reveal that although their definitions are mostly consistent, most of them lack the inclusion of disinformation in their list/glossary of cyber threats. We then proceeded to dissect the phenomenon of disinformation through the lens of cyber threat epistemology; it displays the presence of the necessary elements required (i.e., threat agent, attack vector, target, impact, defense) for its appropriate classification. To conjunct this, we have also included an in-depth comparative analysis of disinformation and its similar nature and characteristics with the prevailing and existing cyber threats. We, therefore, argue for its recommendation as an official and actual cyber threat. The significance of this paper, beyond the taxonomical correction it recommends, rests in the hope that it influences future policies and regulations in combatting disinformation and its propaganda.

13 citations


Journal ArticleDOI
26 May 2022-Data
TL;DR: In this article , a new ontology is proposed based on the Digital Twin concept, i.e., the digital counterpart of cultural heritage assets incorporating all the digital information pertaining to them, which creates a Knowledge Base on the cultural heritage data space.
Abstract: The present paper concerns the design of the semantic infrastructure of the data space for cultural heritage as envisaged by the European Commission in its recent documents. Due to the complexity of the cultural heritage data and of their intrinsic inter-relationships, it is necessary to introduce a novel ontology, yet compliant with existing standards and interoperable with previous platforms used in this context as Europeana. The data space organization must be tailored to the methods and the theory of cultural heritage, briefly summarized in the introduction. The new ontology is based on the Digital Twin concept, i.e., the digital counterpart of cultural heritage assets incorporating all the digital information pertaining to them. This creates a Knowledge Base on the cultural heritage data space. The paper outlines the main features of the proposed Heritage Digital Twin ontology and provides some examples of its application. Future work will include completing the ontology in all its details and testing it in other real cases and with the various sectors of the cultural heritage community.

13 citations


Journal ArticleDOI
20 Apr 2022-Data
TL;DR: A hybrid stock prediction model using the prediction rule ensembles (PRE) technique and deep neural network (DNN) to deal with nonlinearity in data is proposed, which is better than the single prediction model, namely DNN and ANN.
Abstract: Stock prices are volatile due to different factors that are involved in the stock market, such as geopolitical tension, company earnings, and commodity prices, affecting stock price. Sometimes stock prices react to domestic uncertainty such as reserve bank policy, government policy, inflation, and global market uncertainty. The volatility estimation of stock is one of the challenging tasks for traders. Accurate prediction of stock price helps investors to reduce the risk in portfolio or investment. Stock prices are nonlinear. To deal with nonlinearity in data, we propose a hybrid stock prediction model using the prediction rule ensembles (PRE) technique and deep neural network (DNN). First, stock technical indicators are considered to identify the uptrend in stock prices. We considered moving average technical indicators: moving average 20 days, moving average 50 days, and moving average 200 days. Second, using the PRE technique-computed different rules for stock prediction, we selected the rules with the lowest root mean square error (RMSE) score. Third, the three-layer DNN is considered for stock prediction. We have fine-tuned the hyperparameters of DNN, such as the number of layers, learning rate, neurons, and number of epochs in the model. Fourth, the average results of the PRE and DNN prediction model are combined. The hybrid stock prediction model results are computed using the mean absolute error (MAE) and RMSE metric. The performance of the hybrid stock prediction model is better than the single prediction model, namely DNN and ANN, with a 5% to 7% improvement in RMSE score. The Indian stock price data are considered for the work.

12 citations


Journal ArticleDOI
09 May 2022-Data
TL;DR: In this paper , a study aimed to predict the youth customers' defection in retail banking by applying machine learning techniques, including ensembles, to predict churn among 602 young adult bank customers.
Abstract: (1) This study aims to predict the youth customers’ defection in retail banking. The sample comprised 602 young adult bank customers. (2) The study applied Machine learning techniques, including ensembles, to predict the possibility of churn. (3) The absence of mobile banking, zero-interest personal loans, access to ATMs, and customer care and support were critical driving factors to churn. The ExtraTreeClassifier model resulted in an accuracy rate of 92%, and an AUC of 91.88% validated the findings. (4) Customer retention is one of the critical success factors for organizations so as to enhance the business value. It is imperative for banks to predict the drivers of churn among their young adult customers so as to create and deliver proactive enable quality services.

9 citations


Journal ArticleDOI
01 Jun 2022-Data
TL;DR: In this article , an integrated mapping of the georeferenced data is presented using the QGIS and GMT scripting tool set, which has applicability for risk assessment and geological hazard mapping in the Bolivian Andes, South America.
Abstract: In this paper, an integrated mapping of the georeferenced data is presented using the QGIS and GMT scripting tool set. The study area encompasses the Bolivian Andes, South America, notable for complex geophysical and geological parameters and high seismicity. A data integration was performed for a detailed analysis of the geophysical and geological setting. The data included the raster and vector datasets captured from the open sources: the IRIS seismic data (2015 to 2021), geophysical data from satellite-derived gravity grids based on CryoSat, topographic GEBCO data, geoid undulation data from EGM-2008, and geological georeferences’ vector data from the USGS. The techniques of data processing included quantitative and qualitative evaluation of the seismicity and geophysical setting in Bolivia. The result includes a series of thematic maps on the Bolivian Andes. Based on the data analysis, the western region was identified as the most seismically endangered area in Bolivia with a high risk of earthquake hazards in Cordillera Occidental, followed by Altiplano and Cordillera Real. The earthquake magnitude here ranges from 1.8 to 7.6. The data analysis shows a tight correlation between the gravity, geophysics, and topography in the Bolivian Andes. The cartographic scripts used for processing data in GMT are available in the author’s public GitHub repository in open-access with the provided link. The utility of scripting cartographic techniques for geophysical and topographic data processing combined with GIS spatial evaluation of the geological data supported automated mapping, which has applicability for risk assessment and geological hazard mapping of the Bolivian Andes, South America.

8 citations


Journal ArticleDOI
29 Jan 2022-Data
TL;DR: The development of the cybersecurity datasets used to train the algorithms which are used for building IDS detection models are described, as well as analyzing and summarizing the different and famous internet of things (IoT) attacks.
Abstract: Almost all industrial internet of things (IIoT) attacks happen at the data transmission layer according to a majority of the sources. In IIoT, different machine learning (ML) and deep learning (DL) techniques are used for building the intrusion detection system (IDS) and models to detect the attacks in any layer of its architecture. In this regard, minimizing the attacks could be the major objective of cybersecurity, while knowing that they cannot be fully avoided. The number of people resisting the attacks and protection system is less than those who prepare the attacks. Well-reasoned and learning-backed problems must be addressed by the cyber machine, using appropriate methods alongside quality datasets. The purpose of this paper is to describe the development of the cybersecurity datasets used to train the algorithms which are used for building IDS detection models, as well as analyzing and summarizing the different and famous internet of things (IoT) attacks. This is carried out by assessing the outlines of various studies presented in the literature and the many problems with IoT threat detection. Hybrid frameworks have shown good performance and high detection rates compared to standalone machine learning methods in a few experiments. It is the researchers’ recommendation to employ hybrid frameworks to identify IoT attacks for the foreseeable future.

8 citations


Journal ArticleDOI
14 Sep 2022-Data
TL;DR: In this article , the authors examined the impact of anxiety-inducing videos on biosignals, particularly electrocardiogram (ECG) and respiration (RES) signals, that were collected using a portable device.
Abstract: Portable and wearable devices are becoming increasingly common in our daily lives. In this study, we examined the impact of anxiety-inducing videos on biosignals, particularly electrocardiogram (ECG) and respiration (RES) signals, that were collected using a portable device. Two psychological scales (Beck Anxiety Inventory and Hamilton Anxiety Rating Scale) were used to assess overall anxiety before induction. The data were collected at Simon Fraser University from participants aged 18–56, all of whom were healthy at the time. The ECG and RES signals were collected simultaneously while participants continuously watched video clips that stimulated anxiety-inducing (negative experience) and non-anxiety-inducing events (positive experience). The ECG and RES signals were recorded simultaneously at 500 Hz. The final dataset consisted of psychological scores and physiological signals from 19 participants (14 males and 5 females) who watched eight video clips. This dataset can be used to explore the instantaneous relationship between ECG and RES waveforms and anxiety-inducing video clips to uncover and evaluate the latent characteristic information contained in these biosignals.

8 citations


Journal ArticleDOI
11 May 2022-Data
TL;DR: The DriverMVT (Driver Monitoring dataset with Videos and Telemetry) is introduced, which can be used to train and evaluate deep learning models to estimate the driver’s health state, mental state, concentration level, and his/her activity in the cabin.
Abstract: Developing a driver monitoring system that can assess the driver’s state is a prerequisite and a key to improving the road safety. With the success of deep learning, such systems can achieve a high accuracy if corresponding high-quality datasets are available. In this paper, we introduce DriverMVT (Driver Monitoring dataset with Videos and Telemetry). The dataset contains information about the driver head pose, heart rate, and driver behaviour inside the cabin like drowsiness and unfastened belt. This dataset can be used to train and evaluate deep learning models to estimate the driver’s health state, mental state, concentration level, and his/her activity in the cabin. Developing such systems that can alert the driver in case of drowsiness or distraction can reduce the number of accidents and increase the safety on the road. The dataset contains 1506 videos for 9 different drivers (7 males and 2 females) with total number of frames equal 5119k and total time over 36 h. In addition, evaluated the dataset with multi-task temporal shift convolutional attention network (MTTS-CAN) algorithm. The algorithm mean average error on our dataset is 16.375 heartbeats per minute.

7 citations


Journal ArticleDOI
29 Jan 2022-Data
TL;DR: A web-based system for predicting academic performance and identifying students at risk of failure through academic and demographic factors is developed, and the academic factors had a higher impact on students’ academic performance than the demographic factors.
Abstract: Educational Data Mining (EDM) is used to extract and discover interesting patterns from educational institution datasets using Machine Learning (ML) algorithms. There is much academic information related to students available. Therefore, it is helpful to apply data mining to extract factors affecting students’ academic performance. In this paper, a web-based system for predicting academic performance and identifying students at risk of failure through academic and demographic factors is developed. The ML model is developed to predict the total score of a course at the early stages. Several ML algorithms are applied, namely: Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Artificial Neural Network (ANN), and Linear Regression (LR). This model applies to the data of female students of the Computer Science Department at Imam Abdulrahman bin Faisal University (IAU). The dataset contains 842 instances for 168 students. Moreover, the results showed that the prediction’s Mean Absolute Percentage Error (MAPE) reached 6.34%, and the academic factors had a higher impact on students’ academic performance than the demographic factors, the midterm exam score in the top. The developed web-based prediction system is available on an online server and can be used by tutors.

Journal ArticleDOI
14 Nov 2022-Data
TL;DR: In this article , the authors compared the predictive power of machine learning algorithms and applied SHAP values to interpret the prediction results on the dataset of listed companies in Vietnam from 2010 to 2021.
Abstract: The past decade has witnessed the rapid development of machine learning applied in economics and finance. Recent evidence suggests that machine learning models have produced superior results to traditional statistical models and have become the driving force for dramatic improvement in the financial industry. However, a much-debated question is whether the prediction results from black box machine learning models can be interpreted. In this study, we compared the predictive power of machine learning algorithms and applied SHAP values to interpret the prediction results on the dataset of listed companies in Vietnam from 2010 to 2021. The results showed that the extreme gradient boosting and random forest models outperformed other models. In addition, based on Shapley values, we also found that long-term debts to equity, enterprise value to revenues, account payable to equity, and diluted EPS had greatly influenced the outputs. In terms of practical contributions, the study helps credit rating companies have a new method for predicting the possibility of default of bond issuers in the market. The study also provides an early warning tool for policymakers about the risks of public companies in order to develop measures to protect retail investors against the risk of bond default.

Journal ArticleDOI
01 Jul 2022-Data
TL;DR: This study presents a collection of annotations/segmentations of pulmonary radiological manifestations that are consistent with TB in the publicly available and widely used Shenzhen chest X-ray (CXR) dataset made available by the U.S. National Library of Medicine.
Abstract: Developments in deep learning techniques have led to significant advances in automated abnormality detection in radiological images and paved the way for their potential use in computer-aided diagnosis (CAD) systems. However, the development of CAD systems for pulmonary tuberculosis (TB) diagnosis is hampered by the lack of training data that is of good visual and diagnostic quality, of sufficient size, variety, and, where relevant, containing fine region annotations. This study presents a collection of annotations/segmentations of pulmonary radiological manifestations that are consistent with TB in the publicly available and widely used Shenzhen chest X-ray (CXR) dataset made available by the U.S. National Library of Medicine and obtained via a research collaboration with No. 3. People’s Hospital Shenzhen, China. The goal of releasing these annotations is to advance the state-of-the-art for image segmentation methods toward improving the performance of fine-grained segmentation of TB-consistent findings in digital Chest X-ray images. The annotation collection comprises the following: 1) annotation files in JSON (JavaScript Object Notation) format that indicate locations and shapes of 19 lung pattern abnormalities for 336 TB patients; 2) mask files saved in PNG format for each abnormality per TB patient; 3) a CSV (comma-separated values) file that summarizes lung abnormality types and numbers per TB patient. To the best of our knowledge, this is the first collection of pixel-level annotations of TB-consistent findings in CXRs. Dataset: https://data.lhncbc.nlm.nih.gov/public/Tuberculosis-Chest-X-ray-Datasets/Shenzhen-Hospital-CXR-Set/Annotations/index.html.

Journal ArticleDOI
28 Oct 2022-Data
TL;DR: In this article , the authors presented a dataset for predicting academic performance and dropout at the Polytechnic Institute of Portalegre in Brazil, which includes demographic, socioeconomic, macroeconomic, and academic data on enrollment and academic performance at the end of the first and second semesters.
Abstract: Higher education institutions record a significant amount of data about their students, representing a considerable potential to generate information, knowledge, and monitoring. Both school dropout and educational failure in higher education are an obstacle to economic growth, employment, competitiveness, and productivity, directly impacting the lives of students and their families, higher education institutions, and society as a whole. The dataset described here results from the aggregation of information from different disjointed data sources and includes demographic, socioeconomic, macroeconomic, and academic data on enrollment and academic performance at the end of the first and second semesters. The dataset is used to build machine learning models for predicting academic performance and dropout, which is part of a Learning Analytic tool developed at the Polytechnic Institute of Portalegre that provides information to the tutoring team with an estimate of the risk of dropout and failure. The dataset is useful for researchers who want to conduct comparative studies on student academic performance and also for training in the machine learning area.

Journal ArticleDOI
13 Jul 2022-Data
TL;DR: SEN2VENµS is an open-data licensed dataset composed of 10 m and 20 m cloud-free surface reflectance patches from Sentinel-2, with their reference spatially registered surface reflectances patches at 5 m resolution acquired on the same day by the VENµs satellite, which can be used for the training and comparison of super-resolution algorithms.
Abstract: Boosted by the progress in deep learning, Single Image Super-Resolution (SISR) has gained a lot of interest in the remote sensing community, who sees it as an opportunity to compensate for satellites’ ever-limited spatial resolution with respect to end users’ needs. This is especially true for Sentinel-2 because of its unique combination of resolution, revisit time, global coverage and free and open data policy. While there has been a great amount of work on network architectures in recent years, deep-learning-based SISR in remote sensing is still limited by the availability of the large training sets it requires. The lack of publicly available large datasets with the required variability in terms of landscapes and seasons pushes researchers to simulate their own datasets by means of downsampling. This may impair the applicability of the trained model on real-world data at the target input resolution. This paper presents SEN2VENµS, an open-data licensed dataset composed of 10 m and 20 m cloud-free surface reflectance patches from Sentinel-2, with their reference spatially registered surface reflectance patches at 5 m resolution acquired on the same day by the VENµS satellite. This dataset covers 29 locations on earth with a total of 132,955 patches of 256 × 256 pixels at 5 m resolution and can be used for the training and comparison of super-resolution algorithms to bring the spatial resolution of 8 of the Sentinel-2 bands up to 5 m.

Journal ArticleDOI
09 Jun 2022-Data
TL;DR: A new dataset of burned areas in Indonesia is presented, collected from some regions in Indonesia, and it consists of 227 images with a size of 512 × 512 pixels that can be used to train and evaluate the deep learning model for image detection, segmentation, and classification tasks related to burned area mapping.
Abstract: Wildland fire is one of the most causes of deforestation, and it has an important impact on atmospheric emissions, notably CO2. It occurs almost every year in Indonesia, especially during the dry season. Therefore, it is necessary to identify the burned areas from remote sensing images to establish the zoning map of areas prone to wildland fires. Many methods have been developed for mapping burned areas from low-resolution to medium-resolution satellite images. One of the popular approaches for mapping tasks is a deep learning approach using U-Net architecture. However, it needs a large amount of representative training data to develop the model. In this paper, we present a new dataset of burned areas in Indonesia for training or evaluating the U-Net model. We delineate burned areas manually by visual interpretation on Landsat-8 satellite images. The dataset is collected from some regions in Indonesia, and it consists of 227 images with a size of 512 × 512 pixels. It contains one or more burned scars or only the background and its labeled masks. The dataset can be used to train and evaluate the deep learning model for image detection, segmentation, and classification tasks related to burned area mapping.

Journal ArticleDOI
31 Oct 2022-Data
TL;DR: Wang et al. as discussed by the authors proposed a hybrid deep learning model to predict the future price of the cryptocurrency, which integrates a 1-dimensional convolutional neural network and stacked gated recurrent unit (1DCNN-GRU).
Abstract: Virtual currencies have been declared as one of the financial assets that are widely recognized as exchange currencies. The cryptocurrency trades caught the attention of investors as cryptocurrencies can be considered as highly profitable investments. To optimize the profit of the cryptocurrency investments, accurate price prediction is essential. In view of the fact that the price prediction is a time series task, a hybrid deep learning model is proposed to predict the future price of the cryptocurrency. The hybrid model integrates a 1-dimensional convolutional neural network and stacked gated recurrent unit (1DCNN-GRU). Given the cryptocurrency price data over the time, the 1-dimensional convolutional neural network encodes the data into a high-level discriminative representation. Subsequently, the stacked gated recurrent unit captures the long-range dependencies of the representation. The proposed hybrid model was evaluated on three different cryptocurrency datasets, namely Bitcoin, Ethereum, and Ripple. Experimental results demonstrated that the proposed 1DCNN-GRU model outperformed the existing methods with the lowest RMSE values of 43.933 on the Bitcoin dataset, 3.511 on the Ethereum dataset, and 0.00128 on the Ripple dataset.

Journal ArticleDOI
10 Jan 2022-Data
TL;DR: In this paper , the authors used a systematic literature review (SLR) technique with PRISMA procedures, an analytical hierarchy process, and expert interviews to identify the components of the KM model for smart campuses.
Abstract: The application of smart campuses (SC), especially at higher education institutions (HEI) in Indonesia, is very diverse, and does not yet have standards. As a result, SC practice is spread across various areas in an unstructured and uneven manner. KM is one of the critical components of SC. However, the use of KM to support SC is less clearly discussed. Most implementations and assumptions still consider the latest IT application as the SC component. As such, this study aims to identify the components of the KM model for SC. This study used a systematic literature review (SLR) technique with PRISMA procedures, an analytical hierarchy process, and expert interviews. SLR is used to identify the components of the conceptual model, and AHP is used for model priority component analysis. Interviews were used for validation and model development. The results show that KM, IoT, and big data have the highest trends. Governance, people, and smart education have the highest trends. IT is the highest priority component. The KM model for SC has five main layers grouped in phases of the system cycle. This cycle describes the organization’s intellectual ability to adapt in achieving SC indicators. The knowledge cycle at HEIs focuses on education, research, and community service.

Journal ArticleDOI
14 Apr 2022-Data
TL;DR: In this paper , the HAGDAVS dataset fusing RGB spectral channel and Digital Surface Model DSM for detection and segmentation of vehicles from aerial drone images, including three vehicle classes: cars, motorcycles, and ghosts (motorcycle or car).
Abstract: Detection and Semantic Segmentation of vehicles in drone aerial orthomosaics has applications in a variety of fields such as security, traffic and parking management, urban planning, logistics, and transportation, among many others. This paper presents the HAGDAVS dataset fusing RGB spectral channel and Digital Surface Model DSM for the detection and segmentation of vehicles from aerial drone images, including three vehicle classes: cars, motorcycles, and ghosts (motorcycle or car). We supply DSM as an additional variable to be included in deep learning and computer vision models to increase its accuracy. RGB orthomosaic, RG-DSM fusion, and multi-label mask are provided in Tag Image File Format. Geo-located vehicle bounding boxes are provided in GeoJSON vector format. We also describes the acquisition of drone data, the derived products, and the workflow to produce the dataset. Researchers would benefit from using the proposed dataset to improve results in the case of vehicle occlusion, geo-location, and the need for cleaning ghost vehicles. As far as we know, this is the first openly available dataset for vehicle detection and segmentation, comprising RG-DSM drone data fusion and different color masks for motorcycles, cars, and ghosts.

Journal ArticleDOI
24 May 2022-Data
TL;DR: The aim of this research was to build the HateMotiv corpus, a freely available dataset that is annotated for types of hate crimes and the motivation behind committing them and could provide the research community with a very unique, novel, and reliable dataset.
Abstract: With the rapidly increasing use of social media platforms, much of our lives is spent online. Despite the great advantages of using social media, unfortunately, the spread of hate, cyberbullying, harassment, and trolling can be very common online. Many extremists use social media platforms to communicate their messages of hatred and spread violence, which may result in serious psychological consequences and even contribute to real-world violence. Thus, the aim of this research was to build the HateMotiv corpus, a freely available dataset that is annotated for types of hate crimes and the motivation behind committing them. The dataset was developed using Twitter as an example of social media platforms and could provide the research community with a very unique, novel, and reliable dataset. The dataset is unique as a consequence of its topic-specific nature and its detailed annotation. The corpus was annotated by two annotators who are experts in annotation based on unified guidelines, so they were able to produce an annotation of a high standard with F-scores for the agreement rate as high as 0.66 and 0.71 for type and motivation labels of hate crimes, respectively.

Journal ArticleDOI
11 Apr 2022-Data
TL;DR: A collection of thirty mathematical functions that can be used for optimization purposes is presented and investigated in detail, to determine whether the problem is better suited for a metaheuristic approach or for a mathematical method, which is based on gradients.
Abstract: A collection of thirty mathematical functions that can be used for optimization purposes is presented and investigated in detail. The functions are defined in multiple dimensions, for any number of dimensions, and can be used as benchmark functions for unconstrained multidimensional single-objective optimization problems. The functions feature a wide variability in terms of complexity. We investigate the performance of three optimization algorithms on the functions: two metaheuristic algorithms, namely Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), and one mathematical algorithm, Sequential Quadratic Programming (SQP). All implementations are done in MATLAB, with full source code availability. The focus of the study is both on the objective functions, the optimization algorithms used, and their suitability for solving each problem. We use the three optimization methods to investigate the difficulty and complexity of each problem and to determine whether the problem is better suited for a metaheuristic approach or for a mathematical method, which is based on gradients. We also investigate how increasing the dimensionality affects the difficulty of each problem and the performance of the optimizers. There are functions that are extremely difficult to optimize efficiently, especially for higher dimensions. Such examples are the last two new objective functions, F29 and F30, which are very hard to optimize, although the optimum point is clearly visible, at least in the two-dimensional case.

Journal ArticleDOI
31 Mar 2022-Data
TL;DR: In this paper , the authors focus on the specific contribution of the OpenStreetMap (OSM) project to address the early stage of the COVID-19 crisis (approximately from February to May 2020) in Italy.
Abstract: Data and digital technologies have been at the core of the societal response to COVID-19 since the beginning of the pandemic. This work focuses on the specific contribution of the OpenStreetMap (OSM) project to address the early stage of the COVID-19 crisis (approximately from February to May 2020) in Italy. Several activities initiated by the Italian OSM community are described, including: mapping ‘red zones’ (the first municipalities affected by the emergency); updating OSM pharmacies based on the authoritative dataset from the Ministry of Health; adding information on delivery services of commercial activities during COVID-19 times; publishing web maps to offer COVID-19-specific information at the local level; and developing software tools to help collect new data. Those initiatives are analysed from a data ecosystem perspective, identifying the actors, data and data flows involved, and reflecting on the enablers and barriers for their success from a technical, organisational and legal point of view. The OSM project itself is then assessed in the wider European policy context, in particular against the objectives of the recent European strategy for data, highlighting opportunities and challenges for scaling successful approaches such as those to fight COVID-19 from the local to the national and European scales.

Journal ArticleDOI
27 Jan 2022-Data
TL;DR: In this article , the Delphi consensus method is used to validate the collected and reviewed data of sustainable building criteria, and then validated these data by applying each consensus method, and finally made a comparison between both consensus methods.
Abstract: Data collection and review are the building blocks of academic research regardless of the discipline. The gathered and reviewed data, however, need to be validated in order to obtain accurate information. The Delphi consensus is known as a method for validating the data. However, several studies have shown that this method is time-consuming and requires a number of rounds to complete. Until now, there has been no clear evidence that validating data by a Delphi consensus is more significant than by a general consensus. In this regard, if data validation between both methods are not significantly different, then just using a general consensus method is sufficient, easier, and less time-consuming. Hence, this study aims to find out whether or not data validation by a Delphi consensus method is more significant than by a general consensus method. This study firstly collected and reviewed the data of sustainable building criteria, secondly validated these data by applying each consensus method, and finally made a comparison between both consensus methods. The results showed that seventeen of the valid criteria obtained from the general consensus and reduced by the Delphi consensus were found to be inconsistent for sustainable building assessments in Cambodia. Therefore, this study concludes that using the Delphi consensus method is more significant in validating the gathered and reviewed data. This experiment contributes to the selection and application of consensus methods in validating data, information, or criteria, especially in engineering fields.

Journal ArticleDOI
20 Jan 2022-Data
TL;DR: In this paper , the authors evaluate and provide examples of case studies currently using PDI and use its long-term continental US database (18 locations and 24 years) to test the cover crop and grazing effects on soil organic carbon (SOC) storage, and show that legume and rye (Secale cereale L.) cover crops increased SOC storage by 36% and 50%, respectively, compared with oat (Avena sativa L.) and rye mixtures and low and high grazing intensities improving the upper SOC by 69-72% compared with a medium grazing intensity.
Abstract: Combining data into a centralized, searchable, and linked platform will provide a data exploration platform to agricultural stakeholders and researchers for better agricultural decision making, thus fully utilizing existing data and preventing redundant research. Such a data repository requires readiness to share data, knowledge, and skillsets and working with Big Data infrastructures. With the adoption of new technologies and increased data collection, agricultural workforces need to update their knowledge, skills, and abilities. The partnerships for data innovation (PDI) effort integrates agricultural data by efficiently capturing them from field, lab, and greenhouse studies using a variety of sensors, tools, and apps and provides a quick visualization and summary of statistics for real-time decision making. This paper aims to evaluate and provide examples of case studies currently using PDI and use its long-term continental US database (18 locations and 24 years) to test the cover crop and grazing effects on soil organic carbon (SOC) storage. The results show that legume and rye (Secale cereale L.) cover crops increased SOC storage by 36% and 50%, respectively, compared with oat (Avena sativa L.) and rye mixtures and low and high grazing intensities improving the upper SOC by 69–72% compared with a medium grazing intensity. This was likely due to legumes providing a more favorable substrate for SOC formation and high grazing intensity systems having continuous manure deposition. Overall, PDI can be used to democratize data regionally and nationally and therefore can address large-scale research questions aimed at addressing agricultural grand challenges.

Journal ArticleDOI
14 Sep 2022-Data
TL;DR: In this paper , a multilinear regression is performed to correlate retention indices (RIs) and response factors (RFs) with structural properties, which can be used together with the detailed hydrocarbon analysis (DHA) method and be expanded further.
Abstract: The replacement of fossil carbon sources with green bio-oils promotes the importance of several hundred oxygenated hydrocarbons, which substantially increases the analytical effort in catalysis research. A multilinear regression is performed to correlate retention indices (RIs) and response factors (RFs) with structural properties. The model includes a variety of possible products formed during the hydrodeoxygenation of bio-oils with good accuracy (RRF2 0.921 and RRI2 0.975). The GC parameters are related to the detailed hydrocarbon analysis (DHA) method, which is commonly used for non-oxygenated hydrocarbons. The RIs are determined from a paraffin standard (C5–C15), and the RFs are calculated with ethanol and 1,3,5-trimethylbenzene as internal standards. The method presented here can, therefore, be used together with the DHA method and be expanded further. In addition to the multilinear regression, an increment system has been developed for aromatic oxygenates, which further improves the prediction accuracy of the response factors with respect to the molecular constitution (R2 0.958). Both predictive models are designed exclusively on structural factors to ensure effortless application. All experimental RIs and RFs are determined under identical conditions. Moreover, a folded Plackett–Burman screening design demonstrates the general applicability of the datasets independent of method- or device-specific parameters.

Journal ArticleDOI
07 Apr 2022-Data
TL;DR: The aim of the FakeAds corpus is to study the impact of fake news and false information in advertising and marketing materials for specific products and which types of products are targeted most on Twitter to draw the attention of consumers.
Abstract: Nowadays, an increasing portion of our lives is spent interacting online through social media platforms, thanks to the widespread adoption of the latest technology and the proliferation of smartphones. Obtaining news from social media platforms is fast, easy, and less expensive compared with other traditional media platforms, e.g., television and newspapers. Therefore, social media is now being exploited to disseminate fake news and false information. This research aims to build the FakeAds corpus, which consists of tweets for product advertisements. The aim of the FakeAds corpus is to study the impact of fake news and false information in advertising and marketing materials for specific products and which types of products (i.e., cosmetics, health, fashion, or electronics) are targeted most on Twitter to draw the attention of consumers. The corpus is unique and novel, in terms of the very specific topic (i.e., the role of Twitter in disseminating fake news related to production promotion and advertisement) and also in terms of its fine-grained annotations. The annotation guidelines were designed with guidance by a domain expert, and the annotation is performed by two domain experts, resulting in a high-quality annotation, with agreement rate F-scores as high as 0.815.

Journal ArticleDOI
09 Apr 2022-Data
TL;DR: Although the OSM dataset is the fundamental and most crucial one used for modeling, the machine learning algorithm’s training was performed on a dataset that was prepared by combining several features from three other datasets, and the results were validated through a comparison with publicly available statistical data.
Abstract: Details on building levels play an essential part in a number of real-world application models. Energy systems, telecommunications, disaster management, the internet-of-things, health care, and marketing are a few of the many applications that require building information. The essential variables that most of these models require are building type, house type, area of living space, and number of residents. In order to acquire some of this information, this paper introduces a methodology and generates corresponding data. The study was conducted for specific applications in energy system modeling. Nonetheless, these data can also be used in other applications. Building locations and some of their details are openly available in the form of map data from OpenStreetMap (OSM). However, data regarding building types (i.e., residential, industrial, office, single-family house, multi-family house, etc.) are only partially available in the OSM dataset. Therefore, a machine learning classification algorithm for predicting the building types on the basis of the OSM buildings’ data was introduced. Although the OSM dataset is the fundamental and most crucial one used for modeling, the machine learning algorithm’s training was performed on a dataset that was prepared by combining several features from three other datasets. The generated dataset consists of approximately 29 million buildings, of which about 19 million are residential, with 72% being single-family houses and the rest multi-family ones that include two-family houses and apartment buildings. Furthermore, the results were validated through a comparison with publicly available statistical data. The comparison of the resulting data with official statistics reveals that there is a percentage error of 3.64% for residential buildings, 13.14% for single-family houses, and −15.38% for multi-family houses classification. Nevertheless, by incorporating the building types, this dataset is able to complement existing building information in studies in which building type information is crucial.

Journal ArticleDOI
25 Jan 2022-Data
TL;DR: This corpus allows the evaluation and comparison of semantic labeling and modeling approaches across different methodologies, and it is the first corpus that additionally allows to leverage textual data documentations for semantic labeled and modeling.
Abstract: Ontology-based data management and knowledge graphs have emerged in recent years as efficient approaches for managing and utilizing diverse and large data sets. In this regard, research on algorithms for automatic semantic labeling and modeling as a prerequisite for both has made steady progress in the form of new approaches. The range of algorithms varies in the type of information used (data schema, values, or metadata), as well as in the underlying methodology (e.g., use of different machine learning methods or external knowledge bases). Approaches that have been established over the years, however, still come with various weaknesses. Most approaches are evaluated on few small data corpora specific to the approach. This reduces comparability and also limits statements for the general applicability and performance of those approaches. Other research areas, such as computer vision or natural language processing solve this problem by providing unified data corpora for the evaluation of specific algorithms and tasks. In this paper, we present and publish VC-SLAM to lay the necessary foundation for future research. This corpus allows the evaluation and comparison of semantic labeling and modeling approaches across different methodologies, and it is the first corpus that additionally allows to leverage textual data documentations for semantic labeling and modeling. Each of the contained 101 data sets consists of labels, data and metadata, as well as corresponding semantic labels and a semantic model that were manually created by human experts using an ontology that was explicitly built for the corpus. We provide statistical information about the corpus as well as a critical discussion of its strengths and shortcomings, and test the corpus with existing methods for labeling and modeling.

Journal ArticleDOI
30 Jul 2022-Data
TL;DR: In this paper , the authors evaluated the accuracy and suitability of weather forecasts of two parameters namely temperature and humidity from the OpenWeatherMap API (an online weather platform) and compared them with actual measurements collected from the Brazilian weather stations (INMET).
Abstract: Certain weather conditions are inadvertently related to increased population of various mosquitoes. In order to predict the burden of mosquito populations in the Global South, it is imperative to integrate weather-related risk factors into such predictive models. There are a lot of online open-source weather platforms that provide historical, current and future weather forecasts which can be utilised for general predictions, and these electronic sources serve as an alternate option for weather data when physical weather stations are inaccessible (or inactive). Before using data from such online source, it is important to assess the accuracy against some baseline measure. In this paper, we therefore evaluated the accuracy and suitability of weather forecasts of two parameters namely temperature and humidity from the OpenWeatherMap API (an online weather platform) and compared them with actual measurements collected from the Brazilian weather stations (INMET). The evaluation was focused on two Brazilian cites, namely, Recife and Campina Grande. The intention is to prepare an early warning model which will harness data from OpenWeatherMap API for mosquito prediction.

Journal ArticleDOI
22 Jun 2022-Data
TL;DR: This research presents the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying, and shows that the SVM classifier outperforms the other classifiers.
Abstract: (1) Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were collected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the inter-annotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of detecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments.