scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2017"


Journal ArticleDOI
TL;DR: It was found that most of the existing research on big data focuses majorly on consumer discretionary, followed by public administration, and not much focus was highlighted in these studies to demonstrate the tools used for the analysis to address this gap.
Abstract: The importance of data science and big data analytics is growing very fast as organizations are gearing up to leverage their information assets to gain competitive advantage. The flexibility offered through big data analytics empowers functional as well as firm-level performance. In the first phase of the study, we attempt to analyze the research on big data published in high-quality business management journals. The analysis was visualized using tools for big data and text mining to understand the dominant themes and how they are connected. Subsequently, an industry-specific categorization of the studies was done to understand the key use cases. It was found that most of the existing research focuses majorly on consumer discretionary, followed by public administration. Methodologically, a major focus in such exploration is in social media analytics, text mining and machine learning applications for meeting objectives in marketing and supply chain management. However, it was found that not much focus was highlighted in these studies to demonstrate the tools used for the analysis. To address this gap, this study also discusses the evolution, types and usage of big data tools. The brief overview of big data technologies grouped by the services they enable and some of their applications are presented. The study categorizes these tools into big data analysis platforms, databases and data warehouses, programming languages, search tools, and data aggregation and transfer tools. Finally, based on the review, future directions for exploration in big data has been provided for academic and practice.

136 citations


Journal ArticleDOI
TL;DR: The value of BIM is increased based on rules, extracted from data of O&M phase that appear irregular and disordered, leading to the improvements of resource usage and maintenance efficiency during the O &M phase.

78 citations


Journal ArticleDOI
TL;DR: This paper explores how big data technology could be implemented with data warehouse to support decision making process and proposes Hadoop as big data analytic tools to be implemented for data ingestion/staging.

73 citations


Journal ArticleDOI
TL;DR: The resulting standardized costs contained in the data warehouse can be used to create detailed, bottom-up analyses of professional and facility costs of procedures, medical conditions, and patient care cycles without revealing business-sensitive information.
Abstract: Research addressing value in healthcare requires a measure of cost. While there are many sources and types of cost data, each has strengths and weaknesses. Many researchers appear to create study-specific cost datasets, but the explanations of their costing methodologies are not always clear, causing their results to be difficult to interpret. Our solution, described in this paper, was to use widely accepted costing methodologies to create a service-level, standardized healthcare cost data warehouse from an institutional perspective that includes all professional and hospital-billed services for our patients. The warehouse is based on a National Institutes of Research–funded research infrastructure containing the linked health records and medical care administrative data of two healthcare providers and their affiliated hospitals. Since all patients are identified in the data warehouse, their costs can be linked to other systems and databases, such as electronic health records, tumor registries, and disease or treatment registries. We describe the two institutions’ administrative source data; the reference files, which include Medicare fee schedules and cost reports; the process of creating standardized costs; and the warehouse structure. The costing algorithm can create inflation-adjusted standardized costs at the service line level for defined study cohorts on request. The resulting standardized costs contained in the data warehouse can be used to create detailed, bottom-up analyses of professional and facility costs of procedures, medical conditions, and patient care cycles without revealing business-sensitive information. After its creation, a standardized cost data warehouse is relatively easy to maintain and can be expanded to include data from other providers. Individual investigators who may not have sufficient knowledge about administrative data do not have to try to create their own standardized costs on a project-by-project basis because our data warehouse generates standardized costs for defined cohorts upon request.

72 citations


Patent
04 Jan 2017
TL;DR: The systems and methods described in this article provide highly dynamic and interactive data analysis user interfaces, which enable data analysts to quickly and efficiently explore large volume data sources by applying filters, joining to other tables in a database, viewing interactive data visualizations.
Abstract: The systems and methods described herein provide highly dynamic and interactive data analysis user interfaces which enable data analysts to quickly and efficiently explore large volume data sources. In particular, a data analysis system, such as described herein, may provide features to enable the data analyst to investigate large volumes of data over many different paths of analysis while maintaining detailed and retraceable steps taken by the data analyst over the course of an investigation, as captured via the data analyst's queries and user interaction with the user interfaces provided by the data analysis system. Data analysis paths may involve exploration of high volume data sets, such as Internet proxy data, which may include trillions of rows of data. The data analyst may pursue a data analysis path that involves, among other things, applying filters, joining to other tables in a database, viewing interactive data visualizations, and so on.

70 citations


Journal ArticleDOI
TL;DR: This paper presents the integration of spatial operations into standardized SQL queries making the BIM data accessible for wide ranges of query capabilities, which will allow much better visibility into BIMData for better decision-making processes.

58 citations


Journal ArticleDOI
TL;DR: The numerical solutions for the data-pricing model indicate that the multi-version strategy achieves a better market segmentation and is more profitable and feasible when the multiple dimensions of data quality are considered.

56 citations


Book ChapterDOI
01 Jan 2017
TL;DR: This survey chapter presents a review of the current big data research, exploring applications, opportunities and challenges, as well as the state-of-the-art techniques and underlying models that exploit cloud computing technologies, such as the big data-as-a-service (BDaaS) or analytics-as a service (AaaS).
Abstract: The proliferation of data warehouses and the rise of multimedia, social media and the Internet of Things (IoT) generate an increasing volume of structured, semi-structured and unstructured data. Towards the investigation of these large volumes of data, big data and data analytics have become emerging research fields, attracting the attention of the academia, industry and governments. Researchers, entrepreneurs, decision makers and problem solvers view ‘big data’ as the tool to revolutionize various industries and sectors, such as business, healthcare, retail, research, education and public administration. In this context, this survey chapter presents a review of the current big data research, exploring applications, opportunities and challenges, as well as the state-of-the-art techniques and underlying models that exploit cloud computing technologies, such as the big data-as-a-service (BDaaS) or analytics-as-a-service (AaaS).

53 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: The proposal focuses on predictive maintenance of production systems, including manufacturing machines and tools, to increase the production process quality and utilises production data storage, built on Hadoop framework and NoSQL systems, integrated into traditional data warehouse discovery platform.
Abstract: In the proposed paper, we described the approach to build Hadoop based knowledge discovery platform. The proposal focuses on predictive maintenance of production systems, including manufacturing machines and tools, to increase the production process quality. The proposal utilises production data storage, built on Hadoop framework and NoSQL systems, integrated into traditional data warehouse discovery platform, preserving the well proven and robust data warehouse decision support and analytic tools. The initial proof of concept case study is included in the proposed paper.

48 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper proposes a framework on recent research for the Data Mining using Big Data, based on a strong body of work in data integration, mapping and transformations, to achieve automated error-free difference resolution.
Abstract: In an Information technology world, the ability to effectively process massive datasets has become integral to a broad range of scientific and other academic disciplines. We are living in an era of data deluge and as a result, the term “Big Data” is appearing in many contexts. It ranges from meteorology, genomics, complex physics simulations, biological and environmental research, finance and business to healthcare. Big Data refers to data streams of higher velocity and higher variety. The infrastructure required to support the acquisition of Big Data must deliver low, predictable latency in both capturing data and in executing short, simple queries. To be able to handle very high transaction volumes, often in a distributed environment; and support flexible, dynamic data structures. Data processing is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then “robotically” resolvable. There is a strong body of work in data integration, mapping and transformations. However, considerable additional work is required to achieve automated error-free difference resolution. This paper proposes a framework on recent research for the Data Mining using Big Data.

41 citations


Proceedings Article
01 Jan 2017
TL;DR: To the knowledge, this is the first complete description of a methodology for electronic patient data acquisition and provisioning that ignores data harmonization at the time of initial storage in favor of downstream transformation to address specific research questions and applications.
Abstract: Academic medical centers commonly approach secondary use of electronic health record (EHR) data by implementing centralized clinical data warehouses (CDWs) However, CDWs require extensive resources to model data dimensions and harmonize clinical terminology, which can hinder effective support of the specific and varied data needs of investigators We hypothesized that an approach that aggregates raw data from source systems, ignores initial modeling typical of CDWs, and transforms raw data for specific research purposes would meet investigator needs The approach has successfully enabled multiple tools that provide utility to the institutional research enterprise To our knowledge, this is the first complete description of a methodology for electronic patient data acquisition and provisioning that ignores data harmonization at the time of initial storage in favor of downstream transformation to address specific research questions and applications

Journal ArticleDOI
TL;DR: The characteristics, trends and challenges of big data are introduced and the benefits and the risks that may rise out of the integration between big data and cloud computing are investigated.
Abstract: Big data is currently one of the most critical emerging technologies. Big Data are used as a concept that refers to the inability of traditional data architectures to efficiently handle the new data sets. The 4V’s of big data – volume, velocity, variety and veracity makes the data management and analytics challenging for the traditional data warehouses. It is important to think of big data and analytics together. Big data is the term used to describe the recent explosion of different types of data from disparate sources. Analytics is about examining data to derive interesting and relevant trends and patterns, which can be used to inform decisions, optimize processes, and even drive new business models. Cloud computing seems to be a perfect vehicle for hosting big data workloads. However, working on big data in the cloud brings its own challenge of reconciling two contradictory design principles. Cloud computing is based on the concepts of consolidation and resource pooling, but big data systems (such as Hadoop) are built on the shared nothing principle, where each node is independent and selfsufficient. The integrating big data with cloud computing technologies, businesses and education institutes can have a better direction to the future. The capability to store large amounts of data in different forms and process it all at very large speeds will result in data that can guide businesses and education institutes in developing fast. Nevertheless, there is a large concern regarding privacy and security issues when moving to the cloud which is the main causes as to why businesses and educational institutes will not move to the cloud. This paper introduces the characteristics, trends and challenges of big data. In addition to that, it investigates the benefits and the risks that may rise out of the integration between big data and cloud computing.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: The aim of the survey paper is to provide the overview of the big data analytics, issues, challenges and various technologies related with Big Data.
Abstract: In recent years, the internet application and communication have seen a lot of development and reputation in the field of Information Technology. These internet applications and communication are continually generating the large size, different variety and with some genuine difficult multifaceted structure data called big data. As a consequence, we are now in the era of massive automatic data collection, systematically obtaining many measurements, not knowing which one will be relevant to the phenomenon of interest. For example, E-commerce transactions include activities such as online buying, selling or investing. Thus they generate the data which are high in dimensional and complex in structure. The traditional data storage techniques are not adequate to store and analyses those huge volume of data. Many researchers are doing their research in dimensionality reduction of the big data for effective and better analytics report and data visualization. Hence, the aim of the survey paper is to provide the overview of the big data analytics, issues, challenges and various technologies related with Big Data.

Patent
01 Dec 2017
TL;DR: In this article, a feature acquirer is constructed to process user portrait data, application list data and client-reported data to obtain regular feature vectors meeting mathematical modeling requirements; various basic recommending models are used to make predictions to generate a primary user application recommendation list and corresponding download probabilities.
Abstract: The invention provides a game recommending method and system based on user portrait behavior analysis. A feature acquirer is constructed to process user portrait data, application list data and client-reported data to obtain regular feature vectors meeting mathematical modeling requirements; various basic recommending models are used to make predictions to generate a primary user application recommendation list and corresponding download probabilities; a final application recommendation list is generated by combining the download probabilities and an actual label training fusion model. User historical behavior logs are subjected to multidimensional analysis to perform feature extraction so as to construct a user portrait data warehouse. A long and short-term memory network is introduced to the basic recommending models to learn time-series relationships of user behaviors, users' degrees of preference for objects are better depicted, and recommended game applications match well with the needs of users. Integrated learning is added to perform model fusion, learning results of the models are integrated, and accordingly, the stability and generalization ability of a recommending algorithm are improved.

Journal ArticleDOI
29 Mar 2017
TL;DR: This paper summarizes a demonstration case in which the implementation of a specific Big Data architecture shows how the evolution from a traditional Data Warehouse to a Big Data Warehouse is possible.
Abstract: In the era of Big Data, many NoSQL databases emerged for the storage and later processing of vast volumes of data, using data structures that can follow columnar, key-value, document or graph forma...

Journal ArticleDOI
TL;DR: By developing the Hadoop distributed computing platform and the HBase NoSQL database schema for solar energy, Energy-CRADLE exemplifies an integrated, scalable, secure, and user-friendly data informatics and analytics system for PV researchers.
Abstract: A nonrelational, distributed computing, data warehouse, and analytics environment (Energy-CRADLE) was developed for the analysis of field and laboratory data from multiple heterogeneous photovoltaic (PV) test sites. This data informatics and analytics infrastructure was designed to process diverse formats of PV performance data and climatic telemetry time-series data collected from a PV outdoor test network, i.e., the Solar Durability and Lifetime Extension global SunFarm network, as well as point-in-time laboratory spectral and image measurements of PV material samples. Using Hadoop/HBase for the distributed data warehouse, Energy-CRADLE does not have a predefined data table schema, which enables ingestion of data in diverse and changing formats. For easy data ingestion and data retrieval, Energy-CRADLE utilizes Hadoop streaming to enable Python MapReduce and provides a graphical user interface, i.e., py-CRADLE. By developing the Hadoop distributed computing platform and the HBase NoSQL database schema for solar energy, Energy-CRADLE exemplifies an integrated, scalable, secure, and user-friendly data informatics and analytics system for PV researchers. An example of Energy-CRADLE enabled scalable, data-driven, analytics is presented, where machine learning is used for anomaly detection across 2.2 million real-world current-voltage (I–V) curves of PV modules in three distinct Koppen–Geiger climatic zones.

Journal ArticleDOI
TL;DR: A middleware based schema model to support the temporal oriented storage of real-time data of ANT+ sensors as hierarchical documents is explained by using an algorithm based approach for flexible evolution of the model for a document oriented database, i.e, MongoDB.
Abstract: Proliferation of structural, semi-structural and no-structural data, has challenged the scalability, flexibility and processability of the traditional relational database management systems (RDBMS). The next generation systems demand horizontal scaling by distributing data over autonomously addable nodes to a running system. For schema flexibility, they also want to process and store different data formats along the sequence factor in the data. NoSQL approaches are solutions to these, hence big data solutions are vital nowadays. But in monitoring scenarios sensors transmit the data continuously over certain intervals of time and temporal factor is the main property of the data. Therefore the key research aspect is to investigate schema flexibility and temporal data integration aspects together. We need to know that: what data modelling should we adopt for a data driven real-time scenario; that we could store the data effectively and evolve the schema accordingly during data integration in NoSQL environments without losing big data advantages. In this paper we explain a middleware based schema model to support the temporal oriented storage of real-time data of ANT+ sensors as hierarchical documents. We explain how to adopt a schema for the data integration by using an algorithm based approach for flexible evolution of the model for a document oriented database, i.e, MongoDB. The proposed model is logical, compact for storage and evolves seamlessly upon new data integration.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper has compared different aspects of some popular ETL tools (Informatica, Datastage, Ab Initio, Oracle Data Integrator, SSIS) and have analysed their advantages and disadvantages and highlighted some salient features.
Abstract: Data Warehouse is a repository of strategic data from many sources gathered over a long period of time. Traditional DW operations mainly comprise of extracting data from multiple sources, transforming these data into a compatible form and finally loading them to DW schema for further analysis. The extract-transform-load (ETL) functions need to be incorporated into appropriate tools so that organisations can utilise these tools efficiently as required. There is a wide variety of such tools available in market. In this paper, we have compared different aspects of some popular ETL tools (Informatica, Datastage, Ab Initio, Oracle Data Integrator, SSIS) and have analysed their advantages and disadvantages. We have also highlighted some salient features (performance optimisation, data lineage, real time data analysis, cost, language binding etc.) of these tools and represented them with a comparative study. Apart from the review of the ETL tools, the paper also provides feedback from data science industry which narrates the market value and relevance of the tools in practical scenario. However, the traditional DW concept is expanding rapidly with the advent of big data, cloud computing, real time data analysis and the growing need of parsing information from structured and unstructured data sources. In this paper, we have also identified these factors which are transforming the definition of data warehousing.

Proceedings ArticleDOI
09 May 2017
TL;DR: Data CIVILIZER is an end-to-end big data management system with components for data discovery, data integration and stitching, data cleaning, and querying data from a large variety of storage engines, running in large enterprises.
Abstract: Finding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being scattered all over the enterprise and being typically dirty and inconsistent. In practice, data scientists are routinely reporting that the majority (more than 80%) of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. We propose to demonstrate DATA CIVILIZER to ease the pain faced in analyzing data "in the wild". DATA CIVILIZER is an end-to-end big data management system with components for data discovery, data integration and stitching, data cleaning, and querying data from a large variety of storage engines, running in large enterprises.

Patent
22 Feb 2017
TL;DR: Wang et al. as mentioned in this paper proposed a personal credit ecological platform, which is structured by a display layer, a middle layer and a data layer; the display layer is a front client platform, the middle layer is background service platform, and the data layer is data warehouse system.
Abstract: The invention discloses a personal credit ecological platform, which is structured by a display layer, a middle layer and a data layer; the display layer is a front client platform, the middle layer is a background service platform, and the data layer is a data warehouse system; the whole ecological platform consists of a management end, a personal credit module, a merchant module, a payment flow, and a mobile APP. The platform data base is deep, the data quality is alive, the accuracy is high, the scale is wide, and the time track is rich, the data volume is huge and the dimension is wide. The platform deeply excavates the personal credit data, realizes the credit evaluation, provides the complete personal credit status, realizes the personal credit tracking management, and further controls the risk; besides, the client complaint procedure is added. The platform achieves the recommendation of potential and sustainable developed individual client for merchant users, and realizes the bidirectional push.

Journal ArticleDOI
TL;DR: A team of physicians, systems biologists, engineers, and scientists at Rutgers Cancer Institute of New Jersey have designed, developed, and implemented the Warehouse with information originating from data sources, including Electronic Medical Records, Clinical Trial Management Systems, Tumor Registries, Biospecimen Repositories, Radiology and Pathology archives, and Next Generation Sequencing services.
Abstract: Leading institutions throughout the country have established Precision Medicine programs to support personalized treatment of patients. A cornerstone for these programs is the establishment of enterprise-wide Clinical Data Warehouses. Working shoulder-to-shoulder, a team of physicians, systems biologists, engineers, and scientists at Rutgers Cancer Institute of New Jersey have designed, developed, and implemented the Warehouse with information originating from data sources, including Electronic Medical Records, Clinical Trial Management Systems, Tumor Registries, Biospecimen Repositories, Radiology and Pathology archives, and Next Generation Sequencing services. Innovative solutions were implemented to detect and extract unstructured clinical information that was embedded in paper/text documents, including synoptic pathology reports. Supporting important precision medicine use cases, the growing Warehouse enables physicians to systematically mine and review the molecular, genomic, image-based, and correlated clinical information of patient tumors individually or as part of large cohorts to identify changes and patterns that may influence treatment decisions and potential outcomes.

Journal ArticleDOI
TL;DR: A DW was successfully implemented in an ophthalmologic academic environment to support and facilitate research by using increasing EMR and measurement data and software for decision support can be developed based on the DW and its structured data.

Patent
09 Feb 2017
TL;DR: In this article, the present invention discloses several embodiments of data mining architectures and methods of accessing external databases that are located behind firewalls or have other security and privacy protections to mine additional data based on data from submitted data files.
Abstract: The present invention discloses several embodiments of data mining architectures. Data mining architectures have components such as secure cloud servers hosting data warehouses, data modelers, analytics engines, and query engines. Systems comprising such data mining architecture process data extracted from user submitted data files in several formats, which could be massive datasets spread across multiple dimensions. Query engines have Application Programming Interfaces to communicate with external databases and services. Also disclosed are architectures and methods of accessing external databases that are located behind firewalls or have other security and privacy protections to mine additional data based on data from user submitted data files. The present invention can be used for applications such as customizing search engine output.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: A flexible architecture is proposed, which simplifies data integration, handling and sharing of data over organizational borders, and can be an enabler for automated data analysis of distributed data from sources with heterogeneous data formats in automated production systems.
Abstract: Data heterogeneity and proprietary interfaces present a major challenge for big data analytics. The data generated from a multitude of sources has to be aggregated and integrated first before being evaluated. To overcome this, an automated integration of this data and its provisioning via defined interfaces in a generic data format could greatly reduce the effort for an efficient collection and preparation of data for data analysis in automated production systems. Besides, the sharing of specific data with customers and suppliers, as well as near real-time processing of data can boost the information gain from analysis. Existing approaches for automatic data integration lack the fulfillment of all these requirements. On this basis, a flexible architecture is proposed, which simplifies data integration, handling and sharing of data over organizational borders. Special focus is put on the ability to process near real-time data which is common in the field of automated production systems. An evaluation with technical experts from the field of automation was carried out by adapting the generic concept for specific use cases. The evaluation showed that the proposed architecture could overcome the disadvantages of current systems and reduce the effort spent on data integration. Therefore, the proposed architecture can be an enabler for automated data analysis of distributed data from sources with heterogeneous data formats in automated production systems.

Journal ArticleDOI
TL;DR: A comprehensive experimental evaluation comparing SETL to a solution made with traditional tools (requiring much more hand-coding) on a concrete use case shows that SETL provides better programmer productivity, knowledge base quality, and performance.

Journal ArticleDOI
TL;DR: The paper describes processes of creation and production of data warehouse using a NoSQL data mart and outlines requirements for technology necessary for such processes.

Book ChapterDOI
01 Jan 2017
TL;DR: Modelling the process of person’s comprehensive assessment analysis requires consideration of the reasons for the occurrence of missing data in the data warehouse, and using techniques of intellectual and multivariate data analysis of person's comprehensive assessment will give the opportunity to improve the first stage of theprocess of inclusive education.
Abstract: One of the most effective ways of socialization of persons with special educational needs is inclusive education, which enables learning and development in the environment of mainstream education. The application of IT support to such education will increase access to it and improve its quality. Information and technological support of the first stage of inclusive education is in the application of database management system of psychophysiological survey. Hereinafter, to store person’s comprehensive assessment, the relevant data warehouse is to be developed. Development of information technologies of the first stage of inclusive education support is based on the construction of the infological and data logical schema of psychophysiological survey database. Modelling the process of person’s comprehensive assessment analysis requires consideration of the reasons for the occurrence of missing data in the data warehouse. Using techniques of intellectual and multivariate data analysis of person’s comprehensive assessment will give the opportunity to improve the first stage of the process of inclusive education.

Journal ArticleDOI
TL;DR: A data characterization framework for MFA consists of an MFA data terminology, a data characterization matrix, and a procedure for database analysis that facilitates systematic data characterization by cell‐level tagging of data with data attributes.
Abstract: Summary The validity of material flow analyses (MFAs) depends on the available information base, that is, the quality and quantity of available data. MFA data are cross-disciplinary, can have varying formats and qualities, and originate from heterogeneous sources, such as official statistics, scientific models, or expert estimations. Statistical methods for data evaluation are most often inadequate, because MFA data are typically isolated values rather than extensive data sets. In consideration of the properties of MFA data, a data characterization framework for MFA is presented. It consists of an MFA data terminology, a data characterization matrix, and a procedure for database analysis. The framework facilitates systematic data characterization by cell-level tagging of data with data attributes. Data attributes represent data characteristics and metainformation regarding statistical properties, meaning, origination, and application of the data. The data characterization framework is illustrated in a case study of a national phosphorus budget. This work furthers understanding of the information basis of material flow systems, promotes the transparent documentation and precise communication of MFA input data, and can be the foundation for better data interpretation and comprehensive data quality evaluation.

Journal ArticleDOI
TL;DR: A tool called REDCap2SDTM is developed that maps information in the Field Annotation of REDCap to SDTM and executes data conversion, including when data must be pivoted to accommodate the SDTM format, dynamically, by parsing the mapping information using R.

Journal ArticleDOI
TL;DR: The use of data mining techniques such as ETL (Extract, Transform, Load) to feed the authors' database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis.
Abstract: The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses.