scispace - formally typeset
Proceedings ArticleDOI

Effects of Clustering Feature Vectors on Bus Travel Time Prediction: A Case Study

05 Jan 2021-pp 741-746

TL;DR: In this article, the authors analyzed the use of different feature vectors for clustering and the effect on travel time predictions and showed that the prediction accuracy is highest when only travel times are used as a clustering feature vector.

AbstractImproving the accuracy of travel time predictions depends on providing the correct inputs as well as the prediction algorithm used. Clustering algorithms can be used to identify the patterns in the data, which can improve the inputs to the prediction algorithm. The feature vectors used for clustering greatly affect the clusters formed and, ultimately, the prediction performance. Clustering being an unsupervised learning technique, the accuracy or correctness of the cluster formed can not be evaluated directly. A possible solution for this would be to link the problem with prediction accuracy and choose the feature vector combination with maximum prediction accuracy. The present study analyses the use of different feature vectors for clustering and the effect on travel time predictions. Here, three cases, namely, travel time alone, travel time along with features such as time of the day, section index, and day of the week as numerical features and as a mix of categorical and numerical feature vectors, are studied. The effects of using each of these cases as clustering feature vectors on travel time predictions are evaluated. It is observed that the prediction accuracy is the highest when only travel times are used as a clustering feature vector. The study demonstrates the importance of choosing the correct feature vectors for clustering and its effect on a final application, namely, travel time prediction.

...read more


References
More filters
Journal ArticleDOI
TL;DR: The feasibility of applying SVR in travel-time prediction is demonstrated and it is proved that SVR is applicable and performs well for traffic data analysis.
Abstract: Travel time is a fundamental measure in transportation. Accurate travel-time prediction also is crucial to the development of intelligent transportation systems and advanced traveler information systems. We apply support vector regression (SVR) for travel-time prediction and compare its results to other baseline travel-time prediction methods using real highway traffic data. Since support vector machines have greater generalization ability and guarantee global minima for given training data, it is believed that SVR will perform well for time series analysis. Compared to other baseline predictors, our results show that the SVR predictor can significantly reduce both relative mean errors and root-mean-squared errors of predicted travel times. We demonstrate the feasibility of applying SVR in travel-time prediction and prove that SVR is applicable and performs well for traffic data analysis.

899 citations

Journal ArticleDOI
TL;DR: I was sitting before my TV set, a while back, watching Captain Video and pondering the organizational problems of psychologists, psychometricians, psychodiagnosticians, psycho-somatists, psychosomnabulists, and psychoceramics, and decided to enlist Captain Video's help to bring me from the Black Planet that superogalactian hypermetrician, Dr. Idnozs HcahscrorTenib, cosmos-famous disc
Abstract: I was sitting before my TV set, a while back, watching Captain Video and pondering the organizational problems of psychologists, psychometricians, psychodiagnosticians, psycho-somatists, psychosomnabulists, and psychoceramics (crack-pots to you). Wondering what I might do, in my small way, to help out, I decided to enlist Captain Video's help to bring me from the Black Planet that superogalactian hypermetrician, Dr. Idnozs HcahscrorTenib, cosmos-famous discoverer of Serutan. Why delay? The Galaxy was on its way. and in half a light year Dr. Tenib was at my side prepared to devote his gargantuan talents to the task. Seeing no point in confusing the good doctor by trying to describe to him the present administrative hodgepodge, I said, "Doctor, let's start from scratch. I want you to find out for me how these good people who are present at the annual meeting of the APA structure themselves? What families are represented? How many, or better, how few? And who belongs to each?" "We proceed," said the Doctor. "Bring sample of population; I measure." So we set out to design a sample. The problem presented some interesting theoretical aspects, but the final solution was relatively simple. We stationed representatives at each of the three state beverage stores and followed every third badge-wearing individual who came out of a store. We selected only outgoing patrons for obvious reasons. After assisting each respondent to unburden himself, we brought him to Dr. Idnozs (as we came to call him among ourselves) for study. "Now," murmured the Doctor, "we give tests. First is 'Draw-a-Psychiatrist Test.' " "We score this," he confided, "by if it gives horns." Presently we started on the physiological test battery. "We draw off saliva drop by drop," explained our idiot savant, "and see does he drool when we bring in Skinner Box." Later came the Peculiar Preference Blank. "Forced-choice, you know," whispered the Doctor. "Would you rather make mud pies or kiss gorgeous blonde?"

866 citations

Journal ArticleDOI
TL;DR: In this article, the authors present a review of clustering validation techniques for post-genomic data analysis, with a particular focus on their application to postgenomic analysis of biological data.
Abstract: Motivation: The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge---whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. Results: This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. Availability: The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/ Contact: J.Handl@postgrad.manchester.ac.uk Supplementary information: Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/

839 citations

Proceedings Article
27 Aug 1998
TL;DR: A scalable clustering framework applicable to a wide class of iterative clustering that requires at most one scan of the database and is instantiated and numerically justified with the popular K-Means clustering algorithm.
Abstract: Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.

800 citations

Journal ArticleDOI
01 Nov 2007
TL;DR: A clustering algorithm based on k-mean paradigm that works well for data with mixed numeric and categorical features is presented and a new cost function and distance measure based on co-occurrence of values is proposed.
Abstract: Use of traditional k-mean type algorithm is limited to numeric data. This paper presents a clustering algorithm based on k-mean paradigm that works well for data with mixed numeric and categorical features. We propose new cost function and distance measure based on co-occurrence of values. The measures also take into account the significance of an attribute towards the clustering process. We present a modified description of cluster center to overcome the numeric data only limitation of k-mean algorithm and provide a better characterization of clusters. The performance of this algorithm has been studied on real world data sets. Comparisons with other clustering algorithms illustrate the effectiveness of this approach.

527 citations