scispace - formally typeset
Search or ask a question
Author

Siddha Prabhu Chodnekar

Bio: Siddha Prabhu Chodnekar is an academic researcher from VIT University. The author has contributed to research in topics: Cluster analysis & k-means clustering. The author has an hindex of 2, co-authored 3 publications receiving 10 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A density with distance based method which ensures identification of seed artefacts from different clusters that leads to more accurate clustering results, and compares its results with random, Wu, Cao and Khan’s methods of initial seed artefact selection.

9 citations

Journal ArticleDOI
TL;DR: A novel technique for identifying initial seeds for heterogeneous data clustering is proposed, through the introduction of a unique distance measure where the distance of the numerical attributes is scaled such that it is comparable to that of categorical attributes.
Abstract: Data sets to which clustering is applied may be homogeneous (numerical or categorical) or heterogeneous (numerical and categorical) in nature. Handling homogeneous data is easier than heterogeneous data. We propose a novel technique for identifying initial seeds for heterogeneous data clustering, through the introduction of a unique distance measure where the distance of the numerical attributes is scaled such that it is comparable to that of categorical attributes. The proposed initial seed selection algorithm ensures selection of initial seed points from different clusters of the clustering solution which are then given as input to the modified K-means clustering algorithm along with the data set. This technique is independent of any user-defined parameter and thus can be easily applied to clusterable data sets with mixed attributes. We have also modified the K-means clustering algorithm to handle mixed attributes by incorporating our novel distance measure to handle numerical data and assigned the value one or zero when categorical data is dissimilar or similar. Finally, a comparison has been made with existing algorithms to bring out the significance of our approach. We also perform a statistical test to evaluate the statistical significance of our proposed technique.

5 citations

Book ChapterDOI
01 Jan 2019
TL;DR: From the result analysis it is found that the performance is maximum, when the pattern size matches the tile size and it is less than 64, due to the size of the warp considered.
Abstract: Parallelizing pattern matching in multidimensional images is very vital in many applications to improve the performance. With SIMT architectures, the performance can be greatly enhanced if the hardware threads are utilized to the maximum. In the case of pattern matching algorithms, the main bottleneck arises due to the reduction operation that needs to be performed on the multiple parallel search operations. This can be solved by using Shift-Or operations. The recent trend has shown the improvement in pattern matching using Shift-Or operations for bit pattern matching. This has to be extended for multiple dimensional images like hyper-cubes. In this paper, we have extended the Shift-Or pattern matching for multidimensional images. The algorithm is implemented for GPU architectures. The complexity of the proposed algorithm is \( m*\frac{log(n)}{kw} \) where m is the number of dimensions, n is the size of the array if the multidimensional matrix values are placed in a single dimensional array, k is the size of the pattern and w is the size of the tile. From the result analysis it is found that the performance is maximum, when the pattern size matches the tile size and it is less than 64. This restriction is due to the size of the warp considered.

1 citations


Cited by
More filters
01 Jan 1999
TL;DR: In this article, the Shift-And algorithm was used to solve the problem of pattern matching in LZW compressed text, where a pattern length is at most 32 or the word length.
Abstract: This paper considers the Shift-And approach to the problem of pattern matching in LZW compressed text, and gives a new algorithm that solves it. The algorithm is indeed fast when a pattern length is at most 32, or the word length. After an O(m + |Σ|) time and O(|Σ|) space preprocessing of a pattern, it scans an LZW compressed text in O(n + r) time and reports all occurrences of the pattern, where n is the compressed text length, m is the pattern length, and r is the number of the pattern occurrences. Experimental results show that it runs approximately 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Moreover, the algorithm can be extended to the generalized pattern matching, to the pattern matching with k mismatches, and to the multiple pattern matching, like the Shift-And algorithm.

56 citations

Journal ArticleDOI
TL;DR: Investigation of the global patent databases of DT patents and summarizes related technologies, effects, and applications reveals that DT fails to form a comprehensively connected technology, which is a typical phenomenon for a technology that remains in its early stage of development.
Abstract: Digital twin (DT) can facilitate interaction between the physical and the cyber worlds and achieve smart manufacturing. However, the DT’s development in the industry remains vague. This study investigates the global patent databases of DT patents and summarizes related technologies, effects, and applications. Patent map analysis is used to uncover the patent development trajectory of DT in the patent databases of the USA, China, and the World Intellectual Property Organization among European nations. In addition, a nation-based survey is conducted to explore their DT patent trends. Findings reveal that DT fails to form a comprehensively connected technology, which is a typical phenomenon for a technology that remains in its early stage of development. In the present study, the two-dimensional matrix analysis of patent technology and effect exhibits that several patents created a variety of effects and reached saturation. Moreover, several technology–effect domains remain, and DT-related technology gaps exist in a number of potential effects. The DT-related patents are distributed unevenly in various industries. For instance, most of the DT-related patents appear in the manufacturing industry. Furthermore, our K-mode cluster analysis reveals that the DT-related patents are distributed in five subgroups of the three dimensions, namely, technology, effect, and application.

19 citations

Journal ArticleDOI
TL;DR: In this paper, a modified Roger's distance for mixed quantitative-qualitative phenotypes was developed to select 30 accessions (denoted as the core collection) that had a maximum pairwise genetic distance.
Abstract: Vegetable soybeans [Glycine max (L.) Merr.] have characteristics of larger seeds, less beany flavor, tender texture, and green-colored pods and seeds. Rich in nutrients, vegetable soybeans are conducive to preventing neurological disease. Due to the change of dietary habits and increasing health awareness, the demand for vegetable soybeans has increased. To conserve vegetable soybean germplasms in Taiwan, we built a core collection of vegetable soybeans, with minimum accessions, minimum redundancy, and maximum representation. Initially, a total of 213 vegetable soybean germplasms and 29 morphological traits were used to construct the core collection. After redundant accessions were removed, 200 accessions were retained as the entire collection, which was grouped into nine clusters. Here, we developed a modified Roger’s distance for mixed quantitative–qualitative phenotypes to select 30 accessions (denoted as the core collection) that had a maximum pairwise genetic distance. No significant differences were observed in all phenotypic traits (p-values > 0.05) between the entire and the core collections, except plant height. Compared to the entire collection, we found that most traits retained diversities, but seven traits were slightly lost (ranged from 2 to 9%) in the core collection. The core collection demonstrated a small percentage of significant mean difference (3.45%) and a large coincidence rate (97.70%), indicating representativeness of the entire collection. Furthermore, large values in variable rate (149.80%) and coverage (92.5%) were in line with high diversity retained in the core collection. The results suggested that phenotype-based core collection can retain diversity and genetic variability of vegetable soybeans, providing a basis for further research and breeding programs.

9 citations

Journal ArticleDOI
TL;DR: In this article , the adaptive multiple imputations of missing values using the class center (AMICC) approach is proposed to produce effective imputation results efficiently, which is based on the class centre and defines a threshold from the weighted distances between the center and other observed data for the imputation step.
Abstract: Abstract Big data has become a core technology to provide innovative solutions in many fields. However, the collected dataset for data analysis in various domains will contain missing values. Missing value imputation is the primary method for resolving problems involving incomplete datasets. Missing attribute values are replaced with values from a selected set of observed data using statistical or machine learning methods. Although machine learning techniques can generate reasonably accurate imputation results, they typically require longer imputation durations than statistical techniques. This study proposes the adaptive multiple imputations of missing values using the class center (AMICC) approach to produce effective imputation results efficiently. AMICC is based on the class center and defines a threshold from the weighted distances between the center and other observed data for the imputation step. Additionally, the distance can be an adaptive nearest neighborhood or the center to estimate the missing values. The experimental results are based on numerical, categorical, and mixed datasets from the University of California Irvine (UCI) Machine Learning Repository with introduced missing values rate from 10 to 50% in 27 datasets. The proposed AMICC approach outperforms the other missing value imputation methods with higher average accuracy at 81.48% which is higher than those of other methods about 9 – 14%. Furthermore, execution time is different from the Mean/Mode method, about seven seconds; moreover, it requires significantly less time for imputation than some machine learning approaches about 10 – 14 s.

6 citations

Journal ArticleDOI
TL;DR: The overhead is reduced by developing the Similarity-based K-means Clustering (SKC) approach for clustering the attributes that depends on divergence distance, which achieves 98.45% accuracy for the publicly available dataset when comparing with the existing techniques.
Abstract: Clustering plays a major role in the data mining application, because it divides and groups the data effectively. In the pattern analysis, two major challenges occur in real-life applications that includes handling the categorical data and the availability of correctly labeled data. According to the characteristics of homogeneity, the clustering techniques are designed to group the unlabeled data. Some important issues such as high memory utilization, time consumption, overhead, computation complexity and less effective results are present in various existing algorithms of numerical data. Therefore, the research study implemented clustering techniques based on the similarity of categorical data. Simultaneously, the attributes of inter and intra-clusters’ similarities are identified, and then the performance of proposed method is improved by integrating those similarities. The noises are also removed by performing the pre-processing techniques, so the similarity between noise-free elements are estimated. Once these similarities are identified, the insignificant attributes are removed and the relevant attributes are chosen from the preprocessed elements. The overhead is reduced by developing the Similarity-based K-means Clustering (SKC) approach for clustering the attributes that depends on divergence distance. The efficiency of SKC is tested in the experimental analysis by means of precision, f-measure, accuracy, error rate of clustering and recall. The results state that the developed study achieved 98.45% accuracy for the publicly available dataset when comparing with the existing techniques: variations of Particle Swarm Optimization (PSO) and semi-supervised clustering system.

6 citations