scispace - formally typeset
Proceedings ArticleDOI

Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction

Reads0
Chats0
TLDR
In this paper, the authors evaluate semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS).
Abstract
This study evaluates semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data. Traditionally, categorical Medicare features are one-hot encoded for the purpose of supervised learning. One-hot encoding thousands of unique procedure codes leads to high-dimensional vectors that increase model complexity and fail to capture the inherent relationships between codes. We address these shortcomings by representing procedure codes using low-rank continuous vectors that capture various dimensions of similarity. We leverage publicly available data from the Centers for Medicare and Medicaid Services, with more than 56 million claims records, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS). Continuous-bag-of-words and skip-gram embed-dings are trained using a range of embedding and window sizes. The proposed embeddings are empirically evaluated on a Medicare fraud classification problem using the Extreme Gradient Boosting learner. Results are compared to both one-hot encodings and pre-trained embeddings from related works using the area under the receiver operating characteristic curve and geometric mean metrics. Statistical tests are used to show that the proposed embeddings significantly outperform one-hot encodings with 95% confidence. In addition to our empirical analysis, we briefly evaluate the quality of the learned embeddings by exploring nearest neighbors in vector space. To the best of our knowledge, this is the first study to train and evaluate HCPCS procedure embeddings on big Medicare data.

read more

Citations
More filters
Proceedings ArticleDOI

Encoding Techniques for High-Cardinality Features and Ensemble Learners

TL;DR: In this article, the performance of bagging and boosting ensembles for high-cardinality categorical features was evaluated on the latest Medicare Part B fraud classification data set, where the healthcare procedure code feature includes 7,752 unique values.
Journal ArticleDOI

Medical Provider Embeddings for Healthcare Fraud Detection

TL;DR: In this article, the problem of encoding medical provider types was addressed and four techniques for learning dense, semantic embeddings that capture provider specialty similarities were presented, which can be readily adopted and applied in future machine learning applications in the healthcare industry.
Proceedings ArticleDOI

Leveraging LightGBM for Categorical Big Data

TL;DR: In this paper, a study of LightGBM revealed two alternatives for a Big Data classification task to do anomaly detection: one-hot encoding of the data into a sparse representation, and then relying entirely on lightGBM's exclusive feature bundling to complete encoding of categorical features.
References
More filters
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article

Efficient Estimation of Word Representations in Vector Space

TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
Proceedings ArticleDOI

Deep contextualized word representations

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

Software Framework for Topic Modelling with Large Corpora

TL;DR: This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.
Related Papers (5)