Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction

doi:10.1109/CIC50333.2020.00026

Proceedings ArticleDOI

Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction

Justin M. Johnson, +1 more

- pp 145-152

Chats0

TLDR

In this paper, the authors evaluate semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS).

Abstract:

This study evaluates semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data. Traditionally, categorical Medicare features are one-hot encoded for the purpose of supervised learning. One-hot encoding thousands of unique procedure codes leads to high-dimensional vectors that increase model complexity and fail to capture the inherent relationships between codes. We address these shortcomings by representing procedure codes using low-rank continuous vectors that capture various dimensions of similarity. We leverage publicly available data from the Centers for Medicare and Medicaid Services, with more than 56 million claims records, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS). Continuous-bag-of-words and skip-gram embed-dings are trained using a range of embedding and window sizes. The proposed embeddings are empirically evaluated on a Medicare fraud classification problem using the Extreme Gradient Boosting learner. Results are compared to both one-hot encodings and pre-trained embeddings from related works using the area under the receiver operating characteristic curve and geometric mean metrics. Statistical tests are used to show that the proposed embeddings significantly outperform one-hot encodings with 95% confidence. In addition to our empirical analysis, we briefly evaluate the quality of the learned embeddings by exploring nearest neighbors in vector space. To the best of our knowledge, this is the first study to train and evaluate HCPCS procedure embeddings on big Medicare data.

Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction

Citations

Encoding Techniques for High-Cardinality Features and Ensemble Learners

Medical Provider Embeddings for Healthcare Fraud Detection

Leveraging LightGBM for Categorical Big Data

Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

References

Glove: Global Vectors for Word Representation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Efficient Estimation of Word Representations in Vector Space

Deep contextualized word representations

Software Framework for Topic Modelling with Large Corpora

Related Papers (5)

Semantic Embeddings for Medical Providers and Fraud Detection

Exploiting hierarchy in medical concept embedding.

Embeddings of Categorical Variables for Sequential Data in Fraud Context

AutoML using Metadata Language Embeddings.

Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification From Clinical Notes