Machine Learning with partially labeled Data for Indoor Outdoor Detection

doi:10.1109/CCNC.2019.8651736

HAL Id: hal-02011454

https://hal.archives-ouvertes.fr/hal-02011454

Submitted on 12 Feb 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Machine Learning with partially labeled Data for Indoor

Outdoor Detection

Illyyne Saar, Marie-Line Alberi-Morel, Kamal Deep Singh, César Viho

To cite this version:

Illyyne Saar, Marie-Line Alberi-Morel, Kamal Deep Singh, César Viho. Machine Learning with par-

tially labeled Data for Indoor Outdoor Detection. CCNC 2019 - 16th IEEE Consumer Communications

& Networking Conference, Jan 2019, Las Vegas, United States. pp.1-7, �10.1109/CCNC.2019.8651736�.

�hal-02011454�

Machine Learning with partially labeled Data for

Indoor Outdoor Detection

Illyyne Saffar

Service Automation,

Nokia Bell Labs

Nozay, France

illyyne.saffar@nokia.com

Marie Line Alberi Morel

Service Automation,

Nokia Bell Labs

Nozay, France

marie line.alberi-morel@nokia.com

Kamal Deep Singh

Laboratoire Hubert Curien,

University of Saint-Etienne,

Saint-Etienne, France

kamal.singh@univ-st-etienne.fr

Cesar Viho

IRISA - INRIA,

University of Rennes 1,

Rennes, France

Cesar.Viho@irisa.fr

Abstract—This paper demonstrates the feasibility of an

hybrid/semi-supervised classiﬁcation method for detecting the

environment of an active mobile phone, based on both labeled and

unlabeled cellular radio data. Precisely, we provide answers to

the following question: what is the environment of the mobile user

when it is/was experiencing a mobile service/application: indoor

or outdoor? Implementing this method within the mobile network

is interesting for mobile operators since it has low complexity,

is less human intrusive (minimal intervention of mobile users)

and more accurate. The semi-supervised classiﬁcation algorithm

learns to identify the environment using large and real collected

3GPP signals measurements. As compared to existing work,

in addition to existing parameters used for classiﬁcation, we

propose to also use a radio metric called Timing Advance. It is

computed within the mobile network. We empirically validate the

innovative semi-supervised algorithm using new real-time radio

measurements, with partial ground truth information, gathered

daily, weekly, monthly, from indoor and outdoor locations and

from multiple typical and diversiﬁed environments crossed by

mobile users. The study conﬁrms the effectiveness of the pro-

posed scheme compared to the existing supervised classiﬁcation

methods including SVM and Deep Learning.

Index Terms—Environment classiﬁcation, Machine Learning,

Indoor Outdoor Detection, 3GPP radio measurement, crowd-

sourcing, real user activity.

I. INTRODUCTION

Recent technological breakthroughs have extended the mo-

bile phones’ features, functions and capabilities, which are

now used for more than just communicating or affording ap-

plications. Recently, mobile devices are being utilized to know

the consuming habits of individuals and communities [1], [2],

[3]. Thus, our purpose is to inject this learned cognition into

mobile 5G networks to help them grow smarter and be more

efﬁcient when faced to the increasing complexity of network

management combined with numerous new applications and

their heterogeneous needs.

As a ﬁrst step to bring such additional knowledge to the net-

work, we target Indoor/Outdoor Detection (IOD) in this paper.

IOD refers to the estimation of the mobile users’ environments,

that is to infer whether the user is Indoor or Outdoor. IOD is

a cornerstone of the user behavior contextualization, which

in turn can be used for learning the user behavior, adapting

mobile network resources, etc [4], [5]. The idea is to have

more information on the user like knowing his environment

type or his location.

IOD can be performed automatically and in real-time using

machine learning techniques, which in turn need data for

learning. Thus, data collection is the ﬁrst phase of designing

IOD solution based on machine learning. Recently, a new

crowd-sourcing approach [6], [7] is becoming popular for

collecting and analyzing real and large network measurement

datasets coming from mobile phones or any other connected

devices. This method exploits smartphones (with built-in cellu-

lar network interface) with their various measurement sensors.

Additionally, data obtained from smartphones has the natural

mobility vector of people carrying them. This ensures cost-

effective, continual and ﬁne-grained spatio-temporal moni-

toring and analyses of mobile networks. For our work, we

propose to investigate this concept of large and real crowd-

sourced measurements for IOD. We also propose to extend it

to mobile networks to deal with the challenge of detecting the

environmental context of mobile users from network side. The

idea is to collect data, which is measured or derived within

network, and then consider it as an input for the machine

learning based classiﬁer used for training, learning and then

detection. The data measured by multiple UEs during their

connection is sent to eNB, using standardized procedures.

Such solutions are interesting for mobile network opera-

tors that wish to exploit cognition of user behavior to op-

timize/customize their service delivery with minimal inter-

vention of the users. Furthermore, such measurements, as an

alternative to coverage modeling or drive tests [6], capture

reality well, reveal real life of a mobile user while at the

same time being less expensive. This method can then be

implemented by the operators in their networks, as a generic

solution, independent of the implementations of particular

manufacturers. Consequently, it allows the mobile network to

exploit direct measurements at user side to deduce contextual

factors such as the user environment.

In 4G/5G cellular networks, such solutions are technically

feasible since enormous amount of mobile measurement data

is collected by the mobile terminal. This data is regularly

sent to the network using standardized protocols and interfaces

during each UE’s connection to the cell (on a per-procedure

basis and on a network deﬁned event basis). This measurement

data is referred to as LTE UE Measurement Data (LUMD) [8].

LUMD contains rich information on mobile performance and

RF metrics such as signal strength (Reference Signal Receive

Power or RSRP), signal plus interference and noise strength

(Reference Signal Receive quality or RSRQ). It also includes

the Channel Quality Indicator (CQI) that is a function of SINR.

In this work, we aim to achieve the following objectives:

• (1) infer the user environmental context, from certain

LUMD metrics collected in crowdsourcing mode and the

radio metric, Timing Advance, assessed by the network

when the user is connected to a session. In fact, the

environment considered is divided into two main types:

– Indoor: at home, in restaurant, in cafe/ at work or in

other building types, etc.

– Outdoor: pedestrian, running or in car moving with

high speed.

• (2) consider the constraint that the inference shall be

done at network side with minimal human interaction or

intervention.

To achieve (1) and (2), we design a method for training

IOD automatic classiﬁer based on a weakly or partially labeled

crowdsourced dataset. Such dataset reduces human interven-

tion to the lowest possible. Indeed, the labeled data, used

for machine learning training, is either tagged manually or

automatically. Manual data tagging can be expensive, complex

and even unfeasible for mobile operators if they have to tag

all the collected crowdsourced data.

In this paper, we are interested in Machine Learning (ML),

one of the popular techniques, for automatic IOD. Among ML

families, we consider supervised learning and more particu-

larly semi-supervised learning which can be seen as a mix of

supervised and unsupervised approaches. Supervised learning

is more adapted for classiﬁcation tasks. It uses labeled data to

learn the mapping between data and the labels. Unsupervised

learning looks for patterns and structures within the data for

tasks such as clustering. The semi-supervised learning, which

is an hybrid approach, is becoming popular with growing

abundance of data in this era. It proposes a learning scheme

based on partially or weakly labeled dataset in order to achieve

a classiﬁcation task or a function approximation task. In our

case, semi-supervised learning allows the mobile operator

to use labeled data from a few users and combine it with

lot of unlabeled and easily available data collected from

several users. This combination allows to learn all possible

environment types related to the user behavior.

The rest of this paper is organized as follows. Section

II describes the main IOD works in literature. In Section

III, a comparative analysis of crowdsourcing and drive-test

data collection modes is provided. In section IV, results with

supervised classiﬁcation and clustering algorithms are given.

Section V and VI present a new Deep Learning-based semi-

supervised learning approach proposed for IOD from the

network side. Section VI discuss the results.

II. RELATED WORK

In the literature, the IOD issue has not been largely studied:

only few works address it. Proposed solutions are usually

divided in to two categories [9]. IOD is either considered as

a statistical issue where a weighted score or a threshold is

deﬁned to determine the mobile environment, or as a classiﬁ-

cation problem sorting mobile users between multiple classes.

In most of these works, only two classes are considered

(Indoor/Outdoor) but, in some works, three classes are selected

(e.g. Indoor/Semi-Outdoor/Outdoor). The Figure 1 shows an

illustration of the whole dependency of existing classes.

Fig. 1. Example of IOD classiﬁcation scheme: in 3 main classes

In addition to such categorization, IOD problem can also be

distinguished based on the location where IOD is performed,

either at the mobile terminal side or at the mobile network side.

In the following, we highlight some of the works dealing with

the IOD issue, presenting them according to this classiﬁcation.

In ﬁrst category, [10] looks at a threshold of signals col-

lected from some phone sensors related to: radio signals, cell

signal strength, light intensity as well as the magnetic sensor to

infer whether the mobile user is indoor or outdoor. However,

this threshold is speciﬁc to the experimental settings where it is

calculated. It is not generalizable to new environments. Thus,

using just a threshold decreases the IOD accuracy. Similar to

[10], the work in [5] also uses the same signals, but also con-

sidered sound intensity, battery temperature and the proximity

sensor. For IOD, they propose a semi-supervised approach:

a co-training solution. They use 2 classiﬁers in parallel with

a weighted score of classiﬁcation probability to improve the

ﬁnal performance of IOD. For every classiﬁer, they select

a different set of sensors to learn different perspectives and

patterns. This work shows high performance (more than 90%

of accuracy) in the detection of new instances in unknown

environments. However, the impact of this work is limited

since their database is not highly representative. Indeed, the

used data set was only collected in three places (the campus

area, city center, residential area) which are not enough to

train a general IOD system.

The work in [4], proposes a video streaming optimization

based on adaptation as a function of the user location in

time. For that, IOD is computed via a Bayesian detector

that combines measurements from two smartphone sensors to

decide the user environment type.

In second category, in [11] authors optimize the use of

radio measurements in wireless networks. Literally, they use

radio signal measurements collected in different situations

of mobility with varying speed (low, medium, high) namely

(pedestrian, incar and unmoving). They dynamically estimate

the signal attenuation. This in turn helps them to efﬁciently

classify mobile user environment (pedestrian, incar, unmoving)

and ﬁnally improves the handover process. Authors assume

that once the signal power attenuation is estimated correctly,

we can easily come to classify whether the mobile user is

pedestrian, in car or unmoving. This is because the measured

power signal for an unmoving user does not show too much

variations unlike the incar or pedestrian cases. Nevertheless,

this proposition is still at an early stage and it has not

been thoroughly developed yet. In [8], the main issue is to

localize the mobile user by estimating its longitude and latitude

in a most possible accurate way. For this, they made the

assumption that mobile users are outdoor, thus giving rise

to the importance of IOD and the necessity to classify the

user environment. For the classiﬁcation task, they used RSRP

and RSRQ signals and tested many algorithms: SVM, logistic

regression and random forest. SVM was the retained solution

since it performed best.

In this paper, we focus on the IOD automation within the

network side using machine learning algorithms. They are

trained using large real dataset while minimizing the mobile

user interaction (minimal labels). We look at the performance

in terms of F 1 − scores of supervised and semi-supervised

IOD methods. Goal is to evaluate the minimal amount of

labeled data required for obtaining good IOD performance.

III. COLLECTED DATA FOR IOD

In this section, we analyze the statistical differences by

focusing on the empirical cumulative distribution function

(CDF) between indoor and outdoor environments, using a large

and real data-set collected at multiple places, many environ-

ments. We illustrate the impact of the two environments on

the empirical CDFs, according to where the data is collected.

A. Data Description

Our large data set consists in Time, 3 LUMD radio signals,

the metric Timing Advance (TA) and the label when it is

known. Thus, it has a vector of 6 features with the label:

• Time: time of signal record

• RSRP: the average received power of the Reference

Signal (RS) between -140 dBm to -44 dBm [12], sent

by eNB.

• RSRQ: the ratio between RSRP and RSSI (Received

Signal Strength Indicator) between -19.5dB and -3dB

[12], that represents the total power of the received

signal (including the transmitted signal, the noise and the

interference).

• CQI: indicator reported by UE to eNB that gives the most

appropriate modulation scheme and coding scheme to be

used for transmission [13].

• TA: used to control Uplink signal transmission timing.

It is indicated by eNB to UE via a Timing Advance

command [14].

The set of these signals has been collected during 9 months,

24h/7 (From October 2017 until June 2018), with an average

of 1 measurement per 15 seconds while the mobile phone

session is active and 1 measurement per 2 minutes otherwise.

The dataset is made of 40% of labeled data and 60% of

unlabelled data. The 9 months collection has been performed

in many different environments like mountain, beach, forest,

companies, cafes, streets, bars, parks, restaurants, lakes, etc...

It was also performed in many cities and places like country-

side, villages, small cities, metropolis, and different countries,

but for this paper we are only studying the data collected in

France (Figure 2). This long collection period allows us to

have data reﬂecting all weather types: Heavy Rain, Foggy,

Sunny, Snowy, Windy, Rainy,... i.e. almost the 4 seasons.

Therefore with this campaign of data measurement we try to

be as close as possible to the complexity and the variety of a

mobile user moving in real world.

Fig. 2. Data collection Points in France: multiple environments and places

B. Data collection: crowdsourcing vs. drive-test mode

In crowdsourcing mode, the collected data consists of sig-

nals measured by the mobile phone and sent to the eNB. Our

dataset described in the previous subsection has been collected

using this mode. Figures 3 shows the empirical cumulative

distribution functions (CDFs) of RSRP and CQI obtained

with the dataset. The signiﬁcant offset between the indoor

and the outdoor curves, results from substantial difference

and attenuation variation in radio signal propagation. It is

mainly due to reﬂection, diffraction, dispersion and attenuation

experienced in indoor environment. However, we note that

there is some overlap between the ranges of RSRP and CQI

values. Also the extreme values seen in the two indoor and

outdoor CDFs (located in tails) get similar and the division

between the two gets blurred. The behaviour at the juncture

of extreme values can be explained by the ambiguous char-

acteristics of the environment when a user is at high speed

(Train, car...) or when he is in a semi indoor environments (like

balconies, semi-open building, near a window.., etc. We argue

that these points are ambiguous and will pose a good challenge

for supervised classiﬁcation, since they can be indifferently

classed indoor or outdoor at the same time.

0

0.2

0.4

0.6

0.8

1

-140 -120 -100 -80 -60 -40

F(x)

RSRP (dBm)

Indoor

Outdoor

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18

F(x)

CQI

Indoor

Outdoor

Fig. 3. Empirical CDF for measured RSRP (left) and CQI (right) in

crowdsourcing mode: multiple environments and places - Indoor (red) and

Outddor (green).

0

0.2

0.4

0.6

0.8

1

-140 -120 -100 -80 -60 -40

F(x)

RSRP (dBm)

Indoor

Outdoor

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18

F(x)

CQI

Indoor

Outdoor

Fig. 4. Empirical CDF for measured RSRP (left) and CQI (right) in drive-

test type mode: speciﬁc environments and places - Indoor (red) and Outddor

(green).

An alternate data collection mode, widely used to collect

data, is the drive-test mode. However, this mode imposes limits

on capturing the reality through the data collected in this mode.

Such data collection campaigns are run for limited hours per

day during short period (couple of weeks) and at some speciﬁc

places. To model this way of collecting data, referred as drive-

test mode, we extract a portion data (EPD) from the whole

dataset. We aimed by this selected EPD data to be as close

as possible to the type of places where the drive-test was

performed by one of the top 3 American operators in New

York City in [8]. Therefore to build EPD we consider data

only in metropolis (Paris and southern suburbs see ﬁgure 5).

Indeed, Paris as metropolis, has a dense and speciﬁc architec-

ture which allows better comparison with NYC. Concerning

indoor data, we selected instances where the user was strictly

indoor and, thus, not in “semi-indoor” positions like semi-

open building or balconies,...etc. For outdoor data, we chose

the instances where the user was either pedestrian or in vehicle

in different city streets (limited speed). Thus, to mimic drive-

test we consequently ignored data coming from environments

like subway/ countryside/ forest/ beaches/ Mountains/ .../ etc.

We did this to enable a fair comparison between the two

modes. Figure 4 shows well separated RSRP empirical cdfs

between the classes indoor and outdoor. The superimposed

points of both the cdfs we judge conﬂicting have disappeared.

The overlap between both the cdfs, which previously led to

ambiguity, has disappeared. This is due to the signiﬁcant

distance between the indoor and the outdoor curves. In the case

of CQI cdfs we notice a similar phenomenon. This analysis

allows us to argue that supervised classiﬁcation will run better

on labeled dataset collected in drive-test mode as compared to

obtained through crowdsourcing mode.

Fig. 5. The Data collection Points of EPD in drive-test like mode: Paris and

southern suburbs

IV. CLASSIFICATION USING SUPERVISED LEARNING OR

CLUSTERING

After analyzing the statistical properties of I/O environ-

ments, we ﬁrst evaluate the accuracy and the performance of

supervised classiﬁers for IOD. For this, we use the accuracy

metric which is the ratio of correctly classiﬁed instances

divided by the total instances and the metric F 1 − score that

is by deﬁnition the weighted average of Precision and Recall

according to the following relation:

F 1 − score = 2.

P recision.Recall

P recision + Recall

where precision is the number of correct positive results

divided by the number of all positive results returned by the

classiﬁer, and recall is the number of correct positive results

divided by the number of all relevant samples. F 1 − score

is one of the most used metrics in case of unbalanced data

classes. Indeed, the statistics of our data show that the data

proportion between indoor and outdoor classes is unbalanced

65% Indoor vs. 35% Outdoor. This reﬂects the reality since

people, in general, spend more time at home or in indoor envi-

ronments than in outdoor environments. For the experiments,

we divided the dataset as follows: 70% for training, 30% for

validation and test. We evaluate the impact of both input pairs

(RSRP, RSRQ), which is the reference input for IOD in the

literature, vs. (RSRP, CQI), in three cases:

• Training and evaluation on labeled EPD collected in

drive-test like mode (see Table I),

• Training on labeled EPD and evaluation on the rest of

the labeled data of crowdsourcing mode, thus operating

with unknown environments (see Table II) and,

• Training and evaluation on labeled data collected in

crowdsourcing mode (see Table III).

As shown in the table I, running either classiﬁcation (SVM,

Random Forest, Neural Network) or clustering (k-means)

algorithms on EPD, obtained from drive-test like mode, shows

an excellent performance with an F 1−score of 99%, which is

Machine Learning with partially labeled Data for Indoor Outdoor Detection

Summary (2 min read)

Introduction

III. COLLECTED DATA FOR IOD

A. Data Description

B. Data collection: crowdsourcing vs. drive-test mode

IV. CLASSIFICATION USING SUPERVISED LEARNING OR CLUSTERING

VI. RESULTS AND DISCUSSION

VII. CONCLUSION

Figures (10)

Citations

Cites background or methods from "Machine Learning with partially lab..."

Cites background or methods from "Machine Learning with partially lab..."

Additional excerpts

References

"Machine Learning with partially lab..." refers methods in this paper

"Machine Learning with partially lab..." refers background in this paper

"Machine Learning with partially lab..." refers methods in this paper

Related Papers (5)

Trending Questions (1)