scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Auto-play: A data mining approach to ODI cricket simulation and prediction

TL;DR: A prediction system that takes in historical match data as well as the instantaneous state of a match, and predicts future match events culminating in a victory or loss is built, demonstrating the performance of the algorithms in predicting the number of runs scored, one of the most important determinants of match outcome.
Abstract: Cricket is a popular sport played by 16 countries, is the second most watched sport in the world after soccer, and enjoys a multi-million dollar industry. There is tremendous interest in simulating cricket and more importantly in predicting the outcome of games, particularly in their one-day international format. The complex rules governing the game, along with the numerous natural parameters affecting the outcome of a cricket match present significant challenges for accurate prediction. Multiple diverse parameters, including but not limited to cricketing skills and performances, match venues and even weather conditions can significantly affect the outcome of a game. The sheer number of parameters, along with their interdependence and variance create a non-trivial challenge to create an accurate quantitative model of a game Unlike other sports such as basketball and baseball which are well researched from a sports analytics perspective, for cricket, these tasks have yet to be investigated in depth. In this paper, we build a prediction system that takes in historical match data as well as the instantaneous state of a match, and predicts future match events culminating in a victory or loss. We model the game using a subset of match parameters, using a combination of linear regression and nearestneighbor clustering algorithms. We describe our model and algorithms and finally present quantitative results, demonstrating the performance of our algorithms in predicting the number of runs scored, one of the most important determinants of match outcome.

Content maybe subject to copyright    Report

Auto-play: A Data Mining Approach to
ODI Cricket Simulation and Prediction
Vignesh Veppur Sankaranarayanan, Junaed Sattar and Laks V. S. Lakshmanan
Department of Computer Science
University of British Columbia
Vancouver, B.C. Canada V6T 1Z4
Email: {vsvicky,junaed,laks}@cs.ubc.ca
Abstract
Cricket is a popular sport played by 16 countries, is
the second most watched sport in the world after soc-
cer, and enjoys a multi-million dollar industry. There
is tremendous interest in simulating cricket and more
importantly in predicting the outcome of games, par-
ticularly in their one-day international format. The
complex rules governing the game, along with the nu-
merous natural parameters affecting the outcome of a
cricket match present significant challenges for accurate
prediction. Multiple diverse parameters, including but
not limited to cricketing skills and performances, match
venues and even weather conditions can significantly
affect the outcome of a game. The sheer number of
parameters, along with their interdependence and vari-
ance create a non-trivial challenge to create an accurate
quantitative model of a game Unlike other sports such as
basketball and baseball which are well researched from a
sports analytics perspective, for cricket, these tasks have
yet to be investigated in depth. In this paper, we build a
prediction system that takes in historical match data as
well as the instantaneous state of a match, and predicts
future match events culminating in a victory or loss. We
model the game using a subset of match parameters,
using a combination of linear regression and nearest-
neighbor clustering algorithms. We describe our model
and algorithms and finally present quantitative results,
demonstrating the performance of our algorithms in pre-
dicting the number of runs scored, one of the most im-
portant determinants of match outcome.
Keywords
Sports prediction, analytics, ridge regression, attribute
bagging, nearest neighbors
1 Introduction
Primarily played in the member countries of the Com-
monwealth, cricket has grown in following across all con-
tinents. It has the second largest viewership by popula-
tion for any sport, next only to soccer, and generates an
extremely passionate following among the supporters.
There is huge commercial interest in strategic planning
for ensuring victory and in game outcome prediction.
This has motivated thorough and methodical analysis
of individual and team performance, as well as predic-
tion of future games, across all formats of the game.
Currently, team strategists rely on a combination
of personal experience, team constitution and seat of
the pants “cricketing sense” for making instantaneous
strategic decisions. Inherently, the methodology em-
ployed by human experts is to extract and leverage im-
portant information from both past and current game
statistics. However, to our knowledge, the underlying
science behind this has not been clearly articulated.
One of the key problems that needs to be solved in for-
mulating strategies is predicting the outcome of a game.
Our focus in this paper is to address the problem of
accurately modeling game progression towards match
outcome prediction. We learn a model for one-day for-
mat games by mining existing game data. In principle,
our approach is applicable towards modeling any for-
mat of the game; however, we choose to focus our test-
ing and evaluation on the most popular format, namely
one-day international (ODI). By using a combination of
supervised and unsupervised learning algorithms, our
approach learns a number of features from a one-day
cricket dataset which consists of complete records of all
games played in a 19-month period between January
2011 and July 2012. Along with these learned histor-
ical features of the game, our model also incorporates
instantaneous match state data, such as runs scored,
wickets lost etc., as game progresses, to predict future
states of an on-going match. By using a weighted com-
bination of both historical and instantaneous features,
our approach is thus able to simulate and predict game
progression before and during a match. We motivate
the problem of game modeling and outcome prediction

in Section 2. Along with a brief introduction to cricket,
Section 3 presents the problem formulation, with details
on feature modeling. In Section 4, we present our algo-
rithm for predicting the game progression and outcome,
with results discussed in Section 5
2 Related Work
2.1 Data Mining in Other Sports The problem of
match outcome prediction has been studied extensively
in the context of basketball and soccer. Bhandari
et al. [4] developed the Advanced Scout system for
discovering interesting patterns from basketball games,
which has is now used by the NBA teams. More
recently, Schultz [12] studies how to determine types
and combination of players most relevant to winning
matches. In soccer, Luckner et al. [11] predict the
outcome of FIFA World Cup 2006 matches using live
Prediction Markets. In baseball, Gartheepan et al. [7]
built a data driven model that helps in deciding when
to ‘pull a starting pitcher’. These works are developed
with a sport specific intuition which would render them
inapplicable to the sport of cricket.
2.2 Academic Interest in Cricket One of the ear-
liest and pioneering works in cricket was by Duckworth
and Lewis [6] where they introduce the Duckworth-
Lewis or D-L method, which allows fair adjustment of
scores in proportion to the time lost due to match inter-
ruption (often due to adverse weather conditions such
as rain, poor visibility etc.). This proposal has been
adopted by the International Cricket Council (ICC) as a
means to reset targets in matches where time is lost due
to match interruptions. The method proposed in [6],
and subsequently adapted by [14], for capturing the re-
sources of a team during the progression of a match has
found independent use in subsequent work in cricket
modeling and mining [14][2].
Lewis [10], Lemmer [9], Alsopp and Clarke [1], and
Beaudoin [3] develop new performance measures to rate
teams and to find the most valuable players. Raj and
Padma [15] analyze the Indian cricket team’s One-Day
International (ODI) match data and mine association
rules from a set of features, namely toss, home or
away game, batting first or second and game outcome.
Kaluarachchi and Varde [8] employ both association
rules and naive Bayes classifier and analyze the factors
contributing to a win, also taking day/day-night game
into account. Both approaches use a very limited subset
of high-level features to analyze the factors contributing
to victory. Furthermore, they do not address score
prediction, nor the progression of the game.
Bailey and Clarke [2] use historical match data
and predict the total score of an innings using linear
regression. As data of a match in progress streams
in, the prediction model is updated. Using this, they
analyze betting
1
market’s sensitivity to the ups and
downs of the game. Swartz et al. [17] use Markov
Chain Monte Carlo methods to simulate ball by ball
outcome of a match using a Bayesian Latent variable
model. Based on the features of current batsman,
bowler, and game situation (number of wickets lost and
number of balls bowled), they estimate the outcome of
the next ball. This model suffers from severe sparsity
as noted by the authors themselves: the likelihood of a
given batsman having previously faced a given bowler
in previous games in the dataset is low.
While both [17] and [2] have built match simulators
for ODI cricket, their models rely on games played
over 10 years ago. ODI cricket has since undergone
a number of major rule modifications. Important
examples include powerplays, free hit after an illegal ball
delivery, and the use of two new balls (as opposed to just
one) in an innings. These changes significantly affect
the team strategies, and essentially render old models
a poor fit. Our focus is on the modern and current
form of ODI cricket, incorporating all recent changes to
the game with support for accommodating future rule
modifications.
3 Game Modeling
3.1 Overview of ODI Cricket Rules and Objec-
tives We provide a brief overview of ODI cricket and
review its basic rules as they pertain to game modeling
and score prediction. We also introduce several basic
notations and terminologies used in the rest of the pa-
per.
Toss: Similar to a number of other sports, an ODI
cricket match starts with a toss. The team that wins
the toss can choose to bat first or can ask the opponents
to bat first. This decision is important and takes
into account the nature of the playing field, weather
conditions, and relative strengths and weaknesses of the
two teams.
Objective: In a game between T eam
A
and T eam
B
,
suppose T eam
A
wins the toss and chooses to bat
first. The period during which T eam
A
bats is called
innings
1
, in which T eam
A
has 50 overs to score as
many runs as possible, while T eam
B
tries to minimize
the scoring by getting T eam
A
’s batsmen out (more
commonly referred to as taking wickets). Scoring can
also be restricted by T eam
B
, by bowling balls that
are difficult to score off and by flawless fielding, where
fielders stop hits by batsmen of T eam
A
to deny them
1
There is a vibrant betting market associated with cricket. See,
e.g., http://www.betfair.com/exchange/en-gb/cricket-4/sp/.

opportunities to score runs. Innings
1
comes to an end
when T eam
A
loses all its wickets or finishes its quota of
50 overs, whichever happens first. Let Score
A
denote
the number of runs accumulated by T eam
A
at this
point.When T eam
B
comes in to bat in innings
2
, it has
the exact same number of 50 overs to play, with the goal
of scoring at least Score
A
+1 runs; innings
2
ends when
Score
B
, the number of runs scored by T eam
B
, exceeds
Score
A
, or when T eam
B
finishes its quota of 50 overs
or loses all its wickets, whichever happens first. T eam
B
is deemed the winner in the first case, and T eam
A
wins
otherwise. A third possibility is a tie when Score
A
and
Score
B
are equal at the end of the game.
2
Scoring: Teams can accumulate runs in two ways. One
way of scoring is to power-hit the ball outside the
playing area. Four runs are awarded if the ball touches
the ground before rolling past the boundary of the
playing area. If the ball lands directly outside the
playing area, six runs are awarded. Borrowing a term
from baseball, game, for convenience, we collectively
term runs scored this way as home runs. Home runs
yield greater reward in terms of runs scored, but the
batsmen have to take risks to hit them, which increases
their chance of getting out. The other way of scoring
is to hit the ball within the playing area and for the
two batsmen to run and exchange their positions. In
the mean time, the opponent players try to collect the
ball to minimize the number of exchanges. Runs are
awarded based on the number of times the batsmen
exchange their positions before the ball is returned to
one of the positions. There is theoretically no bound on
the number of exchanges possible in a given ball but this
value typically lies in the range 1 3 runs. This way of
scoring has a lower risk of the batsman getting out but
yields a lower number of runs. We term these non-home
runs. Runs are awarded to the batting team when the
bowler commits a foul while delivering the ball. Runs
conceded this way are usually small and are accounted
for by non-home runs in our model.
Dismissal: There are eleven ways for a batsman to
lose his wicket, commonly referred to as getting out or
dismissed. The common ways to get dismissed are being
bowled, caught by opponents, run out and Leg-Before-
Wicket (abbreviated as LBW). In our model, we do not
distinguish between the different forms of dismissal.
Target score: The number of runs accumulated by
T eam
A
at the end of innings
1
is Score
A
. Score
A
+1
run is set as the Target that the team batting second
tries to achieve or exceed in innings
2
.
Resources: Overs and Wickets are collectively termed
2
Currently, there are no tie-breakers in ODI the format,
possibly because ties are extremely rare.
as resource. The batting team consumes the overs
to accumulate runs and loses wickets in the process.
A batting team has 50 overs and 10 wickets at their
disposal at the start of an innings. This resource
continually decreases as the game progresses.
Segment: The batting period of a team is called an
innings and it lasts till they run out of one of the
resources. We split the 50-over window into 10 segments
of 5 overs each, denoted S
i
, 1 i 10. For a team T ,
R
T
i
and W
T
i
denote the the number of runs scored and
the number of wickets lost in segment S
i
, respectively.
The total number of runs scored by team T at the
end of their innings is given by R
T
eoi
=
P
10
i=1
R
T
i
. We
drop the superscript T when the team is clear from the
context. Below, we formalize the problem addressed in
this paper.
3.2 Problem Formulation The main problem we
tackle in this paper is given the instantaneous match
data up to a certain point in the game, predict the
progression of the remainder of the game, and in
particular, predict the winner. Before we formalize this,
we define a match state at segment n, 0 n < 10, as
the pair of numbers consisting of the number of runs
scored and the number of wickets lost so far, by the
batting team. Notice that given a match state, the
resources remaining at the batting team’s disposal can
be easily calculated: the number of balls remaining is
(10 n) × 5 × 6 and the number of wickets remaining is
10 (#wickets lost so far).
More precisely, given a match state associated with
segment n, namely (R
known
=
P
n
i=1
R
i
, W
known
=
P
n
i=1
W
i
), predict the number of runs
ˆ
R
i
for the re-
maining segments i, n + 1 i 10. Using these predic-
tions, the total predicted score at the end of the innings
can be obtained as
(3.1)
ˆ
R
eoi
= R
known
+
n
X
i=n+1
ˆ
R
i
If an innings has not commenced, as a special case,
n = 0, R
known
= 0 and W
known
= 0.
We follow this segmented prediction approach to
predict
ˆ
R
eoi
for both innings
1
and innings
2
. T eam
A
is predicted to be the winner if
ˆ
R
A
eoi
>
ˆ
R
B
eoi
. T eam
B
is
predicted to be the winner if
ˆ
R
A
eoi
<
ˆ
R
B
eoi
.
3.3 Sub-Problem We break down the problem of
predicting the number of runs in the next segment
S
n+1
, given the match state up to segment S
n
, into
two subproblems, by recognizing that home runs and
non-home runs are strategized and scored by different
means by the batsmen. We have found from our
analysis and exploration that the number of runs can be
predicted more accurately if we learn separate models

for predicting the home runs HR
n+1
and non-home
runs N HR
n+1
. More precisely, for any segment i,
R
i
= HR
i
+ N HR
i
and
ˆ
R
i
=
ˆ
HR
i
+
ˆ
NHR
i
, where
ˆ
X is the predicted value of X.
While it may seem counter-intuitive to use two
different classes of techniques to predict the overall
total score, this decision was driven by observing the
inherent nature of the game itself, and has eventually
been justified by our experimental results. In a given
game, the number of non-home run scoring balls greatly
outnumber the home run scoring balls. A linear-
regression based approach to predict non-home runs
thus runs into the problem of data sparsity. Attribute
bagging, on the other hand, enables our system to find
matches that have similar home-run scoring patterns,
given the set of match features, and thus avoids the
sparsity issue altogether. Our experiments have shown
(see Section 5) much degraded performance when using
ridge regression for HR prediction, with the MAE for
ˆ
R
eoi
increasing from 16.5 runs to 29.4 runs.
Prediction of
ˆ
HR
i
and
ˆ
NHR
i
is accomplished
using two sets of features historical features and
the instantaneous features, described next. Of these,
historical features are critical for predicting runs for
the first segment, since by definition, no instantaneous
match data is available before the first segment.
3.4 Historical Features Our model consists of 6
historical features for each team in the dataset. They
are mined from data across all matches played by a given
team. The historical features of a team are as follows:
(1) Average runs scored (by the team) in an innings;
(2) Average number of wickets lost in an innings; (3)
Frequency of being all-out;
3
(4) Average runs conceded
in an innings; (5) Average number of opponent wickets
taken in an innings; (6) Frequency of getting opposition
all-out.
In what follows, we will use N to denote the total
number of matches in the training dataset. Recall, n
denotes the segment up to which match state is known.
The first feature is calculated by dividing the total runs
scored by the given team across the number of matches
it played.
(3.2) AverageScore =
P
N
i=1
(Runs scored in match
i
)
N
The subsequent five features are self-explanatory
and are calculated similarly to (3.2). Out of the 6
features, the first three represent the team’s batting
ability, while the last three represent the team’s bowling
ability.
3
That is, losing all 10 wickets in an innings within 50 overs.
3.5 Instantaneous features In addition to the fea-
tures mined from past game data, i.e., the historical fea-
tures, we incorporate several instantaneous match fea-
tures in our prediction model. What has happened in
the game so far is an important indicator for predicting
game outcome. We extract the following instantaneous
features from the dataset.
1. Home or Away: This is a binary feature describing
if the batting team is playing in its home ground. If the
match is played in a neutral venue, this feature carries
no weight for both teams.
2. Powerplay: Powerplay is a restriction on the number
of fielders that could be placed by the bowling team
outside a certain range from the batsmen (usually
30 yards, approx. 27.432 meters). This restriction
enables the batsmen to hit the balls aggressively and
try and score home runs, with a relatively reduced
risk of getting out. The first 10 overs of the game
are mandatory powerplays, with two more instances of
powerplay periods arbitrarily chosen by the batting and
bowling team each, to occur at any point in the game
up to the 45
th
over. For any segment, the powerplay
can occupy between 0 and 5 overs of the segment.
Consequently, the value of this feature ranges from 0
to 1 in increments of 0.2.
3. Target: The goal of the team batting second is to
achieve the T arget Score, (= Score
A
+1 runs). This
used as a feature
4. Batsmen performance features: For any given
segment S
n
, we identify four performance indicators for
each of the two currently playing batsmen. They are
batsman-cluster (to be described in section 3.6), #runs
scored, #balls faced, and #home runs hit till segment
S
n1
.
5. Game snapshot: This feature is a pair of game
state variables, namely current score and #wickets (i.e.,
#batsmen) left.
Instantaneous features 4 and 5 are explained in
detail in Sections 3.6 and 3.7.
3.6 Batsmen Clustering In our dataset, there are
more than 200 players who have faced at least one ball.
Given data corresponding to 125 matches, learning the
features for each of the 200 individual players is fraught
with extreme sparsity. To give an example, given a
currently playing batsman b and a current bowler `,
the probability that b has faced ` in earlier matches
can be quite low. Even when b has faced ` before,
the number of such matches can be too small to learn
any useful signals from, for purposes of prediction. To
quantify, if in a dataset of M matches, the average
number of matches played by player b is m
b
, and by
player l is m
l
(where M m
b
and M m
l
), even

assuming independence, the probability that b and l
played together is
m
b
M
×
m
l
M
. To overcome this sparsity,
we cluster the batsmen according to their batting skills,
using the following four features: (1) Batting Average;
(2) Strike Rate; (3) Home-run hitting ability; and (4)
Milestone reaching ability. The first two features are
standard metrics used to report batsmen stats in cricket.
Although they are used to express a batsman’s quality,
they do not quite capture his skill as observed by cricket
experts and proved by [16] and [1]. Hence for batsman
clustering, we use Features 3 and 4 that capture the
quality of batsmen more accurately.
Batting Average for a batsman is the ratio of the
total number of runs he has scored across all matches,
over the number of times he has gotten out. Strike
Rate is the average number of runs scored per 100
balls, again calculated across all matches played. Both
Batting Average and Strike Rate are standard player
statistics used in cricket.
We measure the ability of a batsman to frequently
hit home runs using
(3.3)
HR-hittingAbility =
P
N
i=1
#home runs hit in match
i
P
N
i=1
balls faced in match
i
Scoring fifty runs or a hundred runs (commonly
referred to as half-century and century) are considered
batting milestones in cricket. Players who consistently
and frequently reach these milestones are considered to
be of very high caliber. To capture this, we define
a metric called milestone reaching ability (MRA) as
follows:
(3.4)
MRA =
# of 50 & 100 run scores in N matches played
N
MRA is thus a good indication of batsman quality.
Using the above four statistics, we cluster the batsmen
into 5 clusters using the k-nearest neighbor clustering.
We chose 5 clusters based on the intuition that a team
consists of opening batsmen, middle-order batsmen, all-
rounders, wicket-keeper, and tail-enders, having differ-
ent batting capabilities.
3.7 Game Snapshot Recall that the problem is,
given the match state data up to segment n < 10,
i.e., runs scored R
i
and wickets lost W
i
in segment
i, 1 i n, we need to predict the number of runs
for segment n + 1. To facilitate this, we aggregate
all of the information in segments S
1
to S
n1
and
retain the information in segment S
n
separately. More
precisely, we set R
1:n1
=
P
n1
i=1
R
i
and W
1:n1
=
P
n1
i=1
W
i
. We then incorporate the instantaneous
features R
1:n1
, W
1:n1
, R
n
, W
n
in our model. Since
our score prediction is done separately for home and
non-home runs, we use HR
1:n1
=
P
n1
i=1
HR
i
and
NHR
1:n1
=
P
n1
i=1
NHR
i
and use these features
instead of R
1:n1
, and predict the number of home runs
and non-home runs for segment n.
For example, to predict the runs in segment S
6
(overs 26 to 30), runs scored and wickets lost in segments
S
1
to S
4
are aggregated. Runs and wickets in segment
S
5
are retained as such. This approach provides the
game information till segment S
n1
and the game in-
formation in segment S
n
separately to the model. This
provides a broader snapshot of match state and also
gives more importance to the immediately preceding
segment.
Our learning algorithm, described in the next sec-
tion, makes use of the aforementioned historical and in-
stantaneous features up to a given segment to predict
scores for subsequent segments and uses that to predict
the overall score
ˆ
R
eoi
. As a special case, when n = 0,
the algorithm relies on historical features alone to make
its predictions.
4 Algorithm
4.1 Home-Run Prediction Model Using the his-
torical and non-historical features discussed above, we
predict the number of home runs
ˆ
HR
i
for a segment
S
i
, using attribute bagging ensemble method [5] with
nearest-neighbor clustering. Here, we choose random
subsets of features for n classifiers with l features each
and aggregate the overall results. Different sets of fea-
tures corresponding to the previous states are chosen
randomly and their nearest neighbors are identified from
history, thereby leveraging the Markovian nature of seg-
ments. Number of features for every classifier is set to be
the root value of the total number of features. The num-
ber of classifiers is experimentally determined. The in-
tuition behind using nearest-neighbor algorithm is that
information from similar match situations can be “bor-
rowed” from the training dataset. We use Spearman’s
distance metric, that uses rank correlation to identify
the top neighbor.
The Spearman distance is a measure of pairwise
linear correlation between ranked variables. Suppose
a sequence of values of two variables u = (u
1
, ..., u
m
)
and v = (v
1
, ..., v
m
) are rank-ordered, then Spearman
correlation coefficient is defined as:
(4.5) ρ =
P
m
i
(u
i
¯u)(v
i
¯v)
P
m
i
(u
i
¯u)
2
P
m
i
(v
i
¯v)
2
It is the same as Pearson correlation coefficient except
ranks are used in place of observed values.
Game features are ranked in the training and test
dataset separately. The distance between a match in the

Citations
More filters
01 Jan 2016
TL;DR: This work suggests that the relative team strength between the competing teams forms a distinctive feature for predicting the winner of a One Day International cricket match.
Abstract: With the advent of statistical modeling in sports, predicting the outcome of a game has been established as a fundamental problem. Cricket is one of the most popular team games in the world. With this article, we embark on predicting the outcome of a One Day International (ODI) cricket match using a supervised learning approach from a team composition perspective. Our work suggests that the relative team strength between the competing teams forms a distinctive feature for predicting the winner. Modeling the team strength boils down to modeling individual player’s batting and bowling performances, forming the basis of our approach. We use career statistics as well as the recent performances of a player to model him. Player independent factors have also been considered in order to predict the outcome of a match. We show that the k-Nearest Neighbor (kNN) algorithm yields better results as compared to other classifiers.

40 citations


Cites methods from "Auto-play: A data mining approach t..."

  • ...[11] uses a combination of linear regression and nearest-neighbor clustering algorithms to predict the outcome of a match....

    [...]

  • ...For instance, we do not have the details on the timings of the matches (day/night) as used by [10], and the instantaneous state of the matches at multiple stages as used by [11]....

    [...]

  • ...The only obstacle we faced while evaluating our approach is the inability to compare against previous models like [10] and [11], due to the different underlying datasets used....

    [...]

Journal ArticleDOI
TL;DR: A supervised learning method using SVM model with linear, and nonlinear poly and RBF kernals to predict the outcome of the game against particular side by grouping the players at different levels in the order of play for both the teams and develops a system which recommends a player for a specific role in a team by considering the past performances.

34 citations


Cites methods from "Auto-play: A data mining approach t..."

  • ...Historic features extracted from the previous matches is combined with the ongoing match features like a number of wickets and runs scored are used in prediction Sankaranarayanan et al. (2014)....

    [...]

Journal ArticleDOI
TL;DR: Attempts are made to investigate the feasibility of using collective knowledge obtained from microposts posted on Twitter to predict the winner of a Cricket match to classify winning team prediction in a Cricket game before the start of game.
Abstract: Social media has become a platform of first choice where one can express his/her feelings with freedom. The sports and matches being played are also discussed on social media such as Twitter. In this article, efforts are made to investigate the feasibility of using collective knowledge obtained from microposts posted on Twitter to predict the winner of a Cricket match. For predictions, we use three different methods that depend on the total number of tweets before the game for each team, fans sentiments toward each team and fans score predictions on Twitter. By combining these three methods, we classify winning team prediction in a Cricket game before the start of game. Our results are promising enough to be used for winning team forecast. Furthermore, the effectiveness of supervised learning algorithms is evaluated where Support Vector Machine (SVM) has shown advantage over other classifiers.

30 citations


Cites background from "Auto-play: A data mining approach t..."

  • ...[21] build a prediction system that analyzes historical Cricket match data and the instantaneous state of a match to predict game progression and the outcome of ODI match....

    [...]

  • ...Cricket is the second most popular sport in the world after Soccer with two to three billion fans [21]....

    [...]

Posted Content
TL;DR: Six machine learning models were trained and used for predicting the outcome of each 2018 IPL match, 15 minutes before the gameplay, immediately after the toss, with Multilayer Perceptron outperforming all other models with an impressive accuracy.
Abstract: Cricket, especially the Twenty20 format, has maximum uncertainty, where a single over can completely change the momentum of the game. With millions of people following the Indian Premier League (IPL), developing a model for predicting the outcome of its matches is a real-world problem. A cricket match depends upon various factors, and in this work, the factors which significantly influence the outcome of a Twenty20 cricket match are identified. Each player's performance in the field is considered to find out the overall weight (relative strength) of the teams. A multivariate regression based solution is proposed to calculate points for each player in the league and the overall weight of a team is computed based on the past performance of the players who have appeared most for the team. Finally, a dataset is modeled based on the identified seven factors which influence the outcome of an IPL match. Six machine learning models were trained and used for predicting the outcome of each 2018 IPL match, 15 minutes before the gameplay, immediately after the toss. Three of the trained models were seen to be correctly predicting more than 40 matches, with Multilayer Perceptron outperforming all other models with an impressive accuracy of 71.66%.

15 citations


Cites background from "Auto-play: A data mining approach t..."

  • ...Similarly [18] discusses modeling home-runs and non-home runs prediction algorithms and considers taking runs, wickets, frequency of being all-out as historical features into their prediction model....

    [...]

Proceedings ArticleDOI
01 Oct 2018
TL;DR: This work analyzed One Day International cricket data of Bangladesh, based on seventeen features and finds out the most important features that are enough for better prediction, not only important features but also can take much decision in the authors' analysis.
Abstract: Nowadays Data mining is an emerging field in sports analysis. To choose a most effective team or to predict suitable formation for winning a game or to analyze weakness of the opponent, data mining plays a vital role. However, no research has been done yet for the Bangladesh cricket team. So, we analyzed One Day International cricket data of Bangladesh, based on seventeen features and find out the most important features that are enough for better prediction, not only important features but also can take much decision in our analysis. Our analysis divided into three sections; before starting the game, after one innings played and continuous fall of wickets which leads to the probable prediction of the chances of winning and losing even while the game is in progress. In our analysis, we used the latest version of the decision tree algorithm that is C5.0 on our own collected data set and successfully get the accuracy of 63.63% for before starting the game, 72.72% and 81.81% when Bangladesh played in the first and second innings, finally 80% and 70% for fall of wicket analysis. We also used other classification algorithms and shown the accuracy level of our data set.

13 citations


Cites methods from "Auto-play: A data mining approach t..."

  • ...[1] used 6 features and got the accuracy of between 68% and 70% almost....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a review of the theory of ridge regression and its relation to generalized inverse regression is presented along with the results of a simulation experiment and three examples of the use of the ridge regression in practice.
Abstract: Summary The use of biased estimation in data analysis and model building is discussed A review of the theory of ridge regression and its relation to generalized inverse regression is presented along with the results of a simulation experiment and three examples of the use of ridge regression in practice Comments on variable selection procedures, model validation, and ridge and generalized inverse regression computation procedures are included The examples studied here show that when the predictor variables are highly correlated, ridge regression produces coefficients which predict and extrapolate better than least squares and is a safe procedure for selecting variables

742 citations


"Auto-play: A data mining approach t..." refers methods in this paper

  • ...4.2 Non-Home-Run Prediction Using the same historical and instantaneous features, non-home runs of segment Si, ˆNHRi is predicted by means of Ridge Regression [13]....

    [...]

  • ...2 Non-Home-Run Prediction Using the same historical and instantaneous features, non-home runs of segment Si, ˆ NHRi is predicted by means of Ridge Regression [13]....

    [...]

  • ...Ridge Regression and attribute bagging algorithms are used on the features to incrementally predict the runs scored in the innings....

    [...]

  • ...(Line 6) Using the same features Θ and ∆i−1, non-home runs ˆNHRi are predicted using Ridge Regression as mentioned earlier in this section (line 7)....

    [...]

Journal ArticleDOI
TL;DR: It is shown that AB gives consistently better results than bagging, both in accuracy and stability, and it is demonstrated that ranking the attribute subsets by their classification accuracy and voting using only the best subsets further improves the resulting performance of the ensemble.

484 citations


"Auto-play: A data mining approach t..." refers methods in this paper

  • ...1 Home-Run Prediction Model Using the historical and non-historical features discussed above, we predict the number of home runs ĤRi for a segment Si, using attribute bagging ensemble method [5] with nearest-neighbor clustering....

    [...]

Journal ArticleDOI
TL;DR: A method is described for setting revised target scores for the team batting second when a limited-overs cricket match has been forcibly shortened after it has commenced, designed so that neither team benefits or suffers from the shortening of the game.
Abstract: A method is described for setting revised target scores for the team batting second when a limited-overs cricket match has been forcibly shortened after it has commenced. It is designed so that neither team benefits or suffers from the shortening of the game and so is totally fair to both. It is easy to apply, requring nothing more than a single table of numbers and a pocket calculator, and is capable of dealing with any number of interruptions at any stage of either or both innings.

181 citations


"Auto-play: A data mining approach t..." refers background or methods in this paper

  • ...2 Academic Interest in Cricket One of the earliest and pioneering works in cricket was by Duckworth and Lewis [6] where they introduce the DuckworthLewis or D-L method, which allows fair adjustment of scores in proportion to the time lost due to match interruption (often due to adverse weather conditions such as rain, poor visibility etc....

    [...]

  • ...The method proposed in [6], and subsequently adapted by [14], for capturing the resources of a team during the progression of a match has found independent use in subsequent work in cricket modeling and mining [14][2]....

    [...]

  • ...2.2 Academic Interest in Cricket One of the earliest and pioneering works in cricket was by Duckworth and Lewis [6] where they introduce the DuckworthLewis or D-L method, which allows fair adjustment of scores in proportion to the time lost due to match interruption (often due to adverse weather conditions such as rain, poor visibility etc.)....

    [...]

  • ...Lewis [10], Lemmer [9], Alsopp and Clarke [1], and Beaudoin [3] develop new performance measures to rate teams and to find the most valuable players....

    [...]

Journal ArticleDOI
TL;DR: The pre-processing of raw data that the program performs is highlighted, the data mining aspects of the software are described and how the interpretation of patterns supports the process of knowledge discovery is described.
Abstract: Advanced Scout is a PC-based data mining application used by National Basketball Association (NBA) coaching staffs to discover interesting patterns in basketball game data. We describe Advanced Scout software from the perspective of data mining and knowledge discovery. This paper highlights the pre-processing of raw data that the program performs, describes the data mining aspects of the software and how the interpretation of patterns supports the process of knowledge discovery. The underlying technique of attribute focusing as the basis of the algorithm is also described. The process of pattern interpretation is facilitated by allowing the user to relate patterns to video tape.

147 citations


Additional excerpts

  • ...[4] developed the Advanced Scout system for discovering interesting patterns from basketball games, which has is now used by the NBA teams....

    [...]

Journal Article
TL;DR: Preliminary results suggest that the market is prone to overreact to events occurring throughout the course of the match, thus creating brief inefficiencies in the wagering market.
Abstract: Millions of dollars are wagered on the outcome of one day international (ODI) cricket matches, with a large percentage of bets occurring after the game has commenced. Using match information gathered from all 2200 ODI matches played prior to January 2005, a range of variables that could independently explain statistically significant proportions of variation associated with the predicted run totals and match outcomes were created. Such variables include home ground advantage, past performances, match experience, performance at the specific venue, performance against the specific opposition, experience at the specific venue and current form. Using a multiple linear regression model, prediction variables were numerically weighted according to statistical significance and used to predict the match outcome. With the use of the Duckworth-Lewis method to determine resources remaining, at the end of each completed over, the predicted run total of the batting team could be updated to provide a more accurate prediction of the match outcome. By applying this prediction approach to a holdout sample of matches, the efficiency of the “in the run ”wagering market could be assessed. Preliminary results suggest that the market is prone to overreact to events occurring throughout the course of the match, thus creating brief inefficiencies in the wagering market. Key Points In excess of 80% of monies wagered on the outcome of ODI matches are placed after the match has commenced. Using all past data from ODI matches, multiple linear regression models are constructed to predict team totals and margin of victory. By combining match information with prediction models, an ‘in the run’ prediction process is created for ODI matches. Key words: Linear regression, live prediction, market efficiency, betting Introduction The first official one day international (ODI) match was played in 1971 between Australia and England at the Melbourne Cricket Ground. Whilst ODI cricket has developed over the past 35 years (2300 matches), the general principles have remained the same. Both sides bat once for a limited time (maximum 50 overs) with the aim in the first innings to score as many runs as possible, and in the second innings to score more than the target set in the first innings. The high scoring nature of ODI matches ensures that team totals and differences between scores can be well approximated by a normal distribution. As shown by (Bailey, 2005), this facilitates the use of multiple linear regression to predict a margin of victory (MOV) prior to the commencement of the match. Using a similar approach, a multiple linear regression is also used to predict the number of runs scored by the team batting first. With the use of (Duckworth and Lewis, 1999) approach of converting resources available into runs, as each over is bowled, the current total and the predicted total for the remaining overs are combined to produce an updated predicted total for the batting team. The difference between the pre-match predicted total and the updated predicted total provides a measure of how the batting team is performing through the course of their inning. This difference is then used to provide an updated prediction for the MOV.

80 citations


"Auto-play: A data mining approach t..." refers background or methods in this paper

  • ...Bailey and Clarke [2] use historical match data and predict the total score of an innings using linear regression....

    [...]

  • ...Lewis [10], Lemmer [9], Alsopp and Clarke [1], and Beaudoin [3] develop new performance measures to rate teams and to find the most valuable players....

    [...]

  • ...[2] propose a model that predicts the R̂eoi of a game in progress which is used to analyze the sensitivity of betting markets....

    [...]

  • ...Figure 8: Mean absolute error in R̂eoi prediction for innings 1 (top) and innings 2 (bottom) for both [2] and our model....

    [...]

  • ...Bailey and Clarke [2] use historical match data and predict the total score of an innings using linear regression....

    [...]

Frequently Asked Questions (8)
Q1. What have the authors contributed in "Auto-play: a data mining approach to odi cricket simulation and prediction" ?

In this paper, the authors build a prediction system that takes in historical match data as well as the instantaneous state of a match, and predicts future match events culminating in a victory or loss. The authors describe their model and algorithms and finally present quantitative results, demonstrating the performance of their algorithms in predicting the number of runs scored, one of the most important determinants of match outcome. 

The period during which TeamA bats is called innings1, in which TeamA has 50 overs to score as many runs as possible, while TeamB tries to minimize the scoring by getting TeamA’s batsmen out (more commonly referred to as taking wickets). 

It can be observed that, for 50% of matches, prediction error has a maximum of 16 runs in Attribute bagging method, while for nearest neighbor method, it is close to 30 runs. 

The intuition behind using nearest-neighbor algorithm is that information from similar match situations can be “borrowed” from the training dataset. 

The first 10 overs of the game are mandatory powerplays, with two more instances of powerplay periods arbitrarily chosen by the batting and bowling team each, to occur at any point in the game up to the 45th over. 

4.1 Home-Run Prediction Model Using the historical and non-historical features discussed above, the authors predict the number of home runs ĤRi for a segment Si, using attribute bagging ensemble method [5] with nearest-neighbor clustering. 

Using these predictions, the total predicted score at the end of the innings can be obtained as(3.1) R̂eoi = Rknown + n∑ i=n+1 R̂iIf an innings has not commenced, as a special case, n = 0, Rknown = 0 and Wknown = 0. 

Of these, historical features are critical for predicting runs for the first segment, since by definition, no instantaneous match data is available before the first segment.