scispace - formally typeset
Open AccessProceedings ArticleDOI

Identifying computer-generated text using statistical analysis

Reads0
Chats0
TLDR
This work hypothesizes that human-crafted wording is more consistent than that of a computer, and proposes a method to identify computer-generated text on the basis of statistics that achieves better performance and works consistently in various languages.
Abstract
Computer-based automatically generated text is used in various applications (e.g., text summarization, machine translation) and has come to play an important role in daily life. However, computer-generated text may produce confusing information due to translation errors and inappropriate wording caused by faulty language processing, which could be a critical issue in presidential elections and product advertisements. Previous methods for detecting computer-generated text typically estimate text fluency, but this may not be useful in the near future due to the development of neural-network-based natural language generation that produces wording close to human-crafted wording. A different approach to detecting computergenerated text is thus needed. We hypothesize that human-crafted wording is more consistent than that of a computer. For instance, Zipf's law states that the most frequent word in human-written text has approximately twice the frequency of the second most frequent word, nearly three times that of the third most frequent word, and so on. We found that this is not true in the case of computer-generated text. We hence propose a method to identify computer-generated text on the basis of statistics. First, the word distribution frequencies are compared with the corresponding Zipfian distributions to extract the frequency features. Next, complex phrase features are extracted because human-generated text contains more complex phrases than computer-generated text. Finally, the higher consistency of the human-generated text is quantified at both the sentence level using phrasal verbs and at the paragraph level using coreference resolution relationships, which are integrated into consistency features. The combination of the frequencies, the complex phrases, and the consistency features was evaluated for 100 English books written originally in English and 100 English books translated from Finnish. The results show that our method achieves better performance (accuracy = 98.0%; equal error rate = 2.9%) compared with the most suitable method for books using parsing tree feature extraction. Evaluation using two other languages (French and Dutch) showed similar results. The proposed method thus works consistently in various languages.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Identifying Computer-Generated Text Using Statistical Analysis
Citation for published version:
Nguyen-Son, H-Q, T. Tieu, N-D, H. Nguyen, H, Yamagishi, J & Echizen, I 2018, Identifying Computer-
Generated Text Using Statistical Analysis. in 2017 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference. Institute of Electrical and Electronics Engineers (IEEE), Kuala
Lumpur, Malaysia, pp. 1504-1511, 2017 Asia-Pacific Signal and Information Processing Association Annual
Summit and Conference, Kuala Lumpur, Malaysia, 12/12/17. https://doi.org/10.1109/APSIPA.2017.8282270
Digital Object Identifier (DOI):
10.1109/APSIPA.2017.8282270
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 10. Aug. 2022

Identifying Machine-Generated Text
Using Statistical Analysis
Hoang-Quoc Nguyen-Son
, Ngoc-Dung T. Tieu
, Huy H. Nguyen
,
Junichi Yamagishi
, and Isao Echizen
National Institute of Informatics, Tokyo, Japan
{nshquoc, jyamagis, iechizen}@nii.ac.jp Tel: +81-34-2122516
The University of Edinburgh, Edinburgh, United Kingdom
The Graduate University for Advanced Studies, Kanagawa, Japan
E-mail: {dungtieu, nhhuy}@nii.ac.jp
Abstract—Computer-based automatically generated text are
used in various applications (e.g. text summarization, machine
translation) and such the machine-generated text significantly
helps our social life. However, machine-generated text may
produce confusing information sometimes due to errors or
inappropriate use of wordings caused by language processing,
which could be a critical issue in president elections or in product
advertisements. Previous methods for detecting such machine-
generated text typically estimates the text fluency, but, this may
not be useful in near future because recently proposed neural-
network based natural language generation results in improved
wording close to human-crafted one.
However, we hypothesize that the habit of human on writing is
still more consistent. For instance, the Zipf’s law states that the
most frequent word in the text written by human approximates
twice the second most frequent word, nearly three times the third
most frequent word, and so forth. We found that this is not
true in the case of machine-generated text. We hence propose a
method to identify the machine-generated text based on such the
statistics First, word distributed frequencies are compared with
the Zipfian distribution to extract frequency features. Second,
complex phrase features are extracted to show that human-
generated text contains more sophisticated phrases than machine-
generated one. Finally, the higher consistency of the human-
generated text is quantified at both the sentence level using
phrasal verbs and at the paragraph level based on coreference
resolution relationships, which are integrated into consistency
features.
The combination of the frequency, the complex phrase, and the
consistency features is evaluated on a hundred of original English
books and a hundred of translated ones from Finnish. The
result shows that our method achieves the better performance
(accuracy = 98.0% and equal error rate = 2.9%) comparing with
a state-of-the-art method using parsing tree feature extraction.
An advantage of this method is that this method can be used for
large collections of text such as books efficiently. Other evaluation
results in two other languages including French and Dutch
showed similar results. They demonstrated that the proposed
method works consistently in various languages.
I. INTRODUCTION
Machine-generated text plays a major role in modern life.
Techniques to generate texts automatically, natural language
generation, partly or entirely may replace humans in vari-
ous applications such as text summarization [1], header cre-
ation [2], machine translation [3], and image description [4].
Further, speech interfaces such as Apple Siri, Google Assis-
tant, and Microsoft Cortana also have the natural language
generation components and may use use machine-generated
text as well as text crafted by human.
However, the quality and trustworthiness of the texts are
difficult to be verified. As a result, the information of the auto-
matically generated contents may be incorrect or inappropriate
compared with the information of the original contents written
by human truly. In worst cases, the machine-generated non-
trusted information may lead readers to misunderstanding.
Moreover, the machine-generated text could either make
customers annoyed in product advertisements or could give
viewers incorrect attitudes in politics
1
. Additionally, more for-
mal writings such as scientific papers written by the machine,
which have been accepted by a few conferences in fact
2
, may
destroy their reputations. We thus need a method to determine
whether a text is written by human or machine.
Numerous researchers have interests in the detection task of
machine-generated text. In the document level, most methods
estimate fluency of text [5] or word similarity quantifica-
tion [6]. In the sentence level, parsing trees are extracted as
discriminative features [7][8]. Our previous method extracted
two features from informal text at the sentence level: a density
feature using an N -gram language model and a noise feature to
be matched unexpected words (misspelling words, translated
error words, etc.) with original forms of words included in
the standard lexica [9]. The drawback of this method is that,
however, these unexpected words are easily recognized and
corrected by advanced assistant tools in formal text (e.g.
books, papers).
Although advanced natural language processing may im-
prove the naturalness and readability of the machine-generated
text, we hypothesize that the habit of human on writing is still
more consistent. For instance, it is known that word frequency
of human-generated text follows the Zipfian distribution [10],
which is called “Zipfs law”. Additionally, we see that human-
generated text commonly use more complex phrases than
computer-generated text such as idiom phrases (“long time no
see”), phrasal verbs (“get rid of ”), ancient phrases (“thou”),
1
https://medium.com/@samim/obama-rnn-machine-generated-political-
speeches-c8abd18a2ea0
2
https://pdos.csail.mit.edu/archive/scigen/
APSIPA ASC 2017

and cliche phrases (“only time will tell means to become
clear over time”). Furthermore, the consistency of human-
generated text is generally better than that of machine-
generated one.
In this paper, we proposed a novel method to detect the
machine-generated text using statistical features at the docu-
ment level. Our contributions are listed below:
We evaluate the word frequency distribution of the
original and machine-generated documents. We find out
that the human-generated text nearly follows the Zipfian
distribution whereas machine-generated text does not.
Therefore, a few parameters related to the Zipfian dis-
tribution are extracted from the text known as frequency
features.
We extract complex phrases from the text including
idiom, cliche, ancient, and dialect by matching successive
lemmas with the four standard complex phrase corpora,
respectively. These extracted phrases are used to calculate
complex phrases features.
We also measure the consistency of the document at
the sentence level using phrasal verbs and at the para-
graph level using coreference resolution relationships.
The number of phrasal verbs and coreference resolution
relationships are considered as consistency features.
We combine these statistical features including the fre-
quency, the complex phrase, and the consistency features
to create classifiers to determine whether the document
is based on either machine- or human-generated text.
We evaluated our proposed method using two-hundred
books in English and Finnish from project Gutenberg [11]:
the hundred English books are considered as human-generated
books. Then, the other hundred Finnish books translated into
English by the Google translation service [3] are treated as
machine-generated text. In the experiment, we compared our
method with a parsing-tree-based feature extraction [6] be-
cause the method is strongly relevant to our method. The result
shows that our method has achieved higher accuracy (98.0%)
and lower error equal rate (2.9%) than the relevant method. We
have also performed similar experiments in other languages
including French and Dutch, which showed the similar results.
These experiments demonstrated that the proposed method
works well in various languages.
The structure of the paper is as follows: Section II intro-
duces some of related work. Section III presents frequency
feature extraction based on the estimated Zipfian distribution.
The complex phrase feature extraction is discussed in Sec-
tion IV. Thereafter, Section V describes the consistency feature
extraction. The classifiers based on the combination of the
frequency, the complex phrase, and the consistency features
are described in Section VI. In Section VII, the experiments
using original and translated books are presented and analyzed.
Finally, Section VIII summarizes some main key findings and
mentions our future work.
II. RELATED WORK
The detection task of machine-generated text is a well-
known research problem. Some of the main methods at the
document or sentence levels are summarized as below.
A. Document Level
Y. Arase and M. Zhou proposed a method that distinguishes
machine-generated text from human-generated text [5] based
on “salad phenomenon. This salad phenomenon means that
each phrase of machine-generated text is grammatically cor-
rect, but, when they put together, they are incorrect in terms of
collocation [12]. Consequently, the authors estimate the salad
phenomenon using an N-gram language model for continuous
word sequence cases and using sequential pattern mining for
isolated word cases. This method works well not only for the
documents but also for sub-document levels such as sentence
or phrase. This method is only evaluated on machine-translated
text from Japanese to English. These languages completely
different with word forms.
Other detection methods designed for larger scales of doc-
uments are text-similarity based approaches. For example, C.
Labb
´
e and D. Labb
´
e has measured an inter-textual similarity
of academic papers [13] using word distributions [6]. This
assumption derives from the abundant reduplicated phrase pat-
terns appeared in the machine-generated papers. The technique
looks at technical terms and phrases only in corresponding
fields (e.g. computer sciences, physics) because the text sim-
ilarity in the machine-generated papers is nearly uniform in
contrast to that of human-generated papers. However, this
characteristic is obviously unsuitable for detecting machine-
generated text in the general domain.
B. Sentence Level
Many researchers have successfully detected machine-
generated text using the parsing trees at the sentence level.
For example, J. Chae and A. Nenkova suggested a solution
which quantifies the text fluency by extracting the main parsing
components [7] such as phrase type proportion, phrase type
rate, and head noun modifier. Moreover, they also exploited
the use of incomplete sentence including the human-generated
headlines and computer-translated errors.
Y. Li et al. also proposed another method using the parsing
structure [8]. They showed that the parsing trees of human-
generated text are more balanced than those of computer-
generated ones. Based on these findings, the authors extracted
several features related to the balance such as right-branching
nodes, left-branching nodes, and branching weight index. The
authors additionally showed that the emotion in the human-
generated text is more abundant than in computer-generated
one.
In our previous work [9], we extracted word density features
using an N -gram language model on both internally limited
corpus and huge external corpus. Futhermore, we found that
the human-generated text frequently contains particular words
such as spoken words (e.g., wanna, gonna) or misspelling
words (comin, goin, etc.) whereas machine-generated one
APSIPA ASC 2017

frequently includes unexpected words which are created by
mistakes of generators. These distinguishable words were
called as noises. We then performed the detection of machine-
generated sentences using the density and noise features.
In this paper, we extend the noise features of our previous
method further. The previous features consider individual
words only by matching each word with the standard lexica.
We extend these features for complex phrases including idiom,
cliche, ancient, and dialect. Moreover, several complex phrases
are separated such as phrasal verbs, so they are not simply
identified by the matching. We then propose a method to detect
separable complex phrases using parsing tree tags.
To compare the proposed method with previous methods,
we adopted the parsing based method suggested by Y. Li et
al. [8] that calculates distinct parsing features for each sentence
of a document. The average of the sentence features is then
used to construct a classifier. The method is compared with our
proposed method which combines frequency features, complex
phrase features, and consistency features.
III. FREQUENCY FEATURES
We hypothesize that the word distributed frequency of
the human-written text often follows with Zipfs law while
computer-generated distribution does not. This law asserts that
the distribution of the highest frequented words doubles with
the occurrences of the second most frequented ones and triples
with the third, and so forth. We use this evidence to distinguish
the human-generated text from computer-generated text.
Frequency feature extraction is used to estimate how much
an input document text t is compatible with the Zipfian
distribution. The proposed scheme for extracting the frequency
features is shown in Fig. 1:
Step 1 (Extracting linear regression line feature): Each
word in t is normalized by their lemmas. The lemma
distribution is calculated and is used to estimate a linear
regression function f = ax + b that is matched to the
distribution. The slope feature a presented for the line is
finally extracted.
Step 2 (Extracting information loss including square
root R
2
and cost value C): The quality of the linear
regression line f is evaluated by two standard metrics.
These metrics include the standard square root R
2
and a
cost value C that measures the information loss.
The detail of each step to extract frequency features are
described in below.
A. Extracting Linear Regression Line Feature (Step 1)
Due to word variations in English (such as has, have,
had”), we first need to normalize the original words in the
input text t by their lemmas. The Stanford library [14] is used
to convert variances to the same lemma here.
The number of lemma frequented distribution d
i
is calcu-
lated. We then estimate the compatibility with the Zipfian
distribution with the lemma distribution. According to the
Step 1: Extracting
linear regression line
Step 2: Extracting
information loss
Input document text t
Slope of the line feature a
Square root feature R
2
Cost value feature C
Fig. 1. The scheme for frequency feature extraction.
Zipfs law, the distribution d
i
of the i-th most common lemma
is proportional to
1
i
:
d
i
1
i
. (1)
Therefore, the lemma distribution d
i
are increasingly sorted.
The log-log graph is then used to demonstrate the relationship
of these distributions. For instance, distributions of a human-
written book in blue and machine-generated book in red are
shown in Fig. 2. The linear regression lines f for each are
then estimated in the log-log domain:
f = ax + b, (2)
where a is the slope and b is the y-intercept of the line f.
In Fig. 2, the standard Zipfian distribution is shown in black
dotted line with slope a
Z
= 1. The distributions of human-
and machine-generated text are estimated by two linear regres-
sion lines colored in blue and red, correspondingly. The slope
of human distribution a
H
is equal -1.22 and it is closer to
the slope of the Zipfian distribution (a
Z
= 1) than machine
one (a
M
= 1.35). This shows that the compatibility level
of human-generated text with the Zipfs law is better than
computer-generated text. Therefore, the slope a is considered
as a major feature for detecting computer-generated text.
B. Extracting Information Loss (Step 2)
We quantify the information loss of the linear regression f
via two standard metrics including square root R
2
and the cost
value C. The first one is calculated by:
R
2
= 1
P
N 1
i=0
(y
i
f
i
)
2
P
N 1
i=0
(y
i
¯y
i
)
2
, (3)
where N is the number of distinct lemma, y
i
is the distribution
of the i-th lemma, f
i
is the estimated value of i-th lemma by
linear regression line f, and ¯y
i
is the value of i-th lemma
on the mean distribution line ¯y. The demonstration of these
variables is shown in Fig. 3.
The other metric to quantify the information loss is a cost
value C given in an equation below:
C =
1
2N
N 1
X
i=0
(y
i
f
i
)
2
. (4)
APSIPA ASC 2017

APSIPA ASC 2017
Therefore, the authors extracted main balance-based features
such as right-branching nodes, left-branching nodes, and
branching weight index. The authors additionally show that the
emotion in the human text is more abundant than in computer-
generated one.
The disadvantage of the parsing-tree-based methods is that
they have just employed these characteristics based on a single
sentence. They do not thus handle the relationships of mutual
sentences. We propose a method to overcome the problem by
evaluating the consistency among various sentences in a
document. The detail of the consistency features extraction is
presented in Section V.
To compare with previous methods, we adopt the parsing
tree based method suggested by Li et al. [9] by calculating
distinct parsing features for each sentence of a document. The
average of the sentence features is used to create a classifier.
The adoption is compared with our proposed method which
combines frequency features (Section III), complex phrase
features (Section IV), and consistency features (Section V).
The detail of the comparison is shown in Section VII.
III. F
REQUENCY
F
EATURES
Frequency feature extraction is used to estimate the degree
of Zipfian distribution compliance with an input text . The
propose scheme for extracting the frequency features is shown
in Fig. 1:
Step 1 (Extract linear regression line feature slope
): Each word in is normalized by their lemmas. The
lemma distribution is calculated and used to estimate
linear regression line with function =  + which
matched to the distribution. The slope feature is
extracted by this step.
Step 2 (Extract information loss including square
root
and cost value C): This step evaluates the
quality of the linear regression line estimated in the
Step 1. Two standard metrics including square root
and cost value are measured to quantify the
information loss.
Step 1: Extract linear
regression line
Step 2: Extract
information loss
Input text t
Slope of the line feature a
Square root feature R
2
Cost value feature C
Fig. 1. The scheme for frequency feature extraction
The detail of each step to extract frequency features are
described in below.
A. Extract linear regression line features (Step 1)
Due to word variants in English (such as has, have,
had”), we normalize the original words in the input text by
their lemmas. Stanford library [13] is used in here to convert
variances to the same lemma.
The number of lemma frequented distribution
is
calculated. We then estimate the compliance of Zipfian
distribution with the lemma distribution. By Zipf’s law, the
distribution
of the -th most common term is proportional to
:
1
Therefore, the lemma distribution
are increasingly sorted.
The log-log graph is then used to demonstrate the relationship
of these distributions. For example, distribution of a computer-
generated book in blue and machine-generated book in orange
is shown in Fig. 2. The logistic regression lines are then
estimated using the log distribution:
= +
where is the slope and is the y-intercept of the line .
Fig. 2: Log-log graph for computer-generated text (in blue) and human-
written text (in orange)
In Fig. 2, the standard Zipfian distribution is shown in black
dotted line with slope
= 1. The distributions of human
text and computer text are estimated by two linear regression
lines in blue and orange, correspondingly. The slope of human
distribution
is equal 1.22 . It is closer to the slope of
Zipfian distribution (
= 1 ) than machine one (
=
1.35). This shows that the compliance level of human text
with the law is better than computer-generated text. Therefore,
0
1
2
3
4
5
6
0 2 4 6
log10 d
log10 rank
Original English book distribution
Translated English book distribution
Linear (Original English book distribution)
Linear (Translated English book distribution)
= 1
= −1.35
= 1.22
Fig. 2. Log-log graph for machine-generated text (in blue) and human-
generated text (in red) demonstrating the human slope a
H
more complying
with Zipfian slope a
Z
than machine one a
M
.
6/9/2017 Coefficient_of_Determination.svg
file:///C:/Users/Elab/Downloads/Coefficient_of_Determination.svg 1/1
y
x
y
y
x
f
Fig. 3. Root square demonstration with distribution mean line ¯y (left) and
linear regression line f (right).
.
IV. COMPLEX PHRASE FEATURES
The complex phrases, which are flexibly and commonly
written in the human-generated text, are extracted as complex
phrase features (Fig. 4):
Step 1a (Extracting idiom phrase feature I): Idiom
phrases are extracted from an input text t such as long
time no see or a hot potato by matching with a idiom
corpus. We use a standard idiom corpus
3
suggested by
Wikipedia with about 5000 distinct phrases. The use
of idioms in a text may be different from the original
3
https://en.wiktionary.org/wiki/Appendix:English idioms
Step 1a: Extracting
idiom phrase features
Step 1b: Extracting
cliche phrase feature
Step 1c: Extracting
ancient phrase features
Step 1d: Extracting
dialet phrase features
Idiom phrase feature I
Cliche phrase feature L
Ancient phrase feature A
Dialet phrase feature D
Input
document
text t
Fig. 4. Complex phrase features extraction.
idioms due to various word forms. Therefore, all words
are standardized by their lemmas before matching. This
standardization is also applied for other next steps. Ad-
ditionally, all features in this section are divided by the
number of words n in t for normalizing these features
with documents with various lengths.
Step 1b (Extracting cliche phrase feature L): Cliche
words are commonly used in human-written text than
computer-created one. Therefore, all cliche phrases are
identified from the text t to create a cliche feature L.
The cliche phrase corpus used in here for matching is
inherited from a Laura Hayden’s corpus
4
with about 600
phrases.
Step 1c (Extracting ancient phrase feature A): Other
complex phrases known as ancient phrases also often
occur in the human text. These archaic phrases are
extracted by matching with a commonly ancient phrase
corpus
5
with about 1500 words. An ancient phrase feature
A is measured using the extracted phrases.
Step 1d (Extracting dialect phrases features D): Many
deviations of English text can be used in similar contexts
known as dialect phrases. Such phrases are identified
by extracting contiguous lemmas including in a huge
Yorkshire dialect phrase corpus
6
with about 4000 phrases.
We only describe in detail of the Step 1a due to the similar
of the four steps in this section.
Extracting idiom phrase features I (Step 1a): There are
many variants of words in texts. Therefore, these words are
standardized by their lemmas. In this step, we use Stanford
parser library [14] to decide each lemma from separate words
in an input text t. Successive lemmas are combined and
matched with each phrase in a candidate idiom phrase list. We
utilize the standard idiom corpus suggested by Wikipedia
3
as
the candidate idiom phrase list. The idiom extract feature I is
the division of the number of extracted idiom phrases and the
number of words n:
4
http://suspense.net/whitefish/cliche.htm
5
http://shakespearestudyguide.com/Archaisms.html
6
http://www.yorkshiredialect.com/Dialect words.htm
APSIPA ASC 2017

Citations
More filters
Proceedings ArticleDOI

Wide-Ranging Review Manipulation Attacks: Model, Empirical Study, and Countermeasures

TL;DR: This work proposes and evaluates a new class of attacks on online review platforms based on neural language models at word-level granularity in an inductive transfer-learning framework wherein a universal model is refined to handle domain shift, leading to potentially wide-ranging attacks on review systems.
Journal ArticleDOI

Machine-Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

TL;DR: This survey places machine generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models, and ensuring detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.
Journal ArticleDOI

Assisting academics to identify computer generated writing

TL;DR: In this article , a case study showed how difficult it is for academics with no knowledge of AAGs to identify this writing, and a survey was used to indicate how a training session can improve the ability of detecting AAG writing.
Proceedings ArticleDOI

Adversarial Robustness of Neural-Statistical Features in Detection of Generative Transformers

TL;DR: While statistical features underperform neural features, statistical features provide additional adversarial robustness that can be leveraged in ensemble detection models, and pioneer the usage of ΔMAUVE as a proxy measure for human judgement of adversarial text quality.
Posted Content

Detecting Machine-Translated Paragraphs by Matching Similar Words.

TL;DR: A method matching similar words throughout the paragraph and estimating the paragraph-level coherence, that can identify machine-translated text is developed that achieves high performance and is efficiently better than previous methods.
References
More filters

Statistical Machine Translation.

Miles Osborne
TL;DR: Statistical Machine Translation deals with automating sentences in one human language into another human language (such as English) and estimates from parallel corpora and also from monolingual corpora (examples of target sentences).
Proceedings Article

A Monolingual Tree-based Translation Model for Sentence Simplification

TL;DR: A Tree-based Simplification Model (TSM) is proposed, which, to the knowledge, is the first statistical simplification model covering splitting, dropping, reordering and substitution integrally.
Journal ArticleDOI

Statistical machine translation

TL;DR: A tutorial overview of the state of the art of statistical machine translation, which describes the context of the current research and presents a taxonomy of some different approaches within the main subproblems: translation modeling, parameter estimation, and decoding.
Journal ArticleDOI

Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

TL;DR: This work demonstrates a software method of detecting duplicate and fake publications appearing in scientific conferences and, as a result, in the bibliographic services.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What have the authors stated for future works in "Identifying machine-generated text using statistical analysis" ?

In future work, the authors will evaluate their method on other kinds of documents such as novels or news. 

Previous methods for detecting such machinegenerated text typically estimates the text fluency, but, this may not be useful in near future because recently proposed neuralnetwork based natural language generation results in improved wording close to human-crafted one. The authors hence propose a method to identify the machine-generated text based on such the statistics – First, word distributed frequencies are compared with the Zipfian distribution to extract frequency features. 

Due to word variations in English (such as “has,” “have,” “had”), the authors first need to normalize the original words in the input text t by their lemmas. 

The lemma distribution is calculated and is used to estimate a linear regression function f = ax + b that is matched to the distribution. 

A. Extract linear regression line features (Step 1) Due to word variants in English (such as “has,” “have,” “had”), the authors normalize the original words in the input text by their lemmas. 

The support vector machines were optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm. 

The authors quantify the information loss of the linear regression f via two standard metrics including square root R2 and the cost value C. 

The slope of human distribution aH is equal -1.22 and it is closer to the slope of the Zipfian distribution (aZ = −1) than machine one (aM = −1.35). 

The proposed scheme for extracting the frequency features is shown in Fig. 1:• Step 1 (Extracting linear regression line feature): Each word in t is normalized by their lemmas. 

1/1yxyyfThe complex phrases, which are flexibly and commonly written in the human-generated text, are extracted as complex phrase features (Fig. 4):• Step 1a (Extracting idiom phrase feature I): Idiom phrases are extracted from an input text t such as “long time no see” or “a hot potato” by matching with a idiom corpus. 

2. The linear regression lines f for each are then estimated in the log-log domain:f = ax + b, (2)where a is the slope and b is the y-intercept of the line f . 

The distributions of humanand machine-generated text are estimated by two linear regression lines colored in blue and red, correspondingly. 

The distributions of human text and computer text are estimated by two linear regression lines in blue and orange, correspondingly.