scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Identifying computer-generated text using statistical analysis

TL;DR: This work hypothesizes that human-crafted wording is more consistent than that of a computer, and proposes a method to identify computer-generated text on the basis of statistics that achieves better performance and works consistently in various languages.
Abstract: Computer-based automatically generated text is used in various applications (e.g., text summarization, machine translation) and has come to play an important role in daily life. However, computer-generated text may produce confusing information due to translation errors and inappropriate wording caused by faulty language processing, which could be a critical issue in presidential elections and product advertisements. Previous methods for detecting computer-generated text typically estimate text fluency, but this may not be useful in the near future due to the development of neural-network-based natural language generation that produces wording close to human-crafted wording. A different approach to detecting computergenerated text is thus needed. We hypothesize that human-crafted wording is more consistent than that of a computer. For instance, Zipf's law states that the most frequent word in human-written text has approximately twice the frequency of the second most frequent word, nearly three times that of the third most frequent word, and so on. We found that this is not true in the case of computer-generated text. We hence propose a method to identify computer-generated text on the basis of statistics. First, the word distribution frequencies are compared with the corresponding Zipfian distributions to extract the frequency features. Next, complex phrase features are extracted because human-generated text contains more complex phrases than computer-generated text. Finally, the higher consistency of the human-generated text is quantified at both the sentence level using phrasal verbs and at the paragraph level using coreference resolution relationships, which are integrated into consistency features. The combination of the frequencies, the complex phrases, and the consistency features was evaluated for 100 English books written originally in English and 100 English books translated from Finnish. The results show that our method achieves better performance (accuracy = 98.0%; equal error rate = 2.9%) compared with the most suitable method for books using parsing tree feature extraction. Evaluation using two other languages (French and Dutch) showed similar results. The proposed method thus works consistently in various languages.

Summary (3 min read)

I. INTRODUCTION

  • Machine-generated text plays a major role in modern life.
  • Moreover, the machine-generated text could either make customers annoyed in product advertisements or could give viewers incorrect attitudes in politics 1 .
  • These experiments demonstrated that the proposed method works well in various languages.
  • The complex phrase feature extraction is discussed in Section IV.

B. Sentence Level

  • Many researchers have successfully detected machinegenerated text using the parsing trees at the sentence level.
  • Futhermore, the authors found that the human-generated text frequently contains particular words such as spoken words (e.g., wanna, gonna) or misspelling words (comin, goin, etc.) whereas machine-generated one frequently includes unexpected words which are created by mistakes of generators.
  • The authors extend the noise features of their previous method further.
  • The authors extend these features for complex phrases including idiom, cliche, ancient, and dialect.
  • To compare the proposed method with previous methods, the authors adopted the parsing based method suggested by Y. Li et al. [8] that calculates distinct parsing features for each sentence of a document.

III. FREQUENCY FEATURES

  • The authors hypothesize that the word distributed frequency of the human-written text often follows with Zipf's law while computer-generated distribution does not.
  • This law asserts that the distribution of the highest frequented words doubles with the occurrences of the second most frequented ones and triples with the third, and so forth.
  • Frequency feature extraction is used to estimate how much an input document text t is compatible with the Zipfian distribution.
  • The slope feature a presented for the line is finally extracted.
  • (Extracting information loss including square root R 2 and cost value C):.

A. Extracting Linear Regression Line Feature (Step 1)

  • Due to word variations in English (such as "has," "have," "had"), the authors first need to normalize the original words in the input text t by their lemmas.
  • The Stanford library [14] is used to convert variances to the same lemma here.
  • The authors then estimate the compatibility with the Zipfian distribution with the lemma distribution.
  • The log-log graph is then used to demonstrate the relationship of these distributions.
  • The distributions of humanand machine-generated text are estimated by two linear regression lines colored in blue and red, correspondingly.

B. Extracting Information Loss (Step 2)

  • The authors quantify the information loss of the linear regression f via two standard metrics including square root R 2 and the cost value C.
  • The logistic regression lines are then estimated using the log distribution: = + where is the slope and is the y-intercept of the line . .

IV. COMPLEX PHRASE FEATURES

  • The complex phrases, which are flexibly and commonly written in the human-generated text, are extracted as complex phrase features (Fig. 4 ): Step 1a (Extracting idiom phrase feature I): Idiom phrases are extracted from an input text t such as "long time no see" or "a hot potato" by matching with a idiom corpus.
  • Therefore, all words are standardized by their lemmas before matching.
  • The cliche phrase corpus used in here for matching is inherited from a Laura Hayden's corpus 4 with about 600 phrases.
  • An ancient phrase feature A is measured using the extracted phrases.
  • The idiom extract feature I is the division of the number of extracted idiom phrases and the number of words n: Step 1a: Extracting phrasal verb feature V. CONSISTENCY FEATURES.

• Step 2b (Extracting coreference resolution feature S):

  • Text consistency is also expressed via the coreference resolution relationships.
  • Therefore, the number of coreference resolutions is extracted.
  • This number is also normalized with the number of words n for creating the coreference resolution feature S.

A. Extracting Phrasal Verb Feature P (Step 1a)

  • There are two kinds of phrasal verbs including separable or inseparable ones.
  • For instance: s 1 (inseparable phrasal verb): "The terrorists tried to blow up the railroad station." (meaning: explode) s 2 (separable phrasal verb): "It rained so they called the soccer game off." (meaning : cancel).
  • These verbs can be identified from the parsingtree tags.
  • The number of phrasal verbs is fitted with PRT tag occurrence in these parsings.
  • Otherwise, the machine often generates more simple phrases.

B. Extracting Coreference Resolution Feature S (Step 1b)

  • The number of coreference resolution relationships demonstrates the text cohesion.
  • These relationships describe expressions referring to the same entity in the text.
  • The more of coreference resolution relationships, the higher possibility of human-generated text.
  • The authors used the Stanford NLP tool [14] to extract coreference resolution relationships.
  • The number of the relationships is used to measure the coreference resolution feature R:.

VI. COMBINATION

  • The proposed scheme combines the the frequency, the complex phrase, and the consistency features extracted from Section III, IV, and V, respectively (c.f. Fig. 10 ).
  • The frequency features F in step 1a, the complex phrase features X in Step 1b, and the consistency features T in step 1c are integrated to determine whether the input text t is a computeror human-generated text.
  • The features are processed with two popular classification algorithms, logistic regression and support vector machine.
  • The support vector machines were optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm.
  • Among the classifiers, the support vector machine optimized by SGD has achieved the highest performance in their experiments.

A. Individual Features

  • The authors collected various books from Project Gutenberg [11] , the biggest online free books.
  • The ancient feature A reaches the best of performance for the all three classifiers.
  • It shows that the translators trend to use uncomplicated words.
  • The SVM(SGD) have the highest performance (accuracy = 89.0%, EER = 10.2%) with the feature A is used to create the final classifier for other experiments.

B. Combination

  • The authors did similar experiments by combining the individual features in three groups: frequency features Q, complex phrase features X, and consistency features T .
  • This method quantifies features for each parsing tree sentence.
  • The authors adapted the method using the average of these features for the whole book.
  • This result indicates the influence of the group features.
  • The group integration efficiently improves the individual group performances.

C. Other Languages

  • The authors took the similar experiments in other languages.
  • The French and Dutch books are also translated into English by Google translation [3] .
  • The performances of the proposed method are compared with the parsing tree method [8] shown in Table III .
  • The Table III shows that their method works well in other languages.

VIII. CONCLUSION

  • People often use more sophisticated natural languages than computers.
  • Furthermore, the consistency of phrases in human text is generally higher than machine one.
  • Therefore, the authors propose a method to distinguish computer-with human-generated text based on statistical analysis.
  • More specifically, the frequency features are firstly extracted by estimating the word distribution with Zipfian distribution.
  • The classifier is evaluated with 100 original English books and 100 translated English books from Finnish.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Identifying Computer-Generated Text Using Statistical Analysis
Citation for published version:
Nguyen-Son, H-Q, T. Tieu, N-D, H. Nguyen, H, Yamagishi, J & Echizen, I 2018, Identifying Computer-
Generated Text Using Statistical Analysis. in 2017 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference. Institute of Electrical and Electronics Engineers (IEEE), Kuala
Lumpur, Malaysia, pp. 1504-1511, 2017 Asia-Pacific Signal and Information Processing Association Annual
Summit and Conference, Kuala Lumpur, Malaysia, 12/12/17. https://doi.org/10.1109/APSIPA.2017.8282270
Digital Object Identifier (DOI):
10.1109/APSIPA.2017.8282270
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 10. Aug. 2022

Identifying Machine-Generated Text
Using Statistical Analysis
Hoang-Quoc Nguyen-Son
, Ngoc-Dung T. Tieu
, Huy H. Nguyen
,
Junichi Yamagishi
, and Isao Echizen
National Institute of Informatics, Tokyo, Japan
{nshquoc, jyamagis, iechizen}@nii.ac.jp Tel: +81-34-2122516
The University of Edinburgh, Edinburgh, United Kingdom
The Graduate University for Advanced Studies, Kanagawa, Japan
E-mail: {dungtieu, nhhuy}@nii.ac.jp
Abstract—Computer-based automatically generated text are
used in various applications (e.g. text summarization, machine
translation) and such the machine-generated text significantly
helps our social life. However, machine-generated text may
produce confusing information sometimes due to errors or
inappropriate use of wordings caused by language processing,
which could be a critical issue in president elections or in product
advertisements. Previous methods for detecting such machine-
generated text typically estimates the text fluency, but, this may
not be useful in near future because recently proposed neural-
network based natural language generation results in improved
wording close to human-crafted one.
However, we hypothesize that the habit of human on writing is
still more consistent. For instance, the Zipf’s law states that the
most frequent word in the text written by human approximates
twice the second most frequent word, nearly three times the third
most frequent word, and so forth. We found that this is not
true in the case of machine-generated text. We hence propose a
method to identify the machine-generated text based on such the
statistics First, word distributed frequencies are compared with
the Zipfian distribution to extract frequency features. Second,
complex phrase features are extracted to show that human-
generated text contains more sophisticated phrases than machine-
generated one. Finally, the higher consistency of the human-
generated text is quantified at both the sentence level using
phrasal verbs and at the paragraph level based on coreference
resolution relationships, which are integrated into consistency
features.
The combination of the frequency, the complex phrase, and the
consistency features is evaluated on a hundred of original English
books and a hundred of translated ones from Finnish. The
result shows that our method achieves the better performance
(accuracy = 98.0% and equal error rate = 2.9%) comparing with
a state-of-the-art method using parsing tree feature extraction.
An advantage of this method is that this method can be used for
large collections of text such as books efficiently. Other evaluation
results in two other languages including French and Dutch
showed similar results. They demonstrated that the proposed
method works consistently in various languages.
I. INTRODUCTION
Machine-generated text plays a major role in modern life.
Techniques to generate texts automatically, natural language
generation, partly or entirely may replace humans in vari-
ous applications such as text summarization [1], header cre-
ation [2], machine translation [3], and image description [4].
Further, speech interfaces such as Apple Siri, Google Assis-
tant, and Microsoft Cortana also have the natural language
generation components and may use use machine-generated
text as well as text crafted by human.
However, the quality and trustworthiness of the texts are
difficult to be verified. As a result, the information of the auto-
matically generated contents may be incorrect or inappropriate
compared with the information of the original contents written
by human truly. In worst cases, the machine-generated non-
trusted information may lead readers to misunderstanding.
Moreover, the machine-generated text could either make
customers annoyed in product advertisements or could give
viewers incorrect attitudes in politics
1
. Additionally, more for-
mal writings such as scientific papers written by the machine,
which have been accepted by a few conferences in fact
2
, may
destroy their reputations. We thus need a method to determine
whether a text is written by human or machine.
Numerous researchers have interests in the detection task of
machine-generated text. In the document level, most methods
estimate fluency of text [5] or word similarity quantifica-
tion [6]. In the sentence level, parsing trees are extracted as
discriminative features [7][8]. Our previous method extracted
two features from informal text at the sentence level: a density
feature using an N -gram language model and a noise feature to
be matched unexpected words (misspelling words, translated
error words, etc.) with original forms of words included in
the standard lexica [9]. The drawback of this method is that,
however, these unexpected words are easily recognized and
corrected by advanced assistant tools in formal text (e.g.
books, papers).
Although advanced natural language processing may im-
prove the naturalness and readability of the machine-generated
text, we hypothesize that the habit of human on writing is still
more consistent. For instance, it is known that word frequency
of human-generated text follows the Zipfian distribution [10],
which is called “Zipfs law”. Additionally, we see that human-
generated text commonly use more complex phrases than
computer-generated text such as idiom phrases (“long time no
see”), phrasal verbs (“get rid of ”), ancient phrases (“thou”),
1
https://medium.com/@samim/obama-rnn-machine-generated-political-
speeches-c8abd18a2ea0
2
https://pdos.csail.mit.edu/archive/scigen/
APSIPA ASC 2017

and cliche phrases (“only time will tell means to become
clear over time”). Furthermore, the consistency of human-
generated text is generally better than that of machine-
generated one.
In this paper, we proposed a novel method to detect the
machine-generated text using statistical features at the docu-
ment level. Our contributions are listed below:
We evaluate the word frequency distribution of the
original and machine-generated documents. We find out
that the human-generated text nearly follows the Zipfian
distribution whereas machine-generated text does not.
Therefore, a few parameters related to the Zipfian dis-
tribution are extracted from the text known as frequency
features.
We extract complex phrases from the text including
idiom, cliche, ancient, and dialect by matching successive
lemmas with the four standard complex phrase corpora,
respectively. These extracted phrases are used to calculate
complex phrases features.
We also measure the consistency of the document at
the sentence level using phrasal verbs and at the para-
graph level using coreference resolution relationships.
The number of phrasal verbs and coreference resolution
relationships are considered as consistency features.
We combine these statistical features including the fre-
quency, the complex phrase, and the consistency features
to create classifiers to determine whether the document
is based on either machine- or human-generated text.
We evaluated our proposed method using two-hundred
books in English and Finnish from project Gutenberg [11]:
the hundred English books are considered as human-generated
books. Then, the other hundred Finnish books translated into
English by the Google translation service [3] are treated as
machine-generated text. In the experiment, we compared our
method with a parsing-tree-based feature extraction [6] be-
cause the method is strongly relevant to our method. The result
shows that our method has achieved higher accuracy (98.0%)
and lower error equal rate (2.9%) than the relevant method. We
have also performed similar experiments in other languages
including French and Dutch, which showed the similar results.
These experiments demonstrated that the proposed method
works well in various languages.
The structure of the paper is as follows: Section II intro-
duces some of related work. Section III presents frequency
feature extraction based on the estimated Zipfian distribution.
The complex phrase feature extraction is discussed in Sec-
tion IV. Thereafter, Section V describes the consistency feature
extraction. The classifiers based on the combination of the
frequency, the complex phrase, and the consistency features
are described in Section VI. In Section VII, the experiments
using original and translated books are presented and analyzed.
Finally, Section VIII summarizes some main key findings and
mentions our future work.
II. RELATED WORK
The detection task of machine-generated text is a well-
known research problem. Some of the main methods at the
document or sentence levels are summarized as below.
A. Document Level
Y. Arase and M. Zhou proposed a method that distinguishes
machine-generated text from human-generated text [5] based
on “salad phenomenon. This salad phenomenon means that
each phrase of machine-generated text is grammatically cor-
rect, but, when they put together, they are incorrect in terms of
collocation [12]. Consequently, the authors estimate the salad
phenomenon using an N-gram language model for continuous
word sequence cases and using sequential pattern mining for
isolated word cases. This method works well not only for the
documents but also for sub-document levels such as sentence
or phrase. This method is only evaluated on machine-translated
text from Japanese to English. These languages completely
different with word forms.
Other detection methods designed for larger scales of doc-
uments are text-similarity based approaches. For example, C.
Labb
´
e and D. Labb
´
e has measured an inter-textual similarity
of academic papers [13] using word distributions [6]. This
assumption derives from the abundant reduplicated phrase pat-
terns appeared in the machine-generated papers. The technique
looks at technical terms and phrases only in corresponding
fields (e.g. computer sciences, physics) because the text sim-
ilarity in the machine-generated papers is nearly uniform in
contrast to that of human-generated papers. However, this
characteristic is obviously unsuitable for detecting machine-
generated text in the general domain.
B. Sentence Level
Many researchers have successfully detected machine-
generated text using the parsing trees at the sentence level.
For example, J. Chae and A. Nenkova suggested a solution
which quantifies the text fluency by extracting the main parsing
components [7] such as phrase type proportion, phrase type
rate, and head noun modifier. Moreover, they also exploited
the use of incomplete sentence including the human-generated
headlines and computer-translated errors.
Y. Li et al. also proposed another method using the parsing
structure [8]. They showed that the parsing trees of human-
generated text are more balanced than those of computer-
generated ones. Based on these findings, the authors extracted
several features related to the balance such as right-branching
nodes, left-branching nodes, and branching weight index. The
authors additionally showed that the emotion in the human-
generated text is more abundant than in computer-generated
one.
In our previous work [9], we extracted word density features
using an N -gram language model on both internally limited
corpus and huge external corpus. Futhermore, we found that
the human-generated text frequently contains particular words
such as spoken words (e.g., wanna, gonna) or misspelling
words (comin, goin, etc.) whereas machine-generated one
APSIPA ASC 2017

frequently includes unexpected words which are created by
mistakes of generators. These distinguishable words were
called as noises. We then performed the detection of machine-
generated sentences using the density and noise features.
In this paper, we extend the noise features of our previous
method further. The previous features consider individual
words only by matching each word with the standard lexica.
We extend these features for complex phrases including idiom,
cliche, ancient, and dialect. Moreover, several complex phrases
are separated such as phrasal verbs, so they are not simply
identified by the matching. We then propose a method to detect
separable complex phrases using parsing tree tags.
To compare the proposed method with previous methods,
we adopted the parsing based method suggested by Y. Li et
al. [8] that calculates distinct parsing features for each sentence
of a document. The average of the sentence features is then
used to construct a classifier. The method is compared with our
proposed method which combines frequency features, complex
phrase features, and consistency features.
III. FREQUENCY FEATURES
We hypothesize that the word distributed frequency of
the human-written text often follows with Zipfs law while
computer-generated distribution does not. This law asserts that
the distribution of the highest frequented words doubles with
the occurrences of the second most frequented ones and triples
with the third, and so forth. We use this evidence to distinguish
the human-generated text from computer-generated text.
Frequency feature extraction is used to estimate how much
an input document text t is compatible with the Zipfian
distribution. The proposed scheme for extracting the frequency
features is shown in Fig. 1:
Step 1 (Extracting linear regression line feature): Each
word in t is normalized by their lemmas. The lemma
distribution is calculated and is used to estimate a linear
regression function f = ax + b that is matched to the
distribution. The slope feature a presented for the line is
finally extracted.
Step 2 (Extracting information loss including square
root R
2
and cost value C): The quality of the linear
regression line f is evaluated by two standard metrics.
These metrics include the standard square root R
2
and a
cost value C that measures the information loss.
The detail of each step to extract frequency features are
described in below.
A. Extracting Linear Regression Line Feature (Step 1)
Due to word variations in English (such as has, have,
had”), we first need to normalize the original words in the
input text t by their lemmas. The Stanford library [14] is used
to convert variances to the same lemma here.
The number of lemma frequented distribution d
i
is calcu-
lated. We then estimate the compatibility with the Zipfian
distribution with the lemma distribution. According to the
Step 1: Extracting
linear regression line
Step 2: Extracting
information loss
Input document text t
Slope of the line feature a
Square root feature R
2
Cost value feature C
Fig. 1. The scheme for frequency feature extraction.
Zipfs law, the distribution d
i
of the i-th most common lemma
is proportional to
1
i
:
d
i
1
i
. (1)
Therefore, the lemma distribution d
i
are increasingly sorted.
The log-log graph is then used to demonstrate the relationship
of these distributions. For instance, distributions of a human-
written book in blue and machine-generated book in red are
shown in Fig. 2. The linear regression lines f for each are
then estimated in the log-log domain:
f = ax + b, (2)
where a is the slope and b is the y-intercept of the line f.
In Fig. 2, the standard Zipfian distribution is shown in black
dotted line with slope a
Z
= 1. The distributions of human-
and machine-generated text are estimated by two linear regres-
sion lines colored in blue and red, correspondingly. The slope
of human distribution a
H
is equal -1.22 and it is closer to
the slope of the Zipfian distribution (a
Z
= 1) than machine
one (a
M
= 1.35). This shows that the compatibility level
of human-generated text with the Zipfs law is better than
computer-generated text. Therefore, the slope a is considered
as a major feature for detecting computer-generated text.
B. Extracting Information Loss (Step 2)
We quantify the information loss of the linear regression f
via two standard metrics including square root R
2
and the cost
value C. The first one is calculated by:
R
2
= 1
P
N 1
i=0
(y
i
f
i
)
2
P
N 1
i=0
(y
i
¯y
i
)
2
, (3)
where N is the number of distinct lemma, y
i
is the distribution
of the i-th lemma, f
i
is the estimated value of i-th lemma by
linear regression line f, and ¯y
i
is the value of i-th lemma
on the mean distribution line ¯y. The demonstration of these
variables is shown in Fig. 3.
The other metric to quantify the information loss is a cost
value C given in an equation below:
C =
1
2N
N 1
X
i=0
(y
i
f
i
)
2
. (4)
APSIPA ASC 2017

APSIPA ASC 2017
Therefore, the authors extracted main balance-based features
such as right-branching nodes, left-branching nodes, and
branching weight index. The authors additionally show that the
emotion in the human text is more abundant than in computer-
generated one.
The disadvantage of the parsing-tree-based methods is that
they have just employed these characteristics based on a single
sentence. They do not thus handle the relationships of mutual
sentences. We propose a method to overcome the problem by
evaluating the consistency among various sentences in a
document. The detail of the consistency features extraction is
presented in Section V.
To compare with previous methods, we adopt the parsing
tree based method suggested by Li et al. [9] by calculating
distinct parsing features for each sentence of a document. The
average of the sentence features is used to create a classifier.
The adoption is compared with our proposed method which
combines frequency features (Section III), complex phrase
features (Section IV), and consistency features (Section V).
The detail of the comparison is shown in Section VII.
III. F
REQUENCY
F
EATURES
Frequency feature extraction is used to estimate the degree
of Zipfian distribution compliance with an input text . The
propose scheme for extracting the frequency features is shown
in Fig. 1:
Step 1 (Extract linear regression line feature slope
): Each word in is normalized by their lemmas. The
lemma distribution is calculated and used to estimate
linear regression line with function =  + which
matched to the distribution. The slope feature is
extracted by this step.
Step 2 (Extract information loss including square
root
and cost value C): This step evaluates the
quality of the linear regression line estimated in the
Step 1. Two standard metrics including square root
and cost value are measured to quantify the
information loss.
Step 1: Extract linear
regression line
Step 2: Extract
information loss
Input text t
Slope of the line feature a
Square root feature R
2
Cost value feature C
Fig. 1. The scheme for frequency feature extraction
The detail of each step to extract frequency features are
described in below.
A. Extract linear regression line features (Step 1)
Due to word variants in English (such as has, have,
had”), we normalize the original words in the input text by
their lemmas. Stanford library [13] is used in here to convert
variances to the same lemma.
The number of lemma frequented distribution
is
calculated. We then estimate the compliance of Zipfian
distribution with the lemma distribution. By Zipf’s law, the
distribution
of the -th most common term is proportional to
:
1
Therefore, the lemma distribution
are increasingly sorted.
The log-log graph is then used to demonstrate the relationship
of these distributions. For example, distribution of a computer-
generated book in blue and machine-generated book in orange
is shown in Fig. 2. The logistic regression lines are then
estimated using the log distribution:
= +
where is the slope and is the y-intercept of the line .
Fig. 2: Log-log graph for computer-generated text (in blue) and human-
written text (in orange)
In Fig. 2, the standard Zipfian distribution is shown in black
dotted line with slope
= 1. The distributions of human
text and computer text are estimated by two linear regression
lines in blue and orange, correspondingly. The slope of human
distribution
is equal 1.22 . It is closer to the slope of
Zipfian distribution (
= 1 ) than machine one (
=
1.35). This shows that the compliance level of human text
with the law is better than computer-generated text. Therefore,
0
1
2
3
4
5
6
0 2 4 6
log10 d
log10 rank
Original English book distribution
Translated English book distribution
Linear (Original English book distribution)
Linear (Translated English book distribution)
= 1
= −1.35
= 1.22
Fig. 2. Log-log graph for machine-generated text (in blue) and human-
generated text (in red) demonstrating the human slope a
H
more complying
with Zipfian slope a
Z
than machine one a
M
.
6/9/2017 Coefficient_of_Determination.svg
file:///C:/Users/Elab/Downloads/Coefficient_of_Determination.svg 1/1
y
x
y
y
x
f
Fig. 3. Root square demonstration with distribution mean line ¯y (left) and
linear regression line f (right).
.
IV. COMPLEX PHRASE FEATURES
The complex phrases, which are flexibly and commonly
written in the human-generated text, are extracted as complex
phrase features (Fig. 4):
Step 1a (Extracting idiom phrase feature I): Idiom
phrases are extracted from an input text t such as long
time no see or a hot potato by matching with a idiom
corpus. We use a standard idiom corpus
3
suggested by
Wikipedia with about 5000 distinct phrases. The use
of idioms in a text may be different from the original
3
https://en.wiktionary.org/wiki/Appendix:English idioms
Step 1a: Extracting
idiom phrase features
Step 1b: Extracting
cliche phrase feature
Step 1c: Extracting
ancient phrase features
Step 1d: Extracting
dialet phrase features
Idiom phrase feature I
Cliche phrase feature L
Ancient phrase feature A
Dialet phrase feature D
Input
document
text t
Fig. 4. Complex phrase features extraction.
idioms due to various word forms. Therefore, all words
are standardized by their lemmas before matching. This
standardization is also applied for other next steps. Ad-
ditionally, all features in this section are divided by the
number of words n in t for normalizing these features
with documents with various lengths.
Step 1b (Extracting cliche phrase feature L): Cliche
words are commonly used in human-written text than
computer-created one. Therefore, all cliche phrases are
identified from the text t to create a cliche feature L.
The cliche phrase corpus used in here for matching is
inherited from a Laura Hayden’s corpus
4
with about 600
phrases.
Step 1c (Extracting ancient phrase feature A): Other
complex phrases known as ancient phrases also often
occur in the human text. These archaic phrases are
extracted by matching with a commonly ancient phrase
corpus
5
with about 1500 words. An ancient phrase feature
A is measured using the extracted phrases.
Step 1d (Extracting dialect phrases features D): Many
deviations of English text can be used in similar contexts
known as dialect phrases. Such phrases are identified
by extracting contiguous lemmas including in a huge
Yorkshire dialect phrase corpus
6
with about 4000 phrases.
We only describe in detail of the Step 1a due to the similar
of the four steps in this section.
Extracting idiom phrase features I (Step 1a): There are
many variants of words in texts. Therefore, these words are
standardized by their lemmas. In this step, we use Stanford
parser library [14] to decide each lemma from separate words
in an input text t. Successive lemmas are combined and
matched with each phrase in a candidate idiom phrase list. We
utilize the standard idiom corpus suggested by Wikipedia
3
as
the candidate idiom phrase list. The idiom extract feature I is
the division of the number of extracted idiom phrases and the
number of words n:
4
http://suspense.net/whitefish/cliche.htm
5
http://shakespearestudyguide.com/Archaisms.html
6
http://www.yorkshiredialect.com/Dialect words.htm
APSIPA ASC 2017

Citations
More filters
Proceedings ArticleDOI
03 Nov 2019
TL;DR: This work proposes and evaluates a new class of attacks on online review platforms based on neural language models at word-level granularity in an inductive transfer-learning framework wherein a universal model is refined to handle domain shift, leading to potentially wide-ranging attacks on review systems.
Abstract: User reviews have become a cornerstone of how we make decisions. However, this user-based feedback is susceptible to manipulation as recent research has shown the feasibility of automatically generating fake reviews. Previous investigations, however, have focused on generative fake review approaches that are (i) domain dependent and not extendable to other domains without replicating the whole process from scratch; and (ii) character-level based known to generate reviews of poor quality that are easily detectable by anti-spam detectors and by end users. In this work, we propose and evaluate a new class of attacks on online review platforms based on neural language models at word-level granularity in an inductive transfer-learning framework wherein a universal model is refined to handle domain shift, leading to potentially wide-ranging attacks on review systems. Through extensive evaluation, we show that such model-generated reviews can bypass powerful anti-spam detectors and fool end users. Paired with this troubling attack vector, we propose a new defense mechanism that exploits the distributed representation of these reviews to detect model-generated reviews. We conclude that despite the success of neural models in generating realistic reviews, our proposed RNN-based discriminator can combat this type of attack effectively (90% accuracy).

16 citations

Journal ArticleDOI
TL;DR: This survey places machine generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models, and ensuring detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.
Abstract: Machine-generated text is increasingly difficult to distinguish from text authored by humans. Powerful open-source models are freely available, and user-friendly tools that democratize access to generative models are proliferating. ChatGPT, which was released shortly after the first edition of this survey, epitomizes these trends. The great potential of state-of-the-art natural language generation (NLG) systems is tempered by the multitude of avenues for abuse. Detection of machine-generated text is a key countermeasure for reducing the abuse of NLG models, and presents significant technical challenges and numerous open problems. We provide a survey that includes 1) an extensive analysis of threat models posed by contemporary NLG systems and 2) the most complete review of machine-generated text detection methods to date. This survey places machine-generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models. While doing so, we highlight the importance that detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.

13 citations

Journal ArticleDOI
TL;DR: In this article , a case study showed how difficult it is for academics with no knowledge of AAGs to identify this writing, and a survey was used to indicate how a training session can improve the ability of detecting AAG writing.
Abstract: ABSTRACT Authentic writing is an important aspect in education and research. Unfortunately, academic misconduct occurs among students and researchers. Consequently, written articles undergo certain detection measures and most teaching and research institutions use a range of software to detect plagiarism. However, state-of-the-art Automatic Article Generator (AAG) writing powered by Artificial Intelligence provides a new platform for new types of serious academic misconduct that cannot be easily detected and even if they are detected, can be hard to prove. The main objective of this study is to raise awareness of these tools among academics. This paper first explains the features of AAG writing, then investigates whether academics can distinguish AAG writing from human writing and whether raising the awareness of AAG between academics can improve their ability to detect AAG writing. A case study showed how difficult it is for academics with no knowledge of AAGs to identify this writing. A survey was used to indicate how a training session can improve the ability of detecting AAG writing. The results show that raising awareness training increased the academics’ ability to detect AAG writing. Lastly, the possible solutions to mitigate the academic integrity issues associated with AAG writing have been discussed.

10 citations

Proceedings ArticleDOI
02 Mar 2022
TL;DR: While statistical features underperform neural features, statistical features provide additional adversarial robustness that can be leveraged in ensemble detection models, and pioneer the usage of ΔMAUVE as a proxy measure for human judgement of adversarial text quality.
Abstract: The detection of computer-generated text is an area of rapidly increasing significance as nascent generative models allow for efficient creation of compelling human-like text, which may be abused for the purposes of spam, disinformation, phishing, or online influence campaigns. Past work has studied detection of current state-of-the-art models, but despite a developing threat landscape, there has been minimal analysis of the robustness of detection methods to adversarial attacks. To this end, we evaluate neural and non-neural approaches on their ability to detect computer-generated text, their robustness against text adversarial attacks, and the impact that successful adversarial attacks have on human judgement of text quality. We find that while statistical features underperform neural features, statistical features provide additional adversarial robustness that can be leveraged in ensemble detection models. In the process, we find that previously effective complex phrasal features for detection of computer-generated text hold little predictive power against contemporary generative models, and identify promising statistical features to use instead. Finally, we pioneer the usage of ΔMAUVE as a proxy measure for human judgement of adversarial text quality.

8 citations

Posted Content
TL;DR: A method matching similar words throughout the paragraph and estimating the paragraph-level coherence, that can identify machine-translated text is developed that achieves high performance and is efficiently better than previous methods.
Abstract: Machine-translated text plays an important role in modern life by smoothing communication from various communities using different languages. However, unnatural translation may lead to misunderstanding, a detector is thus needed to avoid the unfortunate mistakes. While a previous method measured the naturalness of continuous words using a N-gram language model, another method matched noncontinuous words across sentences but this method ignores such words in an individual sentence. We have developed a method matching similar words throughout the paragraph and estimating the paragraph-level coherence, that can identify machine-translated text. Experiment evaluates on 2000 English human-generated and 2000 English machine-translated paragraphs from German showing that the coherence-based method achieves high performance (accuracy = 87.0%; equal error rate = 13.0%). It is efficiently better than previous methods (best accuracy = 72.4%; equal error rate = 29.7%). Similar experiments on Dutch and Japanese obtain 89.2% and 97.9% accuracy, respectively. The results demonstrate the persistence of the proposed method in various languages with different resource levels.

7 citations


Cites background or methods from "Identifying computer-generated text..."

  • ...Other work [5,9] analyzes the histogram of word distribution from a massive amount of words, particularly suitable for document level....

    [...]

  • ...While three methods based on word distribution with coreference resolution (coreref) [9], N -gram model [1], and word similarity [10] can directly extract features from a paragraph, the other [6] based on parsing tree only obtains such features from an individual sentence....

    [...]

  • ...Method LINEAR SGD(SVM) SMO(SVM) ACC EER ACC EER ACC EER Word distribution and coreref [9] 66....

    [...]

  • ...Intertextual metric [5] Word distribution and coreref [9]...

    [...]

  • ...In these classifiers, the method based on word distribution and coreference resolution (coreref) [9] attained the lowest performance....

    [...]

References
More filters
Proceedings ArticleDOI
01 Jun 2014
TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Abstract: We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

7,070 citations


"Identifying computer-generated text..." refers methods in this paper

  • ...We use the Stanford NLP tool [14] to extract coreference resolution relationships....

    [...]

  • ...Extract idiomatic phrase feature I (Step la): Standardization of words to their lemma form in this step is done using the Stanford Parser Library [14], as mentioned above....

    [...]

  • ...The Stanford NLP library [14] is used to generate a syntax parsing tree for each sentence in the document and to attach a PRT tag to each phrasal verb....

    [...]

  • ...We used the Stanford Parser Library [14] to convert the various forms of a word to its lemma form....

    [...]

01 Jan 1999
TL;DR: SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.

5,350 citations


"Identifying computer-generated text..." refers methods in this paper

  • ...The SVM algorithm was optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm....

    [...]

Proceedings ArticleDOI
07 Jun 2015
TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.
Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

5,095 citations

Book
John Platt1
08 Feb 1999
TL;DR: In this article, the authors proposed a new algorithm for training Support Vector Machines (SVM) called SMO (Sequential Minimal Optimization), which breaks this large QP problem into a series of smallest possible QP problems.
Abstract: This chapter describes a new algorithm for training Support Vector Machines: Sequential Minimal Optimization, or SMO Training a Support Vector Machine (SVM) requires the solution of a very large quadratic programming (QP) optimization problem SMO breaks this large QP problem into a series of smallest possible QP problems These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets Because large matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while a standard projected conjugate gradient (PCG) chunking algorithm scales somewhere between linear and cubic in the training set size SMO's computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets For the MNIST database, SMO is as fast as PCG chunking; while for the UCI Adult database and linear SVMs, SMO can be more than 1000 times faster than the PCG chunking algorithm

5,019 citations

Posted Content
TL;DR: This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

3,426 citations


"Identifying computer-generated text..." refers background in this paper

  • ..., natural language generation, may partly or entirely replace humans in various applications such as text summarization [1], header creation [2], machine translation [3], and image description [4]....

    [...]

Frequently Asked Questions (13)
Q1. What have the authors stated for future works in "Identifying machine-generated text using statistical analysis" ?

In future work, the authors will evaluate their method on other kinds of documents such as novels or news. 

Previous methods for detecting such machinegenerated text typically estimates the text fluency, but, this may not be useful in near future because recently proposed neuralnetwork based natural language generation results in improved wording close to human-crafted one. The authors hence propose a method to identify the machine-generated text based on such the statistics – First, word distributed frequencies are compared with the Zipfian distribution to extract frequency features. 

Due to word variations in English (such as “has,” “have,” “had”), the authors first need to normalize the original words in the input text t by their lemmas. 

The lemma distribution is calculated and is used to estimate a linear regression function f = ax + b that is matched to the distribution. 

A. Extract linear regression line features (Step 1) Due to word variants in English (such as “has,” “have,” “had”), the authors normalize the original words in the input text by their lemmas. 

The support vector machines were optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm. 

The authors quantify the information loss of the linear regression f via two standard metrics including square root R2 and the cost value C. 

The slope of human distribution aH is equal -1.22 and it is closer to the slope of the Zipfian distribution (aZ = −1) than machine one (aM = −1.35). 

The proposed scheme for extracting the frequency features is shown in Fig. 1:• Step 1 (Extracting linear regression line feature): Each word in t is normalized by their lemmas. 

1/1yxyyfThe complex phrases, which are flexibly and commonly written in the human-generated text, are extracted as complex phrase features (Fig. 4):• Step 1a (Extracting idiom phrase feature I): Idiom phrases are extracted from an input text t such as “long time no see” or “a hot potato” by matching with a idiom corpus. 

2. The linear regression lines f for each are then estimated in the log-log domain:f = ax + b, (2)where a is the slope and b is the y-intercept of the line f . 

The distributions of humanand machine-generated text are estimated by two linear regression lines colored in blue and red, correspondingly. 

The distributions of human text and computer text are estimated by two linear regression lines in blue and orange, correspondingly.