What have the authors stated for future works in "Identifying machine-generated text using statistical analysis" ?

In future work, the authors will evaluate their method on other kinds of documents such as novels or news.

Why do the authors need to normalize the original words in the input text?

Due to word variations in English (such as “has,” “have,” “had”), the authors first need to normalize the original words in the input text t by their lemmas.

What is the step for extracting linear regression line features?

A. Extract linear regression line features (Step 1) Due to word variants in English (such as “has,” “have,” “had”), the authors normalize the original words in the input text by their lemmas.

Which algorithm was used to optimize the support vector machines?

The support vector machines were optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm.

What is the slope of the human distribution aH?

The slope of human distribution aH is equal -1.22 and it is closer to the slope of the Zipfian distribution (aZ = −1) than machine one (aM = −1.35).

(Open Access) Identifying computer-generated text using statistical analysis (2017) | Hoang-Quoc Nguyen-Son

Q: What are the contributions in "Identifying machine-generated text using statistical analysis" ?

Previous methods for detecting such machinegenerated text typically estimates the text fluency, but, this may not be useful in near future because recently proposed neuralnetwork based natural language generation results in improved wording close to human-crafted one. The authors hence propose a method to identify the machine-generated text based on such the statistics – First, word distributed frequencies are compared with the Zipfian distribution to extract frequency features.

Q: How do the authors quantify the information loss of the linear regression f?

The authors quantify the information loss of the linear regression f via two standard metrics including square root R2 and the cost value C.

Q: how many idioms are extracted from a text?

1/1yxyyfThe complex phrases, which are flexibly and commonly written in the human-generated text, are extracted as complex phrase features (Fig. 4):• Step 1a (Extracting idiom phrase feature I): Idiom phrases are extracted from an input text t such as “long time no see” or “a hot potato” by matching with a idiom corpus.

Edinburgh Research Explorer

Identifying Computer-Generated Text Using Statistical Analysis

Citation for published version:

Nguyen-Son, H-Q, T. Tieu, N-D, H. Nguyen, H, Yamagishi, J & Echizen, I 2018, Identifying Computer-

Generated Text Using Statistical Analysis. in 2017 Asia-Pacific Signal and Information Processing

Association Annual Summit and Conference. Institute of Electrical and Electronics Engineers (IEEE), Kuala

Lumpur, Malaysia, pp. 1504-1511, 2017 Asia-Pacific Signal and Information Processing Association Annual

Summit and Conference, Kuala Lumpur, Malaysia, 12/12/17. https://doi.org/10.1109/APSIPA.2017.8282270

Digital Object Identifier (DOI):

10.1109/APSIPA.2017.8282270

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 10. Aug. 2022

Identifying Machine-Generated Text

Using Statistical Analysis

Hoang-Quoc Nguyen-Son

∗

, Ngoc-Dung T. Tieu

‡

, Huy H. Nguyen

‡

Junichi Yamagishi

∗†‡

, and Isao Echizen

∗†

∗

National Institute of Informatics, Tokyo, Japan

{nshquoc, jyamagis, iechizen}@nii.ac.jp Tel: +81-34-2122516

†

The University of Edinburgh, Edinburgh, United Kingdom

‡

The Graduate University for Advanced Studies, Kanagawa, Japan

E-mail: {dungtieu, nhhuy}@nii.ac.jp

Abstract—Computer-based automatically generated text are

used in various applications (e.g. text summarization, machine

translation) and such the machine-generated text signiﬁcantly

helps our social life. However, machine-generated text may

produce confusing information sometimes due to errors or

inappropriate use of wordings caused by language processing,

which could be a critical issue in president elections or in product

advertisements. Previous methods for detecting such machine-

generated text typically estimates the text ﬂuency, but, this may

not be useful in near future because recently proposed neural-

network based natural language generation results in improved

wording close to human-crafted one.

However, we hypothesize that the habit of human on writing is

still more consistent. For instance, the Zipf’s law states that the

most frequent word in the text written by human approximates

twice the second most frequent word, nearly three times the third

most frequent word, and so forth. We found that this is not

true in the case of machine-generated text. We hence propose a

method to identify the machine-generated text based on such the

statistics – First, word distributed frequencies are compared with

the Zipﬁan distribution to extract frequency features. Second,

complex phrase features are extracted to show that human-

generated text contains more sophisticated phrases than machine-

generated one. Finally, the higher consistency of the human-

generated text is quantiﬁed at both the sentence level using

phrasal verbs and at the paragraph level based on coreference

resolution relationships, which are integrated into consistency

features.

The combination of the frequency, the complex phrase, and the

consistency features is evaluated on a hundred of original English

books and a hundred of translated ones from Finnish. The

result shows that our method achieves the better performance

(accuracy = 98.0% and equal error rate = 2.9%) comparing with

a state-of-the-art method using parsing tree feature extraction.

An advantage of this method is that this method can be used for

large collections of text such as books efﬁciently. Other evaluation

results in two other languages including French and Dutch

showed similar results. They demonstrated that the proposed

method works consistently in various languages.

I. INTRODUCTION

Machine-generated text plays a major role in modern life.

Techniques to generate texts automatically, natural language

generation, partly or entirely may replace humans in vari-

ous applications such as text summarization [1], header cre-

ation [2], machine translation [3], and image description [4].

Further, speech interfaces such as Apple Siri, Google Assis-

tant, and Microsoft Cortana also have the natural language

generation components and may use use machine-generated

text as well as text crafted by human.

However, the quality and trustworthiness of the texts are

difﬁcult to be veriﬁed. As a result, the information of the auto-

matically generated contents may be incorrect or inappropriate

compared with the information of the original contents written

by human truly. In worst cases, the machine-generated non-

trusted information may lead readers to misunderstanding.

Moreover, the machine-generated text could either make

customers annoyed in product advertisements or could give

viewers incorrect attitudes in politics

. Additionally, more for-

mal writings such as scientiﬁc papers written by the machine,

which have been accepted by a few conferences in fact

, may

destroy their reputations. We thus need a method to determine

whether a text is written by human or machine.

Numerous researchers have interests in the detection task of

machine-generated text. In the document level, most methods

estimate ﬂuency of text [5] or word similarity quantiﬁca-

tion [6]. In the sentence level, parsing trees are extracted as

discriminative features [7][8]. Our previous method extracted

two features from informal text at the sentence level: a density

feature using an N -gram language model and a noise feature to

be matched unexpected words (misspelling words, translated

error words, etc.) with original forms of words included in

the standard lexica [9]. The drawback of this method is that,

however, these unexpected words are easily recognized and

corrected by advanced assistant tools in formal text (e.g.

books, papers).

Although advanced natural language processing may im-

prove the naturalness and readability of the machine-generated

text, we hypothesize that the habit of human on writing is still

more consistent. For instance, it is known that word frequency

of human-generated text follows the Zipﬁan distribution [10],

which is called “Zipf’s law”. Additionally, we see that human-

generated text commonly use more complex phrases than

computer-generated text such as idiom phrases (“long time no

see”), phrasal verbs (“get rid of ”), ancient phrases (“thou”),

https://medium.com/@samim/obama-rnn-machine-generated-political-

speeches-c8abd18a2ea0

https://pdos.csail.mit.edu/archive/scigen/

APSIPA ASC 2017

and cliche phrases (“only time will tell” means “to become

clear over time”). Furthermore, the consistency of human-

generated text is generally better than that of machine-

generated one.

In this paper, we proposed a novel method to detect the

machine-generated text using statistical features at the docu-

ment level. Our contributions are listed below:

• We evaluate the word frequency distribution of the

original and machine-generated documents. We ﬁnd out

that the human-generated text nearly follows the Zipﬁan

distribution whereas machine-generated text does not.

Therefore, a few parameters related to the Zipﬁan dis-

tribution are extracted from the text known as frequency

features.

• We extract complex phrases from the text including

idiom, cliche, ancient, and dialect by matching successive

lemmas with the four standard complex phrase corpora,

respectively. These extracted phrases are used to calculate

complex phrases features.

• We also measure the consistency of the document at

the sentence level using phrasal verbs and at the para-

graph level using coreference resolution relationships.

The number of phrasal verbs and coreference resolution

relationships are considered as consistency features.

• We combine these statistical features including the fre-

quency, the complex phrase, and the consistency features

to create classiﬁers to determine whether the document

is based on either machine- or human-generated text.

We evaluated our proposed method using two-hundred

books in English and Finnish from project Gutenberg [11]:

the hundred English books are considered as human-generated

books. Then, the other hundred Finnish books translated into

English by the Google translation service [3] are treated as

machine-generated text. In the experiment, we compared our

method with a parsing-tree-based feature extraction [6] be-

cause the method is strongly relevant to our method. The result

shows that our method has achieved higher accuracy (98.0%)

and lower error equal rate (2.9%) than the relevant method. We

have also performed similar experiments in other languages

including French and Dutch, which showed the similar results.

These experiments demonstrated that the proposed method

works well in various languages.

The structure of the paper is as follows: Section II intro-

duces some of related work. Section III presents frequency

feature extraction based on the estimated Zipﬁan distribution.

The complex phrase feature extraction is discussed in Sec-

tion IV. Thereafter, Section V describes the consistency feature

extraction. The classiﬁers based on the combination of the

frequency, the complex phrase, and the consistency features

are described in Section VI. In Section VII, the experiments

using original and translated books are presented and analyzed.

Finally, Section VIII summarizes some main key ﬁndings and

mentions our future work.

II. RELATED WORK

The detection task of machine-generated text is a well-

known research problem. Some of the main methods at the

document or sentence levels are summarized as below.

A. Document Level

Y. Arase and M. Zhou proposed a method that distinguishes

machine-generated text from human-generated text [5] based

on “salad phenomenon.” This salad phenomenon means that

each phrase of machine-generated text is grammatically cor-

rect, but, when they put together, they are incorrect in terms of

collocation [12]. Consequently, the authors estimate the salad

phenomenon using an N-gram language model for continuous

word sequence cases and using sequential pattern mining for

isolated word cases. This method works well not only for the

documents but also for sub-document levels such as sentence

or phrase. This method is only evaluated on machine-translated

text from Japanese to English. These languages completely

different with word forms.

Other detection methods designed for larger scales of doc-

uments are text-similarity based approaches. For example, C.

Labb

e and D. Labb

e has measured an inter-textual similarity

of academic papers [13] using word distributions [6]. This

assumption derives from the abundant reduplicated phrase pat-

terns appeared in the machine-generated papers. The technique

looks at technical terms and phrases only in corresponding

ﬁelds (e.g. computer sciences, physics) because the text sim-

ilarity in the machine-generated papers is nearly uniform in

contrast to that of human-generated papers. However, this

characteristic is obviously unsuitable for detecting machine-

generated text in the general domain.

B. Sentence Level

Many researchers have successfully detected machine-

generated text using the parsing trees at the sentence level.

For example, J. Chae and A. Nenkova suggested a solution

which quantiﬁes the text ﬂuency by extracting the main parsing

components [7] such as phrase type proportion, phrase type

rate, and head noun modiﬁer. Moreover, they also exploited

the use of incomplete sentence including the human-generated

headlines and computer-translated errors.

Y. Li et al. also proposed another method using the parsing

structure [8]. They showed that the parsing trees of human-

generated text are more balanced than those of computer-

generated ones. Based on these ﬁndings, the authors extracted

several features related to the balance such as right-branching

nodes, left-branching nodes, and branching weight index. The

authors additionally showed that the emotion in the human-

generated text is more abundant than in computer-generated

one.

In our previous work [9], we extracted word density features

using an N -gram language model on both internally limited

corpus and huge external corpus. Futhermore, we found that

the human-generated text frequently contains particular words

such as spoken words (e.g., wanna, gonna) or misspelling

words (comin, goin, etc.) whereas machine-generated one

APSIPA ASC 2017

frequently includes unexpected words which are created by

mistakes of generators. These distinguishable words were

called as noises. We then performed the detection of machine-

generated sentences using the density and noise features.

In this paper, we extend the noise features of our previous

method further. The previous features consider individual

words only by matching each word with the standard lexica.

We extend these features for complex phrases including idiom,

cliche, ancient, and dialect. Moreover, several complex phrases

are separated such as phrasal verbs, so they are not simply

identiﬁed by the matching. We then propose a method to detect

separable complex phrases using parsing tree tags.

To compare the proposed method with previous methods,

we adopted the parsing based method suggested by Y. Li et

al. [8] that calculates distinct parsing features for each sentence

of a document. The average of the sentence features is then

used to construct a classiﬁer. The method is compared with our

proposed method which combines frequency features, complex

phrase features, and consistency features.

III. FREQUENCY FEATURES

We hypothesize that the word distributed frequency of

the human-written text often follows with Zipf’s law while

computer-generated distribution does not. This law asserts that

the distribution of the highest frequented words doubles with

the occurrences of the second most frequented ones and triples

with the third, and so forth. We use this evidence to distinguish

the human-generated text from computer-generated text.

Frequency feature extraction is used to estimate how much

an input document text t is compatible with the Zipﬁan

distribution. The proposed scheme for extracting the frequency

features is shown in Fig. 1:

• Step 1 (Extracting linear regression line feature): Each

word in t is normalized by their lemmas. The lemma

distribution is calculated and is used to estimate a linear

regression function f = ax + b that is matched to the

distribution. The slope feature a presented for the line is

ﬁnally extracted.

• Step 2 (Extracting information loss including square

root R

and cost value C): The quality of the linear

regression line f is evaluated by two standard metrics.

These metrics include the standard square root R

and a

cost value C that measures the information loss.

The detail of each step to extract frequency features are

described in below.

A. Extracting Linear Regression Line Feature (Step 1)

Due to word variations in English (such as “has,” “have,”

“had”), we ﬁrst need to normalize the original words in the

input text t by their lemmas. The Stanford library [14] is used

to convert variances to the same lemma here.

The number of lemma frequented distribution d

is calcu-

lated. We then estimate the compatibility with the Zipﬁan

distribution with the lemma distribution. According to the

Step 1: Extracting

linear regression line

Step 2: Extracting

information loss

Input document text t

Slope of the line feature a

Square root feature R

Cost value feature C

Fig. 1. The scheme for frequency feature extraction.

Zipf’s law, the distribution d

of the i-th most common lemma

is proportional to

∝

. (1)

Therefore, the lemma distribution d

are increasingly sorted.

The log-log graph is then used to demonstrate the relationship

of these distributions. For instance, distributions of a human-

written book in blue and machine-generated book in red are

shown in Fig. 2. The linear regression lines f for each are

then estimated in the log-log domain:

f = ax + b, (2)

where a is the slope and b is the y-intercept of the line f.

In Fig. 2, the standard Zipﬁan distribution is shown in black

dotted line with slope a

= −1. The distributions of human-

and machine-generated text are estimated by two linear regres-

sion lines colored in blue and red, correspondingly. The slope

of human distribution a

is equal -1.22 and it is closer to

the slope of the Zipﬁan distribution (a

= −1) than machine

one (a

= −1.35). This shows that the compatibility level

of human-generated text with the Zipf’s law is better than

computer-generated text. Therefore, the slope a is considered

as a major feature for detecting computer-generated text.

B. Extracting Information Loss (Step 2)

We quantify the information loss of the linear regression f

via two standard metrics including square root R

and the cost

value C. The ﬁrst one is calculated by:

= 1 −

N −1

i=0

− f

)

N −1

i=0

− ¯y

)

, (3)

where N is the number of distinct lemma, y

is the distribution

of the i-th lemma, f

is the estimated value of i-th lemma by

linear regression line f, and ¯y

is the value of i-th lemma

on the mean distribution line ¯y. The demonstration of these

variables is shown in Fig. 3.

The other metric to quantify the information loss is a cost

value C given in an equation below:

C =

N −1

i=0

− f

)

. (4)

APSIPA ASC 2017

Therefore, the authors extracted main balance-based features

such as right-branching nodes, left-branching nodes, and

branching weight index. The authors additionally show that the

emotion in the human text is more abundant than in computer-

generated one.

The disadvantage of the parsing-tree-based methods is that

they have just employed these characteristics based on a single

sentence. They do not thus handle the relationships of mutual

sentences. We propose a method to overcome the problem by

evaluating the consistency among various sentences in a

document. The detail of the consistency features extraction is

presented in Section V.

To compare with previous methods, we adopt the parsing

tree based method suggested by Li et al. [9] by calculating

distinct parsing features for each sentence of a document. The

average of the sentence features is used to create a classifier.

The adoption is compared with our proposed method which

combines frequency features (Section III), complex phrase

features (Section IV), and consistency features (Section V).

The detail of the comparison is shown in Section VII.

III. F

REQUENCY

EATURES

Frequency feature extraction is used to estimate the degree

of Zipfian distribution compliance with an input text . The

propose scheme for extracting the frequency features is shown

in Fig. 1:

 Step 1 (Extract linear regression line feature slope

): Each word in  is normalized by their lemmas. The

lemma distribution is calculated and used to estimate

linear regression line with function  =  +  which

matched to the distribution. The slope feature  is

extracted by this step.

 Step 2 (Extract information loss including square

root 



and cost value C): This step evaluates the

quality of the linear regression line  estimated in the

Step 1. Two standard metrics including square root 



and cost value  are measured to quantify the

information loss.

Step 1: Extract linear

regression line

Step 2: Extract

information loss

Input text t

Slope of the line feature a

Square root feature R

Cost value feature C

Fig. 1. The scheme for frequency feature extraction

The detail of each step to extract frequency features are

described in below.

A. Extract linear regression line features (Step 1)

Due to word variants in English (such as “has,” “have,”

“had”), we normalize the original words in the input text  by

their lemmas. Stanford library [13] is used in here to convert

variances to the same lemma.

The number of lemma frequented distribution 



calculated. We then estimate the compliance of Zipfian

distribution with the lemma distribution. By Zipf’s law, the

distribution 



of the -th most common term is proportional to









∝



Therefore, the lemma distribution 



are increasingly sorted.

The log-log graph is then used to demonstrate the relationship

of these distributions. For example, distribution of a computer-

generated book in blue and machine-generated book in orange

is shown in Fig. 2. The logistic regression lines  are then

estimated using the log distribution:

 =  + 

where  is the slope and  is the y-intercept of the line .

Fig. 2: Log-log graph for computer-generated text (in blue) and human-

written text (in orange)

In Fig. 2, the standard Zipfian distribution is shown in black

dotted line with slope 



= −1. The distributions of human

text and computer text are estimated by two linear regression

lines in blue and orange, correspondingly. The slope of human

distribution 



is equal −1.22 . It is closer to the slope of

Zipfian distribution ( 



= −1 ) than machine one ( 



−1.35). This shows that the compliance level of human text

with the law is better than computer-generated text. Therefore,

0 2 4 6

log10 d

log10 rank

Original English book distribution

Translated English book distribution

Linear (Original English book distribution)

Linear (Translated English book distribution)





= −1





= −1.35





= −1.22

Fig. 2. Log-log graph for machine-generated text (in blue) and human-

generated text (in red) demonstrating the human slope a

more complying

with Zipﬁan slope a

than machine one a

6/9/2017 Coefficient_of_Determination.svg

file:///C:/Users/Elab/Downloads/Coefficient_of_Determination.svg 1/1

Fig. 3. Root square demonstration with distribution mean line ¯y (left) and

linear regression line f (right).

IV. COMPLEX PHRASE FEATURES

The complex phrases, which are ﬂexibly and commonly

written in the human-generated text, are extracted as complex

phrase features (Fig. 4):

• Step 1a (Extracting idiom phrase feature I): Idiom

phrases are extracted from an input text t such as “long

time no see” or “a hot potato” by matching with a idiom

corpus. We use a standard idiom corpus

suggested by

Wikipedia with about 5000 distinct phrases. The use

of idioms in a text may be different from the original

https://en.wiktionary.org/wiki/Appendix:English idioms

Step 1a: Extracting

idiom phrase features

Step 1b: Extracting

cliche phrase feature

Step 1c: Extracting

ancient phrase features

Step 1d: Extracting

dialet phrase features

Idiom phrase feature I

Cliche phrase feature L

Ancient phrase feature A

Dialet phrase feature D

Input

document

text t

Fig. 4. Complex phrase features extraction.

idioms due to various word forms. Therefore, all words

are standardized by their lemmas before matching. This

standardization is also applied for other next steps. Ad-

ditionally, all features in this section are divided by the

number of words n in t for normalizing these features

with documents with various lengths.

• Step 1b (Extracting cliche phrase feature L): Cliche

words are commonly used in human-written text than

computer-created one. Therefore, all cliche phrases are

identiﬁed from the text t to create a cliche feature L.

The cliche phrase corpus used in here for matching is

inherited from a Laura Hayden’s corpus

with about 600

phrases.

• Step 1c (Extracting ancient phrase feature A): Other

complex phrases known as ancient phrases also often

occur in the human text. These archaic phrases are

extracted by matching with a commonly ancient phrase

corpus

with about 1500 words. An ancient phrase feature

A is measured using the extracted phrases.

• Step 1d (Extracting dialect phrases features D): Many

deviations of English text can be used in similar contexts

known as dialect phrases. Such phrases are identiﬁed

by extracting contiguous lemmas including in a huge

Yorkshire dialect phrase corpus

with about 4000 phrases.

We only describe in detail of the Step 1a due to the similar

of the four steps in this section.

Extracting idiom phrase features I (Step 1a): There are

many variants of words in texts. Therefore, these words are

standardized by their lemmas. In this step, we use Stanford

parser library [14] to decide each lemma from separate words

in an input text t. Successive lemmas are combined and

matched with each phrase in a candidate idiom phrase list. We

utilize the standard idiom corpus suggested by Wikipedia

the candidate idiom phrase list. The idiom extract feature I is

the division of the number of extracted idiom phrases and the

number of words n:

http://suspense.net/whiteﬁsh/cliche.htm

http://shakespearestudyguide.com/Archaisms.html

http://www.yorkshiredialect.com/Dialect words.htm

APSIPA ASC 2017

Identifying computer-generated text using statistical analysis

Figures

Citations

Wide-Ranging Review Manipulation Attacks: Model, Empirical Study, and Countermeasures

Machine-Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

Assisting academics to identify computer generated writing

Adversarial Robustness of Neural-Statistical Features in Detection of Generative Transformers

Detecting Machine-Translated Paragraphs by Matching Similar Words.

References

Statistical Machine Translation.

Selected Studies of the Principle of Relative Frequency in Language

A Monolingual Tree-based Translation Model for Sentence Simplification

Statistical machine translation

Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

Related Papers (5)

Automatic Detection of Machine Translated Text and Translation Quality Estimation

A Machine Learning Method to Distinguish Machine Translation from Human Translation

Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

Machine Translation Detection from Monolingual Web-Text

Detecting Machine-Translated Paragraphs by Matching Similar Words.

Frequently Asked Questions (13)

Q1. What have the authors stated for future works in "Identifying machine-generated text using statistical analysis" ?

Q2. What are the contributions in "Identifying machine-generated text using statistical analysis" ?

Q3. Why do the authors need to normalize the original words in the input text?

Q4. What is the lemma distribution used to estimate?

Q5. What is the step for extracting linear regression line features?

Q6. Which algorithm was used to optimize the support vector machines?

Q7. How do the authors quantify the information loss of the linear regression f?

Q8. What is the slope of the human distribution aH?

Q9. What is the proposed scheme for extracting the frequency features?

Q10. how many idioms are extracted from a text?

Q11. What is the log-log domain of the linear regression lines?

Q12. How are the distributions of humanwritten text estimated?

Q13. how are the distributions of human and computer text estimated?