Identifying computer-generated text using statistical analysis
Summary (3 min read)
I. INTRODUCTION
- Machine-generated text plays a major role in modern life.
- Moreover, the machine-generated text could either make customers annoyed in product advertisements or could give viewers incorrect attitudes in politics 1 .
- These experiments demonstrated that the proposed method works well in various languages.
- The complex phrase feature extraction is discussed in Section IV.
B. Sentence Level
- Many researchers have successfully detected machinegenerated text using the parsing trees at the sentence level.
- Futhermore, the authors found that the human-generated text frequently contains particular words such as spoken words (e.g., wanna, gonna) or misspelling words (comin, goin, etc.) whereas machine-generated one frequently includes unexpected words which are created by mistakes of generators.
- The authors extend the noise features of their previous method further.
- The authors extend these features for complex phrases including idiom, cliche, ancient, and dialect.
- To compare the proposed method with previous methods, the authors adopted the parsing based method suggested by Y. Li et al. [8] that calculates distinct parsing features for each sentence of a document.
III. FREQUENCY FEATURES
- The authors hypothesize that the word distributed frequency of the human-written text often follows with Zipf's law while computer-generated distribution does not.
- This law asserts that the distribution of the highest frequented words doubles with the occurrences of the second most frequented ones and triples with the third, and so forth.
- Frequency feature extraction is used to estimate how much an input document text t is compatible with the Zipfian distribution.
- The slope feature a presented for the line is finally extracted.
- (Extracting information loss including square root R 2 and cost value C):.
A. Extracting Linear Regression Line Feature (Step 1)
- Due to word variations in English (such as "has," "have," "had"), the authors first need to normalize the original words in the input text t by their lemmas.
- The Stanford library [14] is used to convert variances to the same lemma here.
- The authors then estimate the compatibility with the Zipfian distribution with the lemma distribution.
- The log-log graph is then used to demonstrate the relationship of these distributions.
- The distributions of humanand machine-generated text are estimated by two linear regression lines colored in blue and red, correspondingly.
B. Extracting Information Loss (Step 2)
- The authors quantify the information loss of the linear regression f via two standard metrics including square root R 2 and the cost value C.
- The logistic regression lines are then estimated using the log distribution: = + where is the slope and is the y-intercept of the line . .
IV. COMPLEX PHRASE FEATURES
- The complex phrases, which are flexibly and commonly written in the human-generated text, are extracted as complex phrase features (Fig. 4 ): Step 1a (Extracting idiom phrase feature I): Idiom phrases are extracted from an input text t such as "long time no see" or "a hot potato" by matching with a idiom corpus.
- Therefore, all words are standardized by their lemmas before matching.
- The cliche phrase corpus used in here for matching is inherited from a Laura Hayden's corpus 4 with about 600 phrases.
- An ancient phrase feature A is measured using the extracted phrases.
- The idiom extract feature I is the division of the number of extracted idiom phrases and the number of words n: Step 1a: Extracting phrasal verb feature V. CONSISTENCY FEATURES.
• Step 2b (Extracting coreference resolution feature S):
- Text consistency is also expressed via the coreference resolution relationships.
- Therefore, the number of coreference resolutions is extracted.
- This number is also normalized with the number of words n for creating the coreference resolution feature S.
A. Extracting Phrasal Verb Feature P (Step 1a)
- There are two kinds of phrasal verbs including separable or inseparable ones.
- For instance: s 1 (inseparable phrasal verb): "The terrorists tried to blow up the railroad station." (meaning: explode) s 2 (separable phrasal verb): "It rained so they called the soccer game off." (meaning : cancel).
- These verbs can be identified from the parsingtree tags.
- The number of phrasal verbs is fitted with PRT tag occurrence in these parsings.
- Otherwise, the machine often generates more simple phrases.
B. Extracting Coreference Resolution Feature S (Step 1b)
- The number of coreference resolution relationships demonstrates the text cohesion.
- These relationships describe expressions referring to the same entity in the text.
- The more of coreference resolution relationships, the higher possibility of human-generated text.
- The authors used the Stanford NLP tool [14] to extract coreference resolution relationships.
- The number of the relationships is used to measure the coreference resolution feature R:.
VI. COMBINATION
- The proposed scheme combines the the frequency, the complex phrase, and the consistency features extracted from Section III, IV, and V, respectively (c.f. Fig. 10 ).
- The frequency features F in step 1a, the complex phrase features X in Step 1b, and the consistency features T in step 1c are integrated to determine whether the input text t is a computeror human-generated text.
- The features are processed with two popular classification algorithms, logistic regression and support vector machine.
- The support vector machines were optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm.
- Among the classifiers, the support vector machine optimized by SGD has achieved the highest performance in their experiments.
A. Individual Features
- The authors collected various books from Project Gutenberg [11] , the biggest online free books.
- The ancient feature A reaches the best of performance for the all three classifiers.
- It shows that the translators trend to use uncomplicated words.
- The SVM(SGD) have the highest performance (accuracy = 89.0%, EER = 10.2%) with the feature A is used to create the final classifier for other experiments.
B. Combination
- The authors did similar experiments by combining the individual features in three groups: frequency features Q, complex phrase features X, and consistency features T .
- This method quantifies features for each parsing tree sentence.
- The authors adapted the method using the average of these features for the whole book.
- This result indicates the influence of the group features.
- The group integration efficiently improves the individual group performances.
C. Other Languages
- The authors took the similar experiments in other languages.
- The French and Dutch books are also translated into English by Google translation [3] .
- The performances of the proposed method are compared with the parsing tree method [8] shown in Table III .
- The Table III shows that their method works well in other languages.
VIII. CONCLUSION
- People often use more sophisticated natural languages than computers.
- Furthermore, the consistency of phrases in human text is generally higher than machine one.
- Therefore, the authors propose a method to distinguish computer-with human-generated text based on statistical analysis.
- More specifically, the frequency features are firstly extracted by estimating the word distribution with Zipfian distribution.
- The classifier is evaluated with 100 original English books and 100 translated English books from Finnish.
Did you find this useful? Give us your feedback
Citations
16 citations
13 citations
10 citations
8 citations
7 citations
Cites background or methods from "Identifying computer-generated text..."
...Other work [5,9] analyzes the histogram of word distribution from a massive amount of words, particularly suitable for document level....
[...]
...While three methods based on word distribution with coreference resolution (coreref) [9], N -gram model [1], and word similarity [10] can directly extract features from a paragraph, the other [6] based on parsing tree only obtains such features from an individual sentence....
[...]
...Method LINEAR SGD(SVM) SMO(SVM) ACC EER ACC EER ACC EER Word distribution and coreref [9] 66....
[...]
...Intertextual metric [5] Word distribution and coreref [9]...
[...]
...In these classifiers, the method based on word distribution and coreference resolution (coreref) [9] attained the lowest performance....
[...]
References
7,070 citations
"Identifying computer-generated text..." refers methods in this paper
...We use the Stanford NLP tool [14] to extract coreference resolution relationships....
[...]
...Extract idiomatic phrase feature I (Step la): Standardization of words to their lemma form in this step is done using the Stanford Parser Library [14], as mentioned above....
[...]
...The Stanford NLP library [14] is used to generate a syntax parsing tree for each sentence in the document and to attach a PRT tag to each phrasal verb....
[...]
...We used the Stanford Parser Library [14] to convert the various forms of a word to its lemma form....
[...]
5,350 citations
"Identifying computer-generated text..." refers methods in this paper
...The SVM algorithm was optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm....
[...]
5,095 citations
5,019 citations
3,426 citations
"Identifying computer-generated text..." refers background in this paper
..., natural language generation, may partly or entirely replace humans in various applications such as text summarization [1], header creation [2], machine translation [3], and image description [4]....
[...]
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the contributions in "Identifying machine-generated text using statistical analysis" ?
Previous methods for detecting such machinegenerated text typically estimates the text fluency, but, this may not be useful in near future because recently proposed neuralnetwork based natural language generation results in improved wording close to human-crafted one. The authors hence propose a method to identify the machine-generated text based on such the statistics – First, word distributed frequencies are compared with the Zipfian distribution to extract frequency features.
Q3. Why do the authors need to normalize the original words in the input text?
Due to word variations in English (such as “has,” “have,” “had”), the authors first need to normalize the original words in the input text t by their lemmas.
Q4. What is the lemma distribution used to estimate?
The lemma distribution is calculated and is used to estimate a linear regression function f = ax + b that is matched to the distribution.
Q5. What is the step for extracting linear regression line features?
A. Extract linear regression line features (Step 1) Due to word variants in English (such as “has,” “have,” “had”), the authors normalize the original words in the input text by their lemmas.
Q6. Which algorithm was used to optimize the support vector machines?
The support vector machines were optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm.
Q7. How do the authors quantify the information loss of the linear regression f?
The authors quantify the information loss of the linear regression f via two standard metrics including square root R2 and the cost value C.
Q8. What is the slope of the human distribution aH?
The slope of human distribution aH is equal -1.22 and it is closer to the slope of the Zipfian distribution (aZ = −1) than machine one (aM = −1.35).
Q9. What is the proposed scheme for extracting the frequency features?
The proposed scheme for extracting the frequency features is shown in Fig. 1:• Step 1 (Extracting linear regression line feature): Each word in t is normalized by their lemmas.
Q10. how many idioms are extracted from a text?
1/1yxyyfThe complex phrases, which are flexibly and commonly written in the human-generated text, are extracted as complex phrase features (Fig. 4):• Step 1a (Extracting idiom phrase feature I): Idiom phrases are extracted from an input text t such as “long time no see” or “a hot potato” by matching with a idiom corpus.
Q11. What is the log-log domain of the linear regression lines?
2. The linear regression lines f for each are then estimated in the log-log domain:f = ax + b, (2)where a is the slope and b is the y-intercept of the line f .
Q12. How are the distributions of humanwritten text estimated?
The distributions of humanand machine-generated text are estimated by two linear regression lines colored in blue and red, correspondingly.
Q13. how are the distributions of human and computer text estimated?
The distributions of human text and computer text are estimated by two linear regression lines in blue and orange, correspondingly.