Identifying computer-generated text using statistical analysis
read more
Citations
Wide-Ranging Review Manipulation Attacks: Model, Empirical Study, and Countermeasures
Machine-Generated Text: A Comprehensive Survey of Threat Models and Detection Methods
Assisting academics to identify computer generated writing
Adversarial Robustness of Neural-Statistical Features in Detection of Generative Transformers
Detecting Machine-Translated Paragraphs by Matching Similar Words.
References
Statistical Machine Translation.
A Monolingual Tree-based Translation Model for Sentence Simplification
Statistical machine translation
Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?
Related Papers (5)
A Machine Learning Method to Distinguish Machine Translation from Human Translation
Frequently Asked Questions (13)
Q2. What are the contributions in "Identifying machine-generated text using statistical analysis" ?
Previous methods for detecting such machinegenerated text typically estimates the text fluency, but, this may not be useful in near future because recently proposed neuralnetwork based natural language generation results in improved wording close to human-crafted one. The authors hence propose a method to identify the machine-generated text based on such the statistics – First, word distributed frequencies are compared with the Zipfian distribution to extract frequency features.
Q3. Why do the authors need to normalize the original words in the input text?
Due to word variations in English (such as “has,” “have,” “had”), the authors first need to normalize the original words in the input text t by their lemmas.
Q4. What is the lemma distribution used to estimate?
The lemma distribution is calculated and is used to estimate a linear regression function f = ax + b that is matched to the distribution.
Q5. What is the step for extracting linear regression line features?
A. Extract linear regression line features (Step 1) Due to word variants in English (such as “has,” “have,” “had”), the authors normalize the original words in the input text by their lemmas.
Q6. Which algorithm was used to optimize the support vector machines?
The support vector machines were optimized using either the sequential minimal optimization (SMO) algorithm [15] or the stochastic gradient descent (SGD) algorithm.
Q7. How do the authors quantify the information loss of the linear regression f?
The authors quantify the information loss of the linear regression f via two standard metrics including square root R2 and the cost value C.
Q8. What is the slope of the human distribution aH?
The slope of human distribution aH is equal -1.22 and it is closer to the slope of the Zipfian distribution (aZ = −1) than machine one (aM = −1.35).
Q9. What is the proposed scheme for extracting the frequency features?
The proposed scheme for extracting the frequency features is shown in Fig. 1:• Step 1 (Extracting linear regression line feature): Each word in t is normalized by their lemmas.
Q10. how many idioms are extracted from a text?
1/1yxyyfThe complex phrases, which are flexibly and commonly written in the human-generated text, are extracted as complex phrase features (Fig. 4):• Step 1a (Extracting idiom phrase feature I): Idiom phrases are extracted from an input text t such as “long time no see” or “a hot potato” by matching with a idiom corpus.
Q11. What is the log-log domain of the linear regression lines?
2. The linear regression lines f for each are then estimated in the log-log domain:f = ax + b, (2)where a is the slope and b is the y-intercept of the line f .
Q12. How are the distributions of humanwritten text estimated?
The distributions of humanand machine-generated text are estimated by two linear regression lines colored in blue and red, correspondingly.
Q13. how are the distributions of human and computer text estimated?
The distributions of human text and computer text are estimated by two linear regression lines in blue and orange, correspondingly.