Author Profiling using Stylistic and N-Gram Features
Radha D1, Chandra Sekhar P2
1Radha D*, Department of CSE, Malla Reddy College of Engineering and Technology, Hyderabad, India.
2Chandra Sekhar P, Department of CSE, GITAM, Visakhapatnam, India.
Manuscript received on September 21, 2019. | Revised Manuscript received on October 05, 2019. | Manuscript published on October 30, 2019. | PP: 3044-3049 | Volume-9 Issue-1, October 2019 | Retrieval Number: A1621109119/2019©BEIESP | DOI: 10.35940/ijeat.A1621.109119
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The World Wide Web is increasing tremendously with massive amount of textual content primarily through social media sites. Most of the users are not interested to upload their genuine details along with textual content to these sites. To identify the correct information of the authors the researchers started a new research area named as Authorship Analysis. The authorship Analysis is used to find the details of the authors by examining their text. Authorship Profiling is one type of Authorship Analysis, which is used to detect the demographic characteristics like Age, Gender, Location, Educational Background, Nativity Language and Personality Traits of the authors by examining writing skills in their written text. Stylometry is one research area defines a set of stylometric features namely word based, character based, syntactic, structural and content based features for differentiating the author’s writing styles. In this work, the experimentation conducted with various stylistic features, N-grams and content based features for gender prediction. These features are used for representing the vectors of documents. The classification algorithms produce the model by processing these vectors. Two classification algorithms namely Random Forest, Naïve Bayes Multinomial were used for classification. We concentrated on prediction of Gender from 2019 Pan Competition Twitter dataset. Our approach obtained best accuracies when compared with many Authorship Profiling approaches.
Keywords: Authorship Analysis, Authorship Profiling, Accuracy, Content based Features, Gender Prediction, N-grams, Stylistic Features.