Quantcast
Channel: Dictionary-based text analysis- dealing with length - Data Science Stack Exchange
Viewing all articles
Browse latest Browse all 2

Dictionary-based text analysis- dealing with length

$
0
0

I am working on an analysis using a dictionary-based text-as-data approach. I have a dataset of texts (n=1200), and I am applying a dictionary of 50 words (I tokenize the text with each word being one token). The texts greatly vary in terms of length, so I try to take length into consideration in my models. I first tried to divide the dictionary count in each text by text length (dictionary count/text length = k). Because I use a regression model, I then take the square root of k to normalize the data (which I use as a dependent variable). In a second model, I did not divide the dictionary count by text length, but I controlled for length as a predictor in a linear regression model (I still take the square root of the dictionary count). The results across these models are substantially different (Especially in terms of statistical significance). I am struggling to decide which model is better, as I could not locate the papers on the subject matter in my field (political science) or elsewhere. Any suggestions?


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>
<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596344.js" async> </script>