Re: Idempotence and "Replication Insensitivity" are equivalent ?

From: Marshall <marshall.spight_at_gmail.com>
Date: 23 Sep 2006 12:34:31 -0700
Message-ID: <1159040071.308160.114180_at_m7g2000cwm.googlegroups.com>


Bob Badour wrote:
> Marshall wrote:
>
> > Amusing idea: write a machine learning program and feed
> > it texts for which one knows the sex of the author, in order
> > to produce a technique for establishing sex from someone's
> > writings. I imagine it could be modestly successful.
>
> Are you suggesting that she uses "that" a lot?

Not so far off, actually. There has been a fair bit of work done in using machine learning techniques to identify authorship, and one technique that is surprisingly effective is looking at the frequency of use of common words. I wouldn't have thought that that technique was worth a damn, but in fact it works quite well. When one looks at the Jane Austen novels, she consistently uses "the" and "of" in almost exactly a 1:1 ratio, whereas Henry David Thoreau uses "the" more than 2:1 over "of." Throw together enough of these little features and they start to form a kind of textual fingerprint. I wrote some software that could consistently pair up all the Arthur Conan Doyle novels, and all the Jane Austen novels, and correctly distinguish them from Thoreau, Mary Shelley, etc. It had the most difficulty distinguishing between Jane Eyre and Wuthering Heights; the authors of those two novels were sisters, and had grown up and gone to school together. It had no trouble distinguishing between Mary Shelley and Percy Shelley, wife and husband.

Oh, and all hail project Gutenberg as a fine source for online texts, whether for reading or analysis.

> While that might be
> suggestive of male sex if she were an anglophone, I don't know that it
> would mean that much for someone that has a different mother tongue.

That's a good point, actually. While I wouldn't expect that identity detection would be affected by English as a second language, I would expect that sex detection would be highly culturally specific. As far as detecting it by text goes, anyway.

Marshall Received on Sat Sep 23 2006 - 21:34:31 CEST

Original text of this message