Re: Idempotence and "Replication Insensitivity" are equivalent ?
Date: 23 Sep 2006 12:34:31 -0700
Bob Badour wrote:
> Marshall wrote:
> > Amusing idea: write a machine learning program and feed
> > it texts for which one knows the sex of the author, in order
> > to produce a technique for establishing sex from someone's
> > writings. I imagine it could be modestly successful.
> Are you suggesting that she uses "that" a lot?
Not so far off, actually. There has been a fair bit of work done
in using machine learning techniques to identify authorship, and
one technique that is surprisingly effective is looking at the
frequency of use of common words. I wouldn't have thought
that that technique was worth a damn, but in fact it works
quite well. When one looks at the Jane Austen novels, she
consistently uses "the" and "of" in almost exactly a 1:1 ratio,
whereas Henry David Thoreau uses "the" more than 2:1 over
"of." Throw together enough of these little features and they
start to form a kind of textual fingerprint. I wrote some
software that could consistently pair up all the Arthur
Conan Doyle novels, and all the Jane Austen novels, and
correctly distinguish them from Thoreau, Mary Shelley, etc.
It had the most difficulty distinguishing between Jane Eyre and
Wuthering Heights; the authors of those two novels were
sisters, and had grown up and gone to school together.
It had no trouble distinguishing between Mary Shelley and
Percy Shelley, wife and husband.
Oh, and all hail project Gutenberg as a fine source for
online texts, whether for reading or analysis.
> suggestive of male sex if she were an anglophone, I don't know that it
> would mean that much for someone that has a different mother tongue.
Oh, and all hail project Gutenberg as a fine source for online texts, whether for reading or analysis.
That's a good point, actually. While I wouldn't expect that identity detection would be affected by English as a second language, I would expect that sex detection would be highly culturally specific. As far as detecting it by text goes, anyway.
Marshall Received on Sat Sep 23 2006 - 21:34:31 CEST