Re: Idempotence and "Replication Insensitivity" are equivalent ?

From: Bob Badour <bbadour_at_pei.sympatico.ca>
Date: Sat, 23 Sep 2006 20:30:29 GMT
Message-ID: <FPgRg.37439$9u.315922_at_ursa-nb00s0.nbnet.nb.ca>


Marshall wrote:

> Bob Badour wrote:
>

>>Marshall wrote:
>>
>>
>>>Amusing idea: write a machine learning program and feed
>>>it texts for which one knows the sex of the author, in order
>>>to produce a technique for establishing sex from someone's
>>>writings. I imagine it could be modestly successful.
>>
>>Are you suggesting that she uses "that" a lot?

>
>
> Not so far off, actually. There has been a fair bit of work done
> in using machine learning techniques to identify authorship, and
> one technique that is surprisingly effective is looking at the
> frequency of use of common words. I wouldn't have thought
> that that technique was worth a damn, but in fact it works
> quite well. When one looks at the Jane Austen novels, she
> consistently uses "the" and "of" in almost exactly a 1:1 ratio,
> whereas Henry David Thoreau uses "the" more than 2:1 over
> "of." Throw together enough of these little features and they
> start to form a kind of textual fingerprint. I wrote some
> software that could consistently pair up all the Arthur
> Conan Doyle novels, and all the Jane Austen novels, and
> correctly distinguish them from Thoreau, Mary Shelley, etc.
> It had the most difficulty distinguishing between Jane Eyre and
> Wuthering Heights; the authors of those two novels were
> sisters, and had grown up and gone to school together.
> It had no trouble distinguishing between Mary Shelley and
> Percy Shelley, wife and husband.
>
> Oh, and all hail project Gutenberg as a fine source for
> online texts, whether for reading or analysis.
>
>
>
>>While that might be
>>suggestive of male sex if she were an anglophone, I don't know that it
>>would mean that much for someone that has a different mother tongue.

>
>
> That's a good point, actually. While I wouldn't expect that identity
> detection would be affected by English as a second language,
> I would expect that sex detection would be highly culturally
> specific. As far as detecting it by text goes, anyway.

I would say that having a foreign mother tongue would almost certain confound the results for sexing the writer. Consider the frequency of the verb "to be" from a russian. Received on Sat Sep 23 2006 - 22:30:29 CEST

Original text of this message