Re: Generating fake databases

From: Joe Thurbon <usenet_at_thurbon.com>
Date: Wed, 12 Oct 2011 07:26:32 +1000
Message-ID: <op.v27huihkq7k8pw_at_the-thurbonss-imac.local>


On Tue, 11 Oct 2011 21:54:42 +1000, Roy Hann <specially_at_processed.almost.meat> wrote:

> Can anyone point me towards any papers, articles or web pages that
> discuss efficient techniques for generating large volumes of completely
> synthetic database content having specified characteristics? I'd
> settle for pointers to any software tools that might exist.
>
> I am not interested in generating mere random values; I want to
> efficiently generate plausible/realistic values for multiple tables
> and the "data" must satisfy my database constraints and have specified
> distributions of key values.
>
> If it is relevant, assume I want to create databases for an SQL DBMS.
>
> Needless to say I've made a stab at Googling for what I want but I
> haven't been able to guess effective search terms.
>

I'm not aware of anything directly relevant. The following are maybe too far away, but just in case.

For satisfying the database constraints, it might be worth checking up on Constraint Satisfaction Problem/Programming (CSP).

As for generating data according to distributions, there's literature in the Machine Learning space about doing this, but I'm afraid that I can't remember the journal where I saw it most recently - maybe JML, maybe JetAI, probably around 2005. A lot of the literature covers generating vectors in n-space according to various distributions (gaussian, poisson, etc) but I'm sure I've seen stuff around generating nominal values, too. Search terms might be 'evaluation learning algorithms synthetic data'.

Sorry I can't be more specific.

Cheers,
Joe Received on Tue Oct 11 2011 - 23:26:32 CEST

Original text of this message