Re: Generating fake databases

From: Derek Asirvadem <derek.asirvadem_at_gmail.com>
Date: Fri, 14 Oct 2011 00:37:43 -0700 (PDT)
Message-ID: <272d41c6-fd4f-4e59-840b-a628753b39bd_at_v7g2000yqf.googlegroups.com>


On Oct 14, 2:48 am, Roy Hann <specia..._at_processed.almost.meat> wrote:

>

> I fear you may be somewhat dazzled by the clarity of your own vision of
> the solution.

(Groan)

Well, the prospects are always brighter when one knows what one wants, then people who feel inclined to help you can actually do so. But never mind, we will drag it out of you, no matter what it takes. We do not have crystal balls, so it may take a bit of to-ing and fro-ing.

  1. Exactly what does "synthetic" mean ? complete rubbish ? any alphanumeric character in a CHAR column ? Printable/unprintable ? - what "values" do you want, that will satisfy your "constraints" - do you have CHECK constraints that specify data value ranges ? - Note Erwin's question. Even simpler constraints are as yet unidentified. Exactly what are these "constraints", if they allow synthetic rubbish into the "database" ? - You did not answer, have you implemented, do you want, the ACI in ACID (the D is supplied by the DBMS) ? The answer will tell me a lot about what you are seeking.
  2. Since we know now, that "plausible/realistic" does not mean plausible/realistic, perhaps you would be kind enough to tell us what that means to you. And why you have the word data in quotation marks (it is all data, synthetic or otherwise, no ?). Please be specific, we need to know the distinction between "synthetic/plausible/ realistic" and real/example/source.
  3. Perhaps you can post your "specified characteristics" to which the data must conform. If you want the population mechanism to be driven by this set of control tables, that would be even better.

The next bit has to do with your SQL expertise. Rather than ask questions and receive tidbits of information, allow me give an example. Since it is from my real life, no doubt it will be dazzling. I realise it is not what you want, because we are clueless re what that is (maybe you and I have something in common there, eh ?), but it is offered as a discussion, in the hope that it may improve your ability to specify your exact requirement. At the least, it will, once again, confirm what you *don't* want, and and we will progress one small but light-filled step.

Stress testing situation. I needed a full database. With specific ranges of data distribution. I did not wish to go through the paperwork of obtaining a copy of Prod, besides, it was a new version of the db with about 200 new columns. I wrote a few SQL scripts to load each table (Customer; then OrderSale from Customer; then OrderSaleItem from OrderSale; etc). I split them into parallel sessions, and wrote a couple of shell scripts to manage the lot. I executed them. In a short time, the scripts finished. I ran a couple of sprocs that produce inventory information, stats, etc, and checked populations. data distribution, etc. I was bedazzled by my own brilliance, the database was full, as planned. No posting of international websites asking daft questions that only I can answer; no products; no freeware; no reading papers.

4. Do you have the expertise to do all that (except the bedazzled bit, of course) ?

5. Are there any steps in that, that may be relevant or irrelevant to you ?

6. Probably the most important, in the scripting dept, do you know how to use vectors, and perform projections ?

Anyway, contemplate that for a while, and see if you can provide a few more words specifying exactly what you want, in the creation of "synthetic but plausible/realistic" "data" that "satisfies constraints" but nothing else. We will delight in whatever scraps of information we receive from you, and we will jump at the chance of dragging a few more scraps out of you tomorrow. Received on Fri Oct 14 2011 - 09:37:43 CEST

Original text of this message