Semistructured data

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Thu, 26 Oct 2006 20:58:11 +0300
Message-ID: <Pine.SOL.4.62.0610262017100.15263_at_kruuna.helsinki.fi>


On 2006-10-26, JOG wrote:

> What /exactly/ do you mean by 'semistructured data' Sampo? I see this
> term bandied around a lot, but am yet to see a clear explanation of
> what it /is/.

Buneman defines it as data where the information normally associated with schemas is part of the data itself. I.e. each time you load some data, the metadata within it gets to tell you what the keys, attributes, domains and constraints are, and you don't get a say in the matter.

I'd frame the issue a bit differently. Metadata like that is often used when we want to abstract away from some part of the representation but do not want to make the data completely opaque. We for example can tell that a certain part of the data only concerns a specific application and so that our application can safely ignore it, but we still want to be able to both store the data so that it can be fed to said application and to retain what structure it has so that e.g. parsing, buffering and annotation code can be shared. That's how multimedia and object container formats work. In this sense semistructured data is data conforming to a schema that has been coarsened from what a relational database would expect, and that is more weakly typed as a result. There are also some applications with lots of semantic machinery that lives at a level higher than RM's logical one, or which utilize a fundamentally different data model; those end up being emulated/reified on top of RM. Semistructured data as a concept tries to capture the common features of all of these, allow for the data determining the schema instead of the other way around, and bridge the gap between opaque blobs and well structured databases.

> I've heard that it is data that doesn't fit into the RM, but AFAIK all
> data fits in the RM, so I find this highly suspect.

As Bob just showed, everything does fit in the RM. However, current RDBMS's do not necessarily make storage of all kinds of data easy or uniform enough. One nice example is disjunctive data from knowledge bases. Even that fits, but the representations become rather ugly and a RDMBS doesn't really understand them. Lots of extra machinery is needed on top of the database to make sense of logical or's, so in this case the RDBMS is really used more like a dumb record store, the real semantics of the data are never revealed to it, and essentially the final, disjunction aware database ends up being emulated on top of the RM.

-- 
Sampo Syreeni, aka decoy - mailto:decoy_at_iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Thu Oct 26 2006 - 19:58:11 CEST

Original text of this message