Re: Two examples of semi structured data.

From: mAsterdam <mAsterdam_at_vrijdag.org>
Date: Sun, 22 Aug 2004 17:42:18 +0200
Message-ID: <4128bedf$0$10528$e4fe514c_at_news.xs4all.nl>


Jan Hidders wrote:

> mAsterdam wrote:

>>Jan Hidders wrote:
>>>mAsterdam wrote:
>>>
>>>>Now to the document, "Querying Semi-Structured Data".
>>>>When I read texts like: "Some of this data is C<raw>
>>>>data, e.g., images or sound." I infer that the author
>>>>talks about potentially meaningless C<signs>, not about
>>>>C<data>.
>>>
>>>Er, no. Note the "(from a particular viewpoint)" phrase, which is crucial
>>>here. Serge argues that sometimes, from a particular viewpoint of for
>>>example a certain type of user or a certain application, you are really
>>>not interested in any structure that may or may not be hidden in a
>>>particular stream of bits; they are just a stream of bits and that's
>>>it.
>>
>>That's it. No data, right? Signs.
>
> Yes. Or as it is usually taught in academia: "No information, just data."

I am aware of that (wikipedia: "Data on its own has no meaning, only when interpreted by some kind of data processing system does it take on meaning and become information." (1)) habit in tech-academia. Do you happen to know a good source?

In those terms: The authors are talking potentially informationless data. Well, then let's try to talk about informationbases - hmm... doesn't sound good. Nah database is fine, C<provided the data means something> - thus departing from this habit.

Because of the widespread habit (which BTW also blurs distinctions of relevancy, or newsworthyness by having to overload the term information even more), I am forced to define the terms in database context. Oh well, so be it. No big deal, they are close to intuition for many people anyway, also in academia.

One other inconvenience though: I have to be alert to people who adapt to it's use and consequently have to struggle through a stage where they confuse storage structures and meaning (see the example about the registry elsewhere in this thread).

>>>That does certainly not exclude the possibility that from *another*
>>>point of view there certainly is some worthwile structure to be
>>>discovered there. Think of for example the payload in a package in some
>>>communication protocol.
>>
>>Signs and packages of signs. Structure yes, data no.
>>
>>>At one level of the protocol stack this is just a
>>>list of bits, at another level the same list of bits may be a certain
>>>service request with some parameters. Whether data is "raw" or not is in
>>>the eye of the beholder, it is not an objective quality of the thing
>>>itself.
>>
>>Put differently: we can store stuff, move it around from
>>one place to another, use structures for that - but
>>without interpretation ("the eye of the beholder") there
>>is no message, no meaning, no communication.

> 
> Yes. Is that a problem? Mostly this is just a matter of 
> definition and therefore meaningless.

No problem. It demarcates areas of interest and layers in communication. When I am _not_ interested in meaning per se, I can focus on emerging formats without worrying to much: http://www.wotsit.org

The choice of terminology is not innocent, though (see [semi-orthogonal ?], below).

>>>>I don't have to wait very long to verify that the damage of
>>>>this non-choice is done. "We call here C<semi-structured
>>>>data> this data that is (from a particular viewpoint) neither
>>>>raw nor strictly typed, i.e. not table-oriented as in a relational
>>>>model or sorted-graph as in object databases." Well (please
>>>>keep in mind I am making statements of taste, I am *not* refuting
>>>>the author's argument): by lumping together sorted-graphs and tables
>>>>in one category "strictly typed" suddenly all structure _inherent_
>>>>in the data is out of focus.
>>>
>>>Where do you get that idea? The only thing that is said is that if that
>>>data is typed, which basically means that we know its structure, and if
>>
>>... Sorry to interrupt. The only structure we now know is structure
>>imposed on the signs to be stored or forwarded or represented. This
>>structure does not determine meaning, neither is it determined by
>>meaning. Buzzword bingoish: it is orthogonal to meaning.

> 
> No, it is not orthogonal because it can be, and usually is,
> the carrier of meaning. 

[semi-orthogonal ?]
Let's zoom in here:
The store/forward structure carries the signs. Can we or can we not change that structure whithout affecting the stored/transported signs? Can we change the signs without affecting the conveyed meaning?

Or, in the accepted 'data as potentially meaningless'-terminology: can we change the datastructure without affecting the data?

The latter sounds a lot more difficult.

> I could send a simple string with flat text or I could add
> structure in the form of XML mark-up and then send it to you. 
> If we have agreed before on what this markup means then
> the added structure will add additional meaning.

??? Are you suggesting we can add meaning without changing the agreement?

>>>that is all the structure we are interested in, then it is not considered
>>>semi-structured. On the other hand, if it is completely untyped and
>>>without structure, but we are also not interested (from the chosen
>>>perspective!!) in any hidden structure, then it is also not considered to
>>>be semistructured.
>>
>>Re-introducing meaning after dissmissing it at the
>>start is troublesome: How (and why) does being
>>interested suddenly come into play? How does it
>>relate to the discussed (semi-)structeredness?

> 
> You seem to have a problem with the fact that the meaning of data is not
> an objective property of said data but also depends on those that deal
> with that data. Is that correct? Why does that bother you so much?

It doesn't bother me at all :-)

Ok, ok. I'll give you something more to shoot at. Meaningless nibbles is not the stuff we want to have to much of in a database. Semantic modelling efforts specifically aim to make sure that all data in a database has agreed upon meaning. The semantic (aspect of the) model reflects the agreement. The storage model reflects the organisation of the bits/bytes/signs (note: many would call them data (1)) in containers.

>>>>"To completely structure the data often remains an elusive goal"
>>>>I am so out of here - where is the door!
>>>
>>>Don't you think you're overreacting a little? The only thing that is said
>>>is that it is probably not always possible to retrieve all the hidden
>>>structure we are interested in. Nowhere is it said by Serge that we
>>>shouldn't try, or even that we shouldn't try very hard. On the contrary.
>>>And in fact, right now, there is a lot of research being done on that.
>>
>>There is a multitude of entangled structures, present
>>and imposed. The goal is to unravel, unveil them in
>>order to satisfy our curiosity and maybe even do
>>something useful with our findings, building new things
>>- using existing and new structures of course -
>>along the way. "To completely structure" as a goal
>>(albeit elusive) is a sign of seriously
>>overestimating our capability to understand.

> 
> Yes. I don't see the problem here. The fact that we may not be able to
> reach the borders of our universe doesn't imply in any way that we
> shouldn't do space exploration. 

Indeed.

> It's almost as if you have a deep
> psychological need to have everything structured.

Strange. I would think that my stated dislike of this remark:

 >>>>"To completely structure the data often remains an elusive goal"

contradicts that. Let's not engage in remote psycho-analysis on the basis of postings.

I read about semiotics before ever reading anything about databases. Reading about databases I could relate much of what I read to semiotics.

There are some nice 'semiotics for techies' links:

http://home.earthlink.net/~sroof/Abraxas/sar/doubslit.htm
http://www.magarinos.com.ar/SEMIOTIC.HTM#data
http://www.medialab.chalmers.se/people/jmo/essays/unix_semiotics.html
http://www.ida.liu.se/~gorgo/erp/JSGG-OrgSem03.pdf
http://virtual.inesc.pt/dsvis03/papers/13.pdf






footnote:
(1) The use of 'bits/bytes/signs' instead avoids both over and over repeating a non-mainstream definiton of data and the alternative: repeatedly alerting people to potential meaninglessness.
Wikipedia:
"Data on its own has no meaning,..." - database data *has* - maybe because it is not on it's own. Received on Sun Aug 22 2004 - 17:42:18 CEST

Original text of this message