Re: XML storing and management

From: <djmcmahon_at_gmail.com>
Date: Thu, 04 Oct 2007 14:58:57 -0700
Message-ID: <1191535137.948633.264530_at_19g2000hsx.googlegroups.com>


On Sep 26, 5:16 am, karsoods53 <karsood..._at_gmail.com> wrote:
> I get XML feeds as input and have to store this data on our server. I
> have worked with databases but new to XML. Can someone tell me how I
> can store and manage this data?

Depends on what you need to do with it. If all you want to do is store messages from the feed, you could just shove the XML into a LOB column along with other columns for the date/time received, etc. No different than if you were receiving, say, jpeg images.

A step up from that would be to extract relevant single-valued scalars as part of intake processing in your host programming language (e.g. Java) and stick them in querieable, indexable columns. For example if every XML message had exactly one <title> element in it, you could extract that as you were receiving a message from the feeds, then put it into a TITLE column as well as putting the message into a LOB column. Again no different than, say, extracting the horizontal and vertical resolution from a jpeg image and storing them in columns along with the jpeg itself.

If you need to be able to run queries over the contents of the XML, and especially if you want the query result sets to yield fragments of the XML messages instead of whole ones, you have a lot more work to do.

Databases such as Oracle's allow you to create XML typed columns which still store the XML in a LOB but let you dig out the content in queries, index it, etc. If you have XML schemas for your feeds, you could even go so far as to specify decompositions of the XML into relational tables. These databases also let you query by XML expressions, extract subsets of the XML, etc.

 (Trouble arises when you need to do this with multi-valued elements, or elements embedded within the hierarchy where multi-valued ancestors might exist.)

FWIW I prefer to do as little as possible with the XML while it's in the database, it's easier to think of it as an opaque object similar to e.g. a jpeg image. (Even a jpeg file has an internal structure, containing information that might be useful in queries, e.g. the resolution, bit depth, whatever.) It's usually not too hard, especially with XML, to extract whatever you think you'll need to support searches and such, and put that into real columns as part of the intake processing in a host language. You can then do any XML/ XQuery style filtering of the internal content of the XML objects as part of host-language post-processing, after using SQL in the usual way to find objects of possible interest. This might or might not fit your application.

If you absolutely must be able to drive queries from internal multivalued  elements, path expressions, etc., then you'll need help from the database in the form of XML extensions such as Oracle's, or if none are available, you'll have to do the relevant decomposition of the XML yourself in the host language and then populate your own "index tables". For example, if your input XML contains <address> elements, and possibly more than one such per message, and you think it's important to index the <zipcode> element underneath <address> to support searches for messages based on zip code, you'll need to have one table with {MSG_ID, MSG_XML_CONTENT} and another table with {ADDRESS_ZIP_CODE, MSG_ID} that you will have to populate as part of intake processing. Received on Thu Oct 04 2007 - 23:58:57 CEST

Original text of this message