Challenge, sharing, optimization and little green men

From: Florent Ramiere <framiere_at_yahoo.fr>
Date: 19 Aug 2002 20:01:48 -0700
Message-ID: <47684b13.0208191901.4c19892e_at_posting.google.com>


Hi,  

    Here is an idea about sharing, storing, and organizing a type of data.

    I wish we could review it and optimize it for huge volume. Even by breaking everything and using a totaly different approach ! :)

    Btw it is not straightforward, (this is probably its most serious problem) so take your time to read the description and to understand what was not explicitly described or explained.

    Comments, even not constructives are more than welcome ! :)

    You can see the database model at
http://www.siteavie.com/dbpattern/pattern.jpg     

Florent Ramiere


This pattern is about sharing, storing, and organizing a type of data.

This pattern is using 5 tables:
 one to store the data (DATA)
 one to link a data to one or more categories (ie DATAFOLDERDATA)  one to store recursive categories of data (ie DATAFOLDER)  one to the store the shared status of a category (ie SHAREDDATAFOLDER)
 one to cache shared data (ie CACHEDSHAREDDATA)

DATA



A data is linked directly to one or more category. Most of the time the data is linked to only one category. So we have a field in the data table that is used to link data to the category.
(DATA.DATAFOLDER_ID=DATAFOLDER.DATAFOLDER_ID)
We save also in the data table the number of categories this data is linked to.
So if this value is superior to 1, then we make the join betwee datadatafolder and datadatafolder.

CATEGORY


A category (a.k.a. folder) is recursive, it can contain one or more categories.
We save for each folder these informations:

    NBELEMENTS: the number of elements linked to this category     FOLDER_ID: the parent folder id
    DEPTH: the actual depth of this category in the whole hierarchy     NBSUBFOLDER: the number of sub categories     MAXSUBDEPTH: maximum sub-depth starting from this category     

This datas are stored in order to ease the client side: no data browsing is needed as every information about the hierarchy is already computed.

SHARING



Each table of this pattern is linked to a user that owns the data. Its folders, and its associated datas can be shared to other users.

Most of the time the users will not share their datas, for lack of GUI ;) or simply for lack of usage.
So we need to lower at the maximum the overhead of sharing management.

A folder contains the following field:

    ISSHARED
    NBSHAREELEMENT      isshared is used to quickly know the status of the folder, shared or not.
If it is, we make joins. If not: we do not. Nb share elements will save the overhead of looking through the sharedatafolder table for the ... number of elements.

When a folder is shared, we fill the SHAREDDATAFOLDER table in order to link a user to this folder.
At the same time we do cache the elements of this folder in the CACHEDSHAREDDATA table.
Linking a data to the user requesting the share.

When a folder is shared, we update its ISSHARED field. We update the ISSHARED field of the shared datas too.

SHARING RIGHTS



we save the sharing rights in of the folders in the DATADATAFOLDER table,
here are the fields:

    CANADD
    CANDELETE
    CANUPDATE
    CANADDELEMENT
    CANUPDATEELEMENT
    CANDELETEELEMENT      SHARING NOTIFICATION



When a user wants to share a folder, it has the possibility to specify which event he wants to be notified. the notifications can be triggered on the following events:
    add             : a sub folder is added
    delete          : a sub folder is deleted
    update          : this folder is updated
    add element     : a data has been added to this folder
    delete element  : a data from this folder has been deleted
    update element  : a data from this folder has been updated
    connect         : a user has now attached to this folder
    disconnect      : a user is no more attached to this folder

For the data in itself, it is simpler, the events are:

    udpate : this data has been updated     delete : this data has been deleted

The triggered events are stored in a specific table to ease its retreival.
These datas should have a TTL in order to avoid incrise of its size.
(tables are SHAREEVENT and SHAREEVENTTYPE)

GARBAGE COLLECTION



Garbage collection is needed when a data is shared: Example:

    PETER is sharing is files
    three friends share PETER's files
    PETER unsubscribes the service
    PETER's friens cannot access anymore the files.     

Same thing for the folders.

So we need to do some "reference counting", when a data is linked we increment its NBREFERENCE field. When a data is unlinked we decrement its NBREFERENCE field.
When this fiels is equal to 0. Then nobody is linked to it. And we can destroy it.

We cannot cascade delete the folders, as DATA may be linked individually.

STATISTICS



We save for each table (if it has some meanings...)
    NBACCESS        : number of access of this information 
    NBSHAREDACCESS  : number of shared access of this information 
    NBUPDATE        : number of update of this information
    LASTACCESS      : the last access of the owner on this data
    LASTSHAREACCESS : the last access of non owner on this data
    CREATIONDATE    : the creation date
    DELETIONDATE    : the deletion date (if flagged as deleted)
    

These datas are not meant to be reliable, and real time acurate, most of the time, worker thread will store them, and dump all theses updates during
the night as batch updates.
The only real overhead I do see is the size of the database, but these datas
are very usefull for the production/maintenance teams.

USER SUPPORT



Each table has a ISDELETED flag.
When a user deletes something, we set ISDELETED to true, and we forget about this element (and its sub elements if any)

IE we NEVER delete anything from our database when the user requests it. We just flag the datas as deleted.

This is important for the customer support, if a user unsuscribed or deleted important data mistakingly, the support team will be able to recover its datas.

Of course, a Daemon would remove every data deleted marked elements periodically after a period of time. Received on Tue Aug 20 2002 - 05:01:48 CEST

Original text of this message