Kenneth Downs

Syndicate content
All things related to database applications, both desktop and web.KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.comBlogger44125
Updated: 52 min ago

The Argument For Normalization

Sun, 2008-10-12 06:47

This week we will review the practical arguments in favor of normalization. The major concern as always on this blog is to examine database decisions in light of how they affect the overall application. The major argument for normalization is very simple: you end up coding less, coding easier, and coding stronger, and you end up with fewer data errors.

Informal Description of Normalization

When I find a programmer who is stuck trying to grasp the concepts of normalization, the problem usually comes down to not being able to see the big picture. The programmer may read any number of rigorous papers on the subject (the Wikipedia article is a good place to start) but still be unable to get the basic point. This leaves the programmer stumbling through table design, second-guessing himself, and then running through a frustrating sequence of redesigns. In the worst case it leads him to conclude normalization may not be worth the effort, at which point he starts writing really crappy applications.

The goal of normalization in simple terms is just this: to store each fact in exactly one place. When you put each fact in only one place, you always know where to go to read it or write it. When facts are repeated in the database, the application programmer has an increased burden to make sure they are all consistent. If he fails to shoulder this burden completely, the database will have inconsistent values for the same facts, leading to emergency phone calls and emails requesting help. These request for help always come at 4:30pm as you are getting ready for a date or an extended vacation.

Once the programmer understands this very simple goal, he usually has an "aha!" moment reading the various articles on normalization. Each rule for normalizing suddenly makes more sense, as it appears as just another way to make sure that there is a place for everything and everything is in its place.

The Programmer's Burden

When a programmer is dealing with a non-normalized database, he is going to run into 4 problems over and over again. Three of these are called "anomalies" and the fourth is "inconsistency" (a fancy way of saying the database holds bad data). Most authors who write on normalization take it for granted that the reader can readily see why the anomalies are bad, but I would like to spell it out here to make it crystal clear. First we will look at the three kinds of anomalies, then we will go to the inconsistency problem, and see exactly how they affect the programmer.

Consider a fledgling programmer who has ready too many Web 2.0 blogs saying that relational databases are bad, and so has not bothered to learn anything about them. In the name of "simplicity", he creates a single table that lists employees, their email addresses, the customers they are assigned to, and the primary email address of each customer. This will lead to three kinds of anomaly, each of which leads to inconsistency.

An Update Anomaly occurs when a fact is stored in multiple locations and a user is able to change one without changing them all. If a user goes to this employee-customer table and changes an employee's email on only one row, and no provision is made to change the others, then the database now has inconsistent values for the employee's email.

  • An Insert Anomaly occurs when it is not actually possible to record a fact. If an employee is hired but not yet assigned to any customers, it is not possible to store his email address!

  • A Delete Anomaly occurs when the user deletes one fact and clobbers some other fact along the way. If an employee goes on leave, so that we must remove (delete) their assignments, then we have lost their email address!

    This non-normalized database requires the programmer to write additional application code to try to intercept and correct these issues. This is the Programmer's Burden in a non-normalized situation, and it gets worse and worse as the program expands.

    The Programmer's Burden also emerges as a continuing stream of complaints from users that "the program is wrong." For every case where the programmer fails to provide exception-handling code, a user will stumble across inconsistent data. The customer says, "it says 'X' on this screen but it says 'Y' on that screen, what's going on?" As far as they are concerned it is a bug (which of course it is) and must be fixed. You can't make money coding new features when you are fixing garbage like that.

    The Basic Argument

    So the basic argument for normalization is: we wish to avoid the Programmer's Burden as completely as possible. We want to spend our time on cool features, not going back over and over to fix features we thought were finished already.

    Special Comment on First Normal Form

    First normal form is different from the others. When a database designer violates the higher normal forms, the result is that a fact is recorded in more than one place. However, when you violate first normal form it results in more than one fact in the same place.

    A basic example would be the same table of employees and customers, where we "solve" the problems listed above by storing only one row for each employee, with a comma-separated list of accounts, like so:

    EMPLOYEE   EMAIL                     CUSTOMERS
    ------------------------------------------------------------
    ARANDOLPH  art@praxis.com            100, 523, 638, 724
    SRUSSELL   sax@overlook.edu          516, 123, 158
    PBOYLE     phyllis@sp-elevataor.com  713, 928, 212
    

    The above scheme increases the Programmer's Burden because now he must decompose the data that comes from the server. In technical terms we say that the value CUSTOMERS is non-atomic, it is not a single fact. Every piece of code that touches that table must break down the list of customers and sometimes reassemble it.

    To see this, consider the basic task of adding a customer for employee Art Randolph. If the tables were set up properly, you would insert into a cross-reference of employees and customers, and duplicates would be trapped by a primary key. But here you must retrieve the list of existing customers, split it up in application code, and check that the value is not repeated. Then you have collapse the list back down and send it up to the server.

    All I can say is, no thanks.

    By The Way, What Is The Right Way?

    Now that we have beat up our fledgeling programmer's lousy employee-customer table, it would be worthwhile to spell out how to do it correctly.

    First off, we always need one table for each kind of thing we are keeping track of. That means we will have a table of employees and a table customers. This solves all of the anomalies and inconsistencies listed above because we put facts about employees in the employees table (like email address) and facts about customers in the customers table.

    This leaves the issue of linking employees to customers. There are three ways to do it:

    1. If each customer gets a team of employees assigned to them, but an employee only ever works for one customer, then put a Foreign Key on the employees table that links to the customers table.
    2. If each employee works on more than one customer, but each customer gets only one employee, then put a foreign key on the customers table that links back to employees.
    3. If an employee can work for more than one customer and vice-versa, make a Cross-reference between customers and employees.
    Conclusion

    This week we have seen a fairly simple argument for normalization, and one that regular readers of this blog have seen before: normalization eliminates unnecessary coding burdens. It is hard enough to get software projects done on time and on budget without imposing additional labor that could be avoided entirely by normalizing.

    I do not mean to imply that normalizing takes no time or is instantly easier than a fear-based retreat into coding your way out of things. It does take time to learn to normalize and it does take time to learn to code an application around normalized tables. In my own experience I passed through the various erroneous mindsets that I make fun of in this blog, and each time I put effort into learning the "right way" then every effort I made after that was forever easier, had fewer bugs, and made my customers more happy. So I am not saying it is free, but I am saying it is one of the best bargains in town.

  • Categories: Development

    The Quest for the Absolute

    Sun, 2008-09-28 12:27

    Today I am taking a huge detour from technical matters to lay out the philosophical groundwork behind this blog. The ideas presented today lie beneath every essay on this site. It is easy to observe that people seem driven to formulate absolute truths to guide their pursuits. Programming is no different, programmers are driven to find the absolutes that will universally guide their efforts. Those absolutes are not that hard to find, if you know the method for seeking them out. Fortunately, we have hundreds and thousands of years of human efforts, both successes and failures, to draw upon when embarking upon the task.

    Absolutes in the Post-Modern Age

    Academics refer to our current stage of history as the "Post-Modern" age. Thinking in the post-modern age is dominated by a deep mistrust of the very concept of absolute truth. Many thinkers have noted that in the post-modern age the only absolute is that there are no absolutes. Now, anybody who has not bothered to read much past what they are handed likely believes much of this without even thinking about it, they may not know that in the history of the human race such thinking is less than 60 years old.

    But that "no absolutes" stuff is all nonsense at best and downright cowardice at worst. If you want an example of an absolute truth, try stepping off the edge of a cliff: even if you do not believe in gravity, gravity believes in you. It is an absolute truth for me that if I do not take care of my customers my life becomes unpleasant. It is a further absolute truth for me that I constantly obvserve programmers proclaiming absolutes (always use relational, always use OO, etc). When I stop observing it, then I suppose it won't be an absolute anymore (and I suppose then it never was?)

    So let us now cheerfully ignore the wailing of those who cry that there are no absolutes, and ask if we might discover some elements of software development strategy that hold true always (ok, maybe mostly always) for the context of database application development.

    Aristotle and Virtue

    Nowadays nobody has to read philosophy much anymore, at least not where I live (in the United States), so most programmers have never heard of a man named Aristotle, who lived about 2500 years ago. This is a shame, because Aristotle had a logical way of thinking about things that would warm the heart of any programmer.

    One of Aristotle's major contributions to civilization was his formulation of what philosophers call "virtue". Philosophers use the term in a technical sense, and they do not use "virtuous" to mean "nice" or "pleasant" or "good-natured." To a philosopher (or at least those that taught me) something is virtuous in Aristotelean terms if if performs its function well. The standard classroom example is that a virtuous table serves the function of a table, and a virtuous table maker is somebody who makes good tables.

    This is a very useful concept for programmers. If we want to speak of a "virtuous" program, we mean simply one that meets its goals. This takes the whole high-minded theory and philosophy stuff back to real down-to-earth terms. (This is why I always preferred Aristotle to Plato).

    In the quest for the absolute, if we let the ancient philosophers guide us, we discover the surprisingly basic idea that our programs should perform their functions well if they are to be called virtuous. This is easy to swallow, easy to understand, and easy to flesh out.

    What is a Virtuous Computer Program?

    A virtuous computer program is one that serves its purpose well, and so we need to flesh out the three purposes that are common to most programs:

    • To meet some institutional or strategic goal of those who sign the checks (or accept the work as charity in some case).
    • To meet the goals of end-users, which almost always comes down to performance and ease-of-use.
    • To provide income for the developers (or meet their own goal of providing charity work for non-profits).

    Notice what is not on the list, things like ensure all data resides in a relational database, or implement all code in strictly object-oriented languages. We are not nearly ready to consider such specific strategies as those, they are completely out of place here in a discussion of the unifying goals of all projects.

    So let's review. So far we know that the absolutes of programming are the pursuit of virtue, which turns out to be a fancy way of saying that the program should perform its functions well, which turns out to mean simply that it should do what the check-signer asked for, in a way that is workable for the end-users, and at a price that keeps the programmer fed.

    This leads us towards strategies for reaching those goals.

    The Virtuous Programming Strategy

    Continuing with the idea that a virtuous program meets is basic goals, we can say that a virtuous strategy smooths the way for a programmer to meet the basic goals. An unvirtuous (or just plain bad) strategy litters the path with obstructions or ends up not meeting the goals of the check-signer, end-users, programmer, or all of the above.

    Before we can begin to formulate a strategy, we must look next at the reality of the programming world. Some of the fundamental realities include (but are not limited to):

    • The end-user or check-signer may not fully understand or be able to articulate their requirements.
    • The programmer may not correctly understand requirements, even when correctly articulated.
    • In a healthy prosperous situation there will be new requirements that interact with established requirements in ways that range from no interaction at all to fiendish incompatibilities.
    • The world will change around you, creating demands that did not exist when the system was created (some of us can still remember when there was no internet).
    • Staff will come and go.
    • ...and so on.

    So even before we begin formulating particular strategies for particular situations, we recognize that our strategy had underlying goals it must facilitate, such as:

    • Being easy to change, both for correcting mistakes and adding features.
    • Being able to maintain and sort out possibly contradictory requirements that arise as the years go by.
    • Requiring little or no "deep magic" that depends on arcance knowledge of employees who may depart.
    • Being able to expect the unexpected (like the explosion of the web etc.)

    Only after we have worked through to this point can we begin to evaluate specific strategies and technologies. We can now begin to ask about the proper context of the database server, where to use object orientation, and if javascript is a good programming language. Anything that responds to our core goals and realities can be considered for use, anything which does not play into the core goals is useless at best and obstructive at worst.

    Future essays (and some past essays) in this series will refer back to these ideas. For example, many developers have observed over the years that if you Minimize Code and Maximize Data then you gain many advantages in terms of development time, robustness, and feature count. Other ideas similar to this will come out over and over in future essays in this series. Conclusion

    The strategies and techniques that you will see on this blog are all aimed at one way or another towards the goals expressed in this essay. At the very beginning comes the goals of the check-signer, the end-users, and the programmer. From there we seek strategies that will satisfy our need to grow, change, correct, and adapt. Only then can we ask about the technologies such as databases and object-oriented languages and see how well they let us meet all of these goals.

    Categories: Development

    Comprehensive Table of Contents

    Sun, 2008-09-21 12:56
    table.toc { border-spacing: 0; border-collapse: collapse; width: 100%; height: 500px; } table.toc td { border: 1px solid #C0C0C0; padding: 3px; width: 33%; } table.toc td h3 { padding-top: 5px; margin: 0px; padding-bottom: 0px; } table.toc td h4 { padding-top: 3px; padding-bottom: 3px; margin: 0px; }

    The list below explains all topics past and future, including slots for essays I plan to revise or add to. This posting replaces our Original Table of Contents, as well as our Master Table Design Patterns List and the List of Future Topics.

    As of this writing, September 2008, this blog is now about 10 months old. I have been able to produce at least one entry on most of the major topics that I hoped to cover, and can now see opportunities to flesh out and expand many topics. On top of that, I have been receiving requests for topics both via email and comments. The table below is a guide to what topics are forthcoming, which have been treated in the past, and which you will likely see in the future.

    Topic Essay Philosophy and Dogma   The Quest For The Absolute The Quest for the Absolute   Technology Trumps Dogma ...future topic...   The Customer is Always Right ...future topic...   I am But a Humble Filing Clerk ...future topic...   Pervasive Security ...future topic...   The Relational School of Thought ...future topic...   The Object-Oriented School of Thought ...future topic... Practical Concepts   Minimize Code, Maximize Data Minimize Code, Maximize Data   Normalization Minimizes Code The Argument For Normalization     First Normal Form First Normal Form     Second Normal Form Second Normal Form     Third Normal Form Third Normal Form and Calculated Values   Denormalization Minimizes Code Denormalization Patterns   Constraints Minimize Code ...future topic...   Data Dictionaries Minimize Code Using a Data Dictionary Database Programming Skills   SQL Commands ...future topic...     SELECT Introduction to Queries       JOINs Part 1 The JOIN is the Cornerstone of Powerful Queries       JOINs Part 2 The Many Forms of JOIN       Group By, Having, Sum, Avg and Count Group By, Having, Sum, Avg, Count       UNION How SQL UNION Affects Table Design     INSERT ...future topic...     UPDATE ...future topic...     DELETE ...future topic...   Performance ...future topic...     Huge Inserts Huge Inserts     Indexes, Denormalization and Performance Pay Me Now or Pay Me Later   Server Side Abilities ...future topic...     Triggers Triggers, Encapsulation and Composition     Stored Procedures ...future topic...     Views ...future topic...     Indexes ...future topic...   ACID Compliance ...future topic...   Transactions ...future topic...   Server-Side Security Introducing Database Security     Table Security ...future topic...     Row-level Security ...future topic...     Column-level Security ...future topic... Table Design   Customer Interviews ...future topic...   Iterative Development Table Structure Changes     Changing Requirements Bad Requirements and Iterative Development   Pattern Recognition ...future topic...   The Power of Primary Keys and Foreign Keys ...future topic...     Primary Keys Primary Keys     Foreign Keys Foreign Keys   The How and Why of Constraints Of Tables and Constraints   Table Design Specifics ...future topic...     Basic Table Types A Sane Approach To Primary Keys       Impermanent Primary Keys Impermanent Primary Keys       DELETE CASCADE and DELETE RESTRICT Different Foreign Keys for Different Primary Keys     Transaction Tables ...future topic...       Limited Transaction Pattern Limited Transaction Pattern     Cross References ...future topic...       Cross-Reference Validation Cross Reference Validation Pattern     Specialized Patterns ...future topic...       Resolutions Resolutions       History Tables History Tables       Sequencing Dependencies Sequencing Dependencies       Secure Password Resets Secure Password Resets     Anti-Patterns ...future topic...       Reverse Foreign Key The Reverse Foreign Key The Larger Application   Review of Various Web Architectures ...future topic...     ORM Why I Do Not Use ORM   Connecting To The Database ...future topic...   Putting Business Logic in The Server ...future topic...   A Lean Web Layer The Web Layer   Dealing With the Browser The Wonderful Awful Browser     Javascript As a Foreign Language Javascript As a Foreign Language The Tools of The Trade   Framework Requirements The Framework And The Database   Build Tool ...future topic...   Deploy Tool ...future topic... Older Essays   Original Introduction Introduction
    Categories: Development

    Advanced Table Design: Secure Password Resets

    Sun, 2008-09-07 12:58

    Most web-based database applications make use of email to allow users to change their passwords. Completing securing this operation can be tricky business, and one of the best ways to do it is to user database server abilities.

    Disclaimer 1: Only As Secure as Email

    We tend to take it for granted today that password reset systems work through email. We reason that if a user can access an email sent by us then they are who they say they are. Obviously this will not be true if a user's email account has been compromised.

    Dealing with the possibility of compromised email accounts is outside the scope of this week's essay. There are other strategies available to reduce that risk, but they will be treated in some future essay.

    Disclaimer 2: Only SSL (HTTPS) of Course!

    It is not much use giving yourself a super-secure email system if you transmit sensitive information over unencrypted connections. Secure Socket Layers (SSL) should always be used when high security is required. For the end-user this means they are going to a site through HTTPS instead of HTTP.

    Password Resets vs. Sending Passwords

    On some low-security systems it is acceptable to send a user his password in an email. This approach is very ill-advised in higher security contexts because we have no control over the user's storage of that email. It could end up anywhere, and anybody might read it.

    When security requirements are higher, it is better to force the user to reset their password. There are several reasons for this, but the important one here is that we do not want to send the actual password in an email. Therefore we must send a link that sends them to a page where they can provide a new password.

    The Requirements

    If we spell out the requirements for a secure password reset system, they are at the very least these:

    1. We must generate some hash and send it to the user, this is how she will identify herself so we can let her change her password.
    2. The hash must expire at some point, since we cannot be sure the user will completely purge out the email (or that he even can, depending on the policy of the email host).
    3. It must be completely impossible for anybody to read the hash, otherwise they could intercept the reset process and set a password for themselves.
    4. Despite requirement 3 just listed, we must somehow verify the hash when the user presents it.
    5. We must be able to change the user's password, which is a priveleged operation, even though the user is not even logged in.

    It is not actually possible to implement these requirements in application code alone (or perhaps I should say is not possible to do it and meet minimum acceptable risk). There are two problems if you try it:

    1. Requirements 3 and 4 cannot be reconciled. If the application is able to read the hash to verify it, then a vulnerability in the application code could lead to compromise. If we implement in application code we have the burden of ensuring practically zero vulnerabilities, while if we go server-side we have no such burden (at least for this feature).
    2. Requirement five requires the application code to connect at a very high privelege level, which could lead to completely unrelated vulnerabilities.
    Implementing In The Database

    The system I will now describe meets all 5 of the requirements listed above while never requiring a priveleged connection to the database. The feature is implemented in an isolated system that cannot touch other systems, and it has no burden to be particularly careful in writing the application code.

    Since a picture is worth a thousand words, here it is:

    The process begins at the top left. The user (Yellow circle) clicks on some "Forgot Password" link and provides an email or account id. This goes to web server which generates an INSERT to the insert-only table of hashes. This insert contains only the user's id, nothing else is needed. There is a trigger on the table that fires on the INSERT. This trigger generates the hash and sends the email to the user.

    The salient features here are that the table is insert-only, which is explained below, and that the trigger operates at super-user level, which is also explained below.

    Once the user receives the link and clicks on it, our process goes over to the right. The user lands on a page and provides a new password (and probably of course must type it in twice). The web server does basic things like making sure the two values match, that the password is long enough, and like that, and then generates an INSERT into a second table. The insert contains the email or account ID, the hash, and the desired new password.

    The magic begins on the INSERT into the second table. An INSERT trigger running at superuser level is allowed to look at the first table and verify the hash and its expiration. If these match, it sets the user's password.

    Simple, really, IMHO.

    Feature 1: Insert Only Tables

    This system depends on creating tables that any unpriveleged user can insert into, but which nobody can SELECT from or UPDATE to or DELETE from.

    This may sound like a joke: "Insert Only Table", something like "Write only memory". But the idea is very simple, if nobody can SELECT from the table then nobody can discover active hashes. If nobody can UPDATE the table then nobody can forge hashes. Finally, if nobody can DELETE from the table then nobody can cause mischief.

    The code for the tables looks like this:

    -- FIRST TABLE
    CREATE TABLE users_pwrequests
    (
      recnum_pwr integer,
      user_id character varying(40),
      md5 character(32),
      ts_ins timestamp without time zone,
    )
    -- NOTE! This syntax is PostgreSQL, there may be
    -- slight variations on other platforms.
    REVOKE ALL ON TABLE users_pwrequests FROM PUBLIC;
    GRANT INSERT ON TABLE users_pwrequests FROM PUBLIC;
    
    -- SECOND TABLE
    CREATE TABLE users_pwverifies
    (
      recnum_pwv integer,
      user_id character varying(40),
      md5 character(32),
      member_password character varying(20),
    )
    REVOKE ALL ON TABLE users_pwverifies FROM PUBLIC;
    GRANT INSERT ON TABLE users_pwverifies FROM PUBLIC;
    
    
    Feature 2: Trigger Security Priveleges

    It is possible on most servers to severely limit a user's allowed actions on a table, but then to provide trigger code that fires on those actions and executes a super-user level. Today's technique depends upon this ability. Trigger code operating at superuser level can look at the insert-only table to verify a hash, and it can also set the user's password.

    This basic ability is what makes triggers so amazing and cool for implementing business logic (see also Triggers and Encapsulation), because there is no way for a user to directly invoke a trigger for his own nefarious purposes, and there is no way for a cracker to avoid the firing of the trigger if he performs an action on a table. Triggers are truly the most powerful example of encapsulation of data and code that is available to today's programmer.

    The first trigger looks something like this ( this is PostgreSQL code, your server will likely require variations) (I have also stripped it down for brevity, it may not work exactly without modification):

    CREATE OR REPLACE FUNCTION users_pwrequests_ins_bef_r_f()
      RETURNS trigger AS
    $BODY$
    DECLARE
        NotifyList text = '';
        ErrorList text = '';
        ErrorCount int = 0;
        AnyInt int;
        AnyRow RECORD;
        AnyChar varchar;
        AnyChar2 varchar;
        AnyChar3 varchar;
        AnyChar4 varchar;
    BEGIN
        -- necessary for an old glitch in pg security
        SET search_path TO public;
    
        -- Only execute if the user's id is valid
        SELECT INTO AnyInt Count(*)
               FROM users WHERE user_id = new.user_id;
        IF AnyInt > 0 THEN 
           SELECT INTO AnyChar email
                  FROM users WHERE user_id = new.user_id;
           -- This lets you put the email itself into 
           -- a table for admin control
           SELECT INTO AnyChar2 variable_value
                  FROM variables
                 WHERE variable = 'PW_EMAILCONTENT';
           -- Also the server is stored in a table
           SELECT INTO AnyChar3 variable_value
                  FROM variables
                 WHERE variable = 'SMTP_SERVER';
                 
           -- This becomes the email FROM Address
           SELECT INTO AnyChar4 variable_value
                  FROM variables
                 WHERE variable = 'EMAIL_FROM';
           IF AnyChar4 IS NULL THEN AnyChar4 = ''; END IF;
           
           -- Very important! Set the md5 hash!
           new.md5 := md5(now()::varchar);
           
           -- Call out to a stored procedure that sends emails
           PERFORM pwmail(AnyChar
              ,'Password Reset Request'
              ,AnyChar2 || new.md5
              ,AnyChar3
              ,AnyChar4);
           EXECUTE ' ALTER ROLE ' || new.user_id || ' NOLOGIN ';
        END IF;    -- 3000 PK/UNIQUE Insert Validation
    
    END; $BODY$
      -- The "SECURITY DEFINER" is crucial, it allows 
      -- the trigger to run as the super-user who 
      -- created it
      LANGUAGE 'plpgsql' VOLATILE SECURITY DEFINER
    

    The second trigger looks like this:

    CREATE OR REPLACE FUNCTION users_pwverifies_ins_bef_r_f()
      RETURNS trigger AS
    $BODY$
    DECLARE
        NotifyList text = '';
        ErrorList text = '';
        ErrorCount int = 0;
        AnyInt int;
        AnyRow RECORD;
        AnyChar varchar;
        AnyChar2 varchar;
        AnyChar3 varchar;
        AnyChar4 varchar;
    BEGIN
        SET search_path TO public;
    
        -- Read the first table to see if the 
        -- link is valid and has not expired
        SELECT INTO AnyInt Count(*)
               FROM users_pwrequests
              WHERE user_id = new.user_id
                AND md5     = new.md5
                AND age(now(),ts_ins) < '20  min';         
        IF AnyInt = 0 THEN                                
            ErrorCount = ErrorCount + 1; 
            ErrorList  = ErrorList || 'user_id,9005,Invalid Link;';
        ELSE 
           -- Magic!  The user's password is set
            EXECUTE 'ALTER ROLE ' ||  new.user_id 
                || ' LOGIN PASSWORD ' 
                || quote_literal(new.member_password);
                
            -- Very important!  Now that we have set it,
            -- erase it so it is not saved to the table
            new.member_password := '';
        END IF;    -- 3000 PK/UNIQUE Insert Validation
    
        IF ErrorCount > 0 THEN
            RAISE EXCEPTION '%',ErrorList;
            RETURN null;
        ELSE
            RETURN new;
        END IF;
    END; $BODY$
      LANGUAGE 'plpgsql' VOLATILE SECURITY DEFINER;
    
    Feature 3: Sending Email From Database Server

    The technique present above requires that your database server be able to send emails. This is not always possible. Postgresql (www.postgresql.org) can do it, and I have to believe the other big guys can as well, but I have not tried it yet personally.

    To send emails through a PostgreSQL server, you must install Perl as an untrusted language, and then install the Perl MAIL package. If anybody wants to know more about that then please leave a comment and I will expand the essay to include that.

    Feature 4: The Empty Column

    There is one more note that should be made. To use this system, you must tell the server the user's desired new password. To do that, you must actually make it part of the INSERT command and therefore you must have a column for it in the 2nd read-only table. However, you certainly do not want to actually save it, so you have the trigger set the password first and then blank out the value, so the final row saved to the table does not actually contain anything. This is noted in the code comments on the second trigger, which is included above.

    Conclusion

    The technique presented today makes full use of database server abilities to create a password reset system that is highly resistant to forgery, interception, and evil-admin meddling. It makes use of a combination of restrictive table security, priveleged trigger code, and sending emails from the database server.

    Categories: Development

    Advanced Algorithm: Sequencing Dependencies

    Mon, 2008-08-25 06:56

    Some database applications require you to perform a series of actions where you know only that some actions must be performed before others. Before you can perform the actions, you must work out a safe sequence that takes into account all of the dependencies. This week in The Database Programmer we will see an algorithm for doing this.

    Examples

    There are many examples where a programmer must work out dependencies before doing something.

    A manufacturing package may track many steps in the manufacture of an item. Some steps cannot be performed until others are complete. A simple system would require the end-user to work out the entire process, but a better system would let the user enter only the dependencies: which processes require others to be complete. In this kind of system the computer can be used to schedule manufacturing tasks.

    All popular Linux distributions have a package installation system in which each package lists its required dependencies. If you want to install a large number of packages in one shot, producing a tangled bunch of related dependencies, today's algorithm can be used to work them all out.

    If you are using a data dictionary to build tables, every foreign key represents a dependency, where the child table requires the parent table to exist before it can be built. Today's algorithm can be used to sequence the tables and build them in order.

    Another database example is generating code to perform calculations. Some calculations will depend on previous calculations, so your code generator must be able to sequence them all so that the calculations are performed in the proper order.

    Big Words: Directed Acyclic Graph

    The examples abvoe are all cases of what mathematicians call a Directed Acyclic Graph. If you do not want to read the entire Wikipedia article, the main points are these:

    • We have a set of items. These can be anything you are keeping track of in your database.
    • Any item may be connected to zero or more other items.
    • The connection is one-way only. So if we say A requires B, we are not saying that B also requires A (in fact it is forbidden).
    • There can be no loops (cycles). If A requires B, B may not require A. Further, if A requires B, and B requires C, C may not require A.

    Whenever I can, I like to point out that it is very useful to read up on the mathematical foundations of certain programming techniques. We can often pick up very useful insights from those who think of these things at the most abstract level. It is also much easier to get advice from the more abstract-minded database people if you are at least marginally familiar with the mathematical terms.

    The Tables

    So now let us proceed to the tables and the code. The tables below show a data dictionary that will be used to generate DDL to build a database:

    Table: TABLES
    
    TABLE       | DESCRIPTION            | SEQUENCE
    ------------+------------------------+---------
    ORDERS      | Sales Orders Headers   |  ?
    ORDER_LINES | Sales order lines      |  ?
    CUSTOMERS   | Customers              |  ?
    ITEMS       | Items                  |  ?
    
    
    Table: DEPENDENCIES
    
    CHILD_TABLE  | PARENT_TABLE
    -------------+---------------
    ORDERS       | CUSTOMERS
    ORDER_LINES  | ORDERS
    ORDER_LINES  | ITEMS
    

    The problem here is knowing the safe order in which to build the tables. If I try to build ORDER_LINES before I have built ITEMS, then I cannot put a foreign key onto ORDER_LINES, because ITEMS is not there. In short, I need to know the value of the SEQUENCE column in the example above.

    The Expected Answer

    The example above is simple enough that we can work it out by hand. This is actually a good idea, because we want to get an idea of what the answer will look like:

    TABLE       | DESCRIPTION            | SEQUENCE
    ------------+------------------------+---------
    ORDERS      | Sales Orders Headers   |  1
    ORDER_LINES | Sales order lines      |  2
    CUSTOMERS   | Customers              |  0
    ITEMS       | Items                  |  0
    

    This answer should be self-explanatory, except maybe for the fact that both CUSTOMERS and ITEMS have the same value. We need to look at that before we can see the code that produces it. Is it OK that two entries have the same value, and how would our program handle that?

    The short answer is that it is perfectly OK and natural for two or more entries to have the same value. All this means is that they can be done in any order relative to each other, so long as they are done before the other entries.

    In terms of the example, where we want to build these tables in a database, it means that:

    • We would query the list of tables and sort by SEQUENCE
    • We would loop through and build each table
    • We don't care about ITEMS and CUSTOMERS having the same value, they get built in whatever which-way the server gives us the list.

    The same concept applies to the other potential examples: manufacturing, software packages, and generating calculations. So long as you follow the sequence, we don't care about items that have the same value.

    Stating the Solution in Plain English

    We are now ready to work out a program that will generate the SEQUENCE column. The basic steps the program must perform are:

    1. Initialize the column to -1. A value of -1 means "Not sequenced."
    2. Update the column to zero for all items that have no dependencies.
    3. Repeat the following action until the affected rows are zero: Update the SEQUENCE column to 1 (then 2, then 3) for all rows that have all of their dependencies sequenced already.
    4. Once the command in step 3 is no longer affecting any rows, check for any rows that have -1, these are involved in circular dependencies and we cannot proceed until the user straightens them out.
    Stating the Solution in Code

    The first step is very easy, we initialize the table with this command:

    UPDATE TABLES SET SEQUENCE = -1;
    

    The next step is also very easy, we mark with a '0' all of the tables that have no dependencies. The basic idea is to find all of the entries that have no entries in DEPENDENCIES.

    UPDATE TABLES SET SEQUENCE = 0
     WHERE NOT EXISTS (SELECT child FROM DEPENDENCIES
                        WHERE child = TABLES.TABLE)
    

    Now for the hard part. We now have to execute a loop. On each pass of the loop we are looking for all items whose dependencies have all been sequenced. We will do this over and over until the command is not affecting any rows. It is important that we cannot exit the loop by testing if all rows are sequenced, because a circular dependency will prevent this from happening and we will have an infinite loop.

    You can control this loop from client code, but I wrote mine as a Postgres stored procedure. This algorithm turns out to be surprisingly complicated. The UPDATE command below may not be all that self-explanatory. What it works out is:

    • Get a list of child tables from the DEPENDENCIES table
    • JOIN through to TABLES to look at the SEQUENCE value of their parents.
    • Group and check that the minimum value is greater than zero, if it is it means all parents are sequenced and the table can be sequenced.
    • Update the SEQUENCE value for the tables we found
    CREATE OR REPLACE FUNCTION zdd.Table_Sequencer() RETURNS void AS
    $BODY$
    DECLARE
        -- Note that rowcount is initialized to be > 0, this makes
        -- the loop work properly
        rowcount integer := 1;
        
        -- This tracks the value we are assigning to SEQUENCE.  We
        -- initialize it to 1 because we already took care of the
        -- the rows that have value 0
        lnSeq integer := 1;
    BEGIN
        while rowcount > 0 LOOP
            UPDATE tables set SEQUENCE = lnSeq
              FROM (SELECT t1.CHILD 
                      FROM DEPENDENCIES t1 
                      JOIN TABLES       t2 ON t1.PARENT = t2.TABLE
                     GROUP BY t1.CHILD
                    HAVING MIN(t2.SEQUENCE) >= 0
                    ) fins
              WHERE TABLES.TABLE = fins.CHILD
                AND TABLES.SEQUENCE = -1;
    
      lnSeq := lnSeq + 1;
      GET DIAGNOSTICS rowcount = ROW_COUNT;
     END LOOP;
     
     RETURN;
    END;
    $BODY$
    LANGUAGE plpgsql;
    

    The stored procedure above will stop executing once the UPDATE command is no longer having any effect. Once that happens, your final step is to make sure that all rows have a valid SEQUENCE value, which is to say that no entry has SEQUENCE of -1. If any of the rows have that value then you have a circular dependency. You must report those rows to the user, and you can also report the dependencies that are causing the loop.

    Conclusion

    Sequencing dependencies is a fundamental algorithm that has a lot of use cases in database applications. It is easy enough to accomplish, but the innermost UPDATE command can be a little puzzling when you first look at it. Once you have mastered this algorithm you are on the way to the "big leagues" of database applications such as ERP, MRP and others.

    Next Essay: Secure Password Resets

    Categories: Development

    Javascript As a Foreign Language

    Sun, 2008-08-03 07:58

    So you know 37 different programming languages, you've programmed moon landers, missiles and toasters, and how could Javascript be any problem? Then you start trying to code up some Javascript and find that it just does not feel right, nothing seems to flow naturally or easily. Your instincts do not seem to guide you. You are not alone, here is your cheatsheet...

    Welcome to the Database Programmer blog. If you are trying to write database applications in 2008 then you most likely bump into Javascript. My hope in this week's essay is to provide a "soft landing" into this beautiful and powerful but somewhat strange language.

    To see the other essays in this series, consult our Complete Table of Contents.

    Contents

    Today's essay is rather long. It covers extremely basic ideas but proceeds directly to very powerful techniques. I have provided a summary here so that you can skip over the material that may already be familiar to you.

    Start Off: Firefox and Firebug

    In case you have been living under a rock for the past few years, let me tell you to do your development in a real web browser, that is, Firefox, and to immediately download the Firebug extension. Firebug more or less does everything you need to debug Javascript, and it has many features you may not even know you need. Do not try to develop Javascript without Firebug.

    In particular, firebug has a "console" object that you can send messages to, such as this:

    console.log("this is so much better than alert()!");
    for(var x = 1; x<10; x++) {
        console.log("We are on "+x);
    }
    
    Execution

    Javascript executes while your page is being loaded, and can be placed anywhere on the page. While I make no claims that the example below is good or bad practice, it does illustrate how Javascript executes.

    <html>
    <head>
    <script>
    // Script is executing as it is encountered,
    // so this variable comes into existence
    // immediately
    var x = 5;  
    
    function square(x) {
        return x * x;
    }
    
    // Now that the square function is defined,
    // we can call it
    var y = square(x);
    </script>
    </head>
    <body>
    <h1 id='h1'>Here is a Javascript Example!</h2>
    
    <script>
    // Script can be embedded directly in the
    // body of your HTML (for better or worse!)
    var h1 = document.getElementById('h1');
    h1.innerHTML = 'I changed the H1 content!';
    
    // This function can be used anywhere downstream
    function changeH1(newText) {
        var h1 = document.getElementById('h1');
        h1.innerHTML = newText;
    }
    </script>
    </body>
    <div>Here is a div of text</div>
    <script>
    changeH1("Changing H1 yet again!");
    </script>
    
    Variable Scope

    Scoping in Javascript is pretty straightforward. If you assign a value to a variable outside of any function it becomes a global. If you explicitly define a variable as "window.x = 5" it becomes a global. If you put the keyword var in front of it before using it it becomes local (and can mask a global variable of the same name). You can use the "var" keyword inside of loops, and many javascript programmers use "var" everywhere. Here is an example.

    <html>
    <head>
    <script>
    // We are executing outside of a function, so
    // both of these are globals:
    var x = 5;
    y = 10;
    
    function example() {
        // Since a global named 'x' exists, and we do
        // not use the "var" keyword, we are re-assigning
        // the global variable
        x = 7;
        
        // Using the "var" keyword makes a local variable,
        // we cannot "see" the global x anymore
        var x = 2;
        alert(x);
        
        // I can still access the global variable to
        // set its value back:
        window.x = 5;
        alert(x);
        alert(window.x);    
    }
    
    </script>
    </head>
    
    Adding Methods to Core Javascript

    Javascript lacks certain functions that are very useful to have, such as trimming spaces from strings. One very cool thing about Javascript is that you can directly add these methods to the core language, by adding functions to the "prototype" object of the core classes. Here is how you add a "trim" function to core Javascript.

    String.prototype.trim = function() {
     return this.replace(/^\s+|\s+$/g,"");
    }
    x = "   abc  ";
    alert('-' + x + '-'); // the dashes let you see the spaces
    alert('-' + x.trim() + '-');  // spaces removed!
    

    When I first saw this trick I dutifully copy-n-pasted it in and it worked, but the syntax looked very perplexing, I could not figure out how to make use of it myself. My brain had not yet wrapped itself around the Javascript mentality. This leads directly to our next concept, that functions are "first class cizitens".

    Functions as First Class Citizens

    You may have heard that Javascript treats functions as "first class citizens" and wondered, "what does that mean?" The best way to explain it in terms of other languages is that you can create functions on the fly and pass them around like variables. This may be a little hard to grasp, so we will go directly to examples.

    // Most languages support this type of function definition
    function square(x) {
        return x * x;
    }
    
    // Javascript gives you a slightly different syntax if
    // you like, which can be extremely powerful
    var square = function(x) {
        return x * x;
    }
    
    // The books usually go on to an example like this, 
    // which frankly did not seem to me to have any purpose:
    y = x;
    alert( y(5) );
    

    The basic idea to get here is that you can do anything with a function that you can do with a variable. There are multiple uses for this, but we have already seen one, namely, the ability to add a method to a previously created class. This is what we did above when we added the "trim()" method to the base "String" class. This means that our approach to building class hierarchies is very different than in other Object-oriented languages like PHP, Foxpro, Delphi, VB and so forth.

    // This example shows two different ways to add methods
    // to HTML elements and make them act more object-oriented.
    
    // Method 1, make a function that makes an INPUT read-only
    // by changing its style and setting a property.  Notice
    // the code refers to "this" as if it were part of an
    // object, see below to see why that works.
    function makeReadOnly() {
        this.className = 'readOnly';
        this.readOnly = true;
    }
    
    // Now attach that function to a DOM element (an HTML INPUT)
    var input = document.getElementById('myInput');
    input.makeReadOnly = makeReadOnly;
    
    // Some other code can now tell the input to go into
    // read only mode:
    function changeModes() {
        var input = document.getElementById('myInput);
        // When this executes, the "this" variable in 
        // the function will refer to "input"
        input.makeReadOnly();
    }
    

    There is another way to do this as well, that really illustrates how to make use of Javascript's native abilities:

    // Method 2 is to defne the function while adding it
    // to the INPUT element.
    var input = document.getElementById('myInput');
    input.makeReadOnly = function() {
        this.className = 'readOnly';
        this.readOnly = true;
    }
    
    // This code works exactly as it did above
    function changeModes() {
        var input = document.getElementById('myInput);
        input.makeReadOnly();
    }
    

    Now that we have introduced this idea, it will come up all over the place in later examples.

    Objects And Classes

    When I first tried to use Javascript I kept looking for the "class" keyword, but it's not there! Believe it or not you use the "function" keyword to create what we would call a class in other languages. Here is an example of how to create and instantiate an object in Javascript:

    // Here is a simple PHP class for an 
    // object that handles a row from a database
    class dbRow {
        var tableName = '';
        var rowId = 0;
        
        function dbRow(tableName,id) {
            this.tableId = table;
            this.fetchRow(id);
        }
        
        function fetchRow(id) {
            # ...more code here
        }
    }
    
    var x = new dbRow('customers',23);
    

    In Javascript we make a function instead of a class:

    function dbRow(tableName,id) {
        // When the object is instantiated, this
        // code runs immediately
        this.tableName = tableName;
        
        // We must define a fetchRow function before
        // we can actually call it....
        this.fetchRow = function(id) {
            // some kind of ajax stuff going on here
        }
        
        // ...and now we can invoke the function
        this.fetchRow(id);
    }
    
    // When this command returns we have a new "dbRow"
    // object.  
    var x = new dbRow('customers',23);
    
    Creating An Object Without a Class

    We can say Javascript is "purely dynamic", by which we mean you can define anything on the fly, including ojects, even if you have no class definition (er, I mean no "function()" definition...). You can explicitly create an object by enclosing the definition in curly braces. Properties and their values are assigned with "name: value" syntax, separated by commas. Since you can do anything with a function that you can do with a variable, the following is a nifty way to create an object:

    var x = {
        propertyName: 'value',
        otherProperty: 'otherValue',
        
        square: function(x) {
            return x * x;
        }
        // Don't put a comma after the last property!
        // It will work in firefox but not in IE!
    }
    
    alert(x.square(5));
    

    This syntax is called "JSON" by the way, for "Javascript Object Notation". If you can get comfortable with JSON you can start to code up some really elegant Javascript.

    Accessing Object Properties and Methods

    You can hardcode references to an object's properties by using the ".property" syntax, but you can also use variables that hold the name of the property.

    // Create an object
    var x = {
        first: 'Sax',
        last: 'Russel',
        
        combine: function() {
            return this.first + ' ' + this.last;
        }
    }
    
    // You can now explicitly access properties
    alert (x.first);
    alert (x.last);
    
    // But you can also have a variable hold the name
    // of the property you want:
    var propName = 'first';
    alert (x[propName]);
    
    // Objects can be nested to any depth, and you can
    // mix hardcoded and variable names.  If we had a
    // complex data dictionary stored on the browser,
    // we might get the caption for a column like this:
    var tableId = 'customers';
    var columnId = 'total_sales';
    var caption = dd[tableId].columns[columnId].caption;
    

    This works also for functions. Assuming the same object as the above, we can invoke functions that are named by other variables:

    var x = { .... repeated from above example };
    
    var methodName = 'combine';
    alert( x[methodName]() );
    
    Iteration

    As a database programmer I write a lot of code that iterates arrays and associative arrays. Iteration tends to be very important to database programmers, as it is the most natural way to loop through rows retrieved from a database, or to loop through the values in a row. Basic iteration of an array looks like this:

    // make an array 
    var myList = [ 'sax', 'anne', 'nirgal', 'frank' ];
    for(var idx in myList) {
        // console.log() requires firebug
        console.log("Index and value: "+idx+", "+myList[idx])
    }
    

    All of the action is in the line "for(var idx in myList)", this structure will loop through the array. On each pass the variable "idx" will contain the array's index number. To actually get the value you need you have to go looking for myList[idx].

    Associate Arrays are a very natural data structure for a database programmer, as they are an easy way to represent a single row retrieved from the database. There is no explicit support for associative arrays in Javascript, but this does not matter because you can use an object and get the same results.

    // Here is an object
    var myObject = {
       first: 'Sax',
       last: 'Russel',
       occupation: 'Physicist'
    }
    // Now we loop through it like an associative array
    for(var key in myObject) {
        console.log("The array key is: " + key);
        console.log("The value is: " + myObject[key]);
    }
    
    JSON and Ajax

    Nowadays everybody is jumping into AJAX with both feet. AJAX can be particularly useful to a database programmer, because you can make AJAX calls that return objects (including code), arrays, and database data.

    I should note that the term "AJAX" itself means something very precise, being "Asynchronous Javascript and XML", while the example I am about to show contains no XML, so my use of the term is not strictly correct. Nevertheless, many people routinely use the term AJAX to mean any round-trip to the browser that fetches some fragment of information without doing an entire page refresh. While this is regrettable, I'm not going to try to buck that trend here.

    That being said, here is a nifty way to use PHP to send a data structure back on an AJAX request:

    # THE PHP CODE:
    function someAjaxHandler() {
        $table = myFrameworkGetPostRetrievalFunction('table');
        $id    = myFrameworkGetPostRetrievalFunction('id');
        $row = myFrameworkRowRetrievalFunction("customers",23);
        
        # This nifty PHP function encodes arrays and objects
        # into JSON, very cool
        echo json_encode($row);
    }
    

    This would be handled in the browser like so:

    function someAjaxResponseHandler() {
        if (this.readyState != 4) return;
        try {
            eval( 'window.requestData ='+this.responseText);
        }
        catch(e) {
            alert("The server response was not parsable JSON!");
            return;
        }
    }
    
    Synchronous AJAX and JSON: S-JSON

    It is pretty safe to say that the asynchronous nature of AJAX is a powerful part of its appeal. The request is sent and the browser remains responsive to the user until the request comes back. This is especially powerful for fetching things in the background while the user works.

    However, in database applications sometimes it is the Right Thing for the browser to stop while fetching data. If a user clicks on [SAVE] on a CRUD screen to save some changes, we actually want the browser to wait until the server tells them that it all went ok (or not). You can do this by setting a flag on your call. I have found this a very powerful approach to writing desktop-in-browser applications:

    function JSON(url) {
        // Create an object
        var browser = navigator.appName;
        if(browser == "Microsoft Internet Explorer"){
            var http = new ActiveXObject("Microsoft.XMLHTTP");
        }
        else {
            var http = new XMLHttpRequest();
        }
    
        // The trick is to pass "false" as the third parameter,
        // which says to not go asynchronously.
    
        http.open('POST' , url, false);
        http.send(null);
        
        // Execution now halts, waiting for the complete
        // return value
    
        // Once execution resumes, we can capture the
        // JSON string sent back by the server and do anything
        // we want with it
        try {
            eval( 'window.requestData ='+http.responseText);
        }
        catch(e) {
            alert("The server response was not parsable JSON!");
            return;
        }
    
        // Processing of the result occurs here...  
    }
    
    jQuery and Friends

    Nowadays we have a growing list of very powerful Javascript libraries available. Some of them are very high level and some of them are low-level.

    One library I will mention by name as being very useful is jQuery. This library provides a wealth of extremely simple and powerful abilities for finding and manipulating the HTML that is in your document. I highly recommend it.

    Closing Thoughts

    Any database programmer working in 2008 is either already required to use Javascript or may find himself facing it soon. Javascript is very flexible and powerful, but is different from the languages we are used to for writing applications, like PHP, Foxpro, Delphi, VB and others.

    Nevertheless, Javascript can do everything you need it to do, you just have to grasp the "javascript mentality" as it were. I have attempted this week in this essay to put into one place all of the facts and tricks that were not so obvious to me from reading books or simply taking a snippet of code and trying to modify it. I hope that you find it useful!

    Next Essay: Sequencing Dependencies

    Categories: Development

    Different Foreign Keys for Different Tables

    Sun, 2008-07-27 09:38

    A foreign key can be used to implement table design patterns that span multiple tables. By choosing how a foreign key handles a DELETE attempt on the parent table, you can structure your table designs to follow two standard patterns.

    Welcome to the Database Programmer blog. This series of essays is for anybody who wants to learn about databases on their own terms. There is a complete Table of Contents, as well as a summary of Table Design Patterns. There is a new essay in this spot each Monday morning.

    A Simple Example of Two Foreign Keys

    Picture a basic shopping cart, with its two basic tables of CART and CART_LINES (or ORDERS and ORDER_LINES if you are more old-fashioned). The table CUSTOMERS is also in there as a parent to CARTS. Our three tables look something like this:

       CUSTOMERS
          |
          |
         /|\
         CART  Cart is child of customers
          |
          |
         /|\
       CART_LINES  Lines is child of Cart
    

    There are two foreign keys here. CART has a foreign key to CUSTOMERS, and CART_LINES has a foreign key to CART, but the two foreign keys should behave very differently.

    Table Types and Table Design Patterns

    In A Sane Approach To Choosing Primary Keys we saw that table design begins with identifying the basic kinds of tables: Reference and Small Master Tables, Large Master Tables, Transactions, and Cross-References. Just as we picked different kinds of primary keys for the different tables, so will we pick different kinds of foreign keys between these tables.

    Deleting a Customer

    Imagine you have a customer who has made 10 orders in 2 years. A system administrator, who is allowed to basically do anything, goes into your admin screens, looks up the customer, and clicks [DELETE]. What should happen?

    The near-universal answer is that the user should be denied the action. An error should come back that says "That customer has orders, cannot delete." We want it this way because we never want to delete any parent row and "orphan" the child rows. Database programmers know from long experience that if you allow the DELETE, your queries will give incorrect results, or you will work extremely hard with lots of weird LEFT JOINS and UNIONS trying to get them to come back correctly.

    This is not an issue of "flexibility", where a more robust system would allow the deletion. This is a basic question of record-keeping. If the customer has orders on file then the customer must be kept on file. Enforcing this rule keeps code clean and simple, and trying to avoid this rule in the name of "flexibility" just makes heaps of work for everybody.

    Going further, the administrator in question, who supposedly can do anything, may not violate the rule. An administrator is simply somebody who can do anything that would not produce bad data. Administrators should not be given the ability to violate the basic structure of the data, they simply have full rights to do anything within the structure of the data.

    The DELETE RESTRICT Foreign Key

    The behavior we want here is called DELETE RESTRICT. On most database servers this is the default behavior for a foreign key. It means that you cannot delete a parent table row if there are matching rows in the child table.

    The DELETE RESTRICT pattern is almost universally used when the child table is a transaction table and the parent table is a master table or reference table.

    The syntax looks something like this:

    -- Most database servers implement DELETE RESTRICT
    -- by default, so this syntax:
    Create table CART (
        customer integer REFERENCES customers
       ,order    integer.....
    )
    
    -- ...is the same as this explicit syntax:
    Create table CART (
        customer integer REFERENCES customers
                         ON DELETE RESTRICT 
       ,order    integer.....
    )
    
    Deleting An Order and DELETE CASCADE

    Now let us say a staff member is on the phone with a customer, enters an order, enters five lines, and then the customers says "forget it" and the user needs to delete the entire order from the CART.

    In this case the user wants to go delete the order, and he expects the computer to also delete the lines. This makes perfect sense, why keep the lines if we don't want the order?

    It may seem strange that in the case of deleting a customer it makes perfect sense to stop the user, but when deleting an order it makes perfect sense to delete the lines as well.

    The difference is that an entry in the CART table is a transaction entry. When a user deletes a transaction they almost always want to automatically delete all of the relevant rows from all child tables as well. The two rules basically are:

    • The user cannot delete a master entry that has transactions.
    • Deleting a transaction means deleting the entire transaction.

    NOTE: By "transaction" here I mean financial transaction or other interaction between master elements. I do not mean a database transaction.

    The syntax for DELETE CASCADE looks something like this:

    -- if the user deletes a row from CART,
    -- do them the favor of deleting all of the
    -- lines as well
    Create table CART_LINES (
        order   integer REFERENCES CART
                        ON DELETE CASCADE
       ,order_line integer....
    )
    
    Conclusion: Different Tables Types, Different Foreign Key Types

    I have said many times in these essays that the foreign key is the only meaningful way to connect data in different tables. This week we have seen that the kind of foreign key you choose depends on what kind of tables you are connecting together. Children of master tables generally get DELETE RESTRICT, and children of transaction tables generally get DELETE CASCADE.

    Next Essay: Javascript as a Foreign Language

    Categories: Development

    History Tables

    Sun, 2008-07-20 09:11

    A history table allows you to use one table to track changes in another table. While the basic idea is simple, a naive implementation will lead to bloat and will be difficult to query. A more sophisticated approach allows easier queries and can produce not just information about single rows, but can also support aggregrate company-wide queries.

    This week in the Database Programmer Blog we return to table design patterns with an essay on history tables. The basic premise of this blog is that good coding skills do not lead magically to good database skills -- you can only make optimal use of a database by understanding it on its own terms. There is a new essay each Monday, and there is a Complete Table of Contents and a List of Table Design Patterns. What to Put Into A History Table

    Naive approaches to history tables usually involve making a complete copy of the original (or new) row when something changes in the source table. This turns out to be of little use, for reasons I will explain below. A much more useful approach is to track only a few columns and to store any combination of old values, new values, and differences. A history table designed this way can be tremendously useful.

    We will start with the example of a sales order table, called ORDERS. The columns we are interested in might look like this:

    ORDER | CUSTOMER | DATE     | LINES  |   TAX |  TOTAL |   PAID | BALANCE
    ------+----------+----------+--------+-------+--------+--------+---------
    1234  | 9876     | 5/1/08   |  48.00 |  5.00 |  53.00 |      0 |   53.00
    2345  | 9876     | 5/3/08   | 150.00 |     0 | 150.00 | 150.00 |       0
    3456  | 5544     | 6/8/08   |  25.00 |  2.60 |  27.60 |  15.00 |   12.60
    4567  | 3377     | 7/3/08   | 125.00 |  7.00 | 132.00 |  50.00 |   82.00
    

    We first have to ask which columns must be copied into history so that we can link the history table back to the ORDERS table. The only column we need for tracking is ORDER (the order number), so the history table will always have an ORDER column.

    We should also assume that the history table will contain at least a timestamp and a column to track the user who made the change, which brings us to a minimum of three columns.

    Finally, it tends to be very useful to track what action caused the history entry, be it an INSERT, UPDATE, or DELETE. This brings us up to four minimum columns.

    Next we ask which columns we will definitely not need. There are two groups of columns we will not need, which are 1) the columns that never change and 2) the columns we do not care about. Columns that do not change are likely to be the CUSTOMER and the DATE column. There is no need to bloat the history table with these valus because we can just get them out of the ORDERS table. The second group, columns we do not care about, are are usually things like ship-to address, maybe an email, and other information. Naturally there is no hard-and-fast rule here, it depends entirely upon the needs of the application.

    So now we know what we definitely need and what we definitely do not need, and we are ready to begin work considering the columns that will change. Not surprisingly, these are usually all about the numbers. Next we will see how to track the numbers.

    Tracking Changes to Numbers

    While it is certainly useful to store one or both of the old and new values for a number, it far more useful to store the change in the value, or the delta. Having this number in the history table makes for some really nifty abilities. If you store all three of the old, new, and delta, then you can more or less find out anything about the ORDER's history with very simple queries.

    So we are now ready to consider what the history table might look like. We will take the case of an order that was entered by user 'sax', updated twice by two other users, and in the end it was deleted by user 'anne'. Our first stab at the history table might look like this:

    ORDER | USER_ID  | ACTION | DATE    | LINES_OLD | LINES_NEW | LINES_DELTA 
    ------+----------+--------+---------+-----------+-----------+-------------
    1234  | sax      | UPD    | 5/1/08  |      0.00 |     48.00 |       48.00
    1234  | arkady   | UPD    | 5/7/08  |     48.00 |     58.00 |       10.00
    1234  | ralph    | UPD    | 6/1/08  |     58.00 |     25.00 |      -33.00
    1234  | anne     | DEL    | 6/4/08  |     25.00 |      0.00 |      -25.00
    

    I should note that if you keep LINES_OLD and LINES_NEW, then strictly speaking you do not need the LINES_DELTA columns. Whether or not you put it in depends on your approach to table design. If you framework allows you to guarantee that it will be correct, then your queries will be that much simpler with the LINES_DELTA column present.

    You may wonder why there is no entry for the original INSERT. This is because you must enter an order before you can enter the lines, so the original value will always be zero. Only when lines start going in does the ORDER get any numbers. This is true for header tables, but it would not be true for detail tables like ORDER_LINES_HISTORY.

    Some of the Obvious Queries

    There are few obvious queries that we can pull from the history table right away. These include the following:

    -- Find the value of of the line items of an
    -- order as of June 1st
    SELECT LINES_NEW 
      FROM ORDERS_HISTORY
     WHERE ORDER = 1234
       AND DATE <= '2008-06-01'
     ORDER BY DATE DESC LIMIT 1;
     
    -- Find the original value of the line items,
    -- and the user who entered it.  
    SELECT LINES_NEW, USER_ID
      FROM ORDERS_HISTORY
     WHERE ORDER = 1234
     ORDER BY date LIMIT 1;
       
    -- Find the users who have worked on an order
    SELECT DISTINCT USER_ID
      FROM ORDERS_HISTORY
     WHERE ORDER = 1234;
    

    Most of queries should be pretty obvious, and there are plenty more that will suggest themselves once you start working with the history tables.

    Queries Involving the Delta

    The real power of the DELTA column comes into play when you are trying to compute back-dated values such as the company's total open balance on June 1, 2008. If you have a naive history table that stores only the old value or only the new value, this is truly a tortuous query to write, but if you have both then it is really quite easy.

    -- Query to calculate the total open balance of all
    -- orders as of a given date
    SELECT SUM(BALANCE_DELTA) 
      FROM ORDERS_HISTORY
     WHERE DATE <= '2008-06-01';
    

    This magical little query works because paid orders will "wash out" of the total. Consider an order that is entered on May 20 for $200.00, and is then paid on May 23rd. It will have +200 entry in the BALANCE_DELTA column, and then it will have a -200.00 entry 3 days later. It will contribute the grand sum of zero to the total.

    But an order entered on May 25th that has not been paid by June 1st will have only a +200 entry in the BALANCE_DELTA column, so it will contribute the correct amount of $200.00 to the balance as of June 1st.

    If the company owner wants a report of his total open balances on each of the past 30 days, you can retrieve two queries and build his report on the client:

    -- Get begin balance at the beginning of the period
    SELECT SUM(BALANCE_DELTA) as BEGIN_BALANCE
      FROM ORDERS_HISTORY
     WHERE DATE < '2008-06-01';
    
    -- Get the total changes for each day.  When you
    -- build the report on the client, add each day's
    -- change amount to the prior day's balance
    SELECT SUM(BALANCE_DELTA) AS BALANCE_DELTA
      FROM ORDERS_HISTORY
     WHERE DATE BETWEEN '2008-06-01' AND '2008-06-30'
     GROUP BY DATE;
    
    Keeping History Tables Clean

    A clean history table is one that contains no unnecessary information. You normally do not want entries going into the history table if nothing relevant changed. So your history table mechanism should examine the columns it is tracking, and only make an entry to the history table if one of the columns of interest actually changed.

    Problems With The Naive History Table

    A very basic history table will usually copy the entire original row from the source table into the history table whenever an INSERT, UPDATE or DELETE occurs. One simple problem is that you end up with bloated history tables. Because they are cluttered with unnecessary repititions, they are difficult to work with by inspection.

    A much more serious technical problem with the naive approach is that it is horribly difficult to produce the queries demonstrated above. You must reproduce the concept of a delta by either running through all of the rows on the client, or you must make a difficult (and often impossible) JOIN of the history table to itself in which you connect each row to the row that came just before it. All I can say is, no thanks, I'll go with the delta.

    History Table Security

    History tables always involve some concept of auditing, that is, keeping track of user actions. This means we need to protect against deliberate falsification of the history tables, which leads to two rules. First, a user must have no ability to directly DELETE rows from the history table, or they could erase the record of changes. Second, the user must have no ability to directly INSERT or UPDATE existing rows, because if they could they can falsify the history. These rules apply to both regular users and system administrators, the administrator must have no privelege to subvert or manipulate the history.

    Since history tables have a tendency to become seriously bloated, there must be some priveleged group that can DELETE from the history tables, which they would do as a periodic purge operation. This group should have no ability to UPDATE the tables, because such priveleges would open a potential hole for subverting the history. Regular system administrators should not be in this group, this should be a special group whose only purpose is to DELETE out of the history tables.

    If you are making use of DELTA columns, then stricly speaking you do not want to purge, but compress history tables. If you want to purge out all entries in 2005, you must replace them with a single entry that contains a SUM of the DELTA columns for all of 2005.

    So to sum up, we have the following security rules for a history table:

    • No system user should be able to DELETE from the history table.
    • No system user should be able to UPDATE the history table.
    • No system user should be able to directly control the INSERT into the history table.
    • A special group must be defined whose only ability is to DELETE from the history table, so that the tables can be purged (or compressed) from time to time.
    Implementation

    As always, you have your choice of implementing the history mechanism in the client code or in the database itself.

    The best performing and most secure method is to implement history tables with triggers on the source table. This is the best way to implement both security and the actual business rules in one encapsulated object (the table). However, if you have no current practices for coding server-side routines, or you do not have a data dictionary system that will generate the code for you, then it may not be practical to go server-side for a single feature.

    Implementing history tables in code has the usual benefit of keeping you in the language and habits you are most familiar with, but it means that you cannot allow access to your database except through your application. I cannot of course make a general rule here, this decision is best made by the design team based on the situation at hand and anticipated future needs.

    Conclusion

    History tables have many uses. Beyond the obvious first use of finding indidivual values at some point in the past, well crafted tables can produce company-wide aggregations like total open balances on a given day, changes in booked orders on a day or in a range of days, and many other queries along those lines. Security is very important to prevent history tables from being subverted.

    NEXT ESSAY: Different Foreign Keys For Different Tables

    Categories: Development

    The Wonderful Awful Browser

    Mon, 2008-07-14 01:31

    When a desktop programmer tries to write database applications for the browser, he faces a great many challenges, both technical and cultural. Both sets of challenges appear because the browser and the web were invented for purposes different than our own. On the technical side we must reinvent huge amounts of functionality that we got "for free" with the old desktop systems of Foxpro, Delphi, VB Classic and so on, and on the cultural side we must wade through mountains of irrelevant or downright damaging advice that is aimed at people working on the next version of Facebook or eBay. In this essay we look at as many of these challenges as I can muster.

    Why A Desktop Developer Would Move to The Web

    When the browser first appeared, it totally lacked the technical powers required to replace desktop applications. Nevertheless, some programmers immediately began to ponder how to move into the world of the browser. The reasons were simple then, are simple now, and have not changed:

    • Far easier deployment -- nothing to install.
    • Worldwide access -- businesses with multiple locations are suddenly much easier to take care of.
    • You could now create a public website and give customers and vendors limited access to certain information.
    • Operating System independence. This is far more of a reality now than we dreamed it might be in the darkest days of Microsoft's Total World Domination, but there were visionaries early on who saw the possibilities.

    So there are many programmers, and I am one of them, who continue to work on the same kinds of applications we did before the web existed, but who now deploy these applications in the browser, for the reasons listed above. Here now is our tale of woe and sorrow!

    The Cultural Divide

    While desktop programmers were scratching their heads and trying to figure out how to fit into this new world, a new generation of programmers was growing up who were perfecting this new platform and developing applications that were undreamed of before. Unfortunately, some of the good advice they dreamed up is either irrelevant or counter-productive to the database programmer who is deploying to the web.

    The driving reality for the database application programmer is that her users are not surfing. They are using a dedicated program written for the purposes of the business they work for. Most of what the browser can do is either not necessary or positively in their way, and the browser lacks productivity tools that they took for granted in "the old system." This fact is central to the cultural divide between application programmers and web programmers.

    The Infamous Back Button Problem

    If I surf over to www.osnews.com and click on an article, when I am finished I click "BACK" once or twice until I'm back to osnews, and then pick another article. But to the application user, who is not surfing but is using a dedicated program, who has clicked "New Patient", typed in the info, and clicked "Save", the back button is a positive menace. It is misleading and dangerous. This has led to who-knows-how-many rants from web programmers telling application programmers, "You don't understand the web, you shouldn't write it that way," in which the desparate application programmer replies only, "but you don't understand, I must have it work this way." The simple fact is that when a user is modifying data in a browser there is no concept of BACK. There may "UNDO" or "REVERT", but once the data is saved it is saved. This is why application programmers resort to trying to hide or disable the button, or why they think they should be able to modify the history (which of course they cannot do because that would be a huge security hole for public sites).

    Ajax only Makes The Back Button Worse

    Picture a user on the customers screen, who then goes to the menu and picks the vendors screen. They work for five minutes on the vendors screen, and their wonderful snappy AJAX application is fetching search results and navigating from row to row and saving changes. Then they decide they made a mistake and hit [BACK] and wham! they're on the customer screen! It seems that the better the applications become, the worse the [BACK] button becomes. In my own shop we have finally decided to have the login program pop up a new window which does not have the [BACK] button or the address bar. This is considered heresy by web programmers (you don't understand the web! they cry) but of course what is true for them is not true for us, and vice-versa.

    This also leads to much work. We must provide for such features as UNDO with no native support in the browser, and worse, with whatever native support the browser does have been intended for something totally different.

    Your Application Must Work Without Javascript

    This is dead wrong for the application programmer. Application programmers have a power that is totally outside the experience of a pure web programmer: we can dictate system requirements to the customer. This led to many unhappy problems before the web, but with Firefox (and firebug!) we now have a platform that is free and robust. We simply install Firefox (or instruct the IT department to do so) and we have a platform that we know will support our application.

    Keyboard Shortcuts

    Nothing illustrates the divide between the web and the desktop like keyboard shortcuts.

    When Windows 95 swept the office world (but before the web really came into its own), programmers developed a new term for applications that required constant use of the mouse: we called them "mousetraps". The worst kind of mousetrap program requires the user to constantly lift their hand from the keyboard and go to the mouse, then back again. This is fatiguing, confusing, and terribly counter-productive for the end user.

    But the real problem is that the browser was born a mousetrap. From the perspective of the desktop programmer, keyboard shortcuts are clearly an afterthought, a "red-headed stepchild" as they say. Native HTML supports only the ACCESSKEY attribute, and recently Firefox was changed so that the default key combination is CTRL+ALT instead of ALT. This small change led me to finally realize that these folks, to put it mildly, have never lived in my world and haven't the slightest clue what my users need. I could expect no help from them on this front.

    The solution for the web programmer is to remember my users are not surfing, they are using a dedicated program. Therefore it is the Right Thing for the application programmer to hijack the CTRL-N key and have it mean "New Patient" (or New Customer, New Vendor, etc) instead of opening a new browser window. Moroever, he must kill the CTRL-N so that it does nothing on a page where there is no [New] button. If he does not, then sometimes CTRL-N will create a new patient, and sometimes it will pop up a new window with my.yahoo.com! So the application programmer confidently rewires the "standard" browser keys and has happier customers for his effort.

    Technical Problems In the Browser No Default Focus

    Have you ever gone to a website where the first thing you must do is log in, but the user id input does not have focus? That is a sure sign that the page was written by a web programmer with no desktop experience. When you put a database program into the browser, you expect the user to be typing constantly, looking things up, adding information, and so forth. So the application programmer must ensure that his first control always receives focus. Call it petty if you like, but without it your program becomes a mousetrap. Perfection comes by concentrating on these small things that either annoy or please users.

    Tabbing Off to Mars

    Default browser behavior is to allow the user to TAB through controls in the order they were created. This can be modified by explicitly assigning TABINDEX attributes to the control. However, when you get to the last control, the browser then Tabs you up to the menu, or the address bar, or anywhere else.

    In a business application, where the user is not surfing the web, this is wrong. Tabbing out of the content area is equilivant to exiting the application, it throws the user into a context that they do not need and (sad to say) do not understand.

    When I first began deploying business apps in the browser I would get calls saying, "it's frozen" or "i'm typing and nothings happening" and other such mysterious claims. Once I observed the users I realized they were "tabbing to Mars", the focus was up on the menu or in the address bar or somewhere else equally irrelevant. So we created the idea of the "Tab Loop", so that when the user hits TAB on the last control it loops back to the first. This completely ended those calls.

    The Tower Of Babel

    Desktop programmers have a luxury undreamed of by web programmers: they can do all or nearly all of their programming in a single language, like classic Visual Basic, Foxpro, or Delphi (or heck, COBOL or 4GL!). Most of these programmers also know SQL, but it is not seen as a burden to learn it, it is just part of the job.

    But when the application programmer moves to the web, he is confronted with at least four systems he must grasp if he is to perform as a master crafstman: (X)HTML, CSS, Javascript, and one or more server languages like Ruby, PHP, Java, etc. These different technologies all have syntaxes and philosophies that are different from each other and from past experience. All I can say is I'm glad I made the effort but I sure as heck hope I never have to make a change that dramatic again.

    Let's Not Talk About State

    When an application programmer moves to the web, he is confronted with the totally alien concept that he cannot maintain state. This idea has been discussed much in past years, and I suspect it may not be the problem it once was, as most of us have long since gotten past it. I did not want to leave it out completely, but it is way too large an issue to discuss in a paragraph and I doubt anyway that I could add to the wisdom that is already out there.

    Lousy Widgets

    The HTML SELECT element stinks. Every serious application programmer either downloads a replacement or writes his own replacement. In the old world of the desktop we did not have to do such things.

    The final piece of the puzzle in my shop was jQuery. The irony of jQuery is that it seems to me its core function is DOM traversal and manipulation, but its elegant simplicity has drawn creative minds to do things like create really nice widgets for entering time. In my shop we finished off our desktop-in-browser framework by using jQuery extensively, this solved lots of our lousy widgets problem.

    Salvation In the New Javascript Frameworks?

    I should mention the new frameworks that are emerging for desktop-in-browser, such as extJS and Dojo, not to mention of course the Yahoo! User! Interface! Library!.

    In my case I set out on the task of doing browser-based applications four years ago, when none of these technologies existed. So I developed my own simple browser-side framework. jQuery let me round off the rough edges, and now I have no need of a third-party framework. So I am afraid I cannot offer any experience in the use of these others.

    But even so, if I could use somebody else's dedicated work and put my efforts elsewhere, that would only be wise, so I continue to watch them closely.

    When Do Application Programmers Accept Advice From Web Programmers?

    Does all of this mean that we application programmers can learn nothing from web programmers? Of course not, that we be arrogant in the extreme. The answer is that when we enter the world of the public website, we must seek the advice and guidance of the experts in that world. So when I write the public portion of the application, the part that is visible to customers, vendors and other trading partners, I have to follow the standards of common practice: the back button has to work, Javascript should be optional, it should work on IE (arg), and so on and so forth.

    Conclusion: When Worlds Collide

    The desktop application programmer who decides to deploy business applications in the browser will face two broad challenges: techical and culturual, both of which stem from the origins of the web, which are so different from the origins of the desktop. The technical challenges tend to center around features of the browser that are either lacking or downright counter-productive, and this is made worse because the advice a programmer will receive comes from a culture whose goals are very different from his own. The database programmer who wants to deploy applications in the browser must be prepared to reproduce a lot of features we took for granted in the desktop, and he must also be prepared to filter through the received wisdom and throw out anything that does not meet the needs of his end-users.

    Addendum: After reading a few comments on ycombinator.com I should probably stress that the job of getting the application into the browser is completely doable (in fact I've done it myself). The browser can now easily handle the job of desktop applications. The OP lists the hazards I and many other have faced and overcome in getting there. A big part of the conclusion is that it is a lot easier to get done if you recognize why it seems so hard: which is that you may be getting advice from people whose goals are very different form yours.

    Next Essay: History Tables

    Categories: Development

    Database Performance: Pay Me Now or Pay Me Later

    Sun, 2008-07-06 08:20

    Many database performance decisions come down to "pay me now or pay me later." Some decisions will produce faster inserts and updates at the cost of slower and more complex reads, while other decisions will slow down inserts and updates but provide faster and easier reads.

    Welcome to the 30th regular post in the Database Programmer blog, there is a new post each Monday morning. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a that is updated each week, and a that is updated whenever a new design pattern is presented.

    Paying With Indexes

    An index speeds up SELECT operations. We are not going to go into any detail about how indexes work, this week we will stick to how indexes affect performance.

    Imagine a table of 10,000 sales orders. You wish to pull out a handful of fields for orders placed on 5/1/08, so you issue this SELECT:

    SELECT customer,order_total 
      FROM orders
     WHERE date = '2008-05-01'
    

    If you do not have an index on the table, then the database server will have to scan every single row in the table to find the rows that match the WHERE clause. On the other hand, if you had an index on the date column, the server would first read the index to find pointers to the correct rows, and then read only the rows you needed. The index itself is optimized by various methods so that only a very few reads are necessary to find the correct values. Most databases support the following syntax:

    CREATE INDEX some_unique_name ON orders (date)
    

    As far as performance goes, an index will slow down write operations (INSERT, UPDATE, and DELETE) because the index must be updated when the write operation occurs. This cost is on top of the write operation itself. (Addendum added July 7: Jochen points out correctly in his comment that this statement oversimplifies things. While it is true that the index must be updated for writes, the index can also dramatically speed up UPDATE and DELETE operations if those operations use a WHERE clause that can benefit from the index.)

    In terms of "pay me now or pay me later", when you regularly add a lot of indexes you are opting to "pay me now." You pay the price of slower writes to get faster reads. If you regularly avoid adding any indexes you are opting to "pay me later." You defer the costs of access to read time to get faster writes.

    I should note that it is not possible to completely avoid indexes, nor is there any value in trying to. For instance, a primary key requires an index because otherwise you have to scan the entire table every time you do an INSERT, which is just plain crazy. Foreign keys benefit from indexes as well for similar reasons.

    Paying With Views

    A "view" is a stored SQL statement that you can SELECT from as if were a table. Imagine we have a table of TEACHERS and a table of COURSES that they are teaching in a particular year. We often need to display a list of courses with the names of the teachers. We can do this with a JOIN, but a view gives us an easier pre-defined way to do this:

    CREATE VIEW courses_teachers AS
    SELECT courses.room,courses.period,courses.teacher
          ,courses.year
          ,teacher.first_name,teacher.last_name
      FROM courses 
      JOIN teachers ON courses.teacher = teachers.teacher
    

    ...which now lets you do the easier SELECT:

    SELECT * FROM courses_teachers WHERE year='2008';
    

    In terms of "pay me now or pay me later" a view is always a "pay me later" decision. It makes for easier coding but the server must go out on every SELECT and gather together the data required.

    The "pay me later" nature of a VIEW meets its greatest extreme when the view contains aggregations. Consider the following view which gives you easy access to customers and their lifetime history of orders and payments:

    CREATE VIEW customers_extended AS
    SELECT customers.*
          ,SUM(orders.order_total) as orders_total
          ,SUM(invoices.balance)   as balance
      FROM customers
      JOIN orders    ON customers.customer = orders.customer
      JOIN invoices  ON customers.customer = invoices.customer
      
    -- Pulling from the VIEW requires a complete read
    -- of relevant ORDERS and INVOICES tables
    SELECT * FROM customers_extended
     WHERE customer = X;
    

    This view is a "pay me later" proposition because every time you issue a SELECT from the view, it will have to scan many rows from the ORDERS and INVOICES tables. The contrasting method is to denormalize which is a "pay me now" approach.

    Paying With Denormalization

    Denormalizing means taking a normalized database and deliberately inserting redundant values. I have an essay on the three Denormalization Patterns that I use myself, which follow these three forms:

    • FETCH operations, where a value such as an item's price is copied from the ITEMS table into the ORDER_LINES table.
    • EXTEND operations, where you take the QTY and the PRICE columns in the ORDER_LINES table and write the EXTENDED_PRICE.
    • AGGREGATE operations, such as writing the total of ORDER_LINES onto the ORDERS table.

    All of these operations fall into the "pay me now" category. When these denormalized columns are put into tables, they add to the the size of the table and increase the cost of write operations. However, when it comes time to SELECT out of the tables the values are all there ready to go, usually with fewer JOINs and lower overall disk activity.

    Extreme Pay Me Now

    In my line of work I deal with line-of-business programs that are commissioned or purchased by businesses to do their daily work. User counts are low and resources are high, because often I will have 10 users on a single server, with access via internet limited to only a few thousand potential customers of which very few are ever on at the same time.

    In this context, I prefer to take the "pay me now" approach to its fullest realization. This means I tend to design my systems so that:

    • Any column a user is likely to filter on has an index.
    • Tables are fully denormalized, containing a wealth of derived values.

    This means that all write operations on my systems are slower than they might otherwise be. However, this is more than acceptable within this context because the server is largely untaxed, and users do not notice the difference between 100ms and 200ms to save a row. So I can pay when the user does not notice and as reward I have very rich reporting and lookup abilities.

    The extreme pay-me-now approach has one more advantage. The wealth of derived values in the database lets end-users find what they are looking for without calling a programmer and asking for a special page or report. Generally the more derived values there are the truer this becomes.

    Extreme Pay Me Later

    The extreme form of pay-me-later is a fully normalized database with no derived values and a minimum of indexes. Calculated values are available either in views, client-side code or both. This type of database is tuned for lots of fast writes because the cost of an INSERT or UPDATE has been kept to an absolute minimum. The database will be slower to perform ad-hoc or one-off queries because the server will have to do table scans whenever a user filters on anything except primary keys and foreign keys.

    The lack of derived values in fully normalized databases also leads to more phone calls and emails asking the programmer to create a report or page that will work out derived values that are not present in the database.

    Conclusion: Know Your Context

    This week we have taken common database technologies such as indexes and views and seen how they affect performance. All of these technologies can be judged in terms of the "pay me now or pay me later" decision.

    Database programmers normally choose to "pay me later" when they must support a large number of simultaneous write operations with a minimum of contention. These situations call for fewer indexes and strict normalization. The trade-off is that ad-hoc or one-off queries will involve more JOINs, more table scans and an increased likelihood the programmer will be called in for special cases.

    When read operations are more common than writes, or when inquiries and reports are likely to be unpredictable, database programmers will choose to "pay me now" by doing more work on the write operation. There will be more indexes and more denormalized values, so that the user is more likely to quickly locate whatever they want without programmer intervention.

    Next Essay: The Wonderful Awful Browser

    Categories: Development

    Database Performance: The Web Layer

    Mon, 2008-06-30 02:25

    A database application is a like a convoy of ships, it is only as fast as the the slowest ship. The three "ships" in a web-based database application are the database itself, the web layer, and the browser. Today we will continue our series on performance by examining how the web layer can efficiently retrieve data from the database.

    Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a that is updated each week, and a that is updated whenever a new design pattern is presented.

    Cost 1: Round Trips

    The first basic cost of retrieving data is the "round trip". Database programmers speak of a "roun