Kenneth Downs

Subscribe to Kenneth Downs feed
All things related to database applications, both desktop and web.KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.comBlogger75125
Updated: 1 hour 16 min ago

Approaches to "UPSERT"

Mon, 2009-06-29 20:33

This week in the Database Programmer we look at something called an "UPSERT", the strange trick where an insert command may magically convert itself into an update if a row already exists with the provided key. This trick is very useful in a variety of cases. This week we will see its basic use, and next week we will see how the same idea can be used to materialize summary tables efficiently.

An UPSERT or ON DUPLICATE KEY...

The idea behind an UPSERT is simple. The client issues an INSERT command. If a row already exists with the given primary key, then instead of throwing a key violation error, it takes the non-key values and updates the row.

This is one of those strange (and very unusual) cases where MySQL actually supports something you will not find in all of the other more mature databases. So if you are using MySQL, you do not need to do anything special to make an UPSERT. You just add the term "ON DUPLICATE KEY UPDATE" to the INSERT statement:

insert into table (a,c,b) values (1,2,3)
    on duplicate key update
     b = 2,
     c = 3

The MySQL command gives you the flexibility to specify different operation on UPDATE versus INSERT, but with that flexibility comes the requirement that the UPDATE clause completely restates the operation.

With the MySQL command there are also various considerations for AUTO_INCREMENT columns and multiple unique keys. You can read more at the MySQL page for the INSERT ... ON DUPLICATE KEY UPDATE feature.

A Note About MS SQL Server 2008

MS SQL Server introduced something like UPSERT in SQL Server 2008. It uses the MERGE command, which is a bit hairy, check it out in this nice tutorial.

Coding a Simpler UPSERT

Let us say that we want a simpler UPSERT, where you do not have to mess with SQL Server's MERGE or rewrite the entire command as in MySQL. This can be done with triggers.

To illustrate, consider a shopping cart with a natural key of ORDER_ID and SKU. I want simple application code that does not have to figure out if it needs to do an INSERT or UPDATE, and can always happily do INSERTs, knowing they will be converted to updates if the line is already there. In other words, I want simple application code that just keeps issuing commands like this:

INSERT INTO ORDERLINES
       (order_id,sku,qty)
VALUES 
       (1234,'ABC',5)

We can accomplish this by a trigger. The trigger must occur before the action, and it must redirect the action to an UPDATE if necessary. Let us look at examples for MySQL, Postgres, and SQL Server.

A MySQL Trigger

Alas, MySQL giveth, and MySQL taketh away. You cannot code your own UPSERT in MySQL because of an extremely severe limitation in MySQL trigger rules. A MySQL trigger may not affect a row in a table different from the row originally affected by the command that fired the trigger. A MySQL trigger attempting to create a new row may not affect a different row.

Note: I may be wrong about this. This limitation has bitten me on several features that I would like to provide for MySQL. I am actually hoping this limitation will not apply for UPSERTs because the new row does not yet exist, but I have not had a chance yet to try.

A Postgres Trigger

The Postgres trigger example is pretty simple, hopefully the logic is self-explanatory. As with all code samples, I did this off the top of my head, you may need to fix a syntax error or two.

CREATE OR REPLACE FUNCTION orderlines_insert_before_F()
RETURNS TRIGGER
 AS $BODY$
DECLARE
    result INTEGER; 
BEGIN
    SET SEARCH_PATH TO PUBLIC;
    
    -- Find out if there is a row
    result = (select count(*) from orderlines
                where order_id = new.order_id
                  and sku      = new.sku
               )

    -- On the update branch, perform the update
    -- and then return NULL to prevent the 
    -- original insert from occurring
    IF result = 1 THEN
        UPDATE orderlines 
           SET qty = new.qty
         WHERE order_id = new.order_id
           AND sku      = new.sku;
           
        RETURN null;
    END IF;
    
    -- The default branch is to return "NEW" which
    -- causes the original INSERT to go forward
    RETURN new;

END; $BODY$
LANGUAGE 'plpgsql' SECURITY DEFINER;

-- That extremely annoying second command you always
-- need for Postgres triggers.
CREATE TRIGGER orderlines_insert_before_T
   before insert
   ON ORDERLINES
   FOR EACH ROW
   EXECUTE PROCEDURE orderlines_insert_before_F();
A SQL Server Trigger

SQL Server BEFORE INSERT triggers are significantly different from Postgres triggers. First of all, they operate at the statement level, so that you have a set of new rows instead of just one. Secondly, the trigger must itself contain an explicit INSERT command, or the INSERT never happens. All of this means our SQL Server example is quite a bit more verbose.

The basic logic of the SQL Server example is the same as the Postgres, with two additional complications. First, we must use a CURSOR to loop through the incoming rows. Second, we must explicitly code the INSERT operation for the case where it occurs. But if you can see past the cruft we get for all of that, the SQL Server exmple is doing the same thing:

CREATE TRIGGER upsource_insert_before
ON orderlines
INSTEAD OF insert
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @new_order_id int;
    DECLARE @new_sku      varchar(15);
    DECLARE @new_qty      int;
    DECLARE @result       int;

    DECLARE trig_ins_orderlines CURSOR FOR 
            SELECT * FROM inserted;
    OPEN trig_ins_orderlines;

    FETCH NEXT FROM trig_ins_orderlines
     INTO @new_order_id
         ,@new_sku
         ,@new_qty;

    WHILE @@Fetch_status = 0 
    BEGIN
        -- Find out if there is a row now
        SET @result = (SELECT count(*) from orderlines
                        WHERE order_id = @new_order_id
                          AND sku      = @new_sku
                      )
    
        IF @result = 1 
        BEGIN
            -- Since there is already a row, do an
            -- update
            UPDATE orderlines
               SET qty = @new_qty
             WHERE order_id = @new_order_id
               AND sku      = @new_sku;
        END
        ELSE
        BEGIN
            -- When there is no row, we insert it
            INSERT INTO orderlines 
                  (order_id,sku,qty)
            VALUES
                  (@new_order_id,@new_sku,@new_qty)
            UPDATE orderlines

        -- Pull the next row
        FETCH NEXT FROM trig_ins_orderlines
         INTO @new_order_id
             ,@new_sku
             ,@new_qty;

    END  -- Cursor iteration

    CLOSE trig_ins_orderlines;
    DEALLOCATE trig_ins_orderlines;

END
A Vague Uneasy Feeling

While the examples above are definitely cool and nifty, they ought to leave a certain nagging doubt in many programmers' minds. This doubt comes from the fact that an insert is not necessarily an insert anymore, which can lead to confusion. Just imagine the new programmer who has joined the team an is banging his head on his desk because he cannot figure out why his INSERTS are not working!

We can add a refinement to the process by making the function optional. Here is how we do it.

First, add a column to the ORDERLINES table called _UPSERT that is a char(1). Then modify the trigger so that the UPSERT behavior only occurs if the this column holds 'Y'. It is also extremely import to always set this value back to 'N' or NULL in the trigger, otherwise it will appear as 'Y' on subsequent INSERTS and it won't work properly.

So our new modified explicit upsert requires a SQL statement like this:

INSERT INTO ORDERLINES
       (_upsert,order_id,sku,qty)
VALUES
       ('Y',1234,'ABC',5)

Our trigger code needs only a very slight modification. Here is the Postgres example, the SQL Server example should be very easy to update as well:

   ...trigger declration and definition above
   IF new._upsert = 'Y'
      result = (SELECT.....);
      _upsert = 'N';
   ELSE
      result = 0;
   END IF;
   
   ...rest of trigger is the same
Conclusion

The UPSERT feature gives us simplified code and fewer round trips to the server. Without the UPSERT there are times when the application may have to query the server to find out if a row exists, and then issue either an UPDATE or an INSERT. With the UPSERT, one round trip is eliminated, and the check occurs much more efficiently inside of the server itself.

The downside to UPSERTs is that they can be confusing if some type of explicit control is not put onto them such as the _UPSERT column.

Next week we will see a concept similar to UPSERT used to efficiently create summary tables.

Categories: Development

The Relational Model

Sun, 2009-04-19 15:50

If you look at any system that was born on and for the internet, like Ruby on Rails, or the PHP language, you find an immense wealth of resources on the internet itself, in endless product web sites, blogs, and forums. But when you look for the same comprehensive information on products or ideas that matured before the web you find it is not there. Relational databases stand out as a product family that matured before the internet, and so their representation in cyberspace is very different from the newer stuff.

The Math Stuff

You may have heard relational theorists argue that the strength of relational databases comes from their solid mathematical foundations. Perhaps you have wondered, what does that mean? And why is that good?

To understand this, we have to begin with Edsger W. Dijkstra, a pioneer in the area of computer science with many accomplishments to his name. Dijkstra believed that the best way to develop a system or program was to begin with a mathematical description of the system, and then refine that system into a working program. When the program completely implemented the math, you were finished.

There is a really huge advantage to this approach. If you start out with a mathematical theory of some sort, which presumably has well known behaviors, then the working program will have all of those behaviors and, put simply, everybody will know what to expect of it.

This approach also reduces time wasted on creative efforts to work out how the program should behave. All those decisions collapse intot he simple drive to make the program mimic the math.

A Particular Bit of Math

It so happens that there is a particular body of math known as Relational Theory, which it seemd to E. F. Codd would be a very nice fit for storing business information. In his landmark 1970 paper A Relational Model of Data for Large Shared Data Banks (pdf) he sets out to show how these mathematical things called "relations" have behaviors that would be ideal for storing business models.

If we take the Dijkstra philosophy seriously, which is to build systems based on well-known mathematical theories, and we take Codd's claim that "Relations" match well to business record-keeping needs, the obvious conclusion is that we should build some kind of "Relational" datastore, and so we get the Relational Database systems of today.

So there in a nutshell is why relational theorists are so certain of the virtues of the relational model, it's behaviors are well-known, and if you can build something that matches them, you will have a very predictable system.

They are Still Talking About It

If you want to know more about the actual mathematics, check out the comp.databases.theory Usenet group, or check out Wikipedia's articles on Relational Algebra and Relational Calculus.

A Practical Downside

The downside to all of this comes whenever the mathematical model describes behaviors that are contrary to human goals or simply irrelevant to them. Examples are not hard to find.

When the web exploded in popularity, many programmers found that their greatest data storage needs centered on documents like web pages rather than collections of atomic values like a customer's discount code or credit terms. They found that relational databases were just not that good at storing documents, which only stands to reason because they were never intended to. In theory the model could be stretched, (if the programmer stretched as well), but the programmers could feel in their bones that the fit was not right, and they began searching for something new.

Another example is that of calculated values. If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory, and there are some very bright theorists who will refuse all requests for assistance in getting that value to work, because you have violated their theory. This is probably the most shameful behavior that relational theorists exhibit - a complete refusal to consider extending the model to better reflect real world needs.

The Irony: There are No Relational Databases

The irony of it all is that when programmers set out to build relational systems, they ran into quite a few practical downsides and a sort of consensus was reached to break the model and create the SQL-based databases we have today. In a truly relational system a table would have quite a few more rules on it than we have in our SQL/TABLE based systems of today. But these rules must have seemed impractical or too difficult to implement, and they were scratched.

There is at least one product out there that claims to be truly relational, that is Dataphor. The Weird Optional Part

Probably the grandest irony in the so-called relational database management systems is that any programmer can completely break the relational model by making bad table designs. If your tables are not normalized, you lose much of the benefits of the relational model, and you are completely free to make lots of non-normalized and de-normalized tables.

Conclusion

I have to admit I have always found the strength of relational databases to be their simplicy and power, and not so much their foundations (even if shaky) in mathematical theory. A modern database is very good at storing data in tabular form, and if you know how to design the tables, you've got a great foundation for a solid application. Going further, I've always found relational theorists to be unhelpful in the extreme in the edge cases where overall application needs are not fully met by the underlying mathematical model. The good news is that the products themselves have all of the power we need, so I left the relational theorists to their debates years ago.

Categories: Development

I Am But a Humble Filing Clerk

Sun, 2009-03-01 18:08

This week we are returning to the series on Philosophy, and we will nail down the role of data and the database in any application that requires such things.

This is the Database Programmer blog, for anybody who wants practical advice on database use.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Review of The Absolute

In the first essay in this series, The Quest For the Absolute, I offered the opinion that all programmers by nature seek absolutes to simplify and inform the development effort. Taking a page from the ancient Greek philosopher Aristotle, I suggested that the best absolute was the quest for the virtuous program, which is to say a program that served its purpose well.

A program that serves its purpose well is one that meets the needs of the check signer, the end-user, and the programmer. The check signer needs some strategic goal to be met, the end-user must be productive, and the programmer must make a living. If a program achieves all of these, it is an ideal virtuous program, and has satisfied the absolute requirements that are true of all programs.

Considering the Decision Maker

Normally we think of a decision maker as some important person who has the power to choose your product or services, or to give her money to your competitor. She makes her decision based on how well she can judge who will meet her strategic needs.

Although the decision maker will have vastly different needs in different situations, and is usually thinking at a high level, she has at least one need that is universal: the simple need to keep and use records. She needs a filing system. All of her stated goals will assume that you both know this unstated goal is down there at the foundation of the entire proposed system.

We programmers often forget this simple fact because computers have been around long enough that we do not remember that in their original forms it was impossible to mistake that computers were just electronic filing systems. Way back when I was a kid the day came when phone bills started arriving with an "IBM Card" slipped into them. You returned the card with your check -- they were moving their files into the electronic age. Then came electronic tickets on airlines -- nothing more than a big filing system. The modern web sites we visit to buy tickets are nothing but an interface to what remains a filing system at its heart.

The Virtuous Programmer

So if we go back to the idea of "virtue" as the Greeks thought of it, which means serving your function well, a virtuous programmer will remember always that he is but humble filing clerk. This is not his entire purpose, but it is the beginning of all other purposes and the foundation that the higher purposes are built upon.

Not Just Relational

This principle is general to all programming. An email server is a program that must receive and store email for later retrieval. What good is an email server that cannot store anything? What good is a camera without its memory card? What good is a mobile phone without its contacts list? What good is a image editing program if it cannot read and write files?

So all programs exist to process data, and the business application programmer knows that in his context this means we are really making giant sexy record-keeping systems. We are the guys that color-code the filing cabinets.

Does Not Mean Relational Is Required

This idea, that we are filing clerks, does not automatically mean we must pick relational databases for the persistence layer -- the question of what filing system to use is a completely different question.

Conclusion

If we begin with the idea that the ideal program meets the needs of decision maker, end-user, and programmer, and if we consider first the needs of the decision maker, then we begin with the universal strategic need to keep good records. The ideal programmer knows this need is at the bottom of all other needs, and remembers always that we are but humble filing clerks.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are:

Categories: Development

A Comprehensive Database Security Model

Sat, 2009-02-14 13:05

This week I am taking a bit of a departure. Normally I write about things I have already done, but this week I want to speculate a bit on a security model I am thinking of coding up. Basically I have been asking myself how to create a security model for database apps that never requires elevated privileges for code, but still allows for hosts sharing multiple applications, full table security including row level and column level security, and structural immunity to SQL injection.

The Functional Requirements

Let's consider a developer who will be hosting multiple database applications on a server, sometimes instances of the same application for different customers. The applications themselves will have different needs, but they all boil down to this:

  • Some applications will allow surfers to join the site and create accounts for themselves, while others will be private sites where an administrator must make user accounts.
  • Some applications will not contain sensitive data, and so the site owner wants to send forgotten passwords in email -- which means the passwords must be stored in plaintext. Other site owners will need heightened security that disallows storing of passwords in plaintext.
  • In both cases, administrators must of course be able to manage accounts themselves.
  • The system should be structurally immune to SQL injection.
  • It must be possible to have users with the same user id ("Sheilia", "John", etc.) on multiple applications who are actually totally different people.
  • The application code must never need to run at an elevated privelege level for any reason -- not even to create accounts on public sites where users can join up and conduct transactions.
  • It must be possible for the site owners or their agents to directly connect to the database at very least for querying and possibly to do database writes without going through our application.
  • Users with accounts on one app must never be able to sign on to another app on the same server.

These requirements represent the most flexible possible combination of demands that I have so far seen in real life. The question is, can they be met while still providing security? The model I'd like to speculate on today says yes. Informed Paranoia Versus Frightened Ignorance

Even the most naive programmer knows that the internet is not a safe place, but all too often a lot of security advice you find is based on frightened ignorance and takes the form, "never do x, you don't know what might happen." If we are to create a strong security model, we have to do better than this.

Much better is to strive to be like a strong system architect, whose approach is based on informed paranoia. This hypothetical architect knows everybody is out to compromise his system, but he seeks a thorough knowledge of the inner workings of his tools so that he can engineer the vulnerabilities out as much as possible. He is not looking to write rules for the programmer that say "never do this", he is rather looking to make it impossible for the user or programmer to compromise the system.

Two Examples

Let us consider a server hosting two applications, which are called "social" and "finance".

The "social" application is a social networking site with minimal security needs. Most important is that the site owners want members of the general public to sign up, and they want to be able to email forgotten passwords (and we can't talk them out of it) -- so we have to store passwords in plaintext.

The "finance" application is a private site used by employees of a corporation around the world. The general public is absolutely not welcome. To make matters worse however, the corporation's IT department demands to be able to directly connect to the database and write to the database without going through the web app. This means the server will have an open port to the database. Sure it will be protected with SSL and passwords, but we must make sure that only users of "finance" can connect, and only to their own application.

Dispensing With Single Sign-On

There are two ways to handle connections to a database. One model is to give users real database accounts, the other is to use a single account to sign on to the database. Prior to the web coming along, there were proponents of both models in the client/server world, but amongst web developers the single sign-on method is so prevalent that I often wonder if they know there is any other way to do it.

Nevertheless, we must dispense with the single sign-on method at the start, regardless of how many people think that Moses carved it on the third tablet, because it just has too many problems:

  • Single Sign-on is the primary architectural flaw that makes SQL injection possible. As we will see later, using real database accounts makes your site (almost) completely immune to SQL injection.
  • Single Sign-on requires a connection at the maximum privilege level that any system user might have, where the code then decides what it will let a particular user do. This is a complete violation of the requirement that code always run at the lowest possible privilege level.
  • Single Sign-on totally prevents the requirement that authorized agents be allowed to connect to the database and directly read and write values.

So single sign-on just won't work with the requirements listed. This leads us to creating real accounts on the database server.

Real Accounts and Basic Security

When you use a real database account, your code connects to the database using the username and password provided by the user. Anything he is allowed to do your code will be allowed to do, and anything he is not allowed to do will throw and error if your code tries to do it.

This approach meets quite a few of our requirements nicely. A site owner's IT department can connect with the same accounts they use on the web interface -- they have the same privileges in both cases. Also, there is no need to ever have application code elevate its privilege level during normal operations, since no regular users should ever be doing that. This still leaves the issue of how to create accounts, but we will see that below.

A programmer who thinks of security in terms of what code can run will have a very hard time wrapping his head around using real database accounts for public users. The trick to understanding this approach is to forget about code for a minute and to think about tables. The basic fact of database application security is that all security resolves to table permissions. In other words, our security model is all about who can read or write to what tables, it is not about who can run which program.

If we grant public users real database accounts, and they connect with those accounts, the security must be handled within the database itself, and it comes down to:

  • Defining "groups" as collections of users who share permissions at the table level.
  • Deciding which groups are allowed select, insert, update, and delete privileges on which tables.
  • Granting and revoking those privileges on the server itself when the database is built.
  • At very least row-level security will be required, wherein a user can only see and manipulate certain rows in a table. This is how you keep users from using SQL Injection to mess with each other's order history or member profiles.
  • Column security is also very nice to finish off the picture, but we will not be talking about that today as it does not play into the requirements.

Now we can spend a moment and see why this approach eliminates most SQL Injection vulnerabilities. We will imagine a table of important information called SUPERSECRETS. If somebody could slip in a SQL injection exploit and wipe out this table we'd all go to jail, so we absolutely cannot allow this. Naturally, most users would have no privileges on this table -- even though they are directly connected to the database they cannot even see the table exists, let alone delete from it. So if our hypothetical black hat somehow slips in ";delete from supersecrets" and our code fails to trap for it, nothing happens. They have no privlege on that table. On the other side of things, consider the user who is privileged to delete from that table. If this user slips in a ";delete from supersecrets" he is only going to the trouble with SQL Injection to do something he is perfectly welcome to do anyway through the user interface. So much for SQL injection.

To repeat a point made above: row-level security is a must. If you grant members of a social site global UPDATE privileges on the PROFILES table, and you fail to prevent a SQL Injection, all hell could break loose. Much better is the ability to limit the user to seeing only his own row in the PROFILE table, so that once again you have created a structural immunity to SQL injection.

Anonymous Access

Many public sites allow users to see all kinds of information when they are not logged on. The most obvious example would be an eCommerce site that needs read access to the ITEMS table, among others. Some type of anonymous access must be allowed by our hypothetical framework.

For our two examples, the "social" site might allow limited viewing of member profiles, while the "finance" application must show absolutely nothing to the general public.

If we want a general solution that fits both cases, we opt for a deny-by-default model and allow each application to optionally have an anonymous account.

First we consider deny-by-default. This means simply that our databases are always built so that no group has any permissions on any tables. The programmer of the "social" site now has to grant certain permissions to the anonymous account, while the programmer of the "finance" application does nothing - he already has a secure system.

But still the "finance" site is not quite so simple. An anonymous user account with no privileges can still log in, and that should make any informed paranoid architect nervous. We should extend the deny-by-default philosophy so the framework will not create an anonymous account unless requested. This way the programmer of the "finance" application still basically does nothing, while the programmer of the "social" must flip a flag to create the anonymous account.

Virtualizing Users

If we are having real database accounts, there is one small detail that has to be addressed. If the "social" site has a user "johnsmith" and the finance application has a user of the same name, but they are totally different people, we have to let both accounts exist but be totally separate.

The answer here is to alias the accounts. The database server would actually have accounts "finance_johnsmith" and "social_johnsmith". Our login process would simply take the username provided and append the code in front of it when authenticating on the server. 'nuf said on that.

Allowing Public Users To Join

The "social" site allows anybody to join up and create an account. This means that somehow the web application must be able to create accounts on the database server. Yet it must do this without allowing the web code to elevate its privileges, and while preventing the disaster that would ensue if a user on the "social" site somehow got himself an account on the "finance" site.

Believe it or not, this is the easy part! Here is how it works for the "social" site:

  • Create a table of users. The primary key is the user_id which prevents duplication.
  • For the social site, there is a column called PASSWORD that stores the password in plaintext.
  • Allow the anonymous account to INSERT into this table! (Remember though that deny-by-default means that so far this account has no other privileges).
  • Put an INSERT trigger on the table that automatically creates an aliased user account, so that "johnsmith" becomes "social_johnsmith". The trigger also sets the password.
  • A DELETE trigger on the table would delete users if the row is deleted.
  • An UPDATE trigger on the table would update the password if the user UPDATES the table.
  • Row level security is an absolute must. Users must be able to SELECT and UPDATE table, but only their own row. If your database server or framework cannot support row-level security, it's all out the window.

This gives us a system that almost gets us where we need to be: the general public can create acounts, the web application does not need to elevate its privileges, users can set and change their passwords, and no user can see or set anything for any other user. However, this leaves the issue of password recovery.

In order to recover passwords and email them to members of the "social" site, it is tempting to think that the anonymous account must be able to somehow read the users table, but that is no good because then we have a structural flaw where a successful SQL injection would expose user accounts. However, this also turns out to be easy. There are two options:

  • Write a stored procedure that the anonymous user is free to execute, which does not return a password but actually emails it directly from within the database server. This requires your database server be able to send emails. (Postgres can, and I assume SQL Server can, and I don't really know about mySql).
  • Create a table for password requests, allow inserts to it but nothing else. A trigger sends the email. In this approach you can track email recovery requests.

For the "finance" application we cannot allow any of this to happen, so again we go to the deny-by-default idea. All of the behaviors above will not happen unless the programmer sets a flag to turn them on when the database is built.

This does leave the detail of how users of the "finance" application will reset their passwords. For details on how a secure app can still allow password resets, see my posting of Sept 7 2008 Secure Password Resets.

One More Detail on Public Users

We still have one more detail to handle for public users. Presumably a user, having joined up, has more privileges than the anonymous account. So the web application must be able to join them into a group without elevating its privileges. The solution here is the same as for creating the account: there will be a table that the anonymous user can make inserts into (but nothing else), and a trigger will join the user to whatever group is named.

Except for one more detail. We cannot let the user join whatever group they want, only the special group for members. This requirement can be met by defining the idea of a "freejoin" group and also a "solo" group. If the anonymous user inserts into a user-group table, and the requested group is flagged as allowing anybody to join, the trigger will allow it, but for any other group the trigger will reject the insert. The "solo" idea is similar, it means that if a user is in the "members" group, and that group is a "solo" group, they may not join any other groups. This further jails in members of the general public.

Almost Done: User Administration

In the last two sections we saw the idea of a table of users and a cross-reference of users to groups. This turns out to solve another issue we will have: letting administrators manage groups. If we define a group called "user_administrators" and give them total power on these tables, and also give them CRUD screens for them, then we have a user administrator system. This works for both the "social" and the "finance" application.

The triggers on the table have to be slightly different for the two cases, but that is a small exercise to code them up accordingly.

Cross-Database Access

Believe it or not, the system outlined above has met all of our requirements except one. So far we have a system that never requires the web server to have any elevated priveleges within the database, allows members of the public to join some sites while barring them from others, is structurally immune from SQL injection, allows different people on different sites to have the same user id, and allows administrators of both sites to directly manage accounts. Moreover, we can handle both plaintext passwords and more serious reset-only situations.

This leaves only one very thorny issue: cross-database access. The specific database server I use most is PostgreSQL, and this server has a problem (for this scenario) anyway, which is that out-of-the-box, a database account can connect to any database. This does not mean the account has any priveleges on the database, but we very seriously do not want this to happen at all. If a member of the "social" site can connect to the "finance" app, we have a potential vulnerability even if he has zero privileges in that database. We would be much happier if he could not connect at all.

In Postgres there is a solution to this, but I've grown to not like it. In Postgres you can specify that a user can only connect to a database if they are in a group that has the same name as the database. This is easy to set up, but it requires changing the default configuration of Postgres. However, for the sheer challenge of it I'd like to work out how to do it without requiring that change. So far I'm still puzzling this out. I'd also like to know that the approach would work at very least on MS SQL Server and mySql.

Conclusion

Most of what is in this week's essay is not that radical to any informed database veteran. But to web programmers who were unfortunate enough to grow up in the world of relational-databases-must-die nonsense, it is probably hard or impossible to imagine a system where users are connecting with real database accounts. The ironic thing is that the approached described here is far more secure than any single sign-on system, but it requires the programmer to shift thinking away from action-based code-centric models to what is really going on: table-based privileges. Once that hurdle is past, the rest of it comes easy.

Categories: Development

This Application Has Unique Business Rule Needs

Sun, 2009-02-01 15:29

No it does not. If it did, then your customer/employer would be doing something no other human being has ever done, which is unlikely in the extreme. The application may be unique in its particulars, but it is almost certainly extremely common in its patterns. This week we will see how "unique" needs are in fact nothing but common ordinary development projects.

Beginning With the Conclusion

I have had this conversation with many programmers over the past few years, and it always follows the same patterns. The easy part of the argument is showing the programmer that what he thinks is special or unique is in fact common. The much harder part, because it involves the delicate human ego, is showing the programmer that he has not seen this because he is ignorant. This is not fun to do and I myself usually skip it, it's usually not worth the trouble.

Path 1: Details

Occasionally I speak to a programmer who thinks he has a unique situation. His claim begins with the mountain of details he must handle, details which appear to be contradictory, subtle, and overall perplexing. He wonders if some new approach is required to handle them.

In answering this claim, we begin with the easy part, showing that the situation is itself not unique. In short, all worthwhile projects involve mountains of detail, so there is nothing special there. When it comes to the subtleties and the maze of exceptions and special cases, these are common in mature businesses that have evolved this complexity in response to business needs over the years. So again there is nothing unique here, the programmer's situation is again common.

At this point we have to ask how the programmer will deal with this perplexing mountain of detail. If he knows what he is doing, he will give the general answer that he is going to break it down as much as possible into independent smaller problems that can be solved on their own. Since this is nothing more than how all programmers solve complex problems, the entire "uniqueness" claim has completely collapsed. His project is utterly common.

The much harder part of the conversation comes if the programmer does not know how to break down the problem. For instance, if the problem is all about a fiendishly complex pricing system with lots of discounts and pricing levels, and the programmer does not know that he needs to begin with the database, and he further does not want to hear that, well, there is not much I can do for him. He will end up working a lot harder than he needs to, and will probably remain convinced he is dealing with something "unique".

But let's go a little deeper into that example of the complicated pricing system. Why do I claim that he must start with the tables, and that is he is wasting time if he does not? Well, a complete answer is much more than will fit here, and in fact I hit that theme over and over in these essays, but it comes down to:

  • He must have an accurate and precise description of the details that govern the pricing scheme. That is what tables are for.
  • In working out the mechanics of the tables, particularly their primary and foreign keys, he will come to a his most complete understanding of the mechanisms involved.
  • When the tables completely reflect the details he must work with, the code will just about write itself.
  • Lastly, but probably most importantly, the customer will expect to control the pricing system by adjusting the parameters at all levels. Again, that is what tables are for. The user is in control of the pricing system if he can edit the tables (because of course he cannot edit the code).
Path 2: Combinations

Once upon a time we had simple desktop business applications, games, and then this weird new thing, "the web". Now they are all mixed together, as we play games on the internet that are tied into huge databases. Modern applications often combine technologies that used to be comfortably separate. On any particular project, some of the requirements look like they can be met with an RDBMS, some require management and delivery of media such as MP3 or video, and he is told as well he must provide RSS feeds and import data coming in XML format. Perhaps as well there will be stone tablets and papyrus scrolls.

This programmer may believe he is in a unique situation because of this combination of needs. Because no single toolset out there can meet the entire project, perhaps this is something never before seen? But this does not hold up. Just like the argument about complexity, he must break the problem up correctly, and when he has done so he will have a perfectly ordinary project. Though I might add it will also be a very interesting project and probably a lot of fun.

In The End It Is All About Patterns

I have given two examples above taken from my own experience where programmers have claimed to me that they faced some unique situation. There are many other cases, and they always make perfect sense to the person who thinks he has discovered something new. The biggest flaw in the programmer's thinking is failing to distinguish between particulars and patterns.

My claim in this essay is that the patterns of all problems are the same. Somebody has seen it before, somebody has done it before, the answer is out there. The process of analysis and programming is about slotting your particulars in the patterns that have already been established.

In the broadest sense all programs process data, and particular programs break down into broad patterns of data access and manipulation. Sometimes you have a broad range of users putting in data with very little massaging (think twitter) and sometimes you have one group controlling much of the data while others make use of it (think Amazon), and sometimes your data is mostly relational and table based (think any ecommerce or biz app) and sometimes its mostly media (think youtube).

Once you have these broad patterns identified, you can then proceed to make use of established practices within each particular area. What is the best way to provide sensitive data on the web and protect it from unauthorized eyes? Somebody has done it before. What is the best way to track large amounts of media? Somebody has done it before. What is the best way to set up a complex pricing system with lots of discounts and pricing levels? Somebody has done it before. In all cases, your particulars may be different, but the patterns will be the same.

Conclusion: Find the Patterns

Whenever I find myself looking at a situation that appears to be new, I try to tell myself that it may be new to me, but it is not likely to be new to the human race. If it does not appear to follow a well-known pattern then I proceed as if I have not yet recognized the pattern and continue to analyze and break it apart until the pattern emerges. So far it always has.

Categories: Development

Pages