Development

Oracle enhanced adapter 1.4.0 and Readme Driven Development

Raimonds Simanovskis - Mon, 2011-08-08 16:00

I just released Oracle enhanced adapter version 1.4.0 and here is the summary of main changes.

Rails 3.1 support

Oracle enhanced adapter GitHub version was working with Rails 3.1 betas and release candidate versions already but it was not explicitly stated anywhere that you should use git version with Rails 3.1. Therefore I am releasing new version 1.4.0 which is passing all tests with latest Rails 3.1 release candidate. As I wrote before main changes in ActiveRecord 3.1 are that it using prepared statement cache and using bind variables in many statements (currently in select by primary key, insert and delete statements) which result in better performance and better database resources usage.

To follow up how Oracle enhanced adapter is working with different Rails versions and different Ruby implementations I have set up Continuous Integration server to run tests on different version combinations. At the moment of writing everything was green :)

Other bug fixes and improvements

Main fixes and improvements in this version are the following:

  • On JRuby I switched from using old ojdbc14.jar JDBC driver to latest ojdbc6.jar (on Java 6) or ojdbc5.jar (on Java 5). And JDBC driver can be located in Rails application ./lib directory as well.

  • RAW data type is now supported (which is quite often used in legacy databases instead of nowadays recommended CLOB and BLOB types).

  • rake db:create and rake db:drop can be used to create development or test database schemas.

  • Support for virtual columns in improved (only working on Oracle 11g database).

  • Default table, index, CLOB and BLOB tablespaces can be specified (if your DBA is insisting on putting everything in separate tablespaces :)).

  • Several improvements for context index additional options and definition dump.

See list of all enhancements and bug fixes

If you want to have a new feature in Oracle enhanced adapter then the best way is to implement it by yourself and write some tests for that feature and send me pull request. In this release I have included commits from five new contributors and two existing contributors - so it is not so hard to start contributing to open source!

Readme Driven Development

One of the worst parts of Oracle enhanced adapter so far was that for new users it was quite hard to understand how to start to use it and what are all additional features that Oracle enhanced adapter provides. There were many blog posts in this blog, there were wiki pages, there were posts in discussion forums. But all this information was in different places and some posts were already outdated and therefore for new users it was hard to understand how to start.

After reading about Readme Driven Development and watching presentation about Readme Driven Development I knew that README of Oracle enhanced adapter was quite bad and should be improved (in my other projects I am already trying to be better but this was my first one :)).

Therefore I have written new README of Oracle enhanced adapter which includes main installation, configuration, usage and troubleshooting tasks which previously was scattered across different other posts. If you find that some important information is missing or outdated then please submit patches to README as well so that it stays up to date and with relevant information.

If you have any questions please use discussion group or report issues at GitHub or post comments here.

Categories: Development

Application Express 4.1 - #BUTTON_ID# Changed Behaviour

Anthony Rayner - Thu, 2011-07-21 07:28
I just wanted to blog about some changed behaviour that will be landing in APEX 4.1. The change has to do with the #BUTTON_ID# substitution string, available for use within a button template and we hope will have minimal impact.

What's Changing?

Prior to 4.1, the value substituted for the #BUTTON_ID# substitution string in a button template depended on the type of button:
  • Region buttons substituted the internal numeric ID of the button, for example '123456789101112'.
  • Item buttons actually failed to substitute this value at all, so you would just get the '#BUTTON_ID#' text.
We had to address and improve this for the new Dynamic Action / Button integration in 4.1, which means that the value substituted for #BUTTON_ID# will now be one of the following:
  1. If the template is used by an item button, the item name is used, for example 'P1_GO'.
  2. If the template is used by a region button and a 'Static ID' (new in 4.1) is defined for the button, this is used, for example 'my_custom_id'.
  3. If the template is used by a region button and no 'Static ID' is defined, an ID in the format 'B||[Internal Button ID] will be used, for example 'B123456789101112'.
The change in behaviour that could be of potential significance in your applications is #3 above. This was changed so as to be a valid HTML 4.01 identifier (begins with a letter) and to be consistent with how we handle other component IDs in Application Express. 


Will this impact your applications?

If you used a button template, that used an ID value substituted by #BUTTON_ID# (you would have had to add it to the button template, as this was never included by default in our themes), and importantly you hard-coded that ID from other places in your code (for example in custom JavaScript to attach behaviour to the button), then this change in behaviour will cause that code to no longer work.


What can you do about this?

To help you identify region buttons that use a template containing the #BUTTON_ID# substitution string, you can run the following query in your workspace in Application Express 4.0:

    select aapb.application_id,
           aapb.page_id,
           aapb.button_name,
           aapb.label,
           aapb.button_id,
           aapb.button_template,
           aatb.template
      from apex_application_page_buttons aapb,
           apex_applications aa,
           apex_application_temp_button aatb
     where aapb.application_id     = aa.application_id
       and aa.application_id       = aatb.application_id
       and aa.theme_number         = aatb.theme_number
       and aatb.template_name      = aapb.button_template
       and aapb.application_id     = [Your App ID]
       and aapb.button_template    is not null
       and aapb.button_position    = 'Region Position'
       and upper(aatb.template)    like '%#BUTTON_ID#%'
     order by 1,2,3

Or when APEX 4.1 comes out, you'll be able to do this (by virtue of the addition of the BUTTON_TEMPLATE_ID column to the APEX_APPLICATION_PAGE_BUTTONS dictionary view):

    select aapb.application_id,
           aapb.page_id,
           aapb.button_name,
           aapb.label,
           aapb.button_id,
           aapb.button_template,
           aatb.template
      from apex_application_page_buttons aapb,
           apex_application_temp_button aatb
     where aapb.button_template_id = aatb.button_template_id
       and aapb.application_id     = [Your App ID]
       and aapb.button_template    is not null
       and aapb.button_position    = 'Region Position'
       and upper(aatb.template)    like '%#BUTTON_ID#%'
     order by 1,2,3

This will not detect where you may have referenced the ID in other code (such as JavaScript code), it just identifies the buttons that could be problematic. So you can then review the pages returned by this query, to isolate any pages that could have issues. 

The easiest way to fix the issue will be to use the 'Button ID' value returned from the query as the new 'Static ID' for the button. This would mean the button would be rendered with the same ID as prior to release 4.1 and any dependent code would still work.

This will of course be documented in our Release Notes as Changed Behaviour, but I hope this post helps to give a little pre-warning as to whether this may affect you.

Updated: Thank you to Louis-Guillaume Carrier-Bédard for pointing out that the query I (rather stupidly!) initially added does not work in APEX 4.0. My apologies if this caused any inconvenience.
Categories: Development

Recent conference presentations

Raimonds Simanovskis - Thu, 2011-06-02 16:00

Recently I has not posted any new posts as I was busy with some new projects as well as during May attended several conferences and in some I also did presentations. Here I will post slides from these conferences. If you are interested in some of these topics then ask me to come to you as well and talk about these topics :)

Agile Riga Day

In March I spoke at Agile Riga Day (organized by Agile Latvia) about my experience and recommendations how to adopt Agile practices in iterative style.

RailsConf

In May I travelled to RailsConf in Baltimore and I hosted traditional Rails on Oracle Birds of a Feather session there and gave overview about how to contribute to ActiveRecord Oracle enhanced adapter.

TAPOST

Then I participated in our local Theory and Practice of Software Testing conference and there I promoted use of Ruby as test scripting language.

RailsWayCon

And lastly I participated in Euruko and RailsWayCon conferences in Berlin. In RailsWayCon my first presentation was about multidimensional data analysis with JRuby and mondrian-olap gem. I also published mondrian-olap demo project that I used during presentation.

And second RailsWayCon presentation was about CoffeeScript, Backbone.js and Jasmine that I am recently using to build rich web user interfaces. This was quite successful presentation as there were many questions and also many participants were encouraged to try out CoffeeScript and Backbone.js. I also published my demo application that I used for code samples during presentation.

Next conferences

Now I will rest for some time from conferences :) But then I will attend FrozenRails in Helsinki and I will present at Oracle OpenWorld in San Francisco. See you there!

Categories: Development

Web 2.0 Solutions with Oracle WebCenter 11g (book review)

Java 2 Go! - Wed, 2011-03-16 20:16
by Fábio SouzaHello People! This was supposed to be a post to celebrate the new year, but, as you all can notice, the things didn't happen the way I was expecting (again haha). Today I will talk...

This is a summary only. Please, visit the blog for full content and more.
Categories: Development

Maintaining One Code Base with Possibly Conflicting Custom Features

Kenneth Downs - Fri, 2011-01-21 21:21

Today's essay deals with the tricky issue of custom features for individual customers who are running instances of your software.

The question comes by way of a regular reader who prefers to remain anonymous, but asks this:

... I work on a large (to me, anyway) application that serves as a client database, ticket system, time-tracking, billing, asset-tracking system. We have some customers using their own instances of the software. Often, those customers want additional fields put in different places (e.g., a priority column on tickets). This results in having multiple branches to account for versions with slight changes in code and in the database. This makes things painful and time-consuming in the long run: applying commits from master to the other branches requires testing on every branch; same with database migrate scripts, which frequently have to be modified.

Is there an easier way? I have thought about the possibility of making things "optional" in the database, such as a column on a table, and hiding its existence in the code when it's not "enabled." This would have the benefit of a single code set and a single database schema, but I think it might lead to more dependence on the code and less on the database -- for example, it might mean constraints and keys couldn't be used in certain cases.

Restating the Question

Our reader asks, is it better to have different code branches or to try to keep a lot of potentially conflicting and optional items mixed in together?

Well, the wisdom of the ages is to maintain a single code branch, including the database schema. I tried exactly once, very early in my career, to fork my own code, and gave up almost within days. When I went to work in larger shops I always arrived in a situation where the decision had already been made to maintain a single branch. Funny thing, since most programmers cannot agree on the color of the sky when they're staring out the window, this is the only decision I have ever seen maintained with absolute unanimity no matter how many difficulties came out of it.

There is some simple arithmetic as to why this is so. If you have single feature for a customer that is giving you a headache, and you fork the code, you now have to update both code branches for every change plus regression test them both, including the feature that caused the headache. But if you keep them combined you only have the one headache feature to deal with. That's why people keep them together.

Two Steps

Making custom features work smoothly is a two-step process. The first step is arguably more difficult than the second, but the second step is absolutely crucial if you have business logic tied to the feature.

Most programmers when confronted with this situation will attempt to make various features optional. I consider this to be a mistake because it complicates code, especially when we get to step 2. By far the better solution is to make features ignorable by anybody who does not want them.

The wonderful thing about ingorable features is they tend to eliminate the problems with apparently conflicting features. If you can rig the features so anybody can use either or both, you've eliminated the conflict.

Step 1: The Schema

As mentioned above, the first step is arguably more difficult than the second, because it may involve casting requirements differently than they are presented.

For example, our reader asks about a priority column on tickets, asked for by only one customer. This may seem like a conflict because nobody else wants it, but we can dissolve the conflict when we make the feature ignorable. The first step involves doing this at the database or schema level.

But first we should mention that the UI is easy, we might have a control panel where we can make fields invisible. Or maybe our users just ignore the fields they are not interested in. Either way works.

The problem is in the database. If the values for priority come from a lookup table, which they should, then we have a foreign key, and we have a problem if we try to ignore it:

  • We can allow nulls in the foreign key, which is fine for the people ignoring it, but
  • This means the people who require it can end up with tickets that have no priority because it does not prevent a user from leaving it blank.

A simple answer here is to pre-populate your priority lookup table with a value of "Not applicable", perhaps with a hardcoded id of zero. Then we set the default value for the TICKET.priority to zero. This means people can safely ignore it because it will always be valid.

Then, for the customer who paid for it, we just go in after the install and delete the default entry. It's a one-time operation, not even worth writing a script for, and it forces them to create a set of priorities before using the system. Further, by leaving the default of zero in there, it forces valid answers because users will be dinged with an FK violation if they do not provide a real priority.

For this particular example, there is no step 2, because the problem is completely solved at the schema level. To see how to work with step 2, I will make up an example of my own.

Step 2: Unconditional Business Logic

To illustrate step 2, I'm going to make up an example that is not really appropriate to our reader's question, frankly because I cannot think of one for that situation.

Let's say we have an eCommerce system, and one of our sites wants customer-level discounts based on customer groups, while another wants discounts based on volume of order -- the more you buy, the deeper the discount. At this point most programmers start shouting in the meeting, "We'll make them optional!" Big mistake, because it makes for lots of work. Instead we will make them ignorable.

Step 1 is to make ignorable features in the schema. Our common code base contains a table of customer groups with a discount percent, and in the customers table we make a nullable foreign key to the customer groups table. If anybody wants to use it, great, and if they want to ignore it, that's also fine. We do the same thing with a table of discount amounts, we make an empty table that lists threshhold amounts and discount percents. If anybody wants to use it they fill it in, everybody else leaves it blank.

Now for the business logic, the calculations of these two discounts. The crucial idea here is not to make up conditional logic that tries to figure out whether or not to apply the discounts. It is vastly easier to always apply both discounts, with the discounts coming out zero for those users who have ignored the features.

So for the customer discount, if the customer's entry for customer group is null, it will not match to any discount, and you treat this as zero. Same for the sale amount discount, the lookup to see which sale amount they qualify doesn't find anything because the table is empty, so it treats it as zero.

So the real trick at the business logic level is not to figure out which feature to use, which leads to complicatec conditionals that always end up conflicting with each other, but to always use all features and code them so they have no effect when they are being ignored.

Conclusion

Once upon a time almost everybody coding for a living dealt with these situations -- we all wrote code that was going to ship off to live at our customer's site. Nowadays this is less common, but for those of us dealing with it it is a big deal.

The wisdom of the ages is to maintain a common code base. The method suggested here takes that idea to its most complete implementation, a totally common code base in which all features are active all of the time, with no conditionals or optional features (except perhaps in the UI and on printed reports), and with schema and business logic set up so that features that are being ignored simply have no effect on the user.

Categories: Development

Can You Really Create A Business Logic Layer?

Kenneth Downs - Thu, 2011-01-06 19:21

The past three posts of this little mini-series have gone from a Working definition of business logic to a Rigorous definition of business logic and on to some theorems about business logic. To wrap things up, I'd like to ask the question, is it possible to isolate business logic into a single tier?

Related Reading

There are plenty of opinions out there. For a pretty thorough explanation of how to put everything into the DBMS, check out Toon Koppelaar's description. Mr. Koppelaars has some good material, but you do need to read through his earlier posts to get the definitions of some of his terms. You can also follow his links through to some high quality discussions elsewhere.

Contrasting Mr. Koppelaar's opinion is a piece which does not have nearly the same impact, IMHO, because in Dude, Where's My Business Logic? we get some solid history mixed with normative assertions based on either anecdote or nothing at all. I'm a big believer in anecdote, but when I read a sentence that says, "The database should not have any knowledge of what a customer is, but only of the elements that are used to store a customer." then I figure I'm dealing with somebody who needs to see a bit more of the world.

Starting At the Top: The User Interface

First, let's review that our rigorous definition of business logic includes schema (types and constraints), derived values (timestamps, userstamps, calculations, histories), non-algorithmic compound operations (like batch billing) and algorithmic compound operations, those that require looping in their code. This encompasses everything we might do from the simplest passive things like a constraint that prevents discounts from being over 100% to the most complex hours-long business process, along with everything in between accounted for.

Now I want to start out by using that definition to see a little bit about what is going on in the User Interface. This is not the presentation layer as it is often called but the interaction layer and even the command layer.

Consider an admin interface to a database, where the user is entering or modifying prices for the price list. Now, if the user could enter "Kim Stanley Robinson" as the price, that would be kind of silly, so of course the numeric inputs only allow numeric values. Same goes for dates.

So the foundation of usability for a UI is at very least knowlege of and enforcement of types in the UI layer. Don't be scared off that I am claiming the UI is enforcing anything, we'll get to that a little lower down.

Now consider the case where the user is typing in a discount rate for this or that, and a discount is not allowed to be over 100%. The UI really ought to enforce this, otherwise the user's time is wasted when she enters an invalid value, finishes the entire form, and only then gets an error when she tries to save. In the database world we call this a constraint, so the UI needs to know about constraints to better serve the user.

Now this same user is typing a form where there is an entry for US State. The allowed values are in a table in the database, and it would be nice if the user had a drop-down list, and one that was auto-suggesting as the user typed. Of course the easiest way to do something like this is just make sure the UI form "knows" that this field is a foreign key to the STATES table, so it can generate the list using some generic library function that grabs a couple of columns out of the STATES table. Of course, this kind of lookup thing will be happening all over the place, so it would work well if the UI knew about and enforced foreign keys during entry.

And I suppose the user might at some point be entering a purchase order. The purchase order is automatically stamped with today's date. The user might see it, but not be able to change it, so now our UI knows about system-generated values.

Is this user allowed to delete a customer? If not, the button should either be grayed out or not be there at all. The UI needs to know about and enforce some security.

More About Knowing and Enforcing

So in fact the UI layer not only knows the logic but is enforcing it. It is enforcing it for two reasons, to improve the user experience with date pickers, lists, and so forth, and to prevent the user from entering invalid data and wasting round trips.

And yet, because we cannot trust what comes in to the web server over the wire, we have to enforce every single rule a second time when we commit the data.

You usually do not hear people say that the UI enforces business logic. They usually say the opposite. But the UI does enforce business logic. The problem is, everything the UI enforces has to be enforced again. That may be why we often overlook the fact that it is doing so.

The Application and The Database

Now let's go through the stuff the UI is enforcing, and see what happens in the application and the database.

With respect to type, a strongly typed language will throw an error if the type is wrong, and a weakly typed language is wise to put in a type check anyway. The the DBMS is going to only allow correctly typed values, so, including the UI, type is enforced three times.

With respect to lookups like US state, in a SQL database we always let the server do that with a foreign key, if we know what is good for us. That makes double enforcement for lookups.

So we can see where this is going. As we look at constraints and security and anything else that must be right, we find it will be enforced at least twice, and as much as three times.

You Cannot Isolate What Must be Duplicated

By defining First Order Business Logic, the simplest foundation layer, as including things like types and keys and constraints, we find that the enforcement of this First Order stuff is done 2 or 3 times, but never only once.

This more or less leaves in tatters the idea of a "Business Logic Layer" that is in any way capable of handling all business logic all by its lonesome. The UI layer is completely useless unless it is also enforcing as much logic as possible, and even when we leave the Database Server as the final enforcer of First Order Business Logic (types, constraints, keys), it is still often good engineering to do some checks to prevent expensive wasted trips to the server.

So we are wasting time if we sit around trying to figure out how to get the Business Logic "where it belongs", because it "belongs" in at least two places and sometimes three. Herding the cats into a single pen is a fool's errand, it is at once unnecessary, undesirable, and impossible.

Update: Regular reader Dean Thrasher of Infovark summarizes most of what I'm saying here using an apt industry standard term: Business Logic is a cross-cutting concern.

Some Real Questions

Only when we have smashed the concept that Business Logic can exist in serene isolation in its own layer can we start to ask the questions that would actually speed up development and make for better engineering.

Freed of the illusion of a separate layer, when we look at the higher Third and Fourth Order Business Logic, which always require coding, we can decide where they go based either on engineering or the availability of qualified programmers in particular technologies, but we should not make the mistake of believing they are going where they go because the gods would have it so.

But the real pressing question if we are seeking to create efficient manageable large systems is this: how we distribute the same business logic into 2 or 3 (or more) different places so that it is enforced consistently everywhere. Because a smaller code base is always easier to manage than a large one, and because configuration is always easier than coding, this comes down to meta-data, or if you prefer, a data dictionary. That's the trick that always worked for me.

Is This Only A Matter of Definitions?

Anybody who disagrees with the thesis here has only to say, "Ken, those things are not business logic just because you wrote a blog that says they are. In my world business logic is about code baby!" Well sure, have it your way. After all, the nice thing about definitions is that we can all pick the ones we like.

But these definitions, the theorems I derived on Tuesday, and the multiple-enforcement thesis presented here today should make sense to anbyody struggling with where to put the business logic. That struggle and its frustrations come from the mistake of imposing abstract conceptual responsibilities on each tier instead of using the tiers as each is able to get the job done. Databases are wonderful for type, entity integrity (uniqueness), referential integrity, ACID compliance, and many other things. Use them! Code is often better when the problem at hand cannot be solved with a combination of keys and constraints (Fourth Order Business Logic), but even that code can be put into the DB or in the application.

So beware of paradigms that assign responsibility without compromise to this or that tier. It cannot be done. Don't be afraid to use code for doing things that require structured imperative step-wise operations, and don't be afraid to use the database for what it is good for, and leave the arguments about "where everything belongs" to those with too much time on their hands.

Categories: Development

Theorems Regarding Business Logic

Kenneth Downs - Tue, 2011-01-04 16:33

In yesterday's Rigorous Definition of Business Logic, we saw that business logic can be defined in four orders:

  • First Order Business Logic is entities and attributes that users (or other agents) can save, and the security rules that govern read/write access to the entitites and attributes.
  • Second Order Business Logic is entities and attributes derived by rules and formulas, such as calculated values and history tables.
  • Third Order Business Logic are non-algorithmic compound operations (no structure or looping is required in expressing the solution), such as a month-end batch billing or, for the old-timers out there, a year-end general ledger roll-up.
  • Fourth Order Business Logic are algorithmic compound operations. These occur when the action of one step affects the input to future steps. One example is ERP Allocation.
A Case Study

The best way to see if these have any value is to cook up some theorems and examine them with an example. We will take a vastly simplified time billing system, in which employees enter time which is billed once/month to customers. We'll work out some details a little below.

Theorem 1: 1st and 2nd Order, Analysis

The first theorem we can derive from these definitions is that we should look at First and Second Order Schemas together during analysis. This is because:

  • First Order Business Logic is about entities and atrributes
  • Second Order Business Logic is about entities and attributes
  • Second Order Business Logic is about values generated from First Order values and, possibly, other Second Order values
  • Therefore, Second Order values are always expressed ultimately in terms of First Order values
  • Therefore, they should be analyzed together

To give the devil his due, ORM does this easily, because it ignores so much database theory (paying a large price in performance for doing so) and considers an entire row, with its first order and second order values together, as being part of one class. This is likely the foundation for the claims of ORM users that they experience productivity gains when using ORM. Since I usually do nothing but bash ORM, I hope this statement will be taken as utterly sincere.

Going the other way, database theorists and evangelists who adhere to full normalization can hobble an analysis effort by refusing to consider 2nd order because those values denormalize the database, so sometimes the worst of my own crowd will prevent analysis by trying to keep these out of the conversation. So, assuming I have not pissed off my own friends, let's keep going.

So let's look at our case study of the time billing system. By theorem 1, our analysis of entities and attributes should include both 1st and 2nd order schema, something like this:

 
 INVOICES
-----------
 invoiceid      2nd Order, a generated unique value
 date           2nd Order if always takes date of batch run
 customer       2nd Order, a consequence of this being an
                           aggregation of INVOICE_LINES
 total_amount   2nd Order, a sum from INVOICE_LINES
               
 INVOICE_LINES
---------------
 invoiceid      2nd order, copied from INVOICES
 customer         +-  All three are 2nd order, a consequence
 employee         |   of this being an aggregration of
 activity         +-  employee time entries
 rate           2nd order, taken from ACTIVITIES table
                           (not depicted)
 hours          2nd order, summed from time entries
 amount         2nd order, rate * hours
 
 TIME_ENTRIES
--------------
 employeeid     2nd order, assuming system forces this
                    value to be the employee making
                    the entry
 date           1st order, entered by employee
 customer       1st order, entered by employee
 activity       1st order, entered by employee
 hours          1st order, entered by employee

Now, considering how much of that is 2nd order, which is almost all of it, the theorem is not only supported by the definition, but ought to line up squarely with our experience. Who would want to try to analyze this and claim that all the 2nd order stuff should not be there?

Theorem 2: 1st and 2nd Order, Implementation

The second theorem we can derive from these definitions is that First and Second Order Business logic require separate implementation techniques. This is because:

  • First Order Business Logic is about user-supplied values
  • Second Order Business Logic is about generated values
  • Therefore, unlike things cannot be implemented with like tools.

Going back to the time entry example, let's zoom in on the lowest table, the TIME_ENTRIES. The employee entering her time must supply customer, date, activity, and hours, while the system forces the value of employeeid. This means that customer and activity must be validated in their respective tables, and hours must be checked for something like <= 24. But for employeeid the system provides the value out of its context. So the two kinds of values are processed in very unlike ways. It seems reasonable that our code would be simpler if it did not try to force both kinds of values down the same validation pipe.

Theorem 3: 2nd and 3rd Order, Conservation of Action

This theorem states that the sum of Second and Third Order Business Logic is fixed:

  • Second Order Business Logic is about generating entities and attributes by rules or formulas
  • Third Order Business Logic is coded compound creation of entities and attributes
  • Given that a particular set of requirements resolves to a finite set of actions that generate entities and values, then
  • The sum of Second Order and Third Order Business Logic is fixed.

In plain English, this means that the more Business Logic you can implement through 2nd Order declarative rules and formulas, the fewer processing routines you have to code. Or, if you prefer, the more processes you code, the fewer declarative rules about entitities and attributes you will have.

This theorem may be hard to compare to experience for verification because most of us are so used to thinking in terms of the batch billing as a process that we cannot imagine it being implemented any other way: how exactly am I suppose to implement batch billing declaratively?.

Let's go back to the schema above, where we can realize upon examination that the entirety of the batch billing "process" has been detailed in a 2nd Order Schema, if we could somehow add these facts to our CREATE TABLE commands the way we add keys, types, and constraints, batch billing would occur without the batch part.

Consider this. Imagine that a user enters a a TIME_ENTRY. The system checks for a matching EMPLOYEE/CUSTOMER/ACTIVITY row in INVOICE_DETAIL, and when it finds the row it updates the totals. But if it does not find one then it creates one! Creation of the INVOICE_DETAIL record causes the system to check for the existence of an invoice for that customer, and when it does not find one it creates it and initializes the totals. Subsequent time entries not only update the INVOICE_DETAIL rows but the INVOICE rows as well. If this were happening, there would be no batch billing at the end of the month because the invoices would all be sitting there ready to go when the last time entry was made.

By the way, I coded something that does this in a pretty straight-forward way a few years ago, meaning you could skip the batch billing process and add a few details to a schema that would cause the database to behave exactly as described above. Although the the format for specifying these extra features was easy enough (so it seemed to me as the author), it seemed the conceptual shift of thinking that it required of people was far larger than I initially and naively imagined. Nevertheless, I toil forward, and that is the core idea behind my Triangulum project. Observation: There Will Be Code

This is not so much a theorem as an observation. This observation is that if your application requires Fourth Order Business Logic then somebody is going to code something somewhere.

An anonymous reader pointed out in the comments to Part 2 that Oracle's MODEL clause may work in some cases. I would assume so, but I would also assume that reality can create complicated Fourth Order cases faster than SQL can evolve. Maybe.

But anyway, the real observation here is is that no modern language, either app level or SQL flavor, can express an algorithm declaratively. In other words, no combination of keys, constraints, calculations and derivations, and no known combination of advanced SQL functions and clauses will express an ERP Allocation routine or a Magazine Regulation routine. So you have to code it. This may not always be true, but I think it is true now.

This is in contrast to the example given in the previous section about the fixed total of 2nd and 3rd Order Logic. Unlike that example, you cannot provide enough 2nd order wizardry to eliminate fourth order. (well ok maybe you can, but I haven't figured it out yet myself and have never heard that anybody else is even trying. The trick would be to have a table that you truncate and insert a single row into, a trigger would fire that would know how to generate the next INSERT, generating a cascade. Of course, since this happens in a transaction, if you end up generating 100,000 inserts this might be a bad idea ha ha.)

Theorem 5: Second Order Tools Reduce Code

This theorem rests on the acceptance of an observation, that using meta-data repositories, or data dictionaries, is easier than coding. If that does not hold true, then this theorem does not hold true. But if that observation (my own observation, admittedly) does hold true, then:

  • By Theorem 3, the sum of 2nd and 3rd order logic is fixed
  • By observation, using meta-data that manages schema requires less time than coding,
  • By Theorem 1, 2nd order is analyzed and specified as schema
  • Then it is desirable to specify as much business logic as possible as 2nd order schema, reducing and possibly eliminating manual coding of Third Order programs.

Again we go back to the batch billing example. Is it possible to convert it all to 2nd Order as described above. Well yes it is, because I've done it. The trick is an extremely counter-intuitive modification to a foreign key that causes a failure to actually generate the parent row that would let the key succeed. To find out more about this, check out Triangulum (not ready for prime time as of this writing).

Conclusions

The major conclusion in all of this is that anlaysis and design should begin with First and Second Order Business Logic, which means working out schemas, both the user-supplied values and the system-supplied values.

When that is done, what we often call "processes" are layered on top of this.

Tomorrow we will see part 4 of 4, examining the business logic layer, asking, is it possible to create a pure business logic layer that gathers all business logic unto itself?

Categories: Development

Oracle enhanced adapter 1.3.2 is released

Raimonds Simanovskis - Tue, 2011-01-04 16:00

I just released Oracle enhanced adapter version 1.3.2 with latest bug fixes and enhancements.

Bug fixes and improvements

Main fixes and improvements are the following:

  • Previous version 1.3.1 was checking if environment variable TNS_NAME is set and only then used provided database connection parameter (in database.yml) as TNS connection alias and otherwise defaulted to connection to localhost with provided database name. This was causing issues in many setups.
    Therefore now it is simplified that if you provide only database parameter in database.yml then it by default will be used as TNS connection alias or TNS connection string.
  • Numeric username and/or password in database.yml will be automatically converted to string (previously you needed to quote them using "...").
  • Database connection pool and JNDI connections are now better supported for JRuby on Tomcat and JBoss application servers.
  • NLS connection parameters are supported via environment variables or in database.yml. For example, if you need to have NLS_DATE_FORMAT in your database session to be DD-MON-YYYY then either you specify nls_date_format: DD-MON-YYYY in database.yml for particular database connection or set ENV['NLS_DATE_FORMAT'] = 'DD-MON-YYYY' in e.g. config/initializers/oracle.rb. You can see the list of all NLS parameters in source code.
    It might be necessary to specify these NLS session parameters only if you work with some existing legacy database which has, for example, some stored procedures that require particular NLS settings. If this is new database just for Rails purposes then there is no need to change any settings.
  • If you have defined foreign key constraints then they are now correctly dumped in db/schema.rb after all table definitions. Previously they were dumped after corresponding table which sometimes caused that schema could not be recreated from schema dump because it tried to load constraint which referenced table which has not yet been defined.
  • If you are using NCHAR and NVARCHAR2 data types then now NCHAR and NVARCHAR2 type values are correctly quoted with N'...' in SQL statements.
Upcoming changes in Rails 3.1

Meanwhile Oracle enhanced adapter is updated to pass all ActiveRecord unit tests in Rails development master branch and also updated according to Arel changes. Arel library is responsible for all SQL statement generation in Rails 3.0.

Rails 3.0.3 is using Arel version 2.0 which was full rewrite of Arel 1.0 (that was used initial Rails 3.0 version) and as a result of this rewrite it is much faster and now Rails 3.0.3 ActiveRecord is already little bit faster than in ActiveRecord in Rails 2.3.

There are several improvements in Rails master branch which are planned for Rails 3.1 version which are already supported by Oracle enhanced adapter. One improvement is that ActiveRecord will support prepared statement caching (initially for standard simple queries like find by primary key) which will reduce SQL statement parsing time and memory usage (and probably Oracle DBAs will complain less about Rails dynamic SQL generation :)). The other improvement is that ActiveRecord will correctly load included associations with more than 1000 records (which currently fails with ORA-01795 error).

But I will write more about these improvements sometime later when Rails 3.1 will be released :)

Install

As always you can install Oracle enhanced adapter on any Ruby platform (Ruby 1.8.7 or Ruby 1.9.2 or JRuby 1.5) with

gem install activerecord-oracle_enhanced-adapter

If you have any questions please use discussion group or report issues at GitHub or post comments here. And the best way how to contribute is to fix some issue or create some enhancement and send me pull request at GitHub.

Categories: Development

Business Logic: From Working Definition to Rigorous Definition

Kenneth Downs - Sun, 2011-01-02 12:05

This is part 2 of a 4 part mini-series that began before the holidays with A Working Definition Business Logic. Today we proceed to a rigorous definition, tomorrow we will see some theorems, and the series will wrap up with a post on the "business layer."

In the first post, the working definition said that business logic includes at least:

  • The Schema
  • Calculations
  • Processes

None of these was very rigorously defined, kind of a "I'll know it when I see it" type of thing, and we did not talk at all about security. Now the task becomes tightening this up into a rigorous definition.

Similar Reading

Toon Koppelaars has some excellent material along these same lines, and a good place to start is his Helsinki Declaration (IT Version). The articles have a different focus than this series, so they make great contrasting reading. I consider my time spent reading through it very well spent.

Definitions, Proofs, and Experience

What I propose below is a definition in four parts. As definitions, they are not supposed to prove anything, but they are definitely supposed to ring true to the experience of any developer who has created or worked on a non-trivial business application. This effort would be a success if we reach some concensus that "at least it's all in there", even if we go on to argue bitterly about which components should be included in which layers.

Also, while I claim the definitions below are rigorous, they are not yet formal. My instinct is that formal definitions can be developed using First Order Logic, which would allow the theorems we will see tomorrow to move from "yeah that sounds about right" to being formally provable.

As for their practical benefit, inasmuch as "the truth shall make you free", we ought to be able to improve our architectures if we can settle at very least what we are talking about when we use the vague term "business logic."

The Whole Picture

What we commonly call "business logic", by which we vaguely mean, "That stuff I have to code up", can in fact be rigorously defined as having four parts, which I believe are best termed orders, as there is a definite precedence to their discovery, analysis and implementation.

  • First Order: Schema
  • Second Order: Derivations
  • Third Order: Non-algorithmic compound operations
  • Fourth Order: Algorithmic compound operations

Now we examine each order in detail.

A Word About Schema and NoSQL

Even "schema-less" databases have a schema, they simply do not enforce it in the database server. Consider: an eCommerce site using MongoDB is not going to be tracking the local zoo's animal feeding schedule, because that is out of scope. No, the code is limited to dealing with orders, order lines, customers, items and stuff like that.

It is in the very act of expressing scope as "the data values we will handle" that a schema is developed. This holds true regardless of whether the datastore will be a filesystem, an RDBMS, a new NoSQL database, or anything else.

Because all applications have a schema, whether the database server enforces it or whether the application enforces it, we need a vocabulary to discuss the schema. Here we have an embarrasment of choices, we can talk about entities and attributes, classes and properties, documents and values, or columns and tables. The choice of "entities and attributes" is likely best because it is as close as possible to an implementation-agnostic language.

First Order Business Logic: Schema

We can define schema, including security, as:

that body of entities and their attributes whose relationships and values will be managed by the application stack, including the authorization of roles to read or write to entities and properties.

Schema in this definition does not include derived values of any kind or the processes that may operate on the schema values, those are higher order of business logic. This means that the schema actually defines the entire body of values that the application will accept from outside sources (users and other programs) and commit to the datastore. Restating again into even more practical terms, the schema is the stuff users can save themselves.

With all of that said, let's enumerate the properties of a schema.

Type is required for every attribute.

Constraints are limits to the values allowed for an attribute beyond its type. We may have a discount percent that may not exceed 1.0 or 100%.

Entity Integrity is usually thought of in terms of primary keys and the vague statement "you can't have duplicates." We cannot have a list of US States where "NY" is listed 4 times.

Referential Integrity means that when one entity links or refers to another entity, it must always refer to an existing entity. We cannot have some script kiddie flooding our site with sales of items "EAT_ME" and "F***_YOU", becuase those are not valid items.

The general term 'validation' is not included because any particular validation rule is is a combination of any or all of type, constraints, and integrity rules.

Second Orders Business Logic: Derived values

When we speak of derived values, we usually mean calculated values, but some derivations are not arithmetic, so the more general term "derived" is better. Derivations are:

A complete entity or an attribute of an entity generated from other entities or attributes according to a formula or rule.

The definition is sufficiently general that a "formula or rule" can include conditional logic.

Simple arithmetic derived values include things like calculating price * qty, or summing an order total.

Simple non-arithmetic derivations include things like fetching the price of an item to use on an order line. The price in the order is defined as being a copy of the item's price at the time of purchase.

An example of a complete entity being derived is a history table that tracks changes in some other table. This can also be implemented in NoSQL as a set of documents tracking the changes to some original document.

Security also applies to generated values only insofar as who can see them. But security is not an issue for writing these values because by definition they are generated from formulas and rules, and so no outside user can ever attempt to explicitly specify the value of a derived entity or property.

One final point about Second Order Business Logic is that it can be expressed declaratively, if we have the tools, which we do not, at least not in common use. I wrote one myself some years ago and am re-releasing it as Triangulum, but that is a post for another day.

Sorting out First and Second Order

The definitions of First and Second Order Business Logic have the advantage of being agnostic to what kind of datastore you are using, and being agnostic to whether or not the derived values are materialized. (In relational terms, derivations are almost always denormalizing if materialized, so in a fully normalized database they will not be there, and you have to go through the application to get them.)

Nevertheless, these two definitions can right off bring some confusion to the term "schema." Example: a history table is absolutely in a database schema, but I have called First Order Business Logic "schema" and Second Order Business Logic is, well, something else. The best solution here is to simply use the terms First Order Schema and Second Order Schema. An order_lines table is First Order schema, and the table holding its history is Second Order Schema.

The now ubiquitous auto-incremented surrogate primary keys pose another stumbling block. Because they are used so often (and so often because of seriously faulty reasoning, see A Sane Approach To Choosing Primary Keys) they would automatically be considered schema -- one of the very basic values of a sales order, check, etc. But they are system-generated so they must be Second Order, no? Isn't the orderid a very basic part of the schema and therefore First Order? No. In fact, by these definitions, very little if any of an order header is First Order, the tiny fragments that are first order might be the shipping address, the user's choice of shipping method, and payment details provided by the user. The other information that is system-generated, like Date, OrderId, and order total are all Second Order.

Third Order Business Logic

Before defining Third Order Business Logic I would like to offer a simple example: Batch Billing. A consulting company bills by the hour. Employees enter time tickets throughout the day. At the end of the month the billing agent runs a program that, in SQL terms:

  • Inserts a row into INVOICES for each customer with any time entries
  • Inserts a row into INVOICE_LINES that aggregates the time for each employee/customer combination.

This example ought to make clear what I mean by definining Third Order Business Logic as:

A Non algorithmic compound operation.

The "non-algorithmic" part comes from the fact that none of the individual documents, an INVOICE row and its INVOICE_LINES, is dependent on any other. There is no case in which the invoice for one customer will influence the value of the invoice for another. You do not need an algorithm to do the job, just one or more steps that may have to go in a certain order.

Put another way, it is a one-pass set-oriented operation. The fact that it must be executed in two steps is an artifact of how database servers deal with referential integrity, which is that you need the headers before you can put in the detail. In fact, when using a NoSQL database, it may be possible to insert the complete set of documents in one command, since the lines can be nested directly into the invoices.

Put yet a third way, in more practical terms, there is no conditional or looping logic required to specify the operation. This does not mean there will be no looping logic in the final implementation, because performance concerns and locking concerns may cause it to be implemented with 'chunking' or other strategies, but the important point is that the specification does not include loops or step-wise operations because the individual invoices are all functionally independent of each other.

I do not want to get side-tracked here, but I have had a working hypothesis in my mind for almost 7 years that Third Order Business Logic, even before I called it that, is an artifact, which appears necessary because of the limitations of our tools. In future posts I would like to show how a fully developed understanding and implementation of Second Order Business Logic can dissolve many cases of Third Order.

Fourth Order Business Logic

We now come to the upper bound of complexity for business logic, Fourth Order, which we label "algorithmic compound operations", and define a particular Fourth Order Business Logic process as:

Any operation where it is possible or certain that there will be at least two steps, X and Y, such that the result of Step X modifies the inputs available to Step Y.

In comparison to Third Order:

  • In Third Order the results are independent of one another, in Fourth Order they are not.
  • In Third Order no conditional or branching is required to express the solution, while in Fourth Order conditional, looping, or branching logic will be present in the expression of the solution.

Let's look at the example of ERP Allocation. In the interest of brevity, I am going to skip most of the explanation of the ERP Allocation algorithm and stick to this basic review: a company has a list of sales orders (demand) and a list of purchase orders (supply). Sales orders come in through EDI, and at least once/day the purchasing department must match supply to demand to find out what they need to order. Here is an unrealistically simple example of the supply and demand they might be facing:

  *** DEMAND ***          *** SUPPLY ***

    DATE    | QTY           DATE    | QTY
------------+-----      ------------+----- 
  3/ 1/2011 |  5          3/ 1/2011 |  3
  3/15/2011 | 15          3/ 3/2011 |  6
  4/ 1/2011 | 10          3/15/2011 | 20
  4/ 3/2011 |  7   

The desired output of the ERP Allocation might look like this:

 *** DEMAND ***      *** SUPPLY ****
    DATE    | QTY |  DATE_IN   | QTY  | FINAL 
------------+-----+------------+------+-------
  3/ 1/2011 |  5  |  3/ 1/2011 |  3   |  no
                  |  3/ 3/2011 |  2   | Yes 
  3/15/2011 | 15  |  3/ 3/2011 |  4   |  no
                  |  3/15/2011 | 11   | Yes
  4/ 1/2011 | 10  |  3/15/2011 |  9   |  no
  4/ 3/2011 |  7  |    null    | null |  no

From this the purchasing agents know that the Sales Order that ships on 3/1 will be two days late, and the Sales Orders that will ship on 4/1 and 4/3 cannot be filled completely. They have to order more stuff.

Now for the killer question: Can the desired output be generated in a single SQL query? The answer is no, not even with Common Table Expressions or other recursive constructs. The reason is that each match-up of a purchase order to a sales order modifies the supply available to the next sales order. Or, to use the definition of Fourth Order Business Logic, each iteration will consume some supply and so will affect the inputs available to the next step.

We can see this most clearly if we look at some pseudo-code:

for each sales order by date {
   while sales order demand not met {
      get earliest purchase order w/qty avial > 0
         break if none
      make entry in matching table
      // This is the write operation that 
      // means we have Fourth Order Business Logic
      reduce available qty of purchase order
   }
   break if no more purchase orders
}
Conclusions

As stated in the beginning, it is my belief that these four orders should "ring true" with any developer who has experience with non-trivial business applications. Though we may dispute terminology and argue over edge cases, the recognition and naming of the Four Orders should be of immediate benefit during analysis, design, coding, and refactoring. They rigorously establish both the minimum and maximum bounds of complexity while also filling in the two kinds of actions we all take between those bounds. They are datamodel agnostic, and even agnostic to implementation strategies within data models (like the normalize/denormalize debate in relational).

But their true power is in providing a framework of thought for the process of synthesizing requirements into a specification and from there an implementation.

Tomorrow we will see some theorems that we can derive from these definitions.

Categories: Development

A Working Definition of Business Logic, with Implications for CRUD Code

Kenneth Downs - Tue, 2010-12-21 21:25

Update: the Second Post of this series is now available.

Update: the Third Post of this series is now available.

The Wikipedia entry on "Business Logic" has a wonderfully honest opening sentence stating that "Business logic, or domain logic, is a non-technical term... (emphasis mine)". If this is true, that the term is non-technical, or if you like, non-rigorous, then most of us spend the better part of our efforts working on something that does not even have a definition. Kind of scary.

Is it possible to come up with a decent working definition of business logic? It is certainly worth a try. This post is the first in a four part series. The second post is about a more rigorous definition of Business Logic.

This blog has two tables of contents, the Complete Table of Contents and the list of Database Skills.

The Method

In this essay we will pursue a method of finding operations that we can define as business logic with a minimum of controversey, and identify those that can likely be excluded with a minimum of controversey. This may leave a bit of gray area that can be taken up in a later post.

An Easy Exclusion: Presentation Requirements

If we define Presentation Requirements as all requirements about "how it looks" as opposed to "what it is", then we can rule these out. But if we want to be rigorous we have to be clear, Presentation Requirements has to mean things like branding, skinning, accessibility, any and all formatting, and anything else that is about the appearance and not about the actual values fetched from somewhere.

Tables as the Foundation Layer

Database veterans are likely to agree that your table schema constitutes the foundation layer of all business rules. The schema, being the tables, columns, and keys, determines what must be provided and what must be excluded. If these are not business logic, I guess I don't know what is.

What about CouchDB and MongoDB and others that do not require a predefined schema? These systems give up the advantages of a fixed schema for scalability and simplicity. I would argue here that the schema has not disappeared, it has simply moved into the code that writes documents to the database. Unless the programmer wants a nightmare of chaos and confusion, he will enforce some document structure in the code, and so I still think it safe to say that even for these databases there is a schema somewhere that governs what must be stored and what must be excluded.

So we have at least a foundation for a rigorous definition of business rules: the schema, be it enforced by the database itself or by the code, forms the bottom layer of the business logic.

Processes are Business Logic

The next easy addition to our definition of business logic would be processes, where a process can be defined loosely as anything involving multiple statements, can run without user interaction, may depend on parameters tables, and may take longer than a user is willing to wait, requiring background processing.

I am sure we can all agree this is business logic, but as long as we are trying to be rigorous, we might say it is business logic because:

  • It must be coded
  • The algorithm(s) must be inferred from the requirements
  • It is entirely independent of Presentation Requirements
Calculations are Business Logic

We also should be able to agree that calculated values like an order total, and the total after tax and freight, are business logic. These are things we must code for to take user-supplied values and complete some picture.

The reasons are the same as for processes, they must be coded, the formulas must often be inferred from requirements (or forced out of The Explainer at gunpoint), and the formulas are completely independent of Presentation Requirements.

The Score So Far

So far we have excluded "mere" Presentation Requirements, and included three entries I hope so far are non-controversial:

  • Schema
  • Processes
  • Calculations

These are three things that some programmer must design and code. The schema, either in a conventional relational database or in application code. Processes, which definitely must be coded, and calculations, which also have to be coded.

What Have We Left Out?

Plenty. At very least security and notifications. But let's put those off for another day and see how we might handle what we have so far.

For the Schema, I have already mentioned that you can either put it into a Relational database or manage it in application code when using a "NoSQL" database. More than that will have to wait for 2011, when I am hoping to run a series detailing different ways to implement schemas. I'm kind of excited to play around with CouchDB or MongoDB.

For processes, I have a separate post that examines the implications of the stored procedure route, the embedded SQL route, and the ORM route.

This leaves calculations. Let us now see how we might handle calculations.

Mixing CRUD and Processes

But before we get to CRUD, I should state that if your CRUD code involves processes, seek professional help immediately. Mixing processes into CRUD is an extremely common design error, and it can be devastating. It can be recognized when somebody says, "Yes, but when the salesman closes the sale we have to pick this up and move it over there, and then we have to...."

Alas, this post is running long already and so I cannot go into exactly how to solve these, but the solution will always be one of these:

  • Spawning a background job to run the process asynchronously. Easy because you don't have to recode much, but highly suspicous.
  • Examining why it seems necessary to do so much work on what ought to be a single INSERT into a sales table, with perhaps a few extra rows with some details. Much the better solution, but often very hard to see without a second pair of eyes to help you out.

So now we can move on to pure CRUD operations.

Let The Arguments Begin: Outbound CRUD

Outbound CRUD is any application code that grabs data from the database and passes it up to the Presentation layer.

A fully normalized database will, in appropriate cases, require business logic of the calculations variety, otherwise the display is not complete and meaningful to the user. There is really no getting around it in those cases.

However, a database Denormalized With Calculated Values requires no business logic for outbound CRUD, it only has to pick up what is asked for and pass it up. This is the route I prefer myself.

Deciding whether or not to include denormalized calculated values has heavy implications for the architecture of your system, but before we see why, we have to look at inbound CRUD.

Inbound CRUD

Inbound CRUD, in terms of business logic, is the mirror image of outbound. If your database is fully normalized, inbound CRUD should be free of business logic, since it is simply taking requests and pushing them to the database. However, if you are denormalizing by adding derived values, then it has to be done on the way in, so inbound CRUD code must contain business logic code of the calculations variety.

Now let us examine how the normalization question affects system architecture and application code.

Broken Symmetry

As stated above, denormalizing by including derived values forces calculated business logic on the inbound path, but frees your outbound path to be the "fast lane". The opposite decision, not storing calculated values, allows the inbound path to be the "fast lane" and forces the calculations into the outbound path.

The important conclusion is: if you have business logic of the calculation variety in both lanes then you may have some inconsistent practices, and there may be some gain involved in sorting those out.

But the two paths are not perfectly symmetric. Even a fully normalized database will often, sooner or later, commit those calculated values to columns. This usually happens when some definition of finality is met. Therefore, since the inbound path is more likely to contain calculations in any case, the two options are not really balanced. This is one reason why I prefer to store the calculated values and get them right on the way in.

One Final Option

When people ask me if I prefer to put business logic in the server, it is hard to answer without a lot of information about context. But when calculations are involved the answer is yes.

The reason is that calculations are incredibly easy to fit into patterns. The patterns themselves (almost) all follow foreign keys, since the foreign key is the only way to correctly relate data between tables. So you have the "FETCH" pattern, where a price is fetched from the items table to the cart, the "EXTEND" pattern, where qty * price = extended_Price, and various "AGGREGATE" patterns, where totals are summed up to the invoice. There are others, but it is surprising how many calculations fall into these patterns.

Because these patterns are so easy to identify, it is actually conceivable to code triggers by hand to do them, but being an incurable toolmaker, I prefer to have a code generator put them together out of a data dictionary. More on that around the first of the year.

Updates

Update 1: I realize I never made it quite clear that this is part 1, as the discussion so far seems reasonable but is hardly rigorous (yet). Part 2 will be on the way after I've fattened up for the holidays.

Update 2: It is well worth following the link Mr. Koppelaars has put in the comments: http://thehelsinkideclaration.blogspot.com/2009/03/window-on-data-applications.html

Categories: Development

User-Submitted Analysis Topic: Email

Kenneth Downs - Sun, 2010-12-19 12:10

Reader Dean Thrasher of Infovark has submitted a schema for review and analysis as part of my User-Submitted Analysis Request series. Today we are going to take a first look at what he has. Mr. Thrasher and I both hope that any and all readers will benefit from the exercise of publicly reviewing the schema.

This particular analysis request is a great start to the series, because it has to do with email. Everybody uses email so we all understand at a very basic level what data will be handled. Brief Introduction to User-Submitted Schemas

Mr. Thrasher and I have exchanged a couple of emails, but we have avoided any in-depth discussion. Instead, we want to carry out the conversation on the public blog. So I am not aiming to provide any "from on high" perfect analysis, instead this essay will contain a lot of questions and suggestions, and we will then move into the comments to go forward.

Disclosure: None. We are not paying each other anything, nor have I received any merchandise that would normally carry a licensing fee.

Today's essay is the very first in the User-Submitted Anlaysis Requests series. If you would like to see an analysis of your schema, follow that link and contact me.

This blog has a Complete Table of Contents and a list of Database Skills.

Brief Description and Starting Point

To get us started, I am going to quote the Infovark Product Page, and then we will see what we want to zoom in on:

Infovark automatically collects and catalogs your files and email. It consolidates your digital life into a personal wiki based on what it finds. Once you set Infovark to work, it will monitor your computer and keep your web site up-to-date

So we know even before we see anything technical that we are going to have tables of contacts, emails, phones, addresses, appointments and many other things pulled in from email systems, plus the value-add provided by the product.

The Schema As-Is

We are going to start by looking at how the details of a CONTACT are stored. The schema models contacts with a group of cross references, aka many-to-many relationships, like so:

CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

The first thing we have to note is that there is nothing wrong with this at all. It is fully normalized and so it will be very easy to make sure that database writes will not produce anomalies or bad data.

But, not surprisingly, Mr. Thrasher notes this makes for complicated SELECTS, so we want to ask if perhaps it is over-normalized, are there complications in there that do not need to be there?

Email as A Property of Contact

If I were to follow my own advice, I would first want to identify the master tables. Master tables generally represent real things in the world: people, places, products, services, events.

So my first question is this: is an email address a free-standing entity in its own right that deserves a master table? Or is it instead a property of the CONTACT? I am going to suggest that an email address is a property of a CONTACT, and, since a CONTACT may have more than one email address, they should be stored in a child table of the CONTACTS, more like this:

CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         +----- CONTACTEMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

Whether or not we make this switch depends not on technical arguments about keys or data types, but on whether this accurately models reality. If in fact email addresses are simply properties of contacts, then this is the simplest way to do it. Going further, the code that imports and reads the data will be easier to code, debug and maintain for two reasons: one, because it is simpler, but more importantly, two, because it accurately models reality and therefore will be easier to think about.

If this proves to be the right way to go, it may be a one-off improvement, or it may repeat itself for Phones, Addresses, and Web Addresses, but we will take that up in the next post in the series.

I am going to proceed as if this change is correct, and ask then how it will ripple through the rest of the system.

Some Specifics on the Email Addresses Table

The EMAILADDRESSES table currently has these columns:

-- SQL Flavor is Firebird
CREATE TABLE EMAILADDRESS (
  ID           INTEGER NOT NULL,
  USERNAME     VARCHAR(64) NOT NULL COLLATE UNICODE_CI,
  HOSTNAME     VARCHAR(255) NOT NULL COLLATE UNICODE_CI,
  DISPLAYNAME  VARCHAR(255) NOT NULL
);

ALTER TABLE EMAILADDRESS
  ADD CONSTRAINT PK_EMAILADDRESS
  PRIMARY KEY (ID);

CREATE TRIGGER BI_EMAILADDRESS FOR EMAILADDRESS
ACTIVE BEFORE INSERT POSITION 0
AS
BEGIN
  IF (NEW.ID IS NULL) THEN
  NEW.ID = GEN_ID(GEN_EMAILADDRESS_ID,1);
END^

Suggestion: The first thing I notice is that the complete email itself is not actually stored. So we need to ask Mr. Thrasher what the thinking behind that was. My first instinct is to store that, because it is the original natural value of interest.

Suggestion: The columns USERNAME and HOSTNAME I could go either way on. If they are needed for querying and statistics, it is better to put them in. While this violates 3rd Normal Form and so puts us at risk, the values are supplied AFAIK by a batch import, and so there is only one codepath populating them, and we are likely safe. However, if we DO NOT need to query these values for statistics, and they are only there for convenience at display time, I would likely remove them and generate them on-the-fly in application code. There are some other good reasons to do this that will come up a little further along.

Suggestion: Unless I missed something in the schema sent over, we need a unique constraint on the combination of CONTACTID and USERNAME and HOSTNAME. Or, if we remove USERNAME and HOSTNAME in favor of the original EMAILADDRESS, we need a unique constraint on CONTACTID + EMAILADDRESS.

Before We Get To Transactions

We are about to go into Part 2, which is about the other tables that reference EMAILADDRESSES, but before we do let's look at what the two tables would be if we made all changes suggested so far:

 CONTACTS             EMAILADDRESSES 
------------         --------------------
                      ID            (surrogate key)
 CONTACT_ID --------& CONTACT_ID
 other columns...     EMAILADDRESS  
                      LABEL
                      USERNAME      (possibly removed)
                      HOSTNAME      (possibly removed)
                      DISPLAYNAME

You may notice the LABEL column showed up out of nowhere. That column was previously in the cross-reference. When the cross-reference went away it landed in EMAILADDRESSES. That column LABEL holds values like "work", "home" and so on. It is supplied from whatever system we pull emails from, and so we have no constraints on it or rules about it.

Changing Emails And Transactions

Now we move on from the basic storage of EMAIL addresses to the other tables that reference those addresses. These are things like emails themselves with their lists people sent to/from, and meetings, and presumably other types of transactions as well.

When we look at transactions, which will reference contacts and email addresses, we also have to consider the fact that a CONTACT may change their email address over time. Consider a person working for Vendor A, who moves over to Vendor B. For some of the transactions they will have been at Vendor A, and then going forward they are all at Vendor B. This leads to this very important question:

Do Transactions store details about the CONTACTS as they were at the time of the transaction, or as they are now?

In other words, if a CONTACT moves from one company to another, and you look at a meeting with that person from last year, should it link to where they are now? Or should it be full of information about where they were at the time?

The answer to this question is important because it determines how to proceed on the two final points I would like to raise:

  1. Should the various transactions have a foreign key back to EMAILADDRESSES, or should they simply link back to CONTACTS and contain the EMAILADDRESS itself?
  2. Do we need an integer surrogate key on the EMAILADDRESSES table, especially if we do not link back to it?
First Final Suggestion

So the first of the final two suggestions is: maybe the transactions tables should just link back to CONTACTID and contain a freestanding EMAILADDRESS. The first argument for this is that it preserves the history as it was, and if that is what we want, then this accomplishes it. The second argument is that by putting the actual value instead of an integer key back to some table, we simplify coding by removing a join.

The arguments against embedding the email address might be basically, "hey, if this is a kind of a data warehoues, you are really supposed to be doing the snowflake thing and you don't want to waste space on that value." To which I respond that the engineer always has the choice of trading space for speed. Putting the email in directly is a correct recording of a fact, and takes more space, but eliminates a very common JOIN from many queries, so Mr. Thrasher may choose to make that call.

This also plays back to my question about whether we should have USERNAME and HOSTNAME in the EMAILADDRESSES table. If we start putting email addresses directly into tables, we can also keep putting these other two columns in, which trades again space for speed. We could also skip them and code a parser in the application that generates them on-the-fly as needed.

Second Final Suggestion

Now we go all of the way back to the child table and ask a basic question: Why is there is an integer surrogate key there? Integer surrogate keys are useful in many situations, but contrary to what the web generation learned, they are not some kind of required approach in relational databases.

Consider: we need a unique constraint on CONTACTID+EMAILADDRESS anyway, so we have to justify why we would add a new column that does not add value. The reflex answer tends to be "because they join faster" but that ignores the fact that if you use the natural key of CONTACTID+EMAILADDRESS, and put these columns into child tables, you do not need to join at all! If we use the surrogate key and embed it in child tables, then getting the CONCTACT information forces two joins: through EMAILADDRESS to CONTACTS. But if we use the natural key of CONTACTID + EMAILADDRESS we already have the contact id which saves a JOIN when we are after CONTACTS details, and, unless we want to know something like LABEL, we do not have to JOIN back to EMAILADDRESSES at all. Conclusion

Well that's it. As promised, we have a few suggestions and a lot of questions for Mr. Thrasher. Check back in the coming days to see how the various questions work themselves out in the comments.

Categories: Development

Critical Analysis of an Algorithm: Sproc, Embedded SQL, and ORM

Kenneth Downs - Thu, 2010-12-16 22:24

This is a follow-up to yesterday's historical perspective on ORM. In this essay we examine a particular class of business logic and ask what happens if we go server-side, embedded SQL, or ORM.

This blog has two tables of contents, the Complete Table of Contents and the list of Database Skills.

Processes

We are going to look at a process. The term is not well defined, but my own working definition is any operation that has as many of the following properties as I seem to think are important at the time I make the decision:

  • Is categorically not CRUD: requires much more than displaying data to user or sending a single-row operation to the server
  • Involves reading or writing many rows
  • Involves reading or writing from multiple tables
  • Involves multiple passes of the same data
  • Involves no user interaction while executing
  • If not coded to be idempotent can cause huge headaches
  • Will likely take longer than a user is willing to wait (these days 3-10 seconds) and so runs in the background.
  • Depends upon rules tables that control its behavior
A Particular Process: Magazine Regulation

I have a more complete description of this problem here, so this is going to be very short. The system handles magazine deliveries to stores. The shop running the system has thousands of stores and thousands of magazines. Every store has a default quantity of the particular magazines they carry. For one particular magazine, NewsTime, there are 1000 stores that get an average default quantity of 50, requiring 50,000 magazines each weak.

Here is the twist. You never get exactly 50,000, no matter what you tell the distributor. Some weeks you get 45,000, others you get 55,000, with any variation in between. So the Regulation Process adjusts the defaults for each store until the delivery amounts equal the on-hand total that was delivered on the truck.

The Naive or Simple Algorithm

In the first pass, we are going to consider an unrealistically simple version of Magazine Regulation, where we have too many magazines and must up the quantities until we're giving out the entire amount on-hand.

Assume a table has already been populated that has the default quantities for each store, where the relevant columns for the moment would be these:

 StoreId   |  MagazineId   |  QTY_DEFAULT | QTY_REGULATED 
-----------+---------------+--------------+---------------
    1      |      50       |      75      |      0        
    2      |      50       |      23      |      0        
    4      |      50       |      48      |      0        
   10      |      50       |      19      |      0        
   17      |      50       |     110      |      0        
   21      |      50       |      82      |      0        

We are told only that the increases must be evenly distributed, we can't just ship 5000 extra magazines to a single store. That makes sense. A simple algorithm to do this would be:

  1. Give each store one additional magazine until you run out of magazines or run out of stores.
  2. Repeat step 1 until you run out of magazines.
The Pseudo-Code

Let's mock something up in pseudo-code that shows the structure of the solution:

magsToDeliver = get total magazines...
magsDefault   = get total of defaults...

-- Outer loop implements rule 2:
-- "repeat until you run out of magazines"
while magsToDeliver > magsDefault {

   -- Inner loop implements rule 1:
   -- "increase each store by 1 until you
   --  run out of stores or run out of magazines"
   for each store getting this magazine {
        if magsToDeliver <= magsDefault break
   
        -- If you want to go nuts, and even allow
        -- the accidental simultaneous execution
        -- of two routines doing the same thing, 
        -- put these lines here instead
        magsToDeliver = get total magazines...
        magsDefault   = get total of defaults...
   
        -- This is the actual job getting done
        qty_regulate  +=1
        magsToDeliver -=1
   }
}
The Three Methods

Let's characterize what happens with our three choices of stored procedure, embedded SQL, or ORM.

Stored Procedure. Likely the fastest solution, considering all of that row-by-row going on. If that were in app code (ORM or not) we would be making two round trips to the server per iteration.

The really crazy thing about the stored procedure solution is that it is utterly neutral to the ORM question. The entire ORM good-bad debate dissolves because there is no OOP code involved. So this could be the magic solution that ORM lovers and haters could both agree upon.

App Code With Embedded SQL (no ORM). Just about everybody hates this idea these days, but it should be here for completeness, and because there are some advantages. The top of the pseudo-code requires to aggregate pulls, and if you are not afraid of SQL you can pull down the result in one pass, instead of iterating on the client. Further, the innermost operation can be coded in SQL as a "UPDATE deliveries from (SELECT TOP 1 deliverid From deliveries...)" so that you get only one round trip per iteration, where ORM will cost two.

Any Kind of ORM. By this I mean the code will contain no SQL, and the innermost loop will likely instantiate some "delivery" objects, one after another, increment their qty_regulated property by one, and flush them out. This is twice as expensive as embedded SQL because you have to fetch the row from the database and then write it back, where the embedded SQL can issue a single command that locates and updates the row in a single statement.

Some may argue that I misunderstand ORM here, in that the library may be smart enough to allow the write without the read, and without forcing you to embed SQL. It would have to be something like A) instantiate empty object with key, B) assign value as an expression, like "+=1", C) save. I welcome any such examples and will update the post accordingly if any are forthcoming. I am assuming that no ORM tool I have seen can do this and would be happy to be corrected.

If the ORM forces us to calculate the initial sum of QTY_Default by fetching each row as an object and summing them in the app, we get an extra complete set of round trips. Bummer. But if we say, "Hey my ORM tool lets me embed SQL in *emergencies*" then perhaps we can embed a query with an aggregrate and skip that cost. But oops, we've got some embedded SQL. Leaky abstraction.

The Score So Far

So how does it look so far? All methods have the same number of reads and writes to disk, so we are scoring them on round trips. If "X" is the number of extra magazines to be distributed, and "Y" is the number of stores getting the magazine, we have for round trips:

  • Stored Procedure: 1
  • Embedded SQL: X + 1 (the first pull plus one trip per extra copy of the magazine)
  • ORM, Hypothetical:X + 1 (if the ORM tool can figure out how to do an update without first reading the row to the app)
  • ORM, Best Case: 2X + 1 (if the first pull can be an aggregrate without embedding SQL, and two round trips per iteration)
  • ORM, Worst Case:2X + Y (if the first pull must aggregate app-side and there are two round trips per iteration)

Update: if you want a laugh, check out the image on the Wikipedia page for "Business Logic", it depicts aggregation occuring on the client side.

This gives us the shape of the engineering decision. With all options reading and updating the same number of rows, it call comes down to round trips. As soon as you go client side your round trips go way up, and if your ORM tool does not support Update without Select, then it doubles from there.

Now multiply this across the entire application, every single action in code, every bit of "business logic" with any kind of loop that iterates over rows.

It Gets Worse/Better: Considering SQL Possibilities

If you happen to know much about modern SQL, you may be aware of the amazingly cool SQL RANK() function. If this function is used in the sproc or embedded SQL approaches, you can execute the algorithm with only one loop, in a maximum of N=CEILING((Delivered-Regulated)/Stores) iterations. This will go much faster than the row-by-row, and now those two options are pulling even further ahead of the naive row-by-row methods encouraged by an ORM tool.

This ability of SQL will become extremely important, as we are about to blow apart the simplicity of the algorithm.

We Now Return You To The Real World

I have never been paid good money to write an algorithm as simple as the one described above. This is because mature businesses have always refined these simple methods for years or decades, and so the real situation is always more complex.

In a real magazine regulation algorithm, the rules tend to be more like this:

  1. Apply all of these rules whether you are increasing or decreasing the amounts to deliver
  2. Stop applying the rules when delivery amounts have been balanced to what we have on hand, no matter where you are in the process
  3. Always increase/decrease any particular store by exactly one on each iteration, no matter which rule you are working on
  4. Never decrease any store below 2
  5. Decrease any store whose past 3 issues sold less than 60% by 1, unless this would project their sales of this issue above 60%, and prioritize by prior sales % ascending.
  6. If the previous step completes, and we are short of magazines decrease each store by 1 by order of previous sales percents ascending. Repeat until we are in balance.
  7. If all stores are reduced to 2 and we are still short, abort and write error to log.
  8. If after the decreases we have excess magazines, increase any store whose past 3 issues sold more than 70% by 1, unless this would reduce their projected sales of this issue below 70%, and prioritize by prior sales % descending (so the stores with the most sales are handled first in case we don't get to all of them)
  9. If the previous step completes, and we are still in excess, increase each store by 1 in order of previous sales percents descending. Repeat until we are in balance.

This can also get doubled again if we must implement one set of rules when we start out with a too few magazines, and another set of rules when we start out with too many.

Well, it's not that hard. It actually comes down to having four outer loops in succession. The percent-based reduction, then the by 1 reduction, then the percent-based increase, then the by 1 increase.

But technically the more important matters are these:

  • We now have to grab the sales % for every store for this magazine on their past 3 issues and keep it handy throughout the routine.
  • The rules stated above contain constants like 70%, 60%. These would properly be in some parameter table to allow the user to control them, so those have to be fetched.
  • The loop through the stores is now much different, as we are filtering on prior sales percent for the percent-based passes, and filtering and ordering on prior sales percent for the by 1 passes.
Revisiting the Three Approaches

Now let's see how our three approaches would change.

The Improved Stored Procedure. If we change the sproc to use RANK() and make batch updates, we would pull the prior sales percents into a temp table and apply a covering index to cut the reads from that table in half. Our main loop would then simply join to this temp table and use it for both filtering and ordering.

The Embedded SQL. If we also changed the embedded SQL so it was making batch updates with RANK(), we would also generate a temp table. This option remains the same as the sproc except for where we put the SQL. However, it now has far fewer round trips, and starts to look much more like the sproc in terms of performance.

The ORM Approach. The idea here would be to get those prior sales percents down into an ordered collection and then use them as the basis for the loops. The thorny part is that they must be aggregated and sorted. If we want to avoid all embedded SQL, then the aggregation can be done client-side if we don't mind pulling down 3 times as many rows as are required. The sorting we can pull off if we put the objects into a collection such as an associative array, where the key is the sales percent, then we can use [language of choice]'s built-in sorting (hopefully), and we have escaped the dread evil of embedded SQL.

So we end up where we were, only more so. The sproc remains the fastest, and if we know how to code set-oriented nifty stuff with RANK() then the embedded SQL will run in almost the exact same time. The ORM requires most likely even more round trips and expensive app-side operations that are performed much more efficiently in the db server, unless we are willing to break the abstraction and embed a bit of SQL.

But in the end, if all of that cost of the ORM kicks a 3 second routine to 7 seconds, still well below what any user would notice, and you avoid embedded SQL, and it lets you keep your paradigm, who am I to judge?

Conclusions

I offer none. There are so many conventions in play regarding where to put your code, what tool you are already using, and so forth, that it is really up to the reader to draw conclusions. I only hope there is enough information here to do so.

Categories: Development

Historical Perspective of ORM and Alternatives

Kenneth Downs - Wed, 2010-12-15 20:55

A couple of years ago I broke my basic rule of sticking to practical how-to and general programming philosophy and wrote Why I Do Not Use ORM. It sure got a lot of hits, and is read every day by people searching such things as "orm bad" or "why use orm". But I have never been satisfied with that post, and so I decided to take another stab from another angle. There are legitimate problems that led to ORM, and those problems need to be looked at even if we cannot quite agree on what they are or if ORM is the answer.

UPDATE: In response to comments below and on reddit.com, I have a new post that gives a detailed analysis of an algorithm implemented as a sproc, in app code with embedded SQL, and in ORM.

Here then, is one man's short history of commercial database application programming, from long before the ORM system, right up to the present.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Way Back Machine

When I began my career the world was a different place. No Web, no Java, and Object Orientation had not yet entered the mainstream. My first application was written on a timeshare system (a microVAX) and writing LAN applications made me a good living for awhile before I graduated to client/server.

In those days there were three things a programmer (We were not "software engineers" yet, just programmers) had to know. Every programmer I knew wanted to master all of these skills. They were:

  • How to design a database schema for correctness and efficiency.
  • How to code an application that could process data from the database, correctly and efficiently.
  • How to make a good UI, which came down to hotkeys and stuffing the screen with as much info as possible.

In this essay we are going to look at those first two.

My own experience may be somewhat peculiar in that I have never worked on a team where the programmers were separated from the database. (OK, one exception, in my current assignment there is an iron curtain between the two, but happily it is not my problem from where I sit). Coders made tables, and "tablers" wrote code. So this focus on being a good developer by developing both skills may be rare, enjoyed by those who have the same ecumenical background that I enjoyed.

Some Changes That Did Not Matter

Things changed rapidly, but most of those changes did not really affect application development.

When Windows 95 came out, being "almost as good as a Mac", we recoded our DOS apps into Windows apps without too much trouble and life went on as before.

Laser printers replaced dot-matrix for most office use, CPUs kept getting faster (and Windows kept getting slower), each year there were more colors on the screen, disks got bigger and RAM got cheaper.

Only the internet and the new stateless programming required any real adjustment, but it was easy for a database guy because good practice had always been to keep your transactions as short as possible. The stateless thing just kind of tuned that to a sharp edge.

Finally, with the internet, the RDBMS finally lost its place as sole king of the datastore realm, but those new datastores will have to wait for another day, lest we get bogged down.

Enter Object Orientation

Arguably nothing changed programming more than Object Orientation. Certainly not Windows 95, faster graphics or any of those other Moore's Law consequences. I would go so far as to say that even the explosion of the web just produced more programming, and of different kinds of apps, and even that did not come close to the impact of Object Orientation. Disagree if you like, but as it came in, it was new, it was strange, it was beautiful, and we were in love.

Now here is something you may not believe. The biggest question for those of us already successfully developing large applications was: What is it good for? What does it give me that I do not already have? Sure its beautiful, but what does it do?

User interfaces were for me the easiest first place to see the benefits. When the widgets became classes and objects, and we empolyed encapsulation, inheritance and composition, the world changed and I don't know anybody who ever looked back.

OOP, Data, and Data Structures

But in the matter of processing data, things were not so clear cut. The biggest reason may have been that all languages back then had specialized data structures that were highly tuned to handling relational data. These worked so well that nobody at first envisioned anything like ActiveRecord because we just did not need it.

With these structures you could write applications that ran processes involving dozens of tables, lasting hours, and never wonder, "Gosh, how do I map this data to my language of choice?" You chose the language you were using precisely because it knew how to handle data!

I would like to throw in just one example to show how OOP was not relevant to getting work done back then. I was once asked to optimize something called "ERP Allocation" that ran once/day, but was taking 26 hours at the largest customer site, obviously a big problem. It turned out there was a call to the database inside of a tightly nested loop, and when I moved the query outside of the loop the results were dramatic. The programmers got the idea and they took over from there. The main point of course is that it was all about how to efficiently use a database. The language was OOP, and the code was in a class, but that had nothing to do with the problem or the solution. Going further, coding a process so data intensive as this one using ActiveRecord was prima facia absurd to anybody who knew about data and code.

Java and the Languages of The Internet

But the web had another impact that was far more important than just switching to stateless programming. This was the introduction of an entirely new family of languages that took over the application space, listed here in no particular order: Perl, PHP, Python, Ruby, and the king of them all: Java.

All of these languages have one thing in common that positively jumps out at a veteran: a complete lack of data structures specialized for handling relational data. So as these languages exploded in popularity with their dismal offerings in data handling, the need to provide something better in that area became rapidly clear.

Java has a special role to play because it was pure OOP from the ground up. Even the whitespace is an object! The impact of Java is very important here because Object Orientation was now the One True Faith, and languages with a more flexible approach were gradually demoted to mere 'scripting' languages. ( Of course proponents will quickly point out that 1/12 of the world's population is now using a single application written in one of those 'scripting' languages).

So the explosion of languages without decent data handling abilities, coupled with a rise in OOP-uber-alles thinking led us quite naturally to:

The First Premise of ORM: The Design Mismatch

The first premise of ORM is that there is a design mismatch between OOP and Relational, which must resolved before any meaningful work can be done.

This view is easy to sympathize with, even if you disagree, when you consider the points raised in the above sections, that the languages in play lack any real specialized data structures, and that a certain exclusive truthiness to OOP has arisen that is blind to entire classes of solutions.

So we must grant the ORM crowd their first premise, in modified form. It is not that there is a design mismatch, it is that there is something missing, something that was in older systems that is just not there in the newer languages. Granting that this missing feature is an actual mismatch requires a belief in the Exclusive Truth of OOP, which I do not grant. OOP is like the computer itself, of which Commander Spock said, "Computers make excellent servants, but I have no wish to be servant to a computer."

But anyway, getting back to the story, the race was on to replace what had been lost, and to do it in an OOPy way.

The Second Premise of ORM: Persistence

Fast forward and we soon have an entire family of tools known as Object-Relational-Mappers, or ORM. With them came an old idea: persistence.

The idea has always been around that databases exist to persist the work of the programmer. I thought that myself when I was, oh, about 25 or so. I learned fast that my view of reality was, *cough*, lacking, and that in fact there are two things that are truly real for a developer:

  • The users, who create the paycheck, and
  • The data, which those users seemed to think was supposed to be correct 100% of the time.

From this perspective, the application code suddenly becomes a go-between, the necessary appliance that gets data from the db to the user (who creates the paycheck), and takes instructions back from the user and puts them in the database (correctly, thank you, and don't make the user wait). No matter how beautiful the code was, the user would only ever see the screen (or page nowadays) and you only heard about it if it was wrong. Nobody cares about my code, nobody cares about yours.

However, in the ORM world the idea of a database as the persistence layer now sits on a throne reserved for axiomatic truth. Those who disagree with me on this may say that I have the mistaken perspective of an outsider, to which I could say only that it is this very idea that keeps me an outsider.

But we should not paint the world with a broad brush. Chris Wong writes an excellent blog where he occassionally details how to respect the database while using Hibernate, in this post and this post.

An Alternative World View

There are plenty of alternatives to ORM, but I would contend that they begin with a different world view. Good business recognizes the infinite value of the users as the generators of the Almighty Paycheck, and the database as the permanent record of a job well done.

This worldview forces us into a humble position with respect to our own application code, which is that it is little more than a waiter, carrying orders to the kitchen and food back to the patrons. When we see it this way, the goal becomes to write code that can efficiently get data back and forth. A small handful of library routines can trap SQL injection, validate types, and ship data off to the database. Another set can generate HTML, or, can simply pass JSON data up to those nifty browser client libraries like ExtJS (now "Sencha" for some reason).

This covers a huge amount of what an application does, if you do not have much in the way of business logic.

But How Do You Handle Business Logic?

I have an entire essay on this about half-written, but in short, it comes down to understanding what business logic really is. Update: This post is now available

The tables themselves are the bottom layer of business logic. The table design itself implements the foundation for all of the business rules. This is why it is so important to get it right. The tables are organized using normalization to have a place for everything and everything in its place, and after that the application code mostly writes itself.

The application code then falls into two areas: value-add and no value-add. There is no value-add when the application simply ships data off to the user or executes a user request to update the database. Those kinds of things should be handled with the lightest possible library that gets the job done.

But the value-add stuff is different, where a user's request requires lookups, possibly computations and so forth. The problem here is that a naive analysis of requirements (particulary the transliteration error (Scroll down to "The Customer Does Not Design Tables) will tend to generate many cases of perceived need for value-add where a simpler design can reduce these cases to no value-add. But even when the database has been simplified to pristine perfection, there are jobs that require loops, multiple passes and so forth, which must be made idempotent and robust, which will always require some extra coding. But if you know what you are doing, these always turn out to be the ERP Allocation example given above: they are a lot more about the data than the classes.

Another huge factor is where you come down on the normalization debate, particularly on the inclusion of derived values. If you keep derived values out of the database, which is technically correct from a limited perspective, then suddenly the value-add code is much more important because without it your data is incomplete. If you elect to put derived values into your database than value-add code is only required when writing to the database, so huge abstractions meant to handle any read/write situation are unnecessary. (And of course, it is extremely important to Keep denormalized values correct ).

And the Rest of It

This essay hardly covers the entirety of making code and data work together. You still have to synchronize schema changes to code, and I still think a data dictionary is the best D-R-Y way to do that.

I hope this essay shows something of why many programmers are so down on ORM, but much more importantly that there are coherent philosophies out there that begin with a different worldview and deliver what we were all doing before ORM and what we will all still be doing after ORM: delivering data back and forth between user and database.

Categories: Development

The Cost of Round Trips To The Server

Kenneth Downs - Sat, 2010-12-11 13:33

A database is not much without the applications that connect to it, and one of the most important factors that affects the application's performance is how it retrieves data from queries. In this essay we are going to see the effect of round trips on application performance.

This blog has two tables of contents, the Complete Table of Contents and the list of Database Skills.

Pulling 15,000 Rows

The test will pull 15,000 rows from a table. We do it three different ways and see which is faster and by how much.

Getting a Lot of Rows

The script below creates a table and puts 1 million rows into it. We want far more rows in the table than we will actually pull so that we can pull fresh rows on every pass through the test. It is deliberately crafted to spread out the adjacent values of the integer primary key. This is because, inasmuch as can control what is going on, we want every single row to be on a different page, so that in all tests the cost of retrieving the row is roughly the same and we are measuring only the effect of our retrieval methods.

The script can be run without modification in pgAdmin3, and with slight mods on MS SQL Server.

create table test000 (
    intpk int primary key
   ,filler char(40)
)


--  BLOCK 1, first 5000 rows
--  pgAdmin3: run as pgScript
--  All others: modify as required  
--
declare @x,@y;
set @x = 1;
set @y = string(40,40,1);
while @x <= 5000 begin
    insert into test000 (intpk,filler)
    values ((@x-1)*200 +1,'@y');

    set @x = @x + 1;
end

-- BLOCK 2, put 5000 rows aside 
--
select  * into test000_temp from test000

-- BLOCK 3, Insert the 5000 rows 199 more
--          times to get 1million altogether
--  pgAdmin3: run as pgScript
--  All others: modify as required  
--  
declare @x;
set @x = 1;
while @x <= 199 begin
    insert into test000 (intpk,filler)
    select intpk+@x,filler from test000_temp;

    set @x = @x + 1;
end
Test 1: The Naive Code

The simplest code is a straight loop that pulls 15,000 consecutive rows by sending an explicit query for each one.

# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");

# Program 1, Individual explicit fetches
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 1, using $x1 to $x2";
$timeBegin = microtime(true);
while ($x1++ <= $x2) {
    $dbResult = pg_exec("select * from test000 where intpk=$x1");
    $row = pg_fetch_array($dbResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 1, elapsed time: ".$elapsed;
echo "\n";
Test 2: Prepared Statements

The next command asks the server to prepare a statement, but it still makes 15,000 round trips, executing the prepared statement with a new parameter each time. The code looks like this:

# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");

# Program 2, Individual fetches with prepared statements
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 2, using $x1 to $x2";
$timeBegin = microtime(true);
$dbResult = pg_prepare("test000","select * from test000 where intpk=$1");
while ($x1++ <= $x2) {
    $pqResult = pg_execute("test000",array($x1));
    $row = pg_fetch_all($pqResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 2, elapsed time: ".$elapsed;
echo "\n";
Test 3: A single round trip

This time we issue a single command to retrieve 15,000 rows, then we pull them all down in one shot.

# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");

# Program 3, One fetch, pull all rows
$timeBegin = microtime(true);
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 3, using $x1 to $x2";
$dbResult = pg_exec(
    "select * from test000 where intpk between $x1 and $x2"
);
$allRows = pg_fetch_all($dbResult);
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 3, elapsed time: ".$elapsed;
echo "\n";
Results

I ran this five times in a row, and this is what I got:

Naive 15,000 Prepared 15,000 One Round Trip ~1.800 seconds ~1.150 seconds ~0.045 seconds

Compared to the naive example, the set-oriented fetch of al 15,000 rows in a single shot ran 40 times faster. This is what set-oriented code does for an application.

While the prepared statement option ran faster than the naive option, the set oriented example still ran 25 times faster than the repeated prepared statements.

I also re-arranged the order of the tests, and the results were the same.

Does Server or Language Matter?

So this test was done using PHP against PostgreSQL, will other servers and client languages get different results? Given the same hardware, a different client language or server is going to have a different spread but the shape will be the same. Fetching all rows in a single shot beats the living frack out of round trips inside of loops in any client language against any server.

Putting It Into Use

The most obvious conclusion is that any query returning more than 1 row should return all rows as a set. The advantage is so stark with large row counts that it is worthwhile making this the default for our applications, unless we can find a very good reason not to. So what would the objections be?

One objection might go something like, "Ken, I see the numbers, but I know my app very well and we never pull more than 10-20 rows in a pop. I cannot imagine how it would matter at 10-20 rows, and I do not want to recode." This makes sense so I ran a few more tests with 20 and 100 rows, and found that, on my hardware, you need about 100 rows to see a difference. At 20 rows all three are neck-in-neck and at 100 the set is pulling 4 times faster than the prepared statement and 6 times faster than the naive statement. So the conclusion is not an absolute after all, some judgment is in order.

Another thing to consider is how many simultaneous reads and writes might be going on at any given time. If your system is known to have simultaneous transactions running regularly, then the complete fetch may be a good idea even if you do some tests for best-guess row count and the tests are inconclusive. The reason is that the test is a single user case, but multiple simultaneous users put a strain on the database, even when they are not accessing the same tables. In this case we want the application to play the "good citizen" and get in and out as quickly as possible to reduce strain on the server, which will improve the performance of the entire application, not just the portions optimized for complete fetches.

Another objection might be, "Well, my code needs to pull from multiple tables, so I cannot really do this. When we do -PROCESS-X- we go row by row and need to pull from multiple tables for each row." In this case you *definitely* need to go set oriented and pull all associated quantities down in a query with a JOIN or two. Consider this, if on your particular hardware the ratio of naive row-by-row to single fetch is 10, and you must pull from 2 other tables for each row, that means you are really running 30 times slower (ratio is 10 x 3 reads) than you could be.

A Final Note About PHP, Data Structures, and Frameworks

Back when dinosaurs ruled the Earth and there was no internet (outside of Universities, etc), the languages we used had specialized data structures that were tuned to database use. Compared to those older systems the newer languages born on the internet are more or less starving for such a data structure.

PHP gets by fairly well because its associative array can be used as a passive (non object-oriented) data structure that comes pretty close to what we had before.

I bring this up because the choice of a language and its support for a "fetch all" operation obviously impacts how well the conclusions of this post can be implemented. If your mapping tool has an iterator that absolves you of all knowledge of what is going on under the hood, it may be worthwhile to see if it is doing a complete fetch or a row-by-row.

Categories: Development

Submit Analysis Request to the Database Programmer

Kenneth Downs - Wed, 2010-12-08 17:50

I generally do not reveal too many details about systems I design for customers or employers. This leaves me sometimes in a bind for example material. I either have to simplify it beyond what I would like, or make something up that I have not actually put into Production.

On top of that, one of the key themes of this blog is that table design is a crucial skill, and if the examples I give do not match what you are doing, they may be hard to make use of.

So I would like invite analysis requests. Go over to the Contact the Author page and drop me an email and tell me about the system you are trying to design or optimize.

There are no rules on the type of system.

The most interesting mini-projects would be those where advice you have been given elsewhere (or here for that matter) does not seem to fit.

I will do my best to reply, even if I have to say no, so that nobody is left wondering.

Remember this blog is one of those hobby/professional things, good for all of us but nobody is getting paid, so if you are in a terrible hurry this might not be the best thing.

Categories: Development

A Case When Table Design is Easy and Predictable

Kenneth Downs - Thu, 2010-12-02 20:36

Good table design is a great foundation for a successful application stack. Table design patterns basically resolve into master tables and transaction tables. When we know a thing or two about the master tables (or entities if you prefer), we can infer a great deal about the transactions.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

A Time Billing System

Imagine we have been asked to recode the company's time-billing system. Because this is for the company we work for, we have some inside knowledge about how things work. We know that:

  • There are, of course, customers.
  • ....and employees who record time
  • Each hour we record goes against a Work Order
  • There are different kinds of jobs, like project management, programming, programming management, and others.

Knowing only this, is it possible to anticipate what the system will look like? A safe answer is "no", on the claim that we will undoubtedly learn more, but this safe answer happens to be wrong. We can in fact anticipate the overall shape of the system, and new information will shift details, but it will not change the shape.

We can anticipate the nature of the transactions if we determine the upper bound of complexity and the combinatorial completeness of the system.

The Upper Bound of Complexity

We can safely assume that the big number to get right is going to be the billing rate. Our employer assumes we will get everything else right, but the billing rate is going to have them chewing their fingernails until they know we understand it and have coded it correctly.

The cool thing is that we already have enough information to establish an upper bound on the complexity of the system by looking at the master tables, where a master table is generally one that lists details about real things like people, places, things, or activities. So far we know (or think we know) about three master tables:

  • Customers
  • Employees
  • Services

Now we define the upper bound of complexity as:

The upper bound of complexity occurs when the billing rate is determined by all three master entities.

In plain English, calculating a billing rate can be as complicated as looking up a rate specific to a customer for a service for an employee but cannot be more complex than that because there are no other entities with which to work.

Combinatorially Complete

We can also anticipate all possible calculations for the billing rate by working through the complete set of combinations of master entities. This would look like the list below. Note that we are not trying to figure out right now which of these is likely to occur, we just want to get them listed out:

  • Each service has a default rate
  • Each customer has a negotiated rate
  • Each employee bills out at a default rate
  • The combination customer-service may have a rate
  • The combination customer-employee may have a rate
  • The combination customer-service-employee may have a rate (this is the upper bound of complexity, all three master entities determine the rate).

Unless we live in a super-simple world where only the first item in the list is present, we will end up dealing with several if not all of the combinations listed above.

Each of these combinations then becomes a table, and we know the billing rate will be determined by a resolution. New Information

Now comes the big day and we interview with somebody we'll call "The Explainer" who is going to officially explain the billing system. Can he break what we already know? No. At most he can:

  • Make us aware of new master entities, perhaps there are "projects" and "contracts" that get their own billing arrangements.
  • Dispel our notions about some of the combinations by saying, "Oh we never give a customer a default rate, the default rates come out of the services."
Going in Cold

What about the case where we know absolutely nothing about an assignment when we go in to begin the interviews? We can do a good job of thinking on our feet if we draw "The Explainer" towards the master entities. As we gain confidence that we know what the master entities are, we can ask questions to probe Combinatorial Completeness and the Upper Bound of Complexity.

One caveat: This method works for transactions between master entities. When "The Explainer" starts describing something that cannot be recognized as an interaction between master entities, do not try to stuff the problem into this box, it may not fit.

What About the Application?

At this point, we can also anticipate a lot of what the application will look like. We will need maintenance screens for all of the master entities, and a really slick UI will allow for very easy editing of those various cross-reference combination tables. As long as that much is done, we are almost finished, but not yet.

There will be some billing process that pulls the time entries, finds the correct billing rate for each one, and permanently records the invoices. If we use a resolution this task is child's play to code, debug, and maintain.

Then of course there is the presentation, the actual bill. Depending on the company, these may be delivered as hardcopy or in email. That will of course have to be coded up.

Conclusion

There are two conclusions. First, as originally stated, many transactions can be anticipated when you know what the master entities are.

But secondly, and every bit as important, once the table design is sound, the application pretty much writes itself. On a personal note, this is probably why I do not find application coding as exciting as I once did. Once I realized that the real challenge and satisfaction was in working out the tables, the coding of the app became a bit of a drudge, it requires no judgment as far as business rules are concerned.

Categories: Development

Happy Holidays everyone!

Java 2 Go! - Thu, 2010-12-02 15:57


This is a summary only. Please, visit the blog for full content and more.
Categories: Development

Creating a centered page layout using ADF Faces (only)

Java 2 Go! - Wed, 2010-12-01 14:07
by Eduardo Rodrigues It’s been a long time since I last posted here about my old friend: ADF. But now that I’m working in ADF BC’s awesome development team, I was already feeling the self-pressure...

This is a summary only. Please, visit the blog for full content and more.
Categories: Development

The Really Cool NTILE() Window Function

Kenneth Downs - Tue, 2010-11-30 21:29

If you regularly code queries and have never been introduced to the windowing functions, then you are in for a treat. I've been meaning to write about these for over a year, and now it's time to get down to it.

Support in Major Servers

SQL Server calls these functions Ranking Functions.

PostgreSQL supports a wider range of functions than MS SQL Server, having put them in at 8.4, and PostgreSQL and calls them Window Functions.

Oracle's support is broader (by a reading of the docs) than SQL Server or PostgreSQL, and they call them Analytic Functions.

I try to stay away from MySQL, but I did a quick Google on all three terms and came up with a few forum posts asking when and if they will be supported.

The NTILE() Function

In this post we are going to look at NTILE, a cool function that allows you to segment query results into groups and put numbers onto them. The name is easy to remember because it can create any -tile, a percentile, a decile, or anything else. In short, an n-tile. But it is much easier to understand with an example, so let's go right to it.

Finding percentiles

Consider a table of completed sales, perhaps on an eCommerce site. The Sales Manager would like them divided up into quartiles, four equally divided groups, and she wants the average and maximum sale in each quartile. Let's say the company is not exactly hopping, and there are only twelve sales, which is good because we can list them all for the example. If we already had the quartiles provided then the query would be easy, so if we were lucky enough to be starting with this:

 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 DEALER   |  500.00 |   3
 RETAIL   |  738.00 |   3
 DEALER   | 1250.00 |   3
 RETAIL   | 2029.00 |   4
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

The query would be child's play if we already had the quartile:

Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM ORDERS
 GROUP BY quartile
 ORDER BY quartile
The Problem is We Do Not Have Quartile

The problem of course is that we do not usually have handy columns like QUARTILE provided, but we can generate the QUARTILE column during the query by using NTILE.

Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY quartile
 ORDER BY quartile

This query will give us what the Sales Manager wants.

Dissecting the Function and The OVER Clause

The NTILE() function takes a single argument, which tells the server how many groups to divide the data into. If there are not an exact number of rows in each group, the server decides which groups will be missing one row. So in an exact case all of your groups have the same count of rows, but when it does not divide evenly, one or more of them will be one row short.

If you pass 100 to NTILE(), you get a percentile. If you pass 10, you get a decile, and so forth.

The magic is in the OVER() function. This supports two clauses, and the example shows one, the ORDER BY. Quite simply, the ORDER BY clause tells the server how to line up the rows when adding the NTILE values. The clause is very flexible, and has nothing to do with your query's overall ORDER BY clause. The Second Clause: PARTITION

Now we will pretend the Sales Manager is not satisfied, and wants separate numbers for the two Customer Types. We could do this if the NTILE() function would create two sets of quartiles, one for each Customer Type, like so:

 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 DEALER   |  500.00 |   3
 DEALER   | 1250.00 |   4
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 RETAIL   |  738.00 |   3
 RETAIL   | 2029.00 |   3
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

We can do this by using the PARTITION BY clause, which tells the server to break the rows into groups and apply the NTILE() numbering separately within each group. The new query would be this:

Select custtype
     , quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (partition by custtype
                                 order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY custtype,quartile
 ORDER BY custtype,quartile
Bonus Points: The Median

Now once again the Sales Manager, who is never satisified, comes down and says that the average is no good, she needs the max and the median sale value within each quartile. To keep it simple, she does not need this broken out by customer type, it can be applied to the entire set.

This is a case where we can use NTILE() twice. The first time we will break all sales up into four groups, to get the quartiles, and then we will break up each quartile into two groups to get the median. The code looks like this:

Select quartile
     , max(case when bitile=1 then amount else 0 end) as medAmount
     , max(amount) as maxAmount
  FROM (
        -- The second pass adds the
        -- 2-tile value we will use to find medians
        SELECT quartile
             , amount
             , ntile(2) over (partition by quartile
                                  order by amount) as bitile
          FROM (
                -- The subquery is necessary
                -- to process all rows and add the quartile column
                SELECT amount
                     , ntile(4) over (order by amount) as quartile
                  FROM ORDERS
               ) x1
       ) x2
 GROUP BY quartile
 ORDER BY quartile

The magic here is that we know we've divided the data evenly into four sets, so the median will be the maximum value half way through each set. In other words, it will be the maximum value when the value of bitile=1 for each quartile.

One More Note About Oracle

Once you get down the basics of the OVER clause, Oracle looks really good, because they support the clause over the largest range of functions, at least going by the respective doc pages for each platform.

Categories: Development

Loops Without Cursors

Kenneth Downs - Mon, 2010-11-29 21:19
Looping Without Cursors

Sometimes you need to process a table row-by-row, and the established approach is to use cursors, which are verbose, slow, and painful to code and use.

The Cursor Example

Here is the basic minimum syntax required to loop through a table and get something done. The SQL flavor is MS SQL Server, but its not much better in any other flavor.

-- I coded this off the top of my head, there
-- may be a minor syntax error or two

-- Most of this is pseudo-code, but take
-- note that it is ordered on column1
declare someCursorName cursor for
 select column1, column2, column3 
   from anyTable
  ORDER BY column1

-- Have to do this now
open someCursorName

-- Now you need to declare some variables
-- For the example I'm just making everything int
declare @column1 int
      , @column2 int
      , @column3 int

-- Gosh, we're actually about to start the loop!  Finally!
fetch next from someCursorName into @column1,@column2,@column3
while @@fetch_status = 0 begin

   --  If you still remember what you actually wanted
   --  to do inside the loop, code it here:

-- Repeat this line from the top here again:
fetch next from someCursorName into @column1,@column2,@column3
end

-- Not done yet, these two lines are crucial
close someCursorName
deallocate someCursorName

Call me petty, but what I hate about that code is that I have to refer to specific columns of interest 3 times (not counting the declarations). You refer to them in the cursor declaration and in the two FETCH commands. With a little clever coding, we can vastly simplify this and do it only once.

Using An Ordered Column

We can execute the same loop without the cursor if one of the columns is ordered and unique. Let us say that column1 is the primary key, and is an auto-incremented integer. So it is ordered and unique. The code now collapses down to:

-- I coded this off the top of my head, there
-- may be a minor syntax error or two

-- We can't get around declaring the vars, so do that
declare @column1 int
      , @column2 int
      , @column3 int

-- If you know a safe value for initialization, you
-- can use the code below.  If this is not 100% 
-- safe, you must query for the value or it must
-- be supplied from some other source
set @column1 = -1

-- BONUS POINTS: Can this become an infinite loop?
while 1 = 1 begin

-- Now we code the query and exit condition
 select TOP 1
        @column1 = column1
      , @column2 = column2
      , @column3 = column3 
   from anyTable
  WHERE column1 > @column1  -- this is what advances the loop
  ORDER BY column1

if @@rowcount = 0 begin
    break
end

        -- Put the actions here        

end
Final Notes

The only requirement for this approach is that you have a unique ordered column. This usually means a unique key or primary key. If "column1" is not unique, the loop will skip all but the first value in each group.

Also, it is very nice if you know a safe value to use as an initializer. Without that, you must query for the minimum value that matches the condition and then decrement it by one.

Finally, can this loop become infinite? No. Well, if, in the extremely unlikely situation that rows are being added to the base table faster than you are processing them, then yes, it could go on for a very long time. But if that were happening I'd say there was a separate problem to look at.

It should probably go without saying, but if the particular loop is going to happen very often, the table should be indexed on your unique ordered column. If it is a primary key or you already have a unique constraint it is not necessary to create an index explicitly because there will be one as part of the key or constraint.

Categories: Development

Pages

Subscribe to Oracle FAQ aggregator - Development