Re: SQL Humor

From: Mikito Harakiri <mikharakiri_nospaum_at_yahoo.com>
Date: 19 Aug 2005 16:18:03 -0700
Message-ID: <1124493483.455266.82160_at_g47g2000cwa.googlegroups.com>


Hugo Kornelis wrote:
> On 18 Aug 2005 15:02:56 -0700, Mikito Harakiri wrote:
>
> >Hugo Kornelis wrote:
> >> Many companies use a short mnemonic code for their customers. The use of
> >> a short mnemonic code for products is not uncommon either. Imagine a
> >> customers table with 10,000 customers, a products table with 5,000
> >> products and an orders table with a few million rows. Now would you
> >> prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
> >> for ProductCode?
> >
> >But if lookup table cardinality goes up, then, the storage factor
> >char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
> >effect of the two trailing bytes is not miniscule, as I previously
> >thought, but it just can't be significant. Any benchmark demonstrating
> >that the performance degradation is not in single percentage digits is
> >welcome.
>
> Hi Mikito,
>
> Below is the script I used to compare performance. I compared three
> tasks:
>
> 1. Performing lots of single inserts (speed of inserts matters in
> databases that have to process thousands of new rows per second)
> 2. Adding an index (not an everyday task, but there are situations where
> a process can be sped up by adding an index, doing the process, then
> dropping the index again)
> 3. Generating a typical report that involves some joins (the CASE
> expressions I used may look silly, but imagine that I am using a payment
> status instead - also note that I store the results in a temporary table
> first, to make sure that network speed won't influence the results)
>
> You'll find the full code further below. But first the results:
>
> 1. For the inserts, the elapsed time on my machine was 118156 ms with
> char, 124406 ms with varchar. A degradation of over 5 %.
>
> 2. Adding the index took 2840 ms with char, 3236 ms with varchar. A
> performance degradation of almost 14 %.
>
> 3a. Generating the report with the index created in step 2 took 224 ms
> on average for char; 234 ms on average for varchar (I used three
> consecutive executions for both tests, each time starting with a clean
> cache). Performance degradation: 4.5 %.
> 3b. Generating the same report after dropping the supporting index
> showed an even bigger difference: avg 2281 for char; avg 3250 for
> varchar. Degradation: over 40 %.
>
> 1a. To top it off, I repeated the insertion test, but this time AFTER
> creating the extra index. For char, the time taken was 227936 ms; for
> varchar 313110 ms. Performance degradation: 37 %.
>
> So here's your choice: either you use the index, degrade the queries by
> "only" 4.5% but at the cost of slowing inserts down by 37%, or you
> discard the index to reverse these figures (inserts slowed down 5%;
> queries slowed down 40%).
>
> Or you choose char instead of varchar when you know that the length is
> more or less fixed. :-)
>
>
> Here's the code I used for this benchmark:
>
> -- Set up the tables
> CREATE TABLE Clients
> (ClientID char(6) NOT NULL,
> ClientName varchar(36) NOT NULL,
> PRIMARY KEY (ClientID)
> )
> CREATE TABLE Products
> (ProdID char(5) NOT NULL,
> ProdName varchar(30) NOT NULL,
> PRIMARY KEY (ProdID)
> )
> CREATE TABLE Orders
> (ClientID char(6) NOT NULL,
> ProdID char(5) NOT NULL,
> PRIMARY KEY (ClientID, ProdID),
> FOREIGN KEY (ClientID) REFERENCES Clients,
> FOREIGN KEY (ProdID) REFERENCES Products
> )
> go
> -- Put some rows in the Clients table
> DECLARE _at_i int, @r int, @NumClients int
> SET _at_NumClients = 10000
> SET _at_i = 0
> WHILE _at_i < @NumClients
> BEGIN
> SET _at_r = CAST(@i AS bigint) * 308915776 / @NumClients
> INSERT Clients (ClientID, ClientName)
> VALUES(CHAR(65 + (_at_r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
> + ((_at_r / 676) % 26))
> + CHAR(65 + ((_at_r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
> 26))
> + CHAR(65 + ((_at_r / 11881376) % 26)),
> LEFT(CAST(NewID() as VARCHAR(36)), 10 + RAND() * 26))
> SET _at_i = @i + 1
> END
> select count(*) from Clients
> go
> -- Put some rows in the Products table
> DECLARE _at_i int, @r int, @NumProds int
> SET _at_NumProds = 5000
> SET _at_i = 0
> WHILE _at_i < @NumProds
> BEGIN
> SET _at_r = CAST(@i AS bigint) * 11881376 / @NumProds
> INSERT Products (ProdID, ProdName)
> VALUES(CHAR(65 + (_at_r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
> + ((_at_r / 676) % 26))
> + CHAR(65 + ((_at_r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
> 26)),
> LEFT(CAST(NewID() as VARCHAR(36)), 5 + RAND() * 25))
> SET _at_i = @i + 1
> END
> select count(*) from Products
> go
> -- Have each client order up to 10 products
> DECLARE _at_Start datetime, @End datetime
> SET _at_Start = CURRENT_TIMESTAMP
> DECLARE _at_c int, @i int, @p int, @r int, @NumClients int, @NumProds int,
> _at_ClientID char(6), @ProdID char(5), @Amt tinyint
> SET _at_NumProds = 5000
> SET _at_NumClients = 10000
> SET _at_c = 0
> WHILE _at_c < @NumClients
> BEGIN
> SET _at_r = CAST(@c AS bigint) * 308915776 / @NumClients
> SET _at_ClientID =
> CHAR(65 + (_at_r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
> + ((_at_r / 676) % 26))
> + CHAR(65 + ((_at_r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
> 26))
> + CHAR(65 + ((_at_r / 11881376) % 26))
> SET _at_Amt = RAND() * 20 + 20
> SET _at_i = 0
> WHILE _at_i < @Amt
> BEGIN
> SET _at_r = CAST(RAND() * @NumProds AS bigint) * 11881376 /
> _at_NumProds
> SET _at_ProdID =
> CHAR(65 + (_at_r % 26)) + CHAR(65 + ((@r / 26) % 26)) +
> CHAR(65 + ((_at_r / 676) % 26))
> + CHAR(65 + ((_at_r / 17576) % 26)) + CHAR(65 + ((@r / 456976)
> % 26))
> INSERT Orders (ClientID, ProdID)
> SELECT _at_ClientID, @ProdID
> WHERE NOT EXISTS (SELECT *
> FROM Orders
> WHERE ClientID = _at_ClientID
> AND ProdID = _at_ProdID)
> SET _at_i = @i + 1
> END
> SET _at_c = @c + 1
> IF _at_c % 100 = 0
> PRINT _at_c
> END
> SET _at_End = CURRENT_TIMESTAMP
> SELECT 'Generating orders', _at_Start, @End, DATEDIFF(ms, @Start, @End) AS
> Elapsed
> select count(*) from Orders
> go
> -- Add an index
> DECLARE _at_Start datetime, @End datetime
> SET _at_Start = CURRENT_TIMESTAMP
> CREATE INDEX Orders_ProdID ON Orders(ProdID)
> SET _at_End = CURRENT_TIMESTAMP
> SELECT 'Adding an index', _at_Start, @End, DATEDIFF(ms, @Start, @End) AS
> Elapsed
> go
> -- DROP INDEX Orders.Orders_ProdID
> -- Flush buffers
> CHECKPOINT
> DBCC DROPCLEANBUFFERS
> DBCC FREEPROCCACHE
> go
> -- Perform a typical query
> DECLARE _at_Start datetime, @End datetime
> SET _at_Start = CURRENT_TIMESTAMP
> SELECT p.ProdID,
> COUNT(CASE WHEN LEN(c.ClientName) < 20 THEN 1 END) AS
> ShortNameOrders,
> COUNT(CASE WHEN LEN(c.ClientName) > 30 THEN 1 END) AS
> LongNameOrders,
> COUNT(c.ClientName) AS TotalOrders
> INTO #Results
> FROM Products AS p
> INNER JOIN Orders AS o
> ON o.ProdID = p.ProdID
> INNER JOIN Clients AS c
> ON c.ClientID = o.ClientID
> WHERE p.ProdID LIKE 'D%'
> GROUP BY p.ProdID
> SET _at_End = CURRENT_TIMESTAMP
> SELECT 'Query with join', _at_Start, @End, DATEDIFF(ms, @Start, @End) AS
> Elapsed
> SELECT * FROM #Results
> DROP TABLE #Results
> go
> -- Clean up the mess
> DROP TABLE Orders
> DROP TABLE Products
> DROP TABLE Clients
> go

Hugo,

I was going to perform this test in oracle, but then I came across the following passage:

Tom...
I was told once that if you only need a single char use CHAR(1) since using VARCHAR2(1) has overhead due to the Oracle having to maintain how long the value is in a VARCHAR2 field

Is this a load of rubbish or is there some truth to it?

Followup:

that is rubbish.

A char(n) is a varchar2(n) that is blank padded to it's maximum length.  The
implementation of a char physically on disk is identical to a varchar2 -- there is a length byte for a char(1) just as there is for a varchar2(1).

http://www.jlcomp.demon.co.uk/faq/char_vs_varchar2.html 3. As opposed to a wide-spread misunderstanding, char (1) does *not* use less space in the database than varchar2 (1) when the value 'a' is stored because also char (1) has a length field. This can be proved by the dump () function.

I guess this storage quirk doesn't really affect TPC-C benchmarks:)

It might be interesting, however, to dig down into the insert performance difference. As I mentioned, 2 extra bytes shouldn't really affect the speed of index unique scan.

BTW, I always use INTEGERs for id columns. Received on Sat Aug 20 2005 - 01:18:03 CEST

Original text of this message