Re: Squeezing spaces out of a string

From: Ernst-Udo Wallenborn <ernst-udo.wallenborn_at_freenet.de>
Date: 04 Feb 2004 23:18:58 +0100
Message-ID: <m3r7xah4j1.udo_at_no.domain.net>


joe.celko_at_northface.edu (--CELKO--) writes:

> This is a LOT faster than hanging in a loop and it is pure SQL, not
> proprietary, procedural code. But now a matb problem for you: let
> col_x be VARCHAR(n). What is the optimal mix of replace function
> calls and what should they look like for the general case of (n)?

What is optimal? Smallest nesting depth? Or lowest number of string operations?

> a) (2->1) repeated log2(n) times?
> b) Is it best to always have (k->1)?
> c) (floor(sqrt(n)) -> 1) as a starter?
> d) Fibonnaci series?

Let's consider n=6, so the string may be 64 chars long. Then the following series of operations will all result in single spaced strings:

  1. (2->1),(2->1),(2->1),(2->1),(2->1),(2->1)
  2. (3->1),(3->1),(3->1),(3->1),(2->1)
  3. (55->1),(34->1),(21->1)(13->1)(8->1),(5->1),(3->1),(2->1)
  4. (64->1),(32->1),(16->1),(8->1),(4->1),(2->1)
  5. (65->1),(33->1),(17->1),(9->1),(5->1),(3->1),(2->1)

As is easily seen, (2->1) will prune the string in at most log2(n) rounds. This is not always the lowest number of rounds. (3->1) will arrive at a string with at most two consecutive spaces in floor(log3(2^n))+1 rounds (thats 4 rounds in this case, which reduce a string of 63 spaces to 21, then 7, then 3, then 1 space, and a string of 64 spaces to 22, then 8, then 4, then 2). An additional (2->1) step then removes the remaining consecutive spaces.

Which poses the question: what is the number of string replacements needed by the algorithms above to reduce a string of k spaces (1<=k<=2^n) to a single space? This can easily be evaluated and it is clear that 1) is O(n) and 2) O(n/2) but 3)-5) are nearly constant. In fact, the average number of string replacement operations for reducing a string of length k with 1<=k<=64 to a single space are:

  1. 31.50
  2. 16.00
  3. 2.39
  4. 3.09
  5. 3.00

This seems to support the Fibonacci series theory. However, real life strings don't consist of spaces only. The spaces in real life strings are not randomly distributed either. But let's assume they were. Let's construct strings that consist of characters that are ' ' with probability p and not ' ' with probability 1-p. Then running through 10000 randomly created 64 character long strings, the five algorithms above need the following number of string replacement operations (mean with sdev in parens)

          p=0.1          p=0.25         p=0.5          p=0.75         p=0.9
1)        0.63 (0.84)    3.95 (2.29)   15.74 (4.42)   35.40 (5.36)   51.04 (4.30)
2)        0.57 (0.74)    3.17 (1.67)   10.54 (2.55)   20.38 (2.53)   27.07 (1.85)
3)        0.57 (0.74)    3.13 (1.63)    9.40 (2.17)   13.13 (2.01)   10.22 (2.47)
4)        0.57 (0.73)    3.03 (1.56)    9.01 (2.04)   13.60 (1.98)   11.60 (2.71)
5)        0.57 (0.74)    3.13 (1.64)    9.48 (2.19)   13.78 (2.04)   11.30 (2.63)

The more spaces there are in a string, the worse are 1) and 2), for an obvious reason. It is very inefficient to prune a string of n spaces 2 or 3 spaces at a time. The other three seem to be similar, with the Fibonacci series getting a slight edge for large strings. Why? Well, the minimum number of string replacement operations in a string with n substrings consisting of more than one space is of course n, with the series:

6) (64->1), (63->1), (62->1), (61->1), ..., (2->1)

at the cost of having n-1 nested rounds, instead of O(log(n)). Fibonacci has more rounds, and the Fibonacci numbers are denser than 2^k, so the probability of a particular substring being pruned to ' ' in only very few steps is higher. So it all comes down to the relative costs of nesting depth vs. string operations.

-- 
Ernst-Udo Wallenborn
Received on Wed Feb 04 2004 - 23:18:58 CET

Original text of this message