Re: fast wildcard searching on millions of strings

From: Steven Ross <sjrus_at_yahoo.com>
Date: 16 Aug 2001 17:31:24 -0700
Message-ID: <44d7582c.0108161631.63220356_at_posting.google.com>


I suspect that with most good implementations the performance will be limited by the memory copying operation for very large numbers of matches. I haven't tested grep under these circumstances, but it should be able to return at least a thousand matches on very large databases (anything under 2GB or the memory of the machine) in less than a second. If not, it is rather inefficient.

Three questions that need to be asked:
1) Is the database highly dynamic with small (single-item) additions or removals at a time? Block additions and removals are not a problem.
2) Does it have to be larger than the memory of the machine it's on? 3) Does it have to be larger than an unsigned int (4GB)?

1 is okay on it's own; 2 and 3 are okay together, but 1&3 or 1&2 pose problems with any database system.

One additional question: How complex is the approximate matching you're looking for? That may be a major issue for high-performance techniques. The only approximate matching I can think of doing easily and efficiently is looking from the start and end for the closest match, but this won't catch more than 2 character errors at most.

With that said, if you're looking for optimal performance, then I have an ordered hierarchical tree structure designed for normal to high cardinality data that should be able to fit your needs, with some interface code. I've been looking for a good demonstration of its performance capabilities. Your problem seems to be rather high cardinality issue, and I would be interested in testing it for this situation. On a Pentium 80Mhz (slow!) it can carry out 3 million (single-item) lookups per second in a 16MB database. I've tested the core algorithm out to 256MB, where it performs rather linearly (constant look-up time). It does quite well with block lookups also (wildcard searches) taking only twice as much time as for a single item search, though full ? and multi-* wildcard searches will take some extra code.

The code is currently operating on LINUX written in ANSI C, and I've tested it on HP-UX, and Solaris. If you're doing something experimental that requires high performance I'd be interested in collaboration. Depending on your requirements, I think that this could give you much better performance than you requested.

rsit_at_ucsd.edu (rsit) wrote in message news:<381ddd3b.0108151252.134b9d13_at_posting.google.com>...
> Thanks a lot! You pointed me into the right direction. Now I need to
> find a good implementation of it. I am investigatine OpenText's
> solution right now. I couldn't find any other good implementations.
> I looked at Sary and SAM and one other one, but it's docs are in
> japanese.
>
> Don't have time to implement a good solution right now. Need one that
> will do wildcard searching and approximate matching for texts
> gigabytes in length. hrmmmm...anyone have any suggestions?
>
> Thanks,
> Ryan
>
> Kai.Grossjohann_at_CS.Uni-Dortmund.DE (Kai Großjohann wrote in message news:<vafbslltb5t.fsf_at_lucy.cs.uni-dortmund.de>...
> > rsit_at_ucsd.edu (rsit) writes:
> >
> > > Thanks for you post. It's just a large list of domain names and URLs
> > > that I need to search on. You can think of it as a bunch of strings
> > > with an average length of 18 characters. I am not searching or doing
> > > anything with the documents at the URLs. I would like to put
> > > wildcards(* or ?) anywhere on the query string.
> >
> > Hm. Searching with Google for "pat array" gets me to this page:
> >
> > http://www.dcc.uchile.cl/~gnavarro/abstracts/wsp96.1.html
> >
> > Maybe you can find the whole paper from there.
> >
> > > I tried to find information on OpenText, couldn't find it yet. Do you
> > > have any links that could point me in the right direction?
> >
> > The obvious URL seems to be the right one. www.opentext.com.
> > Hm. But I didn't say that OpenText is a company. Sorry.
> >
> > kai
Received on Fri Aug 17 2001 - 02:31:24 CEST

Original text of this message