Re: Internet search engines and databases

From: KenNorth <knorth_at_my-deja.com>
Date: 2000/04/15
Message-ID: <#vRiDBrp$GA.233_at_cpmsnbbsa03>#1/1


> On Fri, 7 Apr 2000 21:11:44 -0700, "KenNorth" wrote: > >Even if you got every Web site in the world to expose its metadata as  XML,
> >you'd still have a problem with semantics. Is my <ID> the same as your
<ID>?

Mark Preston wrote

> With respect, Ken, I disagree for the following reasons.
> It matters not a bit - you will be using their DTD or Schema for their
> data, not yours. It only matters if you want to actually import their
> data into your database

If you are building a search engine that is going to search across thousands of Web sites, you're focus is not importing data but scanning it to build indexes. The indexes are to provide performance, but they are meaningless if each query has to access a site's DTDs or schemas.

The program that indexes your site has to scan data, including DTDs or schemas. It should have semantic understanding and recognize similarities, enabling the engine to return hits when there are not precise matches. For example, assume Acme made Transporters, but Megacorp came along and bought Acme.

Site A may contain this information:

<WarningID> 2300-01 </WarningID>
<Manufacturer> Acme </Manufacturer>
<Product> Transporter </Product>
<Warning> This model requires regular maintenance. MTBF is 5000 hours when
installed in starships powered by dilithium crystals. Service it before each voyage and perform regular maintenance checks. </Warning>

Site B may contain this information:

<WarningID> 2300-01 </WarningID>
<Manufacturer> Megacorp </Manufacturer>
<Product> Transporter </Product>
<Hazard> This model requires regular maintenance. MTBF is 5000 hours when
installed in starships powered by dilithium crystals. Service it before each voyage and perform regular maintenance checks. </Hazard>

The indexing process should recognize these two are the same, even if site A's data and schemas are a version behind site B's.

Received on Sat Apr 15 2000 - 00:00:00 CEST

Original text of this message