Date: Mon, 5 Jan 2009 02:10:13 -0800 (PST)
We are currently thinking of replacing our existing database system
which is no longer supported.
The data corpus encompasses 4 million records each of which having about 30 fields. Half a million records would have a full text in PDF format which we also put as a text field (by a self-made PDF extraction script).
Our DB usage is moderate, about 1000 searches per day. We load balance by means of pound dealing queries to 6 different virtual machines (each holding a copy of the database).
The web interface is decoupled from the DB; we use a MVC framework that talks to the DB via an API and retrieves data only. Currently we use a Windows system which can easily be replaced with Linux if need be.
Our existing solution has a built-in thesaurus (controlled vocabulary is static) in addition every term of which holds the number of records currently tagged with it.
The new solution should of course be well performing, a thesaurus
functionality would be nice as would be a relevance ranking and
proximity searching – yet not a must.
A cost free solution would be desirable since we want to open our database to the internet thus we might encounter the need to add new instances of the DB (virtual machine). Having to pay additional licences would be too expensive for us.
What would you recommend?
1. Our data repository is a large XML file from which we update our database on a weekly basis by means of a self-made update script. Would an XML database be an alternative, esp. viewing at the performance?
2. I was also asked to investigate a poosibiltiy to implement a federal search on 2 to 3 other sources (different data structure). I assume this then would be a different beast and not a feature for the above mentioned new DB I am looking for. Indeed this is not a requirement yet what options would I have in that concern?
Many thanks for you input,
PS: If this group is not the right place please point me to a proper one Received on Mon Jan 05 2009 - 04:10:13 CST