Re: data cleansing: externally or internally?

From: Robert Klemme <shortcutter_at_googlemail.com>
Date: Fri, 04 Nov 2011 10:23:02 +0100
Message-ID: <9hhp7mFcmiU1_at_mid.individual.net>



On 11/04/2011 07:51 AM, geos wrote:
> there is a big text file with dirty data. a company wants it to be
> clean. there are some known patterns expressed as like or regexp. I
> first thought about two approaches:
> 1) do this on the system level

What exactly do you mean by "system level"? Is this an alias for "without database"?

> 2) or in a database
> for the latter case it looks to me that I could use external tables or
> load data into temporary table and then do the cleaning.
>
> I am looking for pros and cons of each variant. my intuition tells me
> that loading into temporary table would give the most flexibility but
> also take additional space. I am not sure about the other methods. I
> would appreciate your opinion about what I should pay attention to when
> choosing the other methods. how are they restricted in terms of
> performance, flexibility and capabilities (eg. multitable loading)? I am
> also interested in good practices and your experience in similar cases
> you can share.

There are a lot of questions to be asked before one can come up with reasonable advice:

Is this a one off task or needs to be done on a regular basis? What data is contained in the file?
What format is the file?
How large is the file?
Where does the data ultimately go (next processing stage)? What does "dirty data" mean? Put differently: what kind of cleanup activity do you need to do?
What other requirements are there (execution speed, development time / cost)?

Cheers

        robert Received on Fri Nov 04 2011 - 04:23:02 CDT

Original text of this message