Re: data cleansing: externally or internally?
From: mhoys <matthias.hoys_at_gmail.com>
Date: Fri, 4 Nov 2011 02:26:08 -0700 (PDT)
Message-ID: <f3c9c130-ccec-4d18-b421-0009c252bcc5_at_gy7g2000vbb.googlegroups.com>
On Nov 4, 7:51�am, geos <g..._at_nowhere.invalid> wrote:
> there is a big text file with dirty data. a company wants it to be
> clean. there are some known patterns expressed as like or regexp. I
> first thought about two approaches:
> 1) do this on the system level
> 2) or in a database
> for the latter case it looks to me that I could use external tables or
> load data into temporary table and then do the cleaning.
>
> I am looking for pros and cons of each variant. my intuition tells me
> that loading into temporary table would give the most flexibility but
> also take additional space. I am not sure about the other methods. I
> would appreciate your opinion about what I should pay attention to when
> choosing the other methods. how are they restricted in terms of
> performance, flexibility and capabilities (eg. multitable loading)? I am
> also interested in good practices and your experience in similar cases
> you can share.
>
> thank you,
> geos
>
> --
> NOTE: Follow Up set to comp.databases.oracle.misc
Date: Fri, 4 Nov 2011 02:26:08 -0700 (PDT)
Message-ID: <f3c9c130-ccec-4d18-b421-0009c252bcc5_at_gy7g2000vbb.googlegroups.com>
On Nov 4, 7:51�am, geos <g..._at_nowhere.invalid> wrote:
> there is a big text file with dirty data. a company wants it to be
> clean. there are some known patterns expressed as like or regexp. I
> first thought about two approaches:
> 1) do this on the system level
> 2) or in a database
> for the latter case it looks to me that I could use external tables or
> load data into temporary table and then do the cleaning.
>
> I am looking for pros and cons of each variant. my intuition tells me
> that loading into temporary table would give the most flexibility but
> also take additional space. I am not sure about the other methods. I
> would appreciate your opinion about what I should pay attention to when
> choosing the other methods. how are they restricted in terms of
> performance, flexibility and capabilities (eg. multitable loading)? I am
> also interested in good practices and your experience in similar cases
> you can share.
>
> thank you,
> geos
>
> --
> NOTE: Follow Up set to comp.databases.oracle.misc
Hi,
I've always done it with external/temp tables... but that's because
I'm more familiar with (PL/)SQL than with shell scripting. You could
write your cleaning functions/procedures once and put them in a
package for later reuse. And if you know APEX, it's easy to build a
web interface around them. Some things are also more easy to do with
SQL than with shell scripting, for example data aggregations. I'm not
sure about the performance. Maybe Perl might perform better on very
large flat files (this could also be dependent on the type of file
system).
HTH,
Matthias
Received on Fri Nov 04 2011 - 04:26:08 CDT