Re: data cleansing: externally or internally?

From: Cliff <clifford1.buetikofer_at_yahoo.com>
Date: Sun, 6 Nov 2011 07:27:02 -0500
Message-ID: <2011110607270228239-clifford1buetikofer_at_yahoocom>


My 2 cents and IMHO:

Database

        CON

  • additional storage space
  • increased backup time as you'd basically be making multiple copes of your data as you transform it

-version control - Do not do this in the database as it's easier to
keep track of your transformations via file names than via multiple tables in the database.

        PRO

  • If you're transforming current production data being used every day, you know you have the current dataset all of the time. However, I would write some pre-code to verify my assumptions prior to each run.

External file

        CON         

  • transforming external to the database can be more complex if you have multiple decision points in your transformation rules.
  • Some editing tools choke on huge file sizes

        PRO

  • easier to export out of the database and perform multiple small transformations with your favorite tool ( sed and awk are mine) and easier to debug.
  • the additional file space used by the transformation iterations are easier to recover from file manipulation than by shrinking a tablespace or creating a hughe tablespace specifically for this task.
  • Depending on whether this is a one-off or a monthly clean-up script, consider having your script run in 2 modes, Proposed changes and changes that were made; that way you can get business sigh-off on what your script is actually doing.
Received on Sun Nov 06 2011 - 06:27:02 CST

Original text of this message