Re: data cleansing: externally or internally?
From: Cliff <clifford1.buetikofer_at_yahoo.com>
Date: Sun, 6 Nov 2011 07:27:02 -0500
Message-ID: <2011110607270228239-clifford1buetikofer_at_yahoocom>
Date: Sun, 6 Nov 2011 07:27:02 -0500
Message-ID: <2011110607270228239-clifford1buetikofer_at_yahoocom>
My 2 cents and IMHO:
Database
CON
- additional storage space
- increased backup time as you'd basically be making multiple copes of your data as you transform it
-version control - Do not do this in the database as it's easier to
keep track of your transformations via file names than via multiple
tables in the database.
PRO
- If you're transforming current production data being used every day, you know you have the current dataset all of the time. However, I would write some pre-code to verify my assumptions prior to each run.
External file
CON
- transforming external to the database can be more complex if you have multiple decision points in your transformation rules.
- Some editing tools choke on huge file sizes
PRO
- easier to export out of the database and perform multiple small transformations with your favorite tool ( sed and awk are mine) and easier to debug.
- the additional file space used by the transformation iterations are easier to recover from file manipulation than by shrinking a tablespace or creating a hughe tablespace specifically for this task.
- Depending on whether this is a one-off or a monthly clean-up script, consider having your script run in 2 modes, Proposed changes and changes that were made; that way you can get business sigh-off on what your script is actually doing.