Bad Data Costs Big Money, Part 2


What is ETL?

ETL is a type of data integration / migration effort that refers to the three steps (extract, transform, load) used to blend data from multiple sources. It’s also often used to build a data warehouse. We use it mostly to prepare a new Salesforce instance to load data.


Data extraction involves extracting data from homogeneous and/or heterogeneous sources. This data is normally stored in a staging database for processing. Data storage can be many different sources. It can be a local Microsoft SQL Server, Postgres SQL server, MySQL server, local file storage, etc. It also can be cloud based database storage as well.


Data transformation processes data by data cleansing and transforming the data (can be multiple sources) into a proper storage format/structure for the purpose of querying and analysis. This transformation typically comes in as execution scripts that are built to manipulate data in such a way to make it ready to load to a typical system, such as Salesforce. This process is more often than not the most tedious of all three processes. When the systems are dissimilar, this can take a while to get the data in a form that will be accepted in the new system. A properly designed ETL system extracts data form the source systems, enforces data quality and consistency standards, conforms the data so that the separates sources can be used together, and finally delivers data in a presentation-ready format.


The load phase loads the data into the end target, which can be any data store or application. This process is normally process when doing data migration work. Normally we do not override any data in the target system, it will be all new records. However, we have had times where we needed to write over a previous ETL process or update records based upon a criteria we did not have at the time.

In the next blog we will be spending more time in the transformation phase of the ETL process. We will be looking more closely at how we do de-duplication work and how to ensure the target system load is clean of duplicate data.