After fifteen years of system upgrades, data integrations, and changes in data collection regulations, this government agency was suffering from inconsistencies in data formats, missing data relationships, and data that violated rules of integrity.

BACKGROUND

This agency manages the implementation of an act that requires it to provide services to workers and to collect about 500 data items and report them back to the federal government. The federal government places strict formatting rules on these data items and also requires them to respect about 800 integrity rules. The quality of the data is directly related to the performance standards that this agency needs to meet.

SITUATION

Over the years, the agency’s system has been upgraded many times for technological reasons and updated many other times to comply with changes in federal regulations. In addition, it has received thousands of records from other systems that were merged into it. The combination of these factors has caused data inconsistencies that make it difficult for the agency to keep up with the strict federal data reporting guidelines. They needed a way to detect data inconsistencies and clean their data.

SOLUTION

We implemented four mechanisms to profile data and detect data inconsistencies. The first was the Data Profiling Task of SQL Server Integration Services (SSIS) to provide statistics and value distributions for each data item. The second used SQL Server Data Quality Services to build a knowledge base that included domains for each data item together with the DQS component of SSIS to validate data batches. The third used TSQL queries for additional profiling. And the fourth was a TSQL solution for performing the 800 edit checks required by federal regulations on existing data. Finally, we provided recommendations on how to implement data cleansing, some manually and some automatically.

column statistics dqs

SUMMARY

Using several SQL Server data analysis tools, combined with our experience with federal programs, we were able to identify “bad” data and recommend strategies to guide this government agency back into a state of high data quality.