A few months back while working on a data cleansing project using Data Quality Services (DQS), I encountered the same problem that I later saw described by Greg Low on this blog post and confirmed as a bug here. It has to do with the Record Status column that is part of the output of the DQS Cleansing component in SSIS; its value is not being set correctly by the component. In this post I will share a script to work around this bug. In fact, you may use it to implement your own logic for processing status values.
The DQS Cleansing component takes as input a set of data to cleanse against an existing knowledge base. As output, it adds a status column for each input column (each called ColumnName_Status) and one column called Record Status that is supposed to summarize all the individual status columns. As indicated by Mr. Low, the value of this column is supposed to be as follows (evaluated in order):
• “Invalid”: If any of the individual status columns has “Invalid” as its value
• “Corrected”: If any of the individual status columns has “Corrected” as its value
• “New”: If any of the individual status columns has “New” as its value
• “Correct”: If all of the individual status columns have “Correct” as their value
In my particular case, I was “losing” rows with columns that had New values when all of the other status values in the row were Correct, because the Record Status was – incorrectly –being set to Correct by the DQS component. This is the example data flow I created to illustrate the problem, using the DQS Data knowledge base that ships with SQL Server:
Notice the data viewer between the DQS component and the Conditional Split component. With the sample data I set up, I get this:
As you can see, the last row has individual column status values of New and Correct, but the resulting Record Status has a value of Correct instead of New. My solution was to insert a Script component before the Conditional Split component to examine, for each row, the individual column status values and overwrite the value of Record Status.
The work in the Script component will be done in the PreExecute and Input0_ProcessInputRow methods. In PreExecute, I used a one-dimension array of the same length as the number of columns that the DQS component outputs; on it, I mark the index of each of the individual status columns. In Input0_ProcessInputRow, which is called once per row, I examined the status value of each of the individual columns and overwrite the Record Status value of the row if necessary.
First, the class-wide variable declarations:
string[] strInputStatusColumns;
int intNumberOfInputColumns;
In PreExecute, I mark the status columns, identifying them by their suffix. All of the other cells in the array will keep their default value of null.
IDTSInput100 input = ComponentMetaData.InputCollection[0];
intNumberOfInputColumns = input.InputColumnCollection.Count;
strInputStatusColumns = new string[intNumberOfInputColumns];
for (int x = 0; x < intNumberOfInputColumns; x++)
{
if (input.InputColumnCollection[x].Name.EndsWith("_Status"))
{
strInputStatusColumns[x] = input.InputColumnCollection[x].Name;
}
}
Finally, in Input0_ProcessInputRow I implement the desired logic for my situation:
for (int i = 0; i < intNumberOfInputColumns; i++)
{
if (strInputStatusColumns[i] != null)
{
if (inputBuffer[i].ToString() == "New")
{
// this is the Record Status column
inputBuffer[intNumberOfInputColumns - 1] = "New";
}
}
}
I am not exactly correcting the DQS bug, for I had a particular interest in New values, but you may implement your own logic armed now with this sample code.