Missing values are always a headache to prepare data. Let’s consider the way to handle the missing data.
Replace null data with a blank or zero
When there are missing values in string type fields, it is a solution to change null to blank. Or in numeric fields, to change null to 0 is also a solution. For this case, to use the Data Cleansing tool is a popular solution. But the Data Cleansing tool is slow, so to use the CReW Cleanse toll is better solution. Or you can use the Formula tool with IF statement and IsNull or IsEmpty. Or you can also use the Coalesce function in the Formula tool from Designer 2024.1 or later. But I think when there are little data, to use Data Cleansing tool is not a bad solution. However please remember the Data Cleansing tool is slow especially there are much data.
When you use the Data Cleansing tool, the setting is as follows.
If you use the Designer 2024.1 or later, you can use Coalesce function. This function returns the value that is not null value on the left side. In this case, you can use the following expression.
Coalesce([Value], 0)
For that, if the [Value] is Null, it returns 0. If not, it returns [Value]. But for the Designer 2023.2 or before, you can use the expression below.
IF IsNull([Value]) THEN 0 ELSE [Value] ENDIF
When you use the Formula tool for multiple fields, you have to create the expression for each field. In this case, the Multi-Field Formula is the best solution. For example, you use the Coalesce function, you can create the expression in the Multi-Formula tool as follows.
Coalesce([_CurrentField_], 0)
The important point is to use the [_CurrentField_].
Impute by the average or median(for numeric fields)
The Imputation tool is very useful for imputing by the statistical value.
This tool can make the missing values to replace by average, median, mode or user specified value. Moreover, you can replace the specified value instead of the missing value. However, whether or not it is really a good idea to replace missing values with statistical values requires careful consideration, but it is a useful method.
Note that the Imputation tool is only for numeric fields.
Impute by other than average, median, mode or user specified value
The Imputation tool is only by average, median, mode or user specified value. Basically, I think that it is an assumption used to prepare the machine learning data set. But when you want to impute missing values by min or max value, you have to create a simple logic as follows.
To begin with, use the Summarize tool, you make the min or max value and then the Field Appends tool append them that is created by the Summarize tool. After that, replace the null value by the Formula tool. If you want to know the detail, please check the sample workflow. This way can be used for the string type field, too.
Impute by looking at the values of previous and next records
Regarding the time series data, you might want to do the linearly interpolate by looking at the values of previous and next records. In this case, you can use the Multi Row Formula and create the expression that you want how to impute.
For exmaple, when you want to impute by the previous row value, you can use the following expression.
IF IsNull([Value]) THEN
[Row-1:Value]
ELSE [Value] ENDIF
This is a common way. For the categorical field, you can use this option.
Second, ff you want to impute by the mid value between the previous and next values. In this case, you can use the following expression.
IF IsNull([Value]) THEN
([Row-1:Value]+[Row+1:Value])/2
ELSE [Value] ENDIF
However, this expression cannot be used for the consecutive null values. For example, the following case, which is row 2 and 3, row 8 and 9 is consecutive null values, can not be used.
In this case, you have to make a complex logic. I created the corresponding workflow. If you want to know the detail, please check the sample workflow.
Futhermore, if records interval are not constant, the above workflow doesn’t work. So the valid workflow is as follows.
Get rid of the null data
If the null data is not useful, you can filter out those data. For this purpose, you can use the Filter tool.
Conclusion
- I explained how to impute the data when there are missing values.
- The Imputation tool is useful for the some imupute way. If the way doesn’t work, you can use the Coalesce function for Designer 2024.1 or later or IF statement.
- If there are not the value that you want to impute, you have to create the values by Summarize tool or others and append them by the Append Fields tool.
- If you want to do the linearly interpolate, you have to make a complex logic.
Sample Workflow Download
The next post is …
The next post will be about sorting of the fields.
コメント