EventQualityWarnings Configuration
To initialise and run the component two configs are used - general_config.ini
and component’s config (either event_cleaning_quality_warnings.ini
or event_deduplication_quality_warnings.ini
). In general_config.ini
to execute Event Quality Warnings component specify all paths to its corresponding data objects. Example with specified paths for both cases:
[Paths.Silver]
...
# for Event Cleaning Quality Warnings
event_syntactic_quality_metrics_by_column = ${Paths:silver_dir}/event_syntactic_quality_metrics_by_column
event_syntactic_quality_metrics_frequency_distribution = ${Paths:silver_dir}/event_syntactic_quality_metrics_frequency_distribution
event_syntactic_quality_warnings_log_table = ${Paths:silver_dir}/event_syntactic_quality_warnings_log_table
event_syntactic_quality_warnings_for_plots = ${Paths:silver_dir}/event_syntactic_quality_warnings_for_plots
Below there is a description of one of sub component’s config - event_cleaning_quality_warnings.ini
.
Parameters are as follows:
Under [EventQualityWarnings]
config section:
-
input_qm_by_column_path_key - string, key in Paths.Silver section in general config for a path to corresponding Quality Metrics By Column data
-
input_qm_freq_distr_path_key - string, key in Paths.Silver section in general config for a path to corresponding Quality Metrics Frequency Distribution data
-
output_qw_log_table_path_key - string, key in Paths.Silver section in general config for a path to write output Quality Warnings Log Table
-
output_qw_for_plots_path_key - string, key in Paths.Silver section in general config for a path to write output Quality Warnings ForPLots
-
data_period_start - string, format should be “yyyy-MM-dd“ (e.g. 2023-01-08), the date from which start Event Quality Warnings, by now make sure the first day(s) of research period has enough previous data in in Quality Metrics Frequency Fistribution and Quality Metrics By Column
-
data_period_end - string, format should be “yyyy-MM-dd“ (e.g. 2023-01-15), the date till which perform Event Quality Warnings
-
lookback_period - the length of lookback period, represented as string (could be either ‘week' or 'month') which than will get its numeric representation in number of days
-
do_size_raw_data_qw - boolean, whether perform QW checks on
initial_frequency
column in Quality Metrics Frequency Fistribution -
do_size_clean_data_qw - boolean, whether perform QW checks on
final_frequency
column in Quality Metrics Frequency Distribution -
data_size_tresholds - dictionary, with keys as thresholds names and values as corresponding thresholds values for
do_size_raw_data_qw
anddo_size_clean_data_qw
-
do_error_rate_by_date_qw - boolean, whether to perform QW checks on total error rate by
date
-
do_error_rate_by_date_and_cell_qw - boolean, whether to perform QW checks on total error rate by
date
andcell_id
-
do_error_rate_by_date_and_user_qw - boolean, whether to perform QW checks on total error rate by
date
anduser_id
-
do_error_rate_by_date_and_cell_user_qw - boolean, whether to perform QW checks on total error rate by
date
,cell_id
anduser_id
-
error_rate_tresholds - dictionary, with keys as thresholds names and values as corresponding thresholds values for
do_error_rate_by_date_qw
,do_error_rate_by_date_and_cell_qw
,do_error_rate_by_date_and_user_qw
, anddo_error_rate_by_date_and_cell_user_qw
-
error_type_qw_checks - dictionary, where the keys are names of error types (please see
multimno/core/constants/error_types.py
file) and values list of column names on which you want to perform QWs of the this error type. Example: during Event Cleaning three columns are checked for null values, if you want to check error rate ofmissing_value
type for all mentioned columns specify them in the list. Some error types might have None for column names, which means that technically this kind or error do not belong to just one column but several (e.g. forno_location
error three columns are used - cell_id, lat, lon):
error_type_qw_checks = {
'missing_value': ['user_id', 'mcc', 'timestamp'],
'out_of_bounding_box':[None]
}
- thresholds for each error_type & column combination you want to compute QWs - thresholds are combined in groups: each set of thresholds relevant to some error type is a separate config param of type dictionary, where keys are column names, values is another dictionary of structure:
threshold_name:threshold_value
. Example:
missing_value_thresholds = {
'user_id': {"Missing_value_RATE_BYDATE_USER_AVERAGE": 30,
"Missing_value_RATE_BYDATE_USER_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_USER_ABS_VALUE_UPPER_LIMIT": 20
},
'mcc': {"Missing_value_RATE_BYDATE_MCC_AVERAGE": 30,
"Missing_value_RATE_BYDATE_MCC_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_MCC_ABS_VALUE_UPPER_LIMIT": 20
},
'timestamp': {"Missing_value_RATE_BYDATE_TIMESTAMP_AVERAGE": 30,
"Missing_value_RATE_BYDATE_TIMESTAMP_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_TIMESTAMP_ABS_VALUE_UPPER_LIMIT": 20
}
}
out_of_bounding_box_thresholds = {
None: {"Out_of_bbox_RATE_BYDATE_AVERAGE": 30,
"Out_of_bbox_RATE_BYDATE_VARIABILITY": 2,
"Out_of_bbox_RATE_BYDATE_ABS_VALUE_UPPER_LIMIT": 20
}
}
error_type_qw_checks
there are corresponding thresholds. The order of thresholds is important and should be: AVERAGE
, VARIABILITY
, and ABS_VALUE_UPPER_LIMIT
(at least by now all error type QWs follow the same logic and thus their computation is done within one function with ordered threshold arguments). Currently the code supports running QWs on following thresholds:
# possible thresholds in event_cleaning_quality_warnings.ini
missing_value_thresholds
out_of_admissible_values_thresholds
not_right_syntactic_format_thresholds
no_location_thresholds
no_domain_thresholds
out_of_bounding_box_thresholds
deduplication_same_location_thresholds
Configuration example
[EventQualityWarnings]
# keys in Paths.Silver section in general config
input_qm_by_column_path_key = event_syntactic_quality_metrics_by_column
input_qm_freq_distr_path_key = event_syntactic_quality_metrics_frequency_distribution
output_qw_log_table_path_key = event_syntactic_quality_warnings_log_table
output_qw_for_plots_path_key = event_syntactic_quality_warnings_for_plots
# BY NOW make sure that the first day(s) of research period has enough previous data
# of df_qa_by_column and df_qa_freq_distribution
# (e.g. staring from 2023-01-01, if period is a week and start period is 2023-01-08)
data_period_start = 2023-01-01
# you can exceed max(df_qa_by_column.date)
# although you will still get QWs for dates till max(df_qa_by_column.date), including
data_period_end = 2023-01-09
# should be either week or month
lookback_period = week
# SIZE QA
do_size_raw_data_qw = True
do_size_clean_data_qw = True
data_size_tresholds = {
"SIZE_RAW_DATA_BYDATE_VARIABILITY": 3,
"SIZE_RAW_DATA_BYDATE_ABS_VALUE_UPPER_LIMIT": 10000000,
"SIZE_RAW_DATA_BYDATE_ABS_VALUE_LOWER_LIMIT": 0,
"SIZE_CLEAN_DATA_BYDATE_VARIABILITY": 3,
"SIZE_CLEAN_DATA_BYDATE_ABS_VALUE_UPPER_LIMIT": 10000000,
"SIZE_CLEAN_DATA_BYDATE_ABS_VALUE_LOWER_LIMIT": 0,
}
# ERROR RATE QW
do_error_rate_by_date_qw = True
do_error_rate_by_date_and_cell_qw = False
do_error_rate_by_date_and_user_qw = True
do_error_rate_by_date_and_cell_user_qw = True
error_rate_tresholds = {
"TOTAL_ERROR_RATE_BYDATE_OVER_AVERAGE": 30,
"TOTAL_ERROR_RATE_BYDATE_VARIABILITY": 2,
"TOTAL_ERROR_RATE_BYDATE_ABS_VALUE_UPPER_LIMIT": 20,
"ERROR_RATE_BYDATE_BYCELL_OVER_AVERAGE": 30,
"ERROR_RATE_BYDATE_BYCELL_VARIABILITY": 2,
"ERROR_RATE_BYDATE_BYCELL_ABS_VALUE_UPPER_LIMIT": 20,
"ERROR_RATE_BYDATE_BYUSER_OVER_AVERAGE": 30,
"ERROR_RATE_BYDATE_BYUSER_VARIABILITY": 2,
"ERROR_RATE_BYDATE_BYUSER_ABS_VALUE_UPPER_LIMIT": 20,
"ERROR_RATE_BYDATE_BYCELL_USER_OVER_AVERAGE": 30,
"ERROR_RATE_BYDATE_BYCELL_USER_VARIABILITY": 2,
"ERROR_RATE_BYDATE_BYCELL_USER_ABS_VALUE_UPPER_LIMIT": 20,
}
# ERROR TYPE QW
# for each type of error (key), specified the colums you want to check, naming of columns must be oidentical to ColNames
# if you do not want to run qw on some error_type leave the list empty
# None - for no_location and out_of_bounding_box because they do not have more than one column used for this error_type
# for more clarity please check event_cleaning.py
error_type_qw_checks = {
'missing_value': ['user_id', 'mcc', 'timestamp'],
'out_of_admissible_values': ['cell_id', 'mcc', 'mnc', 'plmn', 'timestamp'],
'not_right_syntactic_format': ['timestamp'],
'no_domain': [None],
'no_location':[None],
'out_of_bounding_box':[None],
'same_location_duplicate':[None]
}
# for each dict_error_type_thresholds make sure you specified all relevant columns
# the order of thresholds is important, should be: AVERAGE, VARIABILITY, and ABS_VALUE_UPPER_LIMIT
missing_value_thresholds = {
'user_id': {"Missing_value_RATE_BYDATE_USER_AVERAGE": 30,
"Missing_value_RATE_BYDATE_USER_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_USER_ABS_VALUE_UPPER_LIMIT": 20
},
'mcc': {"Missing_value_RATE_BYDATE_MCC_AVERAGE": 30,
"Missing_value_RATE_BYDATE_MCC_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_MCC_ABS_VALUE_UPPER_LIMIT": 20
},
'mnc': {"Missing_value_RATE_BYDATE_MNC_AVERAGE": 30,
"Missing_value_RATE_BYDATE_MNC_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_MNC_ABS_VALUE_UPPER_LIMIT": 20
},
'plmn': {"Missing_value_RATE_BYDATE_PLMN_AVERAGE": 30,
"Missing_value_RATE_BYDATE_PLMN_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_PLMN_ABS_VALUE_UPPER_LIMIT": 20
},
'timestamp': {"Missing_value_RATE_BYDATE_TIMESTAMP_AVERAGE": 30,
"Missing_value_RATE_BYDATE_TIMESTAMP_VARIABILITY": 2,
"Missing_value_RATE_BYDATE_TIMESTAMP_ABS_VALUE_UPPER_LIMIT": 20
}
}
out_of_admissible_values_thresholds = {
'cell_id': {"Out_of_range_RATE_BYDATE_CELL_AVERAGE": 30,
"Out_of_range_RATE_BYDATE_CELL_VARIABILITY": 2,
"Out_of_range_RATE_BYDATE_CELL_ABS_VALUE_UPPER_LIMIT": 20
},
'mcc': {"Out_of_range_RATE_BYDATE_MCC_AVERAGE": 30,
"Out_of_range_RATE_BYDATE_MCC_VARIABILITY": 2,
"Out_of_range_RATE_BYDATE_MCC_ABS_VALUE_UPPER_LIMIT": 20
},
'mnc': {"Out_of_range_RATE_BYDATE_MNC_AVERAGE": 30,
"Out_of_range_RATE_BYDATE_MNC_VARIABILITY": 2,
"Out_of_range_RATE_BYDATE_MNC_ABS_VALUE_UPPER_LIMIT": 20
},
'plmn': {"Out_of_range_RATE_BYDATE_PLMN_AVERAGE": 30,
"Out_of_range_RATE_BYDATE_PLMN_VARIABILITY": 2,
"Out_of_range_RATE_BYDATE_PLMN_ABS_VALUE_UPPER_LIMIT": 20
},
'timestamp': {"Out_of_range_RATE_BYDATE_TIMESTAMP_AVERAGE": 30,
"Out_of_range_RATE_BYDATE_TIMESTAMP_VARIABILITY": 2,
"Out_of_range_RATE_BYDATE_TIMESTAMP_ABS_VALUE_UPPER_LIMIT": 20
}
}
not_right_syntactic_format_thresholds = {
'timestamp': {"Wrong_type_RATE_BYDATE_TIMESTAMP_AVERAGE": 30,
"Wrong_type_RATE_BYDATE_TIMESTAMP_VARIABILITY": 2,
"Wrong_type_RATE_BYDATE_TIMESTAMP_ABS_VALUE_UPPER_LIMIT": 20
}
}
no_location_thresholds = {
None: {"No_location_RATE_BYDATE_AVERAGE": 30,
"No_location_RATE_BYDATE_VARIABILITY": 2,
"No_location_RATE_BYDATE_ABS_VALUE_UPPER_LIMIT": 20
}
}
no_domain_thresholds = {
None: {"No_domain_RATE_BYDATE_AVERAGE": 30,
"No_domain_RATE_BYDATE_VARIABILITY": 2,
"No_domain_RATE_BYDATE_ABS_VALUE_UPPER_LIMIT": 20
}
}
out_of_bounding_box_thresholds = {
None: {"Out_of_bbox_RATE_BYDATE_AVERAGE": 30,
"Out_of_bbox_RATE_BYDATE_VARIABILITY": 2,
"Out_of_bbox_RATE_BYDATE_ABS_VALUE_UPPER_LIMIT": 20
}
}
deduplication_same_location_thresholds = {
None: {"Deduplication_same_location_RATE_BYDATE_AVERAGE": 30,
"Deduplication_same_location_RATE_BYDATE_VARIABILITY": 2,
"Deduplication_same_location_RATE_BYDATE_ABS_VALUE_UPPER_LIMIT": 20
}
}
clear_destination_directory = True