Skip to content

SemanticCleaning Configuration

To initialise and run the component two configs are used - general_config.ini and event_semantic_cleaning.ini. In general_config.ini all paths to the corresponding data objects shall be specified. Example:

[Paths.Silver]
...
network_data_silver = ${Paths:silver_dir}/mno_network
event_data_silver_flagged = ${Paths:silver_dir}/mno_events_flagged
event_device_semantic_quality_metrics = ${Paths:silver_quality_metrics_dir}/semantic_quality_metrics
...

The expected parameters in event_semantic_cleaning.ini are as follows: - data_period_start: string, format should be the one specified data_period_format (e.g., 2023-01-01 for %Y-%m-%d), the first date for which data will be processed by the component. All dates between this one and the specified in data_period_end will be processed (both inclusive). - data_period_end: string, format should be "yyyy-MM-dd" (e.g., 2023-01-09 for %Y-%m-%d), the last date for which data will be processed by the component. All dates between the specified in data_period_start and this one will be processed (both inclusive). - data_period_format: string, it indicates the format expected in data_period_start and data_period_end. For example, use %Y-%m-%d for the usual "2023-01-09" format separated by -. - semantic_min_distance_m: float, minimum distance (in metres) between two consecutive events above which they will be considered for flagging as suspicious or incorrect location. Example: 10000. - semantic_min_speed_m_s: float, minimum mean speed (in metres per second) between two consecutive events above whihc they will be considered for flagging as suspicious or incorrect location. Example: 55. - do_different_location_deduplication: boolean, True/False. Determines whether to flag duplicates with different location information (cases where a single user has one or more rows with identical timestamp values, but non-identical values in any other columns).

Configuration example

[Spark]
session_name = SemanticCleaning

[SemanticCleaning]
data_period_start = 2023-01-01
data_period_end = 2023-01-09
date_format = %Y-%m-%d

semantic_min_distance_m = 10000
semantic_min_speed_m_s = 55

do_different_location_deduplication = True