Merge pull request #114 from carissalow/feature/doryab_timeDuration_fix

Feature/doryab time duration fix
2021-02-02 12:01:47 -05:00 · 2021-02-02 12:01:47 -05:00 · 8ddb431e9f
parent 3d0d062491 9b248c449d
commit 8ddb431e9f
3 changed files with 64 additions and 113 deletions
--- a/config.yaml
+++ b/config.yaml
@ -243,10 +243,10 @@ PHONE_LOCATIONS:
      DBSCAN_EPS: 10 # meters
      DBSCAN_MINSAMPLES: 5
      THRESHOLD_STATIC : 1 # km/h
-      MAXIMUM_GAP_ALLOWED: 300
+      MAXIMUM_ROW_GAP: 300
+      MAXIMUM_ROW_DURATION: 60
      MINUTES_DATA_USED: False
-      SAMPLING_FREQUENCY: 0
-      CLUSTER_ON: PARTICIPANT_DATASET # PARTICIPANT_DATASET,TIME_SEGMENT
+      CLUSTER_ON: TIME_SEGMENT # PARTICIPANT_DATASET,TIME_SEGMENT
      CLUSTERING_ALGORITHM: DBSCAN #DBSCAN,OPTICS  
      SRC_FOLDER: "doryab" # inside src/features/phone_locations
      SRC_LANGUAGE: "python"
--- a/docs/features/phone-locations.md
+++ b/docs/features/phone-locations.md
@ -104,7 +104,8 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 | `[DBSCAN_EPS]`             | The maximum distance in meters between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
 | `[DBSCAN_MINSAMPLES]`      | The number of samples (or total weight) in a neighborhood for a point to be considered as a core point of a cluster. This includes the point itself.
 | `[THRESHOLD_STATIC]`       | It is the threshold value in km/hr which labels a row as Static or Moving.
-| `[MAXIMUM_GAP_ALLOWED]`   | The maximum gap (in seconds) allowed between any two consecutive rows for them to be considered part of the same displacement. If this threshold is too high, it can throw speed and distance calculations off for periods when the the phone was not sensing.
+| `[MAXIMUM_ROW_GAP]`   | The maximum gap (in seconds) allowed between any two consecutive rows for them to be considered part of the same displacement. If this threshold is too high, it can throw speed and distance calculations off for periods when the the phone was not sensing.
+| `[MAXIMUM_ROW_DURATION]`  | The time difference between any two consecutive rows `A` and `B` is considered as the time a participant spent in `A`. If this difference is bigger than MAXIMUM_ROW_GAP we will substitute it with `MAXIMUM_ROW_DURATION`.
 | `[MINUTES_DATA_USED]`     | Set to `True` to include an extra column in the final location feature file containing the number of minutes used to compute the features on each time segment. Use this for quality control purposes, the more data minutes exist for a period, the more reliable its features should be. For fused location, a single minute can contain more than one coordinate pair if the participant is moving fast enough.
 | `[SAMPLING_FREQUENCY]`     | Expected time difference between any two location rows in minutes. If set to `0`, the sampling frequency will be inferred automatically as the median of all the differences between any two consecutive row timestamps (recommended if you are using `FUSED_RESAMPLED` data). This parameter impacts all the time calculations.
 | `[CLUSTER_ON]`             | Set this flag to `PARTICIPANT_DATASET` to create clusters based on the entire participant's dataset or to `TIME_SEGMENT` to create clusters based on all the instances of the corresponding time segment (e.g. all mornings).
@ -121,14 +122,14 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 |averagespeed                                                 |km/hr         |Average speed in a time segment considering only the instances labeled as Moving.
 |varspeed                                                      |km/hr         |Speed variance in a time segment considering only the instances labeled as Moving. 
 |circadianmovement                                              |-             | \"It encodes the extent to which a person's location patterns follow a 24-hour circadian cycle.\" [Doryab et al.](../../citation#doryab-locations).
-|numberofsignificantplaces                                    |places        |Number of significant locations visited. It is calculated using the DBSCAN clustering algorithm which takes in EPS and MIN_SAMPLES as parameters to identify clusters. Each cluster is a significant place.
+|numberofsignificantplaces                                    |places        |Number of significant locations visited. It is calculated using the DBSCAN/OPTICS clustering algorithm which takes in EPS and MIN_SAMPLES as parameters to identify clusters. Each cluster is a significant place.
 |numberlocationtransitions                                    |transitions   |Number of movements between any two clusters in a time segment.
 |radiusgyration                                               |meters        |Quantifies the area covered by a participant
 |timeattop1location                                           |minutes       |Time spent at the most significant location.
 |timeattop2location                                           |minutes       |Time spent at the 2nd most significant location.
 |timeattop3location                                           |minutes       |Time spent at the 3rd most significant location. 
-|movingtostaticratio                                          | -   |  Ratio between stationary time and total location sensed time. A lat/long coordinate pair is labelled as stationary if it’s speed (distance/time) to the next coordinate pair is less than 1km/hr. A higher value represents a more stationary routine. These times are computed by multiplying the number of rows by `[SAMPLING_FREQUENCY]`
-|outlierstimepercent                                          | -   | Ratio between the time spent in non-significant clusters divided by the time spent in all clusters (total location sensed time). A higher value represents more time spent in non-significant clusters. These times are computed by multiplying the number of rows by `[SAMPLING_FREQUENCY]`
+|movingtostaticratio                                          | -   |  Ratio between stationary time and total location sensed time. A lat/long coordinate pair is labelled as stationary if it’s speed (distance/time) to the next coordinate pair is less than 1km/hr. A higher value represents a more stationary routine. These times are computed using timeInSeconds feature.
+|outlierstimepercent                                          | -   | Ratio between the time spent in non-significant clusters divided by the time spent in all clusters (total location sensed time). A higher value represents more time spent in non-significant clusters. These times are computed using timeInSeconds feature.
 |maxlengthstayatclusters                                      |minutes       |Maximum time spent in a cluster (significant location).
 |minlengthstayatclusters                                      |minutes       |Minimum time spent in a cluster (significant location).
 |meanlengthstayatclusters                                     |minutes       |Average time spent in a cluster (significant location).
@ -145,4 +146,7 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
    For a detailed description of how this is calculated, see [Canzian et al](../../citation#doryab-locations).

    **Fine Tuning Clustering Parameters**
-    Based on an experiment where we collected fused location data for 7 days with a mean accuracy of 86 & SD of 350.874635, we determined that `EPS/MAX_EPS`=100 produced closer clustering results to reality. Higher values (>100) missed out some significant places like a short grocery visit while lower values (<100) picked up traffic lights and stop signs while driving as significant locations. We recommend you set `EPS` based on the accuracy of your location data (the more accurate your data is, the lower you should be able to set EPS).
+    Based on an experiment where we collected fused location data for 7 days with a mean accuracy of 86 & SD of 350.874635, we determined that `EPS/MAX_EPS`=100 produced closer clustering results to reality. Higher values (>100) missed out some significant places like a short grocery visit while lower values (<100) picked up traffic lights and stop signs while driving as significant locations. We recommend you set `EPS` based on the accuracy of your location data (the more accurate your data is, the lower you should be able to set EPS).
+
+    **Duration Calculation**
+    To calculate the time duration component for our features, we compute the difference between the timestamps of consecutive rows to take into account sampling rate variability. If this time difference is larger than a threshold (300 seconds by default) we replace it with a maximum duration (60 seconds by default, i.e. we assume a participant spent at least 60 seconds in their last known location)
--- a/src/features/phone_locations/doryab/main.py
+++ b/src/features/phone_locations/doryab/main.py
@ -13,8 +13,8 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se
    dbscan_eps = provider["DBSCAN_EPS"]
    dbscan_minsamples = provider["DBSCAN_MINSAMPLES"]
    threshold_static = provider["THRESHOLD_STATIC"]
-    maximum_gap_allowed = provider["MAXIMUM_GAP_ALLOWED"]
-    sampling_frequency = provider["SAMPLING_FREQUENCY"]
+    maximum_gap_allowed = provider["MAXIMUM_ROW_GAP"]
+    maximum_row_duration = provider["MAXIMUM_ROW_DURATION"]
    cluster_on = provider["CLUSTER_ON"]
    clustering_algorithm = provider["CLUSTERING_ALGORITHM"]
    
@ -45,7 +45,6 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se
        if cluster_on == "PARTICIPANT_DATASET":
            location_data = cluster_and_label(location_data,clustering_algorithm,threshold_static,**hyperparameters)
            location_data = filter_data_by_segment(location_data, time_segment)
-
        elif cluster_on == "TIME_SEGMENT":
            location_data = filter_data_by_segment(location_data, time_segment)
            location_data = cluster_and_label(location_data,clustering_algorithm,threshold_static,**hyperparameters)
@ -57,9 +56,6 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se
        else:
            location_features = pd.DataFrame()

-            if sampling_frequency == 0:
-                sampling_frequency = getSamplingFrequency(location_data)
-
            if "minutesdataused" in features_to_compute:
                for localDate in location_data["local_segment"].unique():
                    location_features.loc[localDate,"minutesdataused"] = getMinutesData(location_data[location_data["local_segment"]==localDate])
@ -73,6 +69,7 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se
                location_features = location_features.reset_index(drop=True)
                return location_features

+            location_data['timeInSeconds'] = (location_data.timestamp.diff(-1)* -1)/1000    
            if "locationvariance" in features_to_compute:
                location_features["locationvariance"] = location_data.groupby(['local_segment'])['double_latitude'].var() + location_data.groupby(['local_segment'])['double_longitude'].var()
            
@ -116,51 +113,54 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se
            
            if "radiusgyration" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"radiusgyration"] = radius_of_gyration(stationaryLocations[stationaryLocations['local_segment']==localDate],sampling_frequency)
+                    location_features.loc[localDate,"radiusgyration"] = radius_of_gyration(stationaryLocations[stationaryLocations['local_segment']==localDate])
+
+            preComputedTimeArray = pd.DataFrame()
+            for localDate in stationaryLocations["local_segment"].unique():
+                top1,top2,top3,smax,smin,sstd,smean = len_stay_timeattopn(stationaryLocations[stationaryLocations["local_segment"]==localDate],maximum_gap_allowed,maximum_row_duration)
+                preComputedTimeArray.loc[localDate,"timeattop1"] = top1
+                preComputedTimeArray.loc[localDate,"timeattop2"] = top2
+                preComputedTimeArray.loc[localDate,"timeattop3"] = top3
+                preComputedTimeArray.loc[localDate,"maxlengthstayatclusters"] = smax
+                preComputedTimeArray.loc[localDate,"minlengthstayatclusters"] = smin
+                preComputedTimeArray.loc[localDate,"stdlengthstayatclusters"] = sstd
+                preComputedTimeArray.loc[localDate,"meanlengthstayatclusters"] = smean

            if "timeattop1location" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"timeattop1"] = time_at_topn_clusters_in_group(stationaryLocations[stationaryLocations['local_segment']==localDate],1,sampling_frequency)
+                    location_features.loc[localDate,"timeattop1"] = preComputedTimeArray.loc[localDate,"timeattop1"]

            if "timeattop2location" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"timeattop2"] = time_at_topn_clusters_in_group(stationaryLocations[stationaryLocations['local_segment']==localDate],2,sampling_frequency)
+                    location_features.loc[localDate,"timeattop2"] = preComputedTimeArray.loc[localDate,"timeattop2"]
            
            if "timeattop3location" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"timeattop3"] = time_at_topn_clusters_in_group(stationaryLocations[stationaryLocations['local_segment']==localDate],3,sampling_frequency)
+                    location_features.loc[localDate,"timeattop3"] = preComputedTimeArray.loc[localDate,"timeattop3"]

            if "movingtostaticratio" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"movingtostaticratio"] =  (stationaryLocations[stationaryLocations['local_segment']==localDate].shape[0]*sampling_frequency) / (location_data[location_data['local_segment']==localDate].shape[0] * sampling_frequency)
+                    location_features.loc[localDate,"movingtostaticratio"] =  (stationaryLocations[stationaryLocations['local_segment']==localDate]['timeInSeconds'].sum()) / (location_data[location_data['local_segment']==localDate]['timeInSeconds'].sum())

            if "outlierstimepercent" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"outlierstimepercent"] = outliers_time_percent(stationaryLocations[stationaryLocations['local_segment']==localDate],sampling_frequency)
-
-            preComputedmaxminCluster = pd.DataFrame()
-            for localDate in stationaryLocations['local_segment'].unique():
-                    smax, smin, sstd,smean = len_stay_at_clusters_in_minutes(stationaryLocations[stationaryLocations['local_segment']==localDate],sampling_frequency)
-                    preComputedmaxminCluster.loc[localDate,"maxlengthstayatclusters"] = smax 
-                    preComputedmaxminCluster.loc[localDate,"minlengthstayatclusters"] = smin 
-                    preComputedmaxminCluster.loc[localDate,"stdlengthstayatclusters"] = sstd
-                    preComputedmaxminCluster.loc[localDate,"meanlengthstayatclusters"] = smean
+                    location_features.loc[localDate,"outlierstimepercent"] = outlier_time_percent_new(stationaryLocations[stationaryLocations['local_segment']==localDate])
            
            if "maxlengthstayatclusters" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"maxlengthstayatclusters"] = preComputedmaxminCluster.loc[localDate,"maxlengthstayatclusters"]
+                    location_features.loc[localDate,"maxlengthstayatclusters"] = preComputedTimeArray.loc[localDate,"maxlengthstayatclusters"]
            
            if "minlengthstayatclusters" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"minlengthstayatclusters"] = preComputedmaxminCluster.loc[localDate,"minlengthstayatclusters"]
+                    location_features.loc[localDate,"minlengthstayatclusters"] = preComputedTimeArray.loc[localDate,"minlengthstayatclusters"]

            if "stdlengthstayatclusters" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"stdlengthstayatclusters"] = preComputedmaxminCluster.loc[localDate,"stdlengthstayatclusters"]
+                    location_features.loc[localDate,"stdlengthstayatclusters"] = preComputedTimeArray.loc[localDate,"stdlengthstayatclusters"]

            if "meanlengthstayatclusters" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
-                    location_features.loc[localDate,"meanlengthstayatclusters"] = preComputedmaxminCluster.loc[localDate,"meanlengthstayatclusters"]
+                    location_features.loc[localDate,"meanlengthstayatclusters"] = preComputedTimeArray.loc[localDate,"meanlengthstayatclusters"]

            if "locationentropy" in features_to_compute:
                for localDate in stationaryLocations['local_segment'].unique():
@ -174,6 +174,21 @@ def doryab_features(sensor_data_files, time_segment, provider, filter_data_by_se

    return location_features

+def len_stay_timeattopn(locationData,maximum_gap_allowed,maximum_row_duration):
+    if locationData is None or len(locationData) == 0:
+        return  (None, None, None,None, None, None, None)
+
+    calculationDf = locationData[locationData["location_label"] >= 1][['location_label','timeInSeconds']].copy()
+    calculationDf.loc[calculationDf.timeInSeconds >= maximum_gap_allowed,'timeInSeconds'] = maximum_row_duration
+    timeArray = calculationDf.groupby('location_label')['timeInSeconds'].sum().reset_index()['timeInSeconds'].sort_values(ascending=False)/60
+    
+    if len(timeArray) > 2:
+        return (timeArray[0],timeArray[1],timeArray[2],timeArray.max(),timeArray.min(),timeArray.std(),timeArray.mean())
+    elif len(timeArray)==2:
+        return (timeArray[0],timeArray[1],None,timeArray.max(),timeArray.min(),timeArray.std(),timeArray.mean())
+    else:
+        return (timeArray[0],None,None,timeArray.max(),timeArray.min(),timeArray.std(),timeArray.mean())
+    

 def getMinutesData(locationData):

@ -188,7 +203,6 @@ def distance_to_degrees(d):

 def get_all_travel_distances_meters_speed(locationData,threshold,maximum_gap_allowed):
    
-    locationData['timeInSeconds'] = (locationData.timestamp.diff(-1)* -1)/1000
    lat_lon_temp = locationData[locationData['timeInSeconds'] <= maximum_gap_allowed][['double_latitude','double_longitude','timeInSeconds']]
    
    if lat_lon_temp.empty:
@ -314,7 +328,6 @@ def rank_count_map(clusters):
    labels, counts = tuple(np.unique(clusters, return_counts = True))
    sorted_by_count = [x for (y,x) in sorted(zip(counts, labels), reverse = True)]
    label_to_rank = {label : rank + 1 for (label, rank) in [(sorted_by_count[i],i) for i in range(len(sorted_by_count))]}
-
    return lambda x: label_to_rank.get(x, -1)


@ -344,7 +357,7 @@ def number_location_transitions(locationData):

    return df[df['boolCol'] == False].shape[0] - 1

-def radius_of_gyration(locationData,sampling_frequency):
+def radius_of_gyration(locationData):
    if locationData is None or len(locationData) == 0:
        return None
    # Center is the centroid, not the home location
@ -357,88 +370,25 @@ def radius_of_gyration(locationData,sampling_frequency):
        distance = haversine(clusters_centroid.loc[labels].double_longitude,clusters_centroid.loc[labels].double_latitude,
                    centroid_all_clusters.double_longitude,centroid_all_clusters.double_latitude) ** 2
        
-        time_in_cluster = locationData[locationData["location_label"]==labels].shape[0]* sampling_frequency
+        time_in_cluster = locationData[locationData["location_label"]==labels]['timeInSeconds'].sum()
        rog = rog + (time_in_cluster * distance)
    
-    time_all_clusters = valid_clusters.shape[0] * sampling_frequency
+    time_all_clusters = valid_clusters['timeInSeconds'].sum()
    if time_all_clusters == 0:
        return 0
    final_rog = (1/time_all_clusters) * rog

    return np.sqrt(final_rog)

-def time_at_topn_clusters_in_group(locationData,n,sampling_frequency):  # relevant only for global location features since, top3_clusters = top3_clusters_in_group for local
-    
-    if locationData is None or len(locationData) == 0:
+def outlier_time_percent_new(locationData):
+    if locationData is None or len(locationData)==0:
        return None

-    locationData = locationData[locationData["location_label"] >= 1]  # remove outliers/ cluster noise
-    valcounts = locationData["location_label"].value_counts().to_dict()
-    sorted_valcounts = sorted(valcounts.items(), key=lambda kv: (-kv[1], kv[0]))
-
-    if len(sorted_valcounts) >= n:
-        topn = sorted_valcounts[n-1]
-        topn_time = topn[1] * sampling_frequency
-    else:
-        topn_time = None
-
-    return topn_time
-
-def outliers_time_percent(locationData,sampling_frequency):
-    if locationData is None or len(locationData) == 0:
-        return None
-    clusters = locationData["location_label"]
-    numoutliers = clusters[(clusters == -1)].sum() * sampling_frequency
-    numtotal = len(clusters) * sampling_frequency
-    return (float(-1*numoutliers) / numtotal)
-
-
-def moving_time_percent(locationData):
-    if locationData is None or len(locationData) == 0:
-        return None
-    lbls = locationData["location_label"]
-    nummoving = lbls.isnull().sum()
-    numtotal = len(lbls)
-    
-    return (float(nummoving) / numtotal)
-
-def len_stay_at_clusters_in_minutes(locationData,sampling_frequency):
-    if locationData is None or len(locationData) == 0:
-        return  (None, None, None,None)
-
-    lenstays = []
-    count = 0
-    prev_loc_label = None
-    for row in locationData.iterrows():
-        cur_loc_label = row[1]["location_label"]
-        if np.isnan(cur_loc_label):
-            continue
-        elif prev_loc_label == None:
-            prev_loc_label = int(cur_loc_label)
-            count += 1
-        else:
-            if prev_loc_label == int(cur_loc_label):
-                count += 1
-            else:
-                lenstays.append(count)
-                prev_loc_label = int(cur_loc_label)
-                count = 0 + 1
-    if count > 0:  # in case of no transition
-        lenstays.append(count)
-    lenstays = np.array(lenstays) * sampling_frequency
-
-    if len(lenstays) > 0:
-        smax = np.max(lenstays)
-        smin = np.min(lenstays)
-        sstd = np.std(lenstays)
-        smean = np.mean(lenstays)
-    else:
-        smax = None
-        smin = None
-        sstd = None
-        smean = None
-    return (smax, smin, sstd, smean)
+    clustersDf = locationData[["location_label","timeInSeconds"]]
+    numoutliers = clustersDf[clustersDf["location_label"]== -1]["timeInSeconds"].sum()
+    numtotal = clustersDf.timeInSeconds.sum()

+    return numoutliers/numtotal

 def location_entropy(locationData):
    if locationData is None or len(locationData) == 0:
@ -447,7 +397,7 @@ def location_entropy(locationData):
    clusters = locationData[locationData["location_label"] >= 1]  # remove outliers/ cluster noise
    if len(clusters) > 0:
        # Get percentages for each location
-        percents = clusters["location_label"].value_counts(normalize=True)
+        percents = clusters.groupby(['location_label'])['timeInSeconds'].sum() / clusters['timeInSeconds'].sum()
        entropy = -1 * percents.map(lambda x: x * np.log(x)).sum()
        return entropy
    else:
@ -463,10 +413,7 @@ def location_entropy_normalized(locationData):
    num_clusters = len(unique_clusters)
    if num_clusters == 0 or len(locationData) == 0 or entropy is None:
        return None
+    elif np.log(num_clusters)==0:
+        return None
    else:
-        return entropy / num_clusters
-
-
-def getSamplingFrequency(locationData):
-
-    return (locationData.timestamp.diff()/(1000*60)).median()
+        return entropy / np.log(num_clusters)