Small fixes to timeathome docs, add config validation

2021-02-24 17:49:22 -05:00 · 2021-02-24 17:49:22 -05:00 · 724027e383
parent 3d6caea6c4
commit 724027e383
4 changed files with 40 additions and 17 deletions
--- a/docs/change-log.md
+++ b/docs/change-log.md
@ -5,6 +5,7 @@
 - Add logo
 - Move Citation page to the Setup section
 - Add `config.yaml` validation schema and documentation.
+- Add time at home Doryab location feature and home coordinates to location file
 ## v0.4.3
 - Fix bug when any of the rows from any sensor do not belong a time segment
 ## v0.4.2
--- a/docs/features/phone-locations.md
+++ b/docs/features/phone-locations.md
@ -89,6 +89,7 @@ These features are based on the original implementation by [Doryab et al.](../..
    - data/raw/{pid}/phone_locations_raw.csv
    - data/interim/{pid}/phone_locations_processed.csv
    - data/interim/{pid}/phone_locations_processed_with_datetime.csv
+    - data/interim/{pid}/phone_locations_processed_with_datetime_with_home.csv
    - data/interim/{pid}/phone_locations_features/phone_locations_{language}_{provider_key}.csv
    - data/processed/features/{pid}/phone_locations.csv
    ```
@ -110,7 +111,7 @@ Parameters description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 | `[SAMPLING_FREQUENCY]`     | Expected time difference between any two location rows in minutes. If set to `0`, the sampling frequency will be inferred automatically as the median of all the differences between any two consecutive row timestamps (recommended if you are using `FUSED_RESAMPLED` data). This parameter impacts all the time calculations.
 | `[CLUSTER_ON]`             | Set this flag to `PARTICIPANT_DATASET` to create clusters based on the entire participant's dataset or to `TIME_SEGMENT` to create clusters based on all the instances of the corresponding time segment (e.g. all mornings).
 | `[CLUSTERING_ALGORITHM]`   | The original Doryab et al implementation uses `DBSCAN`, `OPTICS` is also available with similar (but not identical) clustering results and lower memory consumption.
-| `[RADIUS_FOR_HOME]`        | The distance from the center of the home location coordinates which can be accepted as part of home.
+| `[RADIUS_FOR_HOME]`        | All location coordinates within this distance (meters) from the home location coordinates are considered a home stay (see `timeathome` feature).


 Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
@ -137,7 +138,7 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:
 |stdlengthstayatclusters                                      |minutes       |Standard deviation of time spent in a cluster (significant location).
 |locationentropy                                              |nats          |Shannon Entropy computed over the row count of each cluster (significant location), it will be higher the more rows belong to a cluster (i.e. the more time a participant spent at a significant location).
 |normalizedlocationentropy                                    |nats          |Shannon Entropy computed over the row count of each cluster (significant location) divided by the number of clusters, it will be higher the more rows belong to a cluster (i.e. the more time a participant spent at a significant location).
-|timeathome                                                   |minutes       | Time spent at home which is calculated by filtering the data between 12 am and 6 am, then applying clustering algorithm, finding the center of the biggest cluster and considering it as home coordinates.
+|timeathome                                                   |minutes       | Time spent at home (see Observations below for a description on how we compute home).


 !!! note "Assumptions/Observations"
@ -152,3 +153,6 @@ Features description for `[PHONE_LOCATIONS][PROVIDERS][DORYAB]`:

    **Duration Calculation**
    To calculate the time duration component for our features, we compute the difference between the timestamps of consecutive rows to take into account sampling rate variability. If this time difference is larger than a threshold (300 seconds by default) we replace it with a maximum duration (60 seconds by default, i.e. we assume a participant spent at least 60 seconds in their last known location)
+
+    **Home location**
+    Home is calculated using all location data of a participant between 12 am and 6 am, then applying a clustering algorithm (`DB_SCAN` or `OPTICS`), and considering the center of the biggest cluster as the home coordinates for that participant.
--- a/src/data/infer_home_location.py
+++ b/src/data/infer_home_location.py
@ -14,20 +14,6 @@ def distance_to_degrees(d):
    d = d / 60
    return d

-origDf = pd.read_csv(snakemake.input[0])
-filteredDf = filterDatafromDf(origDf)
-dbscan_eps = snakemake.params["dbscan_eps"]
-dbscan_minsamples = snakemake.params["dbscan_minsamples"]
-threshold_static = snakemake.params["threshold_static"]
-clustering_algorithm = snakemake.params["clustering_algorithm"]
-
-if clustering_algorithm == "DBSCAN":
-    hyperparameters = {'eps' : distance_to_degrees(dbscan_eps), 'min_samples': dbscan_minsamples}
-elif clustering_algorithm == "OPTICS":
-    hyperparameters = {'max_eps': distance_to_degrees(dbscan_eps), 'min_samples': 2, 'metric':'euclidean', 'cluster_method' : 'dbscan'} 
-else:
-    raise ValueError("config[PHONE_LOCATIONS][HOME_INFERENCE][CLUSTERING ALGORITHM] only accepts DBSCAN or OPTICS but you provided ",clustering_algorithm)
-
 def cluster_and_label(df,clustering_algorithm,threshold_static,**kwargs):
    """

@ -121,6 +107,22 @@ def haversine(lon1,lat1,lon2,lat2):

    return (r * 2 * np.arcsin(np.sqrt(a)) * 1000)

+# Infer a participants home location
+
+origDf = pd.read_csv(snakemake.input[0])
+filteredDf = filterDatafromDf(origDf)
+dbscan_eps = snakemake.params["dbscan_eps"]
+dbscan_minsamples = snakemake.params["dbscan_minsamples"]
+threshold_static = snakemake.params["threshold_static"]
+clustering_algorithm = snakemake.params["clustering_algorithm"]
+
+if clustering_algorithm == "DBSCAN":
+    hyperparameters = {'eps' : distance_to_degrees(dbscan_eps), 'min_samples': dbscan_minsamples}
+elif clustering_algorithm == "OPTICS":
+    hyperparameters = {'max_eps': distance_to_degrees(dbscan_eps), 'min_samples': 2, 'metric':'euclidean', 'cluster_method' : 'dbscan'} 
+else:
+    raise ValueError("config[PHONE_LOCATIONS][HOME_INFERENCE][CLUSTERING ALGORITHM] only accepts DBSCAN or OPTICS but you provided ",clustering_algorithm)
+
 filteredDf = cluster_and_label(filteredDf,clustering_algorithm,threshold_static,**hyperparameters)

 origDf['home_latitude'] = filteredDf[filteredDf['location_label']==1][['double_latitude','double_longitude']].mean()['double_latitude']
--- a/tools/config.schema.yaml
+++ b/tools/config.schema.yaml
@ -598,6 +598,22 @@ properties:
      FUSED_RESAMPLED_TIME_SINCE_VALID_LOCATION:
        type: integer
        exclusiveMinimum: 0
+      HOME_INFERENCE:
+        type: object
+        required: [DBSCAN_EPS, DBSCAN_MINSAMPLES, THRESHOLD_STATIC, CLUSTERING_ALGORITHM]
+        properties:
+          DBSCAN_EPS:
+            type: integer
+            exclusiveMinimum: 0
+          DBSCAN_MINSAMPLES:
+            type: integer
+            exclusiveMinimum: 0
+          THRESHOLD_STATIC:
+            type: integer
+            exclusiveMinimum: 0
+          CLUSTERING_ALGORITHM:
+            type: string
+            enum: ["DBSCAN", "OPTICS"]
      PROVIDERS:
        type: ["null", object]
        properties:
@ -610,7 +626,7 @@ properties:
                      uniqueItems: True
                      items:
                        type: string
-                        enum: [locationvariance,loglocationvariance,totaldistance,averagespeed,varspeed,circadianmovement,numberofsignificantplaces,numberlocationtransitions,radiusgyration,timeattop1location,timeattop2location,timeattop3location,movingtostaticratio,outlierstimepercent,maxlengthstayatclusters,minlengthstayatclusters,meanlengthstayatclusters,stdlengthstayatclusters,locationentropy,normalizedlocationentropy]
+                        enum: [locationvariance,loglocationvariance,totaldistance,averagespeed,varspeed,circadianmovement,numberofsignificantplaces,numberlocationtransitions,radiusgyration,timeattop1location,timeattop2location,timeattop3location,movingtostaticratio,outlierstimepercent,maxlengthstayatclusters,minlengthstayatclusters,meanlengthstayatclusters,stdlengthstayatclusters,locationentropy,normalizedlocationentropy,timeathome]
                    ACCURACY_LIMIT:
                      type: integer
                      exclusiveMinimum: 0