Added RAPIDS docs and corrected docs of docs
parent
f22d1834ee
commit
765bb46263
|
@ -1,7 +1,7 @@
|
|||
How to Edit Documentation
|
||||
============================
|
||||
|
||||
The following is a basic guide for editing the documentation for this project. The documentation is rendered using Sphinx_ documentation builder. This guide is intended to be a basic guide that will allow a contributer to start editing the documentation for the MoSHI AWARE Pipeline. The first step is to install Sphinx.
|
||||
The following is a basic guide for editing the documentation for this project. The documentation is rendered using Sphinx_ documentation builder. This guide is intended to be a basic guide that will allow a contributer to start editing the documentation for the RAPIDS Pipeline. The first step is to install Sphinx.
|
||||
|
||||
Mac OS
|
||||
|
||||
|
@ -29,16 +29,16 @@ The ``toctree`` inserts a TOC tree at the current location using the individual
|
|||
|
||||
Thus the directory structure for the above example is shown below::
|
||||
|
||||
|__ index.rst
|
||||
|__ usage
|
||||
|__ introduction.rst
|
||||
|__ installation.rst
|
||||
├── index.rst
|
||||
└── usage
|
||||
├── introduction.rst
|
||||
└── installation.rst
|
||||
|
||||
Once the ``index.rst`` has been editted and content has been added and/or editted the documentation is built using the following command::
|
||||
|
||||
$ make dirhtml
|
||||
|
||||
This command creates the ``_build`` directory which contains the generated HTML files of the documentation.
|
||||
This command creates the ``_build`` directory which contains the generated HTML files of the documentation. It shoould be noted that once you have pushed your change to the repository the changes will be published even if you have not run ``make dirhtml``
|
||||
|
||||
|
||||
Basic reStructuredText Syntax
|
||||
|
@ -225,13 +225,15 @@ It refers to the section itself, see :ref:`my-reference-label`.
|
|||
|
||||
- Labels that aren’t placed before a section title can still be referenced, but you must give the link an explicit title, using this syntax: ``:ref:`Link title <label-name>```.
|
||||
|
||||
|
||||
**Comments**
|
||||
|
||||
Every explicit markup block which isn’t a valid markup construct (like the footnotes above) is regarded as a comment (ref). For example::
|
||||
Every explicit markup block which isn’t a valid markup construct is regarded as a comment. For example::
|
||||
|
||||
.. This is a comment.
|
||||
.. This is a comment.
|
||||
|
||||
Go to Sphinx_ for more documentation.
|
||||
|
||||
.. _Sphinx: https://www.sphinx-doc.org
|
||||
.. _reStructuredText: https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html
|
||||
|
||||
|
|
|
@ -1,8 +1,10 @@
|
|||
Extracted Features
|
||||
==================
|
||||
|
||||
.. _accelerometer:
|
||||
|
||||
Accelerometer
|
||||
-------------
|
||||
--------------
|
||||
|
||||
Available epochs: daily, morning, afternoon, evening, and night
|
||||
|
||||
|
@ -18,8 +20,10 @@ Available epochs: daily, morning, afternoon, evening, and night
|
|||
- Count exertional activity episodes: count of the episods of performing exertional activity
|
||||
- Count non-exertional activity episodes: count of the episodes of performing non-exertional activity
|
||||
|
||||
.. _applications_foreground:
|
||||
|
||||
Applications_foreground
|
||||
----------------------
|
||||
-------------------------
|
||||
|
||||
Available epochs: daily, morning, afternoon, evening, and night
|
||||
|
||||
|
@ -28,6 +32,8 @@ Available epochs: daily, morning, afternoon, evening, and night
|
|||
- Time of last use: time of last use all_apps/single_app/single_category/multiple_category in minutes
|
||||
- Frenquency entropy: the entropy of the apps frequency for all_apps/single_app/single_category/multiple_category. There is no entropy for single_app.
|
||||
|
||||
.. _battery:
|
||||
|
||||
Battery
|
||||
--------
|
||||
|
||||
|
@ -40,6 +46,8 @@ Available epochs: daily, morning, afternoon, evening, and night
|
|||
- Count charge: number of battery charging episodes
|
||||
- Sum duration charge: total duration of all charging episodes (time the phone was charging)
|
||||
|
||||
.. _bluetooth:
|
||||
|
||||
Bluetooth
|
||||
---------
|
||||
|
||||
|
@ -49,6 +57,8 @@ Available epochs: daily, morning, afternoon, evening, and night
|
|||
- Unique devices (number of unique devices identified by their hardware address -bt_address field)
|
||||
- Count of scans of the most unique device across each participant’s dataset
|
||||
|
||||
.. _calls:
|
||||
|
||||
Calls
|
||||
-----
|
||||
|
||||
|
@ -58,6 +68,8 @@ Available epochs: daily, morning, afternoon, evening, and night
|
|||
- Received: count, count of distinct contacts, mean duration, sum duration, min duration, max duration, std duration, mode duration, entropy duration, time of first call (hours), time of last call (hours), count of most frequent contact.
|
||||
- Missed: count, distinct contacts, time of first call (hours), time of last call (hours), count of most frequent contact.
|
||||
|
||||
.. _google-activity-recognition:
|
||||
|
||||
Google Activity Recognition
|
||||
---------------------------
|
||||
|
||||
|
@ -71,6 +83,8 @@ Available epochs: daily, morning, afternoon, evening, and night
|
|||
- Sum mobile: total duration of episodes of on foot, running, and on bicycle activities
|
||||
- Sum vehicle: total duration of episodes of on vehicle activity
|
||||
|
||||
.. _light:
|
||||
|
||||
Light
|
||||
-----
|
||||
|
||||
|
@ -83,7 +97,9 @@ Available epochs: daily, morning, afternoon, evening, and night
|
|||
- median lux: median ambient luminance in lux units
|
||||
- Std lux: standard deviation of ambient luminance in lux units
|
||||
|
||||
Location (Barnett’s) Fetures
|
||||
.. _location-features:
|
||||
|
||||
Location (Barnett’s) Features
|
||||
-----------------------------
|
||||
|
||||
Available epochs: daily
|
||||
|
@ -105,6 +121,8 @@ Barnett’s location features are based on the concept of flights and pauses. GP
|
|||
- Significant locations. The number of significant locations visited during the day. Significant locations are computed using k-means clustering over pauses found in the whole monitoring period. The number of clusters is found iterating from 1 to 200 stopping until the centroids of two significant locations are within 400 meters of one another.
|
||||
- Significant location entropy. Entropy measurement based on the proportion of time spent at each significant location visited during a day.
|
||||
|
||||
.. _screen:
|
||||
|
||||
Screen
|
||||
------
|
||||
|
||||
|
@ -122,14 +140,18 @@ Notes. An unlock episode is considered as the time between an unlock event and a
|
|||
- Average duration unlock: average duration of unlock episodes
|
||||
- Std duration unlock: standard deviation of the duration of unlock episodes
|
||||
|
||||
.. _sms:
|
||||
|
||||
SMS
|
||||
---
|
||||
----
|
||||
|
||||
Available epochs: daily, morning, afternoon, evening, and night
|
||||
|
||||
- Sent: count, distinct contacts, time first sms, time last sms, count most frequent contact
|
||||
- Received: count, distinct contacts, time first sms, time last sms, count most frequent contact
|
||||
|
||||
.. _fitbit-heart-rate:
|
||||
|
||||
Fitbit: heart rate
|
||||
------------------
|
||||
|
||||
|
@ -151,6 +173,8 @@ Notes. eart rate zones contain 4 zones: out_of_range zone, fat_burn zone, cardio
|
|||
- Length cardio: duration of heart rate in cardio zone in minute
|
||||
- Length peak: duration of heart rate in peak zone in minute
|
||||
|
||||
.. _fitbit-steps:
|
||||
|
||||
Fitbit: steps
|
||||
-------------
|
||||
|
||||
|
|
|
@ -16,6 +16,7 @@ Contents:
|
|||
|
||||
usage/introduction
|
||||
usage/installation
|
||||
usage/snakemake_docs
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
@ -25,6 +26,6 @@ Contents:
|
|||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: Developors
|
||||
:caption: Developers
|
||||
|
||||
develop/documentation
|
|
@ -1,3 +1,5 @@
|
|||
.. _install-page:
|
||||
|
||||
Installation
|
||||
===============
|
||||
|
||||
|
@ -6,7 +8,7 @@ This instructions have been tested on MacOS Catalina and Mojave and Ubuntu 16.04
|
|||
Mac OS (tested on Catalina)
|
||||
----------------------------
|
||||
|
||||
#. Install dependenies (if not installed):
|
||||
#. Install dependenies (Homebrew if not installed):
|
||||
|
||||
- Install brew_ for Mac: ``/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"``
|
||||
|
||||
|
@ -43,6 +45,8 @@ Mac OS (tested on Catalina)
|
|||
- ``snakemake packrat_init``
|
||||
- ``snakemake packrat_restore``
|
||||
|
||||
.. _install-step-8:
|
||||
|
||||
#. Configure the participants to analyze:
|
||||
|
||||
- Create a file per participant in the ``/data/external`` folder, no extension is necessary, its name will be the label for that participant in the pipeline: ``/data/external/pxx``
|
||||
|
@ -55,6 +59,7 @@ Mac OS (tested on Catalina)
|
|||
3a7b0d0a-a9ce-4059-ab98-93a7b189da8a,44f20139-50cc-4b13-bdde-0d5a3889e8f9
|
||||
android
|
||||
|
||||
|
||||
#. Configure the db connection:
|
||||
|
||||
- Create an empty .env file in the root folder
|
||||
|
@ -66,10 +71,19 @@ Mac OS (tested on Catalina)
|
|||
| ``host=MyIP``
|
||||
| ``port=3306``
|
||||
|
||||
|
||||
#. Once the all of the installation and configurations has been completed the following command can be run to pull the default sample dataset that comes with this project.::
|
||||
|
||||
$ snakemake
|
||||
|
||||
|
||||
This pulls sample data from AWARE_ and processes it with the default rules that come with RAPIDS.
|
||||
|
||||
|
||||
Linux (tested on Ubuntu 16.04)
|
||||
------------------------------
|
||||
|
||||
#. Install dependenies (if not installed):
|
||||
#. Install dependenies (Homebrew - if not installed):
|
||||
|
||||
- ``sudo apt-get install libmariadb-client-lgpl-dev libxml2-dev libssl-dev``
|
||||
- Install brew_ for linux and add the following line to ~/.bashrc: ``export PATH=$HOME/.linuxbrew/bin:$PATH``
|
||||
|
@ -129,12 +143,22 @@ Linux (tested on Ubuntu 16.04)
|
|||
| ``port=3306``
|
||||
|
||||
|
||||
#. Once the all of the installation and configurations has been completed the following command can be run to pull the default sample dataset that comes with this project.::
|
||||
|
||||
$ snakemake
|
||||
|
||||
This pulls sample data from AWARE_ and processes it with the default rules that come with RAPIDS.
|
||||
|
||||
.. _the-install-note:
|
||||
|
||||
.. note::
|
||||
- Ensure that ``MY_GROUP_NAME`` is the same value for GROUP in the ``DATABASE_GROUP`` variable in the config.yaml file.
|
||||
- Ensure that your list of ``SENSORS`` in the config.yaml file correspond to the sensors used in all rules in the Snakefile file (See Pipeline Structure Section for more information TBD)
|
||||
- Ensure that ``MY_GROUP_NAME`` is the same value for GROUP in the ``DATABASE_GROUP`` variable in the ``config.yaml`` file.
|
||||
- Ensure that your list of ``SENSORS`` in the ``config.yaml`` file correspond to the sensors used in the ``all`` rule in the ``Snakefile`` file (See :ref:`rapids-structure` for more information)
|
||||
|
||||
|
||||
|
||||
|
||||
.. _bug: https://github.com/Homebrew/linuxbrew-core/issues/17812
|
||||
.. _instructions: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
|
||||
.. _brew: https://docs.brew.sh/Homebrew-on-Linux
|
||||
.. _brew: https://docs.brew.sh/Homebrew-on-Linux
|
||||
.. _AWARE: https://awareframework.com/what-is-aware/
|
|
@ -1,22 +1,40 @@
|
|||
Quick Introduction
|
||||
==================
|
||||
|
||||
The goal of this pipeline is to standardize the data cleaning, featuring extraction, analysis, and evaluation of mobile sensing projects. It leverages Cookiecutter, Snakemake, Sphinx, Scypy, R, and Conda to create an end-to-end reproducible environment that can be published along with research papers.
|
||||
The goal of this pipeline is to standardize the data cleaning, featuring extraction, analysis, and evaluation of mobile sensing projects. It leverages Cookiecutter_, Snakemake_, Sphinx_, Scypy_, R_, and Conda_ to create an end-to-end reproducible environment that can be published along with research papers.
|
||||
|
||||
At the moment, mobile data can be collected using different sensing frameworks (Aware, Beiwe) and hardware (Fitbit). The pipeline is agnostic to these data sources and can unify their analysis. The current implementation only handles data collected with Aware. However, it should be easy to extend it to other providers.
|
||||
At the moment, mobile data can be collected using different sensing frameworks (AWARE_, Beiwe_) and hardware (Fitbit_). The pipeline is agnostic to these data sources and can unify their analysis. The current implementation only handles data collected with AWARE_. However, it should be easy to extend it to other providers.
|
||||
|
||||
We recommend reading Snakemake docs, but the main idea behind the pipeline is that every link in the analysis chain is a rule with an input and an output. Input and output (generally) are files, and these files can be manipulated using any programming language (although Snakemake has wrappers for Python, R, and Julia that can make development slightly more comfortable). Snakemake also allows us to spread the execution of rules over multiple cores, which means that a single analysis pipeline can be executed in parallel for all participants in a study without any code changes.
|
||||
We recommend reading Snakemake_ docs, but the main idea behind the pipeline is that every link in the analysis chain is a rule with an input and an output. Input and output (generally) are files, and these files can be manipulated using any programming language (although Snakemake_ has wrappers for Python_, R_, and Julia_ that can make development slightly more comfortable). Snakemake_ also allows us to spread the execution of rules over multiple cores, which means that a single analysis pipeline can be executed in parallel for all participants in a study without any code changes.
|
||||
|
||||
Available features:
|
||||
|
||||
- SMS
|
||||
- Calls
|
||||
- Bluetooth
|
||||
- Google Activity Recognition
|
||||
- Battery
|
||||
- Location (Barnett's)
|
||||
- Screen
|
||||
- Light
|
||||
- Accelerometer
|
||||
- :ref:`sms`
|
||||
- :ref:`calls`
|
||||
- :ref:`bluetooth`
|
||||
- :ref:`google-activity-recognition`
|
||||
- :ref:`battery`
|
||||
- :ref:`location-features`
|
||||
- :ref:`screen`
|
||||
- :ref:`light`
|
||||
- :ref:`accelerometer`
|
||||
- :ref:`applications_foreground`
|
||||
- :ref:`fitbit-heart-rate`
|
||||
- :ref:`fitbit-steps`
|
||||
|
||||
Applications_foreground
|
||||
|
||||
We are updating these docs constantly, but if you think something needs clarification, feel free to reach out or submit a pull request in GitHub.
|
||||
|
||||
|
||||
.. _Cookiecutter: http://drivendata.github.io/cookiecutter-data-science/
|
||||
.. _Snakemake: https://snakemake.readthedocs.io/en/stable/
|
||||
.. _Sphinx: https://www.sphinx-doc.org/en/master/
|
||||
.. _Scypy: https://www.scipy.org/index.html
|
||||
.. _R: https://www.r-project.org/
|
||||
.. _Conda: https://docs.conda.io/en/latest/
|
||||
.. _AWARE: https://awareframework.com/what-is-aware/
|
||||
.. _Beiwe: https://www.beiwe.org/
|
||||
.. _Fitbit: https://www.fitbit.com/us/home
|
||||
.. _Python: https://www.python.org/
|
||||
.. _Julia: https://julialang.org/
|
|
@ -0,0 +1,244 @@
|
|||
.. _rapids-structure:
|
||||
|
||||
RAPIDS Structure
|
||||
=================
|
||||
|
||||
::
|
||||
|
||||
├── LICENSE
|
||||
├── Makefile <- Makefile with commands like `make data` or `make train`
|
||||
├── README.md <- The top-level README for developers using this project.
|
||||
├── config.yaml <- The configuration settings for the pipeline.
|
||||
├── environment.yml <- Environmental settings - channels and dependences that are installed in the env)
|
||||
├── data
|
||||
│ ├── external <- Data from third party sources.
|
||||
│ ├── interim <- Intermediate data that has been transformed.
|
||||
│ ├── processed <- The final, canonical data sets for modeling.
|
||||
│ └── raw <- The original, immutable data dump.
|
||||
│
|
||||
├── docs <- A default Sphinx project; see sphinx-doc.org for details
|
||||
│
|
||||
├── models <- Trained and serialized models, model predictions, or model summaries
|
||||
│
|
||||
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
|
||||
│ the creator's initials, and a short `-` delimited description, e.g.
|
||||
│ `1.0-jqp-initial-data-exploration`.
|
||||
│
|
||||
├── packrat <- Installed R dependences. (Packrat is a dependency management system for R)
|
||||
├── references <- Data dictionaries, manuals, and all other explanatory materials.
|
||||
│
|
||||
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
|
||||
│ └── figures <- Generated graphics and figures to be used in reporting.
|
||||
│
|
||||
├── rules
|
||||
│ ├── features <- Rules to process the feature data pulled in to pipeline.
|
||||
│ ├── packrat <- Rules for setting up packrat.
|
||||
│ ├── preprocessing <- Preprocessing rules to clean data before processing.
|
||||
│ ├── analysis <- Analytic rules that are applied to the data.
|
||||
│ └── reports <- Snakefile used to produce reports.
|
||||
│
|
||||
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
|
||||
├── Snakemake <- The root snakemake file (the equivalent of a Makefile)
|
||||
├── src <- Source code for use in this project. Can be in any language e.g. Python,
|
||||
│ │ R, Julia, etc.
|
||||
│ │
|
||||
│ ├── data <- Scripts to download or generate data. Can be in any language e.g. Python,
|
||||
│ │ R, Julia, etc.
|
||||
│ │
|
||||
│ ├── features <- Scripts to turn raw data into features for modeling. Can be in any language
|
||||
│ │ e.g. Python, R, Julia, etc.
|
||||
│ │
|
||||
│ ├── models <- Scripts to train models and then use trained models to make prediction. Can
|
||||
│ │ be in any language e.g. Python, R, Julia, etc.
|
||||
│ │
|
||||
│ └── visualization <- Scripts to create exploratory and results oriented visualizations. Can be
|
||||
│ in any language e.g. Python, R, Julia, etc.
|
||||
│
|
||||
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
|
||||
|
||||
|
||||
.. _the-snakefile-file:
|
||||
|
||||
The ``Snakefile`` File
|
||||
----------------------
|
||||
The ``Snakefile`` file (see the actual `Snakefile`_) pulls the entire system together and can be thought of as the menu of RAPIDS allowing the user to define the sensor data that is desired. The first line in this file identifies the configuration file. Next are a list of included files that define the rules used to pull, clean, process, analyze and report on the data. Next is the ``all`` rule that list the sensor data (menu items) that would be processed by the pipeline.
|
||||
|
||||
.. _includes-section:
|
||||
|
||||
Includes
|
||||
"""""""""
|
||||
There are 5 included files in the ``Snakefile`` file.
|
||||
|
||||
- ``packrat.snakefile`` - This file defines the rules to manager the R packages that are used by RAPIDS. (See `packrat`_)
|
||||
- ``preprocessing.snakefile`` - The rules that are used to preprocess the data by cleaning and formatting are contained in this file. (See `preprocessing`_)
|
||||
- ``features.snakefile`` - This file contains the rules that define how the sensor/feature data is processed. (See `features`_)
|
||||
- ``reports.snakefile`` - The file contains the rules that are used to produce the reports. (See `reports`_)
|
||||
|
||||
.. - ``analysis.snakefile`` - The rules that define how the data is analyzed is outlined in this file. (see `analysis <https://github.com/carissalow/rapids/blob/master/rules/analysis.snakefile>`_)
|
||||
|
||||
Includes are relative to the directory of the Snakefile in which they occur. For example, if above Snakefile resides in the directory ``my/dir``, then Snakemake will search for the include file at ``my/dir/path/to/other/snakefile``, regardless of the working directory.
|
||||
|
||||
.. _rule-all-section:
|
||||
|
||||
``Rule all:``
|
||||
"""""""""""""
|
||||
In RAPIDS the ``all`` rule indirectly specifies the features/sensors that are desired by listing the output files of the pipeline using the ``expand`` directive. The ``expand`` function allows the combination of different variables. Consider the following::
|
||||
|
||||
expand("data/raw/{pid}/{sensor}_raw.csv", pid=config["PIDS"], sensor=config["SENSORS"]),
|
||||
|
||||
If ``pids = ['p01','p02']`` and ``sensor = ['sms', 'calls']`` then the above directive would produce::
|
||||
|
||||
["data/raw/p01/sms_raw.csv", "data/raw/p01/calls_raw.csv", "data/raw/p02/sms_raw.csv", "data/raw/p02/calls_raw.csv"]
|
||||
|
||||
Thus, this allows the user of RAPIDS to define all of the desired output files without having to manually list all for the participants of the research. The way Snakemake works is that it looks for the rule that produces the desired output files and then executes that rule. For more information on ``expand`` see `The Expand Function`_
|
||||
|
||||
|
||||
.. _the-env-file:
|
||||
|
||||
The ``.env`` File
|
||||
-------------------
|
||||
The database credentials for database server is placed in the .env file (Remember step 9 on :ref:`install-page` page). The format of the configurations are shown below::
|
||||
|
||||
[MY_GROUP_NAME]
|
||||
user=MyUSER
|
||||
password=MyPassword
|
||||
host=MyIP
|
||||
port=3306
|
||||
|
||||
|
||||
.. _the-config-file:
|
||||
|
||||
The ``config.yaml`` File
|
||||
------------------------
|
||||
|
||||
The configurations for the pipeline are defined in the ``config.yaml`` (See `config.yaml`_). This contains global settings and variables that are used by the rules. Some of the global variables defined in the ``config.yaml`` file are briefly explained below:
|
||||
|
||||
- ``SENSORS`` - This is a global variable that contains a list of the sensor/feature tables in the database that will be analyzed.
|
||||
- ``PIDS`` - This is the list of the participant IDs to include in the analysis. Create a file for each participant with a matching name ``pXXX`` containing the device_id in the ``data/external/`` directory. (Remember installation :ref:`step 8 <install-step-8>`)
|
||||
- ``DAY_SEGMENTS`` - A variable used to list all of the common day segments.
|
||||
- ``TIMEZONE`` - Time variable. Use timezone names from the `List of Timezone`_ and double check your code, for example EST is not US Eastern Time.
|
||||
- ``DATABASE_GROUP`` - A variable for the name of the database group that the project uses. (Remember :ref:`Installation Note <the-install-note>`.)
|
||||
- ``DOWNLOAD_DATASET`` - Variable used to store the name of the dataset that will be download for analysis.
|
||||
|
||||
There are a number of other settings that are specific to the sensor/feature that will be pulled and analyzed by the pipeline. An example of the configuration settings for the :ref:`sms` data is shown below::
|
||||
|
||||
SMS:
|
||||
TYPES : [received, sent]
|
||||
METRICS:
|
||||
received: [count, distinctcontacts, timefirstsms, timelastsms, countmostfrequentcontact]
|
||||
sent: [count, distinctcontacts, timefirstsms, timelastsms, countmostfrequentcontact]
|
||||
DAY_SEGMENTS: *day_segments
|
||||
|
||||
The ``TYPES`` setting defines the type of SMS data that will be analyzed. ``METRICS`` defines the metric data for each the type of SMS data being analyzed. Finally, ``DAY_SEGMENTS`` list the day segment (times of day) that the data is captured.
|
||||
|
||||
.. _rules-syntax:
|
||||
|
||||
The ``Rules`` Directory
|
||||
------------------------
|
||||
|
||||
The ``rules`` directory contains the ``snakefiles`` that were included in the ``Snakefile`` file. A short description of these files are given in the :ref:`includes-section` section.
|
||||
|
||||
|
||||
Rules
|
||||
""""""
|
||||
|
||||
A Snakemake workflow is defined by specifying rules in a ``Snakefile`` (See the features_ snakefile as an actual example). Rules decompose the workflow into small steps (e.g., the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names. Thus, a rule can consist of a name, input files, output files, and a command to generate the output from the input. The following is the basic structure of a Snakemake rule::
|
||||
|
||||
rule NAME:
|
||||
input: "path/to/inputfile", "path/to/other/inputfile"
|
||||
output: "path/to/outputfile", "path/to/another/outputfile"
|
||||
script: "path/to/somescript.R"
|
||||
|
||||
|
||||
A sample rule from the RAPIDS source code is shown below::
|
||||
|
||||
rule sms_metrics:
|
||||
input:
|
||||
"data/raw/{pid}/messages_with_datetime.csv"
|
||||
params:
|
||||
sms_type = "{sms_type}",
|
||||
day_segment = "{day_segment}",
|
||||
metrics = lambda wildcards: config["SMS"]["METRICS"][wildcards.sms_type]
|
||||
output:
|
||||
"data/processed/{pid}/sms_{sms_type}_{day_segment}.csv"
|
||||
script:
|
||||
"../src/features/sms_metrics.R"
|
||||
|
||||
|
||||
The ``rule`` directive specifies the name of the rule that is being defined. ``params`` defines the additional parameters that needs to be set for the rule. In the example immediately above, the parameters will be pasted to the script defined in the ``script`` directive of the rule. Instead of ``script`` a shell command call can also be called by replacing the ``script`` directive of the rule and replacing it with the lines similar to the folllowing::
|
||||
|
||||
shell: "somecommand {input} {output}"
|
||||
|
||||
Here input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. ``path/to/inputfile path/to/other/inputfile``). It should be noted that rules can defined without input and output as seen in the ``packrat`` snakefile. For more information see `Rules documentation`_ and for an actual example see the `packrat`_ snakefile.
|
||||
|
||||
.. _wildcards:
|
||||
|
||||
Wildcards
|
||||
""""""""""
|
||||
There are times that it would be useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Consider the sample code from above again repeated below for quick reference.::
|
||||
|
||||
rule sms_metrics:
|
||||
input:
|
||||
"data/raw/{pid}/messages_with_datetime.csv"
|
||||
params:
|
||||
sms_type = "{sms_type}",
|
||||
day_segment = "{day_segment}",
|
||||
metrics = lambda wildcards: config["SMS"]["METRICS"][wildcards.sms_type]
|
||||
output:
|
||||
"data/processed/{pid}/sms_{sms_type}_{day_segment}.csv"
|
||||
script:
|
||||
"../src/features/sms_metrics.R"
|
||||
|
||||
If the rule’s output matches a requested file, the substrings matched by the wildcards are propagated to the input and params directives. For example, if another rule in the workflow requires the file ``data/processed/p01/sms_sent_daily.csv``, Snakemake recognizes that the above rule is able to produce it by setting ``pid=p01``, ``sms_type=sent`` and ``day_segment=daily``. Thus, it requests the input file ``data/raw/p01/messages_with_datetime.csv`` as input, sets ``sms_type=sent``, ``day_segment=daily`` in the ``params`` directive and executes the script. ``../src/features/sms_metrics.R``. See the preprocessing_ snakefile for an actual example.
|
||||
|
||||
|
||||
.. _the-data-directory:
|
||||
|
||||
The ``data`` Directory
|
||||
-----------------------
|
||||
|
||||
This directory contains the data files for the project. These directories are as follows:
|
||||
|
||||
- ``external`` - This directory stores the participant `pxxx` files that contains the device_id and the type of device as well as data from third party sources. (Remember step 8 on :ref:`install-page` page)
|
||||
- ``raw`` - This directory contains the original, immutable data dump from the sensor database.
|
||||
- ``interim`` - This directory would contain intermediate data that has been transformed but has not been completely analyzed.
|
||||
- ``processed`` - This directory contains the final canonical data sets for modeling.
|
||||
|
||||
|
||||
.. _the-src-directory:
|
||||
|
||||
The ``src`` Directory
|
||||
----------------------
|
||||
|
||||
The ``src`` directory holds all of the scripts used by the pipeline. These scripts can be in any programming language including but not limited to Python_, R_ and Julia_. This directory is organized into the following directories:
|
||||
|
||||
- ``data`` - This directory contains scripts that are used to pull and clean the data to be analyzed. See `data directory`_
|
||||
- ``features`` - This directory contains scripts that deal with processing feature and sensor data. See `features directory`_
|
||||
- ``models`` - This directory contains the model scripts for building and training models. See `models directory`_
|
||||
- ``visualization`` - This directory contains the scripts that visualize the results of the models. See `visualization directory`_
|
||||
|
||||
|
||||
.. _the-report-directory:
|
||||
|
||||
The ``reports`` Directory
|
||||
--------------------------
|
||||
|
||||
This contains the reports of the results of the analysis done by the pipeline.
|
||||
|
||||
.. _Python: https://www.python.org/
|
||||
.. _Julia: https://julialang.org/
|
||||
.. _R: https://www.r-project.org/
|
||||
.. _`List of Timezone`: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
|
||||
.. _`The Expand Function`: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function
|
||||
.. _`example snakefile`: https://github.com/carissalow/rapids/blob/master/rules/features.snakefile
|
||||
.. _packrat: https://github.com/carissalow/rapids/blob/master/rules/packrat.snakefile
|
||||
.. _preprocessing: https://github.com/carissalow/rapids/blob/master/rules/preprocessing.snakefile
|
||||
.. _features: https://github.com/carissalow/rapids/blob/master/rules/features.snakefile
|
||||
.. _reports: https://github.com/carissalow/rapids/blob/master/rules/reports.snakefile
|
||||
.. _`Rules documentation`: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rules
|
||||
.. _`data directory`: https://github.com/carissalow/rapids/tree/master/src/data
|
||||
.. _`features directory`: https://github.com/carissalow/rapids/tree/master/src/features
|
||||
.. _`models directory`: https://github.com/carissalow/rapids/tree/master/src/models
|
||||
.. _`visualization directory`: https://github.com/carissalow/rapids/tree/master/src/visualization
|
||||
.. _`config.yaml`: https://github.com/carissalow/rapids/blob/master/config.yaml
|
||||
.. _`Snakefile`: https://github.com/carissalow/rapids/blob/master/Snakefile
|
Loading…
Reference in New Issue