.. include:: ../global.inc .. _decorators.collate: .. index:: pair: @collate; Syntax .. role:: raw-html(raw) :format: html :raw-html:`` .. role:: red .. seealso:: * :ref:`@collate ` in the **Ruffus** Manual * :ref:`Decorators ` for more decorators .. |input| replace:: `input` .. _input: `decorators.collate.input`_ .. |extras| replace:: `extras` .. _extras: `decorators.collate.extras`_ .. |output| replace:: `output` .. _output: `decorators.collate.output`_ .. |filter| replace:: `filter` .. _filter: `decorators.collate.filter`_ .. |matching_regex| replace:: `matching_regex` .. _matching_regex: `decorators.collate.matching_regex`_ .. |matching_formatter| replace:: `matching_formatter` .. _matching_formatter: `decorators.collate.matching_formatter`_ ######################################################################## collate ######################################################################## ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************ @collate( |input|_, |filter|_, |output|_, [|extras|_,...] ) ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************ **Purpose:** Use |filter|_ to identify common sets of |input|_\s which are to be grouped or collated together: Each set of |input|_\ s which generate identical |output|_ and |extras|_ using the :ref:`formatter` or :ref:`regex` (regular expression) filters are collated into one job. This is a **many to fewer** operation. Only out of date jobs (comparing input and output files) will be re-run. **Example**: ``regex(r".+\.(.+)$")``, ``"\1.summary"`` creates a separate summary file for each suffix:: animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"), r'\1.summary') def summarize(infiles, summary_file): pass #. |output|_ and optional |extras|_ parameters are passed to the functions after string substitution. Non-string values are passed through unchanged. #. Each collate job consists of |input|_ files which are aggregated by string substitution to identical |output|_ and |extras|_ #. | The above example results in two jobs: | ``["a.fish", "b.fish" -> "fish.summary"]`` | ``["c.mammals", "d.mammals" -> "mammals.summary"]`` **Parameters:** .. _decorators.collate.input: * **input** = *tasks_or_file_names* can be a: #. Task / list of tasks. File names are taken from the output of the specified task(s) #. (Nested) list of file name strings (as in the example above). File names containing ``*[]?`` will be expanded as a |glob|_. E.g.:``"a.*" => "a.1", "a.2"`` .. _decorators.collate.filter: .. _decorators.collate.matching_regex: * **filter** = *matching_regex* is a python regular expression string, which must be wrapped in a :ref:`regex` indicator object See python `regular expression (re) `_ documentation for details of regular expression syntax .. _decorators.collate.matching_formatter: * **filter** = *matching_formatter* a :ref:`formatter` indicator object containing optionally a python `regular expression (re) `_. .. _decorators.collate.output: * **output** = *output* Specifies the resulting output file name(s) after string substitution .. _decorators.collate.extras: * **extras** = *extras* Any extra parameters are passed verbatim to the task function If you are using named parameters, these can be passed as a list, i.e. ``extras= [...]`` Any extra parameters are consumed by the task function and not forwarded further down the pipeline. **Example2**: Suppose we had the following files:: cows.mammals.animal horses.mammals.animal sheep.mammals.animal snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal pufferfish.fish.animal and we wanted to end up with three different resulting output:: cow.mammals.animal horse.mammals.animal sheep.mammals.animal -> mammals.results snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal -> reptile.results pufferfish.fish.animal -> fish.results This is the ``@collate`` code required:: animals = [ "cows.mammals.animal", "horses.mammals.animal", "sheep.mammals.animal", "snake.reptile.animal", "lizard.reptile.animal", "crocodile.reptile.animal", "pufferfish.fish.animal"] @collate(animals, regex(r"(.+)\.(.+)\.animal"), r"\2.results") # \1 = species [cow, horse] # \2 = phylogenetics group [mammals, reptile, fish] def summarize_animals_into_groups(species_file, result_file): " ... more code here" pass See :ref:`@merge ` for an alternative way to summarise files. .. _decorators.collate_ex: .. index:: pair: @collate (Advanced Usage); Syntax pair: @collate, inputs(...); Syntax pair: @collate, add_inputs(...); Syntax .. seealso:: * :ref:`Use of add_inputs(...) | inputs(...) ` in the **Ruffus** Manual .. |coll_input| replace:: `input` .. _coll_input: `decorators.collate_ex.input`_ .. |coll_extras| replace:: `extras` .. _coll_extras: `decorators.collate_ex.extras`_ .. |coll_output| replace:: `output` .. _coll_output: `decorators.collate_ex.output`_ .. |coll_filter| replace:: `filter` .. _coll_filter: `decorators.collate_ex.filter`_ .. |coll_matching_regex| replace:: `matching_regex` .. _coll_matching_regex: `decorators.collate_ex.matching_regex`_ .. |coll_matching_formatter| replace:: `matching_formatter` .. _coll_matching_formatter: `decorators.collate_ex.matching_formatter`_ .. |coll_replace_inputs| replace:: `replace_inputs` .. _coll_replace_inputs: `decorators.collate_ex.replace_inputs`_ .. |coll_add_inputs| replace:: `add_inputs` .. _coll_add_inputs: `decorators.collate_ex.add_inputs`_ ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************ collate( |coll_input|_, |coll_filter|_, |coll_replace_inputs|_ | |coll_add_inputs|_, |coll_output|_, [|coll_extras|_,...] ) ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************ **Purpose:** Use |coll_filter|_ to identify common sets of |coll_input|_\s which are to be grouped or collated together: Each set of |coll_input|_\ s which generate identical |coll_output|_ and |coll_extras|_ using the :ref:`formatter` or :ref:`regex` (regular expression) filters are collated into one job. This variant of ``@collate`` allows additional inputs or dependencies to be added dynamically to the task, with optional string substitution. :ref:`add_inputs` nests the the original input parameters in a list before adding additional dependencies. :ref:`inputs` replaces the original input parameters wholescale. This is a **many to fewer** operation. Only out of date jobs (comparing input and output files) will be re-run. **Example of** :ref:`add_inputs` ``regex(r".*(\..+)"), "\1.summary"`` creates a separate summary file for each suffix. But we also add date of birth data for each species:: animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"), add_inputs(r"\1.date_of_birth"), r'\1.summary') def summarize(infiles, summary_file): pass This results in the following equivalent function calls:: summarize([ ["shark.fish", "fish.date_of_birth" ], ["tuna.fish", "fish.date_of_birth" ] ], "fish.summary") summarize([ ["cat.mammals", "mammals.date_of_birth"], ["dog.mammals", "mammals.date_of_birth"] ], "mammals.summary") **Example of** :ref:`add_inputs` using ``inputs(...)`` will summarise only the dates of births for each species group:: animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"), inputs(r"\1.date_of_birth"), r'\1.summary') def summarize(infiles, summary_file): pass This results in the following equivalent function calls:: summarize(["fish.date_of_birth" ], "fish.summary") summarize(["mammals.date_of_birth"], "mammals.summary") **Parameters:** .. _decorators.collate_ex.input: * **input** = *tasks_or_file_names* can be a: #. Task / list of tasks. File names are taken from the output of the specified task(s) #. (Nested) list of file name strings (as in the example above). File names containing ``*[]?`` will be expanded as a |glob|_. E.g.:``"a.*" => "a.1", "a.2"`` .. _decorators.collate_ex.filter: .. _decorators.collate_ex.matching_regex: * **filter** = *matching_regex* is a python regular expression string, which must be wrapped in a :ref:`regex` indicator object See python `regular expression (re) `_ documentation for details of regular expression syntax .. _decorators.collate_ex.matching_formatter: * **filter** = *matching_formatter* a :ref:`formatter` indicator object containing optionally a python `regular expression (re) `_. .. _decorators.collate_ex.add_inputs: .. _decorators.collate_ex.replace_inputs: * **add_inputs** = *add_inputs*\ (...) or **replace_inputs** = *inputs*\ (...) Specifies the resulting |coll_input|_\ (s) to each job. Positional parameters must be disambiguated by wrapping the values in :ref:`inputs(...)` or an :ref:`add_inputs(...)`. Named parameters can be passed the values directly. Takes: #. Task / list of tasks. File names are taken from the output of the specified task(s) #. (Nested) list of file name strings. Strings will be subject to substitution. File names containing ``*[]?`` will be expanded as a |glob|_. E.g. ``"a.*" => "a.1", "a.2"`` .. _decorators.collate_ex.output: * **output** = *output* Specifies the resulting output file name(s). .. _decorators.collate_ex.extras: * **extras** = *extras* Any extra parameters are passed verbatim to the task function If you are using named parameters, these can be passed as a list, i.e. ``extras= [...]`` See :ref:`@collate ` for more straightforward ways to use collate.