See also
- @collate in the Ruffus Manual
- Decorators for more decorators
collate¶
@collate( input, filter, output, [extras,...] )¶
Purpose:
Use filter to identify common sets of inputs which are to be grouped or collated together:
Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.
This is a many to fewer operation.
Only out of date jobs (comparing input and output files) will be re-run.
- Example:
regex(r".+\.(.+)$")
,"\1.summary"
creates a separate summary file for each suffix:animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"), r'\1.summary') def summarize(infiles, summary_file): pass
output and optional extras parameters are passed to the functions after string substitution. Non-string values are passed through unchanged.
Each collate job consists of input files which are aggregated by string substitution to identical output and extras
The above example results in two jobs:["a.fish", "b.fish" -> "fish.summary"]
["c.mammals", "d.mammals" -> "mammals.summary"]
Parameters:
- input = tasks_or_file_names
can be a:
- Task / list of tasks.
File names are taken from the output of the specified task(s)
- (Nested) list of file name strings (as in the example above).
- File names containing
*[]?
will be expanded as a glob.E.g.:
"a.*" => "a.1", "a.2"
- filter = matching_regex
is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax
- filter = matching_formatter
a formatter indicator object containing optionally a python regular expression (re).
- output = output
Specifies the resulting output file name(s) after string substitution
- extras = extras
Any extra parameters are passed verbatim to the task function
If you are using named parameters, these can be passed as a list, i.e.
extras= [...]
Any extra parameters are consumed by the task function and not forwarded further down the pipeline.
Example2:
Suppose we had the following files:
cows.mammals.animal horses.mammals.animal sheep.mammals.animal snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal pufferfish.fish.animaland we wanted to end up with three different resulting output:
cow.mammals.animal horse.mammals.animal sheep.mammals.animal -> mammals.results snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal -> reptile.results pufferfish.fish.animal -> fish.resultsThis is the
@collate
code required:animals = [ "cows.mammals.animal", "horses.mammals.animal", "sheep.mammals.animal", "snake.reptile.animal", "lizard.reptile.animal", "crocodile.reptile.animal", "pufferfish.fish.animal"] @collate(animals, regex(r"(.+)\.(.+)\.animal"), r"\2.results") # \1 = species [cow, horse] # \2 = phylogenetics group [mammals, reptile, fish] def summarize_animals_into_groups(species_file, result_file): " ... more code here" pass
See @merge for an alternative way to summarise files.
See also
- Use of add_inputs(...) | inputs(...) in the Ruffus Manual
collate( input, filter, replace_inputs | add_inputs, output, [extras,...] )¶
- Purpose:
Use filter to identify common sets of inputs which are to be grouped or collated together:
Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.
This variant of
@collate
allows additional inputs or dependencies to be added dynamically to the task, with optional string substitution.add_inputs nests the the original input parameters in a list before adding additional dependencies.
inputs replaces the original input parameters wholescale.
This is a many to fewer operation.
Only out of date jobs (comparing input and output files) will be re-run.
Example of add_inputs
regex(r".*(\..+)"), "\1.summary"
creates a separate summary file for each suffix. But we also add date of birth data for each species:animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"), add_inputs(r"\1.date_of_birth"), r'\1.summary') def summarize(infiles, summary_file): passThis results in the following equivalent function calls:
summarize([ ["shark.fish", "fish.date_of_birth" ], ["tuna.fish", "fish.date_of_birth" ] ], "fish.summary") summarize([ ["cat.mammals", "mammals.date_of_birth"], ["dog.mammals", "mammals.date_of_birth"] ], "mammals.summary")Example of add_inputs
using
inputs(...)
will summarise only the dates of births for each species group:animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"), inputs(r"\1.date_of_birth"), r'\1.summary') def summarize(infiles, summary_file): passThis results in the following equivalent function calls:
summarize(["fish.date_of_birth" ], "fish.summary") summarize(["mammals.date_of_birth"], "mammals.summary")Parameters:
- input = tasks_or_file_names
can be a:
- Task / list of tasks.
File names are taken from the output of the specified task(s)
- (Nested) list of file name strings (as in the example above).
- File names containing
*[]?
will be expanded as a glob.E.g.:
"a.*" => "a.1", "a.2"
- filter = matching_regex
is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax
- filter = matching_formatter
a formatter indicator object containing optionally a python regular expression (re).
- add_inputs = add_inputs(...) or replace_inputs = inputs(...)
Specifies the resulting input(s) to each job.
Positional parameters must be disambiguated by wrapping the values in inputs(...) or an add_inputs(...).
Named parameters can be passed the values directly.
Takes:
- Task / list of tasks.
File names are taken from the output of the specified task(s)
- (Nested) list of file name strings.
Strings will be subject to substitution. File names containing
*[]?
will be expanded as a glob. E.g."a.*" => "a.1", "a.2"
- output = output
Specifies the resulting output file name(s).
- extras = extras
Any extra parameters are passed verbatim to the task function
If you are using named parameters, these can be passed as a list, i.e.
extras= [...]
See @collate for more straightforward ways to use collate.