# collate¶

## @collate( input, filter, output, [extras,...] )¶

Purpose:

Use filter to identify common sets of inputs which are to be grouped or collated together:

Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.

This is a many to fewer operation.

Only out of date jobs (comparing input and output files) will be re-run.

Example:

regex(r".+\.(.+)$"), "\1.summary" creates a separate summary file for each suffix: animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"),  r'\1.summary')
def summarize(infiles, summary_file):
pass

1. output and optional extras parameters are passed to the functions after string substitution. Non-string values are passed through unchanged.

2. Each collate job consists of input files which are aggregated by string substitution to identical output and extras

3. The above example results in two jobs:
["a.fish", "b.fish" -> "fish.summary"]
["c.mammals", "d.mammals" -> "mammals.summary"]

Parameters:

can be a:

File names are taken from the output of the specified task(s)

2. (Nested) list of file name strings (as in the example above).
File names containing *[]? will be expanded as a glob.

E.g.:"a.*" => "a.1", "a.2"

• filter = matching_regex

is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax

• output = output

Specifies the resulting output file name(s) after string substitution

• extras = extras

Any extra parameters are passed verbatim to the task function

If you are using named parameters, these can be passed as a list, i.e. extras= [...]

Any extra parameters are consumed by the task function and not forwarded further down the pipeline.

Example2:

Suppose we had the following files:

cows.mammals.animal
horses.mammals.animal
sheep.mammals.animal

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal

pufferfish.fish.animal


and we wanted to end up with three different resulting output:

cow.mammals.animal
horse.mammals.animal
sheep.mammals.animal
-> mammals.results

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal
-> reptile.results

pufferfish.fish.animal
-> fish.results


This is the @collate code required:

animals = [     "cows.mammals.animal",
"horses.mammals.animal",
"sheep.mammals.animal",
"snake.reptile.animal",
"lizard.reptile.animal",
"crocodile.reptile.animal",
"pufferfish.fish.animal"]

@collate(animals, regex(r"(.+)\.(.+)\.animal"),  r"\2.results")
# \1 = species [cow, horse]
# \2 = phylogenetics group [mammals, reptile, fish]
def summarize_animals_into_groups(species_file, result_file):
" ... more code here"
pass


See @merge for an alternative way to summarise files.

## collate( input, filter, replace_inputs | add_inputs, output, [extras,...] )¶

Purpose:

Use filter to identify common sets of inputs which are to be grouped or collated together:

Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.

This variant of @collate allows additional inputs or dependencies to be added dynamically to the task, with optional string substitution.

inputs replaces the original input parameters wholescale.

This is a many to fewer operation.

Only out of date jobs (comparing input and output files) will be re-run.

regex(r".*(\..+)"), "\1.summary" creates a separate summary file for each suffix. But we also add date of birth data for each species:

animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r".+\.(.+)$"), add_inputs(r"\1.date_of_birth"), r'\1.summary') def summarize(infiles, summary_file): pass  This results in the following equivalent function calls: summarize([ ["shark.fish", "fish.date_of_birth" ], ["tuna.fish", "fish.date_of_birth" ] ], "fish.summary") summarize([ ["cat.mammals", "mammals.date_of_birth"], ["dog.mammals", "mammals.date_of_birth"] ], "mammals.summary")  Example of add_inputs using inputs(...) will summarise only the dates of births for each species group: animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"),  inputs(r"\1.date_of_birth"), r'\1.summary')
def summarize(infiles, summary_file):
pass


This results in the following equivalent function calls:

summarize(["fish.date_of_birth"   ], "fish.summary")
summarize(["mammals.date_of_birth"], "mammals.summary")


Parameters:

can be a:

File names are taken from the output of the specified task(s)

2. (Nested) list of file name strings (as in the example above).
File names containing *[]? will be expanded as a glob.

E.g.:"a.*" => "a.1", "a.2"

• filter = matching_regex

is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax

Specifies the resulting input(s) to each job.

Positional parameters must be disambiguated by wrapping the values in inputs(...) or an add_inputs(...).

Named parameters can be passed the values directly.

Takes:

File names are taken from the output of the specified task(s)

2. (Nested) list of file name strings.

Strings will be subject to substitution. File names containing *[]? will be expanded as a glob. E.g. "a.*" => "a.1", "a.2"

• output = output

Specifies the resulting output file name(s).

• extras = extras

Any extra parameters are passed verbatim to the task function

If you are using named parameters, these can be passed as a list, i.e. extras= [...]

See @collate for more straightforward ways to use collate.