collate¶

@collate( input, filter, output, [extras,...] )¶

Purpose:

Use filter to identify common sets of inputs which are to be grouped or collated together:

Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.

This is a many to fewer operation.

Only out of date jobs (comparing input and output files) will be re-run.
Example:
regex(r".+\.(.+)$"), "\1.summary" creates a separate summary file for each suffix:
animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r".+\.(.+)$"),  r'\1.summary')
def summarize(infiles, summary_file):
    pass
output and optional extras parameters are passed to the functions after string substitution. Non-string values are passed through unchanged.

Each collate job consists of input files which are aggregated by string substitution to identical output and extras

The above example results in two jobs:

["a.fish", "b.fish" -> "fish.summary"]

["c.mammals", "d.mammals" -> "mammals.summary"]
Parameters:

input = tasks_or_file_names

can be a:

Task / list of tasks.

File names are taken from the output of the specified task(s)

(Nested) list of file name strings (as in the example above).

File names containing *[]? will be expanded as a glob.

E.g.:"a.*" => "a.1", "a.2"

filter = matching_regex

is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax

filter = matching_formatter

a formatter indicator object containing optionally a python regular expression (re).

output = output

Specifies the resulting output file name(s) after string substitution

extras = extras

Any extra parameters are passed verbatim to the task function

If you are using named parameters, these can be passed as a list, i.e. extras= [...]

Any extra parameters are consumed by the task function and not forwarded further down the pipeline.

Example2:

Suppose we had the following files:

cows.mammals.animal
horses.mammals.animal
sheep.mammals.animal

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal

pufferfish.fish.animal

and we wanted to end up with three different resulting output:

cow.mammals.animal
horse.mammals.animal
sheep.mammals.animal
    -> mammals.results

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal
    -> reptile.results

pufferfish.fish.animal
    -> fish.results

This is the @collate code required:

animals = [     "cows.mammals.animal",
                "horses.mammals.animal",
                "sheep.mammals.animal",
                "snake.reptile.animal",
                "lizard.reptile.animal",
                "crocodile.reptile.animal",
                "pufferfish.fish.animal"]

@collate(animals, regex(r"(.+)\.(.+)\.animal"),  r"\2.results")
# \1 = species [cow, horse]
# \2 = phylogenetics group [mammals, reptile, fish]
def summarize_animals_into_groups(species_file, result_file):
    " ... more code here"
    pass

See @merge for an alternative way to summarise files.

collate( input, filter, replace_inputs | add_inputs, output, [extras,...] )¶

Purpose:

Use filter to identify common sets of inputs which are to be grouped or collated together:

Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.

This variant of @collate allows additional inputs or dependencies to be added dynamically to the task, with optional string substitution.

add_inputs nests the the original input parameters in a list before adding additional dependencies.

inputs replaces the original input parameters wholescale.

This is a many to fewer operation.

Only out of date jobs (comparing input and output files) will be re-run.

Example of add_inputs
regex(r".*(\..+)"), "\1.summary" creates a separate summary file for each suffix. But we also add date of birth data for each species:
animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r".+\.(.+)$"),  add_inputs(r"\1.date_of_birth"), r'\1.summary')
def summarize(infiles, summary_file):
    pass
This results in the following equivalent function calls:
summarize([ ["shark.fish",  "fish.date_of_birth"   ],
            ["tuna.fish",   "fish.date_of_birth"   ] ], "fish.summary")
summarize([ ["cat.mammals", "mammals.date_of_birth"],
            ["dog.mammals", "mammals.date_of_birth"] ], "mammals.summary")
Example of add_inputs
using inputs(...) will summarise only the dates of births for each species group:
animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r".+\.(.+)$"),  inputs(r"\1.date_of_birth"), r'\1.summary')
def summarize(infiles, summary_file):
    pass
This results in the following equivalent function calls:
summarize(["fish.date_of_birth"   ], "fish.summary")
summarize(["mammals.date_of_birth"], "mammals.summary")
Parameters:

input = tasks_or_file_names

can be a:

Task / list of tasks.

File names are taken from the output of the specified task(s)

(Nested) list of file name strings (as in the example above).

File names containing *[]? will be expanded as a glob.

E.g.:"a.*" => "a.1", "a.2"

filter = matching_regex

is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax

filter = matching_formatter

a formatter indicator object containing optionally a python regular expression (re).

add_inputs = add_inputs(...) or replace_inputs = inputs(...)

Specifies the resulting input(s) to each job.

Positional parameters must be disambiguated by wrapping the values in inputs(...) or an add_inputs(...).

Named parameters can be passed the values directly.

Takes:

Task / list of tasks.

File names are taken from the output of the specified task(s)

(Nested) list of file name strings.

Strings will be subject to substitution. File names containing *[]? will be expanded as a glob. E.g. "a.*" => "a.1", "a.2"

output = output

Specifies the resulting output file name(s).

extras = extras

Any extra parameters are passed verbatim to the task function

If you are using named parameters, these can be passed as a list, i.e. extras= [...]

See @collate for more straightforward ways to use collate.