Chapter 5: Understanding how your pipeline works with pipeline_printout(...)¶
Note
- Whether you are learning or developing ruffus pipelines, your best friend is pipeline_printout(...) This shows the exact parameters and files as they are passed through the pipeline.
- We also strongly recommend you use the
Ruffus.cmdline
convenience module which will take care of all the command line arguments for you. See Chapter 6: Running Ruffus from the command line with ruffus.cmdline.
Printing out which jobs will be run¶
pipeline_printout(...) takes the same parameters as pipeline_run but just prints the tasks which are and are not up-to-date.
The
verbose
parameter controls how much detail is displayed.Let us take the pipelined code we previously wrote in Chapter 3 More on @transform-ing data and @originate but call pipeline_printout(...) instead of pipeline_run(...). This lists the tasks which will be run in the pipeline:
>>> import sys >>> pipeline_printout(sys.stdout, [second_task]) ________________________________________ Tasks which will be run: Task = create_initial_file_pairs Task = first_task Task = second_task ________________________________________To see the input and output parameters of each job in the pipeline, try increasing the verbosity from the default (
1
) to3
(See code)This is very useful for checking that the input and output parameters have been specified correctly.
Determining which jobs are out-of-date or not¶
It is often useful to see which tasks are or are not up-to-date. For example, if we were to run the pipeline in full, and then modify one of the intermediate files, the pipeline would be partially out of date.
Let us start by run the pipeline in full but then modify
job1.a.output.1
so that the second task appears out-of-date:pipeline_run([second_task]) # "touch" job1.stage1 open("job1.a.output.1", "w").close()Run pipeline_printout(...) with a verbosity of
5
.This will tell you exactly why
second_task(...)
needs to be re-run: becausejob1.a.output.1
has a file modification time afterjob1.a.output.2
(highlighted):>>> pipeline_printout(sys.stdout, [second_task], verbose = 5) ________________________________________ Tasks which are up-to-date: Task = create_initial_file_pairs Task = first_task ________________________________________ ________________________________________ Tasks which will be run: Task = second_task Job = [job1.a.output.1 -> job1.a.output.2] >>> # File modification times shown for out of date files Job needs update: Input files: * 22 Jul 2014 15:29:19.33: job1.a.output.1 Output files: * 22 Jul 2014 15:29:07.53: job1.a.output.2 Job = [job2.a.output.1 -> job2.a.output.2] Job = [job3.a.output.1 -> job3.a.output.2] ________________________________________N.B. At a verbosity of 5, even jobs which are up-to-date in
second_task
are displayed.
Verbosity levels¶
The verbosity levels for pipeline_printout(...) and pipeline_run(...) can be specified from
verbose = 0
(print out nothing) to the extreme verbosity ofverbose=6
. A verbosity of above 10 is reserved for the internal debugging of Ruffus
- level 0 : nothing
- level 1 : Out-of-date Task names
- level 2 : All Tasks (including any task function docstrings)
- level 3 : Out-of-date Jobs in Out-of-date Tasks, no explanation
- level 4 : Out-of-date Jobs in Out-of-date Tasks, with explanations and warnings
- level 5 : All Jobs in Out-of-date Tasks, (include only list of up-to-date tasks)
- level 6 : All jobs in All Tasks whether out of date or not
- level 10: logs messages useful only for debugging ruffus pipeline code
Abbreviating long file paths with verbose_abbreviated_path
¶
Pipelines often produce interminable lists of deeply nested filenames. It would be nice to be able to abbreviate this to just enough information to follow the progress.
The
verbose_abbreviated_path
parameter specifies that pipeline_printout(...) and pipeline_run(...) only display
the
NNN
th top level sub-directories to be included, or thatthe message to be truncated to a specified
`MMM
characters (to fit onto a line, for example).MMM
is specified by settingverbose_abbreviated_path = -MMM
, i.e. negative values.Note that the number of characters specified is just the separate lengths of the input and output parameters, not the entire indented line. You many need to specify a smaller limit that you expect (e.g.
60
rather than 80)pipeline_printout(verbose_abbreviated_path = NNN) pipeline_run(verbose_abbreviated_path = -MMM)
verbose_abbreviated_path
defaults to2
For example:
Given
["aa/bb/cc/dddd.txt", "aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt"]
# Original relative paths "[aa/bb/cc/dddd.txt, aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt]" # Full abspath verbose_abbreviated_path = 0 "[/test/ruffus/src/aa/bb/cc/dddd.txt, /test/ruffus/src/aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt]" # Specifed level of nested directories verbose_abbreviated_path = 1 "[.../dddd.txt, .../gggg.txt]" verbose_abbreviated_path = 2 "[.../cc/dddd.txt, .../ffff/gggg.txt]" verbose_abbreviated_path = 3 "[.../bb/cc/dddd.txt, .../eeee/ffff/gggg.txt]" # Truncated to MMM characters verbose_abbreviated_path = -60 "<???> /bb/cc/dddd.txt, aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt]"
Getting a list of all tasks in a pipeline¶
If you just wanted a list of all tasks (Ruffus decorated function names), then you can just run Run pipeline_get_task_names(...).
This doesn’t touch any pipeline code or even check to see if the pipeline is connected up properly.
However, it is sometimes useful to allow users at the command line to choose from a list of possible tasks as a target.