Chapter 5: Understanding how your pipeline works with pipeline_printout(...)¶
- Whether you are learning or developing ruffus pipelines, your best friend is pipeline_printout(...) This shows the exact parameters and files as they are passed through the pipeline.
- We also strongly recommend you use the
Ruffus.cmdlineconvenience module which will take care of all the command line arguments for you. See Chapter 6: Running Ruffus from the command line with ruffus.cmdline.
Printing out which jobs will be run¶
pipeline_printout(...) takes the same parameters as pipeline_run but just prints the tasks which are and are not up-to-date.
verboseparameter controls how much detail is displayed.
Let us take the pipelined code we previously wrote in Chapter 3 More on @transform-ing data and @originate but call pipeline_printout(...) instead of pipeline_run(...). This lists the tasks which will be run in the pipeline:>>> import sys >>> pipeline_printout(sys.stdout, [second_task]) ________________________________________ Tasks which will be run: Task = create_initial_file_pairs Task = first_task Task = second_task ________________________________________
To see the input and output parameters of each job in the pipeline, try increasing the verbosity from the default (
This is very useful for checking that the input and output parameters have been specified correctly.
Determining which jobs are out-of-date or not¶
It is often useful to see which tasks are or are not up-to-date. For example, if we were to run the pipeline in full, and then modify one of the intermediate files, the pipeline would be partially out of date.
Let us start by run the pipeline in full but then modify
job1.a.output.1so that the second task appears out-of-date:pipeline_run([second_task]) # "touch" job1.stage1 open("job1.a.output.1", "w").close()
Run pipeline_printout(...) with a verbosity of
This will tell you exactly why
second_task(...)needs to be re-run: because
job1.a.output.1has a file modification time after
job1.a.output.2(highlighted):>>> pipeline_printout(sys.stdout, [second_task], verbose = 5) ________________________________________ Tasks which are up-to-date: Task = create_initial_file_pairs Task = first_task ________________________________________ ________________________________________ Tasks which will be run: Task = second_task Job = [job1.a.output.1 -> job1.a.output.2] >>> # File modification times shown for out of date files Job needs update: Input files: * 22 Jul 2014 15:29:19.33: job1.a.output.1 Output files: * 22 Jul 2014 15:29:07.53: job1.a.output.2 Job = [job2.a.output.1 -> job2.a.output.2] Job = [job3.a.output.1 -> job3.a.output.2] ________________________________________
N.B. At a verbosity of 5, even jobs which are up-to-date in
The verbosity levels for pipeline_printout(...) and pipeline_run(...) can be specified from
verbose = 0(print out nothing) to the extreme verbosity of
verbose=6. A verbosity of above 10 is reserved for the internal debugging of Ruffus
- level 0 : nothing
- level 1 : Out-of-date Task names
- level 2 : All Tasks (including any task function docstrings)
- level 3 : Out-of-date Jobs in Out-of-date Tasks, no explanation
- level 4 : Out-of-date Jobs in Out-of-date Tasks, with explanations and warnings
- level 5 : All Jobs in Out-of-date Tasks, (include only list of up-to-date tasks)
- level 6 : All jobs in All Tasks whether out of date or not
- level 10: logs messages useful only for debugging ruffus pipeline code
Abbreviating long file paths with
Pipelines often produce interminable lists of deeply nested filenames. It would be nice to be able to abbreviate this to just enough information to follow the progress.
NNNth top level sub-directories to be included, or that
the message to be truncated to a specified
`MMMcharacters (to fit onto a line, for example).
MMMis specified by setting
verbose_abbreviated_path = -MMM, i.e. negative values.
Note that the number of characters specified is just the separate lengths of the input and output parameters, not the entire indented line. You many need to specify a smaller limit that you expect (e.g.
60rather than 80)pipeline_printout(verbose_abbreviated_path = NNN) pipeline_run(verbose_abbreviated_path = -MMM)
["aa/bb/cc/dddd.txt", "aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt"]# Original relative paths "[aa/bb/cc/dddd.txt, aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt]" # Full abspath verbose_abbreviated_path = 0 "[/test/ruffus/src/aa/bb/cc/dddd.txt, /test/ruffus/src/aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt]" # Specifed level of nested directories verbose_abbreviated_path = 1 "[.../dddd.txt, .../gggg.txt]" verbose_abbreviated_path = 2 "[.../cc/dddd.txt, .../ffff/gggg.txt]" verbose_abbreviated_path = 3 "[.../bb/cc/dddd.txt, .../eeee/ffff/gggg.txt]" # Truncated to MMM characters verbose_abbreviated_path = -60 "<???> /bb/cc/dddd.txt, aaa/bbbb/cccc/eeed/eeee/ffff/gggg.txt]"
Getting a list of all tasks in a pipeline¶
If you just wanted a list of all tasks (Ruffus decorated function names), then you can just run Run pipeline_get_task_names(...).
This doesn’t touch any pipeline code or even check to see if the pipeline is connected up properly.
However, it is sometimes useful to allow users at the command line to choose from a list of possible tasks as a target.