4

I'm trying to use tee in a for loop as such:

for ea in $(ls *bam)
  do samtools mpileup -f $ref $ea | \
  tee \
  >(java -jar $varscan2 mpileup2indel --output-vcf 1 > vcf/"$(echo $ea | sed s/.bam//)"_mpileup2indel.vcf) \
  >(java -jar $varscan2 mpileup2snp --output-vcf 1 > vcf/"$(echo $ea | sed s/.bam//)"_mpileup2snp.vcf) | \
  tail -n 5
done;

I.e. using the output of samtools mpileup, and piping that into two separate commands. I added the tail -n 5 to prevent the output of samtools mpileup from being printed in its entirety to stdout (I want full output to be used as input to java -jar varscan, however).

This seems to work initially, but the command doesn't seem to complete (file size for each output is smaller than if command was done without tee).

Eventually I get an error that the two java -jar $varscan commands are waiting for an input that never arrives (prior to getting a chance to start the second iteration of the loop).

Is this the best way to accomplish what I'm after, i.e. using the output of the first command in two separate commands (ideally, not recording/printing the output of the first command at all)? Is tee incompatible with for loops?

Thanks in advance.

  • 1
    Besides other problems that may occur, the `tail` should be piped *right after* the command you want to tail, not at the end of the pipe line. Also look at [Why *not* parse `ls` (and what to do instead)?](https://unix.stackexchange.com/questions/128985/why-not-parse-ls-and-what-to-do-instead). – schrodingerscatcuriosity Oct 22 '21 at 17:59
  • Sorry, I don't follow. Could you write the code as it should be? And, I also don't follow why you're referring me to the 'ls parsing discussion'. Thanks. – Rebecca Eliscu Oct 22 '21 at 18:04
  • For completeness, which shell are you using? Also, do you get _one_ error, from one of the two `java -jar $varscan` processes, or do you get an error _from each_ of those two processes (i.e. two error messages)? It would probably be useful if you could edit the actual error messages into your question. – fra-san Oct 22 '21 at 18:05
  • 2
    With `for ea in $(ls *bam)` you are parsing the output of the `ls` command, which for various reasons detailed in the link is considered bad practice. – schrodingerscatcuriosity Oct 22 '21 at 18:06
  • @fra-san I'm running again because I didn't record the error! I'm using bash, btw. Will update when running is complete. – Rebecca Eliscu Oct 22 '21 at 18:21
  • 1
    @schrodigerscatcuriosity I should clarify that I want the output of `samtools mpileup` to be piped in its entirety to my `tee` commands, but not to the stdout. I'm not sure how else to accomplish this. – Rebecca Eliscu Oct 22 '21 at 18:25
  • 2
    As I was saying in a comment which I deleted because it was mis-diagnosing the cause: try using the `-p` option of tee: `| tee -p ...`. –  Oct 22 '21 at 18:30
  • @CocaineMitch Trying right now! Will report back if this works. – Rebecca Eliscu Oct 22 '21 at 19:16
  • 1
    I would expect the problem to occur when piping to `head`, but not when to `tail`. In general anything that exits before EOF and causes `tee` to get SIGPIPE will make it not send the whole stream to files specified as operands (process substitutions in your case). Is `tail -n 5` in your actual code? If not then maybe the actual command exits early. `tee -p` seems a good idea. If `-p` is not supported then try `tee … | { actual_command; cat >/dev/null; }`, so even after `actual_command` exits, `cat` reads the rest of the stream from `tee` and `tee` never gets SIGPIPE. – Kamil Maciorowski Oct 23 '21 at 11:14

1 Answers1

3
  1. quote your variables
  2. don't parse ls
  3. optional-but-recommended: simplify your script and don't repeat yourself. You generate the basename twice with sed and append a different suffix to it each time - it would be better to generate it once - this would reduce the risk of bugs and improve readability (and very slightly improve performance - it's "cheaper" to do something once and re-use the result than it is to do exactly the same operation twice or more).
  4. readability (i.e. the ability to read and understand a program you've written) is one of, if not the most important things when writing code....so, any time that performance is not absolutely critical it is better to prioritise writing your code in a way that makes it easier to understand. This may mean inserting more line breaks or indentation, or breaking down long and complicated commands into shorter, simpler commands. This will help write and debug the script NOW, and also help you understand it when you need to revisit it in X months (or years) time.
for ea in *.bam; do
  bn="$(basename "$ea" .bam)"
  samtools mpileup -f "$ref" "$ea" |
    tee \
      >(java -jar "$varscan2" mpileup2indel --output-vcf 1 > "vcf/${bn}_mpileup2indel.vcf") \
      >(java -jar "$varscan2" mpileup2snp --output-vcf 1 > "vcf/${bn}_mpileup2snp.vcf") |
    tail -n 5
done

Note the varying indentation levels. e.g. tee is indented slightly from samtools, then tee's args are indented from tee, and then the tail is back at the same indentation level as tee. This all helps to understand which args belong to which program, and where in the pipeline (or loop, etc) you are as you're reading it.

BTW, backslashes to continue a line are optional after a pipe character.

or even:

outdir="vcf"

for ea in *.bam; do
  bn="$(basename "$ea" .bam)"
  indel="$outdir/${bn}_mpileup2indel.vcf"
  snp="$outdir/${bn}_mpileup2snp.vcf"

  samtools mpileup -f "$ref" "$ea" |
    tee \
      >(java -jar "$varscan2" mpileup2indel --output-vcf 1 > "$indel") \
      >(java -jar "$varscan2" ​mpileup2snp --output-vcf 1 > "$snp") |
 ​   tail -n 5
done
cas
  • 1
  • 7
  • 119
  • 185
  • BTW, as @zevzek implies in a comment, you may want a `wait` statement after the `tail` (or a `while pgrep -f "$varscan2" >& /dev/null; do sleep 1; done` loop) - this would ensure that you don't have dozens or hundreds of java processes all running in the background at the same time (two for each .bam file). Only the two processes started by the process substitutions would be running during each pass through the loop. – cas Oct 23 '21 at 02:19