4

I have a very strange situation while trying to use a specific tool (efetch from the NCBI E-utilities suite) in a while loop. This is my input file, a list of strings, one per line:

$ cat transcripts.list 
NR_169596.1
NR_169595.1
NR_169594.1

I want to run the efetch command using each of those strings as an argument, so I do:

$ while read -r line; do echo "Line: $line"; esearch -db nucleotide -query "$line"; done <  transcripts.list 
Line: NR_169596.1
<ENTREZ_DIRECT>
  <Db>nucleotide</Db>
  <WebEnv>MCID_61bb689d20b59b3e2e2d405d</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

This is a single result, not three, as you can see by the single echo that runs. The same thing works, however, if I use a bad practice for loop:

$ for line in $(cat transcripts.list); do echo "Line: $line"; esearch -db nucleotide -query "$line"; done
Line: NR_169596.1
<ENTREZ_DIRECT>
  <Db>nucleotide</Db>
  <WebEnv>MCID_61bb68cabbe98560233344a7</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
Line: NR_169595.1
<ENTREZ_DIRECT>
  <Db>nucleotide</Db>
  <WebEnv>MCID_61bb68cad05f5825d75e3ace</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
Line: NR_169594.1
<ENTREZ_DIRECT>
  <Db>nucleotide</Db>
  <WebEnv>MCID_61bb68cb6bdec5435b5a41cb</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

Question: How is this possible? Even if there is some sort of bug in the specific esearch program, that shouldn't affect the looping, so why is the shell exiting after the first iteration? And how can the for work and the while fail? What do they do differently here?


Some more details.

  1. Adding an echo in front of the esearch command makes the loop behave as expected, so this has to be related to the specific esearch command (but how can that break the shell loop?):

    $ while read -r line; do echo esearch -db nucleotide -query "$line"; done <  transcripts.list 
    esearch -db nucleotide -query NR_169596.1
    esearch -db nucleotide -query NR_169595.1
    esearch -db nucleotide -query NR_169594.1
    
  2. There is nothing odd in the list itself, I can reproduce it with different lists and there are no hidden characters:

    $ od -c transcripts.list 
    0000000   N   R   _   1   6   9   5   9   6   .   1  \n   N   R   _   1
    0000020   6   9   5   9   5   .   1  \n   N   R   _   1   6   9   5   9
    0000040   4   .   1  \n
    0000044
    
  3. I get the same behavior in bash and dash, so it can't be related to things like PIPEFAIL or anything like that. In any case, the exit status of the command is 0:

     while read -r line; do esearch -db nucleotide -query "$line"; echo "EXIT: $?"; done <  transcripts.list 
    <ENTREZ_DIRECT>
      <Db>nucleotide</Db>
      <WebEnv>MCID_61bb69e71191d1185543b24a</WebEnv>
      <QueryKey>1</QueryKey>
      <Count>1</Count>
      <Step>1</Step>
    </ENTREZ_DIRECT>
    
  4. This is happening on a system running Ubuntu, bash, version 4.4.20(1)-release. You can install the efetch tool with sudo apt install ncbi-entrez-direct, if you want to try this out.

  5. Works as expected in a loop using a different language. For instance, in perl:

    $ perl -ne 'chomp;system("esearch -db nucleotide -query \"$_\"")' transcripts.list 
    <ENTREZ_DIRECT>
      <Db>nucleotide</Db>
      <WebEnv>MCID_61bb6c68d8f66e4bb03f00e8</WebEnv>
      <QueryKey>1</QueryKey>
      <Count>1</Count>
      <Step>1</Step>
    </ENTREZ_DIRECT>
    <ENTREZ_DIRECT>
      <Db>nucleotide</Db>
      <WebEnv>MCID_61bb6c69947ca95fce4d4f0f</WebEnv>
      <QueryKey>1</QueryKey>
      <Count>1</Count>
      <Step>1</Step>
    </ENTREZ_DIRECT>
    <ENTREZ_DIRECT>
      <Db>nucleotide</Db>
      <WebEnv>MCID_61bb6c6a85c14642940393f9</WebEnv>
      <QueryKey>1</QueryKey>
      <Count>1</Count>
      <Step>1</Step>
    </ENTREZ_DIRECT>
    
terdon
  • 234,489
  • 66
  • 447
  • 667

1 Answers1

9

This is probably because esearch exhausts its standard input; read and esearch are both reading from transcripts.list.

To fix that, change esearch’s standard input, e.g. esearch < /dev/null.

See I'm reading a file line by line and running ssh or ffmpeg, only the first line gets processed! in the Bash FAQ for details.

Kamil Maciorowski
  • 19,242
  • 1
  • 50
  • 94
Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
  • yeah, sorry I read the link Kami gave after posting my comment so I see it basically boils down to "devs be stupid" and some tools consume stdin even when they don't need to. – terdon Dec 16 '21 at 17:04
  • I have actually been bit by the same thing in a more complex context using ssh, but I don't think it's a dupe, per se: [SSH connections running in the background don't exit if multiple connections have been started by the same shell](https://unix.stackexchange.com/q/383501). Of course, I asked both questions so I'm not impartial, feel free to vote to close as a dupe if you think that works (and obviously feel _very_ free to vote to close if you find a better dupe target). – terdon Dec 16 '21 at 17:06
  • Or more generally, as seen at [Why is using a shell loop to process text considered bad practice?](//unix.stackexchange.com/q/169716) (the part about *inside the loop, stdin is redirected so you need to pay attention that the commands in it don't read from stdin*), open the file on which the `while read` loops on on a fd ouside the 0..2 range. – Stéphane Chazelas Dec 16 '21 at 17:41
  • @terdon there's [Why does the while cycle skips and only reads the first line?](https://unix.stackexchange.com/questions/150094/why-does-the-while-cycle-skips-and-only-reads-the-first-line) too – muru Dec 17 '21 at 02:56