3

I have a runPR.sh script as below

DIR=/directory/buildagain/bin/Project
 FILELIST=$1

 while read FILE
 do
     echo "Processing ${FILE}..."
     ./makeInp.sh ${FILE} ${FILE} >INP/${FILE}.inp
     ${DIR} -PR INP/${FILE}.inp
 done < ${FILELIST}

For the serial program, I run the program by typing make in /directory/buildagain and then ./runPR.sh values.txt. (values.txt just contains the line Chain)

EDIT: Here is a small portion of my code.

 int main( int argc, char *argv[ ] )
 {
      MPI_Status status;
      MPI_Init(&argc,&argv); 
      if( strcmp(argv[1],"-PR") == 0 )
           runPR(argc-2, &argv[2]);
      return 0;
 }

 int runPR(int argc, char* argv[])
 { 
      cout<<"run here"<<endl;

      int mynode, totalnodes;
      int sum,startval,endval,accum;
      int master=0;

      MPI_Comm_size(MPI_COMM_WORLD, &totalnodes); // get totalnodes
      MPI_Comm_rank(MPI_COMM_WORLD, &mynode); // get mynode

      PROpt opt;
      Solve* ps = new Solve();
      cout<<"here1"<<endl;

      cout<<"total nodes "<<totalnodes<<endl;
      for(int j=0;j<totalnodes-1;j=j+1){

           cout<<"processor"<<mynode<<"  received from "<<j<<endl;

           ps->getFile(&opt,argv[0]);
      }
 }

By typing mpirun -np 4 ../directory/buildagain/bin/Project -PR INP/Chain.inp, I see run here, here, total nodes1 printed 4 times. But I don't see cout<<"processor"<<mynode<<" received from "<<j<<endl; printed out, and I would expected total nodes to show 4, not 1. Also, the program just stops. Why is this?

Rui F Ribeiro
  • 55,929
  • 26
  • 146
  • 227
user4352158
  • 431
  • 2
  • 6
  • 9
  • I'm pretty sure that mpirun needs an actual executable that is linked against the MPI library. A shell script won't work. Which implementation and version of MPI are you running? Can you show me your line of code with ``mpi_init`` please? – Otheus May 11 '15 at 22:51
  • The executable is at /directory/buildagain/bin/Project. The `runPR.sh` calls the executable there. I am using openmpi/1.6 – user4352158 May 12 '15 at 01:39
  • I made some changes to the OP in that I posted a small code sample. – user4352158 May 12 '15 at 01:48
  • 1
    I cannot reproduce this with gcc 5.1 and OpenMPI 1.8.4. I get the expected behavior based on your code example. How many CPU cores do you have? Can you compile and run simple MPI programs as expected? I won't address the issues with your code logic as that is offtopic for this site. – casey May 12 '15 at 14:08
  • I'm using gcc/4.7 and openmpi/1.6. Again, I have no problems with a helloworld program and an MWE of my actual code. However, my actual code shows only 1 node for `cout<<"This node="< – user4352158 May 12 '15 at 17:20

1 Answers1

1

After you reported getting output like

total nodes=1

and

This node=0 

printed out 4 times, I concluded you are trying this: mpirun -np 4 script-name.sh. It does this because mpirun is launching 4 copies of a shell script which doesn't understand MPI communication semantics.

If you can somehow get launch mpirun on a script, then remember (1) the script is running in the local "head" node environment, not the remote one, (2) the script must exec to your program as its last and final breath, and (3) when the program runs, it's in the environment on possibly another node -- possibly not having access to the files you had on the head.

So the script should look like this:

PROG="$1"; shift;
OPT="$2"; shift    
for FILE in "$@"
do
     echo "Processing ${FILE}..."
     ./makeInp.sh ${FILE} ${FILE} >INP/${FILE}.inp
done
exec $PROG $OPT "$@"

Within PROG, you'll have to index ARGV to correspond to the current node/thread. (Do check that you haven't exceeded argc or you'll get a NULL-pointer violation.) I don't think there's another/better way.

Otheus
  • 5,945
  • 1
  • 22
  • 53
  • typing `mpirun -np 4 ../directory/buildagain/bin/Project -PR INP/Chain.inp` with your changes to the script, I get the same output as before – user4352158 May 12 '15 at 23:25
  • So without the script at all? That's not good. Too much is dependent on your environemnt. Did you tell me at one point that `mpirun -np 4 helloworld` actually ran 4 threads? If not, maybe your system simply isn't configured for more than one node. (Even then, it would invoke 4 instances on one node...) I'm stumped. – Otheus May 12 '15 at 23:45
  • Also, I noticed a problem with the modified script. I edited the answer to reflect. – Otheus May 12 '15 at 23:45
  • typing `mpirun -np 4 ../directory/buildagain/bin/Project -PR INP/Chain.inp` with the modified script, I noticed no new changes. When I type `mpirun -np 4 ./runPR.sh values.txt`,I still get the `mpirun aborted` message – user4352158 May 12 '15 at 23:51
  • yes, `mpirun -np 4 helloworld` shows the 4 difference processors – user4352158 May 12 '15 at 23:52
  • Wait, in the 1st case, you're not even running the shell script. In the 2nd case, it's clear that mpirun won't allow you to launch a shell script (which is what i sort of thought to begin with. Last time I used mpi, btw, it was 1.1 or something). – Otheus May 12 '15 at 23:53
  • remove from your code the `Solve.new` and `getFile` code. Keep removing code until it is nearly identical to `helloworld`. Eventually you'll see the problem. Good luck. – Otheus May 12 '15 at 23:57
  • 1
    Something is wrong with his other code or his environment. I couldn't repro his code reduced to a mvce (taking out calls he didn't provide code for). – casey May 13 '15 at 22:31
  • To OP: Could it be your execution path/executable is not available on all nodes/hosts where mpirun starts execution? (Or is everything actually local on the same computer node/host?) – Otheus May 15 '15 at 19:57
  • I'm not sure what you mean. How can I check if the executable is available on all nodes when mpirun starts? I am using the multiple nodes on the same machine. Also, I was able to get the helloworld example on the PETSc website to work properly: http://www.mcs.anl.gov/petsc/petsc-current/src/sys/examples/tutorials/ex2.c.html – user4352158 May 16 '15 at 20:25
  • We're assuming it's a problem with the environment, not your program per se. If you ran the hello-world example fine, it might be working, for example, because you ran it out of a common shared directory that was available on all MPI nodes. When I say "node", I'm referring to MPI's view of things: it launches each thread/rank on a "node". Often those nodes are other hosts. In your case, not. Depending on configuration, mpirun might be trying to do a remote shell execution to the localhost. Doing so requires the environment be so that mpirun can run itself, find its libraries and your program. – Otheus May 17 '15 at 08:52
  • 1
    It's a mystery to @casey and myself how mpirun will launch your program with exactly 1 node when told to use 4. Either its your "environment", or there is some mysterious problem within Solve. Eliminate the possibility that it is related to your code, by removing `Solve.new` and `ps->getFile` and trying again. If MPI-size is still only size 1, the problem is (almost certainly) your environment. – Otheus May 17 '15 at 08:55