Improving Fortran Do Loop Performance by 25%

SUMMARY

Here’s a way to make sure you’re optimizing the writing of your code with an example in Fortran. It’s a neat, non-obvious trick to an engineer, but may be more obvious to a computer scientist. It involves the writing of a do loop where you are updating your value for an array, and instead of copying that array into a temporary variable, you simply use the programming logic to continue to use that information (location in memory) during every other time step. I was able to get 25% better performance for this simple change (and it scales). Not clear yet? Let me show you.

BACKGROUND
I’m a mediocre Fortran programmer but I’m learning new tricks with practice, challenges, and going to professor office hours. Taking Scientific Computing and High Performance Computing Systems right now is really increasing the strength of my programming skills. I hope to be an intermediate Fortran programmer by the end of the semester. Below is a sample from a code I wrote for a recent class project. The goal of the project was not to write a sequential version, so I feel comfortable posting a piece of the serial version.

CODE DESCRIPTION

The user is stepping in time from igen to gmax, calculating the neighbor value (‘naybs’) of a cell in the array ‘pop.’ Then, it updates the value of the cell in the array ‘pop’ for the next time step based on the neighbor values. But it does this update in ‘buffer,’ unnecessarily copying the data back to the array ‘pop.’ In this way, the loop proceeds without conflict. This works, but you may be sacrificing performance without even realizing it. I wrote EXAMPLE 1 below and then my instructor said “BUT WHY NOT DO IT BETTER?” and I went back to my computer and coded EXAMPLE 2.

EXAMPLE 1: Original Code

  do igen = 1, gmax
    naybs = 0
    buffer = 0
! This loop finds neighbor values
    do j = 2, y_limit + 1
      do i = 2, x_limit +1
        naybs(i,j) = pop(i-1,j-1) + pop(i,j-1) + pop(i+1,j-1)+ &
                     pop(i-1,j  )              + pop(i+1,j  )+ &
                     pop(i-1,j+1) + pop(i,j+1) + pop(i+1,j+1)
        ! Birth
        if (pop(i,j)==0 .and. naybs(i,j)==3) then
            buffer(i,j)=1
        ! Survival
        else if (pop(i,j)==1 .and. &
            (naybs(i,j)==2 .or. naybs(i,j)==3)) then
            buffer(i,j)=1
        ! Death
        else
            buffer(i,j)=0
        end if
        pop(i,j) = buffer(i,j)
      end do
    end do
  end do

EXAMPLE 2: Improved Code

do igen = 1, gmax
  naybs = 0

  if (mod(igen,2)==1) then

! This loop finds neighbor values
  do j = 2, y_limit + 1
    do i = 2, x_limit +1
      naybs(i,j) = pop(i-1,j-1) + pop(i,j-1) + pop(i+1,j-1)+ &
                   pop(i-1,j  )              + pop(i+1,j  )+ &
                   pop(i-1,j+1) + pop(i,j+1) + pop(i+1,j+1)
      ! Birth
      if (pop(i,j)==0 .and. naybs(i,j)==3) then
          buffer(i,j)=1
      ! Survival
      else if (pop(i,j)==1 .and. &
          (naybs(i,j)==2 .or. naybs(i,j)==3)) then
          buffer(i,j)=1
      ! Death
      else
          buffer(i,j)=0
      end if
    end do
  end do

  else if (mod(igen,2)==0) then

! This loop finds neighbor values
  do j = 2, y_limit + 1
    do i = 2, x_limit +1
      naybs(i,j)=buffer(i-1,j-1)+buffer(i,j-1)+buffer(i+1,j-1)+ &
                 buffer(i-1,j  )              +buffer(i+1,j  )+ &
                 buffer(i-1,j+1)+buffer(i,j+1)+buffer(i+1,j+1)
      ! Birth
      if (buffer(i,j)==0 .and. naybs(i,j)==3) then
          pop(i,j)=1
      ! Survival
      else if (buffer(i,j)==1 .and. &
          (naybs(i,j)==2 .or. naybs(i,j)==3)) then
          pop(i,j)=1
      ! Death
      else
          pop(i,j)=0
      end if
    end do
  end do
  end if
end do

I will say that you do have to write more code, but if that’s your hang-up, then you might not be very interested in performance in the first place.

More importantly, note the logic here. What I’m doing is identifying which time step is odd or even (by computing the mod of each time step), and then based on that result, I will update the next time step with a value from another grid. As far as I know, this can be done more elegantly in C by switching pointers. I don’t even know how to yet use pointers effectively in Fortran, so that might be possible here too. That would then give you the performance you desire, and the brevity that everyone likes. This result did increase my performance by 25%, which is dramatic if you are in the scientific computing world.

I note that this may be obvious to others, and they might even think that I made it harder on myself in the first place by doing something silly (which I did), but remember that I was taught this by multiple people, multiple times. Once I started to get an understanding of how data is held in memory, I started to make more advanced strides in programming. Hope this helps!

Timing an OpenMP run using Fortran

What to expect:

  • How you can time a run in Fortran by calling cpu_time
  • How you can time a run in Fortran (or C) using OpenMP using omp_get_wtime()

Compiler

I am using the Intel Fortran Compiler, ifort, called Intel Fortran Composer XE 2011 for Linux. It’s version 11.1. You can find the compiler here, as part of Intel’s Non-Commercial Software Downloads. You can check the version of your ifort by supplying the command

$ ifort -v

The ifort compiler has OpenMP capability built in. OpenMP has a built-in ability to time the run that you are executing. One way that we can time the run natively in Fortran will also be shown.

At the beginning of my code, it looks like the following:

Example Code

program goodtimes

c$     use omp_lib
       use your_modules

       implicit none

       double precision :: fstart, fend
       [Declare other variables]
       double precision :: ostart,oend

c      Fortran timing
       call cpu_time (fstart)

c      OpenMP timing
c$     ostart = omp_get_wtime()

c      Start of your meaningful code

c      Middle of your meaningful code

c      End of your meaningful code

c      End Fortran timing
       call cpu_time (fend)

c      End OpenMP timing
c$     oend = omp_get_wtime()

       write(*,*) 'Fortran CPU time elapsed', fend-fstart
c$     write(*,*) 'OpenMP Walltime elapsed', oend-ostart

end program

There are a few things to mention in the loosely-written code above. I wrote it a little Fortran 77-esque, where I started writing in the 7th column, and the OpenMP pragma is ‘c$’, ‘!$’, or ‘*$’. I used ‘c$’ above. In later versions of Fortran, use ‘!$’.

Notice that you must use the ‘omp_lib’ module in order to access the built-in ‘omp_get_wtime’. Otherwise, you will get an error. I strongly recommend making your start and end variables double precision. It doesn’t matter how you specify them as double precision, and I don’t necessarily recommend the way I did it above, but I just want to make it clear that you will have a better time with a double precision specification.

Note that cpu_time yields information about the CPU time (how long the CPU was working on your problem) and omp_get_wtime yields the wall clock time, such as the time that would have elapsed if you were timing the run from beginning to end with a very precise clock. I had a few runs for my application that showed the CPU operating at about 90% efficiency (where wall time is 100% of the total time). I recommend reading this post I did about profiling your code, so you can see which regions of your code are time consuming, and you can direct your OpenMP use in those regions.

Remember to include the ‘-openmp’ flag when compiling, and specifying the environment variable ‘OMP_NUM_THREADS’. I typically modify the ~/.bashrc file with a value and then source ~/.bashrc.

The rest of the code is self-explanatory. Read up on Fortran (or C) and OpenMP tutorials and other documentation for any additional information, or feel free to ask questions below. The C techniques are very similar and straightforward.

Profiling a simple Fortran code with gprof

I finished working through Chapman’s Introducton to Fortran 90/95, and it was a very interesting (helpful) read. My next step is to work through Chapman’s (no relation?) Using OpenMP, but there are some performance considerations I must first address.

Therefore, I looked into gprof, which is the GNU profiling tool. It will give me an understanding of how quickly my code runs, and which tasks in the workflow are taking up the most resources. Here is what the ifort man pages say about the gprof compiler flag (note that I have a 32-bit processor for this test!):

-p
Compiles and links for function profiling with gprof(1).
Architectures: IA-32, Intel® 64 architectures
Arguments: None
Default: OFF
Files are compiled and linked without profiling.
Description: This option compiles and links for function profiling with gprof(1).
Alternate Options:
Linux and Mac OS X: -pg (only available on systems using IA-32 architecture or Intel® 64 architecture), -qp (this is a deprecated option)

That’s interesting, sure! So with that bit of knowledge, I want to apply it to a large code that might make debugging a pain. I’m going to focus on a much simpler test case (that I’m taking from Chapman’s Fortran 90/95 book, Example 6-10, pg. 340).

gprof Example with Fortran Code

The example I consider has a function called “ave_value” which calculates the average value of a function between two points first_value and last_value. “ave_value” is called by “my_function,” which is declared as external in the test driver program “test_ave_value.” It’s a very simple program with three .f90 files.

I wrote these functions based on the example given in Chapman, and then  I compiled them with the following command:

$ ifort -p ave_value.f90 my_function.f90 test_ave_value.f90 -o test_ave_value

As a reminder, the -p flag allows me to specify our gprof option, and the -o flag allows me to rename the executable.

Now that you have your executable, you can simply run it, as I did:

$ ./test_ave_value

And you’ll notice that it has generated a “gmon.out” file that can be interpreted by gprof to show you your statistics! Writing gmon.out will overwrite any previous versions that you had in the folder, so use caution. Now, run gprof to interpret the gmon.out file.

$ gprof test_ave_value > tav.output

The tav.output was my re-naming of the gprof output. Now we can view the results of gprof in tav.output, in any competent text editor.

Looking at the Numbers

There is sufficient documentation for understanding gprof numbers on their website, but I’ll hit some critical points. The outputs are separated into the Flat Profile and the Call Graph. The Flat Profile conveys how much time your program has spent executing each function. The Call Graph conveys how much time was spent in the function and its children. You can read more here.

Visualization of gprof results

A quick way to put a visualization together (per the documentation of gprof2dot):

gprof path/to/your/executable | gprof2dot.py | dot -Tpng -o output.png

Here, gprof executes your program (which you’ve already compiled and linked with the appropriate flag!). That output is piped to a program called gprof2dot, which then pipes its output to create an output file that you can view in any competent image display tool!

Note that if you download gprof2dot, you’ll need to change the permissions to ensure that it’s an executable. I tried to run the non-executable version with

$ ./gprof2dot.py

but it would not execute because the file permissions were not set to executable.

Now that I learned this, I’m going to try it on a bigger code. Happy profiling!