Parallel For Loop

In my earlier post on Parallel Programming I have explained about Task and Data Parallelism.In this post we will go into a little more deep into the API and discuss about Parallel For loop in .NET Framework 4.0.The static method For of class System.Threading.Tasks.Parallel class allows us to execute for loops where several iterations might run in parallel.There are couple of overloads of Parallel.For.We will discuss about the simplest one here as shown below:

public static ParallelLoopResult For(int fromInclusive,int toExclusive,Action<int> body)

  • fromInclusive – This is the start index (inclusive)
  • toExclusive – This is the end index (exclusive)
  • body – This is the delegate of type Action<int>that will be invoked once per iteration.The input to this delegate is the counter of the iteration.

Simple implementation of Parallel.For is shown below:

Parallel.For(0, arr.Length, i =>
{
arr[i] = arr[i] + 1;
}
);

I really wanted to test how parallel for loop is behaving compared to simple sequential for loops.I used the code shown below:

private static void ParallelForLoop(int[] arr)
{
     Parallel.For(0, arr.Length, i =>
         {
             arr[i] = arr[i] + 1;
         }
     );
}
private static void SequentialForLoop(int[] arr)
{
     for (int i = 0; i < arr.Length; i++)
     {
         arr[i] = arr[i] + 1;
     }
}

static void Main()
      {

          Stopwatch sw = new Stopwatch();

          int[] arr = new int[100000];
          sw.Start();
          SequentialForLoop(arr);
          sw.Stop();
          Console.WriteLine(sw.ElapsedMilliseconds);

          arr = new int[100000];
          sw.Reset();
          sw.Start();
          ParallelForLoop(arr);
          sw.Stop();
          Console.WriteLine(sw.ElapsedMilliseconds);

          Console.ReadKey();
      }

But to my surprise in my dual core PC Parallel.For was taking more time compared to the sequential loop.Why So?

Regarding Parallel.For, the bad performance you experienced is due to two issues.  First, each iteration requires a delegate invocation.  Usually, this is insignificant, but when the loop body is small, it becomes more noticeable.  Second, Parallel.For’s partitioning scheme is designed for fine-grained load-balancing of larger, more complex workloads.  Each thread worker grabs a relatively small number of iterations to process at a time; this requires synchronization that is, again, more significant when the loop body is small and constant.  Both of these issues cause Parallel.For to perform badly in the following scenario, where parallelism can yield benefit: large number of iterations, small loop body.  Your example (one million add-store operations) fits this category exactly.
We have recently done something about this.  The solution (as you’ve realized in your hand-coded approach) is to use range partitioning and have each worker thread take a much larger chunk of the iteration range.

So I wrote another method where I partitioned the array data based on number processors and invoked Parallel.For iterating once for each partition.This resulted in reduced number of delegate invocation.

private static void ParallelForWithRange(int[] arr)
  {
      int partitions = System.Environment.ProcessorCount;
      int increment = arr.Length / partitions;

      Parallel.For(0, partitions, i =>
      {
          int a = i * increment;
          int b = (i == partitions - 1) ? arr.Length : a + increment;

          for (int j = a; j < b; j++)
          {
              arr[j] = arr[j] + 1;
          }

      });
  }

Now got the desired result.

One point I skipped earlier is the return value of Parallel.For, it is a structure ParallelLoopResult.The IsCompleted property of this structure returns true if all the iterations are successfully completed.

We will discuss about this in more details in next post along with stopping/breaking from a loop.

This entry was posted in Framework & Libraries. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *