Changes to support long (over 2^32 iterations)
Add script for bench on each implementation.
Tiny modifications for input/output
Add MPI simple version