Add script for bench on each implementation.
Tiny modifications for input/output
Add MPI simple version