Hide atomic output.
Keep old version for PGI compiler.
Modify for gcc offload implementation.
Add dynamic allocation on each inside process.
Add OpenACC with PGI compiler for Nvidia GPUs