Move from kernel approach to standard one.
Hide atomic output.
Keep old version for PGI compiler.
Modify for gcc offload implementation.
Add dynamic allocation on each inside process.
Add OpenACC with PGI compiler for Nvidia GPUs