Changed the three places where populations are updated to do serial. Now, at least for the first 100 timesteps, the CUDA and Java codes give almost exactly the same results. Thus, there seems to be an error in the parallel implementation of the population update algorithm.