We describe the enablement of the Ludwig lattice Boltzmann parallel fluid dynamics application, designed specifically for complex problems, for massively parallel GPU-accelerated architectures. NVIDIA CUDA is introduced into the existing C/MPI framework, and we have given careful consideration to maintainability in addition to performance. Significant performance gains are realised on each GPU through restructuring of the data layout to allow memory coalescing and the adaptation of key loops to reduce off-chip memory accesses. The halo-swap communication phase has been designed to efficiently utilise many GPUs in parallel: included is the overlapping of several stages using CUDA stream functionality. The new GPU adaptation is seen to retain the good scaling behaviour of the original CPU code, and scales well up to 256 NVIDIA Fermi GPUs (the largest resource tested). The performance on the NVIDIA Fermi GPU is observed to be up to a factor of 4 greater than the (12-core) AMD Magny-Cours CPU (with all cores utilised) for a binary fluid benchmark.