We propose a deep generative model of humans in natural images which keeps 2D pose separated from other latent factors of variation, such as background scene and clothing. In contrast to methods that learn generative models of low-dimensional representations, e.g., segmentation masks and 2D skeletons, our single-stage end-to-end conditional-VAEGAN learns directly on the image space. The flexibility of this approach allows the sampling of people with independent variations of pose and appearance. Moreover, it enables the reconstruction of images conditioned to a given posture, allowing, for instance, pose-transfer from one person to another. We validate our method on the Human3.6M dataset and achieve state-of-the-art results on the ChictopiaPlus benchmark. Our model, named Conditional-DGPose, outperforms the closest related work in the literature. It generates more realistic and accurate images regarding both, body posture and image quality, learning the underlying factors of pose and appearance variation.