NWT: Towards natural audio-to-video generation with representation learning