Behavioural analysis of mice carrying engineered mutations is widely used to identify roles of specific genes in components of the mammalian behavioural repertoire. The reproducibility and robustness of phenotypic measures has become a concern that undermines the use of mouse genetic models for translational studies. Contributing factors include low individual study power, non-standardised behavioural testing, failure to address confounds and differences in genetic background of mutant mice. We have examined the importance of these factors using a statistically robust approach applied to behavioural data obtained from three mouse mutations on 129S5 and C57BL/6J backgrounds generated in a standardised battery of five behavioural assays. The largest confounding effect was sampling variation, which partially masked the genetic background effect. Our observations suggest that strong interaction of mutation with genetic background in mice in innate and learned behaviours is not necessarily to be expected. We found composite measures of innate and learned behaviour were similarly impacted by mutations across backgrounds. We determined that, for frequently-used group sizes, a single retest of a significant result conforming to the commonly used p < 0.05 threshold results in a reproducibility of 60% between identical experiments. Reproducibility was reduced in the presence of strain differences. We also identified a p-value threshold that maximized reproducibility of mutant phenotypes across strains. This study illustrates the value of standardized approaches for quantitative assessment of behavioural phenotypes and highlights approaches that may improve the translational value of mouse behavioural studies.