Major architectural innovations in the compute node have been introduced in the IBM Blue Gene (R)/Q, including programmable Level 1 (L1) cache data prefetching units to hide memory access latency, hardware support for transactional memory (TM) and speculative execution (SE), an enhanced five-dimensional integrated torus network, and a high-performance quad floating-point SIMD (single-instruction, multiple-data) unit. In this paper, we present the tools and methodology that we used to model, co-design, and validate these new features from early concept phase through design implementation. Early in the design cycle, we made extensive use of an architectural simulator, BGQSim, capable of executing unmodified binary Blue Gene/Q code for single as well as multiple nodes. As the hardware description language for the chip implementation became available, we complemented BGQSim with a cycle-accurate and cycle-reproducible, large-scale field-programmable gate array-based platform, Twinstar, to validate the implementation against performance targets and functional specifications. Through specific examples, we show the effectiveness of these tools in co-developing the hardware and software of Blue Gene/Q, allowing us to meet the design targets at an aggressive project schedule.