The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. In this article, we investigate a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. This mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. We extend our previous work by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies. Results show that the choice of implementation of multi-level cache hierarchies can have a significant impact on performance.