Description
# Artifact for Falcon: A Scalable Analytical Cache Model This is the supporting artifact for the Falcon paper. It can be used to replicate all results in the submitted version of the paper (submission_paper.pdf), given enough time and appropriate hardware. To be precise, it takes around two weeks to reproduce all the results. ## Quick Version of the Artifact To facilitate evaluation, we also provide a "quick" version of the artifact that reproduces the main results. It reproduces all the evaluation figures in the paper, with the following differences: - Figure 1: we run the models at all x values of 20 up to 100 instead of multiples of 5. This is sufficient to establish the same trend. - Figure 6 & 7: we evaluate on the last four benchmarks files in the ordering of Figure 6, showing that even on the "worst" inputs in the benchmark, Falcon takes minutes. (base models time out after four hours) - Figure 8: we evaluate on thread counts [1, 2, 4, 8, 12, 16] instead of all counts up to 16. This is sufficient to establish the same trend. All of the above choices can be easily customized by the user before running the artifact; see "Running the Artifact" below. ## Hardware Requirements ### Requirements for the quick version For hardware measurement, our method requires a machine with an AMD Zen3/Zen4 CPU and access to the `perf_event_open` syscall. Note that many cloud machines disallow this syscall. If such a machine is not available, you can still use the hardware measurement data from our machine and run the rest of the artifact. The only difference will be that the accuracy figures will be plotted against our measurement data on our machine instead of yours. For the parallelism experiment, a machine with 16 cores is required. If there are fewer cores, then running on 16 threads will not improve performance as much so the speedup would be less than that reported in the paper. Other than that, the artifact will still work fine on a machine with fewer cores. ### Requirements for the full version For the complete version, a machine with 192 GiB RAM is required. This is because the baseline model Haystack that we compare against can sometimes take a large amount of RAM. When the full artifact is run on a machine with insufficient RAM, if Haystack runs out of memory when running some file, that file will be gracefully dropped from Figure 6. Otherwise, the rest of the artifact will continue to function normally. In such a scenario it may help system stability to run `./earlyoom.sh` before running the models, though when we tested on a low RAM machine, we did not find this to be necessary. It may be difficult to obtain a single machine satisfying the high RAM requirement as well as the requirement to have access to the `perf_event_open` syscall, as the latter is often not available on cloud machines. Therefore, we provide the option to run each part on a different machine, as long as one machine is available with high RAM and another with the requirements specific for hardware measurements. ## Software Requirements The artifact requires [Docker](https://docs.docker.com/get-docker/). We tested on version `24.0.6` on a Linux machine. ## Getting Started The artifact comes with pre-built binaries. To rebuild from scratch, see that section below. To setup the artifact and docker image: 1. Extract the provided archive and `cd` into the extracted directory. 2. Load the provided docker image with `docker load -i docker/docker_image.tar`. 3. Run the image with `docker run -v $(pwd):/app -it --security-opt seccomp=docker/seccomp.json falcon-artifact` This mounts the project root directory (which should be the current directory) to the VM. Changes made in the VM will be persisted here. The argument `--security-opt seccomp=docker/seccomp.json` loads a custom security configuration. The only difference between the custom one and the default is that the `perf_event_open` syscall is permitted, which is required for hardware measurement. The argument can be omitted if hardware measurement is not needed. If the measurement test assert-fails as described in the next section even though you expect it to work on your system, you can try adding the flag `--privileged`, though we did not need it during testing. Note that during development, our tool was called `lazystack`, so it is referred to as such in scripts and source code. ### Test-running hardware measurement To test the hardware measurement, run `examples/measurement-example`. If it succeeds, the output will contain four numbers. On our system, we got: ``` 0.18134 1359152 16981 68 ``` The first output number is runtime and the next three are cache accesses and misses; none of these numbers are expected to be zero. If any of the last three lines are zero then your CPU is probably unsupported. In this case, you can use the measurement data from our system (see below for more details). On the other hand, if the `perf_event_open` syscall is not supported, an error like the following will be reported: ``` measurement-example: ../src/c-gen-perf-main.cpp:66: read_format disable_and_get_count_group(int) [nr = 3U]: Assertion `data.nr == nr' failed. Aborted (core dumped) ``` ### Test-running the cache models `cd` into the `experiments` directory and run `python perf-all.py -b polybench-S -c 512,512 --output-suffix test --filter gemm`. This should produce output like the following. ``` root@53846d9af34b:/app/experiments# python perf-all.py -b polybench-S -c 512,512 --output-suffix test --filter gemm Will output to data/perf-polybench-S-512-512-test.json Runnning haystack on gemm Running warping on gemm Running lazystack on gemm... 78ms { "gemm": { "haystack": { "L1": 0, "L2": 0, "L3": 0, "accesses": 1352400, "capacity": 0, "compulsory": 1000, "misses": 1000, "time": 188.77 }, "lazystack": { "accesses": 1352400, "misses": 1000, "misses_L1": 1000, "misses_L2": 0, "ops": 6, "peak_mem": 42892, "stack_t": 36.9719, "symbolic_t": 16.7869, "thresh_t": 8.84229, "time": 55.1991, "varheur_t": 0.015209 }, "warping": { "access_level": [ 1352400 ], "accesses": 1352400, "miss_level": [ 926 ], "misses": 926, "time": 293.0 } } } Saving to data/perf-polybench-S-512-512-test.json ``` The main thing to check is that all three models haystack, lazystack, and warping ran successfully, producing a section in the output for each. ## Running the experiments To perform experiments, `cd` into the `experiments` directory. To perform hardware measurement, run `./run_measurement.py`. This may take around 4 hours and should be done on a machine with an AMD Zen3/Zen4 CPU and access to the `perf_event_open` syscall. You should run the docker image with the provided seccomp for this part. Now run `./get_system_cache_conf.sh` on the same machine and note down the two numbers; these are the number of cache lines in the cache and the size of each cache line in bytes. To run the full version of the model evaluation, run `./run_prediction.sh `. This may take around 2 weeks and requires a machine with 192 GiB of RAM. To run the quick version of the model evaluation, run `./run_prediction_fast.sh `. This may take around 2-3 days. If you want to customize any of the parameters of the quick run described in the introduction of this document, it is easy to do so by modifying the environment variables exported in that script; they are well documented. For more info on manually customizing the running of the benchmarks beyond the provided shell scripts see `MoreInfo.md`. ### Using our hardware measurement data in case of unsupported CPU If you do not have a supported AMD CPU for the hardware measurement, you can use our hardware measurement data. To do this, simply copy the `perf-perf-1.json` file from `experiments` into the `experiments/data` directory and run the prediction scripts as usual. Our machine has 512 cache lines and the line size is 64 (bytes); use those settings when running the prediction. You can then proceed to plotting (see below). ## Plotting and comparing results To plot the data collected, run `./plot.sh` in the `experiments` directory. This will generate a report in `experiments/report/report.pdf`. This document lists all the generated figures along with their figure number in the submitted paper. You can now compare these figures with those in the submitted paper `submitted_paper.pdf` and confirm that the interpretations in the captions of the figures in the submission continue to hold in the figures in the report generated from the artifact. If you ran Polybench, you can print the speedups on it with `python print_polybench_speedup.py`. It also mentions the speedup reported in the paper with line number for reference. Finally, in the paper we checked our tool's correctness by comparing the outputs against Haystack. Run `python check.py` to replicate this; it will compare against all available Haystack outputs from the artifact. ## Building from Scratch The artifact comes with pre-built binaries for convenience, but can support a fresh build too. To do so, first run `./clean_all.sh` to delete pre-built data. To build the binaries used for hardware measurement, run `./setup_measurement.sh` To build the cache models, run `./setup_prediction.sh`. The binary of our tool will be produced in `cmake-build-release/bin/lazystack`, which can be used for manually running it on a given MLIR file. ## Source Code Organization The source code of our tool is in the directories `include`, `src`, and `lib`. Some modifications have also been made to the external libraries in `polybase/barvinok` and `polybase/isl`. The entrypoint into the actual cache model is in the `CacheModel::compute` function in `lib/Analysis/CacheModel.cpp`. The key function is the `CacheModel::computeSink` function. This computes the cache misses at each level, for the given sink in the program. This corresponds to the body of the loop in Algorithm 1 in the paper. The dependences are computed by the call to `Lazy::compute`, implemented in `lib/Analysis/Lazy.cpp`. Threshold couting is performed in the call to `ThresholdCounting::compute`, implemented in `lib/Analysis/ThresholdCounting.cpp`.
Data Citation
Pitchanathan, A., Grover, K., & Grosser, T. (2024). Artifact for "Falcon: A Scalable Analytical Cache Model". Zenodo. https://doi.org/10.5281/zenodo.10972076
| Date made available | 15 Apr 2024 |
|---|---|
| Publisher | Zenodo |
Cite this
- DataSetCite