0.5 billion events per second time correlated single photon counting using CMOS SPAD arrays

Citation for published version:
Krstajic, N, Poland, S, Levitt, J, Walker, R, Erdogan, A, Ameer-Beg, S & Henderson, R 2015, '0.5 billion events per second time correlated single photon counting using CMOS SPAD arrays' Optics Letters, vol. 40, no. 18, pp. 4305-4308. DOI: 10.1364/OL.40.004305

Digital Object Identifier (DOI):
10.1364/OL.40.004305

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Peer reviewed version

Published In:
Optics Letters

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim.
.5 billion events per second time correlated single photon counting using CMOS SPAD arrays

NIKOLA KRSTAJIĆ,1,2,* SIMON POLAND,3 JAMES LEVITT,3 RICHARD WALKER,1,4 AHMET ERPDOGAN,1 SIMON AMEER-BEG,3 ROBERT K. HENDERSON1

1Institute for Integrated Micro and Nano Systems, School of Engineering, University of Edinburgh, Edinburgh, UK; 2EPSRC IRC "Hub" in Optical Molecular Sensing & Imaging, MRC Centre for Inflammation Research, Queen’s Medical Research Institute, G7 Little France Crescent, Edinburgh, UK; 3Division of Cancer Studies & Randall Division of Cell and Molecular Biophysics, Guy’s Campus, Kings College, London, UK; 4Photon Force Ltd, Edinburgh

*Corresponding author: n.krstaji@physics.org

Received XX Month XXX; revised XX Month, XXX; accepted XX Month XXXX; posted XX Month XXXX (Doc. ID XXXXX); published XX Month XXXX

We present a digital architecture for fast acquisition of time correlated single photon counting (TCSPC) events from a 32×32 CMOS SPAD array (Megaframe) to the computer memory. Custom firmware was written to transmit event codes from 1024 TCSPC-enabled pixels for fast transfer of TCSPC events. Our 1024 channel TCSPC system is capable of acquiring up to $0.5 \times 10^9$ TCSPC events per second with 16 histogram bins spanning 14 ns width. Other options include $320 \times 10^7$ TCSPC events per second with 256 histogram bins spanning either 14 ns or 56 ns time window. We present a wide-field fluorescence microscopy setup demonstrating fast fluorescence lifetime data acquisition. To the best of our knowledge, this is the fastest direct TCSPC transfer from a single photon counting device to the computer to date. © 2015 Optical Society of America


http://dx.doi.org/10.1364/OL.99.099999

Time correlated single photon counting (TCSPC) is the most accurate technique available for determining fluorescence decays down to sub-nanosecond time ranges. The technique originated in nuclear physics where scintillation decay curves provided clues about the nuclear particles being detected by the scintillating crystal [1]. Over time, TCSPC has expanded into chemistry [2] and a range of biomedical applications [3] including fluorescence lifetime imaging (FLIM) [4,5]. Printed circuit board (PCB) level electronics integration enabled widespread use of TCSPC since the early ’90s when commercial suppliers provided bespoke systems for a range of applications from quantum optics to tissue imaging. We believe that the next technological push will involve integrating single photon detection, timing and processing circuitry on a single semiconductor die. The best contender for this leap is the standard complementary metal oxide semiconductor (CMOS) technology [6], because standard CMOS has already been the key driver in global electronics miniaturization.

A number of CMOS single photon avalanche detector (SPAD) sensors have achieved low noise, high frame rate and scalability needed for fluorescence imaging applications [7,8]. CMOS SPAD imaging sensors enable highly parallelized TCSPC, but at the same time present a serious data bottleneck. The CMOS SPAD sensor used in this study is 32 × 32 pixel, 55 ps time resolution, 10 bit time-to-digital converter (TDC), 56 ns time window, Megaframe (MF32) sensor [7]. Whilst the fill factor of the MF32 sensor used is low (1.5%), our previous work demonstrates optical fill factor amplification to 100%[9,10] by generating 64 fluorescence beamlets which are imaged onto the active area of the SPAD (fill factor amplification is not used in the work presented here). The MF32 interface to the field programmable gate array (FPGA) is limited to 500000 frames per second (fps) and the transfer of this data to the computer has not been optimised. This manuscript addresses the data bottleneck challenge by achieving optimal count rates from the sensor to the computer memory via FPGA and the universal serial bus 3 (USB3).

To the best of our knowledge, we demonstrate the best TCSPC count rates to date. Prior work [11] achieved 80 million counts per second (often referred to as timestamps or time-tags) (Mcps) without taking into account pile-up effects in their lifetime determination. In present work we demonstrate 500 Mcps. This value is close to acceptable limit for counts in terms of pulse pile-up (1% for a laser rep rate of 50MHz). Such count rate is likely to cater for larger sensors since low light widefield fluorescence detection usually acquires 1000-5000 counts per second [12] meaning that a 100 × 100 pixel TCSPC sensor would work well (assuming 100% fill factor). The results presented here are directly applicable to multi-beam scanning required for ultrafast FLIM applications [13].

The measured TCSPC events need to be passed from the sensor to the FPGA, and then from the FPGA to the computer; see Fig. 1. As shown on Fig. 1, the interface to MF32 sensor consists of clocking the LINECLK line at 8MHz. This clock places data on the 64 bit DATA_TDC parallel bus. The readout reads two lines of 10 bit TDC values from the sensor for each LINECLK cycle. The DATA_SAMPLE_CLK is set to 80MHz which enables the sampling of the 10 bit TDC value over 10 clock cycles to fit the LINECLK 8MHz rate. 640 bits (64 pixels with 10 bit TDC values) are processed by the MF32 interface module on the
For best performance, $64 \times 10$ bit TDC values are compressed to $64 \times 4$ bit TDC values losing some time resolution, but allowing unprecedented TCSPC event rate. The resulting 256 bits (64 pixels with 4 bit TDC values) need to be sent via USB3 every 125 μs. Total data throughput required is thus 256 megabytes per second (MB/s). USB3 bus has maximum data rate of 300-340 MB/s, so it gives ample room to fit 0.5 billion events per second. USB3 bus is not real-time, so a first in first out (FIFO) buffer is needed to allow for communications to be smooth. 128 kilobytes (KB) FIFO buffer was implemented with 128 bit input (FIFO_DATA_IN) and 32 bit output (FIFO_DATA_OUT). The manufacturer of the PCB (Opal Kelly) provides a dedicated 32 bit bus on the FPGA and FIFO_DATA_OUT is channeled via USB3.

The timing diagram shown in Fig. 2 further illustrates the clocking frequencies of the data path. Data from the MF32 sensor is fed into the FIFO over 128 bit wide bus at 32 MHz. USB3 data is clocking 32 bits of data at 100 MHz thus allowing up to 400 MB/s in principle. However, the USB3 protocol overheads and delays on the either the PC or embedded side may hold up the communications, so sustained data at 100 MHz thus allowing up to 400 MB/s in principle. However, the USB3 protocol overheads and delay on the either the PC or embedded side may hold up the communications, so sustained data at 100 MHz thus allowing up to 400 MB/s in principle. However, the USB3 protocol overheads and delay on the either the PC or embedded side may hold up the communications, so sustained...
CCD. Three images on the right of Fig. 5. are FLIM images. The lifetime was extracted using iterative deconvolution with Levenberg-Marquardt fitting using custom Matlab scripts derived from DecayFit 1.3 [15,16]. The FLIM images shown in Fig. 5. were acquired in 3s at 320 Mcps with an 8 bit TDC covering the 14 ns time window and a 100 MHz laser repetition rate. Out of 320Mcps, 11 × 10^6 events were non-zero values fed into histograms. Over 3s this resulted in 33000 events per pixel on average. The transfer is organized on frame per frame basis, so if no photon is detected in a given frame’s exposure time then the value for the pixel is written as 0. Pile-up artefact is avoided by assuring that the frame rate is <1% of the laser repetition rate. With higher laser power, improved optics and fill factor improvement [9,10,13] this value will easily scale to 500 Mcps of photon events. We verified full data rates by acquiring the IRF at maximum data rates with >96% non-zero events. The FLIM images will also improve in clarity once full pixel timing calibration is applied, such as integral non-linearity and differential non-linearity corrections [13]. The sample decay from the FLIM image is shown in Fig. 6. The time window is 14 ns wide and 8 bit TDC time-correlated event were used at 55ps time resolution. Decay pre-pulse appearing at ~2 ns in Fig. 6. is related to the optical setup (either specimen or emission path). Pre-pulse does not appear on any of the IRFs acquired.

We tested data rates under a variety of TDC configurations and the results are outlined in Table 1. We were able to maintain 320 MB/s data rate securing count rates of 500 Mcps in 4 bit TDC. The upper limit is the current readout rate of MF32 from FPGA which is 500000 frames per second. We succeeded in maintaining this rate for the 4 bit TDC size and 4 bit single photon count frames. Our tests were performed on 30 × .1 s acquisitions. We interleave acquisitions with histogramming to reduce loading on the computer memory. For each .1s of acquisition, histogramming takes approximately 1s. Our aim is to reduce this by deploying faster histogramming routines, possibly involving graphics processing units (GPUs). Our extensive tests show that 320 MB/s is broadly maintained for 1 hour tests, standard deviation of the data rate variation is 6 MB/s. We found that USB3 cabling plays a crucial role. The quality of cables varies and we found that Point Grey Research ACC-01-2300 3m USB3 cable to give more consistent results. As USB3 becomes more adopted cabling is unlikely to be an issue. Computer architecture deployed is also important. 64 bit operating system should be used with minimum 16 GB dynamic memory.

As expected, the decay for 4 bit TDC transfer is coarser than the decay for 10 bit TDC transfer, as shown in Fig. 7. It should also be noted that for 100 MHz repetition rate, 8 bit TDC for 14 ns time window has the same coarseness as full 10 bit TDC for 56 ns time window. 4 bit TDC transfer has higher number of photons per bin, because 16 bins cover 14 ns time window as opposed photon counts being spread over 256 bins in 8 bit TDC transfer over the same 14 ns time window. So despite the loss in time resolution, one may obtain a decay curve sooner in 4 bit TDC transfer. The problem is more complex, because both the time resolution and the number of photon events in the histogram affect the lifetime estimation accuracy. Prior work shows promising prospects for histogramming with similar coarseness to 4 bit TDC transfer [17,18].

We have demonstrated ultra-fast TCSPC event code transfer from the CMOS SPAD array. As mentioned above, our aim is to deploy this sensor in a variety of physics, biology and pre-clinical applications. It is important to note that the most beneficial implementation for MF32 will be the one which amplifies the fill factor, as this ensures low dark count rate (DCR) due to small SPAD area whilst reducing photobleaching and maintaining a high count rate. Also, the high count rates demonstrated here indicate what should be expected from future TCSPC sensor cameras over USB3 or similar links. Lastly, top range FPGA architectures have more than 50MB of static RAM available on-chip (at the time of writing) allowing fast parallelized read-modify-write histogramming of at least 25000 TCSPC pixels.
References


