So, there’s a click-fusion around 2-3ms, where the previous click is “echo cancelled”. So this is why the resolution of the auditory system starts around 5ms: anything shorter is cancelled out / fused (depending on how you look at it). It’s an impressive time grain resolution that humans can consciously detect.
https://asa.scitation.org/doi/abs/10.1121/1.2770545