As part of applying the verified findings of Henk's appnote (and put into a PR) I found that applying the known-good hardware IO timings breaks the simulation.
Previously sim timings were already different which suggests the timing of the PHY model is already different from the hardware. They are quite significantly different - like 6 core clock cycles which is around 10ns.
So this issue covers whether we should re-visit the PHY model timing and see whether there is fundamentally a difference between xsim and hw which needs to be accommodated or whether we should make the XS3 phy timings match hardware.