Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM5 does not work with ptp4l #39

Open
jclark opened this issue Jan 15, 2025 · 28 comments
Open

CM5 does not work with ptp4l #39

jclark opened this issue Jan 15, 2025 · 28 comments

Comments

@jclark
Copy link
Owner

jclark commented Jan 15, 2025

The CM5 doesn't work with ptp4l. Here's how to test without needing any special hardware.

  1. Connect a CM4 and a CM5 directly with a short ethernet cable. When running headless, I do this by having the CM4 use DHCP on eth0, and then on the CM5 plug an ethernet dongle in the USB port and then plug that into my main network; on the CM5 I then use nmcli to configure eth0 with the shared method.
  2. Install and set up chrony on both the CM4 and CM5. Ideally configure with a common local NTP server (by specifying e.g. server ntp.lan iburst in /etc/chrony/chrony.conf. Make sure chrony is successfully syncing CM4 and CM5 using chronyc sources.
  3. Install linuxptp on both the CM4 and CM5: sudo apt install linuxptp.
  4. On the CM5, run phc2sys to synchronize the PHC to the system clock: sudo phc2sys -q -m -l 6 -c eth0 -s CLOCK_REALTIME -O 0 >phc2sys.log &
  5. Do tail -f phc2sys.log and wait for 30 seconds or so until the sys offsets are consistently small (absolute value < 100).
  6. Start ptp4l on the CM5: sudo ptp4l -i eth0 --tx_timestamp_timeout 100 -l 6 -m -q. After a few seconds it should display selected local clock 2ccf67.fffe.c114d2 as best master, where the first and last part of the hex number shown corresponds to the MAC address of eth0 on the CM5 (2c:cf:67:c1:14:d2 in this case`).
  7. Now open a terminal on the CM4 and run ptp4l as a slave: sudo ptp4l -i eth0 --tx_timestamp_timeout 100 -l 6 -m -q -s >ptp4l.log &
  8. Do tail -f ptp4.log and wait till master offset has settled down (absolute value < 100).
  9. Now do phc_ctl eth0 cmp. This is the difference in nanoseconds between the system clock and the PHC. If everything is working properly this should certainly be less than a millisecond ie 1,000,000. The value I actually see varies all the time. Right now it is 4629122817ns which is 4.5 seconds. However, I get wildly different results on different days.

This is on an 8Gb CM5, with 8 Jan 2025 firmware (as shown by rpi-eeprom-update), kernel 6.6.62+rpt-rpi-2712.

Since the CM4 is known good, this is going to be a hardware, firmware or driver problem on the CM5.

This isn't a problem with phc2sys: you can see the same problem just using phc_ctl eth0 set on the CM5 to do a one time synchronization of the system clock.

@JN19aban
Copy link

JN19aban commented Jan 15, 2025

@jclark let me ask something. On CM5 what bootloader it has? I am telling this because on latest updates NUMA is enabled and this might have a drawback on ptp4l or the opposite might needed to be enabled. The last days almost the same thing happened on RPI 5 (not with ptp4l) which I was used new bootloader with older kernel and the performance was bad, very very bad.

https://github.com/raspberrypi/rpi-eeprom/tree/master/firmware-2712/latest

Have a check on this.....


But on CM4 with the latest bootloader (NUMA enabled) latest kernel the whole thing is flying...... (saying as performance on the system, on PTP4l is the same)

@jclark
Copy link
Owner Author

jclark commented Jan 15, 2025

@JN19aban It looks like NUMA can be turned off by setting SDRAM_BANKLOW=0 with rpi-eeprom-config. I will give that a try.

@JN19aban
Copy link

JN19aban commented Jan 15, 2025

@jclark update to latest bootloader first and then disable NUMA. I see that have updates for the RAM modules specially for 8gb versions.

@JN19aban
Copy link

It has to be on this release....


BOOTLOADER: up to date
   CURRENT: Wed Jan  8 17:52:48 UTC 2025 (1736358768)
    LATEST: Wed Jan  8 17:52:48 UTC 2025 (1736358768)
   RELEASE: latest (/lib/firmware/raspberrypi/bootloader-2712/latest)
            Use raspi-config to change the release.

@jclark
Copy link
Owner Author

jclark commented Jan 15, 2025

vcgencmd bootloader_version gives me

2025/01/08 17:52:48
version 97facbf492c43a5b6b0e9719860798b7cebfdebb (release)
timestamp 1736358768
update-time 1736908130
capabilities 0x0000007f

What command were you running?

@jclark
Copy link
Owner Author

jclark commented Jan 15, 2025

With that the offset is 344096797ns i.e. 0.34s, which is a lot different from before.

Do you have any insight as to why NUMA should be affecting ptp4l so much?

@JN19aban
Copy link

This command:

sudo rpi-eeprom-update

To update use this:

sudo rpi-eeprom-update -a

But you have to change the bootloader release from stable to latest.

To do this you have to run:

sudo apt update
sudo apt upgrade
sudo reboot

1. sudo raspi-config
2. 6 Advance Options
3. A5 Bootloader Version
4. E1 Latest
5. Hit ok and the finish and reboot
6. run sudo rpi-eeprom-update -a
7. and reboot

@jclark
Copy link
Owner Author

jclark commented Jan 15, 2025

I'm running latest already and get the same output from rpi-eeprom-update as you.

@JN19aban
Copy link

JN19aban commented Jan 15, 2025

"Do you have any insight as to why NUMA should be affecting ptp4l so much? "

Wrong RAM timings the first one......... and you have to be with NUMA enabled on kernel... otherwise bad very bad performance...

Although I do not know precisely what effect has on PTP4L because yet I do not have available any CM5.

@JN19aban
Copy link

So this value 344096797ns is with NUMA enabled or disabled you posted before?

@jclark
Copy link
Owner Author

jclark commented Jan 15, 2025

With NUMA disabled

@JN19aban
Copy link

OK try with SDRAM_BANKLOW=1

and re run the test with SDRAM_BANKLOW=2

@jclark
Copy link
Owner Author

jclark commented Jan 15, 2025

The first result (more than 4 seconds offset) was with latest firmware and nothing added to the firmware config, so NUMA enabled.

@JN19aban
Copy link

JN19aban commented Jan 15, 2025

Theoretically is with SDRAM_BANKLOW=3 but to make sure run with 1 and 2 values......


To add something important. After all the test try with SDRAM_BANKLOW=-1 to disable NUMA and RAM enchantments.

@JN19aban
Copy link

@jclark I am thinking something more.

With the NUMA setup (RAM timings etc.) the system is adjusted for performance, compatibility, etc.

On theory the HW Timestamps shouldn't be affected that much as the previously posted latencies.

Can you do more testing that is the HW Timestamp truly enabled and not "faked" somehow on PTP4L? and the whole latency you see might be Software Timestamps?

An other scenario I am thinking can be the new I/O controller (RP1) that might affect somehow the performance of the PHY???

@jclark
Copy link
Owner Author

jclark commented Jan 15, 2025

With SDRAM_BANKLOW=1, I get -26605883ns (so -26,605,883ns = 0.02s).
With SDRAM_BANKLOW=2, I get 460105087ns (so 460,105,087ns = 0.4s).
With SDRAM_BANKLOW=-1, I get 640659702ns (so 640,659,702ns =0.6s).

Note that in all cases the value isn't constant: it changes gradually. So I wonder how reproducible these are.

Going back to no SDRAM_BANKLOW entry, I get 469640908ns (so 469,640,908ns = 0.5s). Going back to SDRAM_BANKLOW=1, I get 523767888ns so (so 523,767,888ns = 0.5s). So not reproducible at all. I suspect this is not the issue.

@JN19aban
Copy link

It might be, I will catch an eye for more news that come up on the RPI 5 / CM 5. I will probably have one on my hands (CM5) by the 15th of February if all goes well.

I strongly think these 2 things on my previous comment though.

@jclark
Copy link
Owner Author

jclark commented Jan 16, 2025

Wireshark shows clearly that the problem is that the hardware transmit timestamps from the CM5 are incorrect.

After some quality time with bpftrace, I think I know what is going on. The behaviour is a kernel bug. It is a result of three things:

  1. the CM5 has two PTP clocks associated with eth0, each with its own packet timestamper; one is the PHY level one from drivers/net/phy/bcm-phy-ptp.c and the other is MAC level one from drivers/net/ethernet/cadence/macb_ptp.c (neither the CM4 nor Pi 5 have two PTP clocks)
  2. the macb driver uses a legacy method of controlling hardware timestamping, using ndo_eth_ioctl instead of the newer ndo_hwtstamp_set/_get methods
  3. the code that handles these ioctls (dev_{set,get}_hwtstamp) in net/core/dev_ioctl.c doesn't properly handle the case whether there are both MAC and PHY level timestampers, but the MAC level timestamper uses the legacy ndo_eth_ioctl

The overall result is that the hardware transmit timestamping code in the PHY level driver doesn't get called, and I think there is a weird mix of the two PHCs being used, which is why there are wrong timestamps.

In kernel 6.8, the macb driver is updated to use the newer ndo_hwtstamp_set/_get, so this problem should go away. I tried applying the patch for this to the current Raspberry Pi kernel, but I got strange errors from the macb driver ("DMA bus error: HRESP not OK"), and I have absolutely no idea what is causing those.

So a simple workaround is to compile the current kernel commenting about CONFIG_MACB_USE_HWSTAMP in the config also avoids having the two competing clocks.

Unfortunately after fixing this problem, another appears: ptp4l gives the infamous timed out while polling for tx timestamp error, even after increasing increasing tx_timestamp_timeout to a ridiculously large value. Argh!

@JN19aban
Copy link

JN19aban commented Jan 16, 2025

This means that you see on /dev/ two clocks ptp0 and ptp1?

@jclark
Copy link
Owner Author

jclark commented Jan 16, 2025

Right: the CM5 has a /dev/ptp0 (like the /dev/ptp0 on the CM4) and a /dev/ptp1 (like the /dev/ptp0 on the Pi5).

@JN19aban
Copy link

JN19aban commented Jan 16, 2025

That is new..... so you can lets say use 2 clocks (an example PTP and SyncE together), of course when will work correct. That sounds nice so far.


I am also wondering if also support partitioning (NPAR) or might in future.

@lhoward
Copy link

lhoward commented Jan 17, 2025

It sounds nice but isn't there the issue described here where the kernel can only report one timestamp at a time?

@jclark
Copy link
Owner Author

jclark commented Jan 18, 2025

I just did an update to next (using sudo rpi-update next), and it is now working. The test above gives a difference of 0.5 milliseconds, which is about right.

uname -a gives

Linux valdeon 6.12.10-v8-16k+ #1840 SMP PREEMPT Fri Jan 17 18:08:09 GMT 2025 aarch64 GNU/Linux

rpi-eeprom-update gives

BOOTLOADER: up to date
   CURRENT: Tue 14 Jan 00:16:48 UTC 2025 (1736813808)
    LATEST: Tue 14 Jan 00:16:48 UTC 2025 (1736813808)
   RELEASE: latest (/lib/firmware/raspberrypi/bootloader-2712/latest)
            Use raspi-config to change the release.

@JN19aban
Copy link

@jclark So it seems to be fixed on kernel 6.12.x Kernel. Nice.

@lhoward See this https://lwn.net/Articles/859792/

@geerlingguy
Copy link

Have you raised an issue about this in the Pi kernel repo? They might be willing to back port the fix. Or maybe motivation to get to next LTS kernel sooner, which I would love for many other reasons!

@jclark
Copy link
Owner Author

jclark commented Jan 21, 2025

@geerlingguy I posted it on the forum thread where they invited feedback on the 6.12 kernel https://forums.raspberrypi.com/viewtopic.php?t=379745&start=100#p2287594. Hopefully that will put it on their radar and encourage them to move to 6.12 soon.

I don't think the backport would be easy. There's the problem with the MAC timestamper being used. The only fix I see for that is to update the macb driver to the new method for setting hardware timestamps. But when I did that, I got mysterious DMA-related errors. Then there's another problem related to the interrupt register in the bcm_phy_ptp driver, and I have no idea what fixed that.

As of Nov 20th, they were talking about moving to 6.12 in a few months, so I'm hoping it will be soon. But sudo rpi-update next is already pretty easy.

@Waynechen026
Copy link

Hi @jclark ,I have updated to 6.12, but I still cannot use ptp4l in CM5. The error "timed out while polling for tx timestamp" still occurs. Why is this happening? We are using the TimeProvider TP4100 for time synchronization.

@jclark
Copy link
Owner Author

jclark commented Feb 12, 2025

@Waynechen026 Are you using the tx_timestamp_timeout option? This is still needed. I usually use 100. If you still get the error, open a new issue with full details of your setup (kernel version, firmware version, logs, hardware etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants