Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling skew for IKEA INSPELNING in ElectricalMeasurement class #388

Open
sigurasg opened this issue Mar 4, 2025 · 24 comments · May be fixed by #389
Open

Sampling skew for IKEA INSPELNING in ElectricalMeasurement class #388

sigurasg opened this issue Mar 4, 2025 · 24 comments · May be fixed by #389

Comments

@sigurasg
Copy link

sigurasg commented Mar 4, 2025

I have a couple of IKEA INSPELNING smart plugs, one of which is hooked up to my espresso machine. When the element is on, it consumes ~1500W, whereas when it's not on, the consumption is <100W. This causes the power_divider value to toggle between 1 and 10 with the elements duty cycle, which is roughly once every 30s.
This in turn makes it very noticeable that the ac_power_divider/ac_power_multiplier attributes are not sampled synchronously with the active_power attribute as it frequently causes samples that are off by an order of magnitude. This happens when the wrong divider/multiplier (divider in my case) is used to scale the read power value.
The only way to fix this that I can think of is to sample the 2 or 3 attributes that define the instantaneous power measurement in a single ZigBee request in the ElectricalMeasurement class. This will probably also help general performance of HA and the ZigBee network.

This issue may apply more generally to the PolledSensor class, though I don't have other examples of the problem.

@puddly
Copy link
Contributor

puddly commented Mar 4, 2025

Can you record a debug log of the attribute reports?

We set up attribute reporting on the device instead of polling and rely on the device to report changes. Ideally it would report both attributes at once but it seems to not be able to do this properly.

Is your device running the latest firmware?

@sigurasg
Copy link
Author

sigurasg commented Mar 4, 2025

Can you record a debug log of the attribute reports?

If you tell me how, I'm sure I could?

We set up attribute reporting on the device instead of polling and rely on the device to report changes. Ideally it would report both attributes at once but it seems to not be able to do this properly.

Where do I find this code?

Is your device running the latest firmware?

Yes, this is clearly a race.

@TheJulianJES
Copy link
Contributor

If you tell me how, I'm sure I could?

You can enable debug logs on the integration page in HA, click the ZHA integration, then "enable debug logging".
Reproduce the issue and if you deactivate debug logging on that page again, the home-assistant.log file will download.
You can either upload the entire file here if it doesn't contain any personal information, email it to either puddly or me, or just post the relevant sections from the debug logs here.

Where do I find this code?

The IKEA plugs will already match this (notice polling is set to False):

@MULTI_MATCH(
cluster_handler_names=CLUSTER_HANDLER_ELECTRICAL_MEASUREMENT,
stop_on_match_group=CLUSTER_HANDLER_ELECTRICAL_MEASUREMENT,
models={"VZM31-SN", "SP 234", "outletv4", "INSPELNING Smart plug"},
)
class ElectricalMeasurement(PollableSensor):
"""Active power measurement."""
_use_custom_polling: bool = False

All other plugs will be polled, as they'll match this:
@MULTI_MATCH(
cluster_handler_names=CLUSTER_HANDLER_ELECTRICAL_MEASUREMENT,
stop_on_match_group=CLUSTER_HANDLER_ELECTRICAL_MEASUREMENT,
)
class PolledElectricalMeasurement(ElectricalMeasurement):
"""Polled active power measurement."""
_use_custom_polling: bool = True

@sigurasg
Copy link
Author

sigurasg commented Mar 4, 2025

If you tell me how, I'm sure I could?

You can enable debug logs on the integration page in HA, click the ZHA integration, then "enable debug logging".

Thanks - will do in the morning.

The IKEA plugs will already match this (notice polling is set to False):

Oh, interesting, so the plug is supposed to push updates regularly? Does the integration specify which attributes should be pushed, or does the device decide?

I guess the other way this could work is if the power attribute is also updated each time a divisor/multiplier update arrives.

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

OK, it seems that the device is pushing changes, and usually they arrive in the order ac_power_divisor, then active_power, like here:

2025-03-05 07:55:52.710 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: ac_power_divisor=1
2025-03-05 07:55:54.906 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: active_power=1455

but occasionally they're swizzled like here:

2025-03-05 07:56:28.254 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: active_power=1052
2025-03-05 07:56:29.804 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: ac_power_divisor=1

which causes an erroneous sample of 105.2W, rather than 1052W.

Would it be reasonable to update the sample value on both ac_power_divisor changes, as well as on active_power changes?
The only other way I can see to get this more consistent is to consolidate attribute changes by delaying sample updates, or else to explicitly poll the ac_power_divisor and active_power attributes in a batch query.

@puddly
Copy link
Contributor

puddly commented Mar 5, 2025

I was afraid that something like this would be required by the device. It would be best for the device to batch the two into a single attribute report but it seems to sometimes send multiple.

What is the maximum difference you see between the two attribute reports? A second or two is totally manageable but if it's much longer this will unfortunately complicate things.

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

I was afraid that something like this would be required by the device. It would be best for the device to batch the two into a single attribute report but it seems to sometimes send multiple.

I would say it usually sends individual updates, but sometimes it does batch them:

2025-03-05 07:59:24.560 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: ac_power_divisor=1
2025-03-05 07:59:27.538 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: active_power=15, ac_power_divisor=10
2025-03-05 08:00:35.631 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: active_power=1453, ac_power_divisor=1
2025-03-05 08:00:38.600 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: ac_power_divisor=10

What is the maximum difference you see between the two attribute reports? A second or two is totally manageable but if it's much longer this will unfortunately complicate things.

I don't have a whole lot of data yet, the one sample I was looking at had a delay of 1.55s.
Let me get back to you once I have more data collected.

@sigurasg sigurasg changed the title Sampling skew in ElectricalMeasurement class Sampling skew for IKEA INSPELNING in ElectricalMeasurement class Mar 5, 2025
@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

Is your device running the latest firmware?

Yes, I have automatic firmware updates enabled, the current firmware version is 0x02040045.

@puddly
Copy link
Contributor

puddly commented Mar 5, 2025

Regarding your second log:

2025-03-05 08:00:35.631 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: active_power=1453, ac_power_divisor=1
2025-03-05 08:00:38.600 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: ac_power_divisor=10

What is the actual power draw of the device when the report is received? Is it 1453W or 145.3W?

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

Regarding your second log:
What is the actual power draw of the device when the report is received? Is it 1453W or 145.3W?

The element consumes around 1500W when it's on, whereas consumption is otherwise around 1.5W, so here the consumption was 1453W.

Now that I'm staring at this, I wonder if perhaps this might be more or less working as intended. The problem could be that the duty cycle of the element is approaching the sampling rate of the power sensor?

Here's an image of the current and power graphs over 30 minutes or so.

Image

... time passes ...

The current graph is clearly also missing samples, but it's not nearly as bad as the power graph. It also doesn't have samples that are off by precisely an order of magnitude. It is odd to see that there is one current sample where the power sample is good. Perhaps the firmware is straddling sampling intervals between current and power in those cases?
Current is always reported in mA, by the looks of it.

Apparently the energy metering chip is in INSPELNING is a HLW8110. According to the data sheet the active power register has these possible update rates:

Active power of channel A,
update rate 3.4 Hz, 6.8 Hz, 13.6
Hz, 27.2 Hz 

So the sampling rate of the energy metering chip is at least a comfortable factor of 100 faster than the ~30s cycle time of the element. Ideally the firmware would read a consistent set of values at the end of each sampling interval, but maybe that's not what's happening.

IDK, I think if the ac_power_divisor and the active_power attribute changes could be aligned, the power graph would look better for my specific case. I wonder if it might cause error in the other direction, e.g. where samples would be an order of magnitude too large?

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

K, so I captured a bunch of data, drilled into the logs and what I found was ... unexpected. The incorrect samples correspond to the cases where active_power and ac_power_divisor are reported in a single packet.

2025-03-05 09:05:33.220 DEBUG (MainThread) [zigpy_deconz.api] Received command CommandId.aps_data_indication{'status': <Status.SUCCESS: 0>, 'frame_length': 43, 'payload_length': 36, 'device_state': DeviceState(network_state=<NetworkState2.CONNECTED: 2>, device_state=<DeviceStateFlags.APSDE_DATA_REQUEST_FREE_SLOTS_AVAILABLE: 8>), 'dst_addr': DeconzAddress(address_mode=<AddressMode.NWK: 2>, address=0x0000), 'dst_ep': 1, 'src_addr': DeconzAddress(address_mode=<AddressMode.NWK: 2>, address=0xCFE7), 'src_ep': 1, 'profile_id': 260, 'cluster_id': 2820, 'asdu': b'\x18v\n\x0b\x05)\xb3\x05\x05\x06!\x01\x00', 'reserved1': 0, 'reserved2': 175, 'lqi': 255, 'reserved3': 198, 'reserved4': 241, 'reserved5': 254, 'reserved6': 1, 'rssi': -49} (seq 148)
2025-03-05 09:05:33.220 DEBUG (MainThread) [zigpy.application] Received a packet: ZigbeePacket(timestamp=datetime.datetime(2025, 3, 5, 14, 5, 33, 220852, tzinfo=datetime.timezone.utc), priority=0, src=AddrModeAddress(addr_mode=<AddrMode.NWK: 2>, address=0xCFE7), src_ep=1, dst=AddrModeAddress(addr_mode=<AddrMode.NWK: 2>, address=0x0000), dst_ep=1, source_route=None, extended_timeout=False, tsn=None, profile_id=260, cluster_id=2820, data=Serialized[b'\x18v\n\x0b\x05)\xb3\x05\x05\x06!\x01\x00'], tx_options=<TransmitOptions.NONE: 0>, radius=0, non_member_radius=0, lqi=255, rssi=-49)
2025-03-05 09:05:33.221 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Received ZCL frame: b'\x18v\n\x0b\x05)\xb3\x05\x05\x06!\x01\x00'
2025-03-05 09:05:33.222 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Decoded ZCL frame header: ZCLHeader(frame_control=FrameControl<0x18>(frame_type=<FrameType.GLOBAL_COMMAND: 0>, is_manufacturer_specific=0, direction=<Direction.Server_to_Client: 1>, disable_default_response=1, reserved=0, *is_cluster=False, *is_general=True), tsn=118, command_id=10, *direction=<Direction.Server_to_Client: 1>)
2025-03-05 09:05:33.223 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Decoded ZCL frame: ElectricalMeasurement:Report_Attributes(attribute_reports=[Attribute(attrid=0x050B, value=TypeValue(type=int16s, value=1459)), Attribute(attrid=0x0605, value=TypeValue(type=uint16_t, value=1))])
2025-03-05 09:05:33.223 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Received command 0x0A (TSN 118): Report_Attributes(attribute_reports=[Attribute(attrid=0x050B, value=TypeValue(type=int16s, value=1459)), Attribute(attrid=0x0605, value=TypeValue(type=uint16_t, value=1))])
2025-03-05 09:05:33.224 DEBUG (MainThread) [zigpy.zcl] [0xCFE7:1:0x0b04] Attribute report received: active_power=1459, ac_power_divisor=1
2025-03-05 09:05:33.224 DEBUG (MainThread) [zha.zigbee.cluster_handlers] [0xCFE7:1:0x0b04]: cluster_handler[electrical_measurement] attribute_updated - cluster[Electrical Measurement] attr[active_power] value[1459]
2025-03-05 09:05:33.225 DEBUG (MainThread) [zha.zigbee.cluster_handlers] [0xCFE7:1:0x0b04]: cluster_handler[electrical_measurement] attribute_updated - cluster[Electrical Measurement] attr[ac_power_divisor] value[1]

Note that it seems the active_power attribute update always precedes the ac_power_divisor in the packets, so maybe the power sample is grabbed synchronously with the first attribute update?

@puddly puddly linked a pull request Mar 5, 2025 that will close this issue
@puddly
Copy link
Contributor

puddly commented Mar 5, 2025

#389 should fix this problem. I don't currently have a load for my plug that rapidly switches between 0W and 1000W+ so it's a little difficult to test what the correct timeout should be for waiting before emitting a state change.

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

#389 should fix this problem. I don't currently have a load for my plug that rapidly switches between 0W and 1000W+ so it's a little difficult to test what the correct timeout should be for waiting before emitting a state change.

Thanks!

Also, time to buy an espresso machine, methinks. Mine is an ancient E61 heat exchange model. In addition to testing smart plugs, it also makes great coffee.

@puddly
Copy link
Contributor

puddly commented Mar 5, 2025

A hairdryer worked well enough: 0W -> 1700W at the click of a button!

Give the above PR a try if you have time and let me know if the timing works for you. I've tuned it to a second above what I see with my device:

2025-03-05 13:45:50.651 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: active_power=6

2025-03-05 13:45:51.328 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0702] Attribute report received: current_summ_delivered=57356

2025-03-05 13:45:53.220 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_voltage=1274
2025-03-05 13:45:53.782 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_current=10

# This has no other update and would be emitted alone
2025-03-05 13:45:55.739 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: active_power=5

2025-03-05 13:45:58.238 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_voltage=1275
2025-03-05 13:45:58.754 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_current=10

# These would be batched together (1.1s apart)
2025-03-05 13:46:00.646 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: active_power=819
2025-03-05 13:46:01.722 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: ac_power_divisor=1

2025-03-05 13:46:03.203 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_voltage=1218

2025-03-05 13:46:03.737 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_current=13946

# These would be batched together (0.9s apart)
2025-03-05 13:46:04.700 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: ac_power_divisor=10
2025-03-05 13:46:05.673 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: active_power=2147

2025-03-05 13:46:08.189 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_voltage=1272
2025-03-05 13:46:08.782 DEBUG (MainThread) [zigpy.zcl] [0xD282:1:0x0b04] Attribute report received: rms_current=10

This would effectively delay all reports by 2s.

Unrelated, but we maybe should change the reporting configuration for rms_voltage, it's somewhat noisy (at lease where I live) and reports constantly even for plugs that aren't in use.

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

So I captured nearly 2 hours of reports, and scrubbing through them, I believe I never see the ac_power_divisor out of order with active_power except when they're reported together. I'm going to scrub through the data with a script to make sure, but here are the attribute updates I see.

This is obviously only for INSPELNING, I don't know anything about other smart plugs, alas.

I see a lot of

0:00:20.787000	rms_voltage=1280
0:00:21.190000	rms_current=19
0:00:21.219000	current_summ_delivered=218360
0:00:21.657000	active_power=13
0:00:25.546000	ac_power_divisor=1
0:00:25.763000	rms_voltage=1259
0:00:26.162000	rms_current=11628
0:00:26.632000	active_power=1457
0:00:28.510000	ac_power_divisor=10

but occasionally there are "spurious" divisor updates, like e.g. here

0:00:59.595000	ac_power_divisor=1
0:01:00.696000	rms_voltage=1258
0:01:02.564000	ac_power_divisor=10
0:01:02.850000	rms_current=2378
0:01:03.660000	active_power=27

@puddly
Copy link
Contributor

puddly commented Mar 5, 2025

ac_power_divisor changing without a corresponding active_power is a little strange.

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

ac_power_divisor changing without a corresponding active_power is a little strange.

I'll take your word for it. I don't know the first thing about ZigBee. Does it guarantee delivery? Does it guarantee in-order delivery?

@puddly
Copy link
Contributor

puddly commented Mar 5, 2025

Does it guarantee delivery? Does it guarantee in-order delivery?

No for both 😄. The only guarantee we have (somewhat) is that the same packet will not be received twice.

From your log:

0:00:25.546000	ac_power_divisor=1
0:00:26.632000	active_power=1457
0:00:28.510000	ac_power_divisor=10
0:00:31.627000	active_power=13

This sort of timing is not something I accounted for and requires a slight rewrite of the PR, since it may be possible for two ac_power_divisor reports to come in and for the more recent divisor to be used with the older active_power.

@sigurasg
Copy link
Author

sigurasg commented Mar 5, 2025

Does it guarantee delivery? Does it guarantee in-order delivery?

No for both 😄. The only guarantee we have (somewhat) is that the same packet will not be received twice.

Ok, then there is no guarantee that a property derived from more than one attribute will be correct, unless all the attributes are reported in a single packet.
I still believe that my log has all the attribute updates in-order, save for the aggregate ones. In that case a zero timeout will probably work to re-dispatch the sample acquisition to the event loop?

At least the current_summ_delivered will monotonically creep forward over time, even in the face of packet loss or reordering.

@sigurasg
Copy link
Author

sigurasg commented Mar 6, 2025

Does it guarantee delivery? Does it guarantee in-order delivery?

No for both 😄. The only guarantee we have (somewhat) is that the same packet will not be received twice.

So, I dove into this rathole and it seems the protocol has everything needed to provide in-order and guaranteed delivery (as much as any network protocol does), namely frame sequence numbers and confirmations. I couldn't quickly find any information on whether the standards mandates this for e.g. commands and attribute updates, but it would (IMHO) be odd if it didn't?

@puddly
Copy link
Contributor

puddly commented Mar 6, 2025

There are two layers of ACKs and we already send the ones we can, the others are handled by the firmware. There's no way to actually signal to a device that a packet has been missed unfortunately.

Reordering is something that I didn't really consider, however! Looking at my log, roughly 20 of the last 50,000 packets were received out of order. For the vast majority of devices this has no impact because we match requests to responses via their TSN but I think we could use this information to detect packet loss, do some slight reordering, or inform entities like this one of missed updates.

@sigurasg
Copy link
Author

sigurasg commented Mar 6, 2025

There are two layers of ACKs and we already send the ones we can, the others are handled by the firmware.

I don't suppose you could point me towards the code in question? I'm academically curious about how all of this works now 😃 .

There's no way to actually signal to a device that a packet has been missed unfortunately.

Right, I guess ultimately the only reasonable way to remedy a lost command or attribute update is to start the device over and rebuild all the state?

Reordering is something that I didn't really consider, however! Looking at my log, roughly 20 of the last 50,000 packets were received out of order. For the vast majority of devices this has no impact because we match requests to responses via their TSN but I think we could use this information to detect packet loss, do some slight reordering, or inform entities like this one of missed updates.

Interesting. I have a ConBee II and I just skimmed through the serial protocol docs. Somehow I assumed the gateway would do a lot more than what it does. Seems it just passes valid frames right up the chain. I guess a lot of my questions are answered in the logs I already captured...
My network is teeny, just a couple of thermostats and some smart plugs. Looks like everything is a router, though, so maybe I multipath routing occurs here.

@puddly
Copy link
Contributor

puddly commented Mar 6, 2025

I don't suppose you could point me towards the code in question?

Default responses (NWK-level ACKs) are here: https://github.com/zigpy/zigpy/blob/c54c8876e5b1f3fc1ff8a71ba57c25c8bd17436a/zigpy/zcl/__init__.py#L467-L504. Attribute reports set the disable_default_response bit so they aren't sent one.

@sigurasg
Copy link
Author

sigurasg commented Mar 6, 2025

Default responses (NWK-level ACKs) are here: https://github.com/zigpy/zigpy/blob/c54c8876e5b1f3fc1ff8a71ba57c25c8bd17436a/zigpy/zcl/__init__.py#L467-L504. Attribute reports set the disable_default_response bit so they aren't sent one.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants