Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pico W disassociates from AP under some power modes but fails to detect it #2153

Open
edmcman opened this issue Dec 29, 2024 · 10 comments
Open
Assignees

Comments

@edmcman
Copy link

edmcman commented Dec 29, 2024

The high-level problem is that on my AP, under some power modes, when running a http server, the RPI eventually disconnects and becomes unreachable. It never recovers when this happens. I have reproduced this a lot using micropython, and more recently using this example from the pico-examples repository.

Port, board and/or hardware

RP2040 on Raspberry Pi Pico W

Reproduction

Expected behaviour

The test.bash script tests whether the web server remains accessible after progressively longer periods of times with no requests. It is expected that the web server will remain available and the script will print "Test succeeded".

Observed behaviour

When run 20 times with a limit of 2048 second periods, 8 out of the 20 runs failed, meaning that the web server became inaccessible. Here is a breakdown of the power modes during failures:

fgrep -l -R Fail *.log | xargs fgrep Random | awk '{print $4}' | sort | uniq -c
      6 network.WLAN.PM_POWERSAVE
      2 None

And the overall distribution of tested pm values:

      2 0xa11140
      8 network.WLAN.PM_NONE
      1 network.WLAN.PM_PERFORMANCE
      7 network.WLAN.PM_POWERSAVE
      3 None

I am not sure why they are so uneven.

Here is a zip file of my collected logs:
logs-2048.zip

Other Information

When the Pico W becomes inaccessible, the logs contain the following messages repeated very often:

[ 2509685] ASYNC(0000,49,0,0,0)

I believe the 49 message indicates crypto verification failure. Under normal operation, it appears to be correlated with PSK_SUP messages. But as soon as it starts repeating by itself the Pico W is offline.

Also, an asynchronous task that calls WLAN.isconnected() and WLAN.status() shows that the Pico W is connected, even though it does not seem to be:

[snip]
[ 3505634] ASYNC(0000,49,0,0,0)
[ 3507681] ASYNC(0000,49,0,0,0)
[ 3508397] ASYNC(0000,49,0,0,0)
Still checking Wi-Fi connection...
Debug: connected: True status: 3
Actual PM: 0xa11142

High level problem

I think that under some power management modes, the Pico is missing rekeying and disassociate messages. The only way to detect that this situation has happened is to detect crypto failures and rejoin the network. I believe the 49 messages are crypto failures.

One way forward is to respond to crypto failures by rejoining the network, like most other wifi implementations. I did file an issue with the cyw43-driver, since it seems to me like the driver ought to be handling this.

Another way forward might be to change the default power mode in pico-sdk. Here, most of my testing is micropython-centric. I've found the infinite disconnection to happen during both the micro-python default power mode, and the power mode setting of 0xa11140 suggested in the Connecting to the Internet document.

Outside of micropython, the default power mode for the pico_w httpd example disconnects. I think this uses PM_PERFORMANCE from perusing the source code, but I am not sure.

Happy to help in any way that I can. I have enough information that I can simply work around the problem, but I'd prefer to see it fixed so that the next person doesn't have to deal with this.

@peterharperuk peterharperuk self-assigned this Jan 2, 2025
@peterharperuk
Copy link
Contributor

A search suggests "49" is WLC_E_ICV_ERROR, which I would only expect to see if using WEP. So that's a bit odd.

@edmcman
Copy link
Author

edmcman commented Jan 2, 2025

There is definitely no WEP...

Here is a packet capture: https://github.com/edmcman/rpi-temp-ser ... fice2.pcap

Some events:

  1. Association happens at t=4282, the Association ID is 2.
  2. The Pico last responds to a request at t=5332
  3. Disassociation happens at t=6087s.

My interpretation is that my AP is buggy and did not tell the Pico to wake up when it should have via TIM field in beacons. (Oddly it does earlier in the pcap) I think the main Pico bug here is that the Pico should eventually detect that it is disconnected.

@peterharperuk
Copy link
Contributor

My interpretation is that my AP is buggy

You might be right. I will do some testing to see if I can reproduce any issue. But I would guess it's an interop problem. If I could reproduce it I would try editing cyw43_cb_process_async_event to handle the event. You could give this a go if you're feeling keen?

Either just disconnect...

cyw43_cb_tcpip_set_link_down(self, CYW43_ITF_STA);
self->wifi_join_state = 0x0000;

or see if reconnecting works (probably a long shot?).

self->pend_rejoin = true;
self->pend_rejoin_wpa = false;
cyw43_schedule_internal_poll_dispatch(cyw43_poll_func);

@dpgeorge
Copy link

dpgeorge commented Jan 3, 2025

My interpretation is that my AP is buggy and did not tell the Pico to wake up

That would also be my suspicion, that the AP does not work properly with STAs that are sleeping and a group rekey event, eg it's not waiting for the correct DTIM interval to send the rekey to the Pico W.

@edmcman
Copy link
Author

edmcman commented Jan 3, 2025

Thanks @peterharperuk I will give this a shot. I just left for vacation, so I don't have physical access, but I do have remote access to the machine the Pico is plugged into. So I'll see if I can update the firmware remotely.

@edmcman
Copy link
Author

edmcman commented Jan 3, 2025

Great, I have it running with the modified firmware. I started with the "long shot".

@edmcman
Copy link
Author

edmcman commented Jan 3, 2025

First run, and it triggered the bug and appeared to work!

Debug: connected: True status: 3
Actual PM: 0xa11142
Still checking Wi-Fi connection...
Debug: connected: True status: 3
Actual PM: 0xa11142
[ 1567277] ASYNC(0000,49,0,0,0)
[ 1567788] ASYNC(0000,49,0,0,0)
[ 1568093] ASYNC(0000,49,0,0,0)
[ 1568734] ASYNC(0000,ASSOC_REQ_IE,0,0,0)
[ 1568735] ASYNC(0000,AUTH,0,0,0)
[ 1568739] ASYNC(0000,ASSOC_RESP_IE,0,0,0)
[ 1568739] ASYNC(0000,ASSOC,0,0,0)
[ 1568740] ASYNC(0001,LINK,0,0,0)
[ 1568753] ASYNC(0000,PSK_SUP,6,0,0)
[ 1568773] ASYNC(0000,JOIN,0,0,0)
[ 1568773] ASYNC(0000,SET_SSID,0,0,0)
Still checking Wi-Fi connection...
Debug: connected: True status: 3
Actual PM: 0xa11142

I am continuing to run more tests.

@edmcman
Copy link
Author

edmcman commented Jan 5, 2025

So far I've run 45 tests, and none have failed. Previously, about 1/3 tests would fail. So I think it's safe to say this fixes the problem.

Here is the (very minimal) patch against cyw43-driver that I've been using:

diff --git a/src/cyw43_ctrl.c b/src/cyw43_ctrl.c
index a9db363..ad4a0ad 100644
--- a/src/cyw43_ctrl.c
+++ b/src/cyw43_ctrl.c
@@ -389,6 +389,10 @@ void cyw43_cb_process_async_event(void *cb_data, const cyw43_async_event_t *ev)
             // Cannot authenticate
             self->wifi_join_state = WIFI_JOIN_STATE_BADAUTH;
         }
+    } else if (ev->event_type == 49) {
+        self->pend_rejoin = true;
+        self->pend_rejoin_wpa = false;
+        cyw43_schedule_internal_poll_dispatch(cyw43_poll_func);
     } else if (ev->event_type == CYW43_EV_DEAUTH_IND) {
         if (ev->status == 0 && ev->reason == 2) {
             // Deauth, probably because password was wrong; disassociate

@peterharperuk
Copy link
Contributor

Nice. Do you want to submit a PR over here https://github.com/georgerobotics/cyw43-driver ?
We'd then update the sdk to include it.

@edmcman
Copy link
Author

edmcman commented Jan 6, 2025

PR started: georgerobotics/cyw43-driver#130

@dpgeorge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants