It was brought to our attention that organic traffic had dropped off significantly from on some flux sites on Thursday (2023-06-08). The drop-off started to occur around the time of the proxy switchover (2023-05-25) and was declining at a constant rate since then.
All the affected sites seemed to be the WP Engine (WPE) hosted ones that had recently been changed to be accessed via subdomain by the proxy, i.e. for a blog site such as https://www.adrianflux.co.uk/cult-classics/, this would be requested as https://cult-classics.adrianflux.co.uk in the proxy and the response passed back via the original URL.
This is a complex setup as it has to make sure that all headers are set correctly, as well as any links and assets in the page responses are correctly returned as the original URL and not the proxied one. All the affected sites posed no problems for a regular user to go visit them, but seemed to be not accessible to Google(bot). This was further confirmed with tests in various browsers presenting a host of user agents from various locations in the world. Automated checks were also written and these too were providing response 200 replies which would suggest they were behaving as normal. As such, the only explanation for the drop in traffic - not the fact that there was no traffic going to the sites (as they worked), but a symptomatic issue that Google couldn’t record them.
The initial thought was that the Link headers may be to blame for this. They are removed from by the proxy as they contain the incorrect domains within them. This was eliminated as an issue though, as some of the sites that did not experience a drop in traffic did not provide these headers anyway. The hardest part of the trying to solve the issue was to recreate the error in order to apply changes and confirm this was the issue.
Google Search Console was identified as the tool to help diagnose the issue as at this point we were certain that it was a problem with Googlebot crawling the sites. Once the adrianflux.co.uk domain was verified for us (via DNS) then this presented a much clearer picture of the issues.
The Googlebot seemed to be getting 403 Forbidden responses to it’s requests. Every part of the proxy was checked and it was concluded that these responses were not coming from the proxy server itself, but some other part of the (complex) chain. This could then be validated (on-demand) by submitting a sitemap and seeing if Googlebot could read it. Couldn't fetch responses were being returned and was confirming that the 403 was happening for these sites. The ability to check the status on-demand like this was a great help in the eventual resolution as we could immediately confirm and changes we made were fixing the issue.
WPE was contacted and we worked with them to diagnose the issues on a nominal site (Cult Classics) and it was again concluded that the issues was not that part of that chain. The only logical part remaining was the edge proxy, Cloudflare (CF). As the sites were on the subdomain the requests were being passed back through CF before the response. The chain is something like:
Browser (request) > Cloudflare > Proxy > Cloudflare > WP Engine > Proxy > Cloudflare > Browser (response).
403 responses suggested a firewalling issue in Cloudflare. Initial thoughts were that the Access gateway on the admin back-ends was responsible somehow, so all of these were disabled. Sitemaps were resubmitted, but we were still seeing the Couldn't fetch responses.
All of our Firewall rules were analysed against the event log, and they did not seem to be any block responses for them. Next the standard CF managed rules were looked at (there are hundreds) and it turns out some rules were kicking out a block response. The offending rules were:
100201 Anomaly:Header:User-Agent - Fake Google Bot
100201_2 Anomaly:Header:User-Agent - Fake Google Bot
100202 Anomaly:Header:User-Agent - Fake Bing or MSN Bot
100202_2 Anomaly:Header:User-Agent - Fake Bing or MSN Bot
100203 Anomaly:Header:User-Agent - Fake Yandex Bot
100203_2 Anomaly:Header:User-Agent - Fake Yandex Bot
As far as CF was concerned the legitimate Googlebot was being flagged as a fake one. After analysing the event log this was being caused by the fact that because requests were going via the proxy, the Googlebot appeared to have a different IP address to a Googlebot IP address (these are a known list). As such, the CF firewall was blocking them and responding with a 403. These rules were disabled and sitemaps resubmitted, which resulted in the expected Success responses.
Bypassing these rules concluded in a very quick same-day fix. However, these will be enabled again once we configure other parts of the firewall to make sure that the proxied Googlebot requests are not blocked by this rule. We can use the sitemap submissions as a way of confirming anything we do won’t affect things in the future.
It is probably a good ideal also for the SEO team to proactively monitor failed page crawls and alert us to issues here. These would have started coming through immediately at the end of May, but it didn’t get flagged until the organic traffic had dropped off a couple of weeks later. As such all of this would have happened in the days following the proxy switch and there probably would not have been such a drop as was recorded later.