This is an open letter to Facebook because their feedback system is very limited (no room for detail) and I’d rather have a public place I can link to with full details. If you work for Facebook, please tell me who I can talk to that has the power to change infrastructure details like this. If you don’t work for Facebook but know a good place I can share this important information, please point me in the right direction.
Facebook web scraper uses empty user agent sometimes, so our server blocks it!
When one server requests a file (like a web page or image) from another server (like Facebook does when it fetches content from our site for use in sharing preview) it always sends a user agent that identifies it. Browsers use their name and various version numbers, and good web applications do something similar (WordPress identifies itself, it’s version and the home URL of the WP site making the request).
Facebook uses several user agents that are sensible, like “facebookexternalhit/1.1”, which makes sense and leads to information about Facebook crawling when I search in Google.
18.104.22.168 - - [26/Oct/2016:10:04:26 +0000] "GET /2016/10/26/52981/ HTTP/1.1" 206 0.001 36410 "-" "facebookexternalhit/1.1"
You can read about these in the Facebook Crawler documentation.
That’s great, but Facebook ALSO sends requests with no user agent at all!
22.214.171.124 - - [26/Oct/2016:17:56:54 +0000] "GET /2015/11/7731?utm_source=Global+Voices&utm_medium=facebook HTTP/1.1" 200 2.063 28007 "-" "-"
This is terrible. I can’t search for information about the crawler nor can I even tell it’s Facebook (except that in this case, it shares the IP with the previous request). The Facebook documentation linked above has no mention of empty user agent strings, which makes this even worse, as following their instructions to whitelist their UAs wouldn’t fix my problem.
Empty user agents are almost always a sign of sloppy programming and usually come from spam/DDOS/garbage bot traffic that is worthless and can be safely discarded. Blocking requests with empty user agents is extremely effective at reducing server load without affecting normal users, because all real people have user agents in their browsers, and all responsible servers identify themselves when making requests.
We use a server app called fail2ban which scans our logs for patterns and bans IP addresses based on these patterns. Banning IPs that make requests without a user agent is one of our most important filters. NOW THAT FILTER IS BLOCKING FACEBOOK IP ADDRESSES!
This is not what Facebook should want, this helps no one and it means our Facebook integration is subject to random failures.
Notably the two examples above both came from the same FB IP address, so when fail2ban blocks it because of the second request, any further requests like the first one (valid because they have a UA) will fail immediately.
FACEBOOK PLEASE FIX! MAKE SURE ALL REQUESTS COMING FROM FACEBOOK HAVE A USER AGENT! THIS SHOULD NOT BE HARD!