This is an open letter to Facebook because their feedback system is very limited (no room for detail) and I’d rather have a public place I can link to with full details. If you work for Facebook, please tell me who I can talk to that has the power to change infrastructure details like this. If you don’t work for Facebook but know a good place I can share this important information, please point me in the right direction.
EDIT 2018-02: Somehow this bug persists! Despite multiple attempts by Facebook to fix it, multiple sysadmins are still experiencing the problem of FB bots visiting with no user agent. Even worse, it seems that same bot has a new bug in it, where it sends dozens/hundreds of visits in a short period, effectively generating a DDOS attack with no UA accountability! Obviously this is totally unacceptable, and it’s time for Facebook to do whatever it takes to fix this bug. You can read my most recent ticket on the “Facebook for Developers: Support” bug tracker (and leave a comment saying how you believe this needs to be fixed) here: Facebook HTTP media fetcher has no User Agent
EDIT 2017-08: In the end I DID find the correct place to post a technical bug to Facebook, and created a ticket about this problem with the empty user agents. I got a decent reply from them where they accepted that the problem is real and stated their intention to fix it in February 2017. As of August 2017, they claim to have fixed the problem, but unfortunately I was able to find more examples of the problem in my logs after they closed the ticket as fixed. Hopefully it will start working soon, and I’ll update this again. Below is the original post, which assumes the problem persists.
Facebook web scraper uses empty user agent sometimes, so our server blocks it!
When one server requests a file (like a web page or image) from another server (like Facebook does when it fetches content from our site for use in sharing preview) it always sends a user agent that identifies it. Browsers use their name and various version numbers, and good web applications do something similar (WordPress identifies itself, it’s version and the home URL of the WP site making the request).
Facebook uses several user agents that are sensible, like “facebookexternalhit/1.1”, which makes sense and leads to information about Facebook crawling when I search in Google.
126.96.36.199 - - [26/Oct/2016:10:04:26 +0000] "GET /2016/10/26/52981/ HTTP/1.1" 206 0.001 36410 "-" "facebookexternalhit/1.1"
You can read about these in the Facebook Crawler documentation.
That’s great, but Facebook ALSO sends requests with no user agent at all!
188.8.131.52 - - [26/Oct/2016:17:56:54 +0000] "GET /2015/11/7731?utm_source=Global+Voices&utm_medium=facebook HTTP/1.1" 200 2.063 28007 "-" "-"
This is terrible. I can’t search for information about the crawler nor can I even tell it’s Facebook (except that in this case, it shares the IP with the previous request). The Facebook documentation linked above has no mention of empty user agent strings, which makes this even worse, as following their instructions to whitelist their UAs wouldn’t fix my problem.
Empty user agents are almost always a sign of sloppy programming and usually come from spam/DDOS/garbage bot traffic that is worthless and can be safely discarded. Blocking requests with empty user agents is extremely effective at reducing server load without affecting normal users, because all real people have user agents in their browsers, and all responsible servers identify themselves when making requests.
We use a server app called fail2ban which scans our logs for patterns and bans IP addresses based on these patterns. Banning IPs that make requests without a user agent is one of our most important filters. NOW THAT FILTER IS BLOCKING FACEBOOK IP ADDRESSES!
This is not what Facebook should want, this helps no one and it means our Facebook integration is subject to random failures.
Notably the two examples above both came from the same FB IP address, so when fail2ban blocks it because of the second request, any further requests like the first one (valid because they have a UA) will fail immediately.
FACEBOOK PLEASE FIX! MAKE SURE ALL REQUESTS COMING FROM FACEBOOK HAVE A USER AGENT! THIS SHOULD NOT BE HARD!