I Hate Screen Scraping
After coding up a quick skeleton, I was surprised to see that none of my initial tests were working – it was almost as if the data wasn’t even there. Checking out the rendered page in Firefox revealed the problem – none of the links I needed to scrape were present. At all. So, I hit up the Yahoo source code to verify that the links were in the original. Yup, right there. Weird!
I tried out a variety of my favorite scraping tools – scRUBYt!, WWW::Mechanize, and even good ol’ cURL. None of these tools could acquire HTML source from the Yahoo server with links intact, even when I provided a valid Firefox User-Agent string.
Next, I dropped down to an even lower level – packet sniffing with my favorite sniffer, Packetyzer, and sending HTTP 1.1 requests directly via telnet.
rich@redbuntu:~$ telnet shopping.yahoo.com 80 Trying 220.127.116.11... Connected to pdb3.shop.yahoo.akadns.net. Escape character is '^]'. GET / HTTP/1.1 Host:shopping.yahoo.com HTTP/1.1 200 OK Date: Tue, 15 Apr 2008 17:48:16 GMT P3P: policyref="http://p3p.yahoo.com/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV" Cache-Control: private Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=utf-8 a17e <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> etc...
Those Sneaky Geeks
It sounded crazy, but I realized that Yahoo could be using clients’ compression support to differentiate between bots and actual web browsers. So, I popped open another terminal and tried making a request as I did before with cURL, except with a Firefox User-Agent string and compression enabled.
curl --user-agent "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:18.104.22.168) Gecko/20070713 FireFox/22.214.171.124" --compressed "http://shopping.yahoo.com/path/to/resource" > result.html
Bingo! All of the links were intact – the source was identical to that which a standard web browser would retrieve. So, it seems that Yahoo has written some sort of Apache module or perhaps just engineered their application code to vary its response according to whether or not the client supports compression. This is quite a sneaky way to deter search engine indexing and screen scraping in general, but it works wonderfully. In fact, if I were the engineer that devised this detection method, I’d be pretty pround of myself. Unfortunately, now that we know Yahoo’s secret, our scraping workflow just gets one more step: curl > result.html.