I’m wanting to make a scraper that starts out with a seed link and then crawls a site downwards from there, pulling any links it finds on each webpage and then evaluating each new link for the ContentType in order to find any documents, images, or binary data on those sites (ex application/pdf, application/octet-stream, application/msword, etc).
From what I can tell, there doesn’t appear to be a method in the
net/http library that stops receiving frames after the header is received. The reason I’d like to do this is simple: using
http.Get() (which calls
http.Do() in the background) downloads the ENTIRE page before evaluating headers – which becomes a problem if, say, you’re interested in scraping a page for MIME types that contains tar files or isos, but not interested in the content of those tar files or isos in the crawler.
The closest thing I could find to this was the
req.Header type, however, the structure is only filled after performing a call to
http.Get(), which defeats the purpose that I’m trying to use it for.