Original post

I’m wanting to make a scraper that starts out with a seed link and then crawls a site downwards from there, pulling any links it finds on each webpage and then evaluating each new link for the ContentType in order to find any documents, images, or binary data on those sites (ex application/pdf, application/octet-stream, application/msword, etc).

From what I can tell, there doesn’t appear to be a method in the net/http library that stops receiving frames after the header is received. The reason I’d like to do this is simple: using http.Get() (which calls http.Do() in the background) downloads the ENTIRE page before evaluating headers – which becomes a problem if, say, you’re interested in scraping a page for MIME types that contains tar files or isos, but not interested in the content of those tar files or isos in the crawler.

The closest thing I could find to this was the req.Header type, however, the structure is only filled after performing a call to http.Get(), which defeats the purpose that I’m trying to use it for.

submitted by /u/bmw417
[link] [comments]