If You Insist On Scraping My Site, At Least Do It Properly

Published: 06 Jul 2023 at 06:00 +0200 Updated: 15 Aug 2023 at 13:21 +0200

Update 15 August 2023: I’m now not granting a license to scraping LLMs so this is now just an instructional on how to scrape respectfully.

I read a post by Terence about how silence isn’t consent where he mentioned some asshole AI data scraping bot that was bulk downloading his site’s images and causing strain on his servers and the bot only let you opt-out with non-standard HTTP headers. I’ll save my thoughts on mass data collection for AI for another day (in short, it depends), but I just wanted to say while I don’t think you should hoover the internet (as if you needed my permission), that won’t stop you so if you do choose to hoover my site in particular, you should be considerate in doing so by:

identifying yourself through the User-Agent,
not flooding my site with requests by scraping an asset at least every 30 seconds, and
learning to use the E-Tag and If-Modified-Since HTTP headers so you only fetch an asset when something changes.

If you have a good enough reason, I’m open to an agreement where I send you the full site export periodically instead of you scraping my site. I understand the value of scraping and I get that you don’t care what I think, just be considerate when you’re doing it on my site.