kivikakk.ee

the LLM residential botnet scraping thing

People are sometimes sceptical that providers use extremely distributed residential botnets to scrape, but it’s very obvious, particularly once you have done every thing you’re “meant to” and several steps beyond: when all the robots.txt blanket disallows and Anubis aren’t enough, and then you have to proceed to straight-up disallowing almost every request from a non-logged in user? They just keep trying, only no longer from officially tagged sources.

Here is a screenshot of nóssa’s Shynet analytics. Something happened during May 21–22! Let’s zoom in on those days.

Something certainly began at 15:00 or so. 3,528 requests for https://nossa.ee/~talya/beffast/blob/main/fddbdce6962ce338e1a17827107a976f10fed921/README.md, and 3,051 requests for https://nossa.ee/~talya/nix-rosetta-builder/blob/t/97f242ec47d110bb5a2e4294595f44be7641258a/LICENSE. Both these URLs give a 403 if you’re not logged in, and outside me there are a whole two registered users: my wife, and an old Fedi friend. I don’t think either of them are doing this, so we can safely assume all the bots are getting are 403s.

Note the referrers, which correlate in session numbers very closely to the requested locations. We have https://nossa.ee/~talya/beffast/blob/main/fddbdce6962ce338e1a17827107a976f10fed921/README.md with 3,520 sessions and https://nossa.ee/~talya/nix-rosetta-builder/blob/t/97f242ec47d110bb5a2e4294595f44be7641258a/LICENSE with 3,047 sessions.

These, too, have given 403s for the last month. Hmmm. Did they all just go to sleep for a month and then retry?

Everyone loves desktop Chrome on Windows.

From roughly the middle of the period, requests from networks identified as “Sky UK Limited”, “CABLE-NET-1”, “TWC-20001-PACWEST”, “COMCAST-7922”, “CITYOFWILSONNC”, “EGIHOSTING”, “CHARTER-20115”, “Virgin Media Limited”, “TWC-11426-CAROLINAS”, “OWS”, “UUNET”, “ATT-INTERNET4”, “Shenzhen Katherine Heng Technology Information Co., Ltd.”, “CENTURYLINK-US-LEGACY-QWEST” …

… “BACOM”, “UUNET” again, “EGIHOSTING” again, “COMCAST-7922” again, “Shenzhen Katherine Heng Technology Information Co., Ltd.” again, “CABLE-NET-1” again, “OWS” again, “EGIHOSTING” again, “T-MOBILE-AS21928”, “MEDIACOM-ENTERPRISE-BUSINESS”, “SOCKET”, “GTT Communications Inc.”, “Telmex Colombia S.A.”, “Unknown” (!), “BHN-33363”, “COMCAST-7922” again …

… “OWS” again, “EGIHOSTING” again, “COMCAST-7922” again, “UUNET” again, “TOT Public Company Limited”, “TWC-11426-CAROLINAS” again, “SP-NYJ”, “TWC-11351-NORTHEAST”, “FRONTIER-FRTR”, “Shenzhen Katherine Heng Technology Information Co., Ltd.” again, “CHARTER-201155” again, “Unknown” again, “GTT Communications Inc.” again …

… “COMCAST-7922” again, “CHARTER-20115” again, “T-MOBILE-AS21928” again, “TWC-11426-CAROLINAS” again, “EGIHOSTING” again, “ATT-INTERNET4” again, “MEGAPATH2”, “VALLEYFIBER”, “Shenzhen Katherine Heng Technology Information Co., Ltd.” again, “BACOM” again, “TWC-10796-MIDWEST”, “SPRINTLINK”, “CABLE-NET-1” again, “AS-CMN”, “CENTURYLINK-LEGACY-SAVVIS” …

… zooming back in time about 21 hours, from “EGIHOSTING” again, “COMCAST-7922” again, “VIPNAS1”, “CABLE-NET-1” again, “EASTMSCONNECT-01”, “Shenzhen Katherine Heng Technology Information Co., Ltd.” again, “TWC-11426-CAROLINAS” again, “UUNET” again, “ATT-INTERNET4” again, “OWS” again, “GTT Communications.” again.

¯\_(ツ)_/¯

older post >
transcendental programming