Daily Digest email

Get the top HN stories in your inbox every day.

lm411

AI companies and notably AI scrapers are a cancer that is destroying what's left of the WWW.

I was hit with a pretty substantial botnet "distributed scraping" attack yesterday.

- About 400,000 different IP addresses over about 3 hours

- Mostly residential IP addresses

- Valid and unique user agents and referrers

- Each IP address would make only a few requests with a long delay in between requests

It would hit the server hard until the server became slow to respond, then it would back off for about 30 seconds, then hit hard again. I was able to block most of the requests with a combination of user agent and referrer patterns, though some legit users may be blocked.

The attack was annoying, but, the even bigger problem is that the data on this website is under license - we have to pay for it, and it's not cheap. We are able to pay for it (barely) with advertising revenue and some subscriptions.

If everyone is getting this data from their "agent" and scrapers, that means no advertising revenue, and soon enough no more website to scrape, jobs lost, nowhere for scrapers to scrape for the data, nowhere for legit users to get the data for free, etc.

oasisbob

Knew it was getting bad, but Meta's facebookexternalhit bot changed their behavior recently.

In addition to pulling responses with huge amplification (40x, at least, for posting a single Facebook post to an empty audience), it's sending us traffic with fbclids in the mix. No idea why.

They're also sending tons of masked traffic from their ASN (and EC2), with a fully deceptive UserAgent.

The weirdest part though is that it's scraping mobile-app APIs associated with the site in high volume. We see a ton of other AI-training focused crawlers do this, but was surprised to see the sudden change in behavior on facebookexternalhit ... happened in the last week or so.

Everyone is nuts these days. Got DoSed by Amazonbot this month too. They refuse to tell me what happened, citing the competitive environment.

dspillett

> it's sending us traffic with fbclids in the mix. No idea why.

The click IDs are likely to make the traffic look more like a human who has clicked a link rather than a bot? That way it gets past simple filters that explicitly let such requests in before bothering to check that the source address of the request seems to be a DC rather than a residential IP.

> citing the competitive environment

All the companies are competing to be the biggest inconvenience to everyone else while scraping as much stuff as they can.

oasisbob

> The click IDs are likely to make the traffic look more like a human who has clicked a link rather than a bot?

It's certainly possible. However, the traffic is still coming from Facebook's network with a FB proxy PTR record in DNS. Seems much more likely to fool your typical site owner than a bad actor.

pinkmuffinere

I’ve been sitting on this page for two minutes and it’s still not sure whether I’m a bot lol. What did I do in a past life to deserve this :(

mxmlnkn

After 2 minutes at 150 kHashes on mobile, I finally see the first pixel of the progress bar filling up. Seems like it will take hours or a day to finish. Some estimate would have been nice.

drum55

Ironically I used a LLM to write a bypass for this ridiculous tool, doing hashing in a browser makes no sense, Claude's very bad implementation of it in C does tens of megahash a second and passes all of the challenges nearly instantly. It took about 5 minutes for Claude to write that, and it's not even a particularly fast implementation, but it beats the pants off doing string comparisons for every loop in JavaScript which is what the Anubis tool does.

    for (; ;) {
        const hashBuffer = await calculateSHA256(data + nonce);
        const hashArray = new Uint8Array(hashBuffer);

        let isValid = true;
        for (let i = 0; i < requiredZeroBytes; i++) {
          if (hashArray[i] !== 0) {
            isValid = false;
            break;
          }
        }

It's less proof of work and just annoying to users, and feel good to whoever added it to their site, I can't wait for it to go away. As a bonus, it's based on a misunderstanding of hashcash, because it is only testing zero bytes comparison with a floating point target (as in Bitcoin for example), the difficulty isn't granular enough to make sense, only a couple of the lower ones are reasonably solvable in JavaScript and the gaps between "wait for 90 minutes" and "instantly solved" are 2 values apart.

Retr0id

I wrote one that uses opencl: https://github.com/DavidBuchanan314/anubis_offload

GeoAtreides

>It's less proof of work and just annoying to users, and feel good to whoever added it to their site,

this is being disproved in the article posted:

>And so Anubis was enabled in the tar pit at difficulty 1 (lowest setting) when requests were pouring in 24/7. Before it was enabled, it was getting several hundred-thousand requests each day. As soon as Anubis became active in there, it decreased to about 11 requests after 24 hours, most just from curious humans.

apparently it does more than annoying users and making the site owner feel good (well, i suppose effective bot blocking would make the site owner feel quite good)

bawolff

Shouldnt browser also have it implemented in c? Like i assume crypto.subtle isnt written in js.

yborg

Maybe post your brilliant solution to commercial companies with hundreds of millions in funding unrestrained bot scraping the Internet for AI training instead of complaining about people desperate to rein it in as individuals.

raincole

At this point I wonder if you can post a crypto miner page on HN and people will fall for it.

dheera

I don't get this kHash thing. Do we have captchas mining bitcoin in a distributed fashion for free now?

throw10920

The page says

> Anubis uses a Proof-of-Work scheme in the vein of Hashcash

And if you look up Hashcash on Wikipedia you get https://en.wikipedia.org/wiki/Hashcash which explains how Hashcash works in a fairly straightforward manner (unlike most math pages).

coryrc

On what page? https://gladeart.com/blog/the-bot-situation-on-the-internet-... loaded effectively instantly for me.

prewett

The cynic in me thinks that they’re mining bitcoin on our phones… And after completing, it claimed the page was misconfigured.

luxuryballs

I think we got honeybotted.

salomonk_mur

I'm surprised at the effectiveness of simple PoW to stop practically all activity.

I'll implement Anubis at low difficulty for all my projects and leave a decent llms.txt referenced in my sitemap and robots.txt so LLMs can still get relevant data for my site while.keeping bad bots out. I'm getting thousands of requests from China that have really increased costs, glad it seems the fix is rather easy.

gruez

>I'm surprised at the effectiveness of simple PoW to stop practically all activity.

It's even dumber than that, because by default anubis whitelists the curl user agent.

    curl -H "User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36" "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v7.0-rc5&id2=v7.0-rc4&dt=2"
    <!doctype html><html lang="en"><head><title>Making sure you&#39;re not a bot!</title><link rel="stylesheet"

    curl "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v7.0-rc5&id2=v7.0-rc4&dt=2"
    <!DOCTYPE html>
    <html lang='en'>
    <head>
    <title>kernel/git/torvalds/linux.git - Linux kernel source tree</title>

functionmouse

shhhh don't tell the bots !

marginalia_nu

Anubis' white lists and block rules are configurable though. The defaults are a bit silly.

xena

The default is to allow non-Mozilla user agents so that existing (good) automation continues to work and so that people stopped threatening to burn my house down. Lovely people in the privacy community.

wolvoleo

It's definitely more than enough to stop me as a human wanting to visit the site, so yeah.

In that case a better solution would be to take the site down altogether.

xboxnolifes

Take down the site entirely because a couple humans get into a fit about it?

wolvoleo

I'm just saying, making visitors wait at least a minute while making their device turn red hot is going to stop 99,9% of your visitors. So at that point what's the point in trying to serve the content?

jlarocco

The site's down entirely anyway. The silly "proof of work" finishes only to tell me the site is down.

What a waste of time.

simonw

> These bots are almost certainly scraping data for AI training; normal bad actors don't have funding for millions of unique IPs thrown at a page. They probably belong to several different companies. Perhaps they sell their scraped data to AI companies, or they are AI companies themselves. We can't tell, but we can guess since there aren't all that many large AI corporations out there.

Is the theory here that OpenAI, Anthropic, Gemini, xAI, Qwen, Z.ai etc are all either running bad scrapers via domestic proxies in Indonesia, or are buying data from companies that run those scrapers?

I want to know for sure. Who is paying for this activity? What does the marketplace for scraped data look like?

oasisbob

I want more data too.

The root sources of the traffic from residential proxies gets murky very quickly.

It's easy to follow the chain partway for some traffic, eg "Why are we receiving all this traffic from Digital Ocean? ... oh, it's their hero client Firecrawl, using a deceptive UserAgent" ... but it still leaves the obvious question about who the Firecrawl client is.

Res proxy traffic is insane these days. There is also plenty of grey-market snowshoe IPs available for the right price, from a handful of ASNs. I regularly see unified crawling missions by unknown agents using 1000+ "clean" IP addresses an hour.

ghywertelling

https://parallel.ai/

I bet lot of companies want to provide search results to AI agents.

NooneAtAll3

> Before it was enabled, it was getting several hundred-thousand requests each day. As soon as Anubis became active in there, it decreased to about 11 requests after 24 hours

I love experimental data like this. So much better than gut reaction that was spammed when anubis was just introduced

wolvoleo

Well yeah, but I also didn't make it through to the actual site. That can't be the idea, right? After 5 seconds of 100% CPU and no progress I gave up.

The idea is to scare off bots and not normal humans.

rz2k

On my computer, with Firefox it uses 14 CPU cores, consumes an extra 35 Watts, and the progress bar barely moves. Is this site mining cryptocurrency?

On Safari or Orion it is merely extremely slow to load.

I definitely wouldn't use any of this on a site that you don't want delisted for cryptojacking.

JeanMarcS

I'm getting this patern a lot on Prestashop websites, where thousand, to not say hundreds of thousand, of request are coming from bots not announcing themselves in the User-agent, and coming from different IP's

Very annoying. And you can't filter them because they look like legitimate trafic.

On a page with differents options (such as color, size, etc...) they'll try all the combinaisons, eating all the ressources.

goodmythical

Looks like they've gone ahead and implemented the easiest fool-proof method of preventing scraping as the site is currently not loading across mutliple devices.

Not even a 404, just not available at all.

LeoPanthera

Is Anubus being set to difficulty 8 on this page supposed to be a joke? I gave up after about 20 seconds.

lucb1e

I think that must be the point they're trying to make, yes

It also drives home that Anubis needs a time estimate for sites that don't use Anubis as a "can you run javascript" wall but as an actual proof of work mechanism that it purports to be its main mechanism

It shows a difficulty of "8" with "794 kilohashes per second", but what does that mean? I understand the 8 must be exponential (not literally that 8 hashes are expected to find 1 solution on average), but even as a power of 2, 2^8=256 I happen to know by heart, so thousands of hashes per second would then find an answer in a fraction of a second. Or if it's 8 bytes instead of bits, then you expect to find a solution after like 8 million hashes, which at ~800k is about ten seconds. There is no way to figure out how long the expected wait is even if you understand all the text on the page (which most people wouldn't) and know some shortcuts to do the mental math (how many people know small powers of 2 by heart)

xiconfjs

I waited a minute until my phone got hot.

siva7

So the elephant in the room: How much of HN is bot generated? Those who know have every incentive not to share and those who don't have no way to figure it out. At this point i have to assume that every new account is a bot

uberman

I've thought about this a bit and I can't really see why someone would want to write AI content here other than to spam ads but they are handled quickly. Does anyone see AI content with a clear motivation or agenda here? There are very few rep based privileges right so that seems like an unlikely motivation as well.

Retr0id

Most of the HN bot accounts I see have a link-to-vibecoded-product in bio, and/or are trying to build up "organic" activity before a Show HN post for the same.

A less publicly-visible motive would be if they were building up accounts to use for paid-upvote schemes.

PowerElectronix

You can automate shilling to drive or at least influence opinion.

siva7

This is a venture capitalist driven community that attracts the sleaziest kind of spammers you could think of under the badge of growth hacking and networking. Besides this very obvious motivation to spam you have all kinds of nerds here eager to do it just because they can (on one of the most famous tech places where registration is made as easy as possible)

MeetingsBrowser

The article is about automated web scraping, not bots writing content.

siva7

The commenters here don't care what the article is about when they can't access the article and the much more concerning question not about web scraping is.

Trufa

I felt a vibe change, some are obvious and some not, but it does feel different, the main change i've seen is in downvotes, I don't say very controversial things and have had many things very quickly downvoted, and then slowly upvoted, I think hn was very slow to downvote in the past (except obvious trolls/spam). So for me the main worry is not even the comments, but the invisible bias generated by voting.

Retr0id

> Those who know have every incentive not to share

Why do you say that?

snapetom

I think HN is one of the better ones these days. I have no data to back this up, but the comments aren't like reddit comments. Go into any reddit post on the main subs, and you won't have to scroll very far to get a comment about Trump derailing the whole thing.

Digg's recent shutdown message talked about how bad and aggressive bots were. I'd love to see Kevin and Alex post in depth about lessons learned, Dead Internet, and call out social sites.

lizknope

> The IPs of these bots here actually do not come from datacenters or VPNs most of the time; the overwhelming majority come from residential and mobile networks.

So I started searching for what these residential proxy networks actually are.

https://datadome.co/bot-management-protection/how-proxy-prov...

Daily Digest email

Get the top HN stories in your inbox every day.

The bot situation on the internet is worse than you could imagine