WelcomeWelcome | FAQFAQ | DownloadsDownloads | WikiWiki

Author Topic: question for admin/mods: wondering reason for increased forum website traffic?  (Read 3561 times)

Offline GNUser

  • Wiki Author
  • Hero Member
  • *****
  • Posts: 1862
Hits from today by useragent   only the top 20
Hi Paul_123. How awful.

I'll just make the point that when most sites do that (or start using any other service that requires Javascript to try and verify humanity) I stop visiting.
It's not like server administrators have great options in this situation.

I have my own http server which I use just for myself, family, and few friends. The server was getting hammered with tens of thousands of visits an hour, every hour, every day. I don't have premium hardware, so sometimes I couldn't access my own http server >:(

The options I considered were 1) shut off the server, 2) use a Javascript gatekeeper (e.g., Anubis), and 3) put the server on a nonstandard port. I actually tried all three options for a while. Doing without the server was too painful. Anubis worked well but added too much complexity to my otherwise barebones setup. Using a nonstandard port turned out to be the right balance for me.

Using a nonstandard port does not eliminate the problem (some bots are more sophisticated and do port scanning), but it eliminates >50% of bot traffic, bringing the noise down to a tolerable level. Would using a nonstandard port be worth trying for the TCL forum? The problem is that this would prevent a lot of legitimate, human users from finding the forum.
« Last Edit: April 08, 2026, 09:24:03 AM by GNUser »

Offline Paul_123

  • Administrator
  • Hero Member
  • *****
  • Posts: 1571
Based on the port scanning going on, I doubt it.  Might slow them down for a couple of days.   And would just frustrate users.

Yesterday's hit was the first botnet that I know of.  Otherwise the bots have been fairly respectful.

Offline mocore

  • Wiki Author
  • Hero Member
  • *****
  • Posts: 795
  • ~.~
  Otherwise the bots have been fairly respectful.

perhaps a-bit in parallel to this topic  ,
i post as i just happened to read the above then the quote below quote from  https://lists.gnu.org/archive/html/help-guix/2026-04/msg00047.html

which seam to be vastly differing perspectives


Quote from: help-guix/2026-04/msg00047
GPTBot alone did 109,552 accesses to my website in march, so I think
they are telling the truth in a very misleading way.

The websites that go into these stats have together about 2000 HTML
documents (https://www.1w6.org has 811, https://www.draketo.de/node has
827 and https://www.draketo.de/ has 296).

99% of these change less than once per year.

If GPTbot crawls them every day, that’s 2000x30 = 60.000 accesses per
month -- which is pretty close to the 109,552 accesses I see.

But I built these websites over 20 years. The oldest articles are from
2007.

A human goes there, reads 1-20 articles and leaves again. Maybe to
return later when there’s a new article (I have RSS feeds).

An LLM goes there and crawls everything. Every day.

There even was a week where GPT tried every possible combination of
search inputs on 1w6.org -- including repeated arguments, likely until
it hit the URL length limit of the server. My log analysis tool needed
days to complete the analysis after that week. And I give thanks to my
hoster that they didn’t boot me then (and that I don’t have to pay for
excess bandwidth).



Offline Paul_123

  • Administrator
  • Hero Member
  • *****
  • Posts: 1571
The different perspective is that I expect some level of scraping.   Its just the time we live.   I specifically use a host that allows for unlimited bandwidth.   Anything I can do to limit it will be obtrusive to the real users.



Offline Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 12830
Hi Paul_123
... I specifically use a host that allows for unlimited bandwidth.   Anything I can do to limit it will be obtrusive to the real users.
If you are talking about downloading extensions from the repo, then
yes, I would agree with that statement.

But this is a simple forum that's not littered with adds and videos.
Even attachments are limited to 200K in size (and total). How much
bandwidth is really needed for reading the forum.

I lowered the download speed on one of my machines to 1Mbit/sec
and had no trouble navigating the website.

If it's possible to set a speed limit that's comfortable for human
consumption, but less comfortable for bots scraping web pages, it
might be worth considering.

Just a thought.

Offline Paul_123

  • Administrator
  • Hero Member
  • *****
  • Posts: 1571
External bandwidth is never an issue and never what throttles the site.  Its the php processing and database processes that jam up the CPU.  Things are already rate limited for IP addresses and sessions.   But a botnet avoids all of these limits.

Offline CNK

  • Wiki Author
  • Sr. Member
  • *****
  • Posts: 453
I'll just make the point that when most sites do that (or start using any other service that requires Javascript to try and verify humanity) I stop visiting.
It's not like server administrators have great options in this situation.

I have my own http server which I use just for myself, family, and few friends. The server was getting hammered with tens of thousands of visits an hour, every hour, every day. I don't have premium hardware, so sometimes I couldn't access my own http server >:(

The options I considered were 1) shut off the server, 2) use a Javascript gatekeeper (e.g., Anubis), and 3) put the server on a nonstandard port. I actually tried all three options for a while. Doing without the server was too painful. Anubis worked well but added too much complexity to my otherwise barebones setup. Using a nonstandard port turned out to be the right balance for me.

In my case I was able to identify a common argument in the request URL strings in all the requests coming from the botnet that was making millions of requests per day to my site. By adding a rule in the Apache configuration I blocked requests matching the bot's requests, and since that prevented loading the PHP module for them the server was then able to handle all the requests the botnet could sent without running out of RAM anymore. I still needed to significantly increase overall connection limit settings in Apache and the Linux kernel itself, but then it was able to absorb the attack which continued for a week or two before finally giving up.

That was with a $1/month VPS, but I was lucky it was a crazy bot using a pointless argument in requested URLs (I guess it was running some idiotic AI-generated code), so I could block it without affecting human (or even sensible crawler) visitors at all. I've read accounts of other people identifying similar ways of blocking bots with web server rules to filter request URLs. Others have blocked impossible or unlikely User-Agents (really old browsers without sufficiently modern HTTPS support to really connect), since some botnets seem to use a pool of random browser User-Agents which isn't up to date. I could have blocked South American and Asian IP addresses since all the hundreds of thousands of IPs the botnet used seemed to be from there, but I didn't want to. Maybe that would be another option for your personal site though. Others block IPs based on the owners of IP blocks (eg. cloud/VPS hosting companies).

Lots of answers, but I agree no single one is perfect for every situation.

Offline gadget42

  • Hero Member
  • *****
  • Posts: 1042
** WARNING: connection is not using a post-quantum kex exchange algorithm.
** This session may be vulnerable to "store now, decrypt later" attacks.
** The server may need to be upgraded. See https://openssh.com/pq.html
** Also see: post quantum internet 2025 - https://blog.cloudflare.com/pq-2025/

Offline CentralWare

  • Retired Admins
  • Hero Member
  • *****
  • Posts: 847
Good morning, everyone! Sorry I haven't checked in (in quite a bit) but life's other obstacles sometimes get in the way!  ???

Paul_123: Agents...
Until they figure out we're onto them...  use their agent tags as a death-trap:
Code: [Select]
156472 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
  43616 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36
The two top scrapers claim to be APPLE + CHROME + SAFARI
Do some digging and see if there's a REAL browser out there claiming to be safari AND google, IF NOT, there's the first security trap at our front door.

For a REAL macOS: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15
For a REAL iPhone: Mozilla/5.0 (iPhone; CPU iPhone OS 17_5_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Mobile/15E148 Safari/604.1
Note: no mention of CHROME anywhere... that's likely a tactic to "please most any website/server" by user-agent.
I haven't looked, but if there's an actual G00gle browser FOR APPLE for some reason, then user-agent traps may not be the ticket.

That said, you can instead use cookie batter as bait...

1.   Plant a session cookie that expires in, oh, say 5 seconds - most INTERACTIVE websites use session cookies for even simple tasks like logins

2a. If bot follows suit, wait for around 5 hits within that time frame then self-jail that IP for however long sounds fair - humans can't "read" a web page at 1 page per second - the actual time and logic will have to be tweaked based on TLC.net's server response time to make it truly worthwhile

2b. If bot finds a way AROUND session cookies, do a bounce-test (on landing, if $_SESSION['self_test'] is empty, goto ./test.php, if test.php detects $_SESSION['self_test'] is STILL empty, that's a red flag for bots and humans alike. It's not something people normally can "turn off" in settings or preferences in major browsers.) Note: "self_test" has to be randomized to prevent bots from "learning" that if they want in, they need to tamper with "self_test" in order to come in without issue. Session cookies are stored SERVER SIDE so that's rare to happen.

3.   Kill a bot's connection for even 15 seconds once this flag's been tripped and you're likely to force it to turn away OR throttle itself. For humans...  pretending their F5 key got stuck... a 15 second ban isn't the end of the world :)

4.   Next comes the three-strike-rule...  trip the above hits-per-second three times in a row within so many minutes and all hits thereafter get redirected (header 301/302) to themselves (127.0.0.1) which "should" in theory actually slow the bot down overall as all of these thousands of sockets hitting us are being redirected...  and now waiting for "localhost" to answer on port 12345.  In theory. :)

It's funny, but it's "AI" that brought me here! (Automated Idiocracy)
I was running a scenario through one of M$ LLMs asking what the challenges would be to install vLLM/ollama/etc. onto TinyCore (it laughed, basically telling me it'll be a painful experience) and I remembered a TLC member asking about a year or so ago why we don't have an AI doing our extension builds (as maintainers) - which is somewhat what I'm finagling...  which led me to here.

So I did a little more digging and the LLM actually knows quite a bit about the ins and outs of the OS and the content of the wiki and forum, so yes, there's SOME good that's come from it, but tactfulness and respect of the crawlers is virtually non-existent, so we may have to teach it a few graces.  Weather it likes it or not.

NOTE: Google crawler isn't overly socket-friendly either, so what keeps the beasts away may also keep the spiders away.