|
From: Andreas K. <and...@gm...> - 2025-10-05 17:18:54
|
> Hi all, > > When I try to download and unpack a tarball from the core server using > wget, I get a complaint from gzip: not in gzip format. > Looking at the downloaded data, it is actually an html page that > contains (among other things) a remark: "You appear to be a robot". > Well, yes. This is an automated build script. That could be considered a > robot. But is that bad? > I understand this happens due to misbehaving AI bots. But is there > anything that can be done to make automated tarball downloads for > legitimate use possible again? It makes no sense to me to have to clone > a whole repository when I just need a single release. It doesn't look > like fossil has the concept of a shallow clone similar to git. I believe it would be best to talk with Richard, as the person managing the repos and the fossil serving them. There might be some way of bypassing the defenses. Some discussion at https://fossil-scm.org/forum/forumpost/6c9c86ef97 > The alternative is that I manually copy the tarballs to my own site and > adjust the script to download from there. But that would be quite > inconvenient. -- Happy Tcling, Andreas Kupries <and...@gm...> <https://core.tcl-lang.org/akupries/> <https://akupries.tclers.tk/> Developer @ SUSE Software Solutions Germany GmbH ------------------------------------------------------------------------------- |
|
From: Schelte B. <tc...@tc...> - 2025-10-05 21:01:39
|
On 05/10/2025 19:18, Andreas Kupries wrote: > I believe it would be best to talk with Richard, as the person > managing the repos and the fossil serving them. There might be some > way of bypassing the defenses. > > Some discussion athttps://fossil-scm.org/forum/forumpost/6c9c86ef97 > That is very interesting. Thanks for pointing me to that discussion. I had assumed that this robot defense mechanism was implemented by cloudflare and therefor I expected it to be the same for all core repositories. Now that I understand it is a fossil feature, I looked around and found the fossil robot defense settings page. I subsequently found that that setting is slightly different on different repositories. I initially ran into the problem with the tcltls repository. This has robot-restrict set to: timelineX,diff,annotate,zip,fileage,file,finfo,reports The tcl repository on the other hand actually allows downloading a tarball with wget. It has this in the robot-restrict property: timeline,*diff,vpatch,annotate,blame,praise,dir,tree I suspected the absence of zip here is what allows the tarball download. When I removed that tag from the setting in the tcltls repository I was indeed able to download a tarball with wget again. Please let me know if anybody considers this an unacceptable risk. Then I will restore the original setting and possibly add some pattern to the robot-exception property. Otherwise I would like to keep it this way. Thanks, Schelte. |
|
From: Richard H. <dr...@sq...> - 2025-10-06 12:26:42
|
We can drop the defenses on the TCL servers as much as you want, or as much as you think your cloudflare frontend can handle. Know this for certain: Without defenses, the new AI robots will go through and systematically download tarballs for every single check-in in the complete history of the project. A single tarball takes about 10 seconds of CPU time just to run the zlib compression - not even counting the work needed to assemble the files for the tarball. This will happen multiple times per day, or per hour. The robot download rate seems to be growing. I have the SQLite repository set up so that it only allows robot-fetches of tarballs that following this pattern: ``` https://sqlite.org/src/tarball/release/HASH/NAME.tar.gz ``` Here HASH is really any symbolic name for the specific check-in. It can be a SHA3 hash, or it can be a date/time stamp, or a symbolic name. That doesn't matter. But the /release/ that comes immediately before HASH means that the specific check-in must be tagged with "release". My servers (and cache) can handle the load of delivering tarballs of releases. Y'all want me to set up something similar for TCL? Would that resolve the issue? Or y'all can do this yourselves (if you have Admin privilege on the repositories) by visiting <https://core.tcl-lang.org/tcl/setup_robot> and entering an appropriate regexp in the "Exceptions to anti-robot restrictions" box. The instructions give you an example, which is in fact the regexp I use for SQLite. Wait a sec... I'm looking at robot defenses page for TCL now. It looks like somebody already has robot restrictions for tarballs turned off via the "Do not allow robots access to these pages" GLOB pattern. What URL is not working for your scripts, exactly? -- D. Richard Hipp dr...@sq... On Sunday, October 5th, 2025 at 1:18 PM, Andreas Kupries <and...@gm...> wrote: > > Hi all, > > > > When I try to download and unpack a tarball from the core server using > > wget, I get a complaint from gzip: not in gzip format. > > > Looking at the downloaded data, it is actually an html page that > > contains (among other things) a remark: "You appear to be a robot". > > Well, yes. This is an automated build script. That could be considered a > > robot. But is that bad? > > > I understand this happens due to misbehaving AI bots. But is there > > anything that can be done to make automated tarball downloads for > > legitimate use possible again? It makes no sense to me to have to clone > > a whole repository when I just need a single release. It doesn't look > > like fossil has the concept of a shallow clone similar to git. > > > I believe it would be best to talk with Richard, as the person > managing the repos and the fossil serving them. There might be some > way of bypassing the defenses. > > Some discussion at https://fossil-scm.org/forum/forumpost/6c9c86ef97 > > > The alternative is that I manually copy the tarballs to my own site and > > adjust the script to download from there. But that would be quite > > inconvenient. > > > -- > Happy Tcling, > Andreas Kupries and...@gm... > > https://core.tcl-lang.org/akupries/ > > https://akupries.tclers.tk/ > > Developer @ SUSE Software Solutions Germany GmbH > ------------------------------------------------------------------------------- > > |
|
From: Schelte B. <tc...@tc...> - 2025-10-06 14:11:25
|
Hello Richard, Thank you for your answer. I guess you missed my follow-up mail. The problem I initially had was with the tcltls repository. After Andreas pointed me to the fossil forum discussion, I found that the Tcl repository actually did allow downloading a tarball and that it was missing the "zip" tag in the robot-restrict property. So I removed that tag from the tcltls repository as well, which then allowed the download. I suppose you are right that it would be prudent to come up with more fine-grained control using the robot-exception property. Thanks, Schelte On 06/10/2025 14:07, Richard Hipp wrote: > We can drop the defenses on the TCL servers as much as you want, or as much as you think your cloudflare frontend can handle. Know this for certain: Without defenses, the new AI robots will go through and systematically download tarballs for every single check-in in the complete history of the project. A single tarball takes about 10 seconds of CPU time just to run the zlib compression - not even counting the work needed to assemble the files for the tarball. This will happen multiple times per day, or per hour. The robot download rate seems to be growing. > > I have the SQLite repository set up so that it only allows robot-fetches of tarballs that following this pattern: > > ``` > https://sqlite.org/src/tarball/release/HASH/NAME.tar.gz > ``` > > Here HASH is really any symbolic name for the specific check-in. It can be a SHA3 hash, or it can be a date/time stamp, or a symbolic name. That doesn't matter. But the /release/ that comes immediately before HASH means that the specific check-in must be tagged with "release". My servers (and cache) can handle the load of delivering tarballs of releases. > > Y'all want me to set up something similar for TCL? Would that resolve the issue? > > Or y'all can do this yourselves (if you have Admin privilege on the repositories) by visiting <https://core.tcl-lang.org/tcl/setup_robot> and entering an appropriate regexp in the "Exceptions to anti-robot restrictions" box. The instructions give you an example, which is in fact the regexp I use for SQLite. > > Wait a sec... I'm looking at robot defenses page for TCL now. It looks like somebody already has robot restrictions for tarballs turned off via the "Do not allow robots access to these pages" GLOB pattern. What URL is not working for your scripts, exactly? > > -- > D. Richard Hipp > dr...@sq... > > > On Sunday, October 5th, 2025 at 1:18 PM, Andreas Kupries <and...@gm...> wrote: > >>> Hi all, >>> >>> When I try to download and unpack a tarball from the core server using >>> wget, I get a complaint from gzip: not in gzip format. >> >>> Looking at the downloaded data, it is actually an html page that >>> contains (among other things) a remark: "You appear to be a robot". >>> Well, yes. This is an automated build script. That could be considered a >>> robot. But is that bad? >> >>> I understand this happens due to misbehaving AI bots. But is there >>> anything that can be done to make automated tarball downloads for >>> legitimate use possible again? It makes no sense to me to have to clone >>> a whole repository when I just need a single release. It doesn't look >>> like fossil has the concept of a shallow clone similar to git. >> >> >> I believe it would be best to talk with Richard, as the person >> managing the repos and the fossil serving them. There might be some >> way of bypassing the defenses. >> >> Some discussion at https://fossil-scm.org/forum/forumpost/6c9c86ef97 >> >>> The alternative is that I manually copy the tarballs to my own site and >>> adjust the script to download from there. But that would be quite >>> inconvenient. >> >> >> -- >> Happy Tcling, >> Andreas Kupries and...@gm... >> >> https://core.tcl-lang.org/akupries/ >> >> https://akupries.tclers.tk/ >> >> Developer @ SUSE Software Solutions Germany GmbH >> ------------------------------------------------------------------------------- >> >> > |
|
From: Gustaf N. (sslmail) <ne...@wu...> - 2025-10-06 14:39:33
|
Hi Schelte, I had a few weeks ago exactly the same problems (automated downloads of versions from branches for testing via GitHub pipelines suddenly failed). So, i switched to the github mirror, which works robust and provides sufficient flexibility to download the archives. I think, just the names of the top-level directories changed. Let me know, if you are interested in more details. All the best -g > On 06.10.2025, at 15:46, Schelte Bron <tc...@tc...> wrote: > > Hello Richard, > > Thank you for your answer. I guess you missed my follow-up mail. The problem I initially had was with the tcltls repository. > > After Andreas pointed me to the fossil forum discussion, I found that the Tcl repository actually did allow downloading a tarball and that it was missing the "zip" tag in the robot-restrict property. So I removed that tag from the tcltls repository as well, which then allowed the download. > > I suppose you are right that it would be prudent to come up with more fine-grained control using the robot-exception property. > > > Thanks, > Schelte > > > On 06/10/2025 14:07, Richard Hipp wrote: >> We can drop the defenses on the TCL servers as much as you want, or as much as you think your cloudflare frontend can handle. Know this for certain: Without defenses, the new AI robots will go through and systematically download tarballs for every single check-in in the complete history of the project. A single tarball takes about 10 seconds of CPU time just to run the zlib compression - not even counting the work needed to assemble the files for the tarball. This will happen multiple times per day, or per hour. The robot download rate seems to be growing. >> I have the SQLite repository set up so that it only allows robot-fetches of tarballs that following this pattern: >> ``` >> https://sqlite.org/src/tarball/release/HASH/NAME.tar.gz >> ``` >> Here HASH is really any symbolic name for the specific check-in. It can be a SHA3 hash, or it can be a date/time stamp, or a symbolic name. That doesn't matter. But the /release/ that comes immediately before HASH means that the specific check-in must be tagged with "release". My servers (and cache) can handle the load of delivering tarballs of releases. >> Y'all want me to set up something similar for TCL? Would that resolve the issue? >> Or y'all can do this yourselves (if you have Admin privilege on the repositories) by visiting <https://core.tcl-lang.org/tcl/setup_robot> and entering an appropriate regexp in the "Exceptions to anti-robot restrictions" box. The instructions give you an example, which is in fact the regexp I use for SQLite. >> Wait a sec... I'm looking at robot defenses page for TCL now. It looks like somebody already has robot restrictions for tarballs turned off via the "Do not allow robots access to these pages" GLOB pattern. What URL is not working for your scripts, exactly? >> -- >> D. Richard Hipp >> dr...@sq... >> On Sunday, October 5th, 2025 at 1:18 PM, Andreas Kupries <and...@gm...> wrote: >>>> Hi all, >>>> >>>> When I try to download and unpack a tarball from the core server using >>>> wget, I get a complaint from gzip: not in gzip format. >>> >>>> Looking at the downloaded data, it is actually an html page that >>>> contains (among other things) a remark: "You appear to be a robot". >>>> Well, yes. This is an automated build script. That could be considered a >>>> robot. But is that bad? >>> >>>> I understand this happens due to misbehaving AI bots. But is there >>>> anything that can be done to make automated tarball downloads for >>>> legitimate use possible again? It makes no sense to me to have to clone >>>> a whole repository when I just need a single release. It doesn't look >>>> like fossil has the concept of a shallow clone similar to git. >>> >>> >>> I believe it would be best to talk with Richard, as the person >>> managing the repos and the fossil serving them. There might be some >>> way of bypassing the defenses. >>> >>> Some discussion at https://fossil-scm.org/forum/forumpost/6c9c86ef97 >>> >>>> The alternative is that I manually copy the tarballs to my own site and >>>> adjust the script to download from there. But that would be quite >>>> inconvenient. >>> >>> >>> -- >>> Happy Tcling, >>> Andreas Kupries and...@gm... >>> >>> https://core.tcl-lang.org/akupries/ >>> >>> https://akupries.tclers.tk/ >>> >>> Developer @ SUSE Software Solutions Germany GmbH >>> ------------------------------------------------------------------------------- >>> >>> > > > > _______________________________________________ > Tcl-Core mailing list > Tcl...@li... > https://lists.sourceforge.net/lists/listinfo/tcl-core |