Menu

Tree [8d4437] master /
 History

HTTPS access


File Date Author Commit
 README.md 2024-10-01 dragomerlin dragomerlin [8d4437] README.md: fix python names and packages
 recursive-crawler.py 2024-09-20 dragomerlin dragomerlin [216e42] Explain how to create python venv to be compati...
 save-urls-online-headers.py 2024-09-23 dragomerlin dragomerlin [58b8ed] save-urls-online-headers.py: detect invalid res...
 save-urls-online-waybackpy.py 2024-09-23 dragomerlin dragomerlin [921448] Add save-urls-online-waybackpy.py

Read Me

0. TABLE OF CONTENTS

1. PROJECT NAME

batch-waybackmachine-urlsaver is a project designed to automate the archiving of URL(s) in the Internet Archive Wayback Machine.

2. LOCAL PYTHON VENV

Using a local or separate python virtual environment separate from the global one is required in some operating systems, and also is a good practice just by itself.
This local venv can be located inside each project's dir or in another place to be used for all projects/scripts.

Clone this project repository to a new directory:

$ cd /projects/
$ git clone git://git.code.sf.net/p/batch-waybackmachine-urlsaver/code batch-waybackmachine-urlsaver-code
$ cd batch-waybackmachine-urlsaver-code/

Some operating systems do not allow to install modules/extensions system-wide, can be forced but it's not recommended:

$ python3 --version
Python 3.12.6
$ which python3
/opt/homebrew/bin/python3
$ which pip3
/opt/homebrew/bin/pip3
$ pip3 install beautifulsoup4
error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try brew install
    xyz, where xyz is the package you are trying to
    install.

    If you wish to install a Python library that isn't in Homebrew,
    use a virtual environment:

    python3 -m venv path/to/venv
    source path/to/venv/bin/activate
    python3 -m pip install xyz

    If you wish to install a Python application that isn't in Homebrew,
    it may be easiest to use 'pipx install xyz', which will manage a
    virtual environment for you. You can install pipx with

    brew install pipx

    You may restore the old behavior of pip by passing
    the '--break-system-packages' flag to pip, or by adding
    'break-system-packages = true' to your pip.conf file. The latter
    will permanently disable this error.

    If you disable this error, we STRONGLY recommend that you additionally
    pass the '--user' flag to pip, or set 'user = true' in your pip.conf
    file. Failure to do this can result in a broken Homebrew installation.

    Read more about this behavior here: <https://peps.python.org/pep-0668/>

note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.

So the recommendation is to create a local (isolated) python virtual environment inside the git repository:

$ python3 -m venv .venv

The venv directory should be excluded from git tracking when in same repo, unless it's being already ignored:

printf ".venv/\n" >> .gitignore

Then activate the venv each time a new shell/terminal/console instance is opened, if the local venv is not configured as default:

$ source .venv/bin/activate

The current venv location can be obtained to check if it's properly activated as local:

$ echo $VIRTUAL_ENV
/projects/batch-waybackmachine-urlsaver-code/.venv
$ python3 -c "import os; print(os.getenv('VIRTUAL_ENV'))"
/projects/batch-waybackmachine-urlsaver-code/.venv

Install in the local venv the required modules for this project (the python scripts themselves also check for it):

$ python3 -m pip install waybackpy requests beautifulsoup4 urllib3 tqdm
Collecting waybackpy
  Using cached waybackpy-3.0.6-py3-none-any.whl.metadata (9.9 kB)
Collecting requests
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting urllib3
  Using cached urllib3-2.2.3-py3-none-any.whl.metadata (6.5 kB)
Collecting tqdm
  Using cached tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting click (from waybackpy)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Using cached charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests)
  Using cached idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting certifi>=2017.4.17 (from requests)
  Using cached certifi-2024.8.30-py3-none-any.whl.metadata (2.2 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Using cached soupsieve-2.6-py3-none-any.whl.metadata (4.6 kB)
Using cached waybackpy-3.0.6-py3-none-any.whl (34 kB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Using cached beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Using cached urllib3-2.2.3-py3-none-any.whl (126 kB)
Using cached tqdm-4.66.5-py3-none-any.whl (78 kB)
Using cached certifi-2024.8.30-py3-none-any.whl (167 kB)
Using cached charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB)
Using cached idna-3.10-py3-none-any.whl (70 kB)
Using cached soupsieve-2.6-py3-none-any.whl (36 kB)
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Installing collected packages: urllib3, tqdm, soupsieve, idna, click, charset-normalizer, certifi, requests, beautifulsoup4, waybackpy
Successfully installed beautifulsoup4-4.12.3 certifi-2024.8.30 charset-normalizer-3.3.2 click-8.1.7 idna-3.10 requests-2.32.3 soupsieve-2.6 tqdm-4.66.5 urllib3-2.2.3 waybackpy-3.0.6

$ python3 -m pip install waybackpy requests beautifulsoup4 urllib3 tqdm
Requirement already satisfied: waybackpy in ./.venv/lib/python3.12/site-packages (3.0.6)
Requirement already satisfied: requests in ./.venv/lib/python3.12/site-packages (2.32.3)
Requirement already satisfied: beautifulsoup4 in ./.venv/lib/python3.12/site-packages (4.12.3)
Requirement already satisfied: urllib3 in ./.venv/lib/python3.12/site-packages (2.2.3)
Requirement already satisfied: tqdm in ./.venv/lib/python3.12/site-packages (4.66.5)
Requirement already satisfied: click in ./.venv/lib/python3.12/site-packages (from waybackpy) (8.1.7)
Requirement already satisfied: charset-normalizer<4,>=2 in ./.venv/lib/python3.12/site-packages (from requests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.12/site-packages (from requests) (3.10)
Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.12/site-packages (from requests) (2024.8.30)
Requirement already satisfied: soupsieve>1.2 in ./.venv/lib/python3.12/site-packages (from beautifulsoup4) (2.6)

To deactivate the local venv and return to global venv just run deactivate. Now will look something like this:

$ deactivate
$ echo $VIRTUAL_ENV

$ python3 -c "import os; print(os.getenv('VIRTUAL_ENV'))"
None

3. INDIVIDUAL FILES

recursive-crawler.py is a python3 script which scans given URL as argument (tested with https://packages-prod.broadcom.com/tools/releases/latest/), and detects recursively all files and directories listed as html links. It generates the files crawler_visited_urls.txt and crawler_extracted_urls.txt (this can be used for the next script).

save-urls-online-waybackpy.py is a python3 script which reads the input text file given as first argument. Each line must be a URL. Successfully saved URLs are deleted from the input file and saved to completed.txt and the ones which fail are moved to failed.txt. By pressing Q/q, the script will exit safely after current URL archival, without having to wait for all urls to be archived. Exiting or pausing with Ctrl+C or Ctrl+Z is also safe.
It uses the module waybackpy to communicate with Internet Archive servers, but offers very little configuration and is prone to suffer shadowbanning.

save-urls-online-headers.py is like the previous one, but instead of using the third party module waybackpy.py, sends custom crafted headers like those send by a desktop web browser. The Internet Web Archive is persistent in not wanting to save pages except by a desktop web browser.

4. ARCHIVING FAILURES

1) Sometimes the Web Archive does not return any bad HTTP code, so makes it lookslike the page has been correctly archived. However, when actually browsing the returned archived URL, different messages may appear as web text in the DOM object:

<div class="row">
    <div class="col-md-4 col-md-offset-4">
        <h2>Sorry</h2>
        <p>Job failed</p>
        <div class="text-center">
            <a href="/save">Return to Save Page Now</a>
        </div>
    </div>
</div>
<div class="row">
    <div class="col-md-4 col-md-offset-4">
        <h2>Sorry</h2>
        <p>You cannot make more than (200,) captures per day. Please email us at "info@archive.org" if you would like to discuss this more.</p>
        <div class="text-center">
            <a href="/save">Return to Save Page Now</a>
        </div>
    </div>
</div>
<div class="row">
    <div class="col-md-4 col-md-offset-4">
        <h2>Sorry</h2>
        <p>This URL is in the Save Page Now service block list and cannot be captured. Please email us at "info@archive.org" if you would like to discuss this more.</p>
        <div class="text-center">
            <a href="/save">Return to Save Page Now</a>
        </div>
    </div>
</div>
    <noscript>
      <div class="no-script-message">
        The Wayback Machine requires your browser to support JavaScript, please email <a href="mailto:info@archive.org">info@archive.org</a><br/>if you have any questions about this.
      </div>
    </noscript>
    <footer>
      <div id="footerHome">
        <p>
          The Wayback Machine is an initiative of the
          <a href="//archive.org/">Internet Archive</a>,
          a 501(c)(3) non-profit, building a digital library of
          Internet sites and other cultural artifacts in digital form.
          <br>Other <a href="//archive.org/projects/">projects</a> include
          <a href="https://openlibrary.org/">Open Library</a> &amp;
          <a href="https://archive-it.org">archive-it.org</a>.
        </p>
        <p>
          Your use of the Wayback Machine is subject to the Internet Archive's
          <a href="//archive.org/about/terms.php">Terms of Use</a>.
        </p>
      </div>
    </footer>

This is a sample stripped log when trying to archive some URLs with save-urls-online-headers.py, it's noticeable that the date and time in the response URL is missing:

$ python3 save-urls-online-headers.py urls.txt
Checking modules availability...
Starting key listener thread for exit command...
Starting jobs, press 'Q'/'q' to exit after any current iteration...
Archiving (1/359985): https://packages-prod.broadcom.com/tools/esx/4.0ep09/rhel6/i686/vmware-open-vm-tools-kmod-8.0.5-989856.el6.i686.rpm
Archived URL: https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/4.0ep09/rhel6/i686/vmware-open-vm-tools-kmod-8.0.5-989856.el6.i686.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/4.0ep09/rhel6/i686/vmware-open-vm-tools-kmod-8.0.5-989856.el6.i686.rpm: Failed to retrieve archived page. Status Code: 520
Exit requested, waiting for current operation to complete...
Sleeping for 30 seconds (1/359984)...:   7%|     | 2/30 [00:02<00:28,  1.01s/s]
Exiting by request.

A valid URL would be something like this:

https://web.archive.org/web/20240923083252/https://packages-prod.broadcom.com/tools/esx/4.0ep09/rhel6/i686/vmware-open-vm-tools-kmod-8.0.5-989856.el6.i686.rpm

The correct way to detect if a URL was archived without inspecting the DOM itself, it to check the response URL format. For example to archive the URL:

https://www.rapidtables.com/web/color/orange-color.html

the correct format is:

https://web.archive.org/web/20240822144645/https://www.rapidtables.com/web/color/orange-color.html

a failed URL is:

https://web.archive.org/save/https://www.rapidtables.com/web/color/orange-color.html

The python script now detects the failure:

$ python3 save-urls-online-headers.py urls.txt
Checking modules availability...
Starting key listener thread for exit command...
Starting jobs, press 'Q'/'q' to exit after any current iteration...
Archiving (1/359981): https://packages-prod.broadcom.com/tools/esx/4.0ep09/rhel6/i686/vmware-open-vm-tools-xorg-utilities-8.0.5-989856.el6.i686.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/4.0ep09/rhel6/i686/vmware-open-vm-tools-xorg-utilities-8.0.5-989856.el6.i686.rpm: Archived URL does not start with 'https://web.archive.org/web/'.
Exit requested, waiting for current operation to complete...
Sleeping for 30 seconds (1/359980)...:  13%|     | 4/30 [00:04<00:26,  1.00s/s]
Exiting by request.

The problem does not end here, because sending a valid 'Cookie' header with curl, does not make it work either (even being a registered account with donations). The Wayback Machine requires JavaScript for that kind of unlimited archival which can be achieved using a fully-fledged desktop web browser. They want a physical person commiting the requests to prevent abuse.

2) This is a sample stripped log when trying to archive some URLs with save-urls-online-waybackpy.py:

$ python3 save-urls-online.py urls.txt
Checking modules availability...
Starting key listener thread for exit command...
Starting jobs, press 'Q'/'q' to exit after any current iteration...
Archiving (1/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/headers/vmware-tools-kmod-0-7.4.8-396269.423167.el4.x86_64.hdr
Archived URL: https://web.archive.org/web/20240920084704/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/headers/vmware-tools-kmod-0-7.4.8-396269.423167.el4.x86_64.hdr
Sleeping for 30 seconds (1/364016)...: 100%|█████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (8/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el4.x86_64.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el4.x86_64.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el4.x86_64.rpm.
Response URL:
https://web.archive.org/save/_embed/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el4.x86_64.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 08:53:29 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app52', 'x-ts': '520', 'x-tr': '24', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'BYPASS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (8/364016)...: 100%|█████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (19/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/vmware-tools-nox-7.4.8-396269.423167.el4.x86_64.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/vmware-tools-nox-7.4.8-396269.423167.el4.x86_64.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/vmware-tools-nox-7.4.8-396269.423167.el4.x86_64.rpm.
Response URL:
https://web.archive.org/save/_embed/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel4/x86_64/vmware-tools-nox-7.4.8-396269.423167.el4.x86_64.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 09:02:42 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app53', 'x-ts': '520', 'x-tr': '12', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'BYPASS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (19/364016)...: 100%|████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (20/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/.
Response URL:
https://web.archive.org/save/_embed/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 09:05:04 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app14', 'x-ts': '520', 'x-tr': '12', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=0', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'BYPASS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (20/364016)...: 100%|████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (25/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-7.4.8-396269.423167.el5.i686.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-7.4.8-396269.423167.el5.i686.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-7.4.8-396269.423167.el5.i686.rpm.
Response URL:
https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-7.4.8-396269.423167.el5.i686.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 09:21:56 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app14', 'x-ts': '520', 'x-tr': '26789', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-location': 'save-sync', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (25/364016)...: 100%|████| 30/30 [00:30<00:00,  1.01s/s]
Archiving (26/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-common-7.4.8-396269.423167.el5.i686.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-common-7.4.8-396269.423167.el5.i686.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-common-7.4.8-396269.423167.el5.i686.rpm.
Response URL:
https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-common-7.4.8-396269.423167.el5.i686.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 09:23:32 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app53', 'x-ts': '520', 'x-tr': '1278', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-location': 'save-sync', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (26/364016)...: 100%|████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (27/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-kmod-7.4.8-396269.423167.el5.i686.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-kmod-7.4.8-396269.423167.el5.i686.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-kmod-7.4.8-396269.423167.el5.i686.rpm.
Response URL:
https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-kmod-7.4.8-396269.423167.el5.i686.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 09:25:41 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app14', 'x-ts': '520', 'x-tr': '7730', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=0', 'X-location': 'save-sync', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (27/364016)...: 100%|████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (28/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-nox-7.4.8-396269.423167.el5.i686.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-nox-7.4.8-396269.423167.el5.i686.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-nox-7.4.8-396269.423167.el5.i686.rpm.
Response URL:
https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-nox-7.4.8-396269.423167.el5.i686.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 09:37:48 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '232', 'Connection': 'keep-alive', 'x-app-server': 'wwwb-app52', 'x-ts': '404', 'x-tr': '60025', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-location': 'save-sync', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (28/364016)...: 100%|████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (29/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-display-10.15.0.0-0.396269.423167.el5.i686.rpm
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-display-10.15.0.0-0.396269.423167.el5.i686.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-display-10.15.0.0-0.396269.423167.el5.i686.rpm.
Response URL:
https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-display-10.15.0.0-0.396269.423167.el5.i686.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 10:02:10 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '232', 'Connection': 'keep-alive', 'x-app-server': 'wwwb-app52', 'x-ts': '404', 'x-tr': '30015', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-location': 'save-sync', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (29/364016)...: 100%|████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (30/364017): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el5.i686.rpm
Exit requested, waiting for current operation to complete...
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el5.i686.rpm: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el5.i686.rpm.
Response URL:
https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/open-vm-tools-xorg-drv-mouse-12.4.1.0-0.396269.423167.el5.i686.rpm
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 10:13:34 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '232', 'Connection': 'keep-alive', 'x-app-server': 'wwwb-app14', 'x-ts': '404', 'x-tr': '30019', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=0', 'X-location': 'save-sync', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (6/363986)...: 100%|█████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (7/363987): https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/repodata/repomd.xml.asc
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/repodata/repomd.xml.asc: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/repodata/repomd.xml.asc.
Response URL:
https://web.archive.org/save/_embed/https://packages-prod.broadcom.com/tools/esx/3.5latest/rhel5/i686/repodata/repomd.xml.asc
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 19:08:04 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app53', 'x-ts': '520', 'x-tr': '11', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'BYPASS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Sleeping for 30 seconds (1/363839)...: 100%|█████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (2/363840): https://packages-prod.broadcom.com/tools/esx/3.5latest/ubuntu/dists/hardy/main/binary-amd64/vmware-open-vm-tools-common_7.4.8-0.396269.423167_ubuntu8.04.amd64.deb
Unexpected error while archiving https://packages-prod.broadcom.com/tools/esx/3.5latest/ubuntu/dists/hardy/main/binary-amd64/vmware-open-vm-tools-common_7.4.8-0.396269.423167_ubuntu8.04.amd64.deb: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /save/https://packages-prod.broadcom.com/tools/esx/3.5latest/ubuntu/dists/hardy/main/binary-amd64/vmware-open-vm-tools-common_7.4.8-0.396269.423167_ubuntu8.04.amd64.deb (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x107b99160>: Failed to establish a new connection: [Errno 61] Connection refused'))
Sleeping for 30 seconds (2/363839)...: 100%|█████| 30/30 [00:30<00:00,  1.00s/s]
Archiving (3/363840): https://packages-prod.broadcom.com/tools/esx/3.5latest/ubuntu/dists/hardy/main/binary-amd64/vmware-open-vm-tools-kmod-2.6.24-16-generic_7.4.8-0.396269.423167_ubuntu8.04.amd64.deb
Failed to archive https://packages-prod.broadcom.com/tools/esx/3.5latest/ubuntu/dists/hardy/main/binary-amd64/vmware-open-vm-tools-kmod-2.6.24-16-generic_7.4.8-0.396269.423167_ubuntu8.04.amd64.deb: Tried 8 times but failed to save and retrieve the archive for https://packages-prod.broadcom.com/tools/esx/3.5latest/ubuntu/dists/hardy/main/binary-amd64/vmware-open-vm-tools-kmod-2.6.24-16-generic_7.4.8-0.396269.423167_ubuntu8.04.amd64.deb.
Response URL:
https://web.archive.org/save/https://packages-prod.broadcom.com/tools/esx/3.5latest/ubuntu/dists/hardy/main/binary-amd64/vmware-open-vm-tools-kmod-2.6.24-16-generic_7.4.8-0.396269.423167_ubuntu8.04.amd64.deb
Response Header:
{'Server': 'nginx', 'Date': 'Fri, 20 Sep 2024 23:20:50 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'cache-control': 'no-cache', 'x-app-server': 'wwwb-app52', 'x-ts': '520', 'x-tr': '10', 'server-timing': 'TR;dur=0,Tw;dur=0,Tc;dur=1', 'X-location': 'save-sync', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()'}
Exit requested, waiting for current operation to complete...
Sleeping for 30 seconds (3/363839)...:   3%|     | 1/30 [00:01<00:29,  1.01s/s]
Exiting by request.

One archival was successful and the others failed. By default the module retries 8 times for each url, which makes the script very slow if the remote server starts throttling or shadowbanning the Wayback Machine.

MongoDB Logo MongoDB