Published January 7, 2020, Page Last Modified July 21, 2023
[Disclaimer: I don’t archive information as much as I used to, though of course, you are free to archive away!].
pip install beautifulsoup4
pip install archivenow
python archivify.py input.html
While the script works just fine, the Internet Archive sometimes throws errors when trying to archive a link (403 Forbidden, 502 Server Error: Bad Gateway for url). In these cases, the script doesn’t add an ‘(a)’.
import sys from bs4 import BeautifulSoup from archivenow import archivenow filename = sys.argv f_in = open(filename, "r") text = f_in.read() soup = BeautifulSoup(text, 'html.parser') for a in soup.find_all('a'): link = a.get('href') # Save link to Internet Archive archived_link = archivenow.push(link, "ia") print('archived_link') # If there was no error archiving if not (archived_link.startswith('Error') or 'No HTTP' in archived_link): # Add ' (a)' with archived link after linked text archived_tag = soup.new_tag("a", href=archived_link) archived_tag.append("a") a.insert_after(")") a.insert_after(archived_tag) a.insert_after(" (") # Write contents to new file filename_out = filename.split('.') + '-archived.' + filename.split('.') f_out = open(filename_out, "w") f_out.write(str(soup))
Comments powered by Talkyard.