Archivify
Published January 7, 2020, Page Last Modified July 21, 2023
[Disclaimer: I don’t archive information as much as I used to, though of course, you are free to archive away!].
Milan Griffes (a), Alexey Guzey, and I seem to like the (a) notation for archiving links on our websites.
Here’s a short script, archivify.py, I wrote that takes an html page, archives each link with the Internet Archive, and then automatically adds the ‘(a)’ notation afterward!
Essentially, it converts 'I like tacos' to 'I like tacos (a)'.
Installations
Before you run it, first install Beautiful Soup (a) and archivenow (a):
pip install beautifulsoup4
pip install archivenow
Usage
python archivify.py input.html
Error Handling
While the script works just fine, the Internet Archive sometimes throws errors when trying to archive a link (403 Forbidden, 502 Server Error: Bad Gateway for url). In these cases, the script doesn’t add an ‘(a)’.
archivify.py
import sys
from bs4 import BeautifulSoup
from archivenow import archivenow
filename = sys.argv[1]
f_in = open(filename, "r")
text = f_in.read()
soup = BeautifulSoup(text, 'html.parser')
for a in soup.find_all('a'):
link = a.get('href')
# Save link to Internet Archive
archived_link = archivenow.push(link, "ia")[0]
print('archived_link')
# If there was no error archiving
if not (archived_link.startswith('Error') or 'No HTTP' in archived_link):
# Add ' (a)' with archived link after linked text
archived_tag = soup.new_tag("a", href=archived_link)
archived_tag.append("a")
a.insert_after(")")
a.insert_after(archived_tag)
a.insert_after(" (")
# Write contents to new file
filename_out = filename.split('.')[0] + '-archived.' + filename.split('.')[1]
f_out = open(filename_out, "w")
f_out.write(str(soup))
Citation
Zuckerman, Andrew, "Archivify", January 7, 2020, http://andzuck.com/projects/archivify/
Subscribe
Get notified when I write new essays
Comments powered by Talkyard.