Link Updater Revisited

Published: 29 Dec 2018 at 00:00 UTC Updated: 28 Oct 2019 at 00:00 UTC

One random night I was browsing Twitter when someone wondered if there was a tool that could update link location in HTML whenever an asset whenever got renamed. Being bored at the time, I decided to spend the next three hours of my life writing this link-updater script for a random person I’ve never met.

This project had three parts to it;

Watching the hosting directory for any moved files, reporting any changes.
Looking through the HTML files in that directory for any references to the moved file.
Updating the former references to the moved files to its current location.

To do 1, I found used the watchdog library which can monitor file system events. I had some previous experience with this library while I was making diary-locker. The documentation was hard to understand and their examples didn’t work, but I figured it out. I might make a tutorial on this library, but you should have something like this when using watchdog.

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

CURRENT_DIR = os.getcwd()

class CustomEventHandler(FileSystemEventHandler):
    # Add the events you want to handle here. List of file system events that can be handled can be found here: https://pythonhosted.org/watchdog/api.html#event-handler-classes
    def on_moved(self, event):
    # handle the moved file event here. event contains a src_path and a dest_path

def main():
    event_handler = CustomEventHandler()
    observer = Observer()
    observer.schedule(event_handler, CURRENT_DIR, recursive=True)
    observer.start()
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        exit(0)

if __name__ == "__main__":
    main()

Essentially, you create an event handler (aptly named event_handler) which is an object based on a class that inherits the FileSystemEventHandler, allowing us to handle various file system events like created, modified, moved out deleted files and directories.

The observer (aptly named observer) would monitor the directory for the events we want to handle in the given directory. There’s an optional parameter, recursive that will also monitor all subdirectories in a directory.

Having found a way to monitor directories, we now have to find all the HTML files in the directory. This Wass rather trivial thanks to the glob that returns files matching a certain pattern. In this case, it would look like this

import glob

CURRENT_DIR = os.getcwd()
html_docs = glob.iglob(CURRENT_DIR+"**\\*.html")

Here iglob returns an iterator to improve performance instead of the full list which glob.glob would return.

With the HTML files, we move into the most interesting part– finding each instance where the moved file is mentioned and update the reference.

To do this, I looked through the HTML files and replaced the old links with the new links. I thought I would need Beautiful Soup for this, but str.replace is much simpler and saves a dependency¹.

def on_moved(self, event):
    logging.debug("File moved from %s to %s" % (event.src_path, event.dest_path))
    html_docs = glob.iglob(CURRENT_DIR+"**\\*.html")
    for doc in html_docs:
        rewrite_doc = False
        with open(doc, "r") as html:
            mkuri = lambda x:os.path.relpath(x).replace("\\","/")
            body = html.read()
            new_body = body.replace(mkuri(event.src_path), mkuri(event.dest_path))
            if body != new_body:
                rewrite_doc = True
        if rewrite_doc:
            with open(doc, "w") as html:
                html.write(new_body)

There are a few strange things I did. For one, I created a lambda functionmkuri which would make a link I could use in HTML by making a relative link to the file. This also meant that I had to replace \\ with / because Windows.

Also, I had to rewrite the whole file because I could find a way to modify just the part I needed.

Conclusion

Looking back at this, it was a fun use of the hours and I wish I did not random explorations like this more often. That being said, this doesn’t work properly. While it does notice changes in directory structure just fine, it doesn’t update them properly. Since I overwrite the entire file rather than just the change, this risks overwriting unsaved HTML files. At the same time, if I added an asset without saving the HTML, if I moved the asset, it wouldn’t update.

To fix this, I could try to detect if the file is currently open with unsaved changes. However, I doubt this script is the best way to approach this problem. To do this properly, it would be better to make an extension for an IDE or text editor to do this. That might limit the potential of overwriting unsaved files.

I remember feeling chuffed that I did this in 3 hours with minimal code. However, I missed some vital functionality. Oh well, we live and learn ¯\(ツ)/¯.

There’s stuff about regular expressions being a bad way to parse HTML files. I’m ashamed to have done this more than once. ↩︎