Josh's blog

Ghosts

Scene: The Internet, late 2015. Two computers are talking via HTTP.

Computer 1: “Hello friend! Can I have /?feed=atom please?”

Computer 2: “Yes, here you go. Stay in touch! 😄”

Almost exactly 90 minutes later, they have almost the exact same conversation.


This blog has feeds, in three different flavours. If you care about RSS you might already know this, because of the <link> tags in the source of the page describing the URLs where you can find them, and you might be browsing this page with software that recognises these tags.


They would have had the same conversation 90 minutes later. Something changed. After pedantically checking the Domain Name System (as always), Computer 1 was instead directed to talk to Computer 3.

Computer 1: “Can I have /?feed=atom please?”

Computer 3: “Never heard of it! Go away!”

This conversation repeated every 30 minutes or so, for a while. A while.

Computer 1 misses its friend.


Back in the heyday of RSS, feeds were often aggregated into other feeds. This was a matter of convenience: with so many good RSS feeds available, it was easier to configure your RSS reader to follow just one aggregated feed on a topic that someone else had done the work to curate, as opposed to inputting the URL of each and every feed of interest. (Google Reader would subsequently improve, and then demolish, the feed-following experience.)

Aggregating feeds was a straightforward-seeming software engineering exercise: periodically consume a list of source feeds, process the content, and format it into a single output feed.

One implementation of such an aggregator, in Python, was called Planet Planet. To handle the small array of different feed types, as well as fetching the feeds, Planet Planet used the Universal Feed Parser. You can recognise requests from Universal Feed Parser in a web server log by its User Agent string.

Planet Planet was configured with a config.ini file with a set of URLs to ingest as feeds, and was typically run periodically on its host using cron.


Computer 1 (tired of this): “Friend, are you there?”

Computer 3: “I’ve been upgraded. Look, have you tried talking to me with HTTPS? It’s better for everybody’s security if you do.”

Computer 1 (to itself): “😔 Well, I tried.”

Computer 1 gives up… for now. It will strike up the conversation again, with perfectly programmed optimism, in almost exactly 30 minutes. Again. And again. And again. And again. And again. And again. And…

For several years.


WordPress is a popular website CMS and blogging platform, implemented in PHP. It was commonly deployed on a LAMP-stack (Linux, Apache, MySQL, PHP), but a centralised form of it also exists. As a blogging platform, it also serves up RSS feeds.

The RSS feed is typically available on a WordPress site at /feed/. There are other ways of requesting the same feed, presumably for compatibility with earlier conventions - these are /?feed=rss and /?feed=atom.


Computers 4 through N (in chorus, seemingly unprompted): “Hey, I heard you have this file? /wp-content/plugins/latex/cache/tex_174ea3aa1e90a695b482fe253768676b.gif?”

Computers 4 through N: “Hey, I heard you have this file? /wp-content/plugins/latex/cache/tex_aad18c0a88969b4c1bdc3711475796c2.gif?”

Computers 4 through N: “Hey, I heard you have …” etc

Computer 3: “What? No! Buzz off, all of you. Where are you even getting these paths?”

Computer N: “You used to have them…”


After consuming its configured feeds, Planet Planet produces static HTML and RSS output that is intended to be served by some web server software - often Apache. The text or HTML content of each article is copied to the output largely unmodified. Some potentially dangerous HTML is scrubbed. Other HTML tags, like <img>, remain. Such images are not scraped by Planet Planet, rather, the URLs remain pointing wherever they were pointing originally, as provided by the source RSS feed.

In other words, Planet Planet outputs hot-links to images hosted elsewhere.

Sometimes, Universal Feed Parser (used by Planet Planet) can’t fetch a feed it is configured to. Being quite old and defunct, Universal Feed Parser doesn’t follow 302 redirects - at least, not to the same feed available under HTTPS.

In any case, if the feed isn’t fetched, Planet Planet will try again later, but in the meantime it keeps a copy of the articles it already had, and uses those.

Search engines also pass around rumors of secret URLs that exist, opening to the right sequence of door-knocks and magic passwords, but when queried during business hours under the bright light of the full overhead sun, are vanished, like ghosts.


Computer 1 (weary): “The usual…honestly not expecting anything…”

Computer 3 (interrupting): “Hello! You want the Atom feed? Here you go! Nice to hear from you! Let’s be friends! 😄”

Computer 1: “😯 … 🤯”


I’m glad our computers can be friends again.