What's new detection script debug and fix

2024-02-07

This document provides a brief overview of the debugging process I undertook for the JavaScript code responsible for detecting new additions and changes to my website.

The issue arose when I added a function to detect modifications to existing files in addition to detecting new files added. Notably, several files that had no actual changes were mistakenly marked as "changed".

Below is the log from when the error initially occurred:

The "What's New" section malfunctioned; instead of updating the one article that I had updated, the website incorrectly marked old articles as "updated" as well. I wanted to figure out why.

Checked GitHub commit history -> The changes were all related to indexing, for example:
<ol class="chapter"><li class="chapter-item affix "><li class="part-title">Content</li><li class="chapter-item expanded "><a href="../Creation/index.html".
...
Checked if git checkout was related. -> It was not.

The problem was caused by adding a new file, which caused mdbook to re-index during building, and git picked that up.

~~Solution idea 1: In the "What's New" detection code, detect changes beyond 10 lines~~

This approach was unsuccessful because sometimes the re-indexing exceeded 10 lines.

Solution idea 2: In the "What's New" detection code, ignore changes that start with <ol class="chapter">, <a rel="next", and <a rel="prev".

I discovered that GitHub displays the content of the file change in a file.patch section in the JSON file.

I needed to know if file.patch separated changes into different sections or if it just dumped everything together -> file.patch does indeed separate changes between each section with an @@..@@ pattern.

Now, I just need to write a filter that filters out all the reindexing junk and leaves out the actual changes.

Here's the log from the second occurrence:

Once again, after editing one of the posts, the latest updater still failed to recognize the edit.

The problem occurs because the change to one of the posts did not have a file.patch.

I ended up using both the significantChangeIdentifier function and counting the number of file.changes in an OR relationship. This way, both types of changes in the commit metadata are covered.

Edit 2024-02-07, 16:00

Shortly after uploading this document, another bug appeared. This document was once again not recognized by the detector.

It was discovered that in the REST API, there are additional status names like renamed, besides the statuses added and modified, for files in a commit.

By examining the response schema here, we can see that there are the following types of statuses:


"status": {
    "type": "string",
    "enum": [
        "added",
        "removed",
        "modified",
        "renamed",
        "copied",
        "changed",
        "unchanged"
    ],...}

A quick chatGPT conversation result this:

"added": The file was added in the commit. It did not exist in the repository before this commit.

"removed": The file was deleted in the commit. It will not be present in the repository after this commit.

"modified": The file existed before the commit and has been altered in some way (content changed, file permissions, and so on) as part of this commit.

"renamed": The file was renamed in the commit. This status indicates that the file's path or name has changed.

"copied": This status indicates that the file was copied from another file in the commit. This is similar to "added" but specifies that the new file is a copy of an existing file.

"changed": This is a more generic status that can indicate any change not specifically covered by the other statuses, such as changes in permissions or other metadata that doesn't necessarily modify the file's content.

"unchanged": The file was part of the commit but was not altered in any way. This status might be used in contexts where files are explicitly mentioned in a commit without being changed, possibly for tracking or auditing purposes.

Which means that I do not need to cover any additional status besides added, modified, and renamed.

Edit 2024-02-07, 17:23

The new changes lead to another problem: failure to handle the edge case where a file is renamed without any content changes. Ideally, this change should be reflected back on the website.

To fix this, I moved the processing from an array of files to an array of filenames slightly downstream. I added tracker for both file.previous_filenames associated with renamed files and overall duplicates using hashsets. So i can compare and delete later.

Anthony's blog

What's new detection script debug and fix

Edit 2024-02-07, 16:00

Edit 2024-02-07, 17:23