Prevent indexing of old kedro docs using RTD-Enabled JavaScript script #4516

DimedS · 2025-02-24T13:58:33Z

Description

To address #3741, in collaboration with the Read the Docs (RTD) team, we have developed a JavaScript script that runs during every documentation build. This script injects a noindex meta tag into all documentation versions—except for the stable versions of Kedro and Kedro-Viz, and kedro-datasets-6.0.0 for Kedro-Datasets—to prevent outdated versions from being indexed by search engines.

Currently, the script is temporarily executed from this storage location, manually enabled by RTD.

Once this PR is merged and the next release is published, we will need to request RTD to manually enable script execution again. In the future, when RTD completes their planned feature update, script selection will be configurable directly from the RTD console.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Dmitry Sorokin <[email protected]>

docs/static/deindex-old-docs.js

astrojuanlu

Thanks @DimedS 🙏🏼 What a journey...

We share the concern of having to keep track of versions manually in the script. For now unfortunately we'll have to move forward with what we have.

Also please ask some of our frontend colleagues to have a look at the JS 😄

Signed-off-by: Dmitry Sorokin <[email protected]>

jitu5 · 2025-02-24T17:45:05Z

@DimedS In terms of code it looks good but considering the article posted by Nok readthedocs/readthedocs.org#10648 (comment) and Google's docs on use of noindex here https://developers.google.com/search/docs/crawling-indexing/block-indexing are still getting expected result ?

DimedS · 2025-02-27T13:23:59Z

@DimedS In terms of code it looks good but considering the article posted by Nok readthedocs/readthedocs.org#10648 (comment) and Google's docs on use of noindex here https://developers.google.com/search/docs/crawling-indexing/block-indexing are still getting expected result ?

Thanks, @jitu5. We considered using the canonical tag but decided to proceed with noindex, as it provides a more robust solution:

When you use canonical to point old pages to the stable version, the old pages will likely still be indexed. However, they will generally have lower priority in search rankings, and the stable version will most likely be shown first. That said, this method is not 100% reliable - search engines may still decide to rank and display the old version despite the canonical tag.
Our current approach aligns with the Google documentation you referenced. We applied a noindex tag to all old versions, ensuring they are not indexed at all. If a page is not indexed, it cannot appear in search rankings.
I double-checked in Google Search Console, and the number of indexed pages has significantly decreased, confirming that this approach is working as expected. Some old versions are still indexed, but this is due to our restrictive robots.txt, which prevents search crawlers from accessing those pages and detecting the noindex tag. We plan to resolve this by removing the custom robots.txt before the next release. Actually maybe we should remove robots.txt in the current PR.

Do you have any additional thoughts, @astrojuanlu?

jitu5 · 2025-02-27T15:10:37Z

@DimedS In terms of code it looks good but considering the article posted by Nok readthedocs/readthedocs.org#10648 (comment) and Google's docs on use of noindex here https://developers.google.com/search/docs/crawling-indexing/block-indexing are still getting expected result ?

Thanks, @jitu5. We considered using the canonical tag but decided to proceed with noindex, as it provides a more robust solution:

When you use canonical to point old pages to the stable version, the old pages will likely still be indexed. However, they will generally have lower priority in search rankings, and the stable version will most likely be shown first. That said, this method is not 100% reliable - search engines may still decide to rank and display the old version despite the canonical tag.

Our current approach aligns with the Google documentation you referenced. We applied a noindex tag to all old versions, ensuring they are not indexed at all. If a page is not indexed, it cannot appear in search rankings.

I double-checked in Google Search Console, and the number of indexed pages has significantly decreased, confirming that this approach is working as expected. Some old versions are still indexed, but this is due to our restrictive robots.txt, which prevents search crawlers from accessing those pages and detecting the noindex tag. We plan to resolve this by removing the custom robots.txt before the next release. Actually maybe we should remove robots.txt in the current PR.

Do you have any additional thoughts, @astrojuanlu?

@DimedS Make sense, Thanks for the detailed info.

astrojuanlu

100 % agreed on what @DimedS said in #4516 (comment) 💯 LGTM!

Add deindex-old-docs.js

3ce9d94

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS requested review from yetudada and astrojuanlu as code owners February 24, 2025 13:58

DimedS linked an issue Feb 24, 2025 that may be closed by this pull request

Improve SEO and maintenance of documentation versions #3741

Closed

DimedS mentioned this pull request Feb 24, 2025

Improve SEO and maintenance of documentation versions #3741

Closed

DimedS requested a review from ankatiyar February 24, 2025 14:03

astrojuanlu reviewed Feb 24, 2025

View reviewed changes

docs/static/deindex-old-docs.js Show resolved Hide resolved

astrojuanlu reviewed Feb 24, 2025

View reviewed changes

DimedS requested a review from jitu5 February 24, 2025 14:22

Add link to the script explanation

429d8f2

Signed-off-by: Dmitry Sorokin <[email protected]>

DimedS requested a review from astrojuanlu February 27, 2025 13:24

jitu5 approved these changes Feb 27, 2025

View reviewed changes

astrojuanlu reviewed Feb 27, 2025

View reviewed changes

astrojuanlu approved these changes Feb 27, 2025

View reviewed changes

DimedS added 2 commits February 27, 2025 16:06

Merge branch 'main' into add-custom-rtd-script

ad4eeff

Merge branch 'main' into add-custom-rtd-script

1727e1e

DimedS merged commit 301e84e into main Feb 28, 2025
10 checks passed

DimedS deleted the add-custom-rtd-script branch February 28, 2025 13:38

DimedS mentioned this pull request Feb 28, 2025

Remove custom robots.txt and sitemap #4532

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent indexing of old kedro docs using RTD-Enabled JavaScript script #4516

Prevent indexing of old kedro docs using RTD-Enabled JavaScript script #4516

DimedS commented Feb 24, 2025

astrojuanlu left a comment

jitu5 commented Feb 24, 2025

DimedS commented Feb 27, 2025

jitu5 commented Feb 27, 2025

astrojuanlu left a comment

Prevent indexing of old kedro docs using RTD-Enabled JavaScript script #4516

Prevent indexing of old kedro docs using RTD-Enabled JavaScript script #4516

Conversation

DimedS commented Feb 24, 2025

Description

Developer Certificate of Origin

Checklist

astrojuanlu left a comment

Choose a reason for hiding this comment

jitu5 commented Feb 24, 2025

DimedS commented Feb 27, 2025

jitu5 commented Feb 27, 2025

astrojuanlu left a comment

Choose a reason for hiding this comment