Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent indexing of old kedro docs using RTD-Enabled JavaScript script #4516

Merged
merged 4 commits into from
Feb 28, 2025

Conversation

DimedS
Copy link
Member

@DimedS DimedS commented Feb 24, 2025

Description

To address #3741, in collaboration with the Read the Docs (RTD) team, we have developed a JavaScript script that runs during every documentation build. This script injects a noindex meta tag into all documentation versions—except for the stable versions of Kedro and Kedro-Viz, and kedro-datasets-6.0.0 for Kedro-Datasets—to prevent outdated versions from being indexed by search engines.

Currently, the script is temporarily executed from this storage location, manually enabled by RTD.

Once this PR is merged and the next release is published, we will need to request RTD to manually enable script execution again. In the future, when RTD completes their planned feature update, script selection will be configurable directly from the RTD console.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Dmitry Sorokin <[email protected]>
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DimedS 🙏🏼 What a journey...

We share the concern of having to keep track of versions manually in the script. For now unfortunately we'll have to move forward with what we have.

Also please ask some of our frontend colleagues to have a look at the JS 😄

@DimedS DimedS requested a review from jitu5 February 24, 2025 14:22
@jitu5
Copy link
Contributor

jitu5 commented Feb 24, 2025

@DimedS In terms of code it looks good but considering the article posted by Nok readthedocs/readthedocs.org#10648 (comment) and Google's docs on use of noindex here https://developers.google.com/search/docs/crawling-indexing/block-indexing are still getting expected result ?

@DimedS
Copy link
Member Author

DimedS commented Feb 27, 2025

@DimedS In terms of code it looks good but considering the article posted by Nok readthedocs/readthedocs.org#10648 (comment) and Google's docs on use of noindex here https://developers.google.com/search/docs/crawling-indexing/block-indexing are still getting expected result ?

Thanks, @jitu5. We considered using the canonical tag but decided to proceed with noindex, as it provides a more robust solution:

  • When you use canonical to point old pages to the stable version, the old pages will likely still be indexed. However, they will generally have lower priority in search rankings, and the stable version will most likely be shown first. That said, this method is not 100% reliable - search engines may still decide to rank and display the old version despite the canonical tag.
  • Our current approach aligns with the Google documentation you referenced. We applied a noindex tag to all old versions, ensuring they are not indexed at all. If a page is not indexed, it cannot appear in search rankings.
  • I double-checked in Google Search Console, and the number of indexed pages has significantly decreased, confirming that this approach is working as expected. Some old versions are still indexed, but this is due to our restrictive robots.txt, which prevents search crawlers from accessing those pages and detecting the noindex tag. We plan to resolve this by removing the custom robots.txt before the next release. Actually maybe we should remove robots.txt in the current PR.

Do you have any additional thoughts, @astrojuanlu?

@DimedS DimedS requested a review from astrojuanlu February 27, 2025 13:24
@jitu5
Copy link
Contributor

jitu5 commented Feb 27, 2025

@DimedS In terms of code it looks good but considering the article posted by Nok readthedocs/readthedocs.org#10648 (comment) and Google's docs on use of noindex here https://developers.google.com/search/docs/crawling-indexing/block-indexing are still getting expected result ?

Thanks, @jitu5. We considered using the canonical tag but decided to proceed with noindex, as it provides a more robust solution:

  • When you use canonical to point old pages to the stable version, the old pages will likely still be indexed. However, they will generally have lower priority in search rankings, and the stable version will most likely be shown first. That said, this method is not 100% reliable - search engines may still decide to rank and display the old version despite the canonical tag.
  • Our current approach aligns with the Google documentation you referenced. We applied a noindex tag to all old versions, ensuring they are not indexed at all. If a page is not indexed, it cannot appear in search rankings.
  • I double-checked in Google Search Console, and the number of indexed pages has significantly decreased, confirming that this approach is working as expected. Some old versions are still indexed, but this is due to our restrictive robots.txt, which prevents search crawlers from accessing those pages and detecting the noindex tag. We plan to resolve this by removing the custom robots.txt before the next release. Actually maybe we should remove robots.txt in the current PR.

Do you have any additional thoughts, @astrojuanlu?

@DimedS Make sense, Thanks for the detailed info.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 % agreed on what @DimedS said in #4516 (comment) 💯 LGTM!

@DimedS DimedS merged commit 301e84e into main Feb 28, 2025
10 checks passed
@DimedS DimedS deleted the add-custom-rtd-script branch February 28, 2025 13:38
@DimedS DimedS mentioned this pull request Feb 28, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve SEO and maintenance of documentation versions
3 participants