Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate general sitemap.xml for projects #5122

Merged
merged 22 commits into from
Feb 19, 2019
Merged

Generate general sitemap.xml for projects #5122

merged 22 commits into from
Feb 19, 2019

Conversation

humitos
Copy link
Member

@humitos humitos commented Jan 16, 2019

This PR makes Read the Docs to generate a general (non specific per project) sitemap.xml served at the root of the project /sitemap.xml based on discussions from #557

I think it would be good to split this in two phases:

  1. Generate a general sitemap.xml for all the project without option to customize it (this PR as is)
  2. Check if the project is already generating a sitemap.xml and instead of generating a general one, generate a specific one for this project using sitemapindex

Example with a toy project locally,

$ http http://fastcgi-for-net.dev.readthedocs.io:8000/sitemap.xml
HTTP/1.0 200 OK
Content-Language: en
Content-Length: 1713
Content-Type: application/xml
Date: Wed, 16 Jan 2019 21:15:44 GMT
Server: WSGIServer/0.2 CPython/3.6.6
Vary: Accept-Language, Cookie

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  
  <url>
    <loc>http://fastcgi-for-net.dev.readthedocs.io:8000/bn/latest/</loc>
    
    <xhtml:link
        rel="alternate"
        hreflang="en"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/en/latest/"/>
    
    <xhtml:link
        rel="alternate"
        hreflang="bn"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/bn/latest/"/>
    
    <lastmod>2019-01-16T21:13:53.325602+00:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  
  <url>
    <loc>http://fastcgi-for-net.dev.readthedocs.io:8000/bn/stable/</loc>
    
    <xhtml:link
        rel="alternate"
        hreflang="en"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/en/stable/"/>
    
    <xhtml:link
        rel="alternate"
        hreflang="bn"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/bn/stable/"/>
    
    <lastmod>2019-01-16T10:54:16.792524+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
  </url>
  
  <url>
    <loc>http://fastcgi-for-net.dev.readthedocs.io:8000/bn/make-docs-dir-a-list/</loc>
    
    <xhtml:link
        rel="alternate"
        hreflang="en"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/en/make-docs-dir-a-list/"/>
    
    <xhtml:link
        rel="alternate"
        hreflang="bn"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/bn/make-docs-dir-a-list/"/>
    
    <lastmod>2019-01-16T11:03:10.324661+00:00</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  
</urlset>
$ 

@humitos humitos added Needed: tests Tests are required PR: work in progress Pull request is not ready for full review labels Jan 16, 2019
@humitos humitos requested a review from a team January 16, 2019 21:29
@humitos humitos force-pushed the humitos/sitemap-xml branch from 5de800d to 41b6c16 Compare January 16, 2019 21:41
@humitos humitos mentioned this pull request Jan 16, 2019
@humitos humitos force-pushed the humitos/sitemap-xml branch from 41b6c16 to 0ed9952 Compare January 16, 2019 21:55
yield c

while True:
yield 'monthly'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using monthly because I think never is too aggressive. If the tag is removed and a branch is created with the same name, we will want bots to revisit this.

NOTE: maybe this should be a comment in the code itself.

for version, priority, changefreq in zip(
sorted_versions, priorities_generator(), changefreqs_generator()):
element = {
'loc': version.get_subdomain_url(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL should be properly escaped: https://www.sitemaps.org/protocol.html#escaping

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is something we need to do here, actually.

@humitos humitos force-pushed the humitos/sitemap-xml branch from 88beea8 to 9c05020 Compare January 17, 2019 11:21
iteration. After 0.1 is reached, it will keep returning 0.1.
"""
priorities = [1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
yield from itertools.chain(priorities, itertools.repeat(0.1))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield from is not Python2 syntax compatible :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo on line 288: change change
and line 293: this one i not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@humitos humitos removed Needed: tests Tests are required PR: work in progress Pull request is not ready for full review labels Jan 17, 2019
@humitos
Copy link
Member Author

humitos commented Jan 17, 2019

This is ready to be merged. Although, it uses yield from which is syntax not compatible with python2. I think we don't want to re-write it to make it compatible with it since we are deprecating it: #4543

(we should probably already remove it from travis and tox)

@humitos humitos added Needed: documentation Documentation is required PR: work in progress Pull request is not ready for full review labels Jan 20, 2019
@agjohnson agjohnson added this to the 3.1 milestone Jan 22, 2019
@agjohnson agjohnson added the Feature New feature label Jan 25, 2019


@map_project_slug
# TODO: make this cache dependent on the project's slug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for cache_page say it is "is keyed off of the URL". Looking at the code, it does look to me like they mean the fully qualified URL including the host so I think we're ok.

Copy link
Member Author

@humitos humitos Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read that I wasn't sure if by URL it meant the path or the full URL.

Digging a little into the code I found that this line is the one that generates the cache key:

https://github.com/django/django/blob/b9cf764be62e77b4777b3a75ec256f6209a57671/django/utils/cache.py#L314

Which returns the absolute URL as you said. Thanks. We are good!

https://docs.djangoproject.com/en/1.11/ref/request-response/#django.http.HttpRequest.build_absolute_uri

context = {
'versions': versions,
}
return render(request, 'sitemap.xml', context, content_type='application/xml')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you went with a template instead of Django's builtin sitemap framework? https://docs.djangoproject.com/en/1.11/ref/contrib/sitemaps/

The framework might have some advantages in case we ever get so large we need to break up sitemaps into multiple files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignorance.

That said and after reading the documentation, I'm not sure how to:

  • get the project from inside the Sitemap class (I think it doesn't have access to the request object --where we could check for .slug property, for example)
  • generate different locations for translations that are not based on LANGUAGE variable but on project.translations

Sitemap objects look pretty clean and clear but I'd need some help on those two things to be able to make it work using the framework.


versions = []
for version, priority, changefreq in zip(
sorted_versions, priorities_generator(), changefreqs_generator()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right:

  • latest will always have priority=1 and changefreq=daily
  • stable will always have priority=0.9 and changefreq=weekly
  • other versions will have decreasing priorities and changefreq=monthly

Is that right?

Wouldn't it be better to just guess at a priority and changefrequency based on the last build date? If there is no last build date (version was never built), we don't include the version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are reading it right, yes.

Wouldn't it be better to just guess at a priority and changefrequency based on the last build date?

Our idea behind this was that you (as author of the project) want to point your readers to the latest version published first. That's why the priority works like that: latest, stable, v1.5, v1.4, etc in that order... I think that build date is not associated with priority.

Regarding changefreq, I have a similar opinion: I expect last versions to change more frequently than v0.1.

I wouldn't complicate the logic for this.

If there is no last build date (version was never built), we don't include the version.

I'm including only active versions, as we do on the flyout also. This could change once more states be implemented: #4001 (comment)

@humitos humitos force-pushed the humitos/sitemap-xml branch from 07f0867 to 9670c85 Compare February 4, 2019 11:29
@humitos humitos removed PR: work in progress Pull request is not ready for full review Needed: documentation Documentation is required labels Feb 4, 2019
@humitos
Copy link
Member Author

humitos commented Feb 4, 2019

I added small docs for sitemap to at least communicate that we are doing this automatically, without deep too much into its implementation. Once sitemap index is implemented, we can use this page to extend for that feature.

@humitos humitos force-pushed the humitos/sitemap-xml branch from 03043d6 to 3de35d0 Compare February 4, 2019 12:29
@humitos humitos requested a review from a team February 4, 2019 12:33
davidfischer
davidfischer previously approved these changes Feb 6, 2019
Copy link
Contributor

@davidfischer davidfischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me.

I will monitor our organic traffic over the next month and make sure there isn't a dip from this change. Based on my understanding of sitemaps, I this should be positive but SEO is a bit of the dark arts.

I know this is a first step but here are some improvements we can do in future iterations:

  • Have a sub-sitemap per version. Our root sitemap (slug.readthedocs.io/sitemap.xml) could point to version specific sitemaps (slug.readthedocs.io/en/1.x/sitemap.xml) that have entries for each HTML file.
  • We can support user submitted sitemaps. If the user has a sitemap.xml file in their build output, our root sitemap could point to it or maybe it replaces the root sitemap.
  • Use the Django sitemap features. I think this is slightly better than using an XML template if possible because there are a few intricacies for sitemaps (max 50k entries per file, etc.) that we won't hit with this implementation but we might if we expand it.

@jdillard
Copy link
Contributor

jdillard commented Feb 9, 2019

I was looking through the changes and it doesn't seem the sitemapindex is added to the robots.txt file, which would likely make it easier for the search engines to find.

If you think that is worth implementing here is an example of what I am talking about: https://github.com/jdillard/sphinx-sitemap#getting-the-most-out-of-the-sitemap

@humitos
Copy link
Member Author

humitos commented Feb 9, 2019

doesn't seem the sitemapindex is added to the robots.txt file, which would likely make it easier for the search engines to find.

I think this is a good addition and should be easy to add it to the default robots.txt returned by Read the Docs at https://github.com/rtfd/readthedocs.org/blob/799480827f20f04ded7239a6307853c721de39fa/readthedocs/core/views/serve.py#L312

@humitos humitos force-pushed the humitos/sitemap-xml branch from ab64878 to 6b3cf9f Compare February 12, 2019 09:10
@humitos humitos requested a review from a team February 12, 2019 09:12
@humitos humitos force-pushed the humitos/sitemap-xml branch from 1bb65e3 to 21b0015 Compare February 12, 2019 09:38
ericholscher
ericholscher previously approved these changes Feb 14, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks simple enough. I could see needing to play with the priorities or something over time, but this is definitely better than no sitemap (hopefully :)

@@ -0,0 +1,19 @@
Sitemaps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this linked from anywhere? Should be in an toctree somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I wanted to link it under Feature Documentation but I forgot to do it.

I will move this file under docs/features/sitemaps.rst and will be linked automatically on that section.

raise Http404

sorted_versions = sort_version_aware(
project.versions.filter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use public( on the queryset here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it also defaults to only active projects, but can pass only_active=True also

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. I just changed to Version.objects.public(project=project, only_active=True)

davidfischer
davidfischer previously approved these changes Feb 14, 2019
@humitos humitos dismissed stale reviews from davidfischer and ericholscher via 6482bac February 18, 2019 11:35
@humitos
Copy link
Member Author

humitos commented Feb 18, 2019

I just pushed the changes suggested on feedback. I will merge this PR once tests pass.

@humitos
Copy link
Member Author

humitos commented Feb 18, 2019

Mmm... It seems that I can't merge because as I sent new changes I need a new approval now:

At least 1 approving review is required by reviewers with write access.

@humitos humitos requested a review from ericholscher February 18, 2019 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants