-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate general sitemap.xml for projects #5122
Conversation
5de800d
to
41b6c16
Compare
41b6c16
to
0ed9952
Compare
readthedocs/core/views/serve.py
Outdated
yield c | ||
|
||
while True: | ||
yield 'monthly' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using monthly
because I think never
is too aggressive. If the tag is removed and a branch is created with the same name, we will want bots to revisit this.
NOTE: maybe this should be a comment in the code itself.
for version, priority, changefreq in zip( | ||
sorted_versions, priorities_generator(), changefreqs_generator()): | ||
element = { | ||
'loc': version.get_subdomain_url(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL should be properly escaped: https://www.sitemaps.org/protocol.html#escaping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is something we need to do here, actually.
88beea8
to
9c05020
Compare
iteration. After 0.1 is reached, it will keep returning 0.1. | ||
""" | ||
priorities = [1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2] | ||
yield from itertools.chain(priorities, itertools.repeat(0.1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield from
is not Python2 syntax compatible :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo on line 288: change change
and line 293: this one i not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
This is ready to be merged. Although, it uses (we should probably already remove it from travis and tox) |
readthedocs/core/views/serve.py
Outdated
|
||
|
||
@map_project_slug | ||
# TODO: make this cache dependent on the project's slug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for cache_page
say it is "is keyed off of the URL". Looking at the code, it does look to me like they mean the fully qualified URL including the host so I think we're ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read that I wasn't sure if by URL it meant the path or the full URL.
Digging a little into the code I found that this line is the one that generates the cache key:
Which returns the absolute URL as you said. Thanks. We are good!
readthedocs/core/views/serve.py
Outdated
context = { | ||
'versions': versions, | ||
} | ||
return render(request, 'sitemap.xml', context, content_type='application/xml') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you went with a template instead of Django's builtin sitemap framework? https://docs.djangoproject.com/en/1.11/ref/contrib/sitemaps/
The framework might have some advantages in case we ever get so large we need to break up sitemaps into multiple files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignorance.
That said and after reading the documentation, I'm not sure how to:
- get the project from inside the Sitemap class (I think it doesn't have access to the
request
object --where we could check for.slug
property, for example) - generate different locations for translations that are not based on
LANGUAGE
variable but onproject.translations
Sitemap
objects look pretty clean and clear but I'd need some help on those two things to be able to make it work using the framework.
readthedocs/core/views/serve.py
Outdated
|
||
versions = [] | ||
for version, priority, changefreq in zip( | ||
sorted_versions, priorities_generator(), changefreqs_generator()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this right:
latest
will always have priority=1 and changefreq=dailystable
will always have priority=0.9 and changefreq=weekly- other versions will have decreasing priorities and changefreq=monthly
Is that right?
Wouldn't it be better to just guess at a priority and changefrequency based on the last build date? If there is no last build date (version was never built), we don't include the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are reading it right, yes.
Wouldn't it be better to just guess at a priority and changefrequency based on the last build date?
Our idea behind this was that you (as author of the project) want to point your readers to the latest version published first. That's why the priority works like that: latest, stable, v1.5, v1.4, etc in that order... I think that build date is not associated with priority.
Regarding changefreq, I have a similar opinion: I expect last versions to change more frequently than v0.1.
I wouldn't complicate the logic for this.
If there is no last build date (version was never built), we don't include the version.
I'm including only active
versions, as we do on the flyout also. This could change once more states be implemented: #4001 (comment)
07f0867
to
9670c85
Compare
I added small docs for sitemap to at least communicate that we are doing this automatically, without deep too much into its implementation. Once sitemap index is implemented, we can use this page to extend for that feature. |
03043d6
to
3de35d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good to me.
I will monitor our organic traffic over the next month and make sure there isn't a dip from this change. Based on my understanding of sitemaps, I this should be positive but SEO is a bit of the dark arts.
I know this is a first step but here are some improvements we can do in future iterations:
- Have a sub-sitemap per version. Our root sitemap (
slug.readthedocs.io/sitemap.xml
) could point to version specific sitemaps (slug.readthedocs.io/en/1.x/sitemap.xml
) that have entries for each HTML file. - We can support user submitted sitemaps. If the user has a sitemap.xml file in their build output, our root sitemap could point to it or maybe it replaces the root sitemap.
- Use the Django sitemap features. I think this is slightly better than using an XML template if possible because there are a few intricacies for sitemaps (max 50k entries per file, etc.) that we won't hit with this implementation but we might if we expand it.
I was looking through the changes and it doesn't seem the sitemapindex is added to the robots.txt file, which would likely make it easier for the search engines to find. If you think that is worth implementing here is an example of what I am talking about: https://github.com/jdillard/sphinx-sitemap#getting-the-most-out-of-the-sitemap |
I think this is a good addition and should be easy to add it to the default |
ab64878
to
6b3cf9f
Compare
1bb65e3
to
21b0015
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks simple enough. I could see needing to play with the priorities or something over time, but this is definitely better than no sitemap (hopefully :)
docs/sitemaps.rst
Outdated
@@ -0,0 +1,19 @@ | |||
Sitemaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this linked from anywhere? Should be in an toctree somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I wanted to link it under Feature Documentation
but I forgot to do it.
I will move this file under docs/features/sitemaps.rst
and will be linked automatically on that section.
readthedocs/core/views/serve.py
Outdated
raise Http404 | ||
|
||
sorted_versions = sort_version_aware( | ||
project.versions.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to use public(
on the queryset here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it also defaults to only active projects, but can pass only_active=True
also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reason. I just changed to Version.objects.public(project=project, only_active=True)
6482bac
I just pushed the changes suggested on feedback. I will merge this PR once tests pass. |
Mmm... It seems that I can't merge because as I sent new changes I need a new approval now:
|
This PR makes Read the Docs to generate a general (non specific per project)
sitemap.xml
served at the root of the project/sitemap.xml
based on discussions from #557I think it would be good to split this in two phases:
sitemap.xml
for all the project without option to customize it (this PR as is)sitemap.xml
and instead of generating a general one, generate a specific one for this project usingsitemapindex
Example with a toy project locally,