- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 2.3k
 
[singlehtml] add docname to section anchor to make them unique #13739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e6b65fb    to
    5117057      
    Compare
  
    | 
           Hi @gastmaier - I'm a former semi-regular volunteer contributor here, although I have been less active recently. Thanks for the pull request; and sorry that I did not notice the toctree constructor problem, as you mention in #13717. I am reading both #13717 and this PR #13739 to try to understand the different approaches and reasons for them. Also: do you have a test case that we could add under   | 
    
| 
           Hi @jayaddison maybe extending tests/test_builders/test_build_html_tocdepth.py  | 
    
| 
           @gastmaier that sounds perfect, yep! (I'd forgotten about those tests)  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align with purpose
c50ba56    to
    910de47      
    Compare
  
    | 
           Hi, @jayaddison and @akhilsmokie7-cloud  I rebased and added the test to check for duplicated ids. I added the test to the bottom, checking out fe728f4 will fail at as expected, since at f5457f1 On, "the html build also changes the images src path during the write step", this is what I am talking about CI note: is due to 2e51b787680cefdfe56b3438d809e6476600a47e Thanks,  | 
    
910de47    to
    3a92a34      
    Compare
  
            
          
                sphinx/builders/singlehtml.py
              
                Outdated
          
        
      | for docname, secnums in self.env.toc_secnumbers.items(): | ||
| for id, secnum in secnums.items(): | ||
| alias = f'{docname}/{id}' | ||
| alias = f'{docname}{id}' | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of values are possible for the docname and id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also: I guess people shouldn't have written hyperlinks or saved bookmarks with the assumption that these aliases are stable? but, even so - if we change the format, I guess we would break those?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gastmaier in fact: I'm not sure where these / separator characters appear.  What does this code relate to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For singlehtml and at the assemble toctree step, the href is a tuple of docname and refid.
#document-path/to/#id1 to  try to avoid the refid confliction in singlehtml mode problem, which didn't work because it would patch toctree, but the content body still had the non-unique ids.
My pr changes the toctree href format from
#document-path/to/#id1  to #document-path/to#id1 (removes end slash)
and for content ids from
#d1  to #document-path/to#id1 (adds doc prefix to make unique)
the new template is therefore:
#document-{doc}#{id}
direct tuple of docname and refid, without the slash.
These are valid HTML anchors, but do require escaping when manipulating with:
css
#document-test\/extra\#test {color: #f00;}
and javascript
document.querySelector('#document-test\\/extra#test')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
singlehtml.zip
here is a singlehtml build with the patch
        
          
                sphinx/writers/html5.py
              
                Outdated
          
        
      | self.body.append('</dt>') | ||
| 
               | 
          ||
| def visit_section(self, node: section) -> None: | ||
| if self.builder.name == 'singlehtml' and node['ids']: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to use many @property methods in the Sphinx writers, but maybe this singlehtml condition is getting to the point where it makes sense (this is the third potential callsite, I think?).
| 
           Maybe pedantic of me to mention, but: running the test code without the fix in place does confirm that the test case fails (duplication of   | 
    
          
 (I attempted that to reassure myself and to learn slightly more about how the fix works)  | 
    
| 
           Drafting again, I spotted more links using the non-doc-prefixed anchor in the body. are not being prefixed. but their links to it are correct ( I will give yet another try, but this time transversing the pickled to patch all ids early on, instead of patching at the nodes visit. Sample of new new approach:  | 
    
9edcc87    to
    82fae9f      
    Compare
  
    | 
           Applied the ruthless traverse to patch all (ref)?ids early on, instead of patching at the nodes visit. This approach avoids mass overwrite of every docutils method under the sun, e.g. the starttag method for the sneaky explicit ref  The procedure is to patch doctree ( Since the call stack is a little hidden, here is a summary  | 
    
82fae9f    to
    bfa9f06      
    Compare
  
            
          
                sphinx/builders/singlehtml.py
              
                Outdated
          
        
      | if 'refid' in node or 'ids' in node: | ||
| docname = env.path2doc(doc['source']) | ||
| if 'refid' in node: | ||
| node['refid'] = 'document-' + docname + '#' + node['refid'] | ||
| if 'ids' in node: | ||
| node['ids'] = ['document-' + docname + '#' + id for id in node['ids']] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll plan to do this within the next 24h or so, but I'll ask in case it is something you could do quickly: could you print out two columns of text with the before and after values for these node attributes when building a non-trivial project (easiest/safest choice: Sphinx itself)?
e.g.
refids
before               after
[sample/#foo]        [document-sample#foo]
node_id
sample/#foo          document-sample#foo
The reason I ask: I'd like to inspect the places where the results differ, and in particular how the code changes achieve uniqueness of the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm also wondering whether docutils -- which produces the node objects, if I understand correctly - could help us and allow us to fix this in a more central location; and I hope that viewing the comparison columns may also help to understand whether that is realistic or whether this is some Sphinx-specific quirk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm also wondering whether
docutils-- which produces thenodeobjects, if I understand correctly - could help us and allow us to fix this in a more central location; and I hope that viewing the comparison columns may also help to understand whether that is realistic or whether this is some Sphinx-specific quirk)
Nope, scratch that - I think that docutils is unaware of the notion of docnames, so whatever is going on here must, I think, be part of Sphinx itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So docutils provides the solved tree to the builder, with each doc being a document.
Sphinx guarantees the ids are unique per doc, the filesystem guarantees the docname is unique (you cannot have two identical paths)
But the builder singlehtml flattens all into the root doc index, loosing the information of the docname, causing non-unique ids after flatting it.
This fix recovers the docname and patches into the id itself.
The sphinx documentation itself, attached below, has conflicts, there are many duplicated id1.
The table requested (attached because it is too long):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much @gastmaier - that makes the problem and fix nice and clear.
I'm reading the comparison file at the moment - in particular I'm interested to find whether any of the before elements included a / delimiter -- I haven't found any so far.  If there are none, then that would completely resolve my concern about breaking any existing hyperlinks containing that character.
Do you have any thoughts about whether we should always include the complete document path prefix? Or whether, for example, it could be omitted for unambiguous/unique IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would patch all at the moment, it makes sense to me to store the lost docname information in the id itself, and it is clearer to debug.
For the toctree, before the pr, it would already generate in the format document-<doc>#<id>, so this would need to be assessed as well. That's what #13717 tried to fix, only to uncover the collision issue.
And there are so many visit_* elements that needs to be patched to handle every corner case, that uniforming into a single format early on (after SphinxPostTransform, before other singlehtml patches) seems to be the only reliable approach.
The latex builder does patch at the visit_* elements with the sphinx/writers/latex.py#hypertarget[withdoc=True] method, but I don't see that working with html since it is straight up more convoluted since each visit would require some if builder.name is 'singlehtml'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gastmaier.  I agree.  Also: in my opinion we should highlight this as a hyperlink-breaking change -- for singlehtml builds -- in the changelog notes.
One remaining concern I have: the document- prefix on every target seems fairly verbose.  I am wondering whether we could restore the path delimiter (/) -- and maybe even add a root prefix (/{docname}/#{refid}) -- to reduce the risk of duplication against other anchors declared by projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give a try, but I need to know exactly the format we want to achieve, as I understood it would be , in the html:
href="#/path/to/doc/#anchor_name"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the same format that I had in mind too, yep 👍
e6a707b    to
    0f7ce18      
    Compare
  
    | 
           Hi @jayaddison applied the new   | 
    
        
          
                sphinx/builders/singlehtml.py
              
                Outdated
          
        
      | def prefix_ids_with_docname(self, tree: nodes.document) -> None: | ||
| # Append docname to refids and ids using format document-<docname>#<id>. | ||
| # Compensates for loss of the pathname section of the href, that | ||
| # ensures uniqueness in the html builder. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adjusting the name of this method to ensure_fully_qualified_refids, or similar?
Note: when I suggest the terminology fully-qualified there, I'm borrowing it as a name from DNS: https://en.wikipedia.org/wiki/Fully_qualified_domain_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, I see in the codebase similar ensure_ methods that alter the doc without return.
I was going to suggest get_fully_qualified_refids, but it is not a get method so it would be misleading.
39b0657    to
    88ac2dd      
    Compare
  
    | 
           Done minor requested changes  | 
    
        
          
                sphinx/builders/singlehtml.py
              
                Outdated
          
        
      | def ensure_fully_qualified_refids(self, tree: nodes.document) -> None: | ||
| # Append docname to refids and ids using format document-<docname>#<id>. | ||
| # Compensates for loss of the pathname section of the href, that | ||
| # ensures uniqueness in the html builder. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the method renaming! Please also replace the comment with a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
88ac2dd    to
    abdd3de      
    Compare
  
            
          
                sphinx/builders/singlehtml.py
              
                Outdated
          
        
      | doc = node.document | ||
| if doc is None: | ||
| continue | ||
| env = doc.settings.env | ||
| if 'refid' in node or 'ids' in node: | ||
| docname = env.path2doc(doc['source']) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any situation where node.document can be empty?  If not, then we can simplify:
| doc = node.document | |
| if doc is None: | |
| continue | |
| env = doc.settings.env | |
| if 'refid' in node or 'ids' in node: | |
| docname = env.path2doc(doc['source']) | |
| if 'refid' in node or 'ids' in node: | |
| docname = self.env.path2doc(node.document['source']) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy complains
sphinx/builders/singlehtml.py:99:36: error: Value of type "document | None" is not indexable  [index]
I tried the NodeMatcher.findall but got the same result.
With cast
doc = cast(nodes.document, node.document)
I get
sphinx/builders/singlehtml.py:97:19: error: Redundant cast to "document"  [redundant-cast]
assert node.document is not None works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you 👍
fee3c39    to
    c455f96      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me - thank you, @gastmaier!
Note that I am not a core contributor here, so my review is not binding and I cannot merge the PR - even so, it looks good to me.
cc @sphinx-doc/developers (does this GitHub team still exist?)
| 
           Ah, I hadn't noticed the failing tests, though - we'll want to fix those too.  | 
    
c455f96    to
    3de2154      
    Compare
  
    | 
           @jayaddison well at least the test added is working.  | 
    
| 
           Hi @AA-Turner can you revisit this one? 😄 edit: force pushed to resolve merge conflict in CHANGES.rst  | 
    
To assert unique ids in singlehtml builder. Signed-off-by: Jorge Marques <[email protected]>
Since the singlehtml aggregates all doc files into a single html page during the write step, and the ids must be unique for proper link anchoring, add test that collects all ids in the page and checks if all ids are unique, by asserting the length of the list against it as a set.
Use doc path to make ids unique with format ``/docname/#id``. Compensates for the loss of the pathname in the href. This will break existing hyperlinks to `singlehtml` HTML documents since it alters the format
Format as ``/docname/#id`` to match other parts.
3de2154    to
    1872ce1      
    Compare
  
    
Purpose
Follow up to #13717, inverting the logic, instead of patching the toctree to yield "#id1" instead of "#document-path/to#id1", have the section id to be docname preffixed, solving non-unique ids in singlehtml.
Allows to remove post Sphinx transforms like in here
Top level overview of current behavior
Approach taken
Based on the LaTeX builder solution.
sphinx/writers/latex.py#hypertarget[withdoc=True]method suffixes docutils id with the docname.In my implementation I edit ids['0'] directly to not have to overwrite the whole
visit_sectionmethod, but I understand if requested to not modify the tree and instead overwrite.On the format #document-test/extra#id1
It is compatible with HTML anchoring, CSS and JavaScript selectors, but require escaping:
Tests
The following tests are relevant:
References