🐛 Add encoding to file writing function #405

claell · 2023-09-21T09:20:37Z

Fixes #403

claell · 2023-09-21T09:22:00Z

Right now, this is just adapted from #395. Not tested, just quickly done in GitHub editor.

MiWeiss

Thanks for the PR. Same as in #395, we would need some tests, primarily to prevent regressions. After those are added, the PR can be merged.

claell · 2023-10-17T10:57:31Z

Sorry, am rather busy right now, so not sure when (or if) I'll have time for that.

MiWeiss · 2023-11-02T19:39:19Z

Sorry, am rather busy right now, so not sure when (or if) I'll have time for that.

Thanks for the comment. I'll mark the PR to be ready for anyone to continue working on this.

If anyone is willing to do so, or if @claell resumes his work: Please comment here to avoid having two people working on the same thing at the same time.

p.s. Continuing to work on the PR can be done as follows: Add the fork of @claell as additional git remote, pull the branch, and then push it to your own fork and open a new PR, stating that that one is supposed to replace this one.

agriyakhetarpal · 2024-01-12T18:05:14Z

Hi, we're looking to add python-bibtexparser as a dependency for the citations workflow of our Python package soon – I thought it would be nice to contribute as well, perhaps by unblocking a few PRs; such as this one and suchlike.

I am happy to take over from @claell and write a few tests, though the issue is that I fail to reproduce the original issue (#403) at this time:

Using the core functionality, i.e., without any customizations to the formatting options for writing, I am able to write an entry:

import bibtexparser
from bibtexparser import *
bib_library = bibtexparser.Library()
fields = []
fields.append(bibtexparser.model.Field("author", "ö"))
entry = bibtexparser.model.Entry("ARTICLE", "test", fields)
bib_library.add(entry)
bibtexparser.write_file("my_new_file.bib", bib_library)

i.e., the same MWE as previously reported in #403 (comment), and parsing my_new_file.bib seems to work without issues.

my_new_file.bib

@ARTICLE{test,
	author = {ö}
}

and reading this programmatically:

import bibtexparser
library = bibtexparser.parse_file("my_new_file.bib")

Therefore, library.entries_dict returns

{'test': Entry(entry_type=`article`, key=`test`, fields=`[Field(key=`author`, value=`ö`, start_line=1)]`, start_line=0)}

which I am able to write to a new .bib file via the bibtexparser.write_file() method without any loss of data or missing umlauts symbols. Has this been fixed, or cannot be reproduced, or there is a different method of looking at the error that I may have missed in oversight?

MiWeiss · 2024-01-14T13:56:52Z

Hi @agriyakhetarpal

Great, thanks a lot for taking over!

Strange to hear that the problem cannot be reproduced, I am not aware of any changes we merged recently (although I have not checked ;-) ). As we're talking about encoding, I would not be surprised if the behavior was somewhat system-dependent, which could explain the issue at hand.

In either case, I think we should still merge this PR - to keep the interface consistent and generally applicable. While it's not ideal to do so without reproducing the problem above, I'd suggest implementing tests similar to #395 - this should, at the very least, protect us from some regressions. Do you agree?

I thought it would be nice to contribute as well

That would be amazing. I'll be happy to help you along the way wherever I can (feedback, reviews, ...)

agriyakhetarpal · 2024-01-14T14:58:27Z

As we're talking about encoding, I would not be surprised if the behavior was somewhat system-dependent, which could explain the issue at hand.

In either case, I think we should still merge this PR - to keep the interface consistent and generally applicable. While it's not ideal to do so without reproducing the problem above, I'd suggest implementing tests similar to #395 - this should, at the very least, protect us from some regressions. Do you agree?

I'll be happy to help you along the way wherever I can (feedback, reviews, ...)

I could not agree more. I am aware systems like Windows choose CP1252 by default for writing to files if UTF-8 isn't specified – maybe the reason I'm not seeing this error is because I'm on macOS?

Is it fine if we can discuss things here itself before I proceed to write a PR? I couldn't wrap my head around the source code for the middlewares and customisers for now—very custom engineering TBF—but I did manage to write a simple test for test_writer.py, as follows.

def test_write_article_with_umlauts():
    entry_block = Entry(
        entry_type="article",
        key="myKey",
        fields=[
            Field(key="title", value='"myTitle"'),
            Field(key="author", value='"Müller, Max"'),
        ],
    )
    library = Library(blocks=[entry_block])
    string = writer.write(library)
    assert string == '@article{myKey,\n\ttitle = "myTitle",\n\tauthor = "Müller, Max"\n}\n'

I would appreciate feedback on this, afterwards we should be able to pytest-parameterize this with a few other characters (ö, ä, ë, and other diacritics you can think of!)

Edit: removed some redundant print statements, was just debugging something to stdout

MiWeiss · 2024-01-18T20:56:23Z

Hi @agriyakhetarpal

Your suggested test, as I read it, would not actually test the method targeted in this PR.

Instead, you would have to test the method bibtexparser.write_file and included the newly added encoding parameter. This will write a new bibtex file. You could then check if this file contains the expected content in the expected encoding.

agriyakhetarpal · 2024-01-21T13:16:33Z

Instead, you would have to test the method bibtexparser.write_file and included the newly added encoding parameter. This will write a new bibtex file. You could then check if this file contains the expected content in the expected encoding.

Makes sense. I improved the test such that it parses a given BibTeX string with umlauts symbols, writes to a temporary file, and reads from it; as follows:

def test_write_file_with_umlauts():
    bibtex_str = """@article{umlauts,
    author = {Müller, Hans},
    title = {A title},
    year = {2014},
    journal = {A Journal}
    }"""
    library = parse_string(bibtex_str)
    with tempfile.NamedTemporaryFile(mode="w", encoding="utf-8") as f:
        write_file(f, library, encoding="utf-8")
        f.seek(0)
        library = parse_file(f.name, encoding="utf-8")
    assert library.entries[0]["author"] == "Müller, Hans"
    assert library.entries[0]["title"] == "A title"
    assert library.entries[0]["year"] == "2014"
    assert library.entries[0]["journal"] == "A Journal"

I opted to use temporary files so as to not create any clutter. Does this look fair enough? I can then write a few more tests or parameterize this as needed, perhaps with a few more characters.

auge · 2025-01-30T06:56:35Z

the problem still exists on Windows (as outlined above).
as a work-around one can also set an environment variable (using your favorite shell or in the system...):
export PYTHONUTF8=1

Add encoding to file writing function

88d9946

Fixes sciunto-org#403

claell mentioned this pull request Sep 21, 2023

Writing doesn't work for umlauts (likely UTF-8 formatting problem) #403

Open

2 tasks

MiWeiss requested changes Sep 21, 2023

View reviewed changes

MiWeiss changed the title ~~Add encoding to file writing function~~ 🐛 Add encoding to file writing function Sep 21, 2023

MiWeiss added good first issue needs help labels Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Add encoding to file writing function #405

🐛 Add encoding to file writing function #405

claell commented Sep 21, 2023

claell commented Sep 21, 2023

MiWeiss left a comment

claell commented Oct 17, 2023

MiWeiss commented Nov 2, 2023

agriyakhetarpal commented Jan 12, 2024 •

edited

Loading

my_new_file.bib

MiWeiss commented Jan 14, 2024

agriyakhetarpal commented Jan 14, 2024 •

edited

Loading

MiWeiss commented Jan 18, 2024

agriyakhetarpal commented Jan 21, 2024

auge commented Jan 30, 2025

🐛 Add encoding to file writing function #405

Are you sure you want to change the base?

🐛 Add encoding to file writing function #405

Conversation

claell commented Sep 21, 2023

claell commented Sep 21, 2023

MiWeiss left a comment

Choose a reason for hiding this comment

claell commented Oct 17, 2023

MiWeiss commented Nov 2, 2023

agriyakhetarpal commented Jan 12, 2024 • edited Loading

my_new_file.bib

MiWeiss commented Jan 14, 2024

agriyakhetarpal commented Jan 14, 2024 • edited Loading

MiWeiss commented Jan 18, 2024

agriyakhetarpal commented Jan 21, 2024

auge commented Jan 30, 2025

agriyakhetarpal commented Jan 12, 2024 •

edited

Loading

agriyakhetarpal commented Jan 14, 2024 •

edited

Loading