Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizing English language tag #3100

Open
adunning opened this issue Dec 5, 2024 · 10 comments
Open

Normalizing English language tag #3100

adunning opened this issue Dec 5, 2024 · 10 comments

Comments

@adunning
Copy link

adunning commented Dec 5, 2024

Debug log ID

FH3W5CKW-refs-euc/6.7.263-7

What happened?

The CSL spec indicates that the language field should provide ISO 639-1 language tags (i.e. IETF tags). Hence, pandoc-citeproc follows this to the letter and will only apply title case to items either with no language specified or with the tag en. Unfortunately, Zotero does not normalize this on import, and many items end up with non-IETF tags in the language field, mostly ISO 639-2 codes, which triggers an undesired sentence-case citation. It would be most helpful if BBT could convert ISO 639-2 to ISO 639-1 language codes, and perhaps also normalize strings such as English to en.

retorquere added a commit that referenced this issue Dec 6, 2024
Copy link

github-actions bot commented Dec 6, 2024

🤖 this is your friendly neighborhood build bot announcing test build 6.7.263.7430 ("fixes #3100")

This update may name other issues, but the build just dropped here is for you; it just means problems already fixed in other issues have been folded into the work we are doing here. Install in Zotero by downloading test build 6.7.263.7430, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

@adunning
Copy link
Author

adunning commented Dec 6, 2024

This works well; thank you! I would suggest two improvements:

  1. It appears that there is no need to change en-US to en, as both are valid IETF codes. Including a region still results in the correct capitalization in Pandoc:
pandoc --citeproc -t plain << EOT
---
references:
- id: example
  author: "Author"
  title: "Example title"
  language: "en-GB"
  issued:
    year: 2024
---

Citation: [@example].
EOT
  1. It is possible to have multiple language codes stored in the Language field, delimited with a space (e.g. eng lat) or occasionally a semicolon (e.g. eng;lat). For examples, add the ISBNs 978-0-19-815039-8 or 978-1-908590-41-1 to a Zotero library. If these codes are converted to IETF tags, this will give the expected result in Pandoc:
pandoc --citeproc -t plain << EOT
---
references:
- id: example
  author: "Author"
  title: "Example title"
  language: "en la"
  issued:
    year: 2024
---

Citation: [@example].
EOT

Many thanks again!

@njbart
Copy link
Contributor

njbart commented Dec 7, 2024

A word of caution, though: In CSL, the language variable is supposed to hold one single language tag only, and the variable’s current unique role is to switch on conversion of titles to title-case when rendering an item’s metadata if the language tag starts with en (and if, in addition, asked by the CSL style to do so, of course).

From https://docs.citationstyles.org/en/stable/specification.html#appendix-iv-variables (note the singular!):

language
The language of the item;
Should be entered as an ISO 639-1 two-letter language code (e.g. “en”, “zh”), optionally with a two-letter locale code (e.g. “de-DE”, “de-AT”)

The reason the language: "en la" example works as expected is merely because the tag starts with en; with language: "la en", it does not.

Unfortunately, there is no CSL variable indented to record the language(s) the content of a work is written in (for this purpose, biblatex has language; confusingly, as CSL’s language equals biblatex’s langid).

@retorquere
Copy link
Owner

But then there's no benefit to adding En-US over just en.

@njbart
Copy link
Contributor

njbart commented Dec 7, 2024

I’d still recommend not throwing away information, so I’d always import something like american as en-US rather than just en. (I wouldn’t add information, though, so english should just remain en.)

In any case, keeping language-plus-locale tags is essential when exporting to biblatex, as biblatex can also modify hyphenation, punctuation, and localised terms, all of which might differ between, say, en-US and en-GB, or de-DE and de-AT.

From the current biblatex manual:

It is highly advisable to always specify american, british, australian, etc. rather than english when loading the babel/polyglossia packages to avoid any possible confusion.

@retorquere
Copy link
Owner

But that doesn't apply to CSL, right?

@njbart
Copy link
Contributor

njbart commented Dec 7, 2024

Right. From a CSL (processor) perspective, it currently does not matter if it’s en or en-US.

That being said, en-US is a perfectly valid CSL language tag (see quote from the CSL specs, above), so there’s no reason not to use it.

The OP was about normalising upon import after all, where I would continue to argue that throwing away available information (e.g., by ‘normalising’ from en-US to en) is not a good idea since this very information might be useful, at the very least when exporting to biblatex.

@retorquere
Copy link
Owner

That is not my understanding - I think the OP was talking about items already in Zotero, and that during that import (from whatever source) the dates end up being a hodgepodge (likely so no information is discarded), and how they could be normalized on CSL export. I don't have CSL import, just export.

The reason I'd prefer to leave it as en is that I currently reuse code I already have, and changing it would be kind of involved.

@adunning
Copy link
Author

adunning commented Dec 7, 2024

Yes, my aim is purely to export items from Zotero into valid CSL JSON, for use in Pandoc. While currently this only changes whether title case is applied, I plan to see whether language tagging can also be applied to citations, if this field can be normalized reliably.

I hadn't realized that it was against the spec to list more than one language tag. In that case, if more than one is recorded in Zotero, perhaps only the first could be kept?

If I can get Pandoc to output language tagging with citations, it could be useful to be able to distinguish between, for example, de-DE and de-AT (as @njbart notes), to control hyphenation. It will make no difference for Pandoc's current functionality. I had mainly assumed that the code could be simplified if it were not concerned with discarding this information – it's probably not worth your time if that's not the case.

@retorquere
Copy link
Owner

Zotero doesn't really have the concept of multiple language stored per item. It's a single free-form string.

I can take a look later next week what I can do about locales. It may in the end be simpler but it's not now. The language normalizer in BBT scripts off of babel's language configs, and I don't recall how much flexibility I kept in that process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To triage
Development

No branches or pull requests

3 participants