It turns out "my" bug is already fixed in the development version of LaTeX::ToUnicode. Compare the CPAN version https://metacpan.org/dist/LaTeX-ToUnicode/source/lib/LaTeX/ToUnicode.pm#L46 with current master: https://github.com/borisveytsman/BibTeXPerlLibs/blob/5d24b66bd2461f1f3fc4d9a28dd8774ad6c75829/LaTeX-ToUnicode/lib/LaTeX/ToUnicode.pm#L314 The crucial difference is `\w{1,1}` vs. `\w{1,2}`. Given e.g. `Fr\"oding` the latter would always match two letters and try to look up `$ACCENTS{'"'}{od}` rather than `$ACCENTS{'"'}{o}`, which would fail. You should be able to install the development version by cloning the GitHub repo, coding to LaTeX-ToUnicode and saying `cpanm --force .` The force is because a couple of conversion tests (`\$` -> `$` and `\&` -> `&`) still fail. If you don't have cpanm installed you need to say `cpan App::cpanminus` first. However it still fails one of my test cases which I wrote yesterday: it leaves inputs like `\'{\ae}` and `\"{\ae}` which TeX is perfectly fine with as `\'æ \"æ`. The first of these exists precomposed in Unicode because Danish sometimes uses it so is definitely a bug and the second should arguably be replaced with letter + combining mark, so I will likely make a couple of pull requests as and when I have time. /bpj Den sön 3 juli 2022 21:43BPJ skrev: > > > Den sön 3 juli 2022 18:55Paulo Ney de Souza skrev: > >> >> On Sun, Jul 3, 2022 at 5:15 AM BPJ wrote: >> >>> It's an upstream bug in LaTeX::ToUnicode. >>> >> >> On LaTeX::ToUnicode ? I thought you only used BibTeX::Parser. >> > > Which uses LaTeX::ToUnicode in its `cleaned_*` methods. > > > >> I just had never run into it AFAIK because all the .bib files I had >>> written myself or downloaded from the libraries I use had used `\"{a}` >>> rather than `\"a` which doesn't hit the bug. I have located the bug and am >>> working on a patch. Thanks for discovering this! (There are a lot of >>> unattended bugs though. Do you want me to send you the patch when it is >>> ready?) >>> >> >> I know of some bugs on BibTeX::Parser (and none on LaTeX::ToUnicode). >> > > There are some on the old CPAN RT tracker. I don't know if they are de > facto fixed. > > > It would be nice to have all of them listed o the issues of the project >> page: >> >> https://github.com/borisveytsman/BibTeXPerlLibs/issues >> > > Thanks for the link. It is missing on MetaCPAN. One of the old bugs > complains about sloppy packaging. > > > >> especially if you are producing a patch. >> > > Well it will be listed when I submit a PR! > > > >> Paulo Ney >> >> >> >>> Den sön 3 juli 2022 07:42Paulo Ney de Souza skrev: >>> >>>> I got interested in another aspect of the posting -- the program " >>>> cleanbib.pl" by Benct. >>>> >>>> I installed it in Ubuntu, and found out it does not process perfectly >>>> valid TeX code like characters that end or have a space in the middle, or >>>> that it processes \c{e}, but not the comma-accent any of the other vowels... >>>> >>>> I prepared the torture test below to show the problems: >>>> >>>> @Book{hobbit, >>>> title = {Les \oe uf de la serpente}, >>>> address = {Bla\v zi\'c}, >>>> publisher = {\c{a} \c{e} \c{i} \c{o} \c{u}}, >>>> } >>>> >>>> and above all, how does this compare to: >>>> >>>> https://ctan.org/tex-archive/support/bibtexperllibs/LaTeX-ToUnicode >>>> >>>> Paulo Ney >>>> >>>> >>>> On Sat, Jul 2, 2022 at 1:03 PM BPJ wrote: >>>> >>>>> string.gsub() optionally takes the maximum number of substitutions as >>>>> a fourth argument, and you can reinsert capture groups in the replacement, >>>>> so this should be fairly robust: >>>>> >>>>> ``````lua >>>>> string.gsub(title, '%:(%s)', '.%1', 1) >>>>> `````` >>>>> >>>>> >>>>> Den fre 1 juli 2022 18:44John Carter Wood skrev: >>>>> >>>>>> Ah, of course, biblical references. Religious history is one of my >>>>>> fields, how could I miss that? >>>>>> >>>>>> Looking forward to trying this out! >>>>>> >>>>>> denis...-NSENcxR/0n0@public.gmane.org schrieb am Freitag, 1. Juli 2022 um 18:41:02 UTC+2: >>>>>> >>>>>>> A slightly more reliable version: >>>>>>> >>>>>>> >>>>>>> >>>>>>> ``` >>>>>>> >>>>>>> local stringify = pandoc.utils.stringify >>>>>>> >>>>>>> function Meta(m) >>>>>>> >>>>>>> if m.references ~= nil then >>>>>>> >>>>>>> for _, el in ipairs (m.references) do >>>>>>> >>>>>>> -- print(stringify(el.title)) >>>>>>> >>>>>>> el.title = pandoc.Str(string.gsub(stringify(el.title), ': ', >>>>>>> '. ')) >>>>>>> >>>>>>> -- print(el.title) >>>>>>> >>>>>>> end >>>>>>> >>>>>>> end >>>>>>> >>>>>>> return m >>>>>>> >>>>>>> end``` >>>>>>> >>>>>>> >>>>>>> >>>>>>> (This won’t replace colons in biblical references, e.g. Gen 1:1) >>>>>>> >>>>>>> >>>>>>> >>>>>>> You can test with this file : >>>>>>> >>>>>>> >>>>>>> >>>>>>> ```markdown >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> references: >>>>>>> >>>>>>> - type: book >>>>>>> >>>>>>> id: doe >>>>>>> >>>>>>> author: >>>>>>> >>>>>>> - family: Doe >>>>>>> >>>>>>> given: Jane >>>>>>> >>>>>>> issued: >>>>>>> >>>>>>> date-parts: >>>>>>> >>>>>>> - - 2022 >>>>>>> >>>>>>> title: 'A book: with a subtitle and a reference to Gen 1:1, but >>>>>>> that is not a problem' >>>>>>> >>>>>>> publisher: 'Whatever press' >>>>>>> >>>>>>> lang: de-De >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> >>>>>>> >>>>>>> test [@doe] >>>>>>> >>>>>>> ``` >>>>>>> >>>>>>> >>>>>>> >>>>>>> The filter itself does not cover capitalization. For some reason, >>>>>>> pandoc or citeproc applies title-case transformation here. I don’t think it >>>>>>> should though. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Von:* pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org *Im >>>>>>> Auftrag von *John Carter Wood >>>>>>> *Gesendet:* Freitag, 1. Juli 2022 18:24 >>>>>>> *An:* pandoc-discuss >>>>>>> *Betreff:* Re: Changing colons to full-stops in titles >>>>>>> >>>>>>> >>>>>>> >>>>>>> That's very interesting, thanks! I'll try it out when I get a chance >>>>>>> in the coming days. >>>>>>> >>>>>>> I have thought about this issue of false positives while thinking >>>>>>> about the option of some kind of filter. But...I think they would be very >>>>>>> rare. I have a hard time thinking of a title with a colon in it that >>>>>>> shouldn't be -- in this case -- be turned into a dot. At least, I don't >>>>>>> have anything in my 1,200 references where I can see that that wouldn't >>>>>>> apply. >>>>>>> >>>>>>> Although, of course, I'm sure there are some out there... >>>>>>> >>>>>>> Just a question: would this also ensure that the first word after >>>>>>> the dot is capitalised? Or does that open a new series of problems? :-) >>>>>>> >>>>>>> >>>>>>> denis...-NSENcxR/0n0@public.gmane.org schrieb am Freitag, 1. Juli 2022 um 18:17:02 >>>>>>> UTC+2: >>>>>>> >>>>>>> Here’s a very simple and absolutely unreliable version of a filter. >>>>>>> This will replace every colon in a title with a period. >>>>>>> >>>>>>> >>>>>>> >>>>>>> ```lua >>>>>>> >>>>>>> local stringify = pandoc.utils.stringify >>>>>>> >>>>>>> function Meta(m) >>>>>>> >>>>>>> if m.references ~= nil then >>>>>>> >>>>>>> for _, el in ipairs (m.references) do >>>>>>> >>>>>>> print(stringify(el.title)) >>>>>>> >>>>>>> el.title = pandoc.Str(string.gsub(stringify(el.title), ':', >>>>>>> '.')) >>>>>>> >>>>>>> print(el.title) >>>>>>> >>>>>>> end >>>>>>> >>>>>>> end >>>>>>> >>>>>>> return m >>>>>>> >>>>>>> end >>>>>>> >>>>>>> ``` >>>>>>> >>>>>>> >>>>>>> >>>>>>> Question is how this can be made robust enough to avoid false >>>>>>> positives. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Von:* pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org *Im >>>>>>> Auftrag von *John Carter Wood >>>>>>> *Gesendet:* Freitag, 1. Juli 2022 17:52 >>>>>>> *An:* pandoc-discuss >>>>>>> *Betreff:* Re: Changing colons to full-stops in titles >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks for the suggestions, a couple of which are kind of stretching >>>>>>> my knowledge of these things, but I see where they're going. >>>>>>> >>>>>>> As to JGM's question: I am using a CSL json bibliography, so my >>>>>>> titles are in a single field. ("title":"Science and religion: new >>>>>>> perspectives on the dialogue") >>>>>>> >>>>>>> The issue is that *most* of the journals / publishers I publish in >>>>>>> use, as here, the colon. *Some* (mainly German) styles want the period. If >>>>>>> I were solely interested in either one, I could choose and just enter the >>>>>>> relevant punctuation in the title field. However, I want to continue saving >>>>>>> my bibliographic entries with a colon (because that's the most standard one >>>>>>> for me), but have the option of automatically converting them to a period >>>>>>> for those cases where I need to. If that makes sense. >>>>>>> >>>>>>> Thus: going through denis's options: >>>>>>> >>>>>>> 1. I have switched to json bibliographies from bibtex/biblatex as >>>>>>> they seemed to offer more flexibility (I was running into issue with the >>>>>>> strange archival references I have to make in my field, and JSON seemed to >>>>>>> work better in that regard). So this seems to not apply. >>>>>>> >>>>>>> 2. Seems to not apply, as I have a single title field >>>>>>> >>>>>>> 3. Sounds really interesting, and I use BBT, though it also sounds >>>>>>> like I would here have to create a separate bibliography file from my >>>>>>> Zotero database for those publishers/styles that require the dot. This is >>>>>>> not *too* onerous, as it would at least be automated. >>>>>>> >>>>>>> 4. Having a filter that I could simply apply (as part of a pandoc >>>>>>> command, say) or not apply as relevant seems like the most flexible / >>>>>>> efficient solution. I don't know lua, but if this is one possible way, then >>>>>>> I could use it as a (hopefully fairly simple?) way into learning it. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Does this help to clarify my situation? >>>>>>> >>>>>>> >>>>>>> >>>>>>> denis...-NSENcxR/0n0@public.gmane.org schrieb am Freitag, 1. Juli 2022 um 17:34:55 >>>>>>> UTC+2: >>>>>>> >>>>>>> Yes, that’s a known issue... >>>>>>> >>>>>>> There are a couple of possible solutions : >>>>>>> >>>>>>> >>>>>>> >>>>>>> 1. use biblatex databases and patch pandoc so it will concat title >>>>>>> and subtitle fields using periods. (line 667 >>>>>>> https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Citeproc/BibTeX.hs >>>>>>> ) >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2. I think pandoc’s citeproc will just treat every unknown variable >>>>>>> as a string variable (see >>>>>>> https://github.com/jgm/citeproc/blob/3f94424db469c804cf2dac2d22dc7a18b614f43e/src/Citeproc/Types.hs#L1054 >>>>>>> and >>>>>>> https://github.com/jgm/citeproc/blob/3f94424db469c804cf2dac2d22dc7a18b614f43e/src/Citeproc/Types.hs#L901), >>>>>>> so you should be able to use «subtitle» in styles. (This will give you >>>>>>> warnings when using the style with Zotero and it won’t work reliably across >>>>>>> implementations, but anyway ...) >>>>>>> >>>>>>> >>>>>>> >>>>>>> 3. if you’re using Zotero, you can leverage Zotero BBT’s postscript >>>>>>> feature to manipulate the JSON after exporting. >>>>>>> >>>>>>> E.g., this one : >>>>>>> >>>>>>> if (Translator.BetterCSL && item.title) { >>>>>>> >>>>>>> reference.title = reference.title.replace(/ : /g, '. ') >>>>>>> >>>>>>> } >>>>>>> >>>>>>> Not bullet-proof, but simple. You will want to choose a better >>>>>>> separator, maybe a double-bar or so. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 4. Doing the with lua should also be possible... >>>>>>> >>>>>>> >>>>>>> >>>>>>> The question is: do you have the subtitle in a distinct field or is >>>>>>> it just in the title field? >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Von:* pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org *Im >>>>>>> Auftrag von *John Carter Wood >>>>>>> *Gesendet:* Freitag, 1. Juli 2022 16:39 >>>>>>> *An:* pandoc-discuss >>>>>>> *Betreff:* Changing colons to full-stops in titles >>>>>>> >>>>>>> >>>>>>> >>>>>>> I have one final (for now...) issue in setting up a CSL file (which >>>>>>> I use with pandoc/citeproc and references in a json file). >>>>>>> >>>>>>> >>>>>>> >>>>>>> I'm not sure whether this is a CSL issue or whether it's an issue >>>>>>> that can be solved via using a filter (or some other solution) in pandoc, >>>>>>> but I thought there might be some people here who might have faced a >>>>>>> similar issue. >>>>>>> >>>>>>> >>>>>>> >>>>>>> The house style for here (German-based publisher) wants a *full-stop/period >>>>>>> *between main title and subtitle in citations / bibliographies; >>>>>>> US/UK standard is a *colon* between main title and subtitle. And >>>>>>> reference managers like Zotero -- IIUC -- save titles as single fields (at >>>>>>> least they are in my version of Zotero). So it doesn't seem like it is >>>>>>> possible to control what delimiter is used between them via CSL. >>>>>>> >>>>>>> >>>>>>> I have found various discussions of relevant title/subtitle division >>>>>>> issues -- some going back quite a few years -- in forums on Zotero: >>>>>>> >>>>>>> >>>>>>> https://forums.zotero.org/discussion/8077/separate-fields-for-title-and-subtitle/ >>>>>>> >>>>>>> ...and CSL: >>>>>>> >>>>>>> >>>>>>> https://discourse.citationstyles.org/t/handling-main-sub-title-splits-citeproc-js/1563/11 >>>>>>> >>>>>>> >>>>>>> >>>>>>> However, these were in part discussions among developers about >>>>>>> *possible* changes, and I'm not sure of the current status of this >>>>>>> issue or whether there is a way to handle it. >>>>>>> >>>>>>> Would it be possible to automate turning colons in titles into >>>>>>> full-stops via using a filter? If so is there such a filter already around? >>>>>>> Can this be done via CSL? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Or is this, as of now, impossible? >>>>>>> >>>>>>> (Or is there a real simple solution that I have, as usual, >>>>>>> overlooked...) >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "pandoc-discuss" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/pandoc-discuss/78df697a-50f5-46d0-b0b8-29a2cbc9509an%40googlegroups.com >>>>>>> >>>>>>> . >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "pandoc-discuss" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>>>>>> >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/pandoc-discuss/a2d540a6-a435-4285-aed5-018007d155cfn%40googlegroups.com >>>>>>> >>>>>>> . >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "pandoc-discuss" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>>>>>> >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/pandoc-discuss/f0f222ef-e60e-4397-83ac-bec1a6ac2d08n%40googlegroups.com >>>>>>> >>>>>>> . >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "pandoc-discuss" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/pandoc-discuss/b3deb0de-8ba0-4159-b9f3-1ecfbe68d457n%40googlegroups.com >>>>>> >>>>>> . >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "pandoc-discuss" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhAU66TxJKMZdDM-KVabJpmKUVo5xyuAAN03F2b89jv9Ow%40mail.gmail.com >>>>> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "pandoc-discuss" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZMyj_GZ%3DAo_1qR2rwnAAYAaQ%3DMaf880cGLRv7yD_ianpQ%40mail.gmail.com >>>> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "pandoc-discuss" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhBjTdgbY-xDouhDGfnE%2BJ%2BV5c3v0FUA2Hn00z59%3D%3DWeLw%40mail.gmail.com >>> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/pandoc-discuss/CAFVhNZNNsfQs_Lt8agoaseyrNfdhrVOC9GTusMEdfecJFCBnug%40mail.gmail.com >> >> . >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhB7xCir7Gwq%3DXr%3DYh4jSfCfjqnfWNdWvYZoXE7idcbePQ%40mail.gmail.com.