* "double emphasis" bug when converting to asciidoc? @ 2022-01-04 18:02 Frank Bergmann [not found] ` <3f7b920b-c982-5be5-fa04-9025e008e518-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Frank Bergmann @ 2022-01-04 18:02 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 1540 bytes --] Hi, I found a strange behaviour when converting some HTML files to asciidoc. Versions used: asciidoc 9.1.0 pandoc 2.16.2 Example input: <!DOCTYPE HTML> <html> <head> <title>Xx</title> </head> <body> <a href="x.htm"><i>Xx</i></a><i>,</i> </body> </html> With "pandoc --wrap=none -f html -t asciidoc" I get this asciidoc output: link:x.htm[_Xx_]__,__ The double underscores look "suspicious" and with "asciidoc -b docbook;xmllint" I get: z.xml:10: parser error : Unescaped '<' not allowed in attributes values <simpara>link:x.htm<emphasis><phrase role="<emphasis>Xx</emphasis>">,</phrase></ The related docbook line which was created by asciidoc: <simpara>link:x.htm<emphasis><phrase role="<emphasis>Xx</emphasis>">,</phrase></emphasis></simpara> *Is this a known bug?* If I add a space before comma... <a href="x.htm"><i>Xx</i></a><i> ,</i> then I get link:x.htm[_Xx_] _,_ which causes no issue. Also adding a space before the emphasis... <a href="x.htm"><i>Xx</i></a> <i>,</i> create an asciidoc file which can be rendered: link:x.htm[_Xx_] _,_ Does someone know this? Does a fix already exist? cheers, Frank -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3f7b920b-c982-5be5-fa04-9025e008e518%40tuxad.com. [-- Attachment #2: Type: text/html, Size: 3006 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <3f7b920b-c982-5be5-fa04-9025e008e518-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org>]
* Re: "double emphasis" bug when converting to asciidoc? [not found] ` <3f7b920b-c982-5be5-fa04-9025e008e518-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> @ 2022-01-04 19:12 ` John MacFarlane [not found] ` <m2v8yzpb5x.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: John MacFarlane @ 2022-01-04 19:12 UTC (permalink / raw) To: Frank Bergmann, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw I see this in the code: inlineToAsciiDoc opts (Emph lst) = do contents <- inlineListToAsciiDoc opts lst isIntraword <- gets intraword let marker = if isIntraword then "__" else "_" return $ marker <> contents <> marker So apparently in asciidoc you use __ for intraword emphasis (I don't use asciidoc myself). The problem may be that this is a context asciidoc doesn't consider "intraword." Anyway, please submit a bug report and link here. Frank Bergmann <pandoc-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> writes: > Hi, > > I found a strange behaviour when converting some HTML files to asciidoc. > > Versions used: > asciidoc 9.1.0 > pandoc 2.16.2 > > Example input: > > <!DOCTYPE HTML> > <html> > <head> > <title>Xx</title> > </head> > <body> > <a href="x.htm"><i>Xx</i></a><i>,</i> > </body> > </html> > > With "pandoc --wrap=none -f html -t asciidoc" I get this asciidoc output: > > link:x.htm[_Xx_]__,__ > > The double underscores look "suspicious" and with "asciidoc -b > docbook;xmllint" I get: > > z.xml:10: parser error : Unescaped '<' not allowed in attributes values > <simpara>link:x.htm<emphasis><phrase > role="<emphasis>Xx</emphasis>">,</phrase></ > > The related docbook line which was created by asciidoc: > > <simpara>link:x.htm<emphasis><phrase > role="<emphasis>Xx</emphasis>">,</phrase></emphasis></simpara> > > *Is this a known bug?* > > > If I add a space before comma... > > <a href="x.htm"><i>Xx</i></a><i> ,</i> > > then I get > > link:x.htm[_Xx_] _,_ > > which causes no issue. Also adding a space before the emphasis... > > <a href="x.htm"><i>Xx</i></a> <i>,</i> > > create an asciidoc file which can be rendered: > > link:x.htm[_Xx_] _,_ > > > > Does someone know this? Does a fix already exist? > > > cheers, > Frank > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3f7b920b-c982-5be5-fa04-9025e008e518%40tuxad.com. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <m2v8yzpb5x.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]
* Re: "double emphasis" bug when converting to asciidoc? [not found] ` <m2v8yzpb5x.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> @ 2022-01-04 19:50 ` Leonard Rosenthol [not found] ` <CALu=v3JE2PPCY8=agCY9wtvwrMKXAidpSVFN650oc+Hge8J3dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2022-01-05 8:39 ` "double emphasis" bug when converting to asciidoc? Frank Bergmann 1 sibling, 1 reply; 9+ messages in thread From: Leonard Rosenthol @ 2022-01-04 19:50 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: Frank Bergmann [-- Attachment #1: Type: text/plain, Size: 3479 bytes --] The AsciiDoc specs on emphasis/italics is here - https://docs.asciidoctor.org/asciidoc/latest/text/italic/ - and talks about the intraword scenario. Hope that helps. Leonard On Tue, Jan 4, 2022 at 2:13 PM John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote: > > I see this in the code: > > inlineToAsciiDoc opts (Emph lst) = do > contents <- inlineListToAsciiDoc opts lst > isIntraword <- gets intraword > let marker = if isIntraword then "__" else "_" > return $ marker <> contents <> marker > > > So apparently in asciidoc you use __ for intraword emphasis > (I don't use asciidoc myself). > > The problem may be that this is a context asciidoc doesn't > consider "intraword." > > Anyway, please submit a bug report and link here. > > Frank Bergmann <pandoc-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> writes: > > > Hi, > > > > I found a strange behaviour when converting some HTML files to asciidoc. > > > > Versions used: > > asciidoc 9.1.0 > > pandoc 2.16.2 > > > > Example input: > > > > <!DOCTYPE HTML> > > <html> > > <head> > > <title>Xx</title> > > </head> > > <body> > > <a href="x.htm"><i>Xx</i></a><i>,</i> > > </body> > > </html> > > > > With "pandoc --wrap=none -f html -t asciidoc" I get this asciidoc output: > > > > link:x.htm[_Xx_]__,__ > > > > The double underscores look "suspicious" and with "asciidoc -b > > docbook;xmllint" I get: > > > > z.xml:10: parser error : Unescaped '<' not allowed in attributes values > > <simpara>link:x.htm<emphasis><phrase > > role="<emphasis>Xx</emphasis>">,</phrase></ > > > > The related docbook line which was created by asciidoc: > > > > <simpara>link:x.htm<emphasis><phrase > > role="<emphasis>Xx</emphasis>">,</phrase></emphasis></simpara> > > > > *Is this a known bug?* > > > > > > If I add a space before comma... > > > > <a href="x.htm"><i>Xx</i></a><i> ,</i> > > > > then I get > > > > link:x.htm[_Xx_] _,_ > > > > which causes no issue. Also adding a space before the emphasis... > > > > <a href="x.htm"><i>Xx</i></a> <i>,</i> > > > > create an asciidoc file which can be rendered: > > > > link:x.htm[_Xx_] _,_ > > > > > > > > Does someone know this? Does a fix already exist? > > > > > > cheers, > > Frank > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/3f7b920b-c982-5be5-fa04-9025e008e518%40tuxad.com > . > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/m2v8yzpb5x.fsf%40MacBook-Pro-2.hsd1.ca.comcast.net > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALu%3Dv3JE2PPCY8%3DagCY9wtvwrMKXAidpSVFN650oc%2BHge8J3dw%40mail.gmail.com. [-- Attachment #2: Type: text/html, Size: 5792 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CALu=v3JE2PPCY8=agCY9wtvwrMKXAidpSVFN650oc+Hge8J3dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: "double emphasis" bug when converting to asciidoc? [not found] ` <CALu=v3JE2PPCY8=agCY9wtvwrMKXAidpSVFN650oc+Hge8J3dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2022-01-05 3:17 ` John MacFarlane [not found] ` <m2pmp6q3ah.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: John MacFarlane @ 2022-01-05 3:17 UTC (permalink / raw) To: Leonard Rosenthol, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: Frank Bergmann That's helpful. Please submit a bug report at https://github.com/jgm/pandoc/issues Leonard Rosenthol <leonardr-bM6h3K5UM15l57MIdRCFDg@public.gmane.org> writes: > The AsciiDoc specs on emphasis/italics is here - > https://docs.asciidoctor.org/asciidoc/latest/text/italic/ - and talks about > the intraword scenario. > > Hope that helps. > > Leonard > > On Tue, Jan 4, 2022 at 2:13 PM John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote: > >> >> I see this in the code: >> >> inlineToAsciiDoc opts (Emph lst) = do >> contents <- inlineListToAsciiDoc opts lst >> isIntraword <- gets intraword >> let marker = if isIntraword then "__" else "_" >> return $ marker <> contents <> marker >> >> >> So apparently in asciidoc you use __ for intraword emphasis >> (I don't use asciidoc myself). >> >> The problem may be that this is a context asciidoc doesn't >> consider "intraword." >> >> Anyway, please submit a bug report and link here. >> >> Frank Bergmann <pandoc-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> writes: >> >> > Hi, >> > >> > I found a strange behaviour when converting some HTML files to asciidoc. >> > >> > Versions used: >> > asciidoc 9.1.0 >> > pandoc 2.16.2 >> > >> > Example input: >> > >> > <!DOCTYPE HTML> >> > <html> >> > <head> >> > <title>Xx</title> >> > </head> >> > <body> >> > <a href="x.htm"><i>Xx</i></a><i>,</i> >> > </body> >> > </html> >> > >> > With "pandoc --wrap=none -f html -t asciidoc" I get this asciidoc output: >> > >> > link:x.htm[_Xx_]__,__ >> > >> > The double underscores look "suspicious" and with "asciidoc -b >> > docbook;xmllint" I get: >> > >> > z.xml:10: parser error : Unescaped '<' not allowed in attributes values >> > <simpara>link:x.htm<emphasis><phrase >> > role="<emphasis>Xx</emphasis>">,</phrase></ >> > >> > The related docbook line which was created by asciidoc: >> > >> > <simpara>link:x.htm<emphasis><phrase >> > role="<emphasis>Xx</emphasis>">,</phrase></emphasis></simpara> >> > >> > *Is this a known bug?* >> > >> > >> > If I add a space before comma... >> > >> > <a href="x.htm"><i>Xx</i></a><i> ,</i> >> > >> > then I get >> > >> > link:x.htm[_Xx_] _,_ >> > >> > which causes no issue. Also adding a space before the emphasis... >> > >> > <a href="x.htm"><i>Xx</i></a> <i>,</i> >> > >> > create an asciidoc file which can be rendered: >> > >> > link:x.htm[_Xx_] _,_ >> > >> > >> > >> > Does someone know this? Does a fix already exist? >> > >> > >> > cheers, >> > Frank >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "pandoc-discuss" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/pandoc-discuss/3f7b920b-c982-5be5-fa04-9025e008e518%40tuxad.com >> . >> >> -- >> You received this message because you are subscribed to the Google Groups >> "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/pandoc-discuss/m2v8yzpb5x.fsf%40MacBook-Pro-2.hsd1.ca.comcast.net >> . >> > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALu%3Dv3JE2PPCY8%3DagCY9wtvwrMKXAidpSVFN650oc%2BHge8J3dw%40mail.gmail.com. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <m2pmp6q3ah.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]
* Re: "double emphasis" bug when converting to asciidoc? [not found] ` <m2pmp6q3ah.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> @ 2022-01-05 17:19 ` Frank Bergmann 2022-06-25 9:34 ` Translating style sheets in reader on HTML input Frank Bergmann 1 sibling, 0 replies; 9+ messages in thread From: Frank Bergmann @ 2022-01-05 17:19 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw Hi, thank you both, John and Leonard. I read the asciidoc specs, did some tests with asciidoc's intraword emphasize, created a few test cases and will now submit a bug with my findings. Frank On 05.01.22 04:17, John MacFarlane wrote: > That's helpful. Please submit a bug report at > https://github.com/jgm/pandoc/issues > > Leonard Rosenthol <leonardr-bM6h3K5UM15l57MIdRCFDg@public.gmane.org> writes: > >> The AsciiDoc specs on emphasis/italics is here - >> https://docs.asciidoctor.org/asciidoc/latest/text/italic/ - and talks about >> the intraword scenario. >> >> Hope that helps. >> >> Leonard >> >> On Tue, Jan 4, 2022 at 2:13 PM John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote: >> >>> I see this in the code: >>> >>> inlineToAsciiDoc opts (Emph lst) = do >>> contents <- inlineListToAsciiDoc opts lst >>> isIntraword <- gets intraword >>> let marker = if isIntraword then "__" else "_" >>> return $ marker <> contents <> marker >>> >>> >>> So apparently in asciidoc you use __ for intraword emphasis >>> (I don't use asciidoc myself). >>> >>> The problem may be that this is a context asciidoc doesn't >>> consider "intraword." >>> >>> Anyway, please submit a bug report and link here. >>> >>> Frank Bergmann <pandoc-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> writes: >>> >>>> Hi, >>>> >>>> I found a strange behaviour when converting some HTML files to asciidoc. >>>> >>>> Versions used: >>>> asciidoc 9.1.0 >>>> pandoc 2.16.2 >>>> >>>> Example input: >>>> >>>> <!DOCTYPE HTML> >>>> <html> >>>> <head> >>>> <title>Xx</title> >>>> </head> >>>> <body> >>>> <a href="x.htm"><i>Xx</i></a><i>,</i> >>>> </body> >>>> </html> >>>> >>>> With "pandoc --wrap=none -f html -t asciidoc" I get this asciidoc output: >>>> >>>> link:x.htm[_Xx_]__,__ >>>> >>>> The double underscores look "suspicious" and with "asciidoc -b >>>> docbook;xmllint" I get: >>>> >>>> z.xml:10: parser error : Unescaped '<' not allowed in attributes values >>>> <simpara>link:x.htm<emphasis><phrase >>>> role="<emphasis>Xx</emphasis>">,</phrase></ >>>> >>>> The related docbook line which was created by asciidoc: >>>> >>>> <simpara>link:x.htm<emphasis><phrase >>>> role="<emphasis>Xx</emphasis>">,</phrase></emphasis></simpara> >>>> >>>> *Is this a known bug?* >>>> >>>> >>>> If I add a space before comma... >>>> >>>> <a href="x.htm"><i>Xx</i></a><i> ,</i> >>>> >>>> then I get >>>> >>>> link:x.htm[_Xx_] _,_ >>>> >>>> which causes no issue. Also adding a space before the emphasis... >>>> >>>> <a href="x.htm"><i>Xx</i></a> <i>,</i> >>>> >>>> create an asciidoc file which can be rendered: >>>> >>>> link:x.htm[_Xx_] _,_ >>>> >>>> >>>> >>>> Does someone know this? Does a fix already exist? >>>> >>>> >>>> cheers, >>>> Frank >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>> Groups "pandoc-discuss" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/pandoc-discuss/3f7b920b-c982-5be5-fa04-9025e008e518%40tuxad.com >>> . >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "pandoc-discuss" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/pandoc-discuss/m2v8yzpb5x.fsf%40MacBook-Pro-2.hsd1.ca.comcast.net >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALu%3Dv3JE2PPCY8%3DagCY9wtvwrMKXAidpSVFN650oc%2BHge8J3dw%40mail.gmail.com. -- Frank Bergmann, Pödinghauser Str. 5, D-32051 Herford, Tel. +49-5221-9249753 SAP Hybris & Linux LPIC-3, E-Mail tx2014-VEyjnN4Vo9k@public.gmane.org, USt-IdNr DE237314606 http://tdyn.de/freel -- Redirect to profile at freelancermap http://www.gulp.de/freiberufler/2HNKY2YHW.html -- Profile at GULP -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/16c258f8-d4fb-4d9f-a9a6-c855e83dfa4a%40tuxad.com. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Translating style sheets in reader on HTML input [not found] ` <m2pmp6q3ah.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2022-01-05 17:19 ` Frank Bergmann @ 2022-06-25 9:34 ` Frank Bergmann [not found] ` <c09f254c-5ccf-1ed4-97ab-4e6bccbbdcb6-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 9+ messages in thread From: Frank Bergmann @ 2022-06-25 9:34 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw Hi, this time I have some questions. As far as I understood the lua scripting it is not working on actual input but just on already translated native format. What I need is to do some "translations" on raw HTML input. (BTW - actual output here is asciidoc.) My issue is that the "HTML" input has a lot of styles like these: <span class="Kursiv"> <span class="FettUnterstrichen"> <p class="Normal_fett"> <p class="rml10_101__Normal_fett"> <p class="rml10_112__Normal_fett"> <p class="rml10_114__Normal_fett"> <p class="rml10_11__Normal_fett"> <p class="rml10_122__Normal_fett"> <p class="rml10_124__Normal_fett"> <p class="rml10_133__Normal_fett"> <p class="rml10_136__Normal_fett"> <p class="rml10_138__Normal_fett"> <p class="rml10_177__Normal_fett"> <span class="Fett"> <span class="FettUnterstrichen"> (Note: kursiv=italic/emphasized, fett=bold, unterstrichen=underline) Is there a way in pandoc to "translate" styles like e.g. the ones with "fett" to e.g. a simple HTML tag "<b>" before internally doing the actual translation to native and then to output format? Can a lua script be used for this? Or do I need to write a translator of my own and run it BEFORE using pandoc? (Note: The "HTML" input is coming from Adobe RoboHelp.) kind regards, Frank -- Frank Bergmann, Pödinghauser Str. 5, D-32051 Herford, Tel. +49-5221-9249753 SAP Hybris & Linux LPIC-3, E-Mail tx2014-VEyjnN4Vo9k@public.gmane.org, USt-IdNr DE237314606 http://tdyn.de/freel -- Redirect to profile at freelancermap http://www.gulp.de/freiberufler/2HNKY2YHW.html -- Profile at GULP -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c09f254c-5ccf-1ed4-97ab-4e6bccbbdcb6%40tuxad.com. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <c09f254c-5ccf-1ed4-97ab-4e6bccbbdcb6-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org>]
* Re: Translating style sheets in reader on HTML input [not found] ` <c09f254c-5ccf-1ed4-97ab-4e6bccbbdcb6-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> @ 2022-06-25 11:23 ` William Lupton [not found] ` <CAEe_xxh02ZZ_HbZS0cPDZ4rWE+ES5zYJQsa4Uw9_bTBX5aEAVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: William Lupton @ 2022-06-25 11:23 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1: Type: text/plain, Size: 3882 bytes --] Yes, lua filters operate on the AST (abstract syntax tree). I think that some pre-processing will be necessary because (AFAIK) the Para (p) element doesn't retain attributes in the AST. Here's an example using HTML derived from yours (the p element is wrapped in a div). Note: I think perhaps the lua div logic could be simpler, but this seems to work. % cat kursiv.html <span class="Kursiv">span-text-with-class</span> <div class="Normal_fett"><p>para-text-class-from-div</p></div> % pandoc kursiv.html -L kursiv.lua <em><span>span-text-with-class</span></em> <div> <p><strong>para-text-class-from-div</strong></p> </div> % cat kursiv.lua function Span(span) local class, index = span.attr.classes:find('Kursiv') if class then span.attr.classes:remove(index) return pandoc.Emph({span}) end end function Div(div) local class, index = div.attr.classes:find('Normal_fett') if class then div.attr.classes:remove(index) div.content = div.content:map( function(elem) elem.content = {pandoc.Strong(elem.content)} return elem end ) return div end end On Sat, 25 Jun 2022 at 10:34, Frank Bergmann <pandoc-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> wrote: > Hi, > > this time I have some questions. > > As far as I understood the lua scripting it is not working on actual > input but just on already translated native format. > What I need is to do some "translations" on raw HTML input. > (BTW - actual output here is asciidoc.) > > My issue is that the "HTML" input has a lot of styles like these: > > <span class="Kursiv"> > <span class="FettUnterstrichen"> > <p class="Normal_fett"> > <p class="rml10_101__Normal_fett"> > <p class="rml10_112__Normal_fett"> > <p class="rml10_114__Normal_fett"> > <p class="rml10_11__Normal_fett"> > <p class="rml10_122__Normal_fett"> > <p class="rml10_124__Normal_fett"> > <p class="rml10_133__Normal_fett"> > <p class="rml10_136__Normal_fett"> > <p class="rml10_138__Normal_fett"> > <p class="rml10_177__Normal_fett"> > <span class="Fett"> > <span class="FettUnterstrichen"> > > (Note: kursiv=italic/emphasized, fett=bold, unterstrichen=underline) > > Is there a way in pandoc to "translate" styles like e.g. the ones with > "fett" to e.g. a simple HTML tag "<b>" before internally doing the > actual translation to native and then to output format? > Can a lua script be used for this? > Or do I need to write a translator of my own and run it BEFORE using > pandoc? > > (Note: The "HTML" input is coming from Adobe RoboHelp.) > > kind regards, > Frank > > -- > Frank Bergmann, Pödinghauser Str. 5, D-32051 Herford, Tel. +49-5221-9249753 > SAP Hybris & Linux LPIC-3, E-Mail tx2014-VEyjnN4Vo9k@public.gmane.org, USt-IdNr DE237314606 > http://tdyn.de/freel -- Redirect to profile at freelancermap > http://www.gulp.de/freiberufler/2HNKY2YHW.html -- Profile at GULP > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/c09f254c-5ccf-1ed4-97ab-4e6bccbbdcb6%40tuxad.com > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAEe_xxh02ZZ_HbZS0cPDZ4rWE%2BES5zYJQsa4Uw9_bTBX5aEAVg%40mail.gmail.com. [-- Attachment #2: Type: text/html, Size: 5900 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAEe_xxh02ZZ_HbZS0cPDZ4rWE+ES5zYJQsa4Uw9_bTBX5aEAVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Translating style sheets in reader on HTML input [not found] ` <CAEe_xxh02ZZ_HbZS0cPDZ4rWE+ES5zYJQsa4Uw9_bTBX5aEAVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2022-06-28 11:56 ` Frank Bergmann 0 siblings, 0 replies; 9+ messages in thread From: Frank Bergmann @ 2022-06-28 11:56 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw [-- Attachment #1: Type: text/plain, Size: 5282 bytes --] Hi William, awesome, thank you! Frank On 25.06.22 13:23, William Lupton wrote: > Yes, lua filters operate on the AST (abstract syntax tree). > > I think that some pre-processing will be necessary because (AFAIK) the > Para (p) element doesn't retain attributes in the AST. > > Here's an example using HTML derived from yours (the p element is > wrapped in a div). Note: I think perhaps the lua div logic could be > simpler, but this seems to work. > > % cat kursiv.html > <span class="Kursiv">span-text-with-class</span> > <div class="Normal_fett"><p>para-text-class-from-div</p></div> > > % pandoc kursiv.html -L kursiv.lua > <em><span>span-text-with-class</span></em> > <div> > <p><strong>para-text-class-from-div</strong></p> > </div> > > % cat kursiv.lua > function Span(span) > local class, index = span.attr.classes:find('Kursiv') > if class then > span.attr.classes:remove(index) > return pandoc.Emph({span}) > end > end > > function Div(div) > local class, index = div.attr.classes:find('Normal_fett') > if class then > div.attr.classes:remove(index) > div.content = div.content:map( > function(elem) > elem.content = {pandoc.Strong(elem.content)} > return elem > end > ) > return div > end > end > > On Sat, 25 Jun 2022 at 10:34, Frank Bergmann <pandoc-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> wrote: > > Hi, > > this time I have some questions. > > As far as I understood the lua scripting it is not working on actual > input but just on already translated native format. > What I need is to do some "translations" on raw HTML input. > (BTW - actual output here is asciidoc.) > > My issue is that the "HTML" input has a lot of styles like these: > > <span class="Kursiv"> > <span class="FettUnterstrichen"> > <p class="Normal_fett"> > <p class="rml10_101__Normal_fett"> > <p class="rml10_112__Normal_fett"> > <p class="rml10_114__Normal_fett"> > <p class="rml10_11__Normal_fett"> > <p class="rml10_122__Normal_fett"> > <p class="rml10_124__Normal_fett"> > <p class="rml10_133__Normal_fett"> > <p class="rml10_136__Normal_fett"> > <p class="rml10_138__Normal_fett"> > <p class="rml10_177__Normal_fett"> > <span class="Fett"> > <span class="FettUnterstrichen"> > > (Note: kursiv=italic/emphasized, fett=bold, unterstrichen=underline) > > Is there a way in pandoc to "translate" styles like e.g. the ones > with > "fett" to e.g. a simple HTML tag "<b>" before internally doing the > actual translation to native and then to output format? > Can a lua script be used for this? > Or do I need to write a translator of my own and run it BEFORE > using pandoc? > > (Note: The "HTML" input is coming from Adobe RoboHelp.) > > kind regards, > Frank > > -- > Frank Bergmann, Pödinghauser Str. 5, D-32051 Herford, Tel. > +49-5221-9249753 > SAP Hybris & Linux LPIC-3, E-Mail tx2014-VEyjnN4Vo9k@public.gmane.org, USt-IdNr > DE237314606 > http://tdyn.de/freel -- Redirect to profile at freelancermap > http://www.gulp.de/freiberufler/2HNKY2YHW.html -- Profile at GULP > > -- > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, > send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org > <mailto:pandoc-discuss%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/c09f254c-5ccf-1ed4-97ab-4e6bccbbdcb6%40tuxad.com. > > -- > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/CAEe_xxh02ZZ_HbZS0cPDZ4rWE%2BES5zYJQsa4Uw9_bTBX5aEAVg%40mail.gmail.com > <https://groups.google.com/d/msgid/pandoc-discuss/CAEe_xxh02ZZ_HbZS0cPDZ4rWE%2BES5zYJQsa4Uw9_bTBX5aEAVg%40mail.gmail.com?utm_medium=email&utm_source=footer>. -- Frank Bergmann, Pödinghauser Str. 5, D-32051 Herford, Tel. +49-5221-9249753 SAP Hybris & Linux LPIC-3, E-Mailtx2014-VEyjnN4Vo9k@public.gmane.org, USt-IdNr DE237314606 http://tdyn.de/freel -- Redirect to profile at freelancermap http://www.gulp.de/freiberufler/2HNKY2YHW.html -- Profile at GULP -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ff5f0f4e-00ce-e6fa-da84-5367e86c4bc9%40tuxad.com. [-- Attachment #2: Type: text/html, Size: 9299 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: "double emphasis" bug when converting to asciidoc? [not found] ` <m2v8yzpb5x.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2022-01-04 19:50 ` Leonard Rosenthol @ 2022-01-05 8:39 ` Frank Bergmann 1 sibling, 0 replies; 9+ messages in thread From: Frank Bergmann @ 2022-01-05 8:39 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: Frank Bergmann Hi, that's awesome - thank you for the quick and great answer, John. I was not aware of the intraword marking. I'll check the specs if it is an undesired behaviour of asciidoc or maybe pandoc and then submit a bug accordingly. cheers, Frank On 04.01.22 20:12, John MacFarlane wrote: > I see this in the code: > > inlineToAsciiDoc opts (Emph lst) = do > contents <- inlineListToAsciiDoc opts lst > isIntraword <- gets intraword > let marker = if isIntraword then "__" else "_" > return $ marker <> contents <> marker > > > So apparently in asciidoc you use __ for intraword emphasis > (I don't use asciidoc myself). > > The problem may be that this is a context asciidoc doesn't > consider "intraword." > > Anyway, please submit a bug report and link here. > > Frank Bergmann <pandoc-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> writes: > >> Hi, >> >> I found a strange behaviour when converting some HTML files to asciidoc. >> >> Versions used: >> asciidoc 9.1.0 >> pandoc 2.16.2 >> >> Example input: >> >> <!DOCTYPE HTML> >> <html> >> <head> >> <title>Xx</title> >> </head> >> <body> >> <a href="x.htm"><i>Xx</i></a><i>,</i> >> </body> >> </html> >> >> With "pandoc --wrap=none -f html -t asciidoc" I get this asciidoc output: >> >> link:x.htm[_Xx_]__,__ >> >> The double underscores look "suspicious" and with "asciidoc -b >> docbook;xmllint" I get: >> >> z.xml:10: parser error : Unescaped '<' not allowed in attributes values >> <simpara>link:x.htm<emphasis><phrase >> role="<emphasis>Xx</emphasis>">,</phrase></ >> >> The related docbook line which was created by asciidoc: >> >> <simpara>link:x.htm<emphasis><phrase >> role="<emphasis>Xx</emphasis>">,</phrase></emphasis></simpara> >> >> *Is this a known bug?* >> >> >> If I add a space before comma... >> >> <a href="x.htm"><i>Xx</i></a><i> ,</i> >> >> then I get >> >> link:x.htm[_Xx_] _,_ >> >> which causes no issue. Also adding a space before the emphasis... >> >> <a href="x.htm"><i>Xx</i></a> <i>,</i> >> >> create an asciidoc file which can be rendered: >> >> link:x.htm[_Xx_] _,_ >> >> >> >> Does someone know this? Does a fix already exist? >> >> >> cheers, >> Frank >> >> -- >> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3f7b920b-c982-5be5-fa04-9025e008e518%40tuxad.com. -- Frank Bergmann, Pödinghauser Str. 5, D-32051 Herford, Tel. +49-5221-9249753 SAP Hybris & Linux LPIC-3, E-Mail tx2014-VEyjnN4Vo9k@public.gmane.org, USt-IdNr DE237314606 http://tdyn.de/freel -- Redirect to profile at freelancermap http://www.gulp.de/freiberufler/2HNKY2YHW.html -- Profile at GULP -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/695408e4-ba64-571f-42d2-be6fda24a8b1%40tuxad.com. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-06-28 11:56 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-04 18:02 "double emphasis" bug when converting to asciidoc? Frank Bergmann [not found] ` <3f7b920b-c982-5be5-fa04-9025e008e518-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> 2022-01-04 19:12 ` John MacFarlane [not found] ` <m2v8yzpb5x.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2022-01-04 19:50 ` Leonard Rosenthol [not found] ` <CALu=v3JE2PPCY8=agCY9wtvwrMKXAidpSVFN650oc+Hge8J3dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2022-01-05 3:17 ` John MacFarlane [not found] ` <m2pmp6q3ah.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2022-01-05 17:19 ` Frank Bergmann 2022-06-25 9:34 ` Translating style sheets in reader on HTML input Frank Bergmann [not found] ` <c09f254c-5ccf-1ed4-97ab-4e6bccbbdcb6-eSlkCAlw8VwAvxtiuMwx3w@public.gmane.org> 2022-06-25 11:23 ` William Lupton [not found] ` <CAEe_xxh02ZZ_HbZS0cPDZ4rWE+ES5zYJQsa4Uw9_bTBX5aEAVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2022-06-28 11:56 ` Frank Bergmann 2022-01-05 8:39 ` "double emphasis" bug when converting to asciidoc? Frank Bergmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).