public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Smart quotes recognition
@ 2010-11-18 13:59 Joost Kremers
       [not found] ` <20101118135904.GC15326-4Qa7NeS2ENVPDCrvnpRrPfNq91seawkrvu54Y+ZNwJg@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Joost Kremers @ 2010-11-18 13:59 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi,

Today, I ran the following (German) text through markdown2pdf:

==========

1) Seite 5: "Ausgehend von der "Uniformity of Theta Assignment Hypothesis" (UTAH)
   als Leitmotiv faßt Baker genau die morphologischen Konstruktionen als
   syntaktische Operationen, d.h. als Instanzen syntaktischer X° Bewegung auf,
   die einen "Grammatical Function-Changing"-Prozeß (GFC-Prozeß) indizieren,
   [...]." - Ich glaube, ich verstehe gundlegend worum es geht, doch wäre es
   schön dies nochmal an einem Beispiel zu demonstrieren.

==========

When I examined the pdf output, I noticed something weird: the double quotes
before «Uniformity» had become closing quotes and the space after «der» had
disappeared.

When converting to LaTeX, the output is the following:

==========

\item
Seite 5: ``Ausgehend von der''Uniformity of Theta Assignment
  Hypothesis" (UTAH) als Leitmotiv faßt Baker genau die
  morphologischen Konstruktionen als syntatktische Operationen, d.h.
  als Instanzen syntaktischer X° Bewegung auf, die einen
  ``Grammatical Function-Changing''-Prozeß (GFC-Prozeß) indizieren,
  [\ldots{}]." - Ich glaube, ich verstehe gundlegend worum es geht,
  doch wäre es schön dies nochmal an einem Beispiel zu
  demonstrieren.

==========

I suspect Pandoc keeps track of quotes in order to determine whether a given
quote must be an opening or a closing quote, which obviously leads to false
results in this case. Not only is the double quote before «Uniformity»
interpreted as a closing quote and the space deleted, the double quotes after
«Hypothesis» and after the ellipsis are not converted to '' as they should be.
(Though I don't really understand why this happens...)

Wouldn't it make more sense to determine whether an open or closing quote is
needed by examining the character before and after the quote?

It would of course not be a simple matter of determining on which side the space
is, that would fail for the string «"Grammatical Function-Changing"-Prozeß» in
the above example. I guess that would require some sort of character hierarchy,
something along the lines of:

1. letters
2. ,.!?;: 
3. ()-
4. space

(I'm sure I left out other relevant characters). Now, when the character
following " ranks higher than the character preceding it, " is converted into a
closing quote, and when the preceding character is ranked lower, " is converted
to an opening quote.

I know one isn't really supposed to use double quotes inside double quotes, but
if I use them anyway, I would like them to work... Perhaps an alternative
implementation could be considered?

Thanks,

Joost



-- 
Joost Kremers
Life has its moments

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Smart quotes recognition
       [not found] ` <20101118135904.GC15326-4Qa7NeS2ENVPDCrvnpRrPfNq91seawkrvu54Y+ZNwJg@public.gmane.org>
@ 2010-11-21 21:45   ` John MacFarlane
       [not found]     ` <20101121214556.GA27359-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: John MacFarlane @ 2010-11-21 21:45 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

+++ Joost Kremers [Nov 18 10 14:59 ]:
> Hi,
> 
> Today, I ran the following (German) text through markdown2pdf:
> 
> ==========
> 
> 1) Seite 5: "Ausgehend von der "Uniformity of Theta Assignment Hypothesis" (UTAH)
>    als Leitmotiv faßt Baker genau die morphologischen Konstruktionen als
>    syntaktische Operationen, d.h. als Instanzen syntaktischer X° Bewegung auf,
>    die einen "Grammatical Function-Changing"-Prozeß (GFC-Prozeß) indizieren,
>    [...]." - Ich glaube, ich verstehe gundlegend worum es geht, doch wäre es
>    schön dies nochmal an einem Beispiel zu demonstrieren.
> 
> ==========
> 
> When I examined the pdf output, I noticed something weird: the double quotes
> before «Uniformity» had become closing quotes and the space after «der» had
> disappeared.
> 
> When converting to LaTeX, the output is the following:
> 
> ==========
> 
> \item
> Seite 5: ``Ausgehend von der''Uniformity of Theta Assignment
>   Hypothesis" (UTAH) als Leitmotiv faßt Baker genau die
>   morphologischen Konstruktionen als syntatktische Operationen, d.h.
>   als Instanzen syntaktischer X° Bewegung auf, die einen
>   ``Grammatical Function-Changing''-Prozeß (GFC-Prozeß) indizieren,
>   [\ldots{}]." - Ich glaube, ich verstehe gundlegend worum es geht,
>   doch wäre es schön dies nochmal an einem Beispiel zu
>   demonstrieren.
> 
> ==========
> 
> I suspect Pandoc keeps track of quotes in order to determine whether a given
> quote must be an opening or a closing quote, which obviously leads to false
> results in this case. Not only is the double quote before «Uniformity»
> interpreted as a closing quote and the space deleted, the double quotes after
> «Hypothesis» and after the ellipsis are not converted to '' as they should be.
> (Though I don't really understand why this happens...)
> 
> Wouldn't it make more sense to determine whether an open or closing quote is
> needed by examining the character before and after the quote?

You're right that pandoc's smart quote parsing works by keeping track
of when you've already got an open single or double quote, and then
waiting for a matching closer.  It does not work by examining the
character before and after the quote. It couldn't, because the parser
doesn't keep track of the previously parsed character.  (And as far
as I can see, it would be difficult to make it do so.)

The present system assumes that you aren't going to have double quotes
within double quotes.  It breaks when you violate the convention
of alternating quote styles. I'm not sure there's much to be done
about this...

You can always use unicode open and close quote characters in your
markdown document, if you want to use double quotes within double
quotes.

John

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Smart quotes recognition
       [not found]     ` <20101121214556.GA27359-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
@ 2010-11-23  9:43       ` Joost Kremers
  0 siblings, 0 replies; 3+ messages in thread
From: Joost Kremers @ 2010-11-23  9:43 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi John,

[smart quotes]
> You're right that pandoc's smart quote parsing works by keeping track
> of when you've already got an open single or double quote, and then
> waiting for a matching closer.  It does not work by examining the
> character before and after the quote. It couldn't, because the parser
> doesn't keep track of the previously parsed character.  (And as far
> as I can see, it would be difficult to make it do so.)

There's certainly no point in making fundamental changes to how the parser works
just for this particular issue. In fact, I probably wouldn't even have run into
it if I hadn't blindly copied text from an email into a markdown document.

Thanks for your explanation, though. :-)

Joost


-- 
Joost Kremers
Life has its moments


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-11-23  9:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-18 13:59 Smart quotes recognition Joost Kremers
     [not found] ` <20101118135904.GC15326-4Qa7NeS2ENVPDCrvnpRrPfNq91seawkrvu54Y+ZNwJg@public.gmane.org>
2010-11-21 21:45   ` John MacFarlane
     [not found]     ` <20101121214556.GA27359-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
2010-11-23  9:43       ` Joost Kremers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).