public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Benct Philip Jonsson <melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
	J <lixichen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: A New Feature for Pandoc's Markdown Extension -- No Space with Newline
Date: Wed, 15 Apr 2020 17:23:25 +0200	[thread overview]
Message-ID: <09a13fd8-db26-b851-b426-2fa7ad96ecf4@gmail.com> (raw)
In-Reply-To: <5fe78fc8-7050-4342-8d5e-1350b9b06794-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 9971 bytes --]

Perl/JSON filter attached.

Take care not to overwrite your original files as this is barely tested 
on a single line of text with mixed Hanzi/Latin letters.

Usage instructions and installation hints in the file (Below 
$DOCUMENTATION).

On 2020-04-15 07:57, J wrote:
> Please don't worry about CPAN. Google will help and I am willing to try the
> steps needed. :D
> 
> On Wednesday, April 15, 2020 at 12:07:47 AM UTC+8, BPJ wrote:
>>
>> Are you conversant with perl and CPAN?
>> If not what operating system(s) do you use (Windows/Mac/Linux)?
>>
>> I ask because if the answer to the first question is no I may have to
>> guide you through installing some stuff, including perl itself if the
>> answer to the second question is Windows.
>>
>> Den tis 14 apr. 2020 16:13J <lixi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <javascript:>> skrev:
>>
>>> Thank sounds perfect ! Many thanks for your efforts !
>>>
>>> On Tuesday, April 14, 2020 at 1:18:17 PM UTC+8, BP wrote:
>>>>
>>>> A Perl filter which removes Space and SoftBreak elements sandwiched
>>>> between two Str elements which respectively ends and starts with a
>>>> character with Unicode script property CJK is certainly doable. Will that
>>>> be OK?
>>>>
>>>> /BPJ
>>>>
>>>>
>>>> Den tis 14 apr. 2020 02:39J <lixi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:
>>>>
>>>>> Thank you for your efforts very much ! I wonder if the script can keep
>>>>> the spaces inside English words, digits, and punctuation, since my files
>>>>> also contain short groups of English words and number with digits ?
>>>>>
>>>>> On Tuesday, April 14, 2020 at 3:16:40 AM UTC+8, BP wrote:
>>>>>>
>>>>>> Wow that script is really ancient! I'll try to port it to a Lua filter
>>>>>> tomorrow. It's 9 PM here now and I have been coding or writing for twelve
>>>>>> hours, so I'm quite exhausted.
>>>>>>
>>>>>> Just to be clear, the old script removes all spaces which are next to
>>>>>> a "string" element, i.e. all "words", digits and punctuation alike, and not
>>>>>> just CJK characters. If you are OK with that behavior porting it to a Lua
>>>>>> filter will be trivial, and Lua is built-in in Pandoc. Otherwise I'll have
>>>>>> to look into rewriting the Perl script, which may be not quite as trivial.
>>>>>>
>>>>>> /BPJ
>>>>>>
>>>>>> Den mån 13 apr. 2020 20:45J <lixi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:
>>>>>>
>>>>>>> Could you help to update zapspace.pl to work with pandoc 2.9.2.1 ? I
>>>>>>> have Chinese markdown files that use spaces to separate groups of words,
>>>>>>> and would like to ignore spaces between Chinese characters before
>>>>>>> converting to Word.
>>>>>>> Many thanks !
>>>>>>>
>>>>>>> On Tuesday, July 16, 2013 at 11:34:32 PM UTC+8, BP Jonsson wrote:
>>>>>>>>
>>>>>>>> 2013-07-15 19:51, John MacFarlane skrev:
>>>>>>>>> +++ Bill Chen (CHEN, Zhechuan) [Jul 15 13 17:16 ]:
>>>>>>>>>>      Have found a way to make this feature done.
>>>>>>>>>>      Just add "\n" at the last of the line
>>>>>>>>>
>>>>>>>>> This would violate the general rule that backslashes before
>>>>>>>> letters in
>>>>>>>>> markdown are just literal backslashes.
>>>>>>>>>
>>>>>>>>> I think that a better approach would be to provide a markdown
>>>>>>>>> extension like the current 'hard_line_breaks':  perhaps
>>>>>>>>> 'ignore_line_breaks'.  'hard_line_breaks' causes line
>>>>>>>>> breaks in a paragraph to be interpreted as hard breaks;
>>>>>>>>> 'ignore_line_breaks' would cause them to be ignored entirely.
>>>>>>>>> (One of these would have to be designated as taking precedence
>>>>>>>>> if both were selected.)
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>
>>>>>>>> The attached perl script, when used as a filter on pandoc's
>>>>>>>> json output, should enable Bill to get what he wants.  I have
>>>>>>>> used an earlier version on Tibetan text with satisfactory
>>>>>>>> results. Someone who knows Haskell could probably write
>>>>>>>> something shorter which interacts with pandoc in a more
>>>>>>>> elegant way, but this script works.
>>>>>>>>
>>>>>>>> The description inside the file reads as follows:
>>>>>>>>
>>>>>>>>          FILE: zapspace.pl
>>>>>>>>
>>>>>>>>         USAGE: pandoc -w json some.markdown | zapspace.pl | pandoc
>>>>>>>> -r json
>>>>>>>>
>>>>>>>>   DESCRIPTION: Takes as input a document in pandoc's json format and
>>>>>>>>                removes all "Space" elements inside any list which
>>>>>>>> also
>>>>>>>>                contains any {"Str":"..."} element, and outputs a
>>>>>>>>                modified json document, which when given as input to
>>>>>>>>                pandoc will produce output suitable for languages
>>>>>>>> which
>>>>>>>>                don't put spaces between words or sentences, with no
>>>>>>>> spaces
>>>>>>>>                inside paragraphs -- unless you insert non-breaking
>>>>>>>> spaces,
>>>>>>>>                see below! --, and notably spaces caused by linebreaks
>>>>>>>>                in the markdown paragraph will be removed.
>>>>>>>>
>>>>>>>>                Additionally it does two things which allow you to
>>>>>>>>                insert whitespace inside paragraph-like elements:
>>>>>>>>
>>>>>>>>                1)  It replaces any non-breaking space (U+00A0) inside
>>>>>>>> a
>>>>>>>>                    "Str" element with ordinary soft spaces (U+0020)
>>>>>>>>                    *if* the "Str" element also contains characters
>>>>>>>> other
>>>>>>>>                    than non-breaking spaces.
>>>>>>>>
>>>>>>>>                    This allows you to insert spaces into your
>>>>>>>> markdown
>>>>>>>>                    paragraphs as non-breaking spaces (in pandoc
>>>>>>>> notation
>>>>>>>>                    a backslash followed by an ordinary space "like\
>>>>>>>> this")
>>>>>>>>                    and get ordinary spaces in your output.
>>>>>>>>
>>>>>>>>                2)  Preserves any "Str" element which only contains
>>>>>>>> one
>>>>>>>>                    or more non-breaking spaces as is.
>>>>>>>>
>>>>>>>>                    This allows you to put non-breaking spaces between
>>>>>>>>                    words by inserting ordinary whitespace -- which
>>>>>>>> will
>>>>>>>>                    be removed -- on either side of the non-breaking
>>>>>>>>                    spaces "like \  this".
>>>>>>>>                                ^  ^
>>>>>>>>
>>>>>>>>                N.B. that this is *not* done by scanning the JSON text
>>>>>>>>                with regular expressions!  The JSON is loaded into a
>>>>>>>>                perl data structure which is modified and then
>>>>>>>> converted
>>>>>>>>                back into JSON. Precautions are taken not to modify
>>>>>>>> the
>>>>>>>>                structure such that the output will be rejected by
>>>>>>>>                pandoc, nor to modify code elements, but I can't
>>>>>>>> guarantee
>>>>>>>>                that this will remain true with future versions of
>>>>>>>> pandoc,
>>>>>>>>                or that it is true for any input.
>>>>>>>>
>>>>>>>>       OPTIONS: ---
>>>>>>>> REQUIREMENTS: *   A reasonably recent version of perl.
>>>>>>>>                *   The following CPAN modules:
>>>>>>>>
>>>>>>>>                    -   [JSON::Any](
>>>>>>>> https://metacpan.org/module/JSON::Any)
>>>>>>>>                        +   A JSON 'backend' module like JSON or
>>>>>>>> JSON::XS.
>>>>>>>>                    -   [List::MoreUtils](
>>>>>>>> https://metacpan.org/module/List::MoreUtils)
>>>>>>>>                    -   [autovivification](
>>>>>>>> https://metacpan.org/module/autovivification)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "pandoc-discuss" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/pandoc-discuss/35356bdb-9f45-4f0c-bb49-3fb4e2db98a0%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/pandoc-discuss/35356bdb-9f45-4f0c-bb49-3fb4e2db98a0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "pandoc-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/pandoc-discuss/1beb6ec0-19a5-4da7-b785-ebb7d340c865%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/pandoc-discuss/1beb6ec0-19a5-4da7-b785-ebb7d340c865%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google Groups
>>> "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/b3c84390-28d9-4962-909a-43eceab09108%40googlegroups.com
>>> <https://groups.google.com/d/msgid/pandoc-discuss/b3c84390-28d9-4962-909a-43eceab09108%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
> 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/09a13fd8-db26-b851-b426-2fa7ad96ecf4%40gmail.com.

[-- Attachment #2: zapspace-cjk.pl --]
[-- Type: application/x-perl, Size: 2884 bytes --]

  parent reply	other threads:[~2020-04-15 15:23 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-15  3:33 Bill Chen (CHEN, Zhechuan)
     [not found] ` <CAFOcPC+oqesoOPbFkiyo_cjAQFW7qGj4oidMU5gn+BnfpWM2aw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-07-15  9:16   ` Bill Chen (CHEN, Zhechuan)
     [not found]     ` <CAFOcPCLzrV1dWji3BNjWopayQXLDebAzFpF0WMwfZ_i8x8d63w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-07-15 17:51       ` John MacFarlane
     [not found]         ` <20130715175101.GA20541-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
2013-07-16 15:34           ` BP Jonsson
     [not found]             ` <51E56808.5000500-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-04-13 18:44               ` J
     [not found]                 ` <35356bdb-9f45-4f0c-bb49-3fb4e2db98a0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-13 19:16                   ` BPJ
     [not found]                     ` <CADAJKhDMPQveCFfsDYp1-CJKTTA6EMmWf_M_11edGF8uvEcHJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-04-14  0:39                       ` J
     [not found]                         ` <1beb6ec0-19a5-4da7-b785-ebb7d340c865-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-14  5:17                           ` BPJ
     [not found]                             ` <CADAJKhDkCQ-GsQ7-G2_U_SZSx-1zheZAdQizRn-Cjb0jaC92Pw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-04-14 14:13                               ` J
     [not found]                                 ` <b3c84390-28d9-4962-909a-43eceab09108-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-14 16:07                                   ` BPJ
     [not found]                                     ` <CADAJKhC+k=sdZVJV5GMKM9xZsP_L8KFGqny2f5AZQ6FDXngy6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-04-15  5:57                                       ` J
     [not found]                                         ` <5fe78fc8-7050-4342-8d5e-1350b9b06794-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-15 15:23                                           ` Benct Philip Jonsson [this message]
2013-07-17 22:47           ` John MacFarlane
     [not found]             ` <20130717224659.GA23839-9Rnp8PDaXcZ2EAH53EmH34tHsfhOvSUSZkel5v8DVj8@public.gmane.org>
2013-07-18  4:38               ` Bill Chen (CHEN, Zhechuan)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=09a13fd8-db26-b851-b426-2fa7ad96ecf4@gmail.com \
    --to=melroch-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=lixichen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).