From: Benct Philip Jonsson <melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
J <lixichen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: A New Feature for Pandoc's Markdown Extension -- No Space with Newline
Date: Wed, 15 Apr 2020 17:23:25 +0200 [thread overview]
Message-ID: <09a13fd8-db26-b851-b426-2fa7ad96ecf4@gmail.com> (raw)
In-Reply-To: <5fe78fc8-7050-4342-8d5e-1350b9b06794-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 9971 bytes --]
Perl/JSON filter attached.
Take care not to overwrite your original files as this is barely tested
on a single line of text with mixed Hanzi/Latin letters.
Usage instructions and installation hints in the file (Below
$DOCUMENTATION).
On 2020-04-15 07:57, J wrote:
> Please don't worry about CPAN. Google will help and I am willing to try the
> steps needed. :D
>
> On Wednesday, April 15, 2020 at 12:07:47 AM UTC+8, BPJ wrote:
>>
>> Are you conversant with perl and CPAN?
>> If not what operating system(s) do you use (Windows/Mac/Linux)?
>>
>> I ask because if the answer to the first question is no I may have to
>> guide you through installing some stuff, including perl itself if the
>> answer to the second question is Windows.
>>
>> Den tis 14 apr. 2020 16:13J <lixi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <javascript:>> skrev:
>>
>>> Thank sounds perfect ! Many thanks for your efforts !
>>>
>>> On Tuesday, April 14, 2020 at 1:18:17 PM UTC+8, BP wrote:
>>>>
>>>> A Perl filter which removes Space and SoftBreak elements sandwiched
>>>> between two Str elements which respectively ends and starts with a
>>>> character with Unicode script property CJK is certainly doable. Will that
>>>> be OK?
>>>>
>>>> /BPJ
>>>>
>>>>
>>>> Den tis 14 apr. 2020 02:39J <lixi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:
>>>>
>>>>> Thank you for your efforts very much ! I wonder if the script can keep
>>>>> the spaces inside English words, digits, and punctuation, since my files
>>>>> also contain short groups of English words and number with digits ?
>>>>>
>>>>> On Tuesday, April 14, 2020 at 3:16:40 AM UTC+8, BP wrote:
>>>>>>
>>>>>> Wow that script is really ancient! I'll try to port it to a Lua filter
>>>>>> tomorrow. It's 9 PM here now and I have been coding or writing for twelve
>>>>>> hours, so I'm quite exhausted.
>>>>>>
>>>>>> Just to be clear, the old script removes all spaces which are next to
>>>>>> a "string" element, i.e. all "words", digits and punctuation alike, and not
>>>>>> just CJK characters. If you are OK with that behavior porting it to a Lua
>>>>>> filter will be trivial, and Lua is built-in in Pandoc. Otherwise I'll have
>>>>>> to look into rewriting the Perl script, which may be not quite as trivial.
>>>>>>
>>>>>> /BPJ
>>>>>>
>>>>>> Den mån 13 apr. 2020 20:45J <lixi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:
>>>>>>
>>>>>>> Could you help to update zapspace.pl to work with pandoc 2.9.2.1 ? I
>>>>>>> have Chinese markdown files that use spaces to separate groups of words,
>>>>>>> and would like to ignore spaces between Chinese characters before
>>>>>>> converting to Word.
>>>>>>> Many thanks !
>>>>>>>
>>>>>>> On Tuesday, July 16, 2013 at 11:34:32 PM UTC+8, BP Jonsson wrote:
>>>>>>>>
>>>>>>>> 2013-07-15 19:51, John MacFarlane skrev:
>>>>>>>>> +++ Bill Chen (CHEN, Zhechuan) [Jul 15 13 17:16 ]:
>>>>>>>>>> Have found a way to make this feature done.
>>>>>>>>>> Just add "\n" at the last of the line
>>>>>>>>>
>>>>>>>>> This would violate the general rule that backslashes before
>>>>>>>> letters in
>>>>>>>>> markdown are just literal backslashes.
>>>>>>>>>
>>>>>>>>> I think that a better approach would be to provide a markdown
>>>>>>>>> extension like the current 'hard_line_breaks': perhaps
>>>>>>>>> 'ignore_line_breaks'. 'hard_line_breaks' causes line
>>>>>>>>> breaks in a paragraph to be interpreted as hard breaks;
>>>>>>>>> 'ignore_line_breaks' would cause them to be ignored entirely.
>>>>>>>>> (One of these would have to be designated as taking precedence
>>>>>>>>> if both were selected.)
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>
>>>>>>>> The attached perl script, when used as a filter on pandoc's
>>>>>>>> json output, should enable Bill to get what he wants. I have
>>>>>>>> used an earlier version on Tibetan text with satisfactory
>>>>>>>> results. Someone who knows Haskell could probably write
>>>>>>>> something shorter which interacts with pandoc in a more
>>>>>>>> elegant way, but this script works.
>>>>>>>>
>>>>>>>> The description inside the file reads as follows:
>>>>>>>>
>>>>>>>> FILE: zapspace.pl
>>>>>>>>
>>>>>>>> USAGE: pandoc -w json some.markdown | zapspace.pl | pandoc
>>>>>>>> -r json
>>>>>>>>
>>>>>>>> DESCRIPTION: Takes as input a document in pandoc's json format and
>>>>>>>> removes all "Space" elements inside any list which
>>>>>>>> also
>>>>>>>> contains any {"Str":"..."} element, and outputs a
>>>>>>>> modified json document, which when given as input to
>>>>>>>> pandoc will produce output suitable for languages
>>>>>>>> which
>>>>>>>> don't put spaces between words or sentences, with no
>>>>>>>> spaces
>>>>>>>> inside paragraphs -- unless you insert non-breaking
>>>>>>>> spaces,
>>>>>>>> see below! --, and notably spaces caused by linebreaks
>>>>>>>> in the markdown paragraph will be removed.
>>>>>>>>
>>>>>>>> Additionally it does two things which allow you to
>>>>>>>> insert whitespace inside paragraph-like elements:
>>>>>>>>
>>>>>>>> 1) It replaces any non-breaking space (U+00A0) inside
>>>>>>>> a
>>>>>>>> "Str" element with ordinary soft spaces (U+0020)
>>>>>>>> *if* the "Str" element also contains characters
>>>>>>>> other
>>>>>>>> than non-breaking spaces.
>>>>>>>>
>>>>>>>> This allows you to insert spaces into your
>>>>>>>> markdown
>>>>>>>> paragraphs as non-breaking spaces (in pandoc
>>>>>>>> notation
>>>>>>>> a backslash followed by an ordinary space "like\
>>>>>>>> this")
>>>>>>>> and get ordinary spaces in your output.
>>>>>>>>
>>>>>>>> 2) Preserves any "Str" element which only contains
>>>>>>>> one
>>>>>>>> or more non-breaking spaces as is.
>>>>>>>>
>>>>>>>> This allows you to put non-breaking spaces between
>>>>>>>> words by inserting ordinary whitespace -- which
>>>>>>>> will
>>>>>>>> be removed -- on either side of the non-breaking
>>>>>>>> spaces "like \ this".
>>>>>>>> ^ ^
>>>>>>>>
>>>>>>>> N.B. that this is *not* done by scanning the JSON text
>>>>>>>> with regular expressions! The JSON is loaded into a
>>>>>>>> perl data structure which is modified and then
>>>>>>>> converted
>>>>>>>> back into JSON. Precautions are taken not to modify
>>>>>>>> the
>>>>>>>> structure such that the output will be rejected by
>>>>>>>> pandoc, nor to modify code elements, but I can't
>>>>>>>> guarantee
>>>>>>>> that this will remain true with future versions of
>>>>>>>> pandoc,
>>>>>>>> or that it is true for any input.
>>>>>>>>
>>>>>>>> OPTIONS: ---
>>>>>>>> REQUIREMENTS: * A reasonably recent version of perl.
>>>>>>>> * The following CPAN modules:
>>>>>>>>
>>>>>>>> - [JSON::Any](
>>>>>>>> https://metacpan.org/module/JSON::Any)
>>>>>>>> + A JSON 'backend' module like JSON or
>>>>>>>> JSON::XS.
>>>>>>>> - [List::MoreUtils](
>>>>>>>> https://metacpan.org/module/List::MoreUtils)
>>>>>>>> - [autovivification](
>>>>>>>> https://metacpan.org/module/autovivification)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "pandoc-discuss" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/pandoc-discuss/35356bdb-9f45-4f0c-bb49-3fb4e2db98a0%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/pandoc-discuss/35356bdb-9f45-4f0c-bb49-3fb4e2db98a0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "pandoc-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/pandoc-discuss/1beb6ec0-19a5-4da7-b785-ebb7d340c865%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/pandoc-discuss/1beb6ec0-19a5-4da7-b785-ebb7d340c865%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/b3c84390-28d9-4962-909a-43eceab09108%40googlegroups.com
>>> <https://groups.google.com/d/msgid/pandoc-discuss/b3c84390-28d9-4962-909a-43eceab09108%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/09a13fd8-db26-b851-b426-2fa7ad96ecf4%40gmail.com.
[-- Attachment #2: zapspace-cjk.pl --]
[-- Type: application/x-perl, Size: 2884 bytes --]
next prev parent reply other threads:[~2020-04-15 15:23 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-15 3:33 Bill Chen (CHEN, Zhechuan)
[not found] ` <CAFOcPC+oqesoOPbFkiyo_cjAQFW7qGj4oidMU5gn+BnfpWM2aw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-07-15 9:16 ` Bill Chen (CHEN, Zhechuan)
[not found] ` <CAFOcPCLzrV1dWji3BNjWopayQXLDebAzFpF0WMwfZ_i8x8d63w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-07-15 17:51 ` John MacFarlane
[not found] ` <20130715175101.GA20541-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
2013-07-16 15:34 ` BP Jonsson
[not found] ` <51E56808.5000500-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-04-13 18:44 ` J
[not found] ` <35356bdb-9f45-4f0c-bb49-3fb4e2db98a0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-13 19:16 ` BPJ
[not found] ` <CADAJKhDMPQveCFfsDYp1-CJKTTA6EMmWf_M_11edGF8uvEcHJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-04-14 0:39 ` J
[not found] ` <1beb6ec0-19a5-4da7-b785-ebb7d340c865-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-14 5:17 ` BPJ
[not found] ` <CADAJKhDkCQ-GsQ7-G2_U_SZSx-1zheZAdQizRn-Cjb0jaC92Pw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-04-14 14:13 ` J
[not found] ` <b3c84390-28d9-4962-909a-43eceab09108-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-14 16:07 ` BPJ
[not found] ` <CADAJKhC+k=sdZVJV5GMKM9xZsP_L8KFGqny2f5AZQ6FDXngy6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-04-15 5:57 ` J
[not found] ` <5fe78fc8-7050-4342-8d5e-1350b9b06794-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-15 15:23 ` Benct Philip Jonsson [this message]
2013-07-17 22:47 ` John MacFarlane
[not found] ` <20130717224659.GA23839-9Rnp8PDaXcZ2EAH53EmH34tHsfhOvSUSZkel5v8DVj8@public.gmane.org>
2013-07-18 4:38 ` Bill Chen (CHEN, Zhechuan)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=09a13fd8-db26-b851-b426-2fa7ad96ecf4@gmail.com \
--to=melroch-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
--cc=lixichen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).