Writing custom filter in python to remove non-breaking spaces

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Writing custom filter in python to remove non-breaking spaces
@ 2017-08-02  9:05 Karim Mohammadi
       [not found] ` <3be5ee09-90dc-41ad-a368-9298b965dfaa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Karim Mohammadi @ 2017-08-02  9:05 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1360 bytes --]

Hi. I want to convert html file *file1.html* to *file1.tex*. The html file 
contains *&nbsp;* .

How can i write an python script as a filter to remove non-breaking space 
(&nbsp)?

This is my code:

    #!/usr/bin/env python
    
    """
    Pandoc filter to removing &nbsp; string from the text
    """
    
    from pandocfilters import toJSONFilter, Para
    
    def debug(content):
    file = open('debug.txt', 'w')
    for item in content:
    file.write("%s\n" % item)
    
    def nbsp(key, value, format, meta):
    uniString = unicode(value, "UTF-8")
    uniString = value.replace("&nbsp;", " ")
    
    return uniString
    
    if __name__ == "__main__":
    toJSONFilter(nbsp)

 but calling command:

pandoc file1.html --filter ./nbsp.py -o file1.tex

give me some errors.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3be5ee09-90dc-41ad-a368-9298b965dfaa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 8658 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Writing custom filter in python to remove non-breaking spaces
       [not found] ` <3be5ee09-90dc-41ad-a368-9298b965dfaa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-08-02 14:09   ` John MacFarlane
       [not found]     ` <20170802140916.GF38349-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
  2017-08-03 22:29   ` Kolen Cheung
  1 sibling, 1 reply; 9+ messages in thread
From: John MacFarlane @ 2017-08-02 14:09 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Probably easier just to preprocess:

sed e 's/&nbsp;/ /g' input.html | pandoc -f html -t latex

+++ Karim Mohammadi [Aug 02 17 02:05 ]:
>   Hi. I want to convert html file file1.html to file1.tex. The html file
>   contains &nbsp; .
>   How can i write an python script as a filter to remove non-breaking
>   space (&nbsp)?
>   This is my code:
>       #!/usr/bin/env python
>
>       """
>       Pandoc filter to removing &nbsp; string from the text
>       """
>
>       from pandocfilters import toJSONFilter, Para
>
>       def debug(content):
>       file = open('debug.txt', 'w')
>       for item in content:
>       file.write("%s\n" % item)
>
>       def nbsp(key, value, format, meta):
>       uniString = unicode(value, "UTF-8")
>       uniString = value.replace("&nbsp;", " ")
>
>       return uniString
>
>       if __name__ == "__main__":
>       toJSONFilter(nbsp)
>    but calling command:
>   pandoc file1.html --filter ./nbsp.py -o file1.tex
>   give me some errors.
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [3]https://groups.google.com/d/msgid/pandoc-discuss/3be5ee09-90dc-41ad-
>   a368-9298b965dfaa%40googlegroups.com.
>   For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. https://groups.google.com/d/msgid/pandoc-discuss/3be5ee09-90dc-41ad-a368-9298b965dfaa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>   4. https://groups.google.com/d/optout


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Writing custom filter in python to remove non-breaking spaces
       [not found]     ` <20170802140916.GF38349-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
@ 2017-08-03  4:55       ` Karim Mohammadi
       [not found]         ` <cc13bbea-06e5-422b-bcdd-cd9ba1c4cf95-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Karim Mohammadi @ 2017-08-03  4:55 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2910 bytes --]

Thanks but i'm on windows.

Can you guide me about making the desired custom filter?

On Wednesday, August 2, 2017 at 6:39:32 PM UTC+4:30, John MacFarlane wrote:
>
> Probably easier just to preprocess: 
>
> sed e 's/&nbsp;/ /g' input.html | pandoc -f html -t latex 
>
> +++ Karim Mohammadi [Aug 02 17 02:05 ]: 
> >   Hi. I want to convert html file file1.html to file1.tex. The html file 
> >   contains &nbsp; . 
> >   How can i write an python script as a filter to remove non-breaking 
> >   space (&nbsp)? 
> >   This is my code: 
> >       #!/usr/bin/env python 
> > 
> >       """ 
> >       Pandoc filter to removing &nbsp; string from the text 
> >       """ 
> > 
> >       from pandocfilters import toJSONFilter, Para 
> > 
> >       def debug(content): 
> >       file = open('debug.txt', 'w') 
> >       for item in content: 
> >       file.write("%s\n" % item) 
> > 
> >       def nbsp(key, value, format, meta): 
> >       uniString = unicode(value, "UTF-8") 
> >       uniString = value.replace("&nbsp;", " ") 
> > 
> >       return uniString 
> > 
> >       if __name__ == "__main__": 
> >       toJSONFilter(nbsp) 
> >    but calling command: 
> >   pandoc file1.html --filter ./nbsp.py -o file1.tex 
> >   give me some errors. 
> > 
> >   -- 
> >   You received this message because you are subscribed to the Google 
> >   Groups "pandoc-discuss" group. 
> >   To unsubscribe from this group and stop receiving emails from it, send 
> >   an email to [1]pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> >   To post to this group, send email to 
> >   [2]pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> >   To view this discussion on the web visit 
> >   [3]
> https://groups.google.com/d/msgid/pandoc-discuss/3be5ee09-90dc-41ad- 
> >   a368-9298b965dfaa%40googlegroups.com. 
> >   For more options, visit [4]https://groups.google.com/d/optout. 
> > 
> >References 
> > 
> >   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:> 
> >   2. mailto:pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:> 
> >   3. 
> https://groups.google.com/d/msgid/pandoc-discuss/3be5ee09-90dc-41ad-a368-9298b965dfaa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer 
> >   4. https://groups.google.com/d/optout 
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cc13bbea-06e5-422b-bcdd-cd9ba1c4cf95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6535 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Writing custom filter in python to remove non-breaking spaces
       [not found]         ` <cc13bbea-06e5-422b-bcdd-cd9ba1c4cf95-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-08-03  6:11           ` Andrew Dunning
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Dunning @ 2017-08-03  6:11 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 167 bytes --]

If you're really keen to write a filter, Panflute is the most developed now and you might look at its documentation. But shouldn't John's suggestion work with Cygwin?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Writing custom filter in python to remove non-breaking spaces
       [not found] ` <3be5ee09-90dc-41ad-a368-9298b965dfaa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-08-02 14:09   ` John MacFarlane
@ 2017-08-03 22:29   ` Kolen Cheung
       [not found]     ` <f0bb6fae-6104-4efc-840d-34fd19b02840-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Kolen Cheung @ 2017-08-03 22:29 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 142 bytes --]

I actually want to know why you would want to remove that in the first place? It seems only if the source has bugs you would want to do that.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Writing custom filter in python to remove non-breaking spaces
       [not found]     ` <f0bb6fae-6104-4efc-840d-34fd19b02840-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-08-04 18:58       ` Melroch
       [not found]         ` <CADAJKhBEZ8-BdJTQRJE4M2nettrGKhf1xYzqBYs=pe=_DAodpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Melroch @ 2017-08-04 18:58 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2315 bytes --]

I think the OP might actually want to replace literal or entity nbspaces
with regular spaces. There certainly may be reasonable reasons to want to
do that.
I think that the OP might be helped by a prefilter written in python. The
following tries to skip any fenced code or code blocks in order to not
replace nbsp entities inside code. It may be thrown by things like
`\~~~strikeout~~` but those are unlikely in practice.

````python
import sys
import re

inp = sys.stdin.read()
txt = inp.decode('utf-8')
pat = u"""(?isxu)
# Match delimited code or code block
(?P<code>
(?P<backtick> \`{1,} ) .*? (?P=backtick)
|
(?P<tilde> \~{3,} ) .*?  (?P=tilde)
)
|
# Match any form of nbsp
( \& (?: nbsp|[#]160|[#]xa0) \; | \u00a0 )
"""

# keep code and replace ordinary space
def rep(m):
    return m.group(1) if m.group(1) else u"\u0020"

print re.sub(pat, rep, txt).encode('utf-8')
````

fre 4 aug. 2017 kl. 00:30 skrev Kolen Cheung <christian.kolen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> I actually want to know why you would want to remove that in the first
> place? It seems only if the source has bugs you would want to do that.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/pandoc-discuss/f0bb6fae-6104-4efc-840d-34fd19b02840%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhBEZ8-BdJTQRJE4M2nettrGKhf1xYzqBYs%3Dpe%3D_DAodpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4308 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Writing custom filter in python to remove non-breaking spaces
       [not found]         ` <CADAJKhBEZ8-BdJTQRJE4M2nettrGKhf1xYzqBYs=pe=_DAodpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-08-04 22:03           ` Kolen Cheung
  2017-08-04 22:09           ` Kolen Cheung
  1 sibling, 0 replies; 9+ messages in thread
From: Kolen Cheung @ 2017-08-04 22:03 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 312 bytes --]

Right, I just read his code and he actually means replace not remove.

This can be done easily in panflute if he want to. The panflute documentation has a related example.

I think I started a thread here wanting to do similar thing, I called it unsmart, and someone else calls it something like dump-something.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Writing custom filter in python to remove non-breaking spaces
       [not found]         ` <CADAJKhBEZ8-BdJTQRJE4M2nettrGKhf1xYzqBYs=pe=_DAodpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-08-04 22:03           ` Kolen Cheung
@ 2017-08-04 22:09           ` Kolen Cheung
       [not found]             ` <4abb2571-34b8-4e49-a189-05632083aab9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Kolen Cheung @ 2017-08-04 22:09 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 1127 bytes --]

But I still want to reiterate my original point, only if the original source is problematic, one wants to remove/replace the non-breaking space. I recall when I processed some doc files, seems like Word has been trying to be too smart, there are many non-breaking space that shouldn't be there (if my memory serves me well).

Otherwise, replacing non-breaking space to space is basically wrong in typography (another example is "1-2" vs "1–2"). In other words, you're destroying the information the original author has carefully put in there.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4abb2571-34b8-4e49-a189-05632083aab9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Writing custom filter in python to remove non-breaking spaces
       [not found]             ` <4abb2571-34b8-4e49-a189-05632083aab9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-08-05  8:45               ` Melroch
  0 siblings, 0 replies; 9+ messages in thread
From: Melroch @ 2017-08-05  8:45 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2273 bytes --]

With some fonts LibreOffice renders nbspaces way too narrow or too wide,
and maybe Word does the same to the extent that they use the same
algorithms. It is arguably a bug to let the width of the \xa0 character in
the font determine how an nbspace is rendered but there you have it. The
best solution is often to double all nbspaces as too wide spaces may look
better than too narrow ones.

Den 5 aug 2017 00:10 skrev "Kolen Cheung" <christian.kolen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> But I still want to reiterate my original point, only if the original
> source is problematic, one wants to remove/replace the non-breaking space.
> I recall when I processed some doc files, seems like Word has been trying
> to be too smart, there are many non-breaking space that shouldn't be there
> (if my memory serves me well).
>
> Otherwise, replacing non-breaking space to space is basically wrong in
> typography (another example is "1-2" vs "1–2"). In other words, you're
> destroying the information the original author has carefully put in there.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/pandoc-discuss/4abb2571-34b8-4e49-a189-05632083aab9%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCedoDRvxGke5Ymnn8Ompj28EMf1ufiAG1JsDzL5B1y8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 3345 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-08-05  8:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-02  9:05 Writing custom filter in python to remove non-breaking spaces Karim Mohammadi
     [not found] ` <3be5ee09-90dc-41ad-a368-9298b965dfaa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-08-02 14:09   ` John MacFarlane
     [not found]     ` <20170802140916.GF38349-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
2017-08-03  4:55       ` Karim Mohammadi
     [not found]         ` <cc13bbea-06e5-422b-bcdd-cd9ba1c4cf95-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-08-03  6:11           ` Andrew Dunning
2017-08-03 22:29   ` Kolen Cheung
     [not found]     ` <f0bb6fae-6104-4efc-840d-34fd19b02840-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-08-04 18:58       ` Melroch
     [not found]         ` <CADAJKhBEZ8-BdJTQRJE4M2nettrGKhf1xYzqBYs=pe=_DAodpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-04 22:03           ` Kolen Cheung
2017-08-04 22:09           ` Kolen Cheung
     [not found]             ` <4abb2571-34b8-4e49-a189-05632083aab9-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-08-05  8:45               ` Melroch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).