public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* [Pandoc 1.12.3.1] Never ending docx to HTML conversion with 100% CPU usage
@ 2020-04-08  9:45 Samuel Viscapi
       [not found] ` <319d4b4b-be6b-439b-931b-a85c5ce4a00b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Samuel Viscapi @ 2020-04-08  9:45 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1536 bytes --]

Dear all,

I'm trying to turn a "simple", one-page DOCX document into an HTML page, 
using pandoc 1.12.3.1 on CentOS 7.7 x86_64 (yeah, it's fairly old I know, 
should I try to compile the newest version from source ?).

At first the conversion failed because of some UTF-8 related errors, but 
those are well documented in the manual:

https://pandoc.org/MANUAL.html#character-encoding

Now, I'm faced with a never ending process silently eating ~100% CPU time 
for hours (before eventually getting terminated by the system).

My command reads as follows:

iconv -f latin1 -t utf-8 file.docx | pandoc -o output.html

Since the docx file originated from Windows, its character encoding is 
"Western Alphabet", thus I hope "latin1" is a close enough approximation.

I tried to explicitly tell pandoc about the input and output formats (-f 
docx -t html) but to no avail... (pandoc: Unknown reader: docx)

I also played a while with extensions (--from docx+empty_paragraphs, --from 
docx+styles, etc...), no luck so far.

What am I doing wrong ?

Best regards,

Samuel, from CINES (Montpellier, France)

https://www.cines.fr/en/

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/319d4b4b-be6b-439b-931b-a85c5ce4a00b%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3000 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Pandoc 1.12.3.1] Never ending docx to HTML conversion with 100% CPU usage
       [not found] ` <319d4b4b-be6b-439b-931b-a85c5ce4a00b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-04-08 13:30   ` Jesse Rosenthal
       [not found]     ` <87k12qds0n.fsf-4GNroTWusrE@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Jesse Rosenthal @ 2020-04-08 13:30 UTC (permalink / raw)
  To: Samuel Viscapi, pandoc-discuss

Samuel Viscapi <sviscapi-KK0ffGbhmjU@public.gmane.org> writes:

> I'm trying to turn a "simple", one-page DOCX document into an HTML page, 
> using pandoc 1.12.3.1 on CentOS 7.7 x86_64 (yeah, it's fairly old I know, 
> should I try to compile the newest version from source ?).

Way too old -- conversion from docx wasn't introduced until 1.13. 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Pandoc 1.12.3.1] Never ending docx to HTML conversion with 100% CPU usage
       [not found]     ` <87k12qds0n.fsf-4GNroTWusrE@public.gmane.org>
@ 2020-04-08 16:32       ` John MacFarlane
       [not found]         ` <m2zhbmaqf9.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2020-04-08 16:32 UTC (permalink / raw)
  To: Jesse Rosenthal, Samuel Viscapi, pandoc-discuss


You can try the linux binary on our GitHub releases page.
It's statically linked and I imagine it will work on your
system.

Jesse Rosenthal <jrosenthal-4GNroTWusrE@public.gmane.org> writes:

> Samuel Viscapi <sviscapi-KK0ffGbhmjU@public.gmane.org> writes:
>
>> I'm trying to turn a "simple", one-page DOCX document into an HTML page, 
>> using pandoc 1.12.3.1 on CentOS 7.7 x86_64 (yeah, it's fairly old I know, 
>> should I try to compile the newest version from source ?).
>
> Way too old -- conversion from docx wasn't introduced until 1.13. 
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/87k12qds0n.fsf%40jhu.edu.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Pandoc 1.12.3.1] Never ending docx to HTML conversion with 100% CPU usage
       [not found]         ` <m2zhbmaqf9.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2020-04-09  7:42           ` Samuel Viscapi
  2020-04-09  8:33           ` Samuel Viscapi
  1 sibling, 0 replies; 5+ messages in thread
From: Samuel Viscapi @ 2020-04-09  7:42 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1949 bytes --]

Thank you John and Jesse for you quick replies, that's really appreciated.

I just gave a go at the binary release from Github:

iconv -f latin1 -t utf-8 file.docx | /path/to/pandoc-2.9.2.1/bin/pandoc -o 
output.html

Again, 100% CPU usage for some minutes. The new thing is: 80%+ of RAM usage 
too, out of 8GB.

The system had to terminate the process.

Output.html was successfully created, but empty.

What am I missing ?

Best regards,

Samuel

On Wednesday, April 8, 2020 at 6:33:15 PM UTC+2, John MacFarlane wrote:
>
>
> You can try the linux binary on our GitHub releases page. 
> It's statically linked and I imagine it will work on your 
> system. 
>
> Jesse Rosenthal <jrose...-4GNroTWusrE@public.gmane.org <javascript:>> writes: 
>
> > Samuel Viscapi <svis...-KK0ffGbhmjU@public.gmane.org <javascript:>> writes: 
> > 
> >> I'm trying to turn a "simple", one-page DOCX document into an HTML 
> page, 
> >> using pandoc 1.12.3.1 on CentOS 7.7 x86_64 (yeah, it's fairly old I 
> know, 
> >> should I try to compile the newest version from source ?). 
> > 
> > Way too old -- conversion from docx wasn't introduced until 1.13. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/87k12qds0n.fsf%40jhu.edu. 
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/82f0dfad-4499-4daa-8714-17f4396869ac%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3661 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Pandoc 1.12.3.1] Never ending docx to HTML conversion with 100% CPU usage
       [not found]         ` <m2zhbmaqf9.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  2020-04-09  7:42           ` Samuel Viscapi
@ 2020-04-09  8:33           ` Samuel Viscapi
  1 sibling, 0 replies; 5+ messages in thread
From: Samuel Viscapi @ 2020-04-09  8:33 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1653 bytes --]

Dear all,

Things are working now, thanks for your help.

My previous message has been deleted, I don't understand why though...

Best regards,

Samuel

On Wednesday, April 8, 2020 at 6:33:15 PM UTC+2, John MacFarlane wrote:
>
>
> You can try the linux binary on our GitHub releases page. 
> It's statically linked and I imagine it will work on your 
> system. 
>
> Jesse Rosenthal <jrose...-4GNroTWusrE@public.gmane.org <javascript:>> writes: 
>
> > Samuel Viscapi <svis...-KK0ffGbhmjU@public.gmane.org <javascript:>> writes: 
> > 
> >> I'm trying to turn a "simple", one-page DOCX document into an HTML 
> page, 
> >> using pandoc 1.12.3.1 on CentOS 7.7 x86_64 (yeah, it's fairly old I 
> know, 
> >> should I try to compile the newest version from source ?). 
> > 
> > Way too old -- conversion from docx wasn't introduced until 1.13. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/87k12qds0n.fsf%40jhu.edu. 
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/2a7a1544-049c-437a-8e2a-2bb87703a335%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3255 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-04-09  8:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-08  9:45 [Pandoc 1.12.3.1] Never ending docx to HTML conversion with 100% CPU usage Samuel Viscapi
     [not found] ` <319d4b4b-be6b-439b-931b-a85c5ce4a00b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-04-08 13:30   ` Jesse Rosenthal
     [not found]     ` <87k12qds0n.fsf-4GNroTWusrE@public.gmane.org>
2020-04-08 16:32       ` John MacFarlane
     [not found]         ` <m2zhbmaqf9.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2020-04-09  7:42           ` Samuel Viscapi
2020-04-09  8:33           ` Samuel Viscapi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).