ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
From: Andrea Valle <valle@di.unito.it>
To: mailing list for ConTeXt users <ntg-context@ntg.nl>
Subject: Re: Doc to ConTeXt [was Re:  HTML to ConTeXt]
Date: Sat, 10 Nov 2007 02:30:36 +0100	[thread overview]
Message-ID: <9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it> (raw)
In-Reply-To: <op.t0sylef3nx1yh1@your-b27fb1c401>


[-- Attachment #1.1: Type: text/plain, Size: 3245 bytes --]

Hi to all (Idris, in particular, as we are always dealing with the  
same problems... ),

I just want to share some thoughts about the ol' damn' problem of  
converting to ConTeXt from Word et al.

> As I told Andrea: For relatively simple documents (like the kind we  
> use in
> academic journals) it seems we can now
>
> 1) convert doc to odt using OOo
> 2) convert odt to markdown using

As suggest by Idris, I subscribed to the pandoc list, but I have to  
say that the activity is not exactly like the one on ConTeXt list...
So the actual support for ConTeXt conversion is not convincing. More,  
it's always better to put the hands on your machine...

My problem is to convert a series of academic journals in ConTeXt.  
They come form the Humanities so little structure (basically, mainly  
body and footnotes).
Far from me the idea of automatically doing all the stuff, I'd like  
to be faster and more accurate in conversion.
(No particular interest in figures, they are few, not so much in  
references: they tends to be typographically inconsistent if done
in a WYSISYG environment, so difficult to parse).
More, as the journal has already being published we need to work with  
final pdfs.

After wasting my time with an awful pdf to html converter by  
Acrobat,  I discovered this, you may all know:
http://pdftohtml.sourceforge.net/

The html  conversion is very very good in resulting rendering and  
also in sources, but after some tweakings I got interested in the xml  
conversion it allows.
The xml format  substantially encodes the infos related to page,  
typically each line is an element. Plus, there are bold and italics  
marked easily as <b> and <i>
I'm still struggling to understand something really operative of XML  
processing in ConTeXt, so  I switched back to Python.
I used an incremental sax parser with some replacement.
This is today's draft.
Original:
http://www.semiotiche.it/andrea/membrana/02%20imp.pdf

Recomposed (no setup at all, only \enableregime[utf]):
http://www.semiotiche.it/andrea/membrana/02imp.pdf

pdf --> pdftoxml --> xml --> python script --> tex --> pdf

I recovered par, bold, em, footnotes,  stripping dashes and  
reassembling the text with footnote references. Not bad as a first step.

I guess that you xml gurus could probably do much easier and cleaner.
So, I mean -just for my very specific needs, I con probably  take  
word sources, convert to pdf and then finally reach ConTeXt as  
discussed.

Just some ideas to share with the list

Best

-a-




--------------------------------------------------
Andrea Valle
--------------------------------------------------
CIRMA - DAMS
Università degli Studi di Torino
--> http://www.cirma.unito.it/andrea/
--> andrea.valle@unito.it
--------------------------------------------------


I did this interview where I just mentioned that I read Foucault. Who  
doesn't in university, right? I was in this strip club giving this  
guy a lap dance and all he wanted to do was to discuss Foucault with  
me. Well, I can stand naked and do my little dance, or I can discuss  
Foucault, but not at the same time; too much information.
(Annabel Chong)





[-- Attachment #1.2: Type: text/html, Size: 9248 bytes --]

[-- Attachment #2: Type: text/plain, Size: 487 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

  reply	other threads:[~2007-11-10  1:30 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-25 14:50 HTML to ConTeXt Aditya Mahajan
2007-10-25 20:17 ` Idris Samawi Hamid
2007-10-26  4:22   ` Aditya Mahajan
2007-10-26 11:37     ` Doc to ConTeXt [was Re: HTML to ConTeXt] Idris Samawi Hamid
2007-11-10  1:30       ` Andrea Valle [this message]
2007-11-10  3:14         ` Idris Samawi Hamid
2007-11-10 11:25           ` Andrea Valle
2007-11-10 12:09             ` Andrea Valle
2007-11-10  3:33         ` Idris Samawi Hamid
2007-11-10 11:59           ` Andrea Valle
2007-11-10 14:07             ` Idris Samawi Hamid
2007-11-10 14:11               ` Andrea Valle
2007-11-10 19:08                 ` Hans Hagen
2007-11-10  5:44         ` Saji Njarackalazhikam Hameed
2007-11-10 13:10           ` Andrea Valle
     [not found]         ` <6faad9f00711100331h547664c6l97d2c3b82c16d8dd@mail.gmail.com>
2007-11-10 12:30           ` Andrea Valle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it \
    --to=valle@di.unito.it \
    --cc=ntg-context@ntg.nl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).