From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/37306 Path: news.gmane.org!not-for-mail From: Andrea Valle Newsgroups: gmane.comp.tex.context Subject: Re: Doc to ConTeXt [was Re: HTML to ConTeXt] Date: Sat, 10 Nov 2007 02:30:36 +0100 Message-ID: <9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it> References: Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: multipart/mixed; boundary="===============1236294440==" X-Trace: ger.gmane.org 1194660541 19422 80.91.229.12 (10 Nov 2007 02:09:01 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 10 Nov 2007 02:09:01 +0000 (UTC) To: mailing list for ConTeXt users Original-X-From: ntg-context-bounces@ntg.nl Sat Nov 10 03:09:03 2007 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl) by lo.gmane.org with esmtp (Exim 4.50) id 1IqfmP-00013J-Rb for gctc-ntg-context-518@m.gmane.org; Sat, 10 Nov 2007 03:09:02 +0100 Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 328D81FADB; Sat, 10 Nov 2007 03:08:49 +0100 (CET) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 05745-10-8; Sat, 10 Nov 2007 03:08:39 +0100 (CET) Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 2B5E01FAA1; Sat, 10 Nov 2007 03:07:48 +0100 (CET) Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 73AF31FA93 for ; Sat, 10 Nov 2007 03:07:32 +0100 (CET) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 05745-10-3 for ; Sat, 10 Nov 2007 03:07:15 +0100 (CET) Original-Received: from pianeta.di.unito.it (pianeta.di.unito.it [130.192.156.1]) by ronja.ntg.nl (Postfix) with ESMTP id 1D6A21FB55 for ; Sat, 10 Nov 2007 02:30:57 +0100 (CET) Original-Received: from [192.168.1.2] (d83-184-26-146.cust.tele2.it [83.184.26.146]) by pianeta.di.unito.it (INFO-DIP) with ESMTP id lAA1Uc3e015834 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO userCertificateDN= AuthenticatedUser=valle ) for ; Sat, 10 Nov 2007 02:30:39 +0100 (MET) In-Reply-To: X-Mailer: Apple Mail (2.752.3) X-dipinfo-MailScanner-Information: Please contact Department of Computer Science technical staff for more information X-AntiVirus: Scanned for viruses by VirusFinder @2001-tecnici@di.unito.it - Email Clean X-SpamCheck: not spam, SpamAssassin (score=-102.598, required 3.5, autolearn=not spam, AUTHENTICATEDUSER -100.00, BAYES_00 -2.60, HTML_MESSAGE 0.00) X-MailScanner-From: valle@di.unito.it X-Virus-Scanned: amavisd-new at ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.9 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl X-Virus-Scanned: amavisd-new at ntg.nl Xref: news.gmane.org gmane.comp.tex.context:37306 Archived-At: --===============1236294440== Content-Type: multipart/alternative; boundary=Apple-Mail-4--544811419 --Apple-Mail-4--544811419 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed Hi to all (Idris, in particular, as we are always dealing with the =20 same problems... ), I just want to share some thoughts about the ol' damn' problem of =20 converting to ConTeXt from Word et al. > As I told Andrea: For relatively simple documents (like the kind we =20= > use in > academic journals) it seems we can now > > 1) convert doc to odt using OOo > 2) convert odt to markdown using As suggest by Idris, I subscribed to the pandoc list, but I have to =20 say that the activity is not exactly like the one on ConTeXt list... So the actual support for ConTeXt conversion is not convincing. More, =20= it's always better to put the hands on your machine... My problem is to convert a series of academic journals in ConTeXt. =20 They come form the Humanities so little structure (basically, mainly =20 body and footnotes). Far from me the idea of automatically doing all the stuff, I'd like =20 to be faster and more accurate in conversion. (No particular interest in figures, they are few, not so much in =20 references: they tends to be typographically inconsistent if done in a WYSISYG environment, so difficult to parse). More, as the journal has already being published we need to work with =20= final pdfs. After wasting my time with an awful pdf to html converter by =20 Acrobat, I discovered this, you may all know: http://pdftohtml.sourceforge.net/ The html conversion is very very good in resulting rendering and =20 also in sources, but after some tweakings I got interested in the xml =20= conversion it allows. The xml format substantially encodes the infos related to page, =20 typically each line is an element. Plus, there are bold and italics =20 marked easily as and I'm still struggling to understand something really operative of XML =20 processing in ConTeXt, so I switched back to Python. I used an incremental sax parser with some replacement. This is today's draft. Original: http://www.semiotiche.it/andrea/membrana/02%20imp.pdf Recomposed (no setup at all, only \enableregime[utf]): http://www.semiotiche.it/andrea/membrana/02imp.pdf pdf --> pdftoxml --> xml --> python script --> tex --> pdf I recovered par, bold, em, footnotes, stripping dashes and =20 reassembling the text with footnote references. Not bad as a first step. I guess that you xml gurus could probably do much easier and cleaner. So, I mean -just for my very specific needs, I con probably take =20 word sources, convert to pdf and then finally reach ConTeXt as =20 discussed. Just some ideas to share with the list Best -a- -------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Universit=E0 degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it -------------------------------------------------- I did this interview where I just mentioned that I read Foucault. Who =20= doesn't in university, right? I was in this strip club giving this =20 guy a lap dance and all he wanted to do was to discuss Foucault with =20 me. Well, I can stand naked and do my little dance, or I can discuss =20 Foucault, but not at the same time; too much information. (Annabel Chong) --Apple-Mail-4--544811419 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=ISO-8859-1 Hi to all (Idris, in particular, = as we are always dealing with the same problems... ),

I just want to share = some thoughts about the ol' damn' problem of converting to ConTeXt from = Word et al.

As I told Andrea: For relatively simple documents = (like the kind we use in =A0
academic = journals) it seems we can now

1) convert doc to odt using = OOo
2) convert odt to markdown = using

As suggest by Idris, I = subscribed to the pandoc list, but I have to say that the activity is = not exactly like the one on ConTeXt list...
So the actual = support for ConTeXt conversion is not convincing. More, it's always = better to put the hands on your machine...

My problem is to convert a = series of academic journals in ConTeXt. They come form the Humanities so = little structure (basically, mainly body and footnotes).
Far = from me the idea of automatically doing all the stuff, I'd like to be = faster and more accurate in conversion.=A0
(No particular = interest in figures, they are few, not so much in references: they tends = to be typographically inconsistent if done
in a WYSISYG = environment, so difficult to parse).
More, as the journal has = already being published=A0we need to work with final pdfs.

After wasting my time with = an awful pdf to html converter by Acrobat,=A0 I discovered this, you may = all know:

The = html=A0 conversion is very very good in resulting rendering and also in = sources, but after some tweakings I got interested in the xml conversion = it allows.
The xml format=A0 substantially encodes the infos = related to page, typically each line is an element. Plus, there are bold = and italics marked easily as <b> and <i>
I'm still = struggling to understand something really operative of XML processing in = ConTeXt, so=A0 I switched back to Python.
I used an = incremental sax parser with some replacement.
This is today's = draft.
Original:

Recomposed (no setup at = all, only \enableregime[utf]):

pdf --> pdftoxml --> = xml --> python script --> tex --> pdf=A0

I recovered par, bold, em, = footnotes,=A0 stripping dashes and reassembling the text with footnote = references. Not bad as a first step.=A0

I guess that you xml gurus = could probably do much easier and cleaner.
So, I mean -just = for my very specific needs, I con probably=A0 take word sources, convert = to pdf and then finally reach ConTeXt as discussed.

Just some ideas to share = with the list

Best

-a-




http://www.cirma.unito.it/andre= a/
= --------------------------------------------------


I did this interview where I just = mentioned that I read Foucault. Who doesn't in = university, right? I was in this strip = club giving this guy a lap = dance and all he wanted to do was to discuss Foucault = with me. Well, I can stand naked and do my little dance, or I can = discuss Foucault, but not at the same time; too much = information.
(Annabel = Chong)



=

= --Apple-Mail-4--544811419-- --===============1236294440== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ --===============1236294440==--