From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/37309
Path: news.gmane.org!not-for-mail
From: Saji Njarackalazhikam Hameed <saji@apcc21.net>
Newsgroups: gmane.comp.tex.context
Subject: Re: Doc to ConTeXt [was Re:  HTML to ConTeXt]
Date: Sat, 10 Nov 2007 14:44:31 +0900
Message-ID: <20071110054431.GB31273@apcc21.net>
References: <alpine.WNT.0.9999.0710251043270.2496@nqvgln>
	<op.t0rrys1hnx1yh1@your-b27fb1c401>
	<alpine.WNT.0.9999.0710251825340.1268@nqvgln>
	<op.t0sylef3nx1yh1@your-b27fb1c401>
	<9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it>
Reply-To: mailing list for ConTeXt users <ntg-context@ntg.nl>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1194674353 10540 80.91.229.12 (10 Nov 2007 05:59:13 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 10 Nov 2007 05:59:13 +0000 (UTC)
To: mailing list for ConTeXt users <ntg-context@ntg.nl>
Original-X-From: ntg-context-bounces@ntg.nl Sat Nov 10 06:59:14 2007
Return-path: <ntg-context-bounces@ntg.nl>
Envelope-to: gctc-ntg-context-518@m.gmane.org
Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl)
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1IqjNC-0004Xi-JE
	for gctc-ntg-context-518@m.gmane.org; Sat, 10 Nov 2007 06:59:14 +0100
Original-Received: from localhost (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id 2AEDB1FADC;
	Sat, 10 Nov 2007 06:59:01 +0100 (CET)
Original-Received: from ronja.ntg.nl ([127.0.0.1])
 by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP
 id 17229-04; Sat, 10 Nov 2007 06:58:47 +0100 (CET)
Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id 1FA151FB0E;
	Sat, 10 Nov 2007 06:58:47 +0100 (CET)
Original-Received: from localhost (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id 87E181FB0E
	for <ntg-context@ntg.nl>; Sat, 10 Nov 2007 06:58:41 +0100 (CET)
Original-Received: from ronja.ntg.nl ([127.0.0.1])
	by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new,
	port 10024) with LMTP id 17250-02-3 for <ntg-context@ntg.nl>;
	Sat, 10 Nov 2007 06:58:21 +0100 (CET)
Original-Received: from mail1.apcc21.net (unknown [210.98.49.5])
	by ronja.ntg.nl (Postfix) with ESMTP id E2F401FADC
	for <ntg-context@ntg.nl>; Sat, 10 Nov 2007 06:58:16 +0100 (CET)
Original-Received: from localhost ([210.98.49.34]) (authenticated bits=0)
	by mail1.apcc21.net (8.13.1/8.13.1) with ESMTP id lAA5WLbM020335
	for <ntg-context@ntg.nl>; Sat, 10 Nov 2007 14:32:21 +0900
Content-Disposition: inline
In-Reply-To: <9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it>
User-Agent: Mutt/1.5.16 (2007-06-09)
X-Virus-Scanned: amavisd-new at ntg.nl
X-BeenThere: ntg-context@ntg.nl
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: mailing list for ConTeXt users <ntg-context.ntg.nl>
List-Unsubscribe: <http://www.ntg.nl/mailman/listinfo/ntg-context>,
	<mailto:ntg-context-request@ntg.nl?subject=unsubscribe>
List-Archive: <http://www.ntg.nl/pipermail/ntg-context>
List-Post: <mailto:ntg-context@ntg.nl>
List-Help: <mailto:ntg-context-request@ntg.nl?subject=help>
List-Subscribe: <http://www.ntg.nl/mailman/listinfo/ntg-context>,
	<mailto:ntg-context-request@ntg.nl?subject=subscribe>
Original-Sender: ntg-context-bounces@ntg.nl
Errors-To: ntg-context-bounces@ntg.nl
X-Virus-Scanned: amavisd-new at ntg.nl
Xref: news.gmane.org gmane.comp.tex.context:37309
Archived-At: <http://permalink.gmane.org/gmane.comp.tex.context/37309>

Hi Andrea,

I face a similar issue while organizing large-scale documents
prepared by members of my group (many folks are not conversant
with TeX here and write documents with WORD). My solution was to take their
input through a wiki and convert the HTML to context markup
using filters written with ruby (also see =

http://wiki.contextgarden.net/HTML_and_ConTeXt). Converting
HTML syntax to ConTeXt syntax is very do-able. =


If it is of any use, I attach the ruby filters I use for
my purpose. BTW, I use a ruby library called "hpricot" to ease
some of these conversions.

saji
...

def scrape_the_page(pagePath,oFile,hFile) =

items_to_remove =3D [
  "#menus",        #menus notice
  "div.markedup",
  "div.navigation",
  "head",          #table of contents =

  "hr"
  ]

doc=3DHpricot(open(pagePath))
# this may not be applicable to your case
# this removes some unnecessary markup from the Wiki pages

@article =3D (doc/"#container").each do |content|
  #remove unnecessary content and edit links
  items_to_remove.each { |x| (content/x).remove }
end =


# Write HTML content to file
hFile.write @article.inner_html

# How to replace various syntactic elements using Hpricot
# replace p/b element with \bf
(@article/"p/*/b").each do |pb|
  pb.swap("{\\bf #{pb.inner_html}}")
end

# replace p/b element with \bf
(@article/"p/b").each do |pb|
  pb.swap("{\\bf #{pb.inner_html}}")
end

# replace strong element with \bf
(@article/"strong").each do |ps|
  ps.swap("{\\bf #{ps.inner_html}}")
end

# replace h1 element with section
(@article/"h1").each do |h1|
  h1.swap("\\section{#{h1.inner_html}}")
end

# replace h2 element with subsection
(@article/"h2").each do |h2|
  h2.swap("\\subsection{#{h2.inner_html}}")
end

# replace h3 element with subsection
(@article/"h3").each do |h3|
  h3.swap("\\subsubsection{#{h3.inner_html}}")
end

# replace h4 element with subsection
(@article/"h4").each do |h4|
  h4.swap("\\subsubsubsection{#{h4.inner_html}}")
end

# replace h5 element with subsection
(@article/"h5").each do |h5|
  h5.swap("\\subsubsubsubsection{#{h5.inner_html}}")
end

# replace <pre><code> by equivalent command in context
(@article/"pre").each do |pre|
  pre.swap("\\startcode \n #{pre.at("code").inner_html} \n
  \\stopcode")
end

# when we encounter a reference to a figure inside the html
# we replace it with a ConTeXt reference

(@article/"a").each do |a|
  a.swap("\\in[#{a.inner_html}]")
end


# remove 'alt' attribute inside <img> element
# replace <p><img> by equivalent command in context
(@article/"p/img").each do |img|

  img_attrs=3Dimg.attributes['alt'].split(",")

  # separate the file name from the extension
  # have to take of file names that have a "." embedded in them
  img_src=3Dimg.attributes['src'].reverse.sub(/\w+\./,"").reverse
  # puts img_src
  # see if position of figure is indicated
  img_pos=3D"force"
  img_attrs.each do |arr| =

    img_pos=3Darr.gsub("position=3D","") if arr.match("position=3D")
  end
  img_attrs.delete("position=3D#{img_pos}") unless img_pos=3D=3D"force" =


  # see if the array img_attrs contains an referral key word
  if img_attrs.first.match(/\w+[=3D]\w+/)
    img_id=3D" "
  else
    img_id=3Dimg_attrs.first
    img_attrs.delete_at(0)
  end

  if img_pos=3D=3D"force"
    if img.attributes['title']
      img.swap("
      \\placefigure\n =

      [#{img_pos}][#{img_id}] \n =

      {#{img.attributes['title']}} \n =

      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}  \n
              ")
    else
      img.swap("
      \\placefigure\n =

      [#{img_pos}] \n
      {none} \n
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} =

              ")
    end
  else
    if img.attributes['title']
      img.swap("
      \\placefigure\n =

      [#{img_pos}][#{img_id}] \n =

      {#{img.attributes['title']}} \n =

      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}  \n
              ")
    else
      img.swap("
      \\placefigure\n =

      [#{img_pos}] \n
      {none} \n
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}
       \n =

              ")
    end
  end

end # end of converting inside (@article/"p/img")

 =

   =

# why not search for table and if we find caption, keep it ; if not add an =
empty

# Styling options: Here I catch the div element called Col2 and
# format the tex document in 2 columns


# Tables : placing them
# replace <p><img> by equivalent command in context
(@article/"table").each do |tab|
  if tab.at("caption")
  tab.swap("
  \\placetable[split]{#{tab.at("caption").inner_html}}\n
  {\\bTABLE \n
  #{tab.inner_html}
  \\eTABLE} =

             ")
  else
  tab.swap("
   \\placetable[split]{}\n
   {\\bTABLE \n
  #{tab.inner_html}
  \\eTABLE} \n =

            ")
  end
end

# Tables: remove the caption
(@article/"caption").each do |cap|
  cap.swap("\n")
end


# Now we transfer the syntactically altered html to a string Object
# and manipulate that object further

newdoc=3D@article.inner_html

# remove empty space in the beginning
newdoc.gsub!(/^\s+/,"")

# remove all elements we don't need.
newdoc.gsub!(/^<div.*/,"")
newdoc.gsub!(/^<\/div.*/,"")
newdoc.gsub!(/^<form.*/,"")
newdoc.gsub!(/^<\/form.*/,"")
newdoc.gsub!(/<p>/,"\n")
newdoc.gsub!(/<\/p>/,"\n")
newdoc.gsub!(/<\u>/,"")
newdoc.gsub!(/<\/u>/,"")
newdoc.gsub!(/<ul>/,"\\startitemize[1]")
newdoc.gsub!(/<\/ul>/,"\\stopitemize")
newdoc.gsub!(/<ol>/,"\\startitemize[n]")
newdoc.gsub!(/<\/ol>/,"\\stopitemize")
newdoc.gsub!(/<li>/,"\\item ")
newdoc.gsub!(/<\/li>/,"\n")
newdoc.gsub!("_","\\_")
newdoc.gsub!(/<table>/,"\\bTABLE \n")
newdoc.gsub!(/<\/table>/,"\\eTABLE \n")
newdoc.gsub!(/<tr>/,"\\bTR ")
newdoc.gsub!(/<\/tr>/,"\\eTR ")
newdoc.gsub!(/<td>/,"\\bTD ")
newdoc.gsub!(/<\/td>/,"\\eTD ")
newdoc.gsub!(/<th>/,"\\bTH ")
newdoc.gsub!(/<\/th>/,"\\eTH ")
newdoc.gsub!(/<center>/,"")
newdoc.gsub!(/<\/center>/,"")
newdoc.gsub!(/<em>/,"{\\em ")
newdoc.gsub!(/<\/em>/,"}")
newdoc.gsub!("^","")
newdoc.gsub!("\%","\\%")
newdoc.gsub!("&amp;","&")
newdoc.gsub!("&",'\\\&')
newdoc.gsub!("$",'\\$')
newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n")
newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n")

# Context does not mind "_" in figures and does not recognize \_,
# so i have to catch these and replace \_ with _

# First catch
filter=3D/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/

if newdoc[filter]
newdoc.gsub!(filter) { |fString| =

fString.gsub("\\_","_") =

}
end

# Second catch
filter2=3D/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/

if newdoc[filter2]
newdoc.gsub!(filter2) { |fString| =

fString.gsub("\\_","_") }
end

# Third catch; remove \_ inside []
filter3=3D/\[\w+\\_\w+\]/

if newdoc[filter3]
newdoc.gsub!(filter3) { |fString| =

puts fString
fString.gsub("\\_","_") }
end


# remove the comment tag, which we used to embed context commands
newdoc.gsub!("<!--","")
newdoc.gsub!("-->","")

# add full path to the images
newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/")

newdoc.gsub!(/<\w+\s*\/>/,"")

#puts newdoc
# open file for output
#outfil=3D"#{oFile}.tex"
#`rm #{outfil}`

#fil=3DFile.new(outfil,"a")
#puts "Writing #{oFile}"
oFile.write newdoc

end
# imgProps=3D{}
  #       img_attrs.each do |arr| =

  #       imgProps['width']=3Darr.gsub("width=3D","") if arr.match("width=
=3D")
  #       imgProps['position']=3Darr.gsub("position=3D","") if arr.match("p=
osition=3D")
  #       end


* Andrea Valle <valle@di.unito.it> [2007-11-10 02:30:36 +0100]:

> Hi to all (Idris, in particular, as we are always dealing with the same =

> problems... ),
>
> I just want to share some thoughts about the ol' damn' problem of =

> converting to ConTeXt from Word et al.
>
>> As I told Andrea: For relatively simple documents (like the kind we use =
in
>> academic journals) it seems we can now
>>
>> 1) convert doc to odt using OOo
>> 2) convert odt to markdown using
>
> As suggest by Idris, I subscribed to the pandoc list, but I have to say =

> that the activity is not exactly like the one on ConTeXt list...
> So the actual support for ConTeXt conversion is not convincing. More, it'=
s =

> always better to put the hands on your machine...
>
> My problem is to convert a series of academic journals in ConTeXt. They =

> come form the Humanities so little structure (basically, mainly body and =

> footnotes).
> Far from me the idea of automatically doing all the stuff, I'd like to be =

> faster and more accurate in conversion.
> (No particular interest in figures, they are few, not so much in =

> references: they tends to be typographically inconsistent if done
> in a WYSISYG environment, so difficult to parse).
> More, as the journal has already being published we need to work with fin=
al =

> pdfs.
>
> After wasting my time with an awful pdf to html converter by Acrobat,  I =

> discovered this, you may all know:
> http://pdftohtml.sourceforge.net/
>
> The html  conversion is very very good in resulting rendering and also in =

> sources, but after some tweakings I got interested in the xml conversion =
it =

> allows.
> The xml format  substantially encodes the infos related to page, typicall=
y =

> each line is an element. Plus, there are bold and italics marked easily a=
s =

> <b> and <i>
> I'm still struggling to understand something really operative of XML =

> processing in ConTeXt, so  I switched back to Python.
> I used an incremental sax parser with some replacement.
> This is today's draft.
> Original:
> http://www.semiotiche.it/andrea/membrana/02%20imp.pdf
>
> Recomposed (no setup at all, only \enableregime[utf]):
> http://www.semiotiche.it/andrea/membrana/02imp.pdf
>
> pdf --> pdftoxml --> xml --> python script --> tex --> pdf
>
> I recovered par, bold, em, footnotes,  stripping dashes and reassembling =

> the text with footnote references. Not bad as a first step.
>
> I guess that you xml gurus could probably do much easier and cleaner.
> So, I mean -just for my very specific needs, I con probably  take word =

> sources, convert to pdf and then finally reach ConTeXt as discussed.
>
> Just some ideas to share with the list
>
> Best
>
> -a-
>
>
>
>
> --------------------------------------------------
> Andrea Valle
> --------------------------------------------------
> CIRMA - DAMS
> Universit=E0 degli Studi di Torino
> --> http://www.cirma.unito.it/andrea/
> --> andrea.valle@unito.it
> --------------------------------------------------
>
>
> I did this interview where I just mentioned that I read Foucault. Who =

> doesn't in university, right? I was in this strip club giving this guy a =

> lap dance and all he wanted to do was to discuss Foucault with me. Well, =
I =

> can stand naked and do my little dance, or I can discuss Foucault, but no=
t =

> at the same time; too much information.
> (Annabel Chong)
>
>
>
>

> _________________________________________________________________________=
__________
> If your question is of interest to others as well, please add an entry to=
 the Wiki!
> =

> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-co=
ntext
> webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive  : https://foundry.supelec.fr/projects/contextrev/
> wiki     : http://contextgarden.net
> _________________________________________________________________________=
__________


-- =

Saji N. Hameed

APEC Climate Center          				+82 51 668 7470
National Pension Corporation Busan Building 12F         =

Yeonsan 2-dong, Yeonje-gu, BUSAN 611705			saji@apcc21.net
KOREA
___________________________________________________________________________=
________
If your question is of interest to others as well, please add an entry to t=
he Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-cont=
ext
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________=
________