From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/37309 Path: news.gmane.org!not-for-mail From: Saji Njarackalazhikam Hameed Newsgroups: gmane.comp.tex.context Subject: Re: Doc to ConTeXt [was Re: HTML to ConTeXt] Date: Sat, 10 Nov 2007 14:44:31 +0900 Message-ID: <20071110054431.GB31273@apcc21.net> References: <9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1194674353 10540 80.91.229.12 (10 Nov 2007 05:59:13 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 10 Nov 2007 05:59:13 +0000 (UTC) To: mailing list for ConTeXt users Original-X-From: ntg-context-bounces@ntg.nl Sat Nov 10 06:59:14 2007 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl) by lo.gmane.org with esmtp (Exim 4.50) id 1IqjNC-0004Xi-JE for gctc-ntg-context-518@m.gmane.org; Sat, 10 Nov 2007 06:59:14 +0100 Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 2AEDB1FADC; Sat, 10 Nov 2007 06:59:01 +0100 (CET) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 17229-04; Sat, 10 Nov 2007 06:58:47 +0100 (CET) Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 1FA151FB0E; Sat, 10 Nov 2007 06:58:47 +0100 (CET) Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 87E181FB0E for ; Sat, 10 Nov 2007 06:58:41 +0100 (CET) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 17250-02-3 for ; Sat, 10 Nov 2007 06:58:21 +0100 (CET) Original-Received: from mail1.apcc21.net (unknown [210.98.49.5]) by ronja.ntg.nl (Postfix) with ESMTP id E2F401FADC for ; Sat, 10 Nov 2007 06:58:16 +0100 (CET) Original-Received: from localhost ([210.98.49.34]) (authenticated bits=0) by mail1.apcc21.net (8.13.1/8.13.1) with ESMTP id lAA5WLbM020335 for ; Sat, 10 Nov 2007 14:32:21 +0900 Content-Disposition: inline In-Reply-To: <9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it> User-Agent: Mutt/1.5.16 (2007-06-09) X-Virus-Scanned: amavisd-new at ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.9 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl X-Virus-Scanned: amavisd-new at ntg.nl Xref: news.gmane.org gmane.comp.tex.context:37309 Archived-At: Hi Andrea, I face a similar issue while organizing large-scale documents prepared by members of my group (many folks are not conversant with TeX here and write documents with WORD). My solution was to take their input through a wiki and convert the HTML to context markup using filters written with ruby (also see = http://wiki.contextgarden.net/HTML_and_ConTeXt). Converting HTML syntax to ConTeXt syntax is very do-able. = If it is of any use, I attach the ruby filters I use for my purpose. BTW, I use a ruby library called "hpricot" to ease some of these conversions. saji ... def scrape_the_page(pagePath,oFile,hFile) = items_to_remove =3D [ "#menus", #menus notice "div.markedup", "div.navigation", "head", #table of contents = "hr" ] doc=3DHpricot(open(pagePath)) # this may not be applicable to your case # this removes some unnecessary markup from the Wiki pages @article =3D (doc/"#container").each do |content| #remove unnecessary content and edit links items_to_remove.each { |x| (content/x).remove } end = # Write HTML content to file hFile.write @article.inner_html # How to replace various syntactic elements using Hpricot # replace p/b element with \bf (@article/"p/*/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace p/b element with \bf (@article/"p/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace strong element with \bf (@article/"strong").each do |ps| ps.swap("{\\bf #{ps.inner_html}}") end # replace h1 element with section (@article/"h1").each do |h1| h1.swap("\\section{#{h1.inner_html}}") end # replace h2 element with subsection (@article/"h2").each do |h2| h2.swap("\\subsection{#{h2.inner_html}}") end # replace h3 element with subsection (@article/"h3").each do |h3| h3.swap("\\subsubsection{#{h3.inner_html}}") end # replace h4 element with subsection (@article/"h4").each do |h4| h4.swap("\\subsubsubsection{#{h4.inner_html}}") end # replace h5 element with subsection (@article/"h5").each do |h5| h5.swap("\\subsubsubsubsection{#{h5.inner_html}}") end # replace
 by equivalent command in context
(@article/"pre").each do |pre|
  pre.swap("\\startcode \n #{pre.at("code").inner_html} \n
  \\stopcode")
end

# when we encounter a reference to a figure inside the html
# we replace it with a ConTeXt reference

(@article/"a").each do |a|
  a.swap("\\in[#{a.inner_html}]")
end


# remove 'alt' attribute inside  element
# replace 

by equivalent command in context (@article/"p/img").each do |img| img_attrs=3Dimg.attributes['alt'].split(",") # separate the file name from the extension # have to take of file names that have a "." embedded in them img_src=3Dimg.attributes['src'].reverse.sub(/\w+\./,"").reverse # puts img_src # see if position of figure is indicated img_pos=3D"force" img_attrs.each do |arr| = img_pos=3Darr.gsub("position=3D","") if arr.match("position=3D") end img_attrs.delete("position=3D#{img_pos}") unless img_pos=3D=3D"force" = # see if the array img_attrs contains an referral key word if img_attrs.first.match(/\w+[=3D]\w+/) img_id=3D" " else img_id=3Dimg_attrs.first img_attrs.delete_at(0) end if img_pos=3D=3D"force" if img.attributes['title'] img.swap(" \\placefigure\n = [#{img_pos}][#{img_id}] \n = {#{img.attributes['title']}} \n = {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n = [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} = ") end else if img.attributes['title'] img.swap(" \\placefigure\n = [#{img_pos}][#{img_id}] \n = {#{img.attributes['title']}} \n = {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n = [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n = ") end end end # end of converting inside (@article/"p/img") = = # why not search for table and if we find caption, keep it ; if not add an = empty # Styling options: Here I catch the div element called Col2 and # format the tex document in 2 columns # Tables : placing them # replace

by equivalent command in context (@article/"table").each do |tab| if tab.at("caption") tab.swap(" \\placetable[split]{#{tab.at("caption").inner_html}}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} = ") else tab.swap(" \\placetable[split]{}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} \n = ") end end # Tables: remove the caption (@article/"caption").each do |cap| cap.swap("\n") end # Now we transfer the syntactically altered html to a string Object # and manipulate that object further newdoc=3D@article.inner_html # remove empty space in the beginning newdoc.gsub!(/^\s+/,"") # remove all elements we don't need. newdoc.gsub!(/^/,"\n") newdoc.gsub!(/<\/p>/,"\n") newdoc.gsub!(/<\u>/,"") newdoc.gsub!(/<\/u>/,"") newdoc.gsub!(/

    /,"\\startitemize[1]") newdoc.gsub!(/<\/ul>/,"\\stopitemize") newdoc.gsub!(/
      /,"\\startitemize[n]") newdoc.gsub!(/<\/ol>/,"\\stopitemize") newdoc.gsub!(/
    1. /,"\\item ") newdoc.gsub!(/<\/li>/,"\n") newdoc.gsub!("_","\\_") newdoc.gsub!(//,"\\bTABLE \n") newdoc.gsub!(/<\/table>/,"\\eTABLE \n") newdoc.gsub!(//,"\\bTR ") newdoc.gsub!(/<\/tr>/,"\\eTR ") newdoc.gsub!(//,"\\bTABLEbody \n") newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n") # Context does not mind "_" in figures and does not recognize \_, # so i have to catch these and replace \_ with _ # First catch filter=3D/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/ if newdoc[filter] newdoc.gsub!(filter) { |fString| = fString.gsub("\\_","_") = } end # Second catch filter2=3D/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/ if newdoc[filter2] newdoc.gsub!(filter2) { |fString| = fString.gsub("\\_","_") } end # Third catch; remove \_ inside [] filter3=3D/\[\w+\\_\w+\]/ if newdoc[filter3] newdoc.gsub!(filter3) { |fString| = puts fString fString.gsub("\\_","_") } end # remove the comment tag, which we used to embed context commands newdoc.gsub!("","") # add full path to the images newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/") newdoc.gsub!(/<\w+\s*\/>/,"") #puts newdoc # open file for output #outfil=3D"#{oFile}.tex" #`rm #{outfil}` #fil=3DFile.new(outfil,"a") #puts "Writing #{oFile}" oFile.write newdoc end # imgProps=3D{} # img_attrs.each do |arr| = # imgProps['width']=3Darr.gsub("width=3D","") if arr.match("width= =3D") # imgProps['position']=3Darr.gsub("position=3D","") if arr.match("p= osition=3D") # end * Andrea Valle [2007-11-10 02:30:36 +0100]: > Hi to all (Idris, in particular, as we are always dealing with the same = > problems... ), > > I just want to share some thoughts about the ol' damn' problem of = > converting to ConTeXt from Word et al. > >> As I told Andrea: For relatively simple documents (like the kind we use = in >> academic journals) it seems we can now >> >> 1) convert doc to odt using OOo >> 2) convert odt to markdown using > > As suggest by Idris, I subscribed to the pandoc list, but I have to say = > that the activity is not exactly like the one on ConTeXt list... > So the actual support for ConTeXt conversion is not convincing. More, it'= s = > always better to put the hands on your machine... > > My problem is to convert a series of academic journals in ConTeXt. They = > come form the Humanities so little structure (basically, mainly body and = > footnotes). > Far from me the idea of automatically doing all the stuff, I'd like to be = > faster and more accurate in conversion. > (No particular interest in figures, they are few, not so much in = > references: they tends to be typographically inconsistent if done > in a WYSISYG environment, so difficult to parse). > More, as the journal has already being published we need to work with fin= al = > pdfs. > > After wasting my time with an awful pdf to html converter by Acrobat, I = > discovered this, you may all know: > http://pdftohtml.sourceforge.net/ > > The html conversion is very very good in resulting rendering and also in = > sources, but after some tweakings I got interested in the xml conversion = it = > allows. > The xml format substantially encodes the infos related to page, typicall= y = > each line is an element. Plus, there are bold and italics marked easily a= s = > and > I'm still struggling to understand something really operative of XML = > processing in ConTeXt, so I switched back to Python. > I used an incremental sax parser with some replacement. > This is today's draft. > Original: > http://www.semiotiche.it/andrea/membrana/02%20imp.pdf > > Recomposed (no setup at all, only \enableregime[utf]): > http://www.semiotiche.it/andrea/membrana/02imp.pdf > > pdf --> pdftoxml --> xml --> python script --> tex --> pdf > > I recovered par, bold, em, footnotes, stripping dashes and reassembling = > the text with footnote references. Not bad as a first step. > > I guess that you xml gurus could probably do much easier and cleaner. > So, I mean -just for my very specific needs, I con probably take word = > sources, convert to pdf and then finally reach ConTeXt as discussed. > > Just some ideas to share with the list > > Best > > -a- > > > > > -------------------------------------------------- > Andrea Valle > -------------------------------------------------- > CIRMA - DAMS > Universit=E0 degli Studi di Torino > --> http://www.cirma.unito.it/andrea/ > --> andrea.valle@unito.it > -------------------------------------------------- > > > I did this interview where I just mentioned that I read Foucault. Who = > doesn't in university, right? I was in this strip club giving this guy a = > lap dance and all he wanted to do was to discuss Foucault with me. Well, = I = > can stand naked and do my little dance, or I can discuss Foucault, but no= t = > at the same time; too much information. > (Annabel Chong) > > > > > _________________________________________________________________________= __________ > If your question is of interest to others as well, please add an entry to= the Wiki! > = > maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-co= ntext > webpage : http://www.pragma-ade.nl / http://tex.aanhet.net > archive : https://foundry.supelec.fr/projects/contextrev/ > wiki : http://contextgarden.net > _________________________________________________________________________= __________ -- = Saji N. Hameed APEC Climate Center +82 51 668 7470 National Pension Corporation Busan Building 12F = Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 saji@apcc21.net KOREA ___________________________________________________________________________= ________ If your question is of interest to others as well, please add an entry to t= he Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-cont= ext webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________= ________
      /,"\\bTD ") newdoc.gsub!(/<\/td>/,"\\eTD ") newdoc.gsub!(//,"\\bTH ") newdoc.gsub!(/<\/th>/,"\\eTH ") newdoc.gsub!(/
      /,"") newdoc.gsub!(/<\/center>/,"") newdoc.gsub!(//,"{\\em ") newdoc.gsub!(/<\/em>/,"}") newdoc.gsub!("^","") newdoc.gsub!("\%","\\%") newdoc.gsub!("&","&") newdoc.gsub!("&",'\\\&') newdoc.gsub!("$",'\\$') newdoc.gsub!(/