Re: Doc to ConTeXt [was Re: HTML to ConTeXt]

From: Saji Njarackalazhikam Hameed <saji@apcc21.net>
To: mailing list for ConTeXt users <ntg-context@ntg.nl>
Subject: Re: Doc to ConTeXt [was Re:  HTML to ConTeXt]
Date: Sat, 10 Nov 2007 14:44:31 +0900	[thread overview]
Message-ID: <20071110054431.GB31273@apcc21.net> (raw)
In-Reply-To: <9AE14A44-6B23-450B-B3E5-DEE82341AFBC@di.unito.it>

Hi Andrea,

I face a similar issue while organizing large-scale documents
prepared by members of my group (many folks are not conversant
with TeX here and write documents with WORD). My solution was to take their
input through a wiki and convert the HTML to context markup
using filters written with ruby (also see 
http://wiki.contextgarden.net/HTML_and_ConTeXt). Converting
HTML syntax to ConTeXt syntax is very do-able. 

If it is of any use, I attach the ruby filters I use for
my purpose. BTW, I use a ruby library called "hpricot" to ease
some of these conversions.

saji
...

def scrape_the_page(pagePath,oFile,hFile) 
items_to_remove = [
  "#menus",        #menus notice
  "div.markedup",
  "div.navigation",
  "head",          #table of contents 
  "hr"
  ]

doc=Hpricot(open(pagePath))
# this may not be applicable to your case
# this removes some unnecessary markup from the Wiki pages

@article = (doc/"#container").each do |content|
  #remove unnecessary content and edit links
  items_to_remove.each { |x| (content/x).remove }
end 

# Write HTML content to file
hFile.write @article.inner_html

# How to replace various syntactic elements using Hpricot
# replace p/b element with \bf
(@article/"p/*/b").each do |pb|
  pb.swap("{\\bf #{pb.inner_html}}")
end

# replace p/b element with \bf
(@article/"p/b").each do |pb|
  pb.swap("{\\bf #{pb.inner_html}}")
end

# replace strong element with \bf
(@article/"strong").each do |ps|
  ps.swap("{\\bf #{ps.inner_html}}")
end

# replace h1 element with section
(@article/"h1").each do |h1|
  h1.swap("\\section{#{h1.inner_html}}")
end

# replace h2 element with subsection
(@article/"h2").each do |h2|
  h2.swap("\\subsection{#{h2.inner_html}}")
end

# replace h3 element with subsection
(@article/"h3").each do |h3|
  h3.swap("\\subsubsection{#{h3.inner_html}}")
end

# replace h4 element with subsection
(@article/"h4").each do |h4|
  h4.swap("\\subsubsubsection{#{h4.inner_html}}")
end

# replace h5 element with subsection
(@article/"h5").each do |h5|
  h5.swap("\\subsubsubsubsection{#{h5.inner_html}}")
end

# replace <pre><code> by equivalent command in context
(@article/"pre").each do |pre|
  pre.swap("\\startcode \n #{pre.at("code").inner_html} \n
  \\stopcode")
end

# when we encounter a reference to a figure inside the html
# we replace it with a ConTeXt reference

(@article/"a").each do |a|
  a.swap("\\in[#{a.inner_html}]")
end

# remove 'alt' attribute inside <img> element
# replace <p><img> by equivalent command in context
(@article/"p/img").each do |img|

  img_attrs=img.attributes['alt'].split(",")

  # separate the file name from the extension
  # have to take of file names that have a "." embedded in them
  img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse
  # puts img_src
  # see if position of figure is indicated
  img_pos="force"
  img_attrs.each do |arr| 
    img_pos=arr.gsub("position=","") if arr.match("position=")
  end
  img_attrs.delete("position=#{img_pos}") unless img_pos=="force" 

  # see if the array img_attrs contains an referral key word
  if img_attrs.first.match(/\w+[=]\w+/)
    img_id=" "
  else
    img_id=img_attrs.first
    img_attrs.delete_at(0)
  end

  if img_pos=="force"
    if img.attributes['title']
      img.swap("
      \\placefigure\n 
      [#{img_pos}][#{img_id}] \n 
      {#{img.attributes['title']}} \n 
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}  \n
              ")
    else
      img.swap("
      \\placefigure\n 
      [#{img_pos}] \n
      {none} \n
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} 
              ")
    end
  else
    if img.attributes['title']
      img.swap("
      \\placefigure\n 
      [#{img_pos}][#{img_id}] \n 
      {#{img.attributes['title']}} \n 
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}  \n
              ")
    else
      img.swap("
      \\placefigure\n 
      [#{img_pos}] \n
      {none} \n
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}
       \n 
              ")
    end
  end

end # end of converting inside (@article/"p/img")

# why not search for table and if we find caption, keep it ; if not add an empty

# Styling options: Here I catch the div element called Col2 and
# format the tex document in 2 columns

# Tables : placing them
# replace <p><img> by equivalent command in context
(@article/"table").each do |tab|
  if tab.at("caption")
  tab.swap("
  \\placetable[split]{#{tab.at("caption").inner_html}}\n
  {\\bTABLE \n
  #{tab.inner_html}
  \\eTABLE} 
             ")
  else
  tab.swap("
   \\placetable[split]{}\n
   {\\bTABLE \n
  #{tab.inner_html}
  \\eTABLE} \n 
            ")
  end
end

# Tables: remove the caption
(@article/"caption").each do |cap|
  cap.swap("\n")
end

# Now we transfer the syntactically altered html to a string Object
# and manipulate that object further

newdoc=@article.inner_html

# remove empty space in the beginning
newdoc.gsub!(/^\s+/,"")

# remove all elements we don't need.
newdoc.gsub!(/^<div.*/,"")
newdoc.gsub!(/^<\/div.*/,"")
newdoc.gsub!(/^<form.*/,"")
newdoc.gsub!(/^<\/form.*/,"")
newdoc.gsub!(/<p>/,"\n")
newdoc.gsub!(/<\/p>/,"\n")
newdoc.gsub!(/<\u>/,"")
newdoc.gsub!(/<\/u>/,"")
newdoc.gsub!(/<ul>/,"\\startitemize[1]")
newdoc.gsub!(/<\/ul>/,"\\stopitemize")
newdoc.gsub!(/<ol>/,"\\startitemize[n]")
newdoc.gsub!(/<\/ol>/,"\\stopitemize")
newdoc.gsub!(/<li>/,"\\item ")
newdoc.gsub!(/<\/li>/,"\n")
newdoc.gsub!("_","\\_")
newdoc.gsub!(/<table>/,"\\bTABLE \n")
newdoc.gsub!(/<\/table>/,"\\eTABLE \n")
newdoc.gsub!(/<tr>/,"\\bTR ")
newdoc.gsub!(/<\/tr>/,"\\eTR ")
newdoc.gsub!(/<td>/,"\\bTD ")
newdoc.gsub!(/<\/td>/,"\\eTD ")
newdoc.gsub!(/<th>/,"\\bTH ")
newdoc.gsub!(/<\/th>/,"\\eTH ")
newdoc.gsub!(/<center>/,"")
newdoc.gsub!(/<\/center>/,"")
newdoc.gsub!(/<em>/,"{\\em ")
newdoc.gsub!(/<\/em>/,"}")
newdoc.gsub!("^","")
newdoc.gsub!("\%","\\%")
newdoc.gsub!("&amp;","&")
newdoc.gsub!("&",'\\\&')
newdoc.gsub!("$",'\\$')
newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n")
newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n")

# Context does not mind "_" in figures and does not recognize \_,
# so i have to catch these and replace \_ with _

# First catch
filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/

if newdoc[filter]
newdoc.gsub!(filter) { |fString| 
fString.gsub("\\_","_") 
}
end

# Second catch
filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/

if newdoc[filter2]
newdoc.gsub!(filter2) { |fString| 
fString.gsub("\\_","_") }
end

# Third catch; remove \_ inside []
filter3=/\[\w+\\_\w+\]/

if newdoc[filter3]
newdoc.gsub!(filter3) { |fString| 
puts fString
fString.gsub("\\_","_") }
end

# remove the comment tag, which we used to embed context commands
newdoc.gsub!("<!--","")
newdoc.gsub!("-->","")

# add full path to the images
newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/")

newdoc.gsub!(/<\w+\s*\/>/,"")

#puts newdoc
# open file for output
#outfil="#{oFile}.tex"
#`rm #{outfil}`

#fil=File.new(outfil,"a")
#puts "Writing #{oFile}"
oFile.write newdoc

end
# imgProps={}
  #       img_attrs.each do |arr| 
  #       imgProps['width']=arr.gsub("width=","") if arr.match("width=")
  #       imgProps['position']=arr.gsub("position=","") if arr.match("position=")
  #       end

* Andrea Valle <valle@di.unito.it> [2007-11-10 02:30:36 +0100]:

> Hi to all (Idris, in particular, as we are always dealing with the same 
> problems... ),
>
> I just want to share some thoughts about the ol' damn' problem of 
> converting to ConTeXt from Word et al.
>
>> As I told Andrea: For relatively simple documents (like the kind we use in
>> academic journals) it seems we can now
>>
>> 1) convert doc to odt using OOo
>> 2) convert odt to markdown using
>
> As suggest by Idris, I subscribed to the pandoc list, but I have to say 
> that the activity is not exactly like the one on ConTeXt list...
> So the actual support for ConTeXt conversion is not convincing. More, it's 
> always better to put the hands on your machine...
>
> My problem is to convert a series of academic journals in ConTeXt. They 
> come form the Humanities so little structure (basically, mainly body and 
> footnotes).
> Far from me the idea of automatically doing all the stuff, I'd like to be 
> faster and more accurate in conversion.
> (No particular interest in figures, they are few, not so much in 
> references: they tends to be typographically inconsistent if done
> in a WYSISYG environment, so difficult to parse).
> More, as the journal has already being published we need to work with final 
> pdfs.
>
> After wasting my time with an awful pdf to html converter by Acrobat,  I 
> discovered this, you may all know:
> http://pdftohtml.sourceforge.net/
>
> The html  conversion is very very good in resulting rendering and also in 
> sources, but after some tweakings I got interested in the xml conversion it 
> allows.
> The xml format  substantially encodes the infos related to page, typically 
> each line is an element. Plus, there are bold and italics marked easily as 
> <b> and <i>
> I'm still struggling to understand something really operative of XML 
> processing in ConTeXt, so  I switched back to Python.
> I used an incremental sax parser with some replacement.
> This is today's draft.
> Original:
> http://www.semiotiche.it/andrea/membrana/02%20imp.pdf
>
> Recomposed (no setup at all, only \enableregime[utf]):
> http://www.semiotiche.it/andrea/membrana/02imp.pdf
>
> pdf --> pdftoxml --> xml --> python script --> tex --> pdf
>
> I recovered par, bold, em, footnotes,  stripping dashes and reassembling 
> the text with footnote references. Not bad as a first step.
>
> I guess that you xml gurus could probably do much easier and cleaner.
> So, I mean -just for my very specific needs, I con probably  take word 
> sources, convert to pdf and then finally reach ConTeXt as discussed.
>
> Just some ideas to share with the list
>
> Best
>
> -a-
>
>
>
>
> --------------------------------------------------
> Andrea Valle
> --------------------------------------------------
> CIRMA - DAMS
> Università degli Studi di Torino
> --> http://www.cirma.unito.it/andrea/
> --> andrea.valle@unito.it
> --------------------------------------------------
>
>
> I did this interview where I just mentioned that I read Foucault. Who 
> doesn't in university, right? I was in this strip club giving this guy a 
> lap dance and all he wanted to do was to discuss Foucault with me. Well, I 
> can stand naked and do my little dance, or I can discuss Foucault, but not 
> at the same time; too much information.
> (Annabel Chong)
>
>
>
>

> ___________________________________________________________________________________
> If your question is of interest to others as well, please add an entry to the Wiki!
> 
> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive  : https://foundry.supelec.fr/projects/contextrev/
> wiki     : http://contextgarden.net
> ___________________________________________________________________________________

-- 
Saji N. Hameed

APEC Climate Center          				+82 51 668 7470
National Pension Corporation Busan Building 12F         
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705			saji@apcc21.net
KOREA
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________