From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/35379 Path: news.gmane.org!not-for-mail From: Saji Njarackalazhikam Hameed Newsgroups: gmane.comp.tex.context Subject: Html to Context using Wiki + hpricot Date: Wed, 11 Jul 2007 16:33:43 +0900 Message-ID: <20070711073343.GH4785@apcc21.net> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1184139645 16988 80.91.229.12 (11 Jul 2007 07:40:45 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 11 Jul 2007 07:40:45 +0000 (UTC) To: ntg-context@ntg.nl Original-X-From: ntg-context-bounces@ntg.nl Wed Jul 11 09:40:42 2007 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl) by lo.gmane.org with esmtp (Exim 4.50) id 1I8WoT-0002kA-2m for gctc-ntg-context-518@m.gmane.org; Wed, 11 Jul 2007 09:40:41 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id AEC40202A2; Wed, 11 Jul 2007 09:40:40 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 02438-01-5; Wed, 11 Jul 2007 09:40:27 +0200 (CEST) Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id C96D220388; Wed, 11 Jul 2007 09:40:26 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 12D50203CB for ; Wed, 11 Jul 2007 09:40:25 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 02438-01-4 for ; Wed, 11 Jul 2007 09:40:03 +0200 (CEST) Original-Received: from mail1.apcc21.net (unknown [210.98.49.5]) by ronja.ntg.nl (Postfix) with ESMTP id D0B92203D1 for ; Wed, 11 Jul 2007 09:40:01 +0200 (CEST) Original-Received: from localhost ([210.98.49.34]) (authenticated bits=0) by mail1.apcc21.net (8.13.1/8.13.1) with ESMTP id l6B7V0sQ028488 for ; Wed, 11 Jul 2007 16:31:00 +0900 Content-Disposition: inline User-Agent: Mutt/1.5.15 (2007-04-06) X-Virus-Scanned: amavisd-new at ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.9 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl X-Virus-Scanned: amavisd-new at ntg.nl Xref: news.gmane.org gmane.comp.tex.context:35379 Archived-At: Hello All, I wanted to share my recent experience in co-ordinated document development. In our office we have to make annual reports, each part of which is contributed by a member. Previously everybody wrote 'Word' documents which was compiled into a larger report. Recently we had the idea to use a Wiki to ease the pain out of this process and to make it enjoyable for everyone involved. After looking around various wiki software we decided to install a brand-new one called Informl (http://informl.folklogic.net) . One nice feature about this is that in the edit mode twin windows are used, one for input and the other a realtime preview window. Anyway this approach could work with any Wiki software. One motivation behind using a Wiki as front-end was to involve people who new nothing about Tex or Context. Secondly it allowed any person to participate in the process from anywhere. Next we used the ruby library hpricot to retrieve the web document and filter it into a context document. This step was interesting and I would like to sharing the code with anybody interested. I am a novice Ruby programmer, so the code may be far from perfect .. nevertheless. saji %---------------------------------------------------------------- scan_page.rb = Retrieves the html page of interest from the server, navigates to links within the main page and construct a context document #!/usr/bin/ruby require 'rubygems' require 'open-uri' require 'hpricot' require 'scrape_page' # scans the home page and lists # all the directories and subdirectories doc=Hpricot(open("http://190.1.1.24:3010/AnnRep07")) mainfil="annrep.tex" `rm #{mainfil}` fil=File.new(mainfil,"a") fil.write "\\input context_styles \n" fil.write "\\starttext \n" fil.write "\\leftaligned{\\BigFontOne Contents} \n" fil.write "\\vfill \n" fil.write "{ \\switchtobodyfont[10pt] " fil.write "\\startcolumns[n=2,balance=no,rule=off,option=background,frame=off,background=color,backgroundcolor=blue:1] \n" fil.write "\\placecontent \n" fil.write "\\stopcolumns \n" fil.write "}" chapters= (doc/"p/a.existingWikiWord") # we need to navigate one more level into the web page # let us discover the links for that chapters.each {|ch| chap_link = ch.attributes['href'] # using inner_html we can create subdirectories chap_name = ch.inner_html.gsub(/\s*/,"") chap_name_org = ch.inner_html # We create chapter directories system("mkdir -p #{chap_name}") puts chap_name # if chapter name starts with underscore (_) skip it if chap_name.match(/^\_/) puts chap_name next end fil.write "\\input #{chap_name} \n" chapFil="#{chap_name}.tex" `rm #{chapFil}` cFil=File.new(chapFil,"a") cFil.write "\\chapter{ #{chap_name_org} } \n" # We navigate to sections now doc2=Hpricot(open(chap_link)) sections= (doc2/"p/a.existingWikiWord") sections.each {|sc| sec_link = sc.attributes['href'] sec_name = sc.inner_html.gsub(/\s*/,"") secFil="#{chap_name}/#{sec_name}.tex" `rm #{secFil}` sFil=File.new(secFil,"a") sechFil="#{chap_name}/#{sec_name}.html" `rm #{sechFil}` shFil=File.new(sechFil,"a") # scrape_the_page(sec_link,"#{chap_name}/#{sec_name}") scrape_the_page(sec_link,sFil,shFil) cFil.write "\\input #{chap_name}/#{sec_name} \n" } } fil.write "\\stoptext \n" %---------------------------------------------------------------- The program calls scrape_page.rb, a function that does most of the filtering Function: scrape_page.rb def scrape_the_page(pagePath,oFile,hFile) items_to_remove = [ "#menus", #menus notice "div.markedup", "div.navigation", "head", #table of contents "hr" ] doc=Hpricot(open(pagePath)) @article = (doc/"#container").each do |content| #remove unnecessary content and edit links items_to_remove.each { |x| (content/x).remove } end # Write HTML content to file hFile.write @article.inner_html # How to replace various syntactic elements using Hpricot # replace p/b element with /f (@article/"p/*/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace p/b element with /bf (@article/"p/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace strong element with /bf (@article/"strong").each do |ps| ps.swap("{\\bf #{ps.inner_html}}") end # replace h1 element with section (@article/"h1").each do |h1| h1.swap("\\section{#{h1.inner_html}}") end # replace h2 element with subsection (@article/"h2").each do |h2| h2.swap("\\subsection{#{h2.inner_html}}") end # replace h3 element with subsection (@article/"h3").each do |h3| h3.swap("\\subsubsection{#{h3.inner_html}}") end # replace h4 element with subsection (@article/"h4").each do |h4| h4.swap("\\subsubsubsection{#{h4.inner_html}}") end # replace h5 element with subsection (@article/"h5").each do |h5| h5.swap("\\subsubsubsubsection{#{h5.inner_html}}") end # replace
 by equivalent command in context
(@article/"pre").each do |pre|
  pre.swap("\\startcode \n #{pre.at("code").inner_html} \n
  \\stopcode")
end

# when we encounter a reference to a figure inside the html
# we replace it with a ConTeXt reference

(@article/"a").each do |a|
  a.swap("\\in[#{a.inner_html}]")
end


# remove 'alt' attribute inside  element
# replace 

by equivalent command in context (@article/"p/img").each do |img| img_attrs=img.attributes['alt'].split(",") # separate the file name from the extension # have to take of file names that have a "." embedded in them img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse # puts img_src # see if position of figure is indicated img_pos="force" img_attrs.each do |arr| img_pos=arr.gsub("position=","") if arr.match("position=") end img_attrs.delete("position=#{img_pos}") unless img_pos=="force" # see if the array img_attrs contains an referral key word if img_attrs.first.match(/\w+[=]\w+/) img_id=" " else img_id=img_attrs.first img_attrs.delete_at(0) end if img_pos=="force" if img.attributes['title'] img.swap(" \\placefigure\n [#{img_pos}][#{img_id}] \n {#{img.attributes['title']}} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} ") end else if img.attributes['title'] img.swap(" \\placefigure\n [#{img_pos}][#{img_id}] \n {#{img.attributes['title']}} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") end end end # end of converting inside (@article/"p/img") # why not search for table and if we find caption, keep it ; if not add an empty # Styling options: Here I catch the div element called Col2 and # format the tex document in 2 columns # Tables : placing them # replace

by equivalent command in context (@article/"table").each do |tab| if tab.at("caption") tab.swap(" \\placetable[split]{#{tab.at("caption").inner_html}}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} ") else tab.swap(" \\placetable[split]{}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} \n ") end end # Tables: remove the caption (@article/"caption").each do |cap| cap.swap("\n") end # Now we transfer the syntactically altered html to a string Object # and manipulate that object further newdoc=@article.inner_html # remove empty space in the beginning newdoc.gsub!(/^\s+/,"") # remove all elements we don't need. newdoc.gsub!(/^/,"\n") newdoc.gsub!(/<\/p>/,"\n") newdoc.gsub!(/<\u>/,"") newdoc.gsub!(/<\/u>/,"") newdoc.gsub!(/

    /,"\\startitemize[1]") newdoc.gsub!(/<\/ul>/,"\\stopitemize") newdoc.gsub!(/
      /,"\\startitemize[n]") newdoc.gsub!(/<\/ol>/,"\\stopitemize") newdoc.gsub!(/
    1. /,"\\item ") newdoc.gsub!(/<\/li>/,"\n") newdoc.gsub!("_","\\_") newdoc.gsub!(//,"\\bTABLE \n") newdoc.gsub!(/<\/table>/,"\\eTABLE \n") newdoc.gsub!(//,"\\bTR ") newdoc.gsub!(/<\/tr>/,"\\eTR ") newdoc.gsub!(//,"\\bTABLEbody \n") newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n") # Context does not mind "_" in figures and does not recognize \_, # so i have to catch these and replace \_ with _ # First catch filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/ if newdoc[filter] newdoc.gsub!(filter) { |fString| fString.gsub("\\_","_") } end # Second catch filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/ if newdoc[filter2] newdoc.gsub!(filter2) { |fString| fString.gsub("\\_","_") } end # Third catch; remove \_ inside [] filter3=/\[\w+\\_\w+\]/ if newdoc[filter3] newdoc.gsub!(filter3) { |fString| puts fString fString.gsub("\\_","_") } end # remove the comment tag, which we used to embed context commands newdoc.gsub!("","") # add full path to the images newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/") newdoc.gsub!(/<\w+\s*\/>/,"") #puts newdoc # open file for output #outfil="#{oFile}.tex" #`rm #{outfil}` #fil=File.new(outfil,"a") #puts "Writing #{oFile}" oFile.write newdoc end # imgProps={} # img_attrs.each do |arr| # imgProps['width']=arr.gsub("width=","") if arr.match("width=") # imgProps['position']=arr.gsub("position=","") if arr.match("position=") # end % ---- End of scrape_page.rb -- Saji N. Hameed APEC Climate Center +82 51 668 7470 National Pension Corporation Busan Building 12F Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 saji@apcc21.net KOREA ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
      /,"\\bTD ") newdoc.gsub!(/<\/td>/,"\\eTD ") newdoc.gsub!(//,"\\bTH ") newdoc.gsub!(/<\/th>/,"\\eTH ") newdoc.gsub!(/
      /,"") newdoc.gsub!(/<\/center>/,"") newdoc.gsub!(//,"{\\em ") newdoc.gsub!(/<\/em>/,"}") newdoc.gsub!("^","") newdoc.gsub!("\%","\\%") newdoc.gsub!("&","&") newdoc.gsub!("&",'\\\&') newdoc.gsub!("$",'\\$') newdoc.gsub!(/