* Html to Context using Wiki + hpricot
@ 2007-07-11 7:33 Saji Njarackalazhikam Hameed
2007-07-11 8:27 ` luigi scarso
0 siblings, 1 reply; 5+ messages in thread
From: Saji Njarackalazhikam Hameed @ 2007-07-11 7:33 UTC (permalink / raw)
To: ntg-context
Hello All,
I wanted to share my recent experience in co-ordinated document development.
In our office we have to make annual reports, each part of which is contributed
by a member. Previously everybody wrote 'Word' documents which was compiled
into a larger report.
Recently we had the idea to use a Wiki to ease the pain out of this process
and to make it enjoyable for everyone involved. After looking around various wiki software we decided to install a brand-new one called
Informl (http://informl.folklogic.net) . One nice feature about this is that in
the edit mode twin windows are used, one for input and the other a realtime
preview window. Anyway this approach could work with any Wiki software.
One motivation behind using a Wiki as front-end was to involve people who new
nothing about Tex or Context. Secondly it allowed any person to participate in
the process from anywhere.
Next we used the ruby library hpricot to retrieve the web document and
filter it into a context document. This step was interesting and I
would like to sharing the code with anybody interested. I am a novice Ruby
programmer, so the code may be far from perfect .. nevertheless.
saji
%----------------------------------------------------------------
scan_page.rb = Retrieves the html page of interest from the server,
navigates to links within the main page and construct a
context document
#!/usr/bin/ruby
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'scrape_page'
# scans the home page and lists
# all the directories and subdirectories
doc=Hpricot(open("http://190.1.1.24:3010/AnnRep07"))
mainfil="annrep.tex"
`rm #{mainfil}`
fil=File.new(mainfil,"a")
fil.write "\\input context_styles \n"
fil.write "\\starttext \n"
fil.write "\\leftaligned{\\BigFontOne Contents} \n"
fil.write "\\vfill \n"
fil.write "{ \\switchtobodyfont[10pt] "
fil.write "\\startcolumns[n=2,balance=no,rule=off,option=background,frame=off,background=color,backgroundcolor=blue:1] \n"
fil.write "\\placecontent \n"
fil.write "\\stopcolumns \n"
fil.write "}"
chapters= (doc/"p/a.existingWikiWord")
# we need to navigate one more level into the web page
# let us discover the links for that
chapters.each {|ch|
chap_link = ch.attributes['href']
# using inner_html we can create subdirectories
chap_name = ch.inner_html.gsub(/\s*/,"")
chap_name_org = ch.inner_html
# We create chapter directories
system("mkdir -p #{chap_name}")
puts chap_name
# if chapter name starts with underscore (_) skip it
if chap_name.match(/^\_/)
puts chap_name
next
end
fil.write "\\input #{chap_name} \n"
chapFil="#{chap_name}.tex"
`rm #{chapFil}`
cFil=File.new(chapFil,"a")
cFil.write "\\chapter{ #{chap_name_org} } \n"
# We navigate to sections now
doc2=Hpricot(open(chap_link))
sections= (doc2/"p/a.existingWikiWord")
sections.each {|sc|
sec_link = sc.attributes['href']
sec_name = sc.inner_html.gsub(/\s*/,"")
secFil="#{chap_name}/#{sec_name}.tex"
`rm #{secFil}`
sFil=File.new(secFil,"a")
sechFil="#{chap_name}/#{sec_name}.html"
`rm #{sechFil}`
shFil=File.new(sechFil,"a")
# scrape_the_page(sec_link,"#{chap_name}/#{sec_name}")
scrape_the_page(sec_link,sFil,shFil)
cFil.write "\\input #{chap_name}/#{sec_name} \n"
}
}
fil.write "\\stoptext \n"
%----------------------------------------------------------------
The program calls scrape_page.rb, a function that does most of
the filtering
Function: scrape_page.rb
def scrape_the_page(pagePath,oFile,hFile)
items_to_remove = [
"#menus", #menus notice
"div.markedup",
"div.navigation",
"head", #table of contents
"hr"
]
doc=Hpricot(open(pagePath))
@article = (doc/"#container").each do |content|
#remove unnecessary content and edit links
items_to_remove.each { |x| (content/x).remove }
end
# Write HTML content to file
hFile.write @article.inner_html
# How to replace various syntactic elements using Hpricot
# replace p/b element with /f
(@article/"p/*/b").each do |pb|
pb.swap("{\\bf #{pb.inner_html}}")
end
# replace p/b element with /bf
(@article/"p/b").each do |pb|
pb.swap("{\\bf #{pb.inner_html}}")
end
# replace strong element with /bf
(@article/"strong").each do |ps|
ps.swap("{\\bf #{ps.inner_html}}")
end
# replace h1 element with section
(@article/"h1").each do |h1|
h1.swap("\\section{#{h1.inner_html}}")
end
# replace h2 element with subsection
(@article/"h2").each do |h2|
h2.swap("\\subsection{#{h2.inner_html}}")
end
# replace h3 element with subsection
(@article/"h3").each do |h3|
h3.swap("\\subsubsection{#{h3.inner_html}}")
end
# replace h4 element with subsection
(@article/"h4").each do |h4|
h4.swap("\\subsubsubsection{#{h4.inner_html}}")
end
# replace h5 element with subsection
(@article/"h5").each do |h5|
h5.swap("\\subsubsubsubsection{#{h5.inner_html}}")
end
# replace <pre><code> by equivalent command in context
(@article/"pre").each do |pre|
pre.swap("\\startcode \n #{pre.at("code").inner_html} \n
\\stopcode")
end
# when we encounter a reference to a figure inside the html
# we replace it with a ConTeXt reference
(@article/"a").each do |a|
a.swap("\\in[#{a.inner_html}]")
end
# remove 'alt' attribute inside <img> element
# replace <p><img> by equivalent command in context
(@article/"p/img").each do |img|
img_attrs=img.attributes['alt'].split(",")
# separate the file name from the extension
# have to take of file names that have a "." embedded in them
img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse
# puts img_src
# see if position of figure is indicated
img_pos="force"
img_attrs.each do |arr|
img_pos=arr.gsub("position=","") if arr.match("position=")
end
img_attrs.delete("position=#{img_pos}") unless img_pos=="force"
# see if the array img_attrs contains an referral key word
if img_attrs.first.match(/\w+[=]\w+/)
img_id=" "
else
img_id=img_attrs.first
img_attrs.delete_at(0)
end
if img_pos=="force"
if img.attributes['title']
img.swap("
\\placefigure\n
[#{img_pos}][#{img_id}] \n
{#{img.attributes['title']}} \n
{\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n
")
else
img.swap("
\\placefigure\n
[#{img_pos}] \n
{none} \n
{\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}
")
end
else
if img.attributes['title']
img.swap("
\\placefigure\n
[#{img_pos}][#{img_id}] \n
{#{img.attributes['title']}} \n
{\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n
")
else
img.swap("
\\placefigure\n
[#{img_pos}] \n
{none} \n
{\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}
\n
")
end
end
end # end of converting inside (@article/"p/img")
# why not search for table and if we find caption, keep it ; if not add an empty
# Styling options: Here I catch the div element called Col2 and
# format the tex document in 2 columns
# Tables : placing them
# replace <p><img> by equivalent command in context
(@article/"table").each do |tab|
if tab.at("caption")
tab.swap("
\\placetable[split]{#{tab.at("caption").inner_html}}\n
{\\bTABLE \n
#{tab.inner_html}
\\eTABLE}
")
else
tab.swap("
\\placetable[split]{}\n
{\\bTABLE \n
#{tab.inner_html}
\\eTABLE} \n
")
end
end
# Tables: remove the caption
(@article/"caption").each do |cap|
cap.swap("\n")
end
# Now we transfer the syntactically altered html to a string Object
# and manipulate that object further
newdoc=@article.inner_html
# remove empty space in the beginning
newdoc.gsub!(/^\s+/,"")
# remove all elements we don't need.
newdoc.gsub!(/^<div.*/,"")
newdoc.gsub!(/^<\/div.*/,"")
newdoc.gsub!(/^<form.*/,"")
newdoc.gsub!(/^<\/form.*/,"")
newdoc.gsub!(/<p>/,"\n")
newdoc.gsub!(/<\/p>/,"\n")
newdoc.gsub!(/<\u>/,"")
newdoc.gsub!(/<\/u>/,"")
newdoc.gsub!(/<ul>/,"\\startitemize[1]")
newdoc.gsub!(/<\/ul>/,"\\stopitemize")
newdoc.gsub!(/<ol>/,"\\startitemize[n]")
newdoc.gsub!(/<\/ol>/,"\\stopitemize")
newdoc.gsub!(/<li>/,"\\item ")
newdoc.gsub!(/<\/li>/,"\n")
newdoc.gsub!("_","\\_")
newdoc.gsub!(/<table>/,"\\bTABLE \n")
newdoc.gsub!(/<\/table>/,"\\eTABLE \n")
newdoc.gsub!(/<tr>/,"\\bTR ")
newdoc.gsub!(/<\/tr>/,"\\eTR ")
newdoc.gsub!(/<td>/,"\\bTD ")
newdoc.gsub!(/<\/td>/,"\\eTD ")
newdoc.gsub!(/<th>/,"\\bTH ")
newdoc.gsub!(/<\/th>/,"\\eTH ")
newdoc.gsub!(/<center>/,"")
newdoc.gsub!(/<\/center>/,"")
newdoc.gsub!(/<em>/,"{\\em ")
newdoc.gsub!(/<\/em>/,"}")
newdoc.gsub!("^","")
newdoc.gsub!("\%","\\%")
newdoc.gsub!("&","&")
newdoc.gsub!("&",'\\\&')
newdoc.gsub!("$",'\\$')
newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n")
newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n")
# Context does not mind "_" in figures and does not recognize \_,
# so i have to catch these and replace \_ with _
# First catch
filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/
if newdoc[filter]
newdoc.gsub!(filter) { |fString|
fString.gsub("\\_","_")
}
end
# Second catch
filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/
if newdoc[filter2]
newdoc.gsub!(filter2) { |fString|
fString.gsub("\\_","_") }
end
# Third catch; remove \_ inside []
filter3=/\[\w+\\_\w+\]/
if newdoc[filter3]
newdoc.gsub!(filter3) { |fString|
puts fString
fString.gsub("\\_","_") }
end
# remove the comment tag, which we used to embed context commands
newdoc.gsub!("<!--","")
newdoc.gsub!("-->","")
# add full path to the images
newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/")
newdoc.gsub!(/<\w+\s*\/>/,"")
#puts newdoc
# open file for output
#outfil="#{oFile}.tex"
#`rm #{outfil}`
#fil=File.new(outfil,"a")
#puts "Writing #{oFile}"
oFile.write newdoc
end
# imgProps={}
# img_attrs.each do |arr|
# imgProps['width']=arr.gsub("width=","") if arr.match("width=")
# imgProps['position']=arr.gsub("position=","") if arr.match("position=")
# end
% ---- End of scrape_page.rb
--
Saji N. Hameed
APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 saji@apcc21.net
KOREA
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : https://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Html to Context using Wiki + hpricot
2007-07-11 7:33 Html to Context using Wiki + hpricot Saji Njarackalazhikam Hameed
@ 2007-07-11 8:27 ` luigi scarso
2007-07-11 9:10 ` Saji Njarackalazhikam Hameed
2007-07-16 8:27 ` Saji Njarackalazhikam Hameed
0 siblings, 2 replies; 5+ messages in thread
From: luigi scarso @ 2007-07-11 8:27 UTC (permalink / raw)
To: mailing list for ConTeXt users
On 7/11/07, Saji Njarackalazhikam Hameed <saji@apcc21.net> wrote:
> Hello All,
>
> I wanted to share my recent experience
Really interesting .
Please, put all these on
wiki.contextgarden.net
--
luigi
----------------------------------------------------------------
If your question is of interest to others as well, please add an entry
to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : https://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : https://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Html to Context using Wiki + hpricot
2007-07-11 8:27 ` luigi scarso
@ 2007-07-11 9:10 ` Saji Njarackalazhikam Hameed
2007-07-11 13:08 ` luigi scarso
2007-07-16 8:27 ` Saji Njarackalazhikam Hameed
1 sibling, 1 reply; 5+ messages in thread
From: Saji Njarackalazhikam Hameed @ 2007-07-11 9:10 UTC (permalink / raw)
To: mailing list for ConTeXt users
Thanks, Luigi...i will do so. Would it be appropriate to put it under
the section "General ConTeXt Documents"? Let me know otherwise and
in that case let me know where would be a good place to add this
article.
saji
..
* luigi scarso <luigi.scarso@gmail.com> [2007-07-11 10:27:12 +0200]:
> On 7/11/07, Saji Njarackalazhikam Hameed <saji@apcc21.net> wrote:
> > Hello All,
> >
> > I wanted to share my recent experience
> Really interesting .
> Please, put all these on
> wiki.contextgarden.net
>
>
>
> --
> luigi
> ----------------------------------------------------------------
> If your question is of interest to others as well, please add an entry
> to the Wiki!
>
> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive : https://foundry.supelec.fr/projects/contextrev/
> wiki : http://contextgarden.net
> ___________________________________________________________________________________
> If your question is of interest to others as well, please add an entry to the Wiki!
>
> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive : https://foundry.supelec.fr/projects/contextrev/
> wiki : http://contextgarden.net
> ___________________________________________________________________________________
--
Saji N. Hameed
APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 saji@apcc21.net
KOREA
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : https://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Html to Context using Wiki + hpricot
2007-07-11 9:10 ` Saji Njarackalazhikam Hameed
@ 2007-07-11 13:08 ` luigi scarso
0 siblings, 0 replies; 5+ messages in thread
From: luigi scarso @ 2007-07-11 13:08 UTC (permalink / raw)
To: mailing list for ConTeXt users
>Would it be appropriate to put it under
> the section "General ConTeXt Documents"?
Yes
(You can always move/put under another category in a second moment ).
--
luigi
----------------------------------------------------------------
If your question is of interest to others as well, please add an entry
to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : https://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : https://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Html to Context using Wiki + hpricot
2007-07-11 8:27 ` luigi scarso
2007-07-11 9:10 ` Saji Njarackalazhikam Hameed
@ 2007-07-16 8:27 ` Saji Njarackalazhikam Hameed
1 sibling, 0 replies; 5+ messages in thread
From: Saji Njarackalazhikam Hameed @ 2007-07-16 8:27 UTC (permalink / raw)
To: mailing list for ConTeXt users
Dear All,
I have added an article to the ConTeXt wiki on using Wiki as a
collaborative medium for making ConTeXt documents. It is a preliminary
version. I will continue to polish it in my spare time. Meanwhile
comments/suggestions are welcome.
http://wiki.contextgarden.net/HTML_and_ConTeXt
saji
..
* luigi scarso <luigi.scarso@gmail.com> [2007-07-11 10:27:12 +0200]:
> On 7/11/07, Saji Njarackalazhikam Hameed <saji@apcc21.net> wrote:
> > Hello All,
> >
> > I wanted to share my recent experience
> Really interesting .
> Please, put all these on
> wiki.contextgarden.net
>
>
>
> --
> luigi
> ----------------------------------------------------------------
> If your question is of interest to others as well, please add an entry
> to the Wiki!
>
> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive : https://foundry.supelec.fr/projects/contextrev/
> wiki : http://contextgarden.net
> ___________________________________________________________________________________
> If your question is of interest to others as well, please add an entry to the Wiki!
>
> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive : https://foundry.supelec.fr/projects/contextrev/
> wiki : http://contextgarden.net
> ___________________________________________________________________________________
--
Saji N. Hameed
APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 saji@apcc21.net
KOREA
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : https://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2007-07-16 8:27 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-11 7:33 Html to Context using Wiki + hpricot Saji Njarackalazhikam Hameed
2007-07-11 8:27 ` luigi scarso
2007-07-11 9:10 ` Saji Njarackalazhikam Hameed
2007-07-11 13:08 ` luigi scarso
2007-07-16 8:27 ` Saji Njarackalazhikam Hameed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).