ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* Ugly hack for multiple MSWord docs.
@ 2006-06-13 22:29 John R. Culleton
  2006-06-15 12:50 ` Hans Hagen
  2006-06-15 16:45 ` Bob Kerstetter
  0 siblings, 2 replies; 7+ messages in thread
From: John R. Culleton @ 2006-06-13 22:29 UTC (permalink / raw)


Frequently I find myself in the position of needing to combine
several MSWord and/or rtf documents into a single file for either
pdftex or Context. I have settled on this strategy. 

1. If necessary I convert the documents to rtf with Open Ofice
Writer. 
2. I convert the resulting  rtf documents to LaTeX using rtf2latex2e.
3. I need to rename some of the LaTeX commands to their plain 
TeX or Context equivalents, and simply ignore others. Instead of
editing each and every occurrence, I add the following to my
"macros.tex" file which heads up the document:
----------------------------------------------------
\def\documentclass{}
\def\newcommand{}
\def\usepackage{}
\def\tab{}
\def\hspace{}
\def\begin{}
\def\end{}
\def\textbf#1{\bf #1}
\def\nobreakspace{~}
\def\underline{}
\def\newpage{}
\def\textmd#1{\rm #1}
\def\textit#1{\it #1}
\def\large{\tfb}
\def\reg{\rm\char174\ }
\def\textregistered{\reg}
------------------------------------------------------

I create a master file that calls in each of the .tex files
and compile the whole goulash. If I missed a latex tag then I add
it to my \defs shown above and recompile until I get a
clean run. Now I have a readable pdf file and can start correcting
the format. 

The scattered Latex tags give me hints where centering etc. might
be needed even though the tags are inoperative in Context, thanks
to my nullifying \def statements shown above.  

Someday there will be an elegant solution to the MSWord to
Context problem. For now there is my ugly hack as described here.

-- 
John Culleton

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ugly hack for multiple MSWord docs.
  2006-06-13 22:29 Ugly hack for multiple MSWord docs John R. Culleton
@ 2006-06-15 12:50 ` Hans Hagen
  2006-06-15 18:35   ` John R. Culleton
  2006-06-15 16:45 ` Bob Kerstetter
  1 sibling, 1 reply; 7+ messages in thread
From: Hans Hagen @ 2006-06-15 12:50 UTC (permalink / raw)


John R. Culleton wrote:
> Someday there will be an elegant solution to the MSWord to
> Context problem. For now there is my ugly hack as described here.
>   
maybe the word xml output, since that can be parsed

Hans 

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ugly hack for multiple MSWord docs.
  2006-06-13 22:29 Ugly hack for multiple MSWord docs John R. Culleton
  2006-06-15 12:50 ` Hans Hagen
@ 2006-06-15 16:45 ` Bob Kerstetter
  1 sibling, 0 replies; 7+ messages in thread
From: Bob Kerstetter @ 2006-06-15 16:45 UTC (permalink / raw)



On Jun 13, 2006, at 5:29 PM, John R. Culleton wrote:

> Frequently I find myself in the position of needing to combine
> several MSWord and/or rtf documents into a single file for either
> pdftex or Context. I have settled on this strategy.
>
>

> <snip>

> Someday there will be an elegant solution to the MSWord to
> Context problem. For now there is my ugly hack as described here.


MEMORY DISCLAIMER: In these examples none of the function names are  
really what they are in Word or VB for Word. The functions are  
available in VB for Word, but it's been some time since I've done  
this, i don't have the macros these days and don't really know the  
real names anymore. So they are just representative of the functions  
available.

STYLE COMMENT: These methods should work even if styles are not being  
used. For example the primary heading may be Arial, 18pt, bold and  
not the Heading 1 style. That's okay because you can search for font  
attributes in Word. If the document is not consistent, well, convert  
to text and markup manually. :)



MORE OR LESS CURRENT EXAMPLE

It's not particularly elegant, but I used to convert from MSWord to  
whatever by writing VB find/replace macros based on styles and  
formatting. In newer versions of Word (at least on OS X), Replace has  
a function that includes what you found, plus you can add other text.

Example:

Find: <Heading 1>        %find stuff formatted with heading 1 style

Replace: \subject{WhatItFound}       %replaces what it found and  
wraps \subject{} around it.


Because Word stores its formatting in the line feed/carriage return,  
for paragraph styles you end up with something like this:

\subject{Some TeX
}

So my last VB find/replace removes the carriage returns globally:

Find: ^p}
Replace: }


When done with all find/replace functions, save as text.

That's it.


Not being much of a script writer, I record the first find/replace,  
then edit the macro and duplicate the find/replace as needed.

The VB find/replace function has options for starting at the top of  
the file, replacing globally, continuing if nothing is found and that  
sort of thing.

The macro looks something like this:

Find: <Heading 1>        %find stuff formatted with heading 1 style
Replace: \subject{WhatItFound}       %replaces what it found and  
wraps \subject{} around it.

Find: <Heading 2>        %find stuff formatted with heading 2 style
Replace: \subsubject{WhatItFound}       %replaces what it found and  
wraps \subsubject{} around it.

Find: <Heading 3>        %find stuff formatted with heading 3 style
Replace: \subsubject{WhatItFound}       %replaces what it found and  
wraps \subsubsubject{} around it.


The above method uses global replacement and it's pretty zippy, for  
Word.



ANOTHER OLDER METHOD

Another method I used before Find/Replace had the <WhatItFound>  
function was to put the found string into a variable, then use that  
variable for the replacement text, plus any TeX control sequences  
wrapped around it.

In summary:

1. Put your finds and replaces in an array:
ArrayFind(0) Heading 1; ArrayReplace(0) \subject{
ArrayFind(1) Heading 2; ArrayReplace(1) \subsubject{
ArrayFind(2) Heading 3; ArrayReplace(2) \subsubsubject{
Note the closing } is missing. It is hardcoded in the replacement code.

2. Find the first array item starting from the top of the document.  
This highlights the text in Word:
Find = $ArrayFind(n)

3. Put the highlighted text into a variable. Maybe you can even strip  
the CR's from formatted pagagraphs:
stripCarriageReturns($FoundThisStuff) = CurrentSelection


4. Put the variable and the first replace item in the Word Replace  
function. Note the hard coded closing bracket. And the CR assuming  
you stripped the CR in step 3:
Replace = $ArrayReplace(n)+$FoundThisStuff+"}"+CR

5. Repeatedly use Replace and Find Next until nothing else is found.
Replace and Find Next
.
.
.

6. Repeatedly find the next array item to the end of the array.
n = n + 1
Find = $ArrayFind(n)
.
.
.

7. Save the file as text.
FilesSaveAs using the text option


Hum. After thinking about this and typing it in, maybe I should still  
use the OLD method. It appears to be a little easier to manage. Maybe  
a lot easier.
Oh well, not a real programmer.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ugly hack for multiple MSWord docs.
  2006-06-15 18:35   ` John R. Culleton
@ 2006-06-15 17:55     ` Hans Hagen
  2006-06-15 22:46       ` John R. Culleton
  0 siblings, 1 reply; 7+ messages in thread
From: Hans Hagen @ 2006-06-15 17:55 UTC (permalink / raw)


John R. Culleton wrote:
> On Thursday 15 June 2006 08:50, Hans Hagen wrote:
>   
>> John R. Culleton wrote:
>>     
>>> Someday there will be an elegant solution to the MSWord to
>>> Context problem. For now there is my ugly hack as described here.
>>>       
>> maybe the word xml output, since that can be parsed
>>
>> Hans
>>     
> Interesting suggestion. I don't have a copy of MSWord. And my
> clients are naive so that asking them to save in exotic formats
> is likely to be unproductive. 
>
> Open Office does not save as xml. Abiword, however does. In a
>   
hm, open offices uses xml as storage format, just save in oo format and 
unzip the file and you will end up with xml files

(however, the xml is typical office xml, complete with tab elements that 
spoil the idea)
> One question: How do I mix in the necessary Context commands such
> as papersize, font selection etc.? What are the rules and no-nos
> for blending Context commands into an xml document?
>   
just set up a style 

Hans 

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ugly hack for multiple MSWord docs.
  2006-06-15 12:50 ` Hans Hagen
@ 2006-06-15 18:35   ` John R. Culleton
  2006-06-15 17:55     ` Hans Hagen
  0 siblings, 1 reply; 7+ messages in thread
From: John R. Culleton @ 2006-06-15 18:35 UTC (permalink / raw)


On Thursday 15 June 2006 08:50, Hans Hagen wrote:
> John R. Culleton wrote:
> > Someday there will be an elegant solution to the MSWord to
> > Context problem. For now there is my ugly hack as described here.
>
> maybe the word xml output, since that can be parsed
>
> Hans
Interesting suggestion. I don't have a copy of MSWord. And my
clients are naive so that asking them to save in exotic formats
is likely to be unproductive. 

Open Office does not save as xml. Abiword, however does. In a
simplistic test case (Now is the time for all good men.)
Abiword saved the document as xml with a little coaxing and
texexec compiled it clean. So at least there is something there
to experiment with. 

Next I will try a real MSWord document, save it as xml from
Abiword, and see what Context does with it.

One question: How do I mix in the necessary Context commands such
as papersize, font selection etc.? What are the rules and no-nos
for blending Context commands into an xml document?

-- 
John Culleton
Books with answers to marketing and publishing questions:
http://wexfordpress.com/tex/shortlist.pdf

Book coaches, consultants and packagers:
http://wexfordpress.com/tex/packagers.pdf

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ugly hack for multiple MSWord docs.
  2006-06-15 17:55     ` Hans Hagen
@ 2006-06-15 22:46       ` John R. Culleton
  2006-06-19  7:44         ` luigi scarso
  0 siblings, 1 reply; 7+ messages in thread
From: John R. Culleton @ 2006-06-15 22:46 UTC (permalink / raw)


On Thursday 15 June 2006 13:55, Hans Hagen wrote:
> John R. Culleton wrote:
> > On Thursday 15 June 2006 08:50, Hans Hagen wrote:
> >> John R. Culleton wrote:
> >>> Someday there will be an elegant solution to the MSWord to
> >>> Context problem. For now there is my ugly hack as described here.
> >>
> >> maybe the word xml output, since that can be parsed
> >>
> >> Hans
> >
> > Interesting suggestion. I don't have a copy of MSWord. And my
> > clients are naive so that asking them to save in exotic formats
> > is likely to be unproductive.
> >
> > Open Office does not save as xml. Abiword, however does. In a
>
> hm, open offices uses xml as storage format, just save in oo format and
> unzip the file and you will end up with xml files
>
> (however, the xml is typical office xml, complete with tab elements that
> spoil the idea)

The abiword xml is neat and parsimonious thus:

------------------------------------------------------------------

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">

<book>
<!-- 
================================================================================ 
-->
<!-- This DocBook file was created by AbiWord.										-->
<!-- AbiWord is a free, Open Source word processor.								   -->
<!-- You may obtain more information about AbiWord at www.abisource.com			   
-->
<!-- 
================================================================================ 
-->


	<chapter>
		<title></title>
		<section role="unnumbered">
			<title></title>
			<para>Now is the time for all good men.</para>
		</section>
	</chapter>
</book>
------------------------------------------------

The Open Office file unzipped is a lot more verbose and  a lot
less readable. There are five files in fact. The file content.xml
will in fact compile correctly via texexec and yield the expected
result. The character count in that file alone is three times
that of the corresponding Abiword xml output shown above.  

The experiments continue...
-- 
John Culleton
Books with answers to marketing and publishing questions:
http://wexfordpress.com/tex/shortlist.pdf

Book coaches, consultants and packagers:
http://wexfordpress.com/tex/packagers.pdf

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ugly hack for multiple MSWord docs.
  2006-06-15 22:46       ` John R. Culleton
@ 2006-06-19  7:44         ` luigi scarso
  0 siblings, 0 replies; 7+ messages in thread
From: luigi scarso @ 2006-06-19  7:44 UTC (permalink / raw)


It's also true that
"On 8 May 2006, the International Organization for Standardization
(ISO) and the International
Electrotechnical Commission (IEC) approved the OpenDocument Format
(ODF) for release as ISO/IEC 26300"
ODF can be an important xml format in next years.

luigi

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-06-19  7:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-13 22:29 Ugly hack for multiple MSWord docs John R. Culleton
2006-06-15 12:50 ` Hans Hagen
2006-06-15 18:35   ` John R. Culleton
2006-06-15 17:55     ` Hans Hagen
2006-06-15 22:46       ` John R. Culleton
2006-06-19  7:44         ` luigi scarso
2006-06-15 16:45 ` Bob Kerstetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).