TexPaste alpha - my Win application converting Word/HTML to TeX

ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed

* TexPaste alpha - my Win application converting Word/HTML to TeX
       [not found] <mailman.1002.1243443853.3589.ntg-context@ntg.nl>
@ 2009-05-27 22:05 ` Vyatcheslav Yatskovsky
  2009-05-28  7:39   ` Henning Hraban Ramm
  0 siblings, 1 reply; 7+ messages in thread
From: Vyatcheslav Yatskovsky @ 2009-05-27 22:05 UTC (permalink / raw)
  To: ntg-context

Hello,

I'm glad to report that I made a simple application (sorry, only 
forWindows at the moment) that coverts text from Ms Word (or other 
editors) or HTML pages (web sites) into TeX.

DOWNLOAD LINK (280 KB):
http://ul.to/hmpy60

The app recognizes at the moment only following formats/tags:
Bold (<b>), Italic (<i>), Header 1 (<h1>), Header 2 (<h2>), Header 3 (<h3>).

It coverts NOBREAK_SPACE (A0) into ~, &nbsp into \enskip, &quot; into ", 
&amp; into \&, and &lt; &gt; into < >.

It is UTF-8 ready.

USAGE: copy desired text fragment from Word or web page into clipboard, 
and click big "Get..." button and see the result in the bottom field. 
Click "Copy Result" to get TeX-formatted text back into the clipboard, 
and paste it into your editor.

KNOW ISSUES: Some crap from Word formatting like  tags happens 
to leak, but it is easier at the moment to delete it manually. And 
sorry... awful interface.

It is very-very first alpha, I want to show it just as proof-of-concept 
and to get some feedback. Actually, I did it for myself to simplyfy 
conversion from Word into TeX. I have some documents to be converted 
(e.g., lecture notes), and this happens to be easy task with my tool :).

Best,
Vyatcheslav
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TexPaste alpha - my Win application converting Word/HTML to TeX
  2009-05-27 22:05 ` TexPaste alpha - my Win application converting Word/HTML to TeX Vyatcheslav Yatskovsky
@ 2009-05-28  7:39   ` Henning Hraban Ramm
  2009-05-28  7:45     ` luigi scarso
  0 siblings, 1 reply; 7+ messages in thread
From: Henning Hraban Ramm @ 2009-05-28  7:39 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Am 2009-05-28 um 00:05 schrieb Vyatcheslav Yatskovsky:

> I'm glad to report that I made a simple application (sorry, only  
> forWindows at the moment) that coverts text from Ms Word (or other  
> editors) or HTML pages (web sites) into TeX.
>
> The app recognizes at the moment only following formats/tags:
> Bold (<b>), Italic (<i>), Header 1 (<h1>), Header 2 (<h2>), Header 3  
> (<h3>).

Sorry for stealing your thread, but it's related...

I just found there's still a collection of my old (2002) Perl scripts at
http://www.fiee.net/texnique/material/fiee-perl.zip
It contains simple converters from HTML, LaTeX and XPress Tags to  
ConTeXt.

While this one (2006):
http://www.fiee.net/texnique/material/mab2bib.zip
contains (besides a mab2bib bibliography converter) a simple Python  
script to convert arbitrary encodings - just rename it from  
"utf8_to_latex.py" to e.g. "latin1_to_utf8.py": If the parts of its  
file name are encodings known to Python, it'll just work.
"latex" encoding is included, so "latex_to_utf8.py" can convert cruft  
like \c{C} to Ç.

I guess I should build a new converter suite (there's also a InDesign  
Tags to ConTeXt converter anywhere on my harddisk).
But I won't make GUI apps, just scripts.

Greetlings from Lake Constance!
Hraban
---
http://www.fiee.net/texnique/
http://wiki.contextgarden.net
https://www.cacert.org (I'm an assurer)

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TexPaste alpha - my Win application converting Word/HTML to TeX
  2009-05-28  7:39   ` Henning Hraban Ramm
@ 2009-05-28  7:45     ` luigi scarso
  2009-05-28  9:37       ` Piotr Kopszak
  2009-05-29  8:14       ` converters (was: TexPaste alpha) Henning Hraban Ramm
  0 siblings, 2 replies; 7+ messages in thread
From: luigi scarso @ 2009-05-28  7:45 UTC (permalink / raw)
  To: mailing list for ConTeXt users


[-- Attachment #1.1: Type: text/plain, Size: 290 bytes --]

>
>
> I guess I should build a new converter suite (there's also a InDesign Tags
> to ConTeXt converter anywhere on my harddisk).
> But I won't make GUI apps, just scripts.
>
That's sound good !
If in python, even better !
If only scripts, the best !

Can we have more details ?

-- 
luigi

[-- Attachment #1.2: Type: text/html, Size: 516 bytes --]

[-- Attachment #2: Type: text/plain, Size: 487 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TexPaste alpha - my Win application converting Word/HTML to TeX
  2009-05-28  7:45     ` luigi scarso
@ 2009-05-28  9:37       ` Piotr Kopszak
  2009-06-08  8:27         ` J.A.J. Pater
  2009-05-29  8:14       ` converters (was: TexPaste alpha) Henning Hraban Ramm
  1 sibling, 1 reply; 7+ messages in thread
From: Piotr Kopszak @ 2009-05-28  9:37 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Hello list,

Inevitably, it's a recurring subject. Here are my 2p. After playing
with all sorts of convertors to TeX, Latex, HTML and scraping the
output with Perl to obtain something useful for ConTeXt I found that
what I in fact really need to preserve from a Word file are italics
and footnotes. To make the long story short. IMHO the only reasonable
way to go is via XSL stylesheet for ooffice. Fortunately you don't
have to develop a new one from scratch which would be quite a task.
There is an excellent stylesheet converting odt to mediawiki by
Bernhard Haumacher odt2mediawiki.xsl  It took me less than an hour to
adapt it for ConTeXt output. Then you only add it as an xml filter to
Open Office and from then on can convert Word to ConTeXt straight from
ooffice as if it was one of its built-in export formats.

Piotr

2009/5/28 luigi scarso <luigi.scarso@gmail.com>:
>>
>> I guess I should build a new converter suite (there's also a InDesign Tags
>> to ConTeXt converter anywhere on my harddisk).
>> But I won't make GUI apps, just scripts.
>
> That's sound good !
> If in python, even better !
> If only scripts, the best !
>
> Can we have more details ?
>
> --
> luigi
>
>
> ___________________________________________________________________________________
> If your question is of interest to others as well, please add an entry to
> the Wiki!
>
> maillist : ntg-context@ntg.nl /
> http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive  : https://foundry.supelec.fr/projects/contextrev/
> wiki     : http://contextgarden.net
> ___________________________________________________________________________________
>
>

-- 
http://okle.pl
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: converters (was: TexPaste alpha)
  2009-05-28  7:45     ` luigi scarso
  2009-05-28  9:37       ` Piotr Kopszak
@ 2009-05-29  8:14       ` Henning Hraban Ramm
  2009-05-29  8:18         ` luigi scarso
  1 sibling, 1 reply; 7+ messages in thread
From: Henning Hraban Ramm @ 2009-05-29  8:14 UTC (permalink / raw)
  To: mailing list for ConTeXt users

[-- Attachment #1: Type: text/plain, Size: 882 bytes --]

Am 2009-05-28 um 09:45 schrieb luigi scarso:

> I guess I should build a new converter suite (there's also a  
> InDesign Tags to ConTeXt converter anywhere on my harddisk).
> But I won't make GUI apps, just scripts.
> That's sound good !
> If in python, even better !
> If only scripts, the best !
>
> Can we have more details ?

Which conversion do you need?

If it's InDesign to ConTeXt, there's always custom programming needed  
- e.g. you need to know what ID paragraph style should become what  
ConTeXt section. (sample attached)

I'm not good in building parsers, using mostly regular expression  
replacements, so my converters are always limited, and manual cleanup  
is necessary - but they save a lot of manual work anyway!


Greetlings from Lake Constance!
Hraban
---
http://www.fiee.net/texnique/
http://wiki.contextgarden.net
https://www.cacert.org (I'm an assurer)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: latin1_to_utf8.py --]
[-- Type: text/x-python-script; x-unix-mode=0755; x-mac-type=54455854; name="latin1_to_utf8.py", Size: 3874 bytes --]

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Universelle Textcodierung
2009-03-10 by Henning Hraban Ramm, fiëe virtuëlle

quellcodierung_to_zielcodierung.py [Optionen] Quelldatei [Zieldatei]

Es können auch ganze Verzeichnisse bearbeitet werden.

Optionen:
--filter=Dateiendung
--overwrite          (sonst wird die Originaldatei gesichert)
--hidden             (sonst werden versteckte Dateien ignoriert)
"""

import os, os.path, sys, codecs, getopt, shutil
try:
    import latex
except:
    pass

modes = ('filter', 'overwrite', 'hidden')
mode = {}

def help(message=""):
    print message
    print __doc__
    sys.exit(1)

def backup(datei):
    original = datei
    pfad, datei = os.path.split(datei)
    datei, ext = os.path.splitext(datei)
    count = 0
    while os.path.exists(os.path.join(pfad, "%s.%d%s" % (datei, count, ext))):
        count += 1
    neudatei = os.path.join(pfad, "%s.%d%s" % (datei, count, ext))
    print "Sichere %s als %s" % (original, neudatei)
    shutil.copy(original, neudatei)
    return neudatei

def is_hidden(datei):
	return (datei.startswith('.') or os.sep+'.' in datei)

def convert(source, target, so_enc, ta_enc):
    from_exists = os.path.exists(source)
    to_exists = os.path.exists(target)
    from_isdir = os.path.isdir(source)
    to_isdir = os.path.isdir(target)
    from_path, from_name = os.path.split(source)
    to_path, to_name = os.path.split(target)
    #from_name = os.path.basename(source)
    #to_name = os.path.basename(target)

    if not from_exists:
    	help("Quelle '%s' nicht gefunden!" % from_name)

    if from_isdir:
    	if is_hidden(source) and not mode['hidden']:
    		print "Ignoriere verstecktes Verzeichnis %s" % source
    		return
        if not to_isdir:
            help("Wenn die Quelle ein Verzeichnis ist, muss auch das Ziel ein Verzeichnis sein!")
    	print "Verarbeite Verzeichnis %s" % source
        dateien = os.listdir(source)
        #if not mode['hidden']:
        #	dateien = [d for d in dateien if not is_hidden(d)]
        if mode['filter']:
            dateien = [d for d in dateien if d.endswith(mode['filter'])]
        for datei in dateien:
        	s = os.path.join(source, datei)
        	t = os.path.join(target, datei)
        	convert(s, t, so_enc, ta_enc)
    else:
    	if is_hidden(from_name) and not mode['hidden']:
    		print "Ignoriere versteckte Datei %s" % source
    		return
        if to_isdir:
            target = os.path.join(target, from_name)
        if not mode['overwrite']:
            if source==target:
                source=backup(source)
            elif os.path.exists(target):
                backup(target)
        print "Konvertiere %s (%s)\n\tnach %s (%s)" % (source, so_enc, target, ta_enc)
        so_file = file(source, "rU")
        lines = so_file.readlines()
        so_file.close()
        ta_file = file(target, "w")
        for l in lines:
            ta_file.write(unicode(l, so_enc).encode(ta_enc))
        ta_file.close()
        

opts, args = getopt.getopt(sys.argv[1:], "ohf:", ["overwrite","hidden","filter="])

if len(args)<1:
    help("Zu wenige Parameter angegeben!")

for m in modes:
    mode[m] = False
    for (o, a) in opts:
        if o=='-'+m[0] or o=='--'+m:
            if a:
                print "Modus %s = %s" % (m, a)
            else:
                a = True
                print "Modus %s aktiv" % m
            mode[m] = a

#print "modes:", mode
#print "opts :", opts
#print "args :", args

# gewünschte Codierung aus dem Dateinamen ablesen
scriptname = os.path.splitext(os.path.basename(sys.argv[0]))[0]
from_enc, to_enc = scriptname.split("_to_")

from_name = to_name = args[0]
if len(args)>1: to_name = args[1]

convert(from_name, to_name, from_enc, to_enc)
    

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: indtxt2context.py --]
[-- Type: text/x-python-script; x-mac-creator=21526368; x-unix-mode=0644; x-mac-type=54455854; name="indtxt2context.py", Size: 2773 bytes --]

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Convert InDesign tagged text to ConTeXt
"""
import sys, os
import re

quote = u'$&_%'

rePatterns = {
	# paragraph styles
	ur'^<pstyle:Ü 1\.>((\d\.)*\s+)?(.+)$' : ur'\\chapter{\3}\n',
	ur'^<pstyle:Ü 1\.1>((\d\.)*\s+)?(.+)$' : ur'\\section{\3}\n',
	ur'^<pstyle:Ü 1\.1\.1>((\d\.)*\s+)?(.+)$' : ur'\\subsection{\3}\n',
	ur'^<pstyle:Ü 1\.1\.1\.1>((\d\.)*\s+)?(.+)$' : ur'\\subsubsection{\3}\n',
	# character styles
	ur'<ct:Bold>(.+?)<ct:>' : ur'{\\bf \1}',
	#ur'<cf:Arial>(.*?)<cf:Times New Roman>' : ur'\\otherfont{\1}',
	
	u'<.*?>' : u'', # delete all other tags

	# lines that start with dotted numbers = section titles
	ur'^\d+\s+(.+)$' : ur'\\chapter{\1}\n',
	ur'^\d+\.\d+\.?\s+(.+)$' : ur'\\section{\1}\n',
	ur'^\d+\.\d+\.\d+\.?\s+(.+)$' : ur'\\subsection{\1}\n',
	ur'^\d+\.\d+\.\d+\.\d+\.?\s+(.+)\$' : ur'\\subsubsection{\1}\n',
	
	ur'^(\s*)[–\-·•]\s+' : ur'\1\\item\t', # itemization (lines starting with bullet etc.)
	ur'^(\s*)(\d+)\.?\)\s+' : ur'\1\\item[\2]\t', # itemization (numerical)
	ur'([Zusovz])\.([Baguo])\.' : ur'\1.\\,\2.', # u.a., s.o., o.g., z.B.
	ur'[„"“](.*?)[“”"]' : ur'\\quotation{\1}', # German quotation
	ur'[\'’,](.*?)[\'’‘]' : ur'\\quote{\1}', # German single quotation
	#ur'"(.*?)"' : ur'\\quotation{\1}', # quotation?
	ur' (\.\?\!:;)' : ur'\1', # spaces in front of punctuation
	ur'{\\em\s+}' : ur'', # empty emphasizing
	ur' (%|°)' : ur'\\,\1', # spaces in front of measure units
	u' - ' : u' – ', # en dash
	ur'(\d{4})\s*(\-|–)\s*(\d{4})' : ur'\1–\3', # year numbers
	
	u' +' : u' ', # multiple spaces
	u'^\s+$' : u'\n', # make empty lines really empty

#	ur'' : ur'',
	
}

reres = {}
status = {
	'item' : False
}

# collect parameters
if len(sys.argv) > 1:
	sourcename = sys.argv[1]
	if len(sys.argv) > 2:
		targetname = sys.argv[2]
	else:
		targetname = sourcename.replace('.txt', '.tex')
else:
	print "file name?"
	sys.exit()

# compile regular expressions
for k in rePatterns:
	p = re.compile(k)
	reres[p] = rePatterns[k]

source = open(sourcename, 'rU')
target = open(targetname, 'w')

# convert lines
for line in source.readlines():
	line = unicode(line, 'utf-16be') # "unicode" encoded InDesign tagged text is UTF-16 big-endian encoded!
	for p in reres:
		line = p.sub(reres[p], line)
	for c in quote:
		line = line.replace(c, u'\\'+c)
	if '\\item ' in line and not status['item']:
		target.write('\\startitemize[]\n')
		status['item'] = True
	if status['item'] and not '\\item ' in line:
		target.write('\\stopitemize\n')
		status['item'] = False
	target.write(line.encode('utf-8')) # write UTF-8

source.close()
target.close()

print "%s completed" % targetname

[-- Attachment #4: Type: text/plain, Size: 487 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: converters (was: TexPaste alpha)
  2009-05-29  8:14       ` converters (was: TexPaste alpha) Henning Hraban Ramm
@ 2009-05-29  8:18         ` luigi scarso
  0 siblings, 0 replies; 7+ messages in thread
From: luigi scarso @ 2009-05-29  8:18 UTC (permalink / raw)
  To: mailing list for ConTeXt users


[-- Attachment #1.1: Type: text/plain, Size: 872 bytes --]

On Fri, May 29, 2009 at 10:14 AM, Henning Hraban Ramm <hraban@fiee.net>wrote:

> Am 2009-05-28 um 09:45 schrieb luigi scarso:
>
>  I guess I should build a new converter suite (there's also a InDesign Tags
>> to ConTeXt converter anywhere on my harddisk).
>> But I won't make GUI apps, just scripts.
>> That's sound good !
>> If in python, even better !
>> If only scripts, the best !
>>
>> Can we have more details ?
>>
>
> Which conversion do you need?
>
> If it's InDesign to ConTeXt, there's always custom programming needed -
> e.g. you need to know what ID paragraph style should become what ConTeXt
> section. (sample attached)
>
> I'm not good in building parsers, using mostly regular expression
> replacements, so my converters are always limited, and manual cleanup is
> necessary - but they save a lot of manual work anyway!
>

Thank you very much!

-- 
luigi

[-- Attachment #1.2: Type: text/html, Size: 1353 bytes --]

[-- Attachment #2: Type: text/plain, Size: 487 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TexPaste alpha - my Win application converting Word/HTML to TeX
  2009-05-28  9:37       ` Piotr Kopszak
@ 2009-06-08  8:27         ` J.A.J. Pater
  0 siblings, 0 replies; 7+ messages in thread
From: J.A.J. Pater @ 2009-06-08  8:27 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Hello Piotr

Sorry for the late reply, but could you post it to the net somewhere or 
to the list?

Thanks,

Adriaan.

> Hello list,
>
> Inevitably, it's a recurring subject. Here are my 2p. After playing
> with all sorts of convertors to TeX, Latex, HTML and scraping the
> output with Perl to obtain something useful for ConTeXt I found that
> what I in fact really need to preserve from a Word file are italics
> and footnotes. To make the long story short. IMHO the only reasonable
> way to go is via XSL stylesheet for ooffice. Fortunately you don't
> have to develop a new one from scratch which would be quite a task.
> There is an excellent stylesheet converting odt to mediawiki by
> Bernhard Haumacher odt2mediawiki.xsl  It took me less than an hour to
> adapt it for ConTeXt output. Then you only add it as an xml filter to
> Open Office and from then on can convert Word to ConTeXt straight from
> ooffice as if it was one of its built-in export formats.
>
> Piotr
>
>
>
>
> 2009/5/28 luigi scarso <luigi.scarso@gmail.com>:
>   
>>> I guess I should build a new converter suite (there's also a InDesign Tags
>>> to ConTeXt converter anywhere on my harddisk).
>>> But I won't make GUI apps, just scripts.
>>>       
>> That's sound good !
>> If in python, even better !
>> If only scripts, the best !
>>
>> Can we have more details ?
>>
>> --
>> luigi
>>
>>
>> ___________________________________________________________________________________
>> If your question is of interest to others as well, please add an entry to
>> the Wiki!
>>
>> maillist : ntg-context@ntg.nl /
>> http://www.ntg.nl/mailman/listinfo/ntg-context
>> webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
>> archive  : https://foundry.supelec.fr/projects/contextrev/
>> wiki     : http://contextgarden.net
>> ___________________________________________________________________________________
>>
>>
>>     
>
>
>
>   

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-08  8:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.1002.1243443853.3589.ntg-context@ntg.nl>
2009-05-27 22:05 ` TexPaste alpha - my Win application converting Word/HTML to TeX Vyatcheslav Yatskovsky
2009-05-28  7:39   ` Henning Hraban Ramm
2009-05-28  7:45     ` luigi scarso
2009-05-28  9:37       ` Piotr Kopszak
2009-06-08  8:27         ` J.A.J. Pater
2009-05-29  8:14       ` converters (was: TexPaste alpha) Henning Hraban Ramm
2009-05-29  8:18         ` luigi scarso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).