ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* ActualText
@ 2009-09-19  0:51 Barry Schwartz
  2009-09-19 16:55 ` ActualText Hans Hagen
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Schwartz @ 2009-09-19  0:51 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Please tell me this isn't in a FAQ. :) Is there support for ActualText
tags so that searching and extraction will work with OpenType fonts
and Unicode? If so, do discretionary hyphens get treated as 00AD
instead of 002D?

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19  0:51 ActualText Barry Schwartz
@ 2009-09-19 16:55 ` Hans Hagen
  2009-09-19 17:10   ` ActualText Arthur Reutenauer
  0 siblings, 1 reply; 10+ messages in thread
From: Hans Hagen @ 2009-09-19 16:55 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Barry Schwartz wrote:
> Please tell me this isn't in a FAQ. :) Is there support for ActualText
> tags so that searching and extraction will work with OpenType fonts
> and Unicode? If so, do discretionary hyphens get treated as 00AD
> instead of 002D?

can you explain in mode detail what you mean with 'actual text tags' ?

concerning searching ... tounicode vectors are added (including 
heuristics for ligatures and such) so searching, cut/past etc should 
work ok

in order to see a problem with hyphens i need an example font/text 
combination that i can generate on my machine

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
      tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 16:55 ` ActualText Hans Hagen
@ 2009-09-19 17:10   ` Arthur Reutenauer
  2009-09-19 18:24     ` ActualText Wolfgang Schuster
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Arthur Reutenauer @ 2009-09-19 17:10 UTC (permalink / raw)
  To: Mailing list for ConTeXt users

> can you explain in mode detail what you mean with 'actual text tags' ?

  He means "ActualText tags" :-)  See the PDF spec section 14.9.4, page 623.
It's a more generic way to support searching than ToUnicode vectors: you just
specify the actual string of underlying Unicode characters.  The PDF spec uses
hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to
search for "Drucker".  You can't do that with ToUnicode vectors.

  Anyway, this needs support at the engine level and I don't think there is;
actually it would be nice to add that to LuaTeX.

	Arthur
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 17:10   ` ActualText Arthur Reutenauer
@ 2009-09-19 18:24     ` Wolfgang Schuster
  2009-09-19 20:02       ` ActualText Arthur Reutenauer
  2009-09-19 19:54     ` ActualText Barry Schwartz
  2009-09-19 21:34     ` ActualText Hans Hagen
  2 siblings, 1 reply; 10+ messages in thread
From: Wolfgang Schuster @ 2009-09-19 18:24 UTC (permalink / raw)
  To: Mailing list for ConTeXt users


Am 19.09.2009 um 19:10 schrieb Arthur Reutenauer:

> Anyway, this needs support at the engine level and I don't think  
> there is;
> actually it would be nice to add that to LuaTeX.

Heiko Oberdiek wrote the accsupp package to use ActualText in LaTeX,
why shouldn't it be then possible to use it in LuaTeX (and ConTeXt)?

Wolfgang

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 17:10   ` ActualText Arthur Reutenauer
  2009-09-19 18:24     ` ActualText Wolfgang Schuster
@ 2009-09-19 19:54     ` Barry Schwartz
  2009-09-19 21:50       ` ActualText Hans Hagen
  2009-09-19 21:34     ` ActualText Hans Hagen
  2 siblings, 1 reply; 10+ messages in thread
From: Barry Schwartz @ 2009-09-19 19:54 UTC (permalink / raw)
  To: Mailing list for ConTeXt users

Arthur Reutenauer <arthur.reutenauer@normalesup.org> skribis:
>   He means "ActualText tags" :-)  See the PDF spec section 14.9.4, page 623.
> It's a more generic way to support searching than ToUnicode vectors: you just
> specify the actual string of underlying Unicode characters.  The PDF spec uses
> hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to
> search for "Drucker".  You can't do that with ToUnicode vectors.

You also need ActualText tags to mark the difference between a
discretionary hyphen and an explicit hyphen in English, which programs
like Reader use when extracting text. When the hyphen is discretionary
you set the ActualText to Unicode AD instead of 2D. (That's mentioned
somewhere in the PDF spec.)

Another thing I just thought of that isn't always done is that there
should be explicit space characters between words, including at the
ends of lines, although I'm not sure whether Adobe Reader turns off
its word-boundary heuristics if it sees space characters.

Since what I enjoy doing is making e-books that can be searched
through and, perhaps more importantly, extracted from via the Select
tool, it's important to me to make the search, selection, and
extraction features work. I'll use them myself if I choose, for
instance, to quote from an e-book I made. I've added them in my
(heavily) modified version of ant, but that's in a primitive state, a
long-term project that competes with font-making and e-book-making for
time, and so I'd like to have ConTeXt as well. I like ConTeXt a lot.

Also, I noticed when playing around with the examples from the "Th"
ligature discussion that searching and extraction didn't work with
small caps, though it did work with the ligature. With ActualText tags
these things always work, regardless of the ToUnicode map's
contents. The way Cairo's PDF backend handles this is to use an
ActualText tag for any glyphs that aren't included in the font's
encoding. What I did in my modified ant is to generate a ToUnicode map
from the Adobe glyph naming convention
(http://www.adobe.com/devnet/opentype/archives/glyph.html) and then
put an ActualText tag on anything that happens not to match what you
would get from the ToUnicode mapping.

(For reasons that were stupid, I once created a lame little C library
to do the mapping from glyph names to Unicode, using a compressed
lookup trie:
http://code.google.com/p/kompostilo/source/browse/#svn/trunk/support-libraries/glyph_name
)

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 18:24     ` ActualText Wolfgang Schuster
@ 2009-09-19 20:02       ` Arthur Reutenauer
  0 siblings, 0 replies; 10+ messages in thread
From: Arthur Reutenauer @ 2009-09-19 20:02 UTC (permalink / raw)
  To: Mailing list for ConTeXt users

> Heiko Oberdiek wrote the accsupp package to use ActualText in LaTeX,
> why shouldn't it be then possible to use it in LuaTeX (and ConTeXt)?

  Right, you don't need additional engine support, you can use \pdfliteral in
pdfTeX, and in LuaTeX as well.  Heiko's package should be quite easy to port to
ConTeXt.

	Arthur
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 17:10   ` ActualText Arthur Reutenauer
  2009-09-19 18:24     ` ActualText Wolfgang Schuster
  2009-09-19 19:54     ` ActualText Barry Schwartz
@ 2009-09-19 21:34     ` Hans Hagen
  2 siblings, 0 replies; 10+ messages in thread
From: Hans Hagen @ 2009-09-19 21:34 UTC (permalink / raw)
  To: Mailing list for ConTeXt users

Arthur Reutenauer wrote:
>> can you explain in mode detail what you mean with 'actual text tags' ?
> 
>   He means "ActualText tags" :-)  See the PDF spec section 14.9.4, page 623.
> It's a more generic way to support searching than ToUnicode vectors: you just
> specify the actual string of underlying Unicode characters.  The PDF spec uses
> hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to
> search for "Drucker".  You can't do that with ToUnicode vectors.
> 
>   Anyway, this needs support at the engine level and I don't think there is;
> actually it would be nice to add that to LuaTeX.

hm, if done with words it's probably doable with an unadapted engine 
(esp when we have a cleaner pdfliteral model, which is on the agenda)

\starttext
     \dorecurse{100}{test }
     \pdfliteral{/Span <</ActualText (hans) >> BDC}arthur\pdfliteral{EMC}
     \dorecurse{100}{test }
\stoptext

not that hard to implement if we add a span around each word

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
      tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 19:54     ` ActualText Barry Schwartz
@ 2009-09-19 21:50       ` Hans Hagen
  2009-09-19 22:17         ` ActualText Barry Schwartz
  0 siblings, 1 reply; 10+ messages in thread
From: Hans Hagen @ 2009-09-19 21:50 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Barry Schwartz wrote:

> Also, I noticed when playing around with the examples from the "Th"
> ligature discussion that searching and extraction didn't work with
> small caps, though it did work with the ligature. With ActualText tags

hm, mkiv has an analyser for names->unicode and afaik small caps should 
work, unless the glyph name cannot be interpreted (as i don't have the 
font i cannot see what happens or what goes wrong here)

> these things always work, regardless of the ToUnicode map's
> contents. The way Cairo's PDF backend handles this is to use an
> ActualText tag for any glyphs that aren't included in the font's
> encoding. What I did in my modified ant is to generate a ToUnicode map
> from the Adobe glyph naming convention
> (http://www.adobe.com/devnet/opentype/archives/glyph.html) and then

thanks for the pointer

> put an ActualText tag on anything that happens not to match what you
> would get from the ToUnicode mapping.

hm, if one knows the character (say c) then why not adapt the tounicode 
vector

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
      tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 21:50       ` ActualText Hans Hagen
@ 2009-09-19 22:17         ` Barry Schwartz
  2009-09-20  0:06           ` ActualText Barry Schwartz
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Schwartz @ 2009-09-19 22:17 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Hans Hagen <pragma@wxs.nl> skribis:
> > put an ActualText tag on anything that happens not to match what you
> > would get from the ToUnicode mapping.
> 
> hm, if one knows the character (say c) then why not adapt the tounicode 
> vector

The same glyph could correspond to different Unicode in the
source. This is exactly what happens normally with hyphens.

In practice what I see with my method is that discretionary hyphens
always get an ActualText, and if the font is older and has names like
"Asmall" or "ffl" (which I don't bother handling specially) then the
substituted stuff gets an ActualText. I could look at the font's
internal encoding the way I think Cairo does, but it doesn't matter a
whole lot.

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ActualText
  2009-09-19 22:17         ` ActualText Barry Schwartz
@ 2009-09-20  0:06           ` Barry Schwartz
  0 siblings, 0 replies; 10+ messages in thread
From: Barry Schwartz @ 2009-09-20  0:06 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Barry Schwartz <chemoelectric@chemoelectric.org> skribis:
> In practice what I see with my method is that discretionary hyphens
> always get an ActualText, and if the font is older and has names like
> "Asmall" or "ffl" (which I don't bother handling specially) then the
> substituted stuff gets an ActualText. I could look at the font's
> internal encoding the way I think Cairo does, but it doesn't matter a
> whole lot.

Oops, "ffl" is in the Adobe Glyph List and so would get put into the
ToUnicode. Something like "ffh" wouldn't, however, but "f_f_h" would
because it can broken down into parts that are in the AGL.



___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-09-20  0:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-19  0:51 ActualText Barry Schwartz
2009-09-19 16:55 ` ActualText Hans Hagen
2009-09-19 17:10   ` ActualText Arthur Reutenauer
2009-09-19 18:24     ` ActualText Wolfgang Schuster
2009-09-19 20:02       ` ActualText Arthur Reutenauer
2009-09-19 19:54     ` ActualText Barry Schwartz
2009-09-19 21:50       ` ActualText Hans Hagen
2009-09-19 22:17         ` ActualText Barry Schwartz
2009-09-20  0:06           ` ActualText Barry Schwartz
2009-09-19 21:34     ` ActualText Hans Hagen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).