* ActualText @ 2009-09-19 0:51 Barry Schwartz 2009-09-19 16:55 ` ActualText Hans Hagen 0 siblings, 1 reply; 10+ messages in thread From: Barry Schwartz @ 2009-09-19 0:51 UTC (permalink / raw) To: mailing list for ConTeXt users Please tell me this isn't in a FAQ. :) Is there support for ActualText tags so that searching and extraction will work with OpenType fonts and Unicode? If so, do discretionary hyphens get treated as 00AD instead of 002D? ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 0:51 ActualText Barry Schwartz @ 2009-09-19 16:55 ` Hans Hagen 2009-09-19 17:10 ` ActualText Arthur Reutenauer 0 siblings, 1 reply; 10+ messages in thread From: Hans Hagen @ 2009-09-19 16:55 UTC (permalink / raw) To: mailing list for ConTeXt users Barry Schwartz wrote: > Please tell me this isn't in a FAQ. :) Is there support for ActualText > tags so that searching and extraction will work with OpenType fonts > and Unicode? If so, do discretionary hyphens get treated as 00AD > instead of 002D? can you explain in mode detail what you mean with 'actual text tags' ? concerning searching ... tounicode vectors are added (including heuristics for ligatures and such) so searching, cut/past etc should work ok in order to see a problem with hyphens i need an example font/text combination that i can generate on my machine Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 16:55 ` ActualText Hans Hagen @ 2009-09-19 17:10 ` Arthur Reutenauer 2009-09-19 18:24 ` ActualText Wolfgang Schuster ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Arthur Reutenauer @ 2009-09-19 17:10 UTC (permalink / raw) To: Mailing list for ConTeXt users > can you explain in mode detail what you mean with 'actual text tags' ? He means "ActualText tags" :-) See the PDF spec section 14.9.4, page 623. It's a more generic way to support searching than ToUnicode vectors: you just specify the actual string of underlying Unicode characters. The PDF spec uses hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to search for "Drucker". You can't do that with ToUnicode vectors. Anyway, this needs support at the engine level and I don't think there is; actually it would be nice to add that to LuaTeX. Arthur ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 17:10 ` ActualText Arthur Reutenauer @ 2009-09-19 18:24 ` Wolfgang Schuster 2009-09-19 20:02 ` ActualText Arthur Reutenauer 2009-09-19 19:54 ` ActualText Barry Schwartz 2009-09-19 21:34 ` ActualText Hans Hagen 2 siblings, 1 reply; 10+ messages in thread From: Wolfgang Schuster @ 2009-09-19 18:24 UTC (permalink / raw) To: Mailing list for ConTeXt users Am 19.09.2009 um 19:10 schrieb Arthur Reutenauer: > Anyway, this needs support at the engine level and I don't think > there is; > actually it would be nice to add that to LuaTeX. Heiko Oberdiek wrote the accsupp package to use ActualText in LaTeX, why shouldn't it be then possible to use it in LuaTeX (and ConTeXt)? Wolfgang ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 18:24 ` ActualText Wolfgang Schuster @ 2009-09-19 20:02 ` Arthur Reutenauer 0 siblings, 0 replies; 10+ messages in thread From: Arthur Reutenauer @ 2009-09-19 20:02 UTC (permalink / raw) To: Mailing list for ConTeXt users > Heiko Oberdiek wrote the accsupp package to use ActualText in LaTeX, > why shouldn't it be then possible to use it in LuaTeX (and ConTeXt)? Right, you don't need additional engine support, you can use \pdfliteral in pdfTeX, and in LuaTeX as well. Heiko's package should be quite easy to port to ConTeXt. Arthur ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 17:10 ` ActualText Arthur Reutenauer 2009-09-19 18:24 ` ActualText Wolfgang Schuster @ 2009-09-19 19:54 ` Barry Schwartz 2009-09-19 21:50 ` ActualText Hans Hagen 2009-09-19 21:34 ` ActualText Hans Hagen 2 siblings, 1 reply; 10+ messages in thread From: Barry Schwartz @ 2009-09-19 19:54 UTC (permalink / raw) To: Mailing list for ConTeXt users Arthur Reutenauer <arthur.reutenauer@normalesup.org> skribis: > He means "ActualText tags" :-) See the PDF spec section 14.9.4, page 623. > It's a more generic way to support searching than ToUnicode vectors: you just > specify the actual string of underlying Unicode characters. The PDF spec uses > hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to > search for "Drucker". You can't do that with ToUnicode vectors. You also need ActualText tags to mark the difference between a discretionary hyphen and an explicit hyphen in English, which programs like Reader use when extracting text. When the hyphen is discretionary you set the ActualText to Unicode AD instead of 2D. (That's mentioned somewhere in the PDF spec.) Another thing I just thought of that isn't always done is that there should be explicit space characters between words, including at the ends of lines, although I'm not sure whether Adobe Reader turns off its word-boundary heuristics if it sees space characters. Since what I enjoy doing is making e-books that can be searched through and, perhaps more importantly, extracted from via the Select tool, it's important to me to make the search, selection, and extraction features work. I'll use them myself if I choose, for instance, to quote from an e-book I made. I've added them in my (heavily) modified version of ant, but that's in a primitive state, a long-term project that competes with font-making and e-book-making for time, and so I'd like to have ConTeXt as well. I like ConTeXt a lot. Also, I noticed when playing around with the examples from the "Th" ligature discussion that searching and extraction didn't work with small caps, though it did work with the ligature. With ActualText tags these things always work, regardless of the ToUnicode map's contents. The way Cairo's PDF backend handles this is to use an ActualText tag for any glyphs that aren't included in the font's encoding. What I did in my modified ant is to generate a ToUnicode map from the Adobe glyph naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) and then put an ActualText tag on anything that happens not to match what you would get from the ToUnicode mapping. (For reasons that were stupid, I once created a lame little C library to do the mapping from glyph names to Unicode, using a compressed lookup trie: http://code.google.com/p/kompostilo/source/browse/#svn/trunk/support-libraries/glyph_name ) ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 19:54 ` ActualText Barry Schwartz @ 2009-09-19 21:50 ` Hans Hagen 2009-09-19 22:17 ` ActualText Barry Schwartz 0 siblings, 1 reply; 10+ messages in thread From: Hans Hagen @ 2009-09-19 21:50 UTC (permalink / raw) To: mailing list for ConTeXt users Barry Schwartz wrote: > Also, I noticed when playing around with the examples from the "Th" > ligature discussion that searching and extraction didn't work with > small caps, though it did work with the ligature. With ActualText tags hm, mkiv has an analyser for names->unicode and afaik small caps should work, unless the glyph name cannot be interpreted (as i don't have the font i cannot see what happens or what goes wrong here) > these things always work, regardless of the ToUnicode map's > contents. The way Cairo's PDF backend handles this is to use an > ActualText tag for any glyphs that aren't included in the font's > encoding. What I did in my modified ant is to generate a ToUnicode map > from the Adobe glyph naming convention > (http://www.adobe.com/devnet/opentype/archives/glyph.html) and then thanks for the pointer > put an ActualText tag on anything that happens not to match what you > would get from the ToUnicode mapping. hm, if one knows the character (say c) then why not adapt the tounicode vector Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 21:50 ` ActualText Hans Hagen @ 2009-09-19 22:17 ` Barry Schwartz 2009-09-20 0:06 ` ActualText Barry Schwartz 0 siblings, 1 reply; 10+ messages in thread From: Barry Schwartz @ 2009-09-19 22:17 UTC (permalink / raw) To: mailing list for ConTeXt users Hans Hagen <pragma@wxs.nl> skribis: > > put an ActualText tag on anything that happens not to match what you > > would get from the ToUnicode mapping. > > hm, if one knows the character (say c) then why not adapt the tounicode > vector The same glyph could correspond to different Unicode in the source. This is exactly what happens normally with hyphens. In practice what I see with my method is that discretionary hyphens always get an ActualText, and if the font is older and has names like "Asmall" or "ffl" (which I don't bother handling specially) then the substituted stuff gets an ActualText. I could look at the font's internal encoding the way I think Cairo does, but it doesn't matter a whole lot. ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 22:17 ` ActualText Barry Schwartz @ 2009-09-20 0:06 ` Barry Schwartz 0 siblings, 0 replies; 10+ messages in thread From: Barry Schwartz @ 2009-09-20 0:06 UTC (permalink / raw) To: mailing list for ConTeXt users Barry Schwartz <chemoelectric@chemoelectric.org> skribis: > In practice what I see with my method is that discretionary hyphens > always get an ActualText, and if the font is older and has names like > "Asmall" or "ffl" (which I don't bother handling specially) then the > substituted stuff gets an ActualText. I could look at the font's > internal encoding the way I think Cairo does, but it doesn't matter a > whole lot. Oops, "ffl" is in the Adobe Glyph List and so would get put into the ToUnicode. Something like "ffh" wouldn't, however, but "f_f_h" would because it can broken down into parts that are in the AGL. ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ActualText 2009-09-19 17:10 ` ActualText Arthur Reutenauer 2009-09-19 18:24 ` ActualText Wolfgang Schuster 2009-09-19 19:54 ` ActualText Barry Schwartz @ 2009-09-19 21:34 ` Hans Hagen 2 siblings, 0 replies; 10+ messages in thread From: Hans Hagen @ 2009-09-19 21:34 UTC (permalink / raw) To: Mailing list for ConTeXt users Arthur Reutenauer wrote: >> can you explain in mode detail what you mean with 'actual text tags' ? > > He means "ActualText tags" :-) See the PDF spec section 14.9.4, page 623. > It's a more generic way to support searching than ToUnicode vectors: you just > specify the actual string of underlying Unicode characters. The PDF spec uses > hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to > search for "Drucker". You can't do that with ToUnicode vectors. > > Anyway, this needs support at the engine level and I don't think there is; > actually it would be nice to add that to LuaTeX. hm, if done with words it's probably doable with an unadapted engine (esp when we have a cleaner pdfliteral model, which is on the agenda) \starttext \dorecurse{100}{test } \pdfliteral{/Span <</ActualText (hans) >> BDC}arthur\pdfliteral{EMC} \dorecurse{100}{test } \stoptext not that hard to implement if we add a span around each word Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-09-20 0:06 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-19 0:51 ActualText Barry Schwartz 2009-09-19 16:55 ` ActualText Hans Hagen 2009-09-19 17:10 ` ActualText Arthur Reutenauer 2009-09-19 18:24 ` ActualText Wolfgang Schuster 2009-09-19 20:02 ` ActualText Arthur Reutenauer 2009-09-19 19:54 ` ActualText Barry Schwartz 2009-09-19 21:50 ` ActualText Hans Hagen 2009-09-19 22:17 ` ActualText Barry Schwartz 2009-09-20 0:06 ` ActualText Barry Schwartz 2009-09-19 21:34 ` ActualText Hans Hagen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).