From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/53030 Path: news.gmane.org!not-for-mail From: Barry Schwartz Newsgroups: gmane.comp.tex.context Subject: Re: ActualText Date: Sat, 19 Sep 2009 14:54:07 -0500 Message-ID: <20090919195407.GA17417@crud.chemoelectric.org> References: <20090919005100.GA900@crud.chemoelectric.org> <4AB50D03.4020706@wxs.nl> <20090919171034.GB29519@phare.normalesup.org> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1253390087 7876 80.91.229.12 (19 Sep 2009 19:54:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 19 Sep 2009 19:54:47 +0000 (UTC) To: Mailing list for ConTeXt users Original-X-From: ntg-context-bounces@ntg.nl Sat Sep 19 21:54:40 2009 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from balder.ntg.nl ([195.12.62.10]) by lo.gmane.org with esmtp (Exim 4.50) id 1Mp612-0006AQ-Hw for gctc-ntg-context-518@m.gmane.org; Sat, 19 Sep 2009 21:54:40 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id 83916C9AAA; Sat, 19 Sep 2009 21:54:37 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at balder.ntg.nl Original-Received: from balder.ntg.nl ([127.0.0.1]) by localhost (balder.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id yaVQy+VCd64o; Sat, 19 Sep 2009 21:54:33 +0200 (CEST) Original-Received: from balder.ntg.nl (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id D7F28C9A90; Sat, 19 Sep 2009 21:54:32 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by balder.ntg.nl (Postfix) with ESMTP id A70C3C9A90 for ; Sat, 19 Sep 2009 21:54:28 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at balder.ntg.nl Original-Received: from balder.ntg.nl ([127.0.0.1]) by localhost (balder.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id LULtV1i8p06p for ; Sat, 19 Sep 2009 21:54:14 +0200 (CEST) Original-Received: from QMTA10.emeryville.ca.mail.comcast.net (qmta10.emeryville.ca.mail.comcast.net [76.96.30.17]) by balder.ntg.nl (Postfix) with ESMTP id 9FACEC9A81 for ; Sat, 19 Sep 2009 21:54:13 +0200 (CEST) Original-Received: from OMTA14.emeryville.ca.mail.comcast.net ([76.96.30.60]) by QMTA10.emeryville.ca.mail.comcast.net with comcast id ijiS1c0021HpZEsAAjuCTd; Sat, 19 Sep 2009 19:54:12 +0000 Original-Received: from crud.chemoelectric.org ([66.41.30.59]) by OMTA14.emeryville.ca.mail.comcast.net with comcast id ijuA1c0041GXozm8ajuBpN; Sat, 19 Sep 2009 19:54:12 +0000 Original-Received: from crud.chemoelectric.org (localhost [127.0.0.1]) by crud.chemoelectric.org (8.14.3/8.14.3) with ESMTP id n8JJs86u017486 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sat, 19 Sep 2009 14:54:08 -0500 Original-Received: (from trashman@localhost) by crud.chemoelectric.org (8.14.3/8.14.3/Submit) id n8JJs71k017485 for ntg-context@ntg.nl; Sat, 19 Sep 2009 14:54:07 -0500 Content-Disposition: inline In-Reply-To: <20090919171034.GB29519@phare.normalesup.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.12 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl Xref: news.gmane.org gmane.comp.tex.context:53030 Archived-At: Arthur Reutenauer skribis: > He means "ActualText tags" :-) See the PDF spec section 14.9.4, page 623. > It's a more generic way to support searching than ToUnicode vectors: you just > specify the actual string of underlying Unicode characters. The PDF spec uses > hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to > search for "Drucker". You can't do that with ToUnicode vectors. You also need ActualText tags to mark the difference between a discretionary hyphen and an explicit hyphen in English, which programs like Reader use when extracting text. When the hyphen is discretionary you set the ActualText to Unicode AD instead of 2D. (That's mentioned somewhere in the PDF spec.) Another thing I just thought of that isn't always done is that there should be explicit space characters between words, including at the ends of lines, although I'm not sure whether Adobe Reader turns off its word-boundary heuristics if it sees space characters. Since what I enjoy doing is making e-books that can be searched through and, perhaps more importantly, extracted from via the Select tool, it's important to me to make the search, selection, and extraction features work. I'll use them myself if I choose, for instance, to quote from an e-book I made. I've added them in my (heavily) modified version of ant, but that's in a primitive state, a long-term project that competes with font-making and e-book-making for time, and so I'd like to have ConTeXt as well. I like ConTeXt a lot. Also, I noticed when playing around with the examples from the "Th" ligature discussion that searching and extraction didn't work with small caps, though it did work with the ligature. With ActualText tags these things always work, regardless of the ToUnicode map's contents. The way Cairo's PDF backend handles this is to use an ActualText tag for any glyphs that aren't included in the font's encoding. What I did in my modified ant is to generate a ToUnicode map from the Adobe glyph naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) and then put an ActualText tag on anything that happens not to match what you would get from the ToUnicode mapping. (For reasons that were stupid, I once created a lame little C library to do the mapping from glyph names to Unicode, using a compressed lookup trie: http://code.google.com/p/kompostilo/source/browse/#svn/trunk/support-libraries/glyph_name ) ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________