From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 8ED12820A1 for ; Fri, 16 Aug 2013 16:26:24 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of john@coherentgraphics.co.uk) identity=pra; client-ip=188.64.184.40; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="john@coherentgraphics.co.uk"; x-sender="john@coherentgraphics.co.uk"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of john@coherentgraphics.co.uk) identity=mailfrom; client-ip=188.64.184.40; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="john@coherentgraphics.co.uk"; x-sender="john@coherentgraphics.co.uk"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@bluechip4.ukhost4u.com) identity=helo; client-ip=188.64.184.40; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="john@coherentgraphics.co.uk"; x-sender="postmaster@bluechip4.ukhost4u.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AjgCANk1DlK8QLgoemdsb2JhbABbgzu/bIEmFg4BAQsHDQk8giQBAQQBMgEFQQULCwcRCRMSDwJGBg6IDwq5apBQB4QSA6xW X-IPAS-Result: AjgCANk1DlK8QLgoemdsb2JhbABbgzu/bIEmFg4BAQsHDQk8giQBAQQBMgEFQQULCwcRCRMSDwJGBg6IDwq5apBQB4QSA6xW X-IronPort-AV: E=Sophos;i="4.89,895,1367964000"; d="scan'208";a="24059864" Received: from bluechip4.ukhost4u.com ([188.64.184.40]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 16 Aug 2013 16:26:23 +0200 Received: from 78-105-203-81.zone3.bethere.co.uk ([78.105.203.81]:52163 helo=feast.local) by bluechip4.ukhost4u.com with esmtpa (Exim 4.80.1) (envelope-from ) id 1VAKyt-0025Wj-NW; Fri, 16 Aug 2013 15:26:23 +0100 Message-ID: <520E368D.40604@coherentgraphics.co.uk> Date: Fri, 16 Aug 2013 15:26:21 +0100 From: John Whitington User-Agent: Postbox 3.0.8 (Macintosh/20130427) MIME-Version: 1.0 To: =?ISO-8859-1?Q?Arma=EBl_Gu=E9neau?= CC: caml-list@inria.fr References: <520CB9C8.7010108@coherentgraphics.co.uk> <520E049C.5020502@ens-lyon.fr> <520E07AD.9010509@coherentgraphics.co.uk> <520E10E1.5020701@ens-lyon.fr> In-Reply-To: <520E10E1.5020701@ens-lyon.fr> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - bluechip4.ukhost4u.com X-AntiAbuse: Original Domain - inria.fr X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - coherentgraphics.co.uk X-Get-Message-Sender-Via: bluechip4.ukhost4u.com: authenticated_id: john@coherentgraphics.co.uk Subject: Re: [Caml-list] ANN: CamlPDF 1.7 Hi, Armaël Guéneau wrote: >> So 0o019 looks like a floating acute in that encoding, followed by a >> kern of 486/1000 of a point to shift leftward, followed by an 'e'. So, >> this is an accented character built by composition of glyphs. >> >>> For "efficient", with "ffi" being ligated, I get >>> >>> Pdfops_TJ (Pdf.Array [Pdf.String "e\014cient"]) >> >> In the font in use here, character 0o014 appears to be a single glyph >> for the ffi ligature. > Yes, ok. How do you know that? I mean, without knowing the displayed text. > Is there a way, knowing the glyph code (here, 0o019 or 0o014), to convert > it to something more "readable"? Like, hum, ['] for the floating acute, > and [ffi] > for the ligature. Sometimes, sometimes not. This is what the Pdftext module does. Modern PDFs have a /ToUnicode for each font, which is a special data structure mapping bytes or sequences of bytes directly to unicode codepoints or sequences of unicode codepoints. In the absence of this, one can fall back to the other parts of the font metadata (or even the font itself) which might give the encoding in use. > I tried to copy paste the text from the pdf using evince, and the > floating acute > is indeed rendered separately, but the ligature is properly converted to > "ffi". > I guess the interpretation of the glyph code depends on the font, but I > don't > find how to do that with CamlPDF - using glyphnames_of_text just returned > only "/.notdef"... So it looks like only the ffi part will be doable. I can't really comment more without seeing the PDF and your code... Thanks, -- John Whitington Director, Coherent Graphics Ltd http://www.coherentpdf.com/