From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 0A300820A1 for ; Fri, 16 Aug 2013 13:06:25 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of john@coherentgraphics.co.uk) identity=pra; client-ip=188.64.184.40; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="john@coherentgraphics.co.uk"; x-sender="john@coherentgraphics.co.uk"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of john@coherentgraphics.co.uk) identity=mailfrom; client-ip=188.64.184.40; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="john@coherentgraphics.co.uk"; x-sender="john@coherentgraphics.co.uk"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@bluechip4.ukhost4u.com) identity=helo; client-ip=188.64.184.40; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="john@coherentgraphics.co.uk"; x-sender="postmaster@bluechip4.ukhost4u.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AjkCALgGDlK8QLgoemdsb2JhbABbgzuDa7wCgSUWDgEBCwcNCTyCJAEBBAEjDwEFQAEFCwsHEQICBRMDCwICCQMCAQIBRQYOiA8KCKhNkSyBKYwdgwoHgmiBKgOsVg X-IPAS-Result: AjkCALgGDlK8QLgoemdsb2JhbABbgzuDa7wCgSUWDgEBCwcNCTyCJAEBBAEjDwEFQAEFCwsHEQICBRMDCwICCQMCAQIBRQYOiA8KCKhNkSyBKYwdgwoHgmiBKgOsVg X-IronPort-AV: E=Sophos;i="4.89,894,1367964000"; d="scan'208";a="24050115" Received: from bluechip4.ukhost4u.com ([188.64.184.40]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 16 Aug 2013 13:06:24 +0200 Received: from 78-105-203-81.zone3.bethere.co.uk ([78.105.203.81]:50032 helo=feast.local) by bluechip4.ukhost4u.com with esmtpa (Exim 4.80.1) (envelope-from ) id 1VAHrL-001URU-Dc; Fri, 16 Aug 2013 12:06:23 +0100 Message-ID: <520E07AD.9010509@coherentgraphics.co.uk> Date: Fri, 16 Aug 2013 12:06:21 +0100 From: John Whitington User-Agent: Postbox 3.0.8 (Macintosh/20130427) MIME-Version: 1.0 To: =?UTF-8?B?QXJtYcOrbCBHdcOpbmVhdQ==?= CC: caml-list@inria.fr References: <520CB9C8.7010108@coherentgraphics.co.uk> <520E049C.5020502@ens-lyon.fr> In-Reply-To: <520E049C.5020502@ens-lyon.fr> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - bluechip4.ukhost4u.com X-AntiAbuse: Original Domain - inria.fr X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - coherentgraphics.co.uk X-Get-Message-Sender-Via: bluechip4.ukhost4u.com: authenticated_id: john@coherentgraphics.co.uk Subject: Re: [Caml-list] ANN: CamlPDF 1.7 Hi, Armaël Guéneau wrote: > Hi, > > Le 15/08/2013 13:21, John Whitington a écrit : >> The first new release of the CamlPDF library for a while is here: >> >> http://www.github.com/johnwhitington/camlpdf >> >> (Or, shortly, via OPAM.) > Thanks! > > I have been playing with CamlPDF a bit, trying to do text extraction. > I'm a total novice about the PDF format, so i might be doing it wrong, > but I was wondering if there were facilities, in CamlPDF, to handle > diacritics and ligatures. > > For example, when reading the PDF operators for "Université", I get > > Pdfops_TJ (Pdf.Array [Pdf.String "Universit"; Pdf.String "\019"; > Pdf.Real 486.; Pdf.String "e"]) So 0o019 looks like a floating acute in that encoding, followed by a kern of 486/1000 of a point to shift leftward, followed by an 'e'. So, this is an accented character built by composition of glyphs. > For "efficient", with "ffi" being ligated, I get > > Pdfops_TJ (Pdf.Array [Pdf.String "e\014cient"]) In the font in use here, character 0o014 appears to be a single glyph for the ffi ligature. > How can I convert these back, especially the ligature? I tried to use the > conversion functions of Pdftext, like codepoints_of_text followed by > utf8_of_codepoints, but that didn't seem to work. It's highly possible > that I'm also doing it wrong here. Drop me a note off-list with the example file and I'll take a look. Text extraction is almost always possible with modern PDFs. With older PDFs, text extraction sometimes isn't possible. You can test this by trying to copy & paste the text in Adobe Reader -- if it comes out correct, CamlPDF will be able to extract the text too -- if not, it won't. With Thanks, -- John Whitington Director, Coherent Graphics Ltd http://www.coherentpdf.com/