Hi,

Le 15/08/2013 13:21, John Whitington a écrit :

The first new release of the CamlPDF library for a while is here:

http://www.github.com/johnwhitington/camlpdf

(Or, shortly, via OPAM.)

Thanks!

I have been playing with CamlPDF a bit, trying to do text extraction.
I'm a total novice about the PDF format, so i might be doing it wrong,
but I was wondering if there were facilities, in CamlPDF, to handle
diacritics and ligatures.

For example, when reading the PDF operators for "Université", I get

Pdfops_TJ (Pdf.Array [Pdf.String "Universit"; Pdf.String "\019"; Pdf.Real 486.; Pdf.String "e"])

For "efficient", with "ffi" being ligated, I get

Pdfops_TJ (Pdf.Array [Pdf.String "e\014cient"])

How can I convert these back, especially the ligature? I tried to use the
conversion functions of Pdftext, like codepoints_of_text followed by
utf8_of_codepoints, but that didn't seem to work. It's highly possible
that I'm also doing it wrong here.

Armaël