Hi,
Le 15/08/2013 13:21, John Whitington a écrit :
The first new release of the CamlPDF library for a
while is here:
http://www.github.com/johnwhitington/camlpdf
(Or, shortly, via OPAM.)
Thanks!
I have been playing with CamlPDF a bit, trying to do text
extraction.
I'm a total novice about the PDF format, so i might be doing it
wrong,
but I was wondering if there were facilities, in CamlPDF, to handle
diacritics and ligatures.
For example, when reading the PDF operators for "Université", I get
Pdfops_TJ (Pdf.Array [Pdf.String
"Universit"; Pdf.String "\019"; Pdf.Real 486.; Pdf.String "e"])
For "efficient", with "ffi" being ligated, I get
Pdfops_TJ (Pdf.Array [Pdf.String
"e\014cient"])
How can I convert these back, especially the ligature? I tried to
use the
conversion functions of Pdftext, like codepoints_of_text
followed by
utf8_of_codepoints, but that didn't
seem to work. It's highly possible
that I'm also doing it wrong here.
Armaël