From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: * X-Spam-Status: No, score=1.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FROM,FROM_LOCAL_NOVOWEL,HK_RANDOM_FROM autolearn=no autolearn_force=no version=3.4.4 Received: (qmail 27870 invoked from network); 30 May 2021 23:46:40 -0000 Received: from 1ess.inri.net (216.126.196.35) by inbox.vuxu.org with ESMTPUTF8; 30 May 2021 23:46:40 -0000 Received: from mail-wr1-f52.google.com ([209.85.221.52]) by 1ess; Sun May 30 19:39:21 -0400 2021 Received: by mail-wr1-f52.google.com with SMTP id c3so9016253wrp.8 for <9front@9front.org>; Sun, 30 May 2021 16:39:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:to:subject:from:date:in-reply-to:mime-version :content-transfer-encoding; bh=oDRcSVvSY/tZ8JzWQUbwqt4xBSspeJaSkPh3kwE5YF4=; b=I7bffWHE7HoT9/SvkzwQJ6mhmC5N818BT+cBZz4N4zuajIwlFem6SQgUK/SCvrpAp9 evEBj6l7OCJ2CiTgbZWZxnxT5Mu+tMQPYVxHAErLD4gHjrVmGS5mh/rq9pLNog9ZdVCV qlLQ45Kxs4J482vrzYY8W4dbzq03L0J5wR0VBwmw68f/FEkv/5NpKmYI+3eYsbWJZSaF qVubbfyEDXYLAEtz6IV4HETec0E5/4L9y9ZFiUsbNd/Rkm6xiARyUs2UG4pmR53nG2em KUJ+Ottf9q6ckVpIkRCKGzXjtG+FXV04ZgtUXe4qPZBwzd4yqCgjbJkqIv41qk+iI7uJ DIWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:to:subject:from:date:in-reply-to :mime-version:content-transfer-encoding; bh=oDRcSVvSY/tZ8JzWQUbwqt4xBSspeJaSkPh3kwE5YF4=; b=i4NaxFa3pVjbrExGcLOSnbEY9gDXGAhxaDCOldbccCrdftM4PNWK8D3a61npgR3qA5 Y5pX4+7vLwll9d480kwYX6BmYFfqzZYc38Dq0QgxSqCiNSSVJPlQsvcRtv+L6d992T84 H33XFAk1cfLdn+RwHGlW934rL6/VS4hVdatMvlYP2iyUpShuIZ9zAjFwegkD8yrukGLZ nk8cWPj5FqO29yR3NiP3W3FR1HNd7mOV0jn3kSF+yY7VPXiHo1jVSSwiOQRTXVmeZ8N2 IzLXp9uWppDDXMFBV1lan60mQZ28v1QiQimMl6KyN/hXQFLAnNR+JACI8TZcfmo1u5Hy OLxw== X-Gm-Message-State: AOAM530Km558NEKNWETuVHE8ku0S4q2S4wWK0EdFXJa/VTWUAZbzT8kC Vf/bSY6a05dJ//rZEZCZGYcjJipqx022RQ== X-Google-Smtp-Source: ABdhPJyt1PTKdLItCi1aVGh8qzdKByvRA+WE1UlMoB+Xf9xVygOJbkM91L394IDsvbbt0rHSO+tUyQ== X-Received: by 2002:a05:6512:10c8:: with SMTP id k8mr13480218lfg.325.1622417616324; Sun, 30 May 2021 16:33:36 -0700 (PDT) Return-Path: Received: from fukken.lan (c-adc9e655.027-406-73746f40.bbcust.telenor.se. [85.230.201.173]) by smtp.gmail.com with ESMTPSA id a25sm1145878lfl.38.2021.05.30.16.33.35 for <9front@9front.org> (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Sun, 30 May 2021 16:33:35 -0700 (PDT) Message-ID: To: 9front@9front.org From: "Sigrid Solveig Haflínudóttir" Date: Mon, 31 May 2021 01:33:34 +0200 In-Reply-To: <20CB1473-5752-4552-BE6E-B86988738BB6@cpan.org> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: patented RESTful HTML over XML module configuration property-aware interface Subject: Re: [9front] PDF search bounty Reply-To: 9front@9front.org Precedence: bulk Quoth Romano : > What is included in Sigrid's attempted pdffs? Perhaps that would include search. > > On May 30, 2021 10:59:04 PM UTC, Stanley Lieber wrote: > >On May 30, 2021 4:10:56 PM EDT, binary cat > >wrote: > >>What is the state of the $200 bounty on searching through PDFs? > >>I thought I might give it a shot. > >> > > > >i'm not aware of anyone having done any work on this. > > > >sl Mostly just object extraction. Text, images, etc. Unpacking (gzip, lzw and so on). The part that is required for pdf2text is no there, but is allegedly not too complex to implement. Page contents usually are a bunch of drawing operations that also include parts of text being placed in specific locations (defined by coordinates X and Y) on the page. Search was definitely part of the plan. Further development has been stalled due to assumption it might get accepted as a GSOC project. Since that did not happen, I will continue as I have free time (and will) for this. Noam might do that too.