I want to extract bibliographic data from Amazon pages

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* I want to extract bibliographic data from Amazon pages
@ 2022-12-10  9:05 Trevor Jenkins
       [not found] ` <C57B5FA0-9810-4234-A8A8-C828D6CF27F6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Trevor Jenkins @ 2022-12-10  9:05 UTC (permalink / raw)
  To: pandoc-discuss

My current workflow for getting bibliographic data from Amazon’s book listings is failing. I use BibDesk as my primary citation manager but it does not extract data from Amazon listing so for that I use a lashed up scheme using Zotero. Zotero has a browser add-on which extracts the bibliographic information from these pages. Then in Zotero I have a third-party script that sends that data to BibDesk. This has worked well for a year or more.

However there are two problems with my method. First is that the third-party script for extraction from Zotero does not work with the current version of the program. I downgraded Zotero to an earlier version and that restore my workflow. Unfortunately it now appears that changes to the browser add-on are not compatible with that older version and my workflow is now dammed as it may or may not add the data to Zotero.

As panda can process both HTML and BibTex formats I wonder if and how I could harness that capability to finally drop Zotero altogether as it was only ever meant to be a stopgap anyway. A simplistic 

pandoc -f html -t bib text …

Using the specific URL for the book I want to add does not work; I did not expect it. Leaves me wonder whether a Lua script might be required to do the job. Not conversant with Lua at all so my idea is on hold. 

Is it possible to get pandoc to do the required extraction and if so what might a Lua script look like?

Regards, Trevor.

<>< Re: deemed!

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/C57B5FA0-9810-4234-A8A8-C828D6CF27F6%40gmail.com.

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <C57B5FA0-9810-4234-A8A8-C828D6CF27F6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]

* AW: I want to extract bibliographic data from Amazon pages
       [not found] ` <C57B5FA0-9810-4234-A8A8-C828D6CF27F6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2022-12-10 11:44   ` denis.maier-NSENcxR/0n0
       [not found]     ` <03d11be1c7b64ed0b31a56f5eb209f88-NSENcxR/0n0@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: denis.maier-NSENcxR/0n0 @ 2022-12-10 11:44 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Although it might be possible with Pandora, I doubt this is the best tool for the job.
I'd use a language such as python (or whatever you are most comfortable with) for that. You'll need to do some web scraping, read the relevant parts of the webpage, and output to bibtex. I bet there's a library for the last part.
For the scraping part you can have a look at what the zotero importer does: https://github.com/zotero/translators/blob/master/Amazon.js

________________________________________
Von: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> im Auftrag von Trevor Jenkins <bslwannabe-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Gesendet: Samstag, 10. Dezember 2022 10:05:36
An: pandoc-discuss
Betreff: I want to extract bibliographic data from Amazon pages

My current workflow for getting bibliographic data from Amazon’s book listings is failing. I use BibDesk as my primary citation manager but it does not extract data from Amazon listing so for that I use a lashed up scheme using Zotero. Zotero has a browser add-on which extracts the bibliographic information from these pages. Then in Zotero I have a third-party script that sends that data to BibDesk. This has worked well for a year or more.

However there are two problems with my method. First is that the third-party script for extraction from Zotero does not work with the current version of the program. I downgraded Zotero to an earlier version and that restore my workflow. Unfortunately it now appears that changes to the browser add-on are not compatible with that older version and my workflow is now dammed as it may or may not add the data to Zotero.

As panda can process both HTML and BibTex formats I wonder if and how I could harness that capability to finally drop Zotero altogether as it was only ever meant to be a stopgap anyway. A simplistic

pandoc -f html -t bib text …

Using the specific URL for the book I want to add does not work; I did not expect it. Leaves me wonder whether a Lua script might be required to do the job. Not conversant with Lua at all so my idea is on hold.

Is it possible to get pandoc to do the required extraction and if so what might a Lua script look like?

Regards, Trevor.

<>< Re: deemed!

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/C57B5FA0-9810-4234-A8A8-C828D6CF27F6%40gmail.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/03d11be1c7b64ed0b31a56f5eb209f88%40unibe.ch.

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <03d11be1c7b64ed0b31a56f5eb209f88-NSENcxR/0n0@public.gmane.org>]

* AW: I want to extract bibliographic data from Amazon pages
       [not found]     ` <03d11be1c7b64ed0b31a56f5eb209f88-NSENcxR/0n0@public.gmane.org>
@ 2022-12-10 12:39       ` denis.maier-NSENcxR/0n0
       [not found]         ` <0394e3cb78574a3b986a66479e6253e8-NSENcxR/0n0@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: denis.maier-NSENcxR/0n0 @ 2022-12-10 12:39 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Oh, and I'm almost certain there must be an existing command line tools that lets you retrieve bibliographic data if you provide an ISBN or doi. The date might not come from Amazon, but their data isn't the best anyway.
If you use emacs there's org-ref which contains https://github.com/jkitchin/org-ref/blob/master/org-ref-isbn.el

________________________________________
Von: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> im Auftrag von denis.maier-NSENcxR/0n0@public.gmane.org <denis.maier-NSENcxR/0n0@public.gmane.org>
Gesendet: Samstag, 10. Dezember 2022 12:44:32
An: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Betreff: AW: I want to extract bibliographic data from Amazon pages

Although it might be possible with Pandora, I doubt this is the best tool for the job.
I'd use a language such as python (or whatever you are most comfortable with) for that. You'll need to do some web scraping, read the relevant parts of the webpage, and output to bibtex. I bet there's a library for the last part.
For the scraping part you can have a look at what the zotero importer does: https://github.com/zotero/translators/blob/master/Amazon.js

________________________________________
Von: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> im Auftrag von Trevor Jenkins <bslwannabe-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Gesendet: Samstag, 10. Dezember 2022 10:05:36
An: pandoc-discuss
Betreff: I want to extract bibliographic data from Amazon pages

My current workflow for getting bibliographic data from Amazon’s book listings is failing. I use BibDesk as my primary citation manager but it does not extract data from Amazon listing so for that I use a lashed up scheme using Zotero. Zotero has a browser add-on which extracts the bibliographic information from these pages. Then in Zotero I have a third-party script that sends that data to BibDesk. This has worked well for a year or more.

However there are two problems with my method. First is that the third-party script for extraction from Zotero does not work with the current version of the program. I downgraded Zotero to an earlier version and that restore my workflow. Unfortunately it now appears that changes to the browser add-on are not compatible with that older version and my workflow is now dammed as it may or may not add the data to Zotero.

As panda can process both HTML and BibTex formats I wonder if and how I could harness that capability to finally drop Zotero altogether as it was only ever meant to be a stopgap anyway. A simplistic

pandoc -f html -t bib text …

Using the specific URL for the book I want to add does not work; I did not expect it. Leaves me wonder whether a Lua script might be required to do the job. Not conversant with Lua at all so my idea is on hold.

Is it possible to get pandoc to do the required extraction and if so what might a Lua script look like?

Regards, Trevor.

<>< Re: deemed!

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/C57B5FA0-9810-4234-A8A8-C828D6CF27F6%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/03d11be1c7b64ed0b31a56f5eb209f88%40unibe.ch.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/0394e3cb78574a3b986a66479e6253e8%40unibe.ch.

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <0394e3cb78574a3b986a66479e6253e8-NSENcxR/0n0@public.gmane.org>]

* Re: I want to extract bibliographic data from Amazon pages
       [not found]         ` <0394e3cb78574a3b986a66479e6253e8-NSENcxR/0n0@public.gmane.org>
@ 2022-12-10 15:08           ` Trevor Jenkins
       [not found]             ` <6A66AECA-AAFF-4195-BA35-039A85E847EE-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Trevor Jenkins @ 2022-12-10 15:08 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 10 Dec 2022, at 12:39, <denis.maier-NSENcxR/0n0@public.gmane.org> <denis.maier-NSENcxR/0n0@public.gmane.org> wrote:


> 
> Oh, and I'm almost certain there must be an existing command line tools that lets you retrieve bibliographic data if you provide an ISBN or doi. The date might not come from Amazon, but their data isn't the best anyway.

There are several but one needs to know which national agency to go to for the search. Plus they have no access to my Amazon purchase history.

> If you use emacs there's org-ref which contains https://github.com/jkitchin/org-ref/blob/master/org-ref-isbn.el

Haven’t used emacs is at least two decades. Thanks for the link to the source. Coupled with the earlier link to the Zotero browser plug-in I have something to work with. 

Regards, Trevor.

<>< Re: deemed!

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6A66AECA-AAFF-4195-BA35-039A85E847EE%40gmail.com.


^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <6A66AECA-AAFF-4195-BA35-039A85E847EE-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]

* AW: I want to extract bibliographic data from Amazon pages
       [not found]             ` <6A66AECA-AAFF-4195-BA35-039A85E847EE-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2022-12-14  9:17               ` denis.maier-NSENcxR/0n0
  0 siblings, 0 replies; 5+ messages in thread
From: denis.maier-NSENcxR/0n0 @ 2022-12-14  9:17 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Another interesting project that might be helpful: https://github.com/mpedramfar/zotra

This runs in emacs, but uses the zotero translators.
________________________________________
Von: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> im Auftrag von Trevor Jenkins <bslwannabe-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Gesendet: Samstag, 10. Dezember 2022 16:08:02
An: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Betreff: Re: I want to extract bibliographic data from Amazon pages

On 10 Dec 2022, at 12:39, <denis.maier-NSENcxR/0n0@public.gmane.org> <denis.maier-NSENcxR/0n0@public.gmane.org> wrote:

>
> Oh, and I'm almost certain there must be an existing command line tools that lets you retrieve bibliographic data if you provide an ISBN or doi. The date might not come from Amazon, but their data isn't the best anyway.

There are several but one needs to know which national agency to go to for the search. Plus they have no access to my Amazon purchase history.

> If you use emacs there's org-ref which contains https://github.com/jkitchin/org-ref/blob/master/org-ref-isbn.el

Haven’t used emacs is at least two decades. Thanks for the link to the source. Coupled with the earlier link to the Zotero browser plug-in I have something to work with.

Regards, Trevor.

<>< Re: deemed!

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6A66AECA-AAFF-4195-BA35-039A85E847EE%40gmail.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/121222348dee426e8b2d98094e11e2fa%40unibe.ch.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-12-14  9:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-10  9:05 I want to extract bibliographic data from Amazon pages Trevor Jenkins
     [not found] ` <C57B5FA0-9810-4234-A8A8-C828D6CF27F6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-12-10 11:44   ` AW: " denis.maier-NSENcxR/0n0
     [not found]     ` <03d11be1c7b64ed0b31a56f5eb209f88-NSENcxR/0n0@public.gmane.org>
2022-12-10 12:39       ` denis.maier-NSENcxR/0n0
     [not found]         ` <0394e3cb78574a3b986a66479e6253e8-NSENcxR/0n0@public.gmane.org>
2022-12-10 15:08           ` Trevor Jenkins
     [not found]             ` <6A66AECA-AAFF-4195-BA35-039A85E847EE-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-12-14  9:17               ` AW: " denis.maier-NSENcxR/0n0

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).