idea: Module to automatically extract and insert information from Wikipedia

ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed

* idea: Module to automatically extract and insert information from Wikipedia
@ 2011-11-12 15:19 Paul Menzel
  2011-11-12 16:31 ` Philipp Gesang
  2011-11-13  6:07 ` Aditya Mahajan
  0 siblings, 2 replies; 5+ messages in thread
From: Paul Menzel @ 2011-11-12 15:19 UTC (permalink / raw)
  To: ntg-context

[-- Attachment #1.1: Type: text/plain, Size: 869 bytes --]

Dear ConTeXt folks,

just now I thought of the following and I am wondering if there exists
already a solution.

Writing a text which includes people I want to add information about
these peoples as footnotes. The first sentence in a Wikipedia article is
most of the time good enough for that.

A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the
first sentence of the article and puts an item into the bibliography.

There is even an API to access articles [2]. Besides coding that up I
see the following problems.

1. The output [3] needs to be converted to ConTeXt.
2. An Internet connection would be necessary. But that is just a note
and not a problem.

Thanks,

Paul

[1] https://en.wikipedia.org/wiki/Donald_Knuth
[2] http://www.mediawiki.org/wiki/API
[3] http://www.mediawiki.org/wiki/API:Data_formats#Output

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 485 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: idea: Module to automatically extract and insert information from Wikipedia
  2011-11-12 15:19 idea: Module to automatically extract and insert information from Wikipedia Paul Menzel
@ 2011-11-12 16:31 ` Philipp Gesang
  2011-11-12 16:40   ` Khaled Hosny
  2011-11-13  6:07 ` Aditya Mahajan
  1 sibling, 1 reply; 5+ messages in thread
From: Philipp Gesang @ 2011-11-12 16:31 UTC (permalink / raw)
  To: mailing list for ConTeXt users


[-- Attachment #1.1: Type: text/plain, Size: 2254 bytes --]

Hi Paul,

On 2011-11-12 16:19, Paul Menzel wrote:
> A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the
> first sentence of the article and puts an item into the bibliography.
> 
> There is even an API to access articles [2]. Besides coding that up I
> see the following problems.
> 
> 1. The output [3] needs to be converted to ConTeXt.
> 2. An Internet connection would be necessary. But that is just a note
> and not a problem.

you could take this as a starting point:
  <https://bitbucket.org/phg/context-acceptor/>
and implement a function that ignores everything but the first
text paragraph. Autodownload should work for the English WP.
(I’m sorry I have no time to do this myself atm.)

Btw. as “Sentence” is not a markup category of wikitext, there is
no sentence recognition built in ... ymmv.

(Beware that processing wiki text from WP is extremely
complicated due to WP’s using special plugins (“templates” and
stuff). So the only way to make sure that a parser accept any
well formed WP page would be to include all those plugins. Which
would entail rewriting the PHP code in Lua for use as a context
script. And then you’d have to decide for every plugin what its
output should look like in Context.[0] If you have the time ...)

Good luck
Philipp

[0] Get an impression on how much work this can be at
    http://en.wikipedia.org/wiki/Wikipedia:List_of_templates
    The more important ones are at
    http://en.wikipedia.org/wiki/Category:Infobox_templates
    

> Thanks,
> 
> Paul
> 
> 
> [1] https://en.wikipedia.org/wiki/Donald_Knuth
> [2] http://www.mediawiki.org/wiki/API
> [3] http://www.mediawiki.org/wiki/API:Data_formats#Output



> ___________________________________________________________________________________
> If your question is of interest to others as well, please add an entry to the Wiki!
> 
> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
> archive  : http://foundry.supelec.fr/projects/contextrev/
> wiki     : http://contextgarden.net
> ___________________________________________________________________________________


[-- Attachment #1.2: Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 485 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: idea: Module to automatically extract and insert information from Wikipedia
  2011-11-12 16:31 ` Philipp Gesang
@ 2011-11-12 16:40   ` Khaled Hosny
  2011-11-12 17:11     ` Hans Hagen
  0 siblings, 1 reply; 5+ messages in thread
From: Khaled Hosny @ 2011-11-12 16:40 UTC (permalink / raw)
  To: mailing list for ConTeXt users


[-- Attachment #1.1: Type: text/plain, Size: 620 bytes --]

On Sat, Nov 12, 2011 at 05:31:23PM +0100, Philipp Gesang wrote:
> (Beware that processing wiki text from WP is extremely
> complicated due to WP’s using special plugins (“templates” and
> stuff). So the only way to make sure that a parser accept any
> well formed WP page would be to include all those plugins. Which
> would entail rewriting the PHP code in Lua for use as a context
> script. And then you’d have to decide for every plugin what its
> output should look like in Context.[0] If you have the time ...)

I think scraping the MediaWiki-generated HTML would be simpler.

Regards,
 Khaled

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 485 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: idea: Module to automatically extract and insert information from Wikipedia
  2011-11-12 16:40   ` Khaled Hosny
@ 2011-11-12 17:11     ` Hans Hagen
  0 siblings, 0 replies; 5+ messages in thread
From: Hans Hagen @ 2011-11-12 17:11 UTC (permalink / raw)
  To: mailing list for ConTeXt users

On 12-11-2011 17:40, Khaled Hosny wrote:
> On Sat, Nov 12, 2011 at 05:31:23PM +0100, Philipp Gesang wrote:
>> (Beware that processing wiki text from WP is extremely
>> complicated due to WP’s using special plugins (“templates” and
>> stuff). So the only way to make sure that a parser accept any
>> well formed WP page would be to include all those plugins. Which
>> would entail rewriting the PHP code in Lua for use as a context
>> script. And then you’d have to decide for every plugin what its
>> output should look like in Context.[0] If you have the time ...)
>
> I think scraping the MediaWiki-generated HTML would be simpler.

Doesn't it also depend on the first line being recognizable as such?

Hans


-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: idea: Module to automatically extract and insert information from Wikipedia
  2011-11-12 15:19 idea: Module to automatically extract and insert information from Wikipedia Paul Menzel
  2011-11-12 16:31 ` Philipp Gesang
@ 2011-11-13  6:07 ` Aditya Mahajan
  1 sibling, 0 replies; 5+ messages in thread
From: Aditya Mahajan @ 2011-11-13  6:07 UTC (permalink / raw)
  To: mailing list for ConTeXt users

On Sat, 12 Nov 2011, Paul Menzel wrote:

> just now I thought of the following and I am wondering if there exists
> already a solution.

Not exactly for wikipedia, but I have an experimental module that pulls 
information from the web. I use it get images from sites like yuml.me an 
dwebsequencediagrams.com.

https://github.com/adityam/context-webfilter

See test/ directory for examples.

> Writing a text which includes people I want to add information about
> these peoples as footnotes. The first sentence in a Wikipedia article is
> most of the time good enough for that.
>
> A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the
> first sentence of the article and puts an item into the bibliography.

This actually requires a more detailed spec. What happens if there is 
more than one person with the same name:

http://en.wikipedia.org/wiki/Wolfgang_Schuster

> There is even an API to access articles [2]. Besides coding that up I
> see the following problems.
>
> 1. The output [3] needs to be converted to ConTeXt.

I don't see anything in the API specs that returns the contents of the 
page. My guess is that simply downloading the html page and scraping the 
main paragraph might be easier. Once the data is retreived, using ConTeXt 
to typeset HTML is fairly easy.

Another option is to just use one of the existing scripts to scrap the 
first paragraph/first line from Wikipedia, e.g.,

http://stackoverflow.com/questions/1565347/get-first-lines-of-wikipedia-article
http://query7.com/scrape-the-first-paragraph-image-from-a-wikipedia-entry

and use the filter module to call them.

Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-11-13  6:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-12 15:19 idea: Module to automatically extract and insert information from Wikipedia Paul Menzel
2011-11-12 16:31 ` Philipp Gesang
2011-11-12 16:40   ` Khaled Hosny
2011-11-12 17:11     ` Hans Hagen
2011-11-13  6:07 ` Aditya Mahajan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).