ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* Downloading long urls
@ 2011-01-16  5:12 Aditya Mahajan
  2011-01-21 17:15 ` Aditya Mahajan
  0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-16  5:12 UTC (permalink / raw)
  To: mailing list for ConTeXt users

Hi,

While downloading urls, context santizes the filename but does not check 
the length of the url. So, one can end up with a situation where the 
filename is too long for the operating system to handle. For example, the 
following fails on 32bit linux.

\enabletrackers[resolvers.schemes]
\startluacode
   local report_webfilter = logs.new("thirddata.webfilter")

   local url = 
"http://www.bing.com/search?q=AreallyreallylongstringjusttoseehowthingsworkordontworkAreallyreallylongstringjusttoseehowthingsworkordontworkAreallyreallylongstringjusttoseehowthingsworkordontworkAreallyreallylongstringjusttoseehowthingsworkordontworAreallyreallylongstringjusttoseehowthingsworkordontworkkAreallyreallylongstringjusttoseehowthingsworkordontwork"

   local specification = resolvers.splitmethod(url)

   local file       = resolvers.finders['http'](specification) or ""

   if file and file ~= "" then
     report_webfilter("saving file %s", file)
   else
     report_webfilter("download failed")
   end
\stopluacode

\normalend

Is there a robust way to avoid this problem? One possibility is that in 
data-sch.lua instead of

     local cleanname = gsub(original,"[^%a%d%.]+","-")

use

     local cleanname = md5sum(original)

What do you think?

Aditya

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Downloading long urls
  2011-01-16  5:12 Downloading long urls Aditya Mahajan
@ 2011-01-21 17:15 ` Aditya Mahajan
  2011-01-21 22:53   ` Hans Hagen
  0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-21 17:15 UTC (permalink / raw)
  To: mailing list for ConTeXt users

On Sun, 16 Jan 2011, Aditya Mahajan wrote:

> Is there a robust way to avoid this problem? One possibility is that in 
> data-sch.lua instead of
>
>    local cleanname = gsub(original,"[^%a%d%.]+","-")
>
> use

     local cleanname = md5.HEX(original) -- gsub(original,"[^%a%d%.]+","-")

appears to work correctly in my tests. The drawback of this scheme is that 
instead of

    \externalfigure[url ending with .png]

one would have to use

    \externalfigure[url ending with .png][method=png]

But \input 'url ending with .tex' still works

The other drawback is the filenames in the cache will be gibberish. But on 
the plus side, you can use long urls.

Do you think that the drawbacks outweigh the gains?

I need this for the webfilter module, where the url can get pretty long. I 
can always write my own http_get function, but that will be mostly 
repetition of data-sch.lua

Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Downloading long urls
  2011-01-21 17:15 ` Aditya Mahajan
@ 2011-01-21 22:53   ` Hans Hagen
  2011-01-22  0:20     ` Aditya Mahajan
  0 siblings, 1 reply; 8+ messages in thread
From: Hans Hagen @ 2011-01-21 22:53 UTC (permalink / raw)
  To: mailing list for ConTeXt users

On 21-1-2011 6:15, Aditya Mahajan wrote:

> local cleanname = md5.HEX(original) -- gsub(original,"[^%a%d%.]+","-")
>
> appears to work correctly in my tests. The drawback of this scheme is
> that instead of
>
> \externalfigure[url ending with .png]
>
> one would have to use
>
> \externalfigure[url ending with .png][method=png]
>
> But \input 'url ending with .tex' still works
>
> The other drawback is the filenames in the cache will be gibberish. But
> on the plus side, you can use long urls.
>
> Do you think that the drawbacks outweigh the gains?

What exactly do you mean with the suffix issue? We can probably 
normalize things a bit. Concerning the gibberish ... we can put a file 
alongside with some info. I need to think a bit about it but indeed it 
makes no sense to have redundant mechanisms.

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Downloading long urls
  2011-01-21 22:53   ` Hans Hagen
@ 2011-01-22  0:20     ` Aditya Mahajan
  2011-01-23 20:22       ` Hans Hagen
  0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-22  0:20 UTC (permalink / raw)
  To: Hans Hagen; +Cc: mailing list for ConTeXt users

On Fri, 21 Jan 2011, Hans Hagen wrote:

> What exactly do you mean with the suffix issue?

Consider

  \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png]

The current implementation downloads this file as

<path-to-current-cache>/http-contextgarden.files.wordpress.com-2008-08-logo-alt41.png

Then external figure sees a file with .png extension, and correctly 
includes it.

If you follow my suggestion, the file will be downloaded as

<path-to-current-cache>/667816068B899068327DA1EF013B3943

Then external figure sees a file with no extension, assumes that the file 
is a pdf file, and the figure inclusion fails. To correct that, you need 
to add [method=png] to \externalfigure.

> We can probably normalize things a bit.

Agreed. Perhaps the best option will be a file name like

http-contextgardent.files.wordpress.com-667816068B899068327DA1EF013B3943.png

(so normalized base url + md5sum of url + extension). I am not sure how if 
extensions can be calculated reliably in urls. In particular imaging 
something like

http://www.bing.com/search?q=check+.extension+long+url+so+that+os+filename+limit+exceeds+....

A simple algorithm with assume that everything following the dot is the 
extension, while that is certainly not the case here. We can definitely 
restrict the search of extension to the last 10 or so characters of the 
url, but there will be cases when such heuristics will fail.

> Concerning the gibberish ... we can put a file alongside with some info. 
> I need to think a bit about it but indeed it makes no sense to have 
> redundant mechanisms.

Thanks,
Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Downloading long urls
  2011-01-22  0:20     ` Aditya Mahajan
@ 2011-01-23 20:22       ` Hans Hagen
  2011-01-23 20:34         ` Aditya Mahajan
  0 siblings, 1 reply; 8+ messages in thread
From: Hans Hagen @ 2011-01-23 20:22 UTC (permalink / raw)
  To: Aditya Mahajan; +Cc: mailing list for ConTeXt users

On 22-1-2011 1:20, Aditya Mahajan wrote:

> A simple algorithm with assume that everything following the dot is the
> extension, while that is certainly not the case here. We can definitely
> restrict the search of extension to the last 10 or so characters of the
> url, but there will be cases when such heuristics will fail.

it's not that complicated ... say that you patch this way:

function schemes.cleanname(specification)
     return (gsub(specification.original,"[^%a%d%.]+","-"))
end

local function fetch(specification)
     local original  = specification.original
     local scheme    = specification.scheme
     local cleanname = schemes.cleanname(specification)

that will be the current method. Now you can experiment with:

\startluacode
function resolvers.schemes.cleanname(specification)
     local name = specification.original
     local hash = 
file.addsuffix(md5.hex(name),file.suffix(specification.path))
     logs.simple("%s => %s",name,hash)
     return hash
end
\stopluacode

Just see how that works out

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Downloading long urls
  2011-01-23 20:22       ` Hans Hagen
@ 2011-01-23 20:34         ` Aditya Mahajan
  2011-01-23 20:54           ` Hans Hagen
  0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-23 20:34 UTC (permalink / raw)
  To: Hans Hagen; +Cc: mailing list for ConTeXt users

On Sun, 23 Jan 2011, Hans Hagen wrote:

> On 22-1-2011 1:20, Aditya Mahajan wrote:
>
>> A simple algorithm with assume that everything following the dot is the
>> extension, while that is certainly not the case here. We can definitely
>> restrict the search of extension to the last 10 or so characters of the
>> url, but there will be cases when such heuristics will fail.
>
> it's not that complicated ... say that you patch this way:
>
> function schemes.cleanname(specification)
>    return (gsub(specification.original,"[^%a%d%.]+","-"))
> end
>
> local function fetch(specification)
>    local original  = specification.original
>    local scheme    = specification.scheme
>    local cleanname = schemes.cleanname(specification)
>
> that will be the current method. Now you can experiment with:

Can cleanname be passed as a parameter of the specification? Then we can 
have

local cleanname = specification.cleanname or schemes.cleanname(specification)

This way, I can only change the cleanname of the files that are downloaded 
by my module without affecting the cleanname for any other command that 
might want to download a file.

Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Downloading long urls
  2011-01-23 20:34         ` Aditya Mahajan
@ 2011-01-23 20:54           ` Hans Hagen
  2011-01-23 21:10             ` Aditya Mahajan
  0 siblings, 1 reply; 8+ messages in thread
From: Hans Hagen @ 2011-01-23 20:54 UTC (permalink / raw)
  To: Aditya Mahajan; +Cc: mailing list for ConTeXt users

On 23-1-2011 9:34, Aditya Mahajan wrote:
> On Sun, 23 Jan 2011, Hans Hagen wrote:
>
>> On 22-1-2011 1:20, Aditya Mahajan wrote:
>>
>>> A simple algorithm with assume that everything following the dot is the
>>> extension, while that is certainly not the case here. We can definitely
>>> restrict the search of extension to the last 10 or so characters of the
>>> url, but there will be cases when such heuristics will fail.
>>
>> it's not that complicated ... say that you patch this way:
>>
>> function schemes.cleanname(specification)
>> return (gsub(specification.original,"[^%a%d%.]+","-"))
>> end
>>
>> local function fetch(specification)
>> local original = specification.original
>> local scheme = specification.scheme
>> local cleanname = schemes.cleanname(specification)
>>
>> that will be the current method. Now you can experiment with:
>
> Can cleanname be passed as a parameter of the specification? Then we can
> have
>
> local cleanname = specification.cleanname or
> schemes.cleanname(specification)
>
> This way, I can only change the cleanname of the files that are
> downloaded by my module without affecting the cleanname for any other
> command that might want to download a file.

I made this ... as this is rather specialized tuning (that might confuse 
users) it's a directive:

\starttext

\enabletrackers [resolvers.schemes]
\enabledirectives[schemes.cleanmethod=md5]

\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]

\stoptext

currently 'strip' is default but we can decide on md5

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Downloading long urls
  2011-01-23 20:54           ` Hans Hagen
@ 2011-01-23 21:10             ` Aditya Mahajan
  0 siblings, 0 replies; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-23 21:10 UTC (permalink / raw)
  To: Hans Hagen; +Cc: mailing list for ConTeXt users

On Sun, 23 Jan 2011, Hans Hagen wrote:

> I made this ... as this is rather specialized tuning (that might confuse 
> users) it's a directive:
>
> \starttext
>
> \enabletrackers [resolvers.schemes]
> \enabledirectives[schemes.cleanmethod=md5]
>
> \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
> \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
> \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
>
> \stoptext
>
> currently 'strip' is default but we can decide on md5

Thanks. I'll test it with my module.

Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-01-23 21:10 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-16  5:12 Downloading long urls Aditya Mahajan
2011-01-21 17:15 ` Aditya Mahajan
2011-01-21 22:53   ` Hans Hagen
2011-01-22  0:20     ` Aditya Mahajan
2011-01-23 20:22       ` Hans Hagen
2011-01-23 20:34         ` Aditya Mahajan
2011-01-23 20:54           ` Hans Hagen
2011-01-23 21:10             ` Aditya Mahajan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).