* Downloading long urls
@ 2011-01-16 5:12 Aditya Mahajan
2011-01-21 17:15 ` Aditya Mahajan
0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-16 5:12 UTC (permalink / raw)
To: mailing list for ConTeXt users
Hi,
While downloading urls, context santizes the filename but does not check
the length of the url. So, one can end up with a situation where the
filename is too long for the operating system to handle. For example, the
following fails on 32bit linux.
\enabletrackers[resolvers.schemes]
\startluacode
local report_webfilter = logs.new("thirddata.webfilter")
local url =
"http://www.bing.com/search?q=AreallyreallylongstringjusttoseehowthingsworkordontworkAreallyreallylongstringjusttoseehowthingsworkordontworkAreallyreallylongstringjusttoseehowthingsworkordontworkAreallyreallylongstringjusttoseehowthingsworkordontworAreallyreallylongstringjusttoseehowthingsworkordontworkkAreallyreallylongstringjusttoseehowthingsworkordontwork"
local specification = resolvers.splitmethod(url)
local file = resolvers.finders['http'](specification) or ""
if file and file ~= "" then
report_webfilter("saving file %s", file)
else
report_webfilter("download failed")
end
\stopluacode
\normalend
Is there a robust way to avoid this problem? One possibility is that in
data-sch.lua instead of
local cleanname = gsub(original,"[^%a%d%.]+","-")
use
local cleanname = md5sum(original)
What do you think?
Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Downloading long urls
2011-01-16 5:12 Downloading long urls Aditya Mahajan
@ 2011-01-21 17:15 ` Aditya Mahajan
2011-01-21 22:53 ` Hans Hagen
0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-21 17:15 UTC (permalink / raw)
To: mailing list for ConTeXt users
On Sun, 16 Jan 2011, Aditya Mahajan wrote:
> Is there a robust way to avoid this problem? One possibility is that in
> data-sch.lua instead of
>
> local cleanname = gsub(original,"[^%a%d%.]+","-")
>
> use
local cleanname = md5.HEX(original) -- gsub(original,"[^%a%d%.]+","-")
appears to work correctly in my tests. The drawback of this scheme is that
instead of
\externalfigure[url ending with .png]
one would have to use
\externalfigure[url ending with .png][method=png]
But \input 'url ending with .tex' still works
The other drawback is the filenames in the cache will be gibberish. But on
the plus side, you can use long urls.
Do you think that the drawbacks outweigh the gains?
I need this for the webfilter module, where the url can get pretty long. I
can always write my own http_get function, but that will be mostly
repetition of data-sch.lua
Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Downloading long urls
2011-01-21 17:15 ` Aditya Mahajan
@ 2011-01-21 22:53 ` Hans Hagen
2011-01-22 0:20 ` Aditya Mahajan
0 siblings, 1 reply; 8+ messages in thread
From: Hans Hagen @ 2011-01-21 22:53 UTC (permalink / raw)
To: mailing list for ConTeXt users
On 21-1-2011 6:15, Aditya Mahajan wrote:
> local cleanname = md5.HEX(original) -- gsub(original,"[^%a%d%.]+","-")
>
> appears to work correctly in my tests. The drawback of this scheme is
> that instead of
>
> \externalfigure[url ending with .png]
>
> one would have to use
>
> \externalfigure[url ending with .png][method=png]
>
> But \input 'url ending with .tex' still works
>
> The other drawback is the filenames in the cache will be gibberish. But
> on the plus side, you can use long urls.
>
> Do you think that the drawbacks outweigh the gains?
What exactly do you mean with the suffix issue? We can probably
normalize things a bit. Concerning the gibberish ... we can put a file
alongside with some info. I need to think a bit about it but indeed it
makes no sense to have redundant mechanisms.
Hans
-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
| www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Downloading long urls
2011-01-21 22:53 ` Hans Hagen
@ 2011-01-22 0:20 ` Aditya Mahajan
2011-01-23 20:22 ` Hans Hagen
0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-22 0:20 UTC (permalink / raw)
To: Hans Hagen; +Cc: mailing list for ConTeXt users
On Fri, 21 Jan 2011, Hans Hagen wrote:
> What exactly do you mean with the suffix issue?
Consider
\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png]
The current implementation downloads this file as
<path-to-current-cache>/http-contextgarden.files.wordpress.com-2008-08-logo-alt41.png
Then external figure sees a file with .png extension, and correctly
includes it.
If you follow my suggestion, the file will be downloaded as
<path-to-current-cache>/667816068B899068327DA1EF013B3943
Then external figure sees a file with no extension, assumes that the file
is a pdf file, and the figure inclusion fails. To correct that, you need
to add [method=png] to \externalfigure.
> We can probably normalize things a bit.
Agreed. Perhaps the best option will be a file name like
http-contextgardent.files.wordpress.com-667816068B899068327DA1EF013B3943.png
(so normalized base url + md5sum of url + extension). I am not sure how if
extensions can be calculated reliably in urls. In particular imaging
something like
http://www.bing.com/search?q=check+.extension+long+url+so+that+os+filename+limit+exceeds+....
A simple algorithm with assume that everything following the dot is the
extension, while that is certainly not the case here. We can definitely
restrict the search of extension to the last 10 or so characters of the
url, but there will be cases when such heuristics will fail.
> Concerning the gibberish ... we can put a file alongside with some info.
> I need to think a bit about it but indeed it makes no sense to have
> redundant mechanisms.
Thanks,
Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Downloading long urls
2011-01-22 0:20 ` Aditya Mahajan
@ 2011-01-23 20:22 ` Hans Hagen
2011-01-23 20:34 ` Aditya Mahajan
0 siblings, 1 reply; 8+ messages in thread
From: Hans Hagen @ 2011-01-23 20:22 UTC (permalink / raw)
To: Aditya Mahajan; +Cc: mailing list for ConTeXt users
On 22-1-2011 1:20, Aditya Mahajan wrote:
> A simple algorithm with assume that everything following the dot is the
> extension, while that is certainly not the case here. We can definitely
> restrict the search of extension to the last 10 or so characters of the
> url, but there will be cases when such heuristics will fail.
it's not that complicated ... say that you patch this way:
function schemes.cleanname(specification)
return (gsub(specification.original,"[^%a%d%.]+","-"))
end
local function fetch(specification)
local original = specification.original
local scheme = specification.scheme
local cleanname = schemes.cleanname(specification)
that will be the current method. Now you can experiment with:
\startluacode
function resolvers.schemes.cleanname(specification)
local name = specification.original
local hash =
file.addsuffix(md5.hex(name),file.suffix(specification.path))
logs.simple("%s => %s",name,hash)
return hash
end
\stopluacode
Just see how that works out
Hans
-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
| www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Downloading long urls
2011-01-23 20:22 ` Hans Hagen
@ 2011-01-23 20:34 ` Aditya Mahajan
2011-01-23 20:54 ` Hans Hagen
0 siblings, 1 reply; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-23 20:34 UTC (permalink / raw)
To: Hans Hagen; +Cc: mailing list for ConTeXt users
On Sun, 23 Jan 2011, Hans Hagen wrote:
> On 22-1-2011 1:20, Aditya Mahajan wrote:
>
>> A simple algorithm with assume that everything following the dot is the
>> extension, while that is certainly not the case here. We can definitely
>> restrict the search of extension to the last 10 or so characters of the
>> url, but there will be cases when such heuristics will fail.
>
> it's not that complicated ... say that you patch this way:
>
> function schemes.cleanname(specification)
> return (gsub(specification.original,"[^%a%d%.]+","-"))
> end
>
> local function fetch(specification)
> local original = specification.original
> local scheme = specification.scheme
> local cleanname = schemes.cleanname(specification)
>
> that will be the current method. Now you can experiment with:
Can cleanname be passed as a parameter of the specification? Then we can
have
local cleanname = specification.cleanname or schemes.cleanname(specification)
This way, I can only change the cleanname of the files that are downloaded
by my module without affecting the cleanname for any other command that
might want to download a file.
Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Downloading long urls
2011-01-23 20:34 ` Aditya Mahajan
@ 2011-01-23 20:54 ` Hans Hagen
2011-01-23 21:10 ` Aditya Mahajan
0 siblings, 1 reply; 8+ messages in thread
From: Hans Hagen @ 2011-01-23 20:54 UTC (permalink / raw)
To: Aditya Mahajan; +Cc: mailing list for ConTeXt users
On 23-1-2011 9:34, Aditya Mahajan wrote:
> On Sun, 23 Jan 2011, Hans Hagen wrote:
>
>> On 22-1-2011 1:20, Aditya Mahajan wrote:
>>
>>> A simple algorithm with assume that everything following the dot is the
>>> extension, while that is certainly not the case here. We can definitely
>>> restrict the search of extension to the last 10 or so characters of the
>>> url, but there will be cases when such heuristics will fail.
>>
>> it's not that complicated ... say that you patch this way:
>>
>> function schemes.cleanname(specification)
>> return (gsub(specification.original,"[^%a%d%.]+","-"))
>> end
>>
>> local function fetch(specification)
>> local original = specification.original
>> local scheme = specification.scheme
>> local cleanname = schemes.cleanname(specification)
>>
>> that will be the current method. Now you can experiment with:
>
> Can cleanname be passed as a parameter of the specification? Then we can
> have
>
> local cleanname = specification.cleanname or
> schemes.cleanname(specification)
>
> This way, I can only change the cleanname of the files that are
> downloaded by my module without affecting the cleanname for any other
> command that might want to download a file.
I made this ... as this is rather specialized tuning (that might confuse
users) it's a directive:
\starttext
\enabletrackers [resolvers.schemes]
\enabledirectives[schemes.cleanmethod=md5]
\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
\stoptext
currently 'strip' is default but we can decide on md5
Hans
-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
| www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Downloading long urls
2011-01-23 20:54 ` Hans Hagen
@ 2011-01-23 21:10 ` Aditya Mahajan
0 siblings, 0 replies; 8+ messages in thread
From: Aditya Mahajan @ 2011-01-23 21:10 UTC (permalink / raw)
To: Hans Hagen; +Cc: mailing list for ConTeXt users
On Sun, 23 Jan 2011, Hans Hagen wrote:
> I made this ... as this is rather specialized tuning (that might confuse
> users) it's a directive:
>
> \starttext
>
> \enabletrackers [resolvers.schemes]
> \enabledirectives[schemes.cleanmethod=md5]
>
> \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
> \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
> \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
>
> \stoptext
>
> currently 'strip' is default but we can decide on md5
Thanks. I'll test it with my module.
Aditya
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage : http://www.pragma-ade.nl / http://tex.aanhet.net
archive : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___________________________________________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-01-23 21:10 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-16 5:12 Downloading long urls Aditya Mahajan
2011-01-21 17:15 ` Aditya Mahajan
2011-01-21 22:53 ` Hans Hagen
2011-01-22 0:20 ` Aditya Mahajan
2011-01-23 20:22 ` Hans Hagen
2011-01-23 20:34 ` Aditya Mahajan
2011-01-23 20:54 ` Hans Hagen
2011-01-23 21:10 ` Aditya Mahajan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).