Hello, authdial() is slow. so I have inspected the code: /sys/src/libauthsrv/authdial.c I suspect netroot causes the problem. int authdial(char *netroot, char *dom) { [snip] if(dom == nil) /* look for one relative to my machine */ return dial(netmkaddr("$auth", netroot, "ticket"), 0, 0, 0); /* look up an auth server in an authentication domain */ p = csgetvalue(netroot, "authdom", dom, "auth", &t); /* if that didn't work, just try the IP domain */ if(p == nil) p = csgetvalue(netroot, "dom", dom, "auth", &t); [snip] for(nt = t; nt != nil; nt = nt->entry) { if(strcmp(nt->attr, "auth") == 0) { p = netmkaddr(nt->val, netroot, "ticket"); rv = dial(p, 0, 0, 0); if(rv >= 0) break; } } [snip] } man authsrv(2) says: the default of netroot in authdial(char *netroot, char *ad) is “/net” and consistent with the synopsis of csgetvalue(char *netroot, char *attr, char *val,….). on the other hand the synopsis of netmkaddr is char* netmkaddr(char *addr, char *defnet, char *defservice) the defnet takes values nil, “tcp”, “il”, etc. as the result, netroot in the first argument of authdial() works only if it is nil, which makes authdial() very slow. I think that should be fixed for example: - p = netmkaddr(nt->val, netroot, "ticket"); rv = dial(p, 0, 0, 0); if(rv >= 0) break; + p = netmkaddr(nt->val, “tcp", "ticket"); rv = dial(p, 0, 0, 0); if(rv >= 0) break; + p = netmkaddr(nt->val, “il", "ticket"); + rv = dial(p, 0, 0, 0); + if(rv >= 0) + break; NOTE that return dial(netmkaddr("$auth", netroot, "ticket"), 0, 0, 0); should be also fixed. Kenji Arisawa
yes, passing netroot to netmkaddr() is wrong. however, i dont see where we pass anything other than netroot=nil to authdial(). i suspect what you'r experiencing is that dial() tried IL first (with !fasttimeout option), which fails and then tries TCP causing the delay. so you changing it to try tcp first gets rid of the delay (but will break the unlikely case where you got an authserver running on IL only as tcp has no fast timeout option). -- cinap
[-- Attachment #1: Type: text/plain, Size: 834 bytes --] On 18 May 2016 at 01:43, arisawa <arisawa@ar.aichi-u.ac.jp> wrote: > - p = netmkaddr(nt->val, netroot, "ticket"); > rv = dial(p, 0, 0, 0); > if(rv >= 0) > break; > > + p = netmkaddr(nt->val, “tcp", "ticket"); > rv = dial(p, 0, 0, 0); > if(rv >= 0) > break; > + p = netmkaddr(nt->val, “il", "ticket"); > + rv = dial(p, 0, 0, 0); > + if(rv >= 0) > + break; > But that's just (eventually) moving the cs search into every application and bound to specific network types. Why is the cs search with "net" so slow? [-- Attachment #2: Type: text/html, Size: 1464 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1448 bytes --] i don't think it's the auth lookup: supermic% time ndb/csquery /net/cs net!luna.nyx.link!ticket /net/tcp/clone 115.36.102.252!567 /net/tcp/clone 2402:6b00:4040:b600::8!567 0.00u 0.00s 0.34r ndb/csquery /net/cs net!luna.nyx.link!ticket supermic% time ndb/csquery /net/cs net!grid.nyx.link!cpu /net/tcp/clone 115.36.102.252!17010 /net/tcp/clone 2402:6b00:4040:b600::9!17010 0.00u 0.00s 0.01r ndb/csquery /net/cs net!grid.nyx.link!cpu supermic% i suspect it's auth negotiations. On Wed, May 18, 2016 at 6:07 AM Charles Forsyth <charles.forsyth@gmail.com> wrote: > > On 18 May 2016 at 01:43, arisawa <arisawa@ar.aichi-u.ac.jp> wrote: > >> - p = netmkaddr(nt->val, netroot, "ticket"); >> rv = dial(p, 0, 0, 0); >> if(rv >= 0) >> break; >> >> + p = netmkaddr(nt->val, “tcp", "ticket"); >> rv = dial(p, 0, 0, 0); >> if(rv >= 0) >> break; >> + p = netmkaddr(nt->val, “il", "ticket"); >> + rv = dial(p, 0, 0, 0); >> + if(rv >= 0) >> + break; >> > > But that's just (eventually) moving the cs search into every application > and bound to specific network types. > Why is the cs search with "net" so slow? > [-- Attachment #2: Type: text/html, Size: 2587 bytes --]
hello,
thanks for the information.
the lookup depends on dns cache but I also don’t think it has a problem.
I think the negotiation process is innocent.
plan9 auth negotiation is fairly simple, that needs only a single step:
C→A: ticket request
A→C: encrypted ticket
I measured the times that are required for the auth server to complete the ticket request.
they are only a few milliseconds.
dialing time from client to auth server is the problem.
> 2016/05/19 2:03、Skip Tavakkolian <skip.tavakkolian@gmail.com> のメール:
>
> i don't think it's the auth lookup:
>
> supermic% time ndb/csquery /net/cs net!luna.nyx.link!ticket
> /net/tcp/clone 115.36.102.252!567
> /net/tcp/clone 2402:6b00:4040:b600::8!567
> 0.00u 0.00s 0.34r ndb/csquery /net/cs net!luna.nyx.link!ticket
> supermic% time ndb/csquery /net/cs net!grid.nyx.link!cpu
> /net/tcp/clone 115.36.102.252!17010
> /net/tcp/clone 2402:6b00:4040:b600::9!17010
> 0.00u 0.00s 0.01r ndb/csquery /net/cs net!grid.nyx.link!cpu
> supermic%
>
> i suspect it's auth negotiations.
>
>
>
> On Wed, May 18, 2016 at 6:07 AM Charles Forsyth <charles.forsyth@gmail.com> wrote:
>
> On 18 May 2016 at 01:43, arisawa <arisawa@ar.aichi-u.ac.jp> wrote:
> - p = netmkaddr(nt->val, netroot, "ticket");
> rv = dial(p, 0, 0, 0);
> if(rv >= 0)
> break;
>
> + p = netmkaddr(nt->val, “tcp", "ticket");
> rv = dial(p, 0, 0, 0);
> if(rv >= 0)
> break;
> + p = netmkaddr(nt->val, “il", "ticket");
> + rv = dial(p, 0, 0, 0);
> + if(rv >= 0)
> + break;
>
> But that's just (eventually) moving the cs search into every application and bound to specific network types.
> Why is the cs search with "net" so slow?
cinap is right but I wonder if we should take care of IL forever.
> 2016/05/18 18:14、cinap_lenrek@felloff.net のメール:
>
> yes, passing netroot to netmkaddr() is wrong. however, i dont see where
> we pass anything other than netroot=nil to authdial().
>
> i suspect what you'r experiencing is that dial() tried IL first (with
> !fasttimeout option), which fails and then tries TCP causing the delay.
>
> so you changing it to try tcp first gets rid of the delay (but will break
> the unlikely case where you got an authserver running on IL only as tcp
> has no fast timeout option).
>
> --
> cinap
>
> cinap is right but I wonder if we should take care of IL forever.
Yes.
Lucio.
[-- Attachment #1: Type: text/plain, Size: 536 bytes --] It's not just il. There are other networks, other protocols, other than tcp and IP. It's one of the advantages of plan 9 that it's easy to try and use new ones. cs could present a different search list when there's only one way to get somewhere. as a quick fix for everything, if you're not using il, remove it from the list in ndb/cs.c, and cs won't suggest it. On 19 May 2016 at 16:43, <lucio@proxima.alt.za> wrote: > > cinap is right but I wonder if we should take care of IL forever. > > Yes. > > Lucio. > > > [-- Attachment #2: Type: text/html, Size: 986 bytes --]
nice idea.
I’ve put an option to cs so that il isn’t listed first.
> 2016/05/20 0:48、Charles Forsyth <charles.forsyth@gmail.com> のメール:
>
> It's not just il. There are other networks, other protocols, other than tcp and IP.
> It's one of the advantages of plan 9 that it's easy to try and use new ones.
> cs could present a different search list when there's only one way to get somewhere.
> as a quick fix for everything, if you're not using il, remove it from the list in ndb/cs.c, and cs won't suggest it.
>
>
> On 19 May 2016 at 16:43, <lucio@proxima.alt.za> wrote:
> > cinap is right but I wonder if we should take care of IL forever.
>
> Yes.
>
> Lucio.
>
>
>
i'm a little confused by the discussion of il + tcp on authdial causing the delay. if ndb/cs returns multiple dial strings, dialmulti function (/sys/src/libc/9sys/dial.c) dials them all in parallel (rfork) and the first one that returns, wins. btw, i found a bug in dial.c (i'll send it in another message); fixing it didn't seem to have any effect but i believe it's a bug anyway. i instrumented the near side of cpu; the setup (including setting up notes forwarding) looks to be less than 2 seconds. the rest can only be attributed to local exportfs startup and remote rc startup. here are three runs: supermic% time ./cpu -h grid.nyx.link -k 'dom=outside.plan9.bell-labs.com' -c pwd dial took: 165268068 ns negotiate auth took: 159047373 ns factotum autenticate took: 699255914 ns lclnoteproc time: 702945750 ns far end setup time: 157257845 ns /usr/fst 0.00u 0.00s 7.58r ./cpu -h grid.nyx.link -k dom=outside.plan9.bell-labs.com ... supermic% time ./cpu -h grid.nyx.link -k 'dom=outside.plan9.bell-labs.com' -c pwd dial took: 158312233 ns negotiate auth took: 155598573 ns factotum autenticate took: 681558002 ns lclnoteproc time: 684863443 ns far end setup time: 151972579 ns /usr/fst 0.00u 0.01s 7.22r ./cpu -h grid.nyx.link -k dom=outside.plan9.bell-labs.com ... supermic% time ./cpu -h grid.nyx.link -k 'dom=outside.plan9.bell-labs.com' -c pwd dial took: 176752321 ns negotiate auth took: 175402567 ns factotum autenticate took: 746787405 ns lclnoteproc time: 749881615 ns far end setup time: 172600395 ns /usr/fst 0.00u 0.03s 8.17r ./cpu -h grid.nyx.link -k dom=outside.plan9.bell-labs.com ... supermic%
can you spot the problem in reap() function in /sys/src/libc/9sys/dial.c? hint: in notedeath() there is a tokenize(exitsts, ...) static int reap(Dest *dp) { char exitsts[2*ERRMAX]; if (outstandingprocs(dp) && await(exitsts, sizeof exitsts) >= 0) { notedeath(dp, exitsts); return 0; } return -1; }
[-- Attachment #1: Type: text/plain, Size: 374 bytes --] On 20 May 2016 at 23:04, Skip Tavakkolian <9nut@9netics.com> wrote: > i'm a little confused by the discussion of il + tcp on authdial > causing the delay. if ndb/cs returns multiple dial strings, dialmulti > function (/sys/src/libc/9sys/dial.c) dials them all in parallel > (rfork) and the first one that returns, wins. > 9front uses the original sequential one [-- Attachment #2: Type: text/html, Size: 722 bytes --]
i think this fix is correct; i'm not sure why tokenize didn't have a problem walking a buffer it expects to be null terminated. supermic% yesterday -d dial.c diff /n/dump/2016/0520/sys/src/libc/9sys/dial.c /sys/src/libc/9sys/dial.c 209a210 > int n; 211c212,213 < if (outstandingprocs(dp) && await(exitsts, sizeof exitsts) >= 0) { --- > if (outstandingprocs(dp) && (n = await(exitsts, sizeof exitsts)) >= 0) { > exitsts[n] = 0; supermic% > can you spot the problem in reap() function in /sys/src/libc/9sys/dial.c? > hint: in notedeath() there is a tokenize(exitsts, ...) > > static int > reap(Dest *dp) > { > char exitsts[2*ERRMAX]; > > if (outstandingprocs(dp) && await(exitsts, sizeof exitsts) >= 0) { > notedeath(dp, exitsts); > return 0; > } > return -1; > }
I am sorry that I didn’t give enough information but I din’t know there is so much difference
on the subject among 9front, 9atom and labs plan9.
> 2016/05/21 7:04、Skip Tavakkolian <9nut@9netics.com> のメール:
>
> i'm a little confused by the discussion of ...
parallel dialing would be fine.
I guess both cinap and erik have examined labs code.
any problem with it?
> 2016/05/21 7:25、Charles Forsyth <charles.forsyth@gmail.com> のメール:
>
>
> On 20 May 2016 at 23:04, Skip Tavakkolian <9nut@9netics.com> wrote:
> i'm a little confused by the discussion of il + tcp on authdial
> causing the delay. if ndb/cs returns multiple dial strings, dialmulti
> function (/sys/src/libc/9sys/dial.c) dials them all in parallel
> (rfork) and the first one that returns, wins.
>
> 9front uses the original sequential one
Hello,
> 2016/05/21 11:25、Skip Tavakkolian <9nut@9netics.com> のメール:
>
> i think this fix is correct; i'm not sure why tokenize didn't have a
> problem walking a buffer it expects to be null terminated.
I agree with skip.
the man wait(2) says:
The underlying system call is await, which fills in the n-
byte buffer s with a textual representation of the pid,
times, and exit string. There is no terminal NUL. The
return value is the length, in bytes, of the data.
thus the typical usage of await is:
Waitmsg*
wait(void)
{
int n, l;
char buf[512], *fld[5];
Waitmsg *w;
n = await(buf, sizeof buf-1);
if(n < 0)
return nil;
buf[n] = '\0';
if(tokenize(buf, fld, nelem(fld)) != nelem(fld)){
werrstr("couldn't parse wait message");
return nil;
}
/sys/src/libc/9sys/dial.c may be fixed as follows:
static int
reap(Dest *dp)
{
char exitsts[2*ERRMAX];
+ int n;
- if (outstandingprocs(dp) && await(exitsts, sizeof exitsts) >= 0) {
+ if (outstandingprocs(dp) && (n = await(exitsts, sizeof exitsts-1)) >= 0) {
+ exitsts[n] = 0;
notedeath(dp, exitsts);
return 0;
}
return -1;
}
probably 2*ERRMAX is enough for await().
I wonder why await() returns non-terminated string?
> static int
> reap(Dest *dp)
> {
> char exitsts[2*ERRMAX];
> + int n;
> - if (outstandingprocs(dp) && await(exitsts, sizeof exitsts) >= 0) {
> + if (outstandingprocs(dp) && (n = await(exitsts, sizeof exitsts-1)) >= 0) {
> + exitsts[n] = 0;
> notedeath(dp, exitsts);
> return 0;
> }
> return -1;
> }
>
> probably 2*ERRMAX is enough for await().
>
> I wonder why await() returns non-terminated string?
>
a better solution is to just use waitmsg (see wait(2)). the parsing rules and sizing are
already implemented there.
- erik
On Fri May 20 21:48:40 PDT 2016, arisawa@ar.aichi-u.ac.jp wrote:
> parallel dialing would be fine.
> I guess both cinap and erik have examined labs code.
> any problem with it?
i don't use this code.
i think the problem with parallel dial is that like the tricks we used to play in
nsec()—it assumes things about the calling environment. for starters, it's going
to break anything that's already forking as it may eat wait messages.
it's been a while since i looked at this, but i think parallel dialing requires a
bit more of a rethink of the dialing code than just hacking in a fork. without
restructuring the system, i think one would be forced into a dial device so that
any forking/threading could escape the current running environment.
the original problem this code was trying to solve was the fact that many connections
can't do ip6, and dialing ip6 slows down the connection quite a bit.
this problem is a little easier to fix. cs needs to do a little more work to filter out
impossible connection types. for example, if one's internet connection is not ip6
capable, then cs should be instructed to not return ip6 addresses.
- erik
On Fri May 20 15:27:56 PDT 2016, charles.forsyth@gmail.com wrote:
> On 20 May 2016 at 23:04, Skip Tavakkolian <9nut@9netics.com> wrote:
>
> > i'm a little confused by the discussion of il + tcp on authdial
> > causing the delay. if ndb/cs returns multiple dial strings, dialmulti
> > function (/sys/src/libc/9sys/dial.c) dials them all in parallel
> > (rfork) and the first one that returns, wins.
> >
>
> 9front uses the original sequential one
so does 9atom.
- erik
> a better solution is to just use waitmsg (see wait(2)). the parsing rules and sizing are > already implemented there. yes, i agree. i changed my patch. diff /n/dump/2016/0519/sys/src/libc/9sys/dial.c /sys/src/libc/9sys/dial.c 167c167 < notedeath(Dest *dp, char *exitsts) --- > notedeath(Dest *dp, Waitmsg *exitsts) 169,170c169 < int i, n, pid; < char *fields[5]; /* pid + 3 times + error */ --- > int n; 173,176c172 < for (i = 0; i < nelem(fields); i++) < fields[i] = ""; < n = tokenize(exitsts, fields, nelem(fields)); < if (n < 4) --- > if (exitsts->pid <= 0) 178,180d173 < pid = atoi(fields[0]); < if (pid <= 0) < return; 182c175 < if (conn->pid == pid && !conn->dead) { /* it's one we know? */ --- > if (conn->pid == exitsts->pid && !conn->dead) { /* it's one we know? */ 187,188c180,182 < strncpy(conn->err, fields[4], sizeof conn->err - 1); < conn->err[sizeof conn->err - 1] = '\0'; --- > n = strlen(exitsts->msg); > assert(n < ERRMAX); > strncpy(conn->err, exitsts->msg, n); 209c203 < char exitsts[2*ERRMAX]; --- > Waitmsg *exitsts = nil; 211c205 < if (outstandingprocs(dp) && await(exitsts, sizeof exitsts) >= 0) { --- > if (outstandingprocs(dp) && (exitsts = wait()) != nil) {
On Sat May 21 14:47:24 PDT 2016, 9nut@9netics.com wrote:
> > a better solution is to just use waitmsg (see wait(2)). the parsing rules and sizing are
> > already implemented there.
>
> yes, i agree. i changed my patch.
i believe this is missing a free.
- erik
also,
209c203
< char exitsts[2*ERRMAX];
---
> Waitmsg *exitsts = nil;
is likely to generate used-and-not-set on amd64.
- erik
> i believe this is missing a free.
>
> - erik
good catch; thanks.
> < char exitsts[2*ERRMAX];
> ---
>> Waitmsg *exitsts = nil;
>
> is likely to generate used-and-not-set on amd64.
>
> - erik
i'm not sure i understand how. can you explain?
On Sat May 21 15:19:06 PDT 2016, 9nut@9netics.com wrote:
> > < char exitsts[2*ERRMAX];
> > ---
> >> Waitmsg *exitsts = nil;
> >
> > is likely to generate used-and-not-set on amd64.
> >
> > - erik
>
> i'm not sure i understand how. can you explain?
since the only code path that uses exitsts sets it unconditionally,
setting a declaration time is going to be optimized out. when the
optimizer elides setting of a variable it emits the set and not used
diagnostic.
gcc doesn't do this with the usual flags, which is why it may be unexpected.
i've included a test program that makes this a little easier to read.
the test program emits the diagnostic "warning: warn.c:9 set and not used: s"
- erik
---
#include <u.h>
#include <libc.h>
void
main(void)
{
char *s;
s = nil;
s = "xyz";
print("s=%s\n", s);
exits("");
}
> also,
>
> 209c203
> < char exitsts[2*ERRMAX];
> ---
> > Waitmsg *exitsts = nil;
>
> is likely to generate used-and-not-set on amd64.
Wouldn't it be "set and not used" instead?
--
David du Colombier
>> >> Waitmsg *exitsts = nil;
i see; it's set but not used before it is assigned to again.
i would expect the compiler to get the hint that it's initialization
-- especially given that the value is 0 and the assignment is right
with the declaration.
> On May 21, 2016, at 10:04 AM, erik quanstrom <quanstro@quanstro.net> wrote:
>
> this problem is a little easier to fix. cs needs to do a little more work to filter out
> impossible connection types. for example, if one's internet connection is not ip6
> capable, then cs should be instructed to not return ip6 addresses.
Yes, exactly. And doesn't/didn't ndb have a tag that was used to indicate whether a system or network used il? (proto=il? I don't have access to a 9labs machine to check at the moment.)
Determining ip6 connectivity shouldn't be that hard. If you have a route, you connect. If you don't, the network stack does a fast ENOROUTE return and the case falls through to the next address family.
IL is a bit trickier, since you can't know in advance if the destination supports it, thus the ndb tags I'm vaguely remembering. And if my memory is correct, the fix for that would be to default 'supports IL' to no, if that's not already the case.
> Yes, exactly. And doesn't/didn't ndb have a tag that was used to indicate whether a system or network used il? (proto=il? I don't have access to a 9labs machine to check at the moment.)
>
> Determining ip6 connectivity shouldn't be that hard. If you have a route, you connect. If you don't, the network stack does a fast ENOROUTE return and the case falls through to the next address family.
>
> IL is a bit trickier, since you can't know in advance if the destination supports it, thus the ndb tags I'm vaguely remembering. And if my memory is correct, the fix for that would be to default 'supports IL' to no, if that's not already the case.
cs uses hard-coded rules. one could argue that it's up to dns to filter out
ip6, which would be straightforward one would only return ip6 when the
path from . to the zone in question is all ip6 accessible. i realize this is
wrong, as the path to the dns server is not the same as the path of the
packets, but, you know, should be a better hurestic than we've got today.
- erik
On Sat May 21 16:05:58 PDT 2016, 9nut@9netics.com wrote:
> >> >> Waitmsg *exitsts = nil;
>
> i see; it's set but not used before it is assigned to again.
>
> i would expect the compiler to get the hint that it's initialization
> -- especially given that the value is 0 and the assignment is right
> with the declaration.
like i mentioned, gcc ignores variables that are set and not used, for
the most part. i think behavior is odd. there's nothing special about
assigning a simple type in a declaration.
for example
char *s = nil;
and
char *s;
s = nil;
are equivalent. gcc ignores both cases if they are not actually used.
one could add an automatic implicit USED in kenc by modifying
the doinit() in dcl.c, but i think that would be a mistake. i find extra
assignments make it harder to read the code, as one has to consider
the assignment might be used when evaluating how a variable is used.
there are some exceptions to the general rule, for non simple types,
where the assignment may change the size. and i'm gnoring some dark
corners of c99 like const int, which are not implemented in plan 9.
- erik
hello skip,
it is too late but I should say many thanks.
> 2016/05/21 7:04、Skip Tavakkolian <9nut@9netics.com> のメール:
>
> supermic% time ./cpu -h grid.nyx.link -k 'dom=outside.plan9.bell-labs.com' -c pwd
> dial took: 165268068 ns
> negotiate auth took: 159047373 ns
> factotum autenticate took: 699255914 ns
> lclnoteproc time: 702945750 ns
> far end setup time: 157257845 ns
> /usr/fst
> 0.00u 0.00s 7.58r ./cpu -h grid.nyx.link -k dom=outside.plan9.bell-labs.com ...
> supermic% time ./cpu -h grid.nyx.link -k 'dom=outside.plan9.bell-labs.com' -c pwd
> dial took: 158312233 ns
> negotiate auth took: 155598573 ns
> factotum autenticate took: 681558002 ns
> lclnoteproc time: 684863443 ns
> far end setup time: 151972579 ns
> /usr/fst
> 0.00u 0.01s 7.22r ./cpu -h grid.nyx.link -k dom=outside.plan9.bell-labs.com ...
> supermic% time ./cpu -h grid.nyx.link -k 'dom=outside.plan9.bell-labs.com' -c pwd
> dial took: 176752321 ns
> negotiate auth took: 175402567 ns
> factotum autenticate took: 746787405 ns
> lclnoteproc time: 749881615 ns
> far end setup time: 172600395 ns
> /usr/fst
> 0.00u 0.03s 8.17r ./cpu -h grid.nyx.link -k dom=outside.plan9.bell-labs.com ...
> supermic%
>
>