9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] 9pfuse and O_APPEND
@ 2008-12-18 23:34 Roman Shaposhnik
  2008-12-18 23:57 ` Russ Cox
  2008-12-19 21:00 ` ron minnich
  0 siblings, 2 replies; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-18 23:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I guess this is mainly a question for Russ: I'm using 9pfuse for a
proof-of-concept project here at Sun and it all works quite
well. My goal is to avoid the 9P2000.u route and use 9P2000
semantics  as much as possible, yet allow most of the POSIX
FS functionality to simply work.

In order to do that, I have to extend 9pfuse somewhat. In most
cases my code could be considered "complimentary" to the
core of 9pfuse, but there's one case which seems to be common
enough to  warrant some discussion and potential
changes to the core.

The question has to do with O_APPEND flag. POSIX apps
seem to use it quite frequently (most notably bash uses it
for the most basic of redirections  >>) but 9pfuse doesn't
really have any support for it:

main.c:_fuseopen
        /*
	 * Could translate but not standard 9P:
	 *	O_DIRECT -> ODIRECT
	 *	O_NONBLOCK -> ONONBLOCK
	 *	O_APPEND -> OAPPEND
	 */
	if(flags){
		fprint(2, "unexpected open flags %#uo", (uint)in->flags);
		replyfuseerrno(m, EACCES);
		return;
	}

So here's my question: is there any consensus on how to best
emulate it?

So far, I see the following choices for myself:
    * follow what v9fs does and emulate it with llseek(...SEEK_END).
Not ideal,
       since it doesn't always guarantee POSIX sematics, but way better
       than nothing.
    * emulate per-FID DMAPPEND by letting the server (which I also
control) accept Qid
       modifications on wstat. My understanding is that existing 9P
servers would simply
       reply with Rerror and I can then fallback onto llsek, perhaps.
Border-line abuse of
       the protocol.
    * reserve (unit)-1 offset in writes as an indication to append to
the end of the file. Really
       seems like an abuse of the protocol :-(

There's also a way for me to handle the situation the way I intend to
handle the
rest of POSIX goo: have a dedicated tree with a special aname. But in
case
of a so common operation it seems to be a bit of an overkill.

Thus, I'd really love to hear suggestions that might help integrate
that bit of code back into
the 9pfuse proper.

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-18 23:34 [9fans] 9pfuse and O_APPEND Roman Shaposhnik
@ 2008-12-18 23:57 ` Russ Cox
  2008-12-19  0:03   ` ron minnich
  2008-12-19  3:03   ` Roman Shaposhnik
  2008-12-19 21:00 ` ron minnich
  1 sibling, 2 replies; 33+ messages in thread
From: Russ Cox @ 2008-12-18 23:57 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I would just seek to the end.
That's fine unless you have multiple
programs writing O_APPEND simultaneously,
in which case you are asking for trouble.

Russ


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-18 23:57 ` Russ Cox
@ 2008-12-19  0:03   ` ron minnich
  2008-12-19  3:06     ` Roman Shaposhnik
  2008-12-19  3:03   ` Roman Shaposhnik
  1 sibling, 1 reply; 33+ messages in thread
From: ron minnich @ 2008-12-19  0:03 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Dec 18, 2008 at 3:57 PM, Russ Cox <rsc@swtch.com> wrote:
> I would just seek to the end.
> That's fine unless you have multiple
> programs writing O_APPEND simultaneously,
> in which case you are asking for trouble.
>

yep. The code in nfs clients to support O_APPEND is a wonder to
behold. A nicer combination of rubber bands and paper clips you never
did see.

ron



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-18 23:57 ` Russ Cox
  2008-12-19  0:03   ` ron minnich
@ 2008-12-19  3:03   ` Roman Shaposhnik
  2008-12-19  3:43     ` erik quanstrom
  1 sibling, 1 reply; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-19  3:03 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 18, 2008, at 3:57 PM, Russ Cox wrote:
> I would just seek to the end.

Got it. In that case, is there any reason the current version
of 9pfuse doesn't just skip O_APPEND (like it does with
O_LARGEFILE, etc.)? Since 9pfuse revalidate i_size
before writes that's the best one can do anyway(*)

The following patch seems to work for me. If there's
any reason for it NOT to be included in the Hg repo
please let me know:

--- main.c	2008-12-18 18:41:19.000000000 -0800
+++ src/cmd/9pfuse/main.c	2008-12-18 18:03:27.000000000 -0800
@@ -576,7 +576,7 @@
  	flags = in->flags;
  	openmode = flags&3;
  	flags &= ~3;
-	flags &= ~(O_DIRECTORY|O_NONBLOCK|O_LARGEFILE|O_CLOEXEC);
+	flags &= ~(O_DIRECTORY|O_NONBLOCK|O_LARGEFILE|O_CLOEXEC|O_APPEND);
  	if(flags & O_TRUNC){
  		openmode |= OTRUNC;
  		flags &= ~O_TRUNC;

> That's fine unless you have multiple
> programs writing O_APPEND simultaneously,
> in which case you are asking for trouble.


Agreed. Now, here's a bit that I still don't quite
understand: Plan9 does have DMAPPEND on
a per-Qid basis. Why was it decided not to
have it on a per-Fid basis (which would match
POSIX semantics 100%)?

The way I understand -- DMAPPEND is just a hint
to the server to *alway* ignore the offset in
incoming writes. It seems that ignoring offsets
in writes for the Fids that asked for it wouldn't be
much more difficult, would it?

Thanks,
Roman.

(*) After some close examination of the 2.6.27 kernel I actually
wonder why v9fs guys
do an explicit seek in there open.

P.S. Its not different clients I'm worried about. Its something
like this within a single broken client:

     int fd = open("/tmp/test.txt", O_RDWR|O_APPEND);
     write(fd, "12345", 5);
     lseek(fd, 1, 0);
     write(fd, "00000", 5);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  0:03   ` ron minnich
@ 2008-12-19  3:06     ` Roman Shaposhnik
  2008-12-19  3:26       ` ron minnich
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-19  3:06 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 18, 2008, at 4:03 PM, ron minnich wrote:
> On Thu, Dec 18, 2008 at 3:57 PM, Russ Cox <rsc@swtch.com> wrote:
>> I would just seek to the end.
>> That's fine unless you have multiple
>> programs writing O_APPEND simultaneously,
>> in which case you are asking for trouble.
>>
>
> yep. The code in nfs clients to support O_APPEND is a wonder to
> behold. A nicer combination of rubber bands and paper clips you never
> did see.

Its fun, yes. But I believe this is more of a testament to the
statelessness of the NFS
plus the fact that the "end of file" is not a well defined offset
(unlike beginning of
the file).

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  3:06     ` Roman Shaposhnik
@ 2008-12-19  3:26       ` ron minnich
  2008-12-19  3:59         ` Roman Shaposhnik
  0 siblings, 1 reply; 33+ messages in thread
From: ron minnich @ 2008-12-19  3:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Dec 18, 2008 at 7:06 PM, Roman Shaposhnik <rvs@sun.com> wrote:

> Its fun, yes. But I believe this is more of a testament to the statelessness
> of the NFS
> plus the fact that the "end of file" is not a well defined offset (unlike
> beginning of
> the file).
>

no, it's even worse with stateful systems.

The mistake is doing append mode at the client, not the server.

ron



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  3:03   ` Roman Shaposhnik
@ 2008-12-19  3:43     ` erik quanstrom
  2008-12-19  3:54       ` Roman Shaposhnik
  0 siblings, 1 reply; 33+ messages in thread
From: erik quanstrom @ 2008-12-19  3:43 UTC (permalink / raw)
  To: 9fans

> Agreed. Now, here's a bit that I still don't quite
> understand: Plan9 does have DMAPPEND on
> a per-Qid basis. Why was it decided not to
> have it on a per-Fid basis (which would match
> POSIX semantics 100%)?
>
> The way I understand -- DMAPPEND is just a hint
> to the server to *alway* ignore the offset in
> incoming writes. It seems that ignoring offsets
> in writes for the Fids that asked for it wouldn't be
> much more difficult, would it?

DMAPPEND, for servers that implement it, is not
a hint to the server, it's a write to the end of the file,
whatever offset that might be.

since the end is computed on the file server, multiple
concurrent writers don't cause a problem. since the
fs is in a position to serialize appends to the same
file.

- erik



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  3:43     ` erik quanstrom
@ 2008-12-19  3:54       ` Roman Shaposhnik
  2008-12-19  4:13         ` geoff
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-19  3:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 18, 2008, at 7:43 PM, erik quanstrom wrote:
>> Agreed. Now, here's a bit that I still don't quite
>> understand: Plan9 does have DMAPPEND on
>> a per-Qid basis. Why was it decided not to
>> have it on a per-Fid basis (which would match
>> POSIX semantics 100%)?
>>
>> The way I understand -- DMAPPEND is just a hint
>> to the server to *alway* ignore the offset in
>> incoming writes. It seems that ignoring offsets
>> in writes for the Fids that asked for it wouldn't be
>> much more difficult, would it?
>
> DMAPPEND, for servers that implement it, is not
> a hint to the server, it's a write to the end of the file,
> whatever offset that might be.

Well, modulo clumsy wording on my part your explanation
seems to be 100% the same as mine.

> since the end is computed on the file server, multiple
> concurrent writers don't cause a problem. since the
> fs is in a position to serialize appends to the same
> file.

And where does it contradict the idea to have the
equivalent of DMAPPEND set per-FID? Or more
precisely per open-FID.

Look, one way or the other the server has a *choice*
of ignoring the offset in write(5) messages and always
appending the data. It can do it effectively and safely
(unlike the client). All I'm asking is why the obvious
benefits of asking the server to ignore the offsets
ONLY for a particular open fid were not considered
to be a good thing.

Your reply doesn't answer that question.

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  3:26       ` ron minnich
@ 2008-12-19  3:59         ` Roman Shaposhnik
  2008-12-19 16:44           ` ron minnich
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-19  3:59 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 18, 2008, at 7:26 PM, ron minnich wrote:
> On Thu, Dec 18, 2008 at 7:06 PM, Roman Shaposhnik <rvs@sun.com> wrote:
>> Its fun, yes. But I believe this is more of a testament to the
>> statelessness
>> of the NFS
>> plus the fact that the "end of file" is not a well defined offset
>> (unlike
>> beginning of
>> the file).
> no, it's even worse with stateful systems.

Why? Please elaborate.

I can see how a trivial change to 9P's open can lead to a desired
behavior of append-on-the-server. But the open is only there because
9P is stateful.

If you don't have a permanent channel to your data it becomes
very difficult to ask for particular processing of read/writes.

All in all, I don't think I agree with your comment.

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  3:54       ` Roman Shaposhnik
@ 2008-12-19  4:13         ` geoff
  2008-12-19  8:23           ` Russ Cox
  2008-12-19 14:21           ` erik quanstrom
  0 siblings, 2 replies; 33+ messages in thread
From: geoff @ 2008-12-19  4:13 UTC (permalink / raw)
  To: 9fans

The places that DMAPPEND is used most commonly are log files and mail
boxes.  In both cases, I don't want the decision of whether to
truncate or append left to the whims of some program.  I want writes
to append, by god, and DMAPPEND on actual disk-based file servers such
as fossil and fs does that.  (Yes, a malicious program could perhaps
clear DMAPPEND and truncate, but the more likely cause of disaster in
a system without DMAPPEND is someone forgetting to set a hypothetical
`append mode' on a fid or equivalent.  You only have to slip up once
to ruin someone's day.)




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  4:13         ` geoff
@ 2008-12-19  8:23           ` Russ Cox
  2008-12-19 19:49             ` Roman Shaposhnik
  2008-12-19 14:21           ` erik quanstrom
  1 sibling, 1 reply; 33+ messages in thread
From: Russ Cox @ 2008-12-19  8:23 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Append-only and exclusive-use are properties of files
and need to be enforced uniformly across all clients
to be meaningful.  They must be per-file, not per-fd.

Russ


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  4:13         ` geoff
  2008-12-19  8:23           ` Russ Cox
@ 2008-12-19 14:21           ` erik quanstrom
  1 sibling, 0 replies; 33+ messages in thread
From: erik quanstrom @ 2008-12-19 14:21 UTC (permalink / raw)
  To: 9fans

> The places that DMAPPEND is used most commonly are log files and mail
> boxes.

mailboxes are append only, however
deleting a message requires rewriting
the mailbox, which isn't possible.  so
a temporary mbox is written, has its
mode tweaked and then replaces the
mbox.  L.mbox is exclusive open and
locks the whole directory to prevent
accidents.

since each message is in its own file,
mdir uses atomic create(2) (OEXCL)
for delivery.  deleting is trivial.  no
L.mbox required.

back to the subject.  i agree with the
point mixing append-only and regular
fids would be a disaster.  this is because
(in general) it takes multiple writes to
accomplish one's goal.  in the case of
a mailbox, it would not be safe to be
adding a new message while rewriting
to delete messages.  an exclusive-open
file would make much more sense.

log files are the big exception, of
course, nobody cares if the entries
are reordered, as long as they remain
intact.  and since each entry is smaller
than the iounit (syslog uses a 1k
buffer), the can fit into a single write
and can be ordered.

- erik



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  3:59         ` Roman Shaposhnik
@ 2008-12-19 16:44           ` ron minnich
  2008-12-19 19:21             ` Anthony Sorace
  2008-12-19 19:59             ` Roman Shaposhnik
  0 siblings, 2 replies; 33+ messages in thread
From: ron minnich @ 2008-12-19 16:44 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Dec 18, 2008 at 7:59 PM, Roman Shaposhnik <rvs@sun.com> wrote:
> On Dec 18, 2008, at 7:26 PM, ron minnich wrote:
>>
>> On Thu, Dec 18, 2008 at 7:06 PM, Roman Shaposhnik <rvs@sun.com> wrote:
>>>
>>> Its fun, yes. But I believe this is more of a testament to the
>>> statelessness
>>> of the NFS
>>> plus the fact that the "end of file" is not a well defined offset (unlike
>>> beginning of
>>> the file).
>>
>> no, it's even worse with stateful systems.
>


you want to write at EOF. Where is EOF? On Plan 9 on an append file,
server by definition always knows: it's where the last write was. So
writes go at EOF.

What about writing append files in a stateful FS where it's up to the
client to figure out where the end is?

client by definition knows more than the server. So client has to do this:
1. get metadata in a way that indicates that nobody else gets to
write. Client calls server to get exclusive access to metadata/file.
This can result in server-client callbacks to all other clients. This
is fun to watch on 1000s of nodes. Before the right hacks went in it
could take 30 minutes. I am not making this up. Why? Well, what if
*every* one of the thousands of clients is trying to write at
eof and they're all fighting for the metadata? Congestive collapse,
that's what.
2. Client writes at eof. Since the client has exclusive access at this
point, it's pretty fast.
3. Clients releases the metadata lock to the server and hence the
other thousands of clients.

The 'client write at EOF' is bad for precisely the same reason that
you don't want to use shared memory for locks in a CC-NUMA machine;
you want to send the operation to the data, not move the data to the
operation. Lots of great papers on this over the years ...

ron



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 16:44           ` ron minnich
@ 2008-12-19 19:21             ` Anthony Sorace
  2008-12-19 19:31               ` erik quanstrom
  2008-12-19 19:41               ` ron minnich
  2008-12-19 19:59             ` Roman Shaposhnik
  1 sibling, 2 replies; 33+ messages in thread
From: Anthony Sorace @ 2008-12-19 19:21 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> client by definition knows more than the server.

i assume you mean "knows less"? the server knows where EOF is and
which files to enforce append-only on. your #1 seems to only exist
because the client doesn't have that info.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 19:21             ` Anthony Sorace
@ 2008-12-19 19:31               ` erik quanstrom
  2008-12-19 19:41               ` ron minnich
  1 sibling, 0 replies; 33+ messages in thread
From: erik quanstrom @ 2008-12-19 19:31 UTC (permalink / raw)
  To: 9fans

On Fri Dec 19 14:24:52 EST 2008, anothy@gmail.com wrote:
> > client by definition knows more than the server.
>
> i assume you mean "knows less"? the server knows where EOF is and
> which files to enforce append-only on. your #1 seems to only exist
> because the client doesn't have that info.

i think it's deeper than that.

if the server has instructions to stick the write at the
end of the file, the server has the ability to prevent
any other writes while executing the append.  doing
this at the client side is hard because regardless of the
client's knowledge, there can be other clients which also
believe they know things equally well and without some
sort of locking or other shenagins on the side, there's a race.

the server is almost by definition in a better position
to append than the client.

- erik



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 19:21             ` Anthony Sorace
  2008-12-19 19:31               ` erik quanstrom
@ 2008-12-19 19:41               ` ron minnich
  1 sibling, 0 replies; 33+ messages in thread
From: ron minnich @ 2008-12-19 19:41 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Dec 19, 2008 at 11:21 AM, Anthony Sorace <anothy@gmail.com> wrote:
>> client by definition knows more than the server.
>
> i assume you mean "knows less"? the server knows where EOF is and
> which files to enforce append-only on. your #1 seems to only exist
> because the client doesn't have that info.

in stateless, client knows more, believe it or not.

ron



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19  8:23           ` Russ Cox
@ 2008-12-19 19:49             ` Roman Shaposhnik
  2008-12-19 19:56               ` erik quanstrom
  2008-12-19 20:02               ` ron minnich
  0 siblings, 2 replies; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-19 19:49 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 19, 2008, at 12:23 AM, Russ Cox wrote:
> Append-only and exclusive-use are properties of files
> and need to be enforced uniformly across all clients
> to be meaningful.  They must be per-file, not per-fd.


Two questions:
    1. But before I ask this one: I don't deny that per-file append-only
     is *extremely* useful. My question is a different one: what is
     the danger of N clients accesing the file X in append-only mode
     and M clients accesing it in random access mode? Could you,
     please, give a concrete scenario?

     2. Could you, please, answer the question in the original email
     of whether the kind of trivial patch (for the real thing you also
need
     to handle O_APPEND in the fusecreate)  I provided would be
acceptable
     for the inclusion into Hg? I have no problem maintaining the
extra code
     on the side, but if the change is deemed *not*
     to be acceptable that translates into it being dangerous or not
good
     enough. And if that's the case I'd really appreciate an
explanation to
     be given.

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 19:49             ` Roman Shaposhnik
@ 2008-12-19 19:56               ` erik quanstrom
  2008-12-19 20:10                 ` Roman Shaposhnik
  2008-12-19 20:02               ` ron minnich
  1 sibling, 1 reply; 33+ messages in thread
From: erik quanstrom @ 2008-12-19 19:56 UTC (permalink / raw)
  To: 9fans

> Two questions:
>     1. But before I ask this one: I don't deny that per-file append-only
>      is *extremely* useful. My question is a different one: what is
>      the danger of N clients accesing the file X in append-only mode
>      and M clients accesing it in random access mode? Could you,
>      please, give a concrete scenario?

credit geoff for bringing this up: upas mailboxes.
suppose you have upas/deliver trying to deliver a message and at
the same time you have upas/fs trying to rewrite the mailbox.
(play along for a bit.  ignore L.mbox and the temporary mbox
tricks.)

- erik



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 16:44           ` ron minnich
  2008-12-19 19:21             ` Anthony Sorace
@ 2008-12-19 19:59             ` Roman Shaposhnik
  2008-12-19 20:06               ` erik quanstrom
  2008-12-19 20:18               ` Charles Forsyth
  1 sibling, 2 replies; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-19 19:59 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 19, 2008, at 8:44 AM, ron minnich wrote:
> On Thu, Dec 18, 2008 at 7:59 PM, Roman Shaposhnik <rvs@sun.com> wrote:
>> On Dec 18, 2008, at 7:26 PM, ron minnich wrote:
>>>
>>> On Thu, Dec 18, 2008 at 7:06 PM, Roman Shaposhnik <rvs@sun.com>
>>> wrote:
>>>>
>>>> Its fun, yes. But I believe this is more of a testament to the
>>>> statelessness
>>>> of the NFS
>>>> plus the fact that the "end of file" is not a well defined offset
>>>> (unlike
>>>> beginning of
>>>> the file).
>>>
>>> no, it's even worse with stateful systems.
>>
>
> you want to write at EOF. Where is EOF? On Plan 9 on an append file,
> server by definition always knows: it's where the last write was. So
> writes go at EOF.

And how is it different from what I was suggesting: A Fid that makes
*all* writes be at EOF? You want to write at EOF? Easy -- just use
that pre-negotiated Fid that was opened with (now non existent)
DMAPPEND flag added to the mode. You want a random-access
write AT THE SAME TIME? Easy -- just open that very same Qid one
more time and have a Fid that does honor offsets in your writes.

Once again -- I don't deny that ALSO having ALWAYS append
files is extremely useful.

All I'm saying is that from where I sit the idea of ALSO having
a way to make append-only Fids seems to be extremely useful
in its own right. And nobody yet cared to give a concrete explanation
of why it might be a bad idea.

> The 'client write at EOF' is bad for precisely the same reason that
> you don't want to use shared memory for locks in a CC-NUMA machine;
> you want to send the operation to the data, not move the data to the
> operation. Lots of great papers on this over the years ...

That is exactly what I'm suggesting -- have yet another mechanism to
let the server decide where the EOF is.

Thanks,
Roman.

P.S. Am I that incomprehensible? :-(



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 19:49             ` Roman Shaposhnik
  2008-12-19 19:56               ` erik quanstrom
@ 2008-12-19 20:02               ` ron minnich
  1 sibling, 0 replies; 33+ messages in thread
From: ron minnich @ 2008-12-19 20:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Dec 19, 2008 at 11:49 AM, Roman Shaposhnik <rvs@sun.com> wrote:

> Two questions:
>   1. But before I ask this one: I don't deny that per-file append-only
>    is *extremely* useful. My question is a different one: what is
>    the danger of N clients accesing the file X in append-only mode
>    and M clients accesing it in random access mode? Could you,
>    please, give a concrete scenario?

there are no problems if you don't care about the integrity of the data.

ron



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 19:59             ` Roman Shaposhnik
@ 2008-12-19 20:06               ` erik quanstrom
  2008-12-19 20:18               ` Charles Forsyth
  1 sibling, 0 replies; 33+ messages in thread
From: erik quanstrom @ 2008-12-19 20:06 UTC (permalink / raw)
  To: 9fans

> And how is it different from what I was suggesting: A Fid that makes
> *all* writes be at EOF? You want to write at EOF? Easy -- just use
> that pre-negotiated Fid that was opened with (now non existent)
> DMAPPEND flag added to the mode. You want a random-access
> write AT THE SAME TIME? Easy -- just open that very same Qid one
> more time and have a Fid that does honor offsets in your writes.

when would this be useful?

- erik



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 19:56               ` erik quanstrom
@ 2008-12-19 20:10                 ` Roman Shaposhnik
  2008-12-19 20:22                   ` erik quanstrom
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-19 20:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 19, 2008, at 11:56 AM, erik quanstrom wrote:
>> Two questions:
>>    1. But before I ask this one: I don't deny that per-file append-
>> only
>>     is *extremely* useful. My question is a different one: what is
>>     the danger of N clients accesing the file X in append-only mode
>>     and M clients accesing it in random access mode? Could you,
>>     please, give a concrete scenario?
>
> credit geoff for bringing this up: upas mailboxes.
> suppose you have upas/deliver trying to deliver a message and at
> the same time you have upas/fs trying to rewrite the mailbox.
> (play along for a bit.  ignore L.mbox and the temporary mbox
> tricks.)


It is difficult to answer your question without knowing what rewrite
actually does and how mailboxes are structured. But in an imaginary
world where a mailbox is a list of constant sized blocks (size<iounit)
a bunch of simultaneous appends and rewrites of existing blocks
would work perfectly well.

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 19:59             ` Roman Shaposhnik
  2008-12-19 20:06               ` erik quanstrom
@ 2008-12-19 20:18               ` Charles Forsyth
  2008-12-21  5:08                 ` Roman V. Shaposhnik
  1 sibling, 1 reply; 33+ messages in thread
From: Charles Forsyth @ 2008-12-19 20:18 UTC (permalink / raw)
  To: 9fans

>And nobody yet cared to give a concrete explanation of why it might be a bad idea.

what's the application you've got in mind?



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 20:10                 ` Roman Shaposhnik
@ 2008-12-19 20:22                   ` erik quanstrom
  0 siblings, 0 replies; 33+ messages in thread
From: erik quanstrom @ 2008-12-19 20:22 UTC (permalink / raw)
  To: 9fans

> It is difficult to answer your question without knowing what rewrite
> actually does and how mailboxes are structured. But in an imaginary
> world where a mailbox is a list of constant sized blocks (size<iounit)
> a bunch of simultaneous appends and rewrites of existing blocks
> would work perfectly well.

(your imaginary mailbox sounds like a wormhole that builds a filesystem
inside a file.)

a mailbox, if you recall from unix, is a bunch of messages concatinated
into a file.  each message is framed by a "From " line and a blank line.

obviously, this is not efficient for big mailboxes.  since i support users
with GB+ mailboxes, i implemented a one-file-per message scheme
which doesn't require append semantics, though it does use atomic
open.

- erik



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-18 23:34 [9fans] 9pfuse and O_APPEND Roman Shaposhnik
  2008-12-18 23:57 ` Russ Cox
@ 2008-12-19 21:00 ` ron minnich
  2008-12-19 21:32   ` Charles Forsyth
  2008-12-21  5:05   ` Roman V. Shaposhnik
  1 sibling, 2 replies; 33+ messages in thread
From: ron minnich @ 2008-12-19 21:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Dec 18, 2008 at 3:34 PM, Roman Shaposhnik <rvs@sun.com> wrote:

>
> So far, I see the following choices for myself:
>   * follow what v9fs does and emulate it with llseek(...SEEK_END). Not
> ideal,
>      since it doesn't always guarantee POSIX sematics, but way better
>      than nothing.

and it won't scale and it can't be guaranteed to work at scale. What
really ought to happen is that the unix world should stop making these
guarantees that can not be honored. Like that's ever going to happen
:-)

>   * emulate per-FID DMAPPEND by letting the server (which I also control)
> accept Qid
>      modifications on wstat. My understanding is that existing 9P servers
> would simply
>      reply with Rerror and I can then fallback onto llsek, perhaps.
> Border-line abuse of
>      the protocol.

is your 9p server ever going to be running on an nfs-mounted
partition? If so then
you can do all the abuse of the protocol you want but you can't make
any guarantees ...

this is a problem.

>   * reserve (unit)-1 offset in writes as an indication to append to the end
> of the file. Really
>      seems like an abuse of the protocol :-(

I don't know. This seems the least unlikeable. Right now the use of -1
offset to the kernel means "write at the current offset". Now I find
myself unable to remember if that -1 would be seen on the wire -- I
just don't remember but I'm sure the smart people do.

But you still can't guarantee it. That's the problem -- if append mode
is an intrinsic property of the file, honored and managed at the
server that really owns that file, that's a difference in kind from
clients who are doing their best to "write at end". And your 9pfuse
server is really a client in the end. So you may get it almost working
but there are no guarantees.

It's interesting to watch network wires melt trying to manage server
client callbacks so that "posix semantics" work ...

ron



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 21:32   ` Charles Forsyth
@ 2008-12-19 21:29     ` ron minnich
  0 siblings, 0 replies; 33+ messages in thread
From: ron minnich @ 2008-12-19 21:29 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Dec 19, 2008 at 1:32 PM, Charles Forsyth <forsyth@terzarima.net> wrote:
>> if that -1 would be seen on the wire
>
> no. it's just a flag to select the code path that provides the offset,
> and entirely internal (just as well).
>
>

I figured as much. Oh well. Sorry, Roman.

ron



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 21:00 ` ron minnich
@ 2008-12-19 21:32   ` Charles Forsyth
  2008-12-19 21:29     ` ron minnich
  2008-12-21  5:05   ` Roman V. Shaposhnik
  1 sibling, 1 reply; 33+ messages in thread
From: Charles Forsyth @ 2008-12-19 21:32 UTC (permalink / raw)
  To: 9fans

> if that -1 would be seen on the wire

no. it's just a flag to select the code path that provides the offset,
and entirely internal (just as well).



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 21:00 ` ron minnich
  2008-12-19 21:32   ` Charles Forsyth
@ 2008-12-21  5:05   ` Roman V. Shaposhnik
  2008-12-21 14:45     ` erik quanstrom
  1 sibling, 1 reply; 33+ messages in thread
From: Roman V. Shaposhnik @ 2008-12-21  5:05 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, 2008-12-19 at 13:00 -0800, ron minnich wrote:
> >   * emulate per-FID DMAPPEND by letting the server (which I also control)
> > accept Qid
> >      modifications on wstat. My understanding is that existing 9P servers
> > would simply
> >      reply with Rerror and I can then fallback onto llsek, perhaps.
> > Border-line abuse of
> >      the protocol.
>
> is your 9p server ever going to be running on an nfs-mounted
> partition?

As with any software -- it would be pretty difficult for me to prevent
somebody from doing that, but in general -- no.

> If so then you can do all the abuse of the protocol you want but you can't make
> any guarantees ...

That's a good point. Its funny but some bits of software we produce are
also NFS-unfriendly. They will warn you if you are trying to use NFS.
I guess I might do the same.

> But you still can't guarantee it. That's the problem -- if append mode
> is an intrinsic property of the file, honored and managed at the
> server that really owns that file, that's a difference in kind from
> clients who are doing their best to "write at end".

As I have indicated some time ago -- its NOT a problem for me to have
X clients do random access and Y clients doing DMAPPEND on the same
file. The integrity of the file will NOT be affected. It is protected by
the chunk_size that ALL I/O is coming to me at. The only
thing that I have to guarantee at the server end is that all of the
writes coming from clients asking for DMAPPEND will, in fact, be
written at the offset corresponding to the EOF (whatever that might
be at the moment when the I/O arrives).

There's no "doing their best" involved at the clients from the set Y.
They simply indicate that ALL of their I/O should be DMAPPEND. Server
has ALL the power to make it happen.

In fact, in this rather long thread, nobody yet has come up with
a convincing argument of why the above scenario wouldn't work.

> And your 9pfuse server is really a client in the end.

Just to be clear: 9pfuse is PURE client in my particular case. I might
decide to enchance it to honor the O_APPEND, but it will not make
it less of a client. Of course, whatever I do, I will NOT be emulating
O_APPEND. There will be no "do my best" involved.

> So you may get it almost working but there are no guarantees.

The only thing that can break the guarantee is the server serving the
NFS mounted files.

> It's interesting to watch network wires melt trying to manage server
> client callbacks so that "posix semantics" work ...

Mmm. Yes, but the scheme I've just described will be exactly what
POSIX mandates for O_APPEND. Whether it'll melt the wires remains
to be seen.

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-19 20:18               ` Charles Forsyth
@ 2008-12-21  5:08                 ` Roman V. Shaposhnik
  0 siblings, 0 replies; 33+ messages in thread
From: Roman V. Shaposhnik @ 2008-12-21  5:08 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, 2008-12-19 at 20:18 +0000, Charles Forsyth wrote:
> >And nobody yet cared to give a concrete explanation of why it might be a bad idea.
>
> what's the application you've got in mind?

Legacy ones :-( At the moment -- they are homegrown databases. And yes,
as Erik pointed out -- they look more like FS-envy wormholes.

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-21  5:05   ` Roman V. Shaposhnik
@ 2008-12-21 14:45     ` erik quanstrom
  2008-12-22 10:02       ` roger peppe
  2008-12-25  6:04       ` Roman Shaposhnik
  0 siblings, 2 replies; 33+ messages in thread
From: erik quanstrom @ 2008-12-21 14:45 UTC (permalink / raw)
  To: 9fans

> > is your 9p server ever going to be running on an nfs-mounted
> > partition?
>
> As with any software -- it would be pretty difficult for me to prevent
> somebody from doing that, but in general -- no.

i use "in general" to mean the exact opposite of what
you are saying here; there is a case where it could happen.

> As I have indicated some time ago -- its NOT a problem for me to have
> X clients do random access and Y clients doing DMAPPEND on the same
> file. The integrity of the file will NOT be affected. It is protected by
> the chunk_size that ALL I/O is coming to me at. The only
> thing that I have to guarantee at the server end is that all of the
> writes coming from clients asking for DMAPPEND will, in fact, be
> written at the offset corresponding to the EOF (whatever that might
> be at the moment when the I/O arrives).

okay, so you're using DMAPPEND like sbrk(2).  how do you avoid
clients caring about the address of this new hunk of memory?^u
clients caring about the offset of this hunk of the file?
that is, the same problem malloc has in a multi-threaded app
with sbrk.

sounds like distributed shared memory with lazy allocation.

why not put the things your storing in seperate files, or of there
are an unwielded number of things, use some sort of clone interface
to create a new $thing that the server carefully maps into the big file?

- erik



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-21 14:45     ` erik quanstrom
@ 2008-12-22 10:02       ` roger peppe
  2008-12-25  6:04       ` Roman Shaposhnik
  1 sibling, 0 replies; 33+ messages in thread
From: roger peppe @ 2008-12-22 10:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sun, Dec 21, 2008 at 2:45 PM, erik quanstrom <quanstro@quanstro.net> wrote:
> okay, so you're using DMAPPEND like sbrk(2).  how do you avoid
> clients caring about the address of this new hunk of memory?^u
> clients caring about the offset of this hunk of the file?
> that is, the same problem malloc has in a multi-threaded app
> with sbrk.

in google's filesystem, i believe a write returns the
offset it actually wrote at. if 9p was extended to do that,
maybe per-fid append semantics might be more useful.
(although if appending writes were rare enough, i suppose
you could write the block with a unique identifier and
scan to find where it actually ended up)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-21 14:45     ` erik quanstrom
  2008-12-22 10:02       ` roger peppe
@ 2008-12-25  6:04       ` Roman Shaposhnik
  2008-12-25  6:33         ` erik quanstrom
  1 sibling, 1 reply; 33+ messages in thread
From: Roman Shaposhnik @ 2008-12-25  6:04 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Dec 21, 2008, at 6:45 AM, erik quanstrom wrote:
>>> is your 9p server ever going to be running on an nfs-mounted
>>> partition?
>>
>> As with any software -- it would be pretty difficult for me to
>> prevent
>> somebody from doing that, but in general -- no.
>
> i use "in general" to mean the exact opposite of what
> you are saying here; there is a case where it could happen.

Well, it is quite difficult to deny an audience to Mr. Murphy, isn't
it? ;-)

>> As I have indicated some time ago -- its NOT a problem for me to have
>> X clients do random access and Y clients doing DMAPPEND on the same
>> file. The integrity of the file will NOT be affected. It is
>> protected by
>> the chunk_size that ALL I/O is coming to me at. The only
>> thing that I have to guarantee at the server end is that all of the
>> writes coming from clients asking for DMAPPEND will, in fact, be
>> written at the offset corresponding to the EOF (whatever that might
>> be at the moment when the I/O arrives).
>
> okay, so you're using DMAPPEND like sbrk(2).

That's a pretty good analogy, yes.

> clients caring about the offset of this hunk of the file?
> that is, the same problem malloc has in a multi-threaded app
> with sbrk.

yes. But unlike what happens in malloc's case -- these regions
are *not* for clients to work with. They are for dumping data.
When there's fragmentation you can be given an offset to
write to, but that's mostly an optimization. What really needs
to happen is very simple -- the data needs to be dumped.

> why not put the things your storing in seperate files, or of there
> are an unwielded number of things, use some sort of clone interface
> to create a new $thing that the server carefully maps into the big
> file?

These are quite reasonable suggestions. Thanks.

In fact, this entire thread was quite helpful to me, since it made me
realize
that my expectation of what constitutes the most common use for
APPEND-like semantics are NOT what most of you guys have in mind.

That's fair. But let me flip a question then, a bit: what do you all use
DMAPPEND for? What's are the examples of the most appropriate
usage for it in existing Plan9 software?

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [9fans] 9pfuse and O_APPEND
  2008-12-25  6:04       ` Roman Shaposhnik
@ 2008-12-25  6:33         ` erik quanstrom
  0 siblings, 0 replies; 33+ messages in thread
From: erik quanstrom @ 2008-12-25  6:33 UTC (permalink / raw)
  To: 9fans

> That's fair. But let me flip a question then, a bit: what do you all use
> DMAPPEND for? What's are the examples of the most appropriate
> usage for it in existing Plan9 software?

i think log files are the cannonical use of append-only files.
mbox style mailboxes also use append-only sematics for delivery
but do other things when deleting mail.  i can't think of any other
examples.

- erik




^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2008-12-25  6:33 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-18 23:34 [9fans] 9pfuse and O_APPEND Roman Shaposhnik
2008-12-18 23:57 ` Russ Cox
2008-12-19  0:03   ` ron minnich
2008-12-19  3:06     ` Roman Shaposhnik
2008-12-19  3:26       ` ron minnich
2008-12-19  3:59         ` Roman Shaposhnik
2008-12-19 16:44           ` ron minnich
2008-12-19 19:21             ` Anthony Sorace
2008-12-19 19:31               ` erik quanstrom
2008-12-19 19:41               ` ron minnich
2008-12-19 19:59             ` Roman Shaposhnik
2008-12-19 20:06               ` erik quanstrom
2008-12-19 20:18               ` Charles Forsyth
2008-12-21  5:08                 ` Roman V. Shaposhnik
2008-12-19  3:03   ` Roman Shaposhnik
2008-12-19  3:43     ` erik quanstrom
2008-12-19  3:54       ` Roman Shaposhnik
2008-12-19  4:13         ` geoff
2008-12-19  8:23           ` Russ Cox
2008-12-19 19:49             ` Roman Shaposhnik
2008-12-19 19:56               ` erik quanstrom
2008-12-19 20:10                 ` Roman Shaposhnik
2008-12-19 20:22                   ` erik quanstrom
2008-12-19 20:02               ` ron minnich
2008-12-19 14:21           ` erik quanstrom
2008-12-19 21:00 ` ron minnich
2008-12-19 21:32   ` Charles Forsyth
2008-12-19 21:29     ` ron minnich
2008-12-21  5:05   ` Roman V. Shaposhnik
2008-12-21 14:45     ` erik quanstrom
2008-12-22 10:02       ` roger peppe
2008-12-25  6:04       ` Roman Shaposhnik
2008-12-25  6:33         ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).