[TUHS] why is sum reporting different checksum's between v6 and v7

The Unix Heritage Society mailing list
 help / color / mirror / Atom feed

* [TUHS] why is sum reporting different checksum's between v6 and v7
@ 2015-12-12  0:30 Noel Chiappa
  2015-12-12  1:07 ` Random832
  0 siblings, 1 reply; 8+ messages in thread
From: Noel Chiappa @ 2015-12-12  0:30 UTC (permalink / raw)


    > From: Will Senn

    > I noticed that the sum utility from v6 reports a different checksum
    > than it does using the sum utility from v7 for the same file.
    > ... does anyone know what's going on here?
    > Why is sum reporting different checksum's between v6 and v7?

The two use different algorithms to accumulate the sum (I have added comments
to the relevant portion of the V6 assembler one, to help understand it):

V6:
	mov	$buf,r2		/ Pointer to buffer in R2
    2:	movb	(r2)+,r4	/ Get new byte into R4 (sign extends!)
	add	r4,r5		/ Add to running sum
	adc	r5		/ If overflow, add carry into low end of sum
	sob	r0,2b		/ If any bytes left, go around again

Read the description of MOVB in the PDP-11 Processor manual.

V7:
	while ((c = getc(f)) != EOF) {
		nbytes++;
		if (sum&01)
			sum = (sum>>1) + 0x8000;
		else
			sum >>= 1;
		sum += c;
		sum &= 0xFFFF;
		}

I'm not clear on some of that, so I'll leave its exact workings as an
exercise, but I'm pretty sure it's not a equivalent algorithm (as in,
something that would produce the same results); it's certainly not
identical. (The right shift is basically a rotate, so it's not a straight sum,
it's more like the Fletcher checksum used by XNS, if anyone remembers that.)

Among the parts I don't get, for instance, sum is declared as 'unsigned',
presumably 16 bits, so the last line _should_ be a NOP!? Also, with 'c' being
implicitly declared as an 'int', does the assignment sign extend? I have this
vague memory that it does. And does the right shift if the low bit is one
really do what the code seems to indicate it does? I have this bit that ASR on
the PDP-11 copies the high bit, not shifts in a 0 (check the processor
manual).  That is, of course, assuming that the compiler implements the '>>'
with an ASR, not a ROR followed by a clear of the high bit, or something.
  
	Noel



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [TUHS] why is sum reporting different checksum's between v6 and v7
  2015-12-12  0:30 [TUHS] why is sum reporting different checksum's between v6 and v7 Noel Chiappa
@ 2015-12-12  1:07 ` Random832
  0 siblings, 0 replies; 8+ messages in thread
From: Random832 @ 2015-12-12  1:07 UTC (permalink / raw)

Noel Chiappa writes:
> The two use different algorithms to accumulate the sum (I have added comments
> to the relevant portion of the V6 assembler one, to help understand it):
>
> V6:
> 	mov	$buf,r2		/ Pointer to buffer in R2
>     2:	movb	(r2)+,r4	/ Get new byte into R4 (sign extends!)
> 	add	r4,r5		/ Add to running sum
> 	adc	r5		/ If overflow, add carry into low end of sum
> 	sob	r0,2b		/ If any bytes left, go around again

Interestingly, the SysIII sum.c program, which I assume yields the same
result for this input, appears to go through the whole input
accumulating the sum of all the bytes into a long, then adds the two
halves of the long at the end rather than after every byte. This
suggests that the two programs would give different results for very
large files that overflow a 32-bit value. Of course, that's (16843010
bytes if all of them are 255) well beyond the size of file you're likely
to encounter on a v6 system.

Also, if this sign extends, then its behavior on "negative" (high bit
set) bytes is likely to be very different from the SysIII one, which
uses getc.

Can someone who has V6 up test what the checksum of a file consisting of
a single byte with the high bit set? On the "modern" implementations it
is the same as the value of the byte [e.g. 255] in both algorithms.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [TUHS] why is sum reporting different checksum's between v6 and v7
  2015-12-12  1:22 Noel Chiappa
  2015-12-12  1:46 ` Random832
@ 2015-12-12  1:50 ` John Cowan
  1 sibling, 0 replies; 8+ messages in thread
From: John Cowan @ 2015-12-12  1:50 UTC (permalink / raw)


Noel Chiappa scripsit:

> I have this bit set that in C, 'char' is defined to be signed, 

If you mean in PDP-11 C, you're right, char is signed precisely because
MOVB sign extends.  In C in general, char's signedness is undefined.
Similarly the signedness of a bitfield is undefined, which means that
the portable range of a 1-bit field is just 0, though there is another
value that might be +1 or -1.

-- 
John Cowan          http://www.ccil.org/~cowan        cowan at ccil.org
Eric Raymond is the Margaret Mead of the Open Source movement.
          --Bruce Perens, a long time ago



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [TUHS] why is sum reporting different checksum's between v6 and v7
  2015-12-12  1:22 Noel Chiappa
@ 2015-12-12  1:46 ` Random832
  2015-12-12  1:50 ` John Cowan
  1 sibling, 0 replies; 8+ messages in thread
From: Random832 @ 2015-12-12  1:46 UTC (permalink / raw)


Noel Chiappa writes:
> No, I don't think so, depending on the exact detals of the implementation. As
> long as when folding the two halves together, you add any carry into the sum,
> you get the same result as doing it into a 16-bit sum.

The issue I was suggesting comes if you've lost carry bits
_before_ folding the two halves together, when you were working
in 32-bit arithmetic.

> (If my memory of how
> this all works is correct - the neurons aren't what they used to be,
> especially late in the day... :-)
>
>     > Also, if this sign extends, then its behavior on "negative" (high bit
>     > set) bytes is likely to be very different from the SysIII one, which
>     > uses getc.
>
> I have this bit set that in C, 'char' is defined to be signed

The SysIII sum.c file uses getc and stores the result in an int,
not a char.

I *think* the definition of getc returns positive values the
same as modern systems do, despite the manpage's caution to
check feof because EOF is a "valid integer value":

#define	getc(p)		(--(p)->_cnt>=0? *(p)->_ptr++&0377:_filbuf(p))

_filbuf also has & 0377 in the relevant place.

If getc returns negative values for high-bit characters, on the
other hand, then they would sign-extend to 32 bits when the long
math is done, still yielding different results.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [TUHS] why is sum reporting different checksum's between v6 and v7
@ 2015-12-12  1:22 Noel Chiappa
  2015-12-12  1:46 ` Random832
  2015-12-12  1:50 ` John Cowan
  0 siblings, 2 replies; 8+ messages in thread
From: Noel Chiappa @ 2015-12-12  1:22 UTC (permalink / raw)


    > From: Random832

    > Interestingly, the SysIII sum.c program, which I assume yields the same
    > result for this input, appears to go through the whole input
    > accumulating the sum of all the bytes into a long, then adds the two
    > halves of the long at the end rather than after every byte.

That's the same hack a lot of TCP/IP checksums routines used on machines with
longer words; add the words, then fold the result in the shorter length at the
end. The one I wrote for the 68K back in '84 did that.

    > This suggests that the two programs would give different results for
    > very large files that overflow a 32-bit value.

No, I don't think so, depending on the exact detals of the implementation. As
long as when folding the two halves together, you add any carry into the sum,
you get the same result as doing it into a 16-bit sum. (If my memory of how
this all works is correct - the neurons aren't what they used to be,
especially late in the day... :-)

    > Also, if this sign extends, then its behavior on "negative" (high bit
    > set) bytes is likely to be very different from the SysIII one, which
    > uses getc.

I have this bit set that in C, 'char' is defined to be signed, and
furthermore that when you assign a shorter int to a longer one, the sign is
extended. So if one has a char holding '0200' octal (i.e. -128), assigning it
to a 16-bit int should result in the latter holding '0177600' (i.e. still
-128). So in fact I think they probably act the same.

	Noel



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [TUHS] why is sum reporting different checksum's between v6 and v7
  2015-12-12  0:03 Will Senn
  2015-12-12  0:10 ` Clem cole
@ 2015-12-12  0:38 ` Random832
  1 sibling, 0 replies; 8+ messages in thread
From: Random832 @ 2015-12-12  0:38 UTC (permalink / raw)

On 2015-12-12, Will Senn wrote:
> # echo "Hello, World" > hi.txt
> # cat hi.txt
> Hello, World
>
> Then on v6:
> # sum hi.txt
> 1106 1
>
> But on v7:
> # sum hi.txt
> 37264     1

Interestingly, I can get both results on OSX:

% echo "Hello, World" | cksum -o 1
37264 1
% echo "Hello, World" | cksum -o 2
1106 1

Or Ubuntu:
% echo "Hello, World" | sum -r
37264     1
% echo "Hello, World" | sum -s
1106 1

Both of these define the one you got in v7 as a "BSD" algorithm.
So it looks like v7's new algorithm didn't make it into USG
Unix, rather it uses the same one as v6. (According to the OSX
manpage, System V eventually grew a "-r" option to use the newer
algorithm).

The second number is the size in blocks, which is 512 for the
"System V" algorithm and 1024 bytes for modern implementations
of the "BSD" algorithm, *but* BUFSIZ (512) for v7.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [TUHS] why is sum reporting different checksum's between v6 and v7
  2015-12-12  0:03 Will Senn
@ 2015-12-12  0:10 ` Clem cole
  2015-12-12  0:38 ` Random832
  1 sibling, 0 replies; 8+ messages in thread
From: Clem cole @ 2015-12-12  0:10 UTC (permalink / raw)


A thought. Try recompiling v7 sum on v6.  It's simple enough that the compiler differences should be easy to tease out. 

Sent from my iPhone

> On Dec 11, 2015, at 7:03 PM, Will Senn <will.senn at gmail.com> wrote:
> 
> All,
> 
> While working on the latest episode of my saga about moving files between v6 and v7, I noticed that the sum utility from v6 reports a different checksum than it does using the sum utility from v7 for the same file. To confirm, I did the following on both systems:
> 
> # echo "Hello, World" > hi.txt
> # cat hi.txt
> Hello, World
> 
> Then on v6:
> # sum hi.txt
> 1106 1
> 
> But on v7:
> # sum hi.txt
> 37264     1
> 
> There is no man page for the utility on v6, and it's assembler. On v7, there's a manpage and it's C:
> man sum
> ...
> Sum calculates and prints a 16-bit checksum for the named
>     file, and also prints the number of blocks in the file.
> ...
> 
> A few questions:
> 1. I'll eventually be able to read assembly and learn what the v6 utility is doing the hard way, but does anyone know what's going on here?
> 2. Why is sum reporting different checksum's between v6 and v7?
> 3. Do you know of an alternative to check that the bytes were transferred exactly? I used od and then compared the text representation of the bytes  on the host using diff (other than differences in output between v6 and v7 related to duplicate lines, it worked ok but is clunky).
> 
> Thanks,
> 
> Will
> _______________________________________________
> TUHS mailing list
> TUHS at minnie.tuhs.org
> http://minnie.tuhs.org/cgi-bin/mailman/listinfo/tuhs



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [TUHS] why is sum reporting different checksum's between v6 and v7
@ 2015-12-12  0:03 Will Senn
  2015-12-12  0:10 ` Clem cole
  2015-12-12  0:38 ` Random832
  0 siblings, 2 replies; 8+ messages in thread
From: Will Senn @ 2015-12-12  0:03 UTC (permalink / raw)


All,

While working on the latest episode of my saga about moving files 
between v6 and v7, I noticed that the sum utility from v6 reports a 
different checksum than it does using the sum utility from v7 for the 
same file. To confirm, I did the following on both systems:

# echo "Hello, World" > hi.txt
# cat hi.txt
Hello, World

Then on v6:
# sum hi.txt
1106 1

But on v7:
# sum hi.txt
37264     1

There is no man page for the utility on v6, and it's assembler. On v7, 
there's a manpage and it's C:
man sum
...
Sum calculates and prints a 16-bit checksum for the named
      file, and also prints the number of blocks in the file.
...

A few questions:
1. I'll eventually be able to read assembly and learn what the v6 
utility is doing the hard way, but does anyone know what's going on here?
2. Why is sum reporting different checksum's between v6 and v7?
3. Do you know of an alternative to check that the bytes were 
transferred exactly? I used od and then compared the text representation 
of the bytes  on the host using diff (other than differences in output 
between v6 and v7 related to duplicate lines, it worked ok but is clunky).

Thanks,

Will



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-12-12  1:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-12  0:30 [TUHS] why is sum reporting different checksum's between v6 and v7 Noel Chiappa
2015-12-12  1:07 ` Random832
  -- strict thread matches above, loose matches on Subject: below --
2015-12-12  1:22 Noel Chiappa
2015-12-12  1:46 ` Random832
2015-12-12  1:50 ` John Cowan
2015-12-12  0:03 Will Senn
2015-12-12  0:10 ` Clem cole
2015-12-12  0:38 ` Random832

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).