Re: [TUHS] Who said ...

The Unix Heritage Society mailing list
 help / color / mirror / Atom feed

From: Chris Torek <torek@elf.torek.net>
To: tuhs@tuhs.org
Subject: Re: [TUHS] Who said ...
Date: Wed, 1 Sep 2021 07:48:12 -0700 (PDT)	[thread overview]
Message-ID: <202109011448.181EmCiB062091@elf.torek.net> (raw)
In-Reply-To: <20210901141638.F064F640CC6@lignose.oclsc.org>

>DEC's diags were far from perfect, but they were a hell of a
>lot better than the largely-nonexistent diags available for
>modern Intel-architecture systems.  I am right now dealing
>with a system that has an intermittent fault, that causes
>the OS to crash in the middle of some device driver every
>so often.  Other identical systems don't, so I don't think
>it's software.

Could be the device itself, corrupting things.

Some problems are just hard to track down.  If you remember
the infamous UDA50/RA81 setup, the original Unix driver was
flaky somehow in that it would "lose" MSCP packets and hang.
I rewrote the thing from scratch to fix that problem.  But
then we got an Emulex board that had a ... different problem.

I hacked on our driver to find it.  The problem turned out
to be that the Emulex hardware (or firmware) would *drop*
16 of the 32 bits of a 32-bit field.  In each MSCP packet,
there was a single 32-bit field where you could store arbitrary
data to be reflected back to you in a reply.  The Unix driver
stored `struct buf *bp` there, if I remember right, and I
originally did as well.

Once I figured out this was being clobbered, I replaced it
with a small integer (index into "outstanding I/O table") with
check bytes.  I'd log the occurrence of corruption, recover the
useful data from the 16 bytes that had the right data, and we
would be on our merry way.  There was no obvious pattern here
though.

Two other sort of related war stories...

 * We had the carry-chain timing bug on our VAX 780 at one point.
   It most-consistently hit on the `extzv` instruction in the
   kernel exit() handler, but only about 1 out of every 10 to 100
   thousand occurrences.  So I wrote a user-land program that
   would spin doing that `extzv`.  If the user program crashed,
   the board-set installed in the backplane had the problem, and
   we'd have the DEC service guy cycle through them (in the usual
   "how many flat tires do we have today" dance).

 * The Ultrasparc II CPU had a similar timing bug, I think in the
   register forwarding logic.  The BSD/OS SPARC port had a three
   instruction sequence for setting up the right stack on a
   trap (interrupt, system call, etc)., and it would randomly
   crash with a bizarre value, that I eventually figured out
   was from putting the result that should have gone into one
   of the %l registers, into the %sp register instead.  It only
   happened after a pipeline flush for other purposes and I
   forget what I did to make it happen frequently enough to
   diagnose.

   (Re-ordering the three instructions fixed the problem.)

Tying back into ZFS etc., if that was on this mailing list: :-)

I had a bad DIMM in an Intel box a while back, that corrupted
data in the kernel buffer pool.  That one was scary, because,
while the memtest86 tests found it, who knows what data they
corrupted?

(This is why I want ECC, even in my home systems.)

Chris

next prev parent reply	other threads:[~2021-09-01 15:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-01 14:16 Norman Wilson
2021-09-01 14:23 ` arnold
2021-09-01 14:27 ` Clem Cole
2021-09-01 14:29 ` Henry Bent
2021-09-01 14:32   ` Clem Cole
2021-09-01 14:48 ` Chris Torek [this message]
2021-09-01 16:00 ` Ron Natalie
  -- strict thread matches above, loose matches on Subject: below --
2021-09-01 13:30 arnold
2021-09-01 13:48 ` Clem Cole
2021-09-01 13:55   ` John Cowan
2021-09-01 13:57   ` arnold
2021-09-01 14:11     ` Clem Cole
2021-09-01 14:26   ` Henry Bent
2021-09-01 15:08     ` Andrew Hume
2021-09-01 16:10 ` Ron Natalie
2021-09-01 16:42   ` Tom Lyon via TUHS
2021-09-01 17:41     ` Lawrence Stewart
2021-09-01 20:00       ` ron minnich
2021-09-01 20:04         ` ron minnich
2021-09-01 21:47           ` Rob Pike

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202109011448.181EmCiB062091@elf.torek.net \
    --to=torek@elf.torek.net \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).