The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] 211bsd: kernel panic after a 'here document' in tcsh
@ 2017-06-25 16:25 Walter F.J. Mueller
  0 siblings, 0 replies; 15+ messages in thread
From: Walter F.J. Mueller @ 2017-06-25 16:25 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1463 bytes --]

Hi,

two remarks on the issues around FPSIM and tcsh:

I of course wondered by a line like

    mov     $4..,r0

is accepted by 'as', I naively expected that this should cause an error.
I didn't locate the 211bsd 'as' manual, so checked 7th Edition manuals,
which can be found under

   https://wolfram.schneider.org/bsd/7thEdManVol2/

The assembler manual, see
   https://wolfram.schneider.org/bsd/7thEdManVol2/assembler/assembler.pdf

states

    6.1  Expression  operators
         The operators are:
            (blank)     when there  is  no  operand  between  operands,
                        the  effect  is  exactly  the  same  as  if  a
                        ‘+’ had  appeared.

So the lexer sees two tokens

   $4.    --> number
   .      --> symbol for location counter

and, because the default operator is '+', interprets this as

    mov     $4. + . , r0

which ends up being a number in the 160000 to 177777 range.

So 'as' is not to blame, works as designed.

Noel Chippa wrote:
 > I'm fairly amazed that apparently nobody has run across one of these 4 before!
 > (Or, at least, not bothered to report it.)
 > I wonder how long that bug has been in the code?

The answer is: this bug was in 211bsd all the time.
Steven Schultz told me that that they simply didn't have a way to
test FPSIM because all machines had FPP, and the only way of testing
would have been to physically remove the FP11 from a 11/70.


		With best regards,   Walter


^ permalink raw reply	[flat|nested] 15+ messages in thread
* [TUHS] 211bsd: kernel panic after a 'here document' in tcsh
@ 2017-06-10 14:24 Noel Chiappa
  2017-06-12 15:26 ` Clem Cole
  0 siblings, 1 reply; 15+ messages in thread
From: Noel Chiappa @ 2017-06-10 14:24 UTC (permalink / raw)


    > From: "Walter F.J. Mueller"

    > the kernel panic after tcsh here documents is understood.

Very nice detective work!

    > The kernel panic is due to a coding error in mch_fpsim.s. ...  After
    > fixing the "$SIGILL." ... and three similar cases

I'm fairly amazed that apparently nobody has run across one of these 4 before!
(Or, at least, not bothered to report it.)

I wonder how long that bug has been in the code?

     Noel


^ permalink raw reply	[flat|nested] 15+ messages in thread
* [TUHS] 211bsd: kernel panic after a 'here document' in tcsh
@ 2017-06-10 12:58 Walter F.J. Mueller
  0 siblings, 0 replies; 15+ messages in thread
From: Walter F.J. Mueller @ 2017-06-10 12:58 UTC (permalink / raw)


Hi,

the kernel panic after tcsh here documents is understood.
And fixed, at least on my system.

The essential hint was Johnny's observation that on his system he gets
an "Illegal instruction - core dumped" and no kernel panic.

I'm using a self-build PDP 11/70 on an FPGA, see
   https://github.com/wfjm/w11/
   https://wfjm.github.io/home/w11/
which doesn't have a floating point unit. Therefore the kernel is build
with floating point emulation, thus with
   FPSIM   YES      # floating point simulator

In a kernel with FPSIM activated the trap handler trap(), see
   http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/trap.c.html
calls for each user mode illegal instruction trap fpsim(). In case
it was a floating point instruction fpsim() emulates it, returns 0,
and trap() simply returns. If not, fpsim() returns the abort signal
type, and trap() calls psignal() with this signal type, which in
general will terminate the offending process.

The kernel panic is due to a coding error in mch_fpsim.s. Look in
   http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/mch_fpsim.s.html
the code after label badins:

    badins:                         / Illegal Instruction
            mov     $SIGILL.,r0
            br      2b

The constant SIGILL is defined in assym.h as

    #define SIGILL 4.

Thus after substitution the mov instruction is

            mov     $4..,r0

with *two dots* !!! The 'as' assembler generates from this

            mov #160750,r0

So r0 will contain a invalid signal number, which is returned by fpsim() to
trap(). This signal number is passed to psignal(), which starts with

      mask = sigmask(sig);
      prop = sigprop[sig];

The access to sigprop[sig] results into an address in IO space, causes an
UNIBUS timeout, and in consequence the kernel panic.

After fixing the "$SIGILL." to "$SIGILL"  (removing the extraneous '.') and
three similar cases the kernel doesn't panic anymore, tcsh crashed with an
illegal instruction trap.

Remains the question why tcsh runs onto an illegal instruction. Getting now
a tcsh core dump adb gives the answer

   adb tcsh tcsh.core
     $c
       0172774: _rscan(0176024,0174434) from ~heredoc+0246
       0176040: _heredoc(067676) from ~execute+0234
       0176126: _execute(067040,01512,0,0) from ~execute+03410
       0176222: _execute(066754,01512,0,0) from ~process+01224
       0176274: _process(01) from ~main+06030
       0177414: _main() from start+0104

heredoc(), which is located in OV1, calls rscan(), which is in OV6 with

    rscan(Dv, Dtestq);

where Dtestq is a function pointer to Dtestq(), which is as heredoc() in OV1.
rscan(), which has the signature

      rscan(t, f)
           register Char **t;
           void    (*f) ();

uses 'f' in the statement

       (*f) (*p++);

The problem is that
   - heredoc() and Dtestq() are in OV1
   - that's why in the end ~Dtestq is used a function pointer, like
     for all overlay internal function invocations
   - rscan() is in OV6, when it's called, overlay is switched OV1 -> OV6
   - this invalidates the function pointer, which points to some random
     code location, which happens to hold '000045', causing a trap.

It is clear that in this context _Dtestq, the forwarder in the base, must
be used and not ~Dtestq, the entry point in the overlay. The generated
code for 'rscan(Dv, Dtestq)' is

       ~heredoc+0230:  mov     $0174434,(sp)         # arg Dtestq: uses ~Dtestq
       ~heredoc+0234:  mov     r5,-(sp)
       ~heredoc+0236:  add     $0177764,(sp)         # arg Dv
       ~heredoc+0242:  jsr     pc,*$_rscan

Since rscan() is very small and only used by heredoc() I simply moved the
code of rscan() from sh.glob.c (OV6) to sh.dol.c where also heredoc() and
Dtestq() is defined.

After that tcsh works fine with here documents
   ./tcsh
   cat >x.x <<EOF
   1
   $TERM
   $PWD
   EOF

   cat x.x
     1
     vt100-long
     /usr/src/bin/tcsh

Bottom line
   - fpsim was broken all the time
   - tcsh was broken all the time

I'm convert this into proper patches and send them to Steven, but this will
take some time because I've to tidy up my system to be again in the
position to provide proper and clean patch sets.

             With best regards,       Walter


P.S.: debugging the kernel issue was quite easy because the w11a CPU has
three essential 'build into the cpu' debug tools:
- a 'cpu monitor', which records 144 bits of processor state for the last 256
   instructions or vector fetches, see
     https://github.com/wfjm/w11/blob/master/rtl/w11a/pdp11_dmcmon.vhd
- a 'breakpoint unit' which allows to set instruction of data breakpoints
- an 'ibus monitor' which records the last 512 ibus transactions
After setting a breakpoint on the trap 004/010 handler an inspection of the
instruction trace gave the essential information. Below a very condensed
and annotated excerpt

  nc ....pc cprptnzvc ..dsrc ..ddst ..dres      vmaddr vmdata
#
# the "(*f) (*p++)" in tcsh, running onto an illegal instruction
#
  15 145210 uu00-.... 000105 173052 000105 w  d 173052 000105 mov r0,(sp)
  25 145212 uu00-.... 173050 174434 174434 w  d 173050 145216 jsr pc, at n(r5)
  19 174434 uu00-.... 000010 173064 000010 r  i 174434 000045 ?000045?
   1 174434 uu00-.... 000012 173064 000012 r  d 000010 000045 !VFETCH 010 RIT
#
# the "mov $SIGILL.,r0" in fpsim(), load 160750 instead of 000004
#
  17 160744 ku00-n..c 160750 000045 160750 r  i 160746 160750 mov #n,r0
  14 160750 ku00-n..c 160752 160750 160732 r  i 160750 000770 br .-14
#
# the "sigprop[sig]" access in psignal(), which accesses 174036
# which leads to a external bus (or UNIBUS) time out and IIT trap
#
  23 161314 ku00-.z.. 000000 147500 000000 w  d 147500 000000 mov r1,n(r5)
   9 161320 ku00-.z.. 174036 000000 000000 Ebto 174036 013066 movb n(r3),r0
   3 161320 ku00-.z.. 000006 000000 000006 r  d 000004 013066 !VFETCH 004 IIT


^ permalink raw reply	[flat|nested] 15+ messages in thread
[parent not found: <mailman.884.1496866451.3779.tuhs@minnie.tuhs.org>]
* [TUHS] 211bsd: kernel panic after a 'here document' in tcsh
@ 2017-06-07 20:14 Walter F.J. Mueller
  2017-06-08  7:54 ` Michael Kjörling
  0 siblings, 1 reply; 15+ messages in thread
From: Walter F.J. Mueller @ 2017-06-07 20:14 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2243 bytes --]

Hi,

a few remarks on the feedback on the kernel panic after a 'here document' in tcsh.

To Michael Kjörling question:
 > I'm curious whether the same thing happens if you try that in some
 > other shell? (Not sure how widely here documents were supported back
 > then, but I'm asking anyway.)
And Johnny Billquist remark
 > Not sure if any of the other shells have this.

'here documents' are available and work fine in sh and csh.
And are in fact used, examples

   /usr/adm/daily     (a /bin/sh script)
     su uucp << EOF
           /etc/uucp/clean.daily
     EOF

   /usr/crash/why     (a /bin/csh script)
     adb -k {unix,core}.$1 << 'EOF'
     version/sn"Backtrace:"n
     $c
     'EOF'

To Michael Kjörling remark
 > The PC value in the panic report ("pc 161324") strikes me as high
and Johnny Billquist remark
 > This is in kernel mode, and that is in the I/O page.

211bsd uses split I/D space and uses all 64 kB I space for code.
The top 8 kB are in fact  the overlay area, and the crash happened
in overlay 4 (as indicated by ov 4). With a simple

   nm /unix | sort | grep " 4"

one gets

   161254 t ~psignal 4
   162302 t ~issignal 4

so the crash is just 050 bytes after the entry point of psignal. So the
PC address is fine and not the problem. For psignal look at

   http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#s:_psignal

the crash must be one of the first lines. psignal is an internal kernel
function, called from

   http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#xref:s:_psignal

and has nothing to do with the libc function psignal

   http://www.retro11.de/ouxr/211bsd/usr/man/cat3/psignal.0.html
   http://www.retro11.de/ouxr/211bsd/usr/src/lib/libc/gen/psignal.c.html

To Johnny Billquist remark
 > Could you (Walter) try the latest version of 2.11BSD and see if you
 > still get that crash?

very interesting that you see a core dump of tcsh rather a kernel panic.

Whatever tcsh does, it should not lead to a kernel panic, and if it does,
it is primarily a bug of the kernel. It looks like there are two issues,
one in tcsh, and one in the kernel. I've a hunch were this might come from,
but that will take a weekend or two to check on.


		With best regards,  Walter


^ permalink raw reply	[flat|nested] 15+ messages in thread
[parent not found: <mailman.1.1496714401.14870.tuhs@minnie.tuhs.org>]
* [TUHS] 211bsd: kernel panic after a 'here document' in tcsh
@ 2017-06-05 23:05 Noel Chiappa
  0 siblings, 0 replies; 15+ messages in thread
From: Noel Chiappa @ 2017-06-05 23:05 UTC (permalink / raw)


    > From: Jacob Ritorto

    > Where might one find the list of trap_types

Look in:

  http://minnie.tuhs.org/cgi-bin/utree.pl?file=2.11BSD/sys/pdp/scb.s

which maps from trap vector locations (built into the hardware; consult a
PDP-11 CPU manual for details) to trap type numbers, which are defined here:

  http://minnie.tuhs.org/cgi-bin/utree.pl?file=2.11BSD/sys/pdp/trap.h

and handled here:

  http://minnie.tuhs.org/cgi-bin/utree.pl?file=2.11BSD/sys/pdp/trap.c


    > and cpuerrs?

That just prints the contents of the CPU Error Register; see an appropriate
PDP-11 CPU manual - 11/70, /44, /73, /83 or /84 for what all the bits mean.
Also the "KDJ11-A CPU Module User's Guide", which also documents it.

In theory, there's also a KDJ11-B UG, but it's not online. If anyone has one,
can we please get it scanned? Thanks!

    Noel


^ permalink raw reply	[flat|nested] 15+ messages in thread
* [TUHS] 211bsd: kernel panic after a 'here document' in tcsh
@ 2017-06-05 14:12 Walter F.J. Mueller
  2017-06-05 16:16 ` Michael Kjörling
  0 siblings, 1 reply; 15+ messages in thread
From: Walter F.J. Mueller @ 2017-06-05 14:12 UTC (permalink / raw)


Hi,

I'm using 211bsd (Version 447) and found that a 'here document' in tcsh
leads to a kernel panic. It's absolutely reproducible on my system, both
when run it on my FPGA PDP-11 or in simh. Just doing

   tcsh
   cat << EOF

is enough, and I get

     ka6 31333 aps 147472
     pc 161324 ps 30004
     ov 4
     cpuerr 20
     trap type 0
     panic: trap
     syncing disks... done

looking at the crash dump gives

   cd /etc/crash
   ./why 4
     Backtrace:
     0147372: _boot(05000,0100) from    ~panic+072
     0147414: _etext(011350) from ~trap+0350
     0147450: ~trap() from call+040
     0147516: _psignal(0101520,0160750) from ~trap+0364
     0147554: ~trap() from call+040

so the crash is in psignal, which is afaik the kernel internal
mechanism to dispatch signals.

Questions:
   1. has anybody seen this before ?
   2. any idea what the reason could be ?


		With best regards, 	Walter


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-06-25 16:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.1.1497146402.26080.tuhs@minnie.tuhs.org>
2017-06-11 10:25 ` [TUHS] 211bsd: kernel panic after a 'here document' in tcsh Johnny Billquist
2017-06-25 16:25 Walter F.J. Mueller
  -- strict thread matches above, loose matches on Subject: below --
2017-06-10 14:24 Noel Chiappa
2017-06-12 15:26 ` Clem Cole
2017-06-10 12:58 Walter F.J. Mueller
     [not found] <mailman.884.1496866451.3779.tuhs@minnie.tuhs.org>
2017-06-08 22:29 ` Johnny Billquist
2017-06-07 20:14 Walter F.J. Mueller
2017-06-08  7:54 ` Michael Kjörling
     [not found] <mailman.1.1496714401.14870.tuhs@minnie.tuhs.org>
2017-06-06 19:15 ` Johnny Billquist
2017-06-05 23:05 Noel Chiappa
2017-06-05 14:12 Walter F.J. Mueller
2017-06-05 16:16 ` Michael Kjörling
2017-06-05 16:33   ` Ron Natalie
2017-06-05 22:08     ` Jacob Ritorto
2017-06-06 11:43       ` Ron Natalie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).