From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: from second.openwall.net (second.openwall.net [193.110.157.125]) by inbox.vuxu.org (Postfix) with SMTP id DD3AE2225C for ; Sat, 17 Feb 2024 14:24:28 +0100 (CET) Received: (qmail 14234 invoked by uid 550); 17 Feb 2024 13:21:16 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 14199 invoked from network); 17 Feb 2024 13:21:16 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=landley-net.20230601.gappssmtp.com; s=20230601; t=1708176254; x=1708781054; darn=lists.openwall.com; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=mBAzclEp7e7orGG56QB6mQZv4PdngQhhFupyMrDhxqk=; b=gKKrwJhgijXsTmx5RDw+1evQZTJomZBdNshIQmjBf3hHiOK90DY0fiRbheTMIH620q MXO/7nVLuoFuhqwrj9Rsp1CaQOIiGhyiE1MJsWz8HJscGaijVG0Dy1fC1+Pi0gEoeQWP le1e7YtFwKLRMbJfxAETo4adnhBwX/j6dVZp4fp9T6GZaHRVklCrYZDoZmnojGalaXu6 f0DbVM8fQPRO/ZWwP303d7mC2aIACITtbbZoVEFSGgx3mlb2eqRpyyLMf9Q0y7g+cH1y CMQbwMIWcpDu5RhUXm4W3oZUSilJ4zRC+3ffeskO8Bm3SCzjWctEBVa9hHYoNrvQcu1t nNKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708176254; x=1708781054; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mBAzclEp7e7orGG56QB6mQZv4PdngQhhFupyMrDhxqk=; b=YYtA2OJcVE4eO58jwU4Aetw7f7JYkrPzz4yDDv9sIXk5KOzbXVFiF9a8oEXpmwb7Yp SbCueXJdgx/6iPj2CCRogcR1rd/GnQwmvbuuusLqatR6cm+I+VlARnqZqmVWoM4iKEzC QRVKCYBpEmm1KhFeulWeFujDj9wuSmghVX8cgTNT4b/GRn2w6vXe5na2C2T9YIm8qAwl 20nXs59cMyR0w6yscmZlu0IIKav0zfv1f66UEzVuMGwQVmQ0gQcx+HXP9wY7rogqQK8J wHF9piwdtz0kKmeqILUVbulBFkG3GXHCSW4IJ7+XZQm6xWF8QLz3Fba9UV7vBsSTyv74 dpTA== X-Forwarded-Encrypted: i=1; AJvYcCUjforBWaGomw0epLW+6cW/mCPj/ftwisNq+XtQMJD87LqIg4XuIiZGktUTjx7mUubtyscu125FU4X0fiF5fkEVXXLhH3Qstw== X-Gm-Message-State: AOJu0YwX7jale/RYZhhIZRG6lN+Jdwgo5xi7JMrh1SjdeHcxvIVr08UH sUgjmy0DgE3O9taPvp8f59NHMQOwl9Q/YjjF8ayX5VFHVVZ44Qc1e+eH24aI/5o= X-Google-Smtp-Source: AGHT+IHUbPTSxO4pDH9kcTdVVy9up2Z+Ukpvfe//ZY1zHvKYj1YtOeIEacNYcJhrzBYAEPGABcB/NQ== X-Received: by 2002:a05:6830:1d90:b0:6e2:dac8:92d with SMTP id y16-20020a0568301d9000b006e2dac8092dmr7005794oti.27.1708176253469; Sat, 17 Feb 2024 05:24:13 -0800 (PST) Message-ID: <05774f03-57b5-f524-7a5b-c436237b5d4b@landley.net> Date: Sat, 17 Feb 2024 07:32:00 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Content-Language: en-US To: Mouse , toybox@lists.landley.net, musl@lists.openwall.com References: <349f4e17-8027-c521-eeb3-aa69e8f2b5a4@landley.net> <202402170323.WAA04412@Stone.Rodents-Montreal.ORG> From: Rob Landley In-Reply-To: <202402170323.WAA04412@Stone.Rodents-Montreal.ORG> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [musl] Re: [Toybox] Not sure how to debug this one. On 2/16/24 21:23, Mouse wrote: >> While grinding away at release prep, I hit a WEIRD one. The >> qemu-system-sh4 target got broken [...by...] the commit that changed >> the stdout buffering type. > >> The actual _problem_ is that sigsetjmp() is faulting [...] > [...] >> While debugging I made the problem GO AWAY more than once by sticking >> printfs() and similar into the code, [...] > > This smells to me like depending on uninitialized stack trash. A write-only function that didn't change its behavior when I memset the structure before calling it? Define "uninitialized". (Unless you mean an uninitialized variable inside a function written entirely in assembly, that's part of a C library shipped and used by many people for many years?) >> Not siglongjmp, _sigsetjmp_. Which means it's failing somewhere in: >> >> https://git.musl-libc.org/cgit/musl/tree/src/signal/sh/sigsetjmp.s > >> And I dunno how to stick a printf into superh assembly code. > > The simple way to figure that out is to compile something that uses > printf and look at the assembly, $ ccc/sh4-linux-musl-cross/bin/sh4-linux-musl-objdump -d generated/unstripped/toybox | grep -A 60 ':' 0045341c : 45341c: 86 2f mov.l r8,@-r15 45341e: f8 e2 mov #-8,r2 453420: 22 4f sts.l pr,@-r15 453422: a8 7f add #-88,r15 453424: 18 d0 mov.l 453488 ,r0 ! 45622c 453426: f3 61 mov r15,r1 453428: 18 71 add #24,r1 45342a: 29 21 and r2,r1 45342c: 43 68 mov r4,r8 45342e: 13 62 mov r1,r2 453430: 18 72 add #24,r2 453432: 7a 11 mov.l r7,@(40,r1) 453434: 04 72 add #4,r2 453436: 58 11 mov.l r5,@(32,r1) 453438: 13 63 mov r1,r3 45343a: 69 11 mov.l r6,@(36,r1) 45343c: f3 65 mov r15,r5 45343e: aa f2 fmov fr10,@r2 453440: 44 75 add #68,r5 453442: bb f2 fmov fr11,@-r2 453444: 20 73 add #32,r3 453446: 13 62 mov r1,r2 453448: 10 72 add #16,r2 45344a: 04 72 add #4,r2 45344c: f3 64 mov r15,r4 45344e: 8a f2 fmov fr8,@r2 453450: 14 e6 mov #20,r6 453452: 9b f2 fmov fr9,@-r2 453454: 13 62 mov r1,r2 453456: 08 72 add #8,r2 453458: 04 72 add #4,r2 45345a: 6a f2 fmov fr6,@r2 45345c: 04 71 add #4,r1 45345e: 7b f2 fmov fr7,@-r2 453460: 4a f1 fmov fr4,@r1 453462: 5b f1 fmov fr5,@-r1 453464: 12 15 mov.l r1,@(8,r5) 453466: 2c 71 add #44,r1 453468: 11 15 mov.l r1,@(4,r5) 45346a: 60 e1 mov #96,r1 45346c: fc 31 add r15,r1 45346e: 33 15 mov.l r3,@(12,r5) 453470: 32 25 mov.l r3,@r5 453472: 0b 40 jsr @r0 453474: 14 15 mov.l r1,@(16,r5) 453476: 05 d0 mov.l 45348c ,r0 ! 45500c 453478: 05 d4 mov.l 453490 ,r4 ! 4cc9ec <__stdout_FILE> 45347a: 0b 40 jsr @r0 45347c: 83 65 mov r8,r5 45347e: 58 7f add #88,r15 453480: 26 4f lds.l @r15+,pr 453482: 0b 00 rts 453484: f6 68 mov.l @r15+,r8 453486: 09 00 nop 453488: 2c 62 extu.b r2,r2 45348a: 45 00 mov.w r4,@(r0,r0) 45348c: 0c 50 mov.l @(48,r0),r0 45348e: 45 00 mov.w r4,@(r0,r0) 453490: ec c9 and #-20,r0 453492: 4c 00 mov.b @(r0,r4),r0 That's just the vfprintf() wrapper, which has the actual plumbing for escape parsing and such, and is of course running its output through the ascii FILE * infrastructure. No, I would wind up CALLING the function, meaning set up a call stack, but how you're supposed to do that in the middle of setjmp() without corrupting the registers you're supposed to be saving... even manually making a _system_call_ in that context is... I mean it's _documented_ in https://man7.org/linux/man-pages/man2/syscall.2.html: Arch/ABI Instruction System Ret Ret Error Notes call # val val2 ─────────────────────────────────────────────────────────────────── superh trapa #31 r3 r0 r1 - 4, 6 But again, the point is to SAVE those registers, in a defined order, and there's no WAY to insert something that big into delicate assembly non-intrusively. This already heisenbugs if my dprintf() is too elaborate. > either by using -save-temps or > equivalent or by disassembling the binary. Did I mention I once stuck print-to-stderr debugging into the uclibc dynamic loader while doing system bringup on the hexagon architecture? Which couldn't use any global variables, function calls, or string constants because it hadn't relocated itself yet so I assembled a message into a char buffer[] on the stack and did a syscall(_nr_write). Similar to debugging uboot before it relocated itself from NOR flash to sram (and thus all the locations the linker had provided for symbols outside the current function and stack were wrong), where debug output was a loop that wrote a byte at a time to the serial port spinning checking the ready-for-next-byte status bit. In that case I worked out the constants I needed to subtract from "string constants" (because a string constant resolves to a pointer of type char so you can "hello"-0x40800300 and that's a byte offset). In theory the same technique would apply to function pointers (every function name is a pointer) but the TYPE of said pointer is sizeof(function) and doing math on them isn't really a thing, so you need to typecast to char and then BACK again (and the syntax for function pointer typecasting has too many parentheses in non-obvious locations, I generally find it easier to declare a function pointer variable and then (void *) typecast assign to that), but that doesn't help if the function then tries to call ANOTHER function, as so many of them do, so... didn't turn out to be very useful. But that's not what I was asking about HERE. "Magic blob of assembly for architecture I'm not hugely familiar with is throwing an interrupt, I wonder why?" > But, given what sigsetjmp is, sticking a printf in there is likely to > be more difficult than usual. Define "usual". Oh, I forgot to mention that qemu-system-blah also has a -s option to launch a gdbserver on a port. (Which as with all the classic qemu options is now described in the --help text as "-s shorthand for -gdb tcp::1234" which is just sad. And that's _after_ they renamed it from https://landley.net/notes-2008.html#19-03-2008 when it was apparently -g ?) I believe qemu -s is emulating a jtag, kgdb is SORT of emulating a jtag, and then normal gdbserver is providing userspace context debugging. Same protocol, what differs is what the registers mean and symbol visibility/namespace context. This is why having an unstripped "vmlinux" is so useful: it's an ELF kernel with all the symbols so gdb can load it and give you kernel namespace context. Even if what you actually RAN is one of the repackaged versions, the linking's already been done so the memory layout's fixed. Except on sparc, with RELOCATES ITSELF. No, I don't know why either, but I broke it back under aboriginal and had to get help debugging it: https://lkml.org/lkml/2011/11/12/57 I do not always have the relevant domain expertise, which is why I try to ask people who _do_: https://lkml.org/lkml/2011/12/14/324 (One of the big goals of aboriginal linux and now mkroot is the ability to package up a test case that somebody can reproduce on their machine without needing specific hardware, INCLUDING a portable build environment that lets them rebuild the provided binaries. Hence self-contained qemu-system builds built with provided portable toolchains that plug into a a build that's both "do this, here's the output" AND "you don't have to use my wrapper, it should be obvious what it does".) > I know a little about Super-H from some Dreamcast hackery I did a while > back. I had a look at the .s file you cite - thank you, musl-libc.org, > for resisting the stampede to try to ram HTTPS down everyone's > throat[%]! - and, while I can read it, there is too much I don't know > to really claim to understand it. I can convert the assembly into > English, certainly, but I don't know how much that would help > (especially since it's the machine language, not assembly language, I > know; the SH assembler I've used is my own, with its own syntax, so I'm > having to guess at the meaning of some parts). One of the private email replies that didn't go to the list (so I can't politely publicly reply to it and maybe get more people who know stuff chiming in) suggested trying it under qemu-user (which reproduced the issue! MUCH easier), and provided better debug output: I got a register dump (with a program counter I can probably dig through the sh4-linux-musl-objdump -d generated/unstripped/toybox (or readelf -a) to identify the failing instruction): Unhandled trap: 0x180 pc=0x3fffe6b0 sr=0x00000001 pr=0x00427c40 fpscr=0x00080000 spc=0x00000000 ssr=0x00000000 gbr=0x004cd9e0 vbr=0x00000000 sgr=0x00000000 dbr=0x00000000 delayed_pc=0x00451644 fpul=0x00000000 r0=0x3fffe6b0 r1=0x00000000 r2=0x00000000 r3=0x000000af r4=0x00000002 r5=0x00481afc r6=0x407fffd0 r7=0x00000008 r8=0x3fffe6b0 r9=0x00456bb0 r10=0x004cea74 r11=0x3fffe6b0 r12=0x3fffe510 r13=0x00000000 r14=0x00456fd0 r15=0x407ffe88 r16=0x00000000 r17=0x00000000 r18=0x00000000 r19=0x00000000 r20=0x00000000 r21=0x00000000 r22=0x00000000 r23=0x00000000 And that ALSO says it's a trap 0x180 which in qemu: sh7750_regs.h:#define SH7750_EVT_ILLEGAL_INSTR 0x180 /* General Illegal Instruction */ I boggle. (I also tried backing up in qemu to see where it's generated from, but alas this is MODERN qemu: the macro defined there is never used in the code, and the fprintf() is the return code from a function that wraps a function pointer call for a variable that is never assigned to in the sh architecture, so probably initialized by a macro I can't grep for. Digging is ongoing. He also pointed me at https://sourceware.org/bugzilla/show_bug.cgi?id=27543 which is interesting, but neither sigsetjmp.s nor the setjmp.S it calls have those two floating point instructions. (Although it saves floating point registers by number so... is this a synonym for the same thing? Floating point flags in weird state throwing an exception that's showing up as illegal instruction but is actually closer to a division by zero error or overflow or something? Touched floating point register before setting FPU mode? Dunno. Hmmm, is any of the code between the start of the function and the failure point doing floating point math? There isn't any in toysh, but I can't guaratantee libc functions like sprintf() don't use some, and somehow leave the FPU in a weird state that faults trying to dump its registers? I'm guessing here...) > [%] Having HTTP support meant I could just look at the http: version > instead of needing to wait until I could use a work machine. > >> (The problem with trying to configure the kernel to produce core >> dumps and compare against the readelf -d output is it's running as >> PID 1. [...]) > > Why is that a problem? I don't see any statement of what kernel you're > running under, but I can think of two plausible reasons offhand: (1) > the kernel refuses to coredump PID 1 under any circumstances or (2) > there's no writable filesytem to take a coredump on at that point. The kernel panics immediately upon PID 1 exiting and even if the panic is deferred until after it's written the core dump instead of a check at the START of exiting, the writeable filesystem is initramfs which is transient. Best case scenario would be _if_ the panic happens at the _end_ of exiting (highly unlikely, but maybe patchable) setting up a network block device and making it O_DIRECT somehow so the data goes out before the exit without being delayed by disk cache or nagle or kernel tasklets being asynchronous or anything. Once upon a time (like 2.0 or something) the kernel continued processing network packets and such after panic, so setting up your firewall rules and then intentionally panicing the kernel was considered the most secure way to set up a Linux router. (Try exploiting a system with NO USERSPACE.) But alas, the kernel got "improved" so that no longer works. (The theory was freeze file IO _now_ because we dunno what's corrupted, so flushing caches to disk and/or network filesystems may make things worse, so STOP EVERYTHING and preserve as much forensic evidence as possible in case of kernel crash dumps or kgdb or kexec on panic or similar. Needing to keep the device you're writing kernel crash dumps to active was, of course, one of those truly funky sequencing issues the kernel got subtly wrong for many years, but the plumbing rewrite that gave us sysfs and years of working on suspend sequencing finally straightened out the dependencies I think?) > To address (1), I'd just build a kernel with that test diked out. > > To address (2), I'd normally netboot. It that's not feasible for some > reason, I'd probably hack on the kernel to remount / read-write before > starting userland. A) I believe you can still pass rw on the kernel command line, B) you can run a dumb little statically linked shim.c as rdinit= to do stuff and then have it exec() the next PID 1 process, that's fairly standard procedure in this context. Don't have to modify the kernel for either, but "file reliably written out as kernel is in the process of panicing"... I already mentioned kgdb, right? There's a way to get a serial console out of it so the kernel itself is acting as your debugger: https://www.kernel.org/doc/html/v4.14/dev-tools/kgdb.html There's some sort of unholy sacrifice a chicken in the summoning circle layering violations going on when this happens, but yes you can panic to a kgdb console. I've done it! Not recently though. (I suspending linux with kgdb and then resuming still RCU timeout city on a modern kernel? Or did they fix that?) But I only pull out gdb when I'm REALLY annoyed. (Cure worse than the disease. Can't STAND the user interface...) > Of course, you said qemu-something, so you are presumably running under > emulation. In principle, you could figure this out from emulator > traces, but that is likely to be both extremely difficult and extremely > tedious. > > But - you said memset-to-zero on the struct ran but didn't stop it from > failing. I'd try memset to various other values, to see if you can > find one that makes it stop crashing. Except sigsetjmp() is writing to the structure. The function is not supposed to be reading from the structure. The memset() was to dirty the memory so I could be sure there wasn't some sort of -EACCESS or a soft fault from stack growth somehow(?) causing a hiccup. Rob