[-- Attachment #1: Type: text/plain, Size: 1118 bytes --] Hi all, I was recently reading the Oracle docs about the ELF, and I came across their chapter about the COPY relocation. They discuraged its use, since with those relocations, a binding exists between importing and exporting module. If the semantics of the imported object changes, then this is an ABI mismatch. So I looked at the musl source code and noticed that COPY relocations are simply processed, and an ABI mismatch is simply accepted. So, since I am of the opinion that detectable errors should be detected, rather than left to fester and spring a hard-to-explain bug on you, usually five minutes before deadline, I wrote the attached patch to add detection for at least a changed size. This won't detect all changes to ABI regarding COPY relocation (e.g.int-->float, or in an array of structs, a change to the struct size and to the array size cancelling each other out), but it should find most of them. Also, I wondered whether COPY relocations are even still in use. But on my system (currently some Ubuntu version) I found over 15000 of the things. Mostly for stdout and stderr, though. Ciao, Markus [-- Attachment #2: 0001-Add-detection-for-changed-size-of-a-COPY-relocation.patch --] [-- Type: text/x-diff, Size: 1631 bytes --] From 75e98f4e4cef2eb2b867062aebc481c3b1f66498 Mon Sep 17 00:00:00 2001 From: Markus Wichmann <nullplan@gmx.net> Date: Wed, 26 Feb 2020 06:09:14 +0100 Subject: [PATCH] Add detection for changed size of a COPY relocation. COPY relocations create an ABI binding between importing and exporting module. Should anything about the object in question change, that would be an ABI change, and therefore incompatible. While the dynamic linker is not capable of detecting all changes, it can detect most of them by detecting a changed size between import and export. Any change is a problem, since the source buffer will be either overread or underread. In any case, if the semantics of the imported object changed, the ABI contract is broken, and it is better to detect this than to silently allow it and inexplicably crash later on. --- ldso/dynlink.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/ldso/dynlink.c b/ldso/dynlink.c index afec985a..618c2cbd 100644 --- a/ldso/dynlink.c +++ b/ldso/dynlink.c @@ -435,6 +435,15 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri else *reloc_addr = (size_t)base + addend; break; case REL_COPY: + if (def.sym && sym->st_size != def.sym->st_size) { + error("Error relocating %s: %s: Size mismatch in COPY" + " relocation (exp %lu, got %lu)", + dso->name, name + sym->st_size + 0ul, + def.sym->st_size + 0ul); + if (runtime) longjmp(*rtld_fail, 1); + continue; + } memcpy(reloc_addr, (void *)sym_val, sym->st_size); break; case REL_OFFSET32: -- 2.17.1
On Wed, Feb 26, 2020 at 06:24:48AM +0100, Markus Wichmann wrote: > Hi all, > > I was recently reading the Oracle docs about the ELF, and I came across > their chapter about the COPY relocation. They discuraged its use, since > with those relocations, a binding exists between importing and exporting > module. If the semantics of the imported object changes, then this is an > ABI mismatch. > > So I looked at the musl source code and noticed that COPY relocations > are simply processed, and an ABI mismatch is simply accepted. So, since > I am of the opinion that detectable errors should be detected, rather > than left to fester and spring a hard-to-explain bug on you, usually > five minutes before deadline, I wrote the attached patch to add > detection for at least a changed size. This won't detect all changes to > ABI regarding COPY relocation (e.g.int-->float, or in an array of > structs, a change to the struct size and to the array size cancelling > each other out), but it should find most of them. > > Also, I wondered whether COPY relocations are even still in use. But on > my system (currently some Ubuntu version) I found over 15000 of the > things. Mostly for stdout and stderr, though. In theory copy relocations should have been gone for everyone who switched to PIE, but then the gcc/binutils folks brought them back as an "optimization" (allowing pc-rel access to external symbols). :-( Anyway, I'm in agreement that we should avoid clobbering memory. The old code did that already by using sym->st_size rather than def.sym->st_size, which is why I hadn't seen this as an issue. But it can over-read (and potentially fault) if the copy reloc has a larger size than the definition, and if it's smaller the library might wrongly access past the end of the copy, reading or clobbering unrelated data. At the very least I think we ought to catch and error on the case where def.sym->st_size>sym->st_size, since we can't honor it and failure to honor it can produce silent memory corruption. I'm less sure about what to do if def.sym->st_size<sym->st-size; this case seems safe and might be desirable not to break (I vaguely recall an intent that it be ok), but if you think there are reasons it's dangerous I'm ok with disallowing it too. I'm having a hard time now thinking of a reason it would really help to support that, anyway. > From 75e98f4e4cef2eb2b867062aebc481c3b1f66498 Mon Sep 17 00:00:00 2001 > From: Markus Wichmann <nullplan@gmx.net> > Date: Wed, 26 Feb 2020 06:09:14 +0100 > Subject: [PATCH] Add detection for changed size of a COPY relocation. > > COPY relocations create an ABI binding between importing and exporting > module. Should anything about the object in question change, that would > be an ABI change, and therefore incompatible. While the dynamic linker > is not capable of detecting all changes, it can detect most of them by > detecting a changed size between import and export. Any change is a > problem, since the source buffer will be either overread or underread. > In any case, if the semantics of the imported object changed, the ABI > contract is broken, and it is better to detect this than to silently > allow it and inexplicably crash later on. > --- > ldso/dynlink.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/ldso/dynlink.c b/ldso/dynlink.c > index afec985a..618c2cbd 100644 > --- a/ldso/dynlink.c > +++ b/ldso/dynlink.c > @@ -435,6 +435,15 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri > else *reloc_addr = (size_t)base + addend; > break; > case REL_COPY: > + if (def.sym && sym->st_size != def.sym->st_size) { > + error("Error relocating %s: %s: Size mismatch in COPY" > + " relocation (exp %lu, got %lu)", > + dso->name, name > + sym->st_size + 0ul, > + def.sym->st_size + 0ul); Missing , after name, and the indent is weird (2 extra levels). I'd either align (spaces) with the opening paren or use just one indent level, and fewer lines. Semantically, st_size is size_t not ulong, so I'd probably lean towards a cast to (size_t) here with %zu. > + if (runtime) longjmp(*rtld_fail, 1); > + continue; > + } > memcpy(reloc_addr, (void *)sym_val, sym->st_size); > break; > case REL_OFFSET32: > -- Otherwise LGTM. Rich
* Rich Felker:
> At the very least I think we ought to catch and error on the case
> where def.sym->st_size>sym->st_size, since we can't honor it and
> failure to honor it can produce silent memory corruption. I'm less
> sure about what to do if def.sym->st_size<sym->st-size; this case
> seems safe and might be desirable not to break (I vaguely recall an
> intent that it be ok), but if you think there are reasons it's
> dangerous I'm ok with disallowing it too. I'm having a hard time now
> thinking of a reason it would really help to support that, anyway.
Unfortunately the Mozilla NSS people disagree that size mismatches for
global symbols are an ABI break. I don't know if this is relevant in
the musl context, but it means that for glibc, we probably can't make
it a hard error.
I want to have better diagnostics for this in glibc, but the current
warning (which is poorly worded at that) is in the
architecture-specific code, and I got side-tracked when I tried to
clean this up the last time.
On Wed, Feb 26, 2020 at 07:38:31PM +0100, Florian Weimer wrote:
> * Rich Felker:
>
> > At the very least I think we ought to catch and error on the case
> > where def.sym->st_size>sym->st_size, since we can't honor it and
> > failure to honor it can produce silent memory corruption. I'm less
> > sure about what to do if def.sym->st_size<sym->st-size; this case
> > seems safe and might be desirable not to break (I vaguely recall an
> > intent that it be ok), but if you think there are reasons it's
> > dangerous I'm ok with disallowing it too. I'm having a hard time now
> > thinking of a reason it would really help to support that, anyway.
>
> Unfortunately the Mozilla NSS people disagree that size mismatches for
> global symbols are an ABI break. I don't know if this is relevant in
> the musl context, but it means that for glibc, we probably can't make
> it a hard error.
>
> I want to have better diagnostics for this in glibc, but the current
> warning (which is poorly worded at that) is in the
> architecture-specific code, and I got side-tracked when I tried to
> clean this up the last time.
Thanks for the feedback. Do you have a source where we could read more
about this? What non-broken behavior do they expect to get when sizes
don't match?
As an aside, I think we should be encouraging distros that are using
PIE to get rid of copy relocations by passing whatever options are
needed (or building gcc with whatever options are needed) to avoid
emitting them in PIE. IIRC I looked this up once but I can't remember
what I found.
Rich
* Rich Felker: > On Wed, Feb 26, 2020 at 07:38:31PM +0100, Florian Weimer wrote: >> * Rich Felker: >> >> > At the very least I think we ought to catch and error on the case >> > where def.sym->st_size>sym->st_size, since we can't honor it and >> > failure to honor it can produce silent memory corruption. I'm less >> > sure about what to do if def.sym->st_size<sym->st-size; this case >> > seems safe and might be desirable not to break (I vaguely recall an >> > intent that it be ok), but if you think there are reasons it's >> > dangerous I'm ok with disallowing it too. I'm having a hard time now >> > thinking of a reason it would really help to support that, anyway. >> >> Unfortunately the Mozilla NSS people disagree that size mismatches for >> global symbols are an ABI break. I don't know if this is relevant in >> the musl context, but it means that for glibc, we probably can't make >> it a hard error. >> >> I want to have better diagnostics for this in glibc, but the current >> warning (which is poorly worded at that) is in the >> architecture-specific code, and I got side-tracked when I tried to >> clean this up the last time. > > Thanks for the feedback. Do you have a source where we could read more > about this? What non-broken behavior do they expect to get when sizes > don't match? There's an NSS bug report: <https://bugzilla.mozilla.org/show_bug.cgi?id=1201900> It seems that the NSS situation is better than what I remembered. > As an aside, I think we should be encouraging distros that are using > PIE to get rid of copy relocations by passing whatever options are > needed (or building gcc with whatever options are needed) to avoid > emitting them in PIE. IIRC I looked this up once but I can't remember > what I found. If I recall correctly, the optimization was a factor when rolling out PIE-by-default in Fedora. I do not know if we can revert it without switching back to fixed-address builds. Even if we did that, the ABI incompatibility will still be there. There is also a similar truncation issue for TLS variables, I think.
On Wed, Feb 26, 2020 at 12:36:21PM -0500, Rich Felker wrote: > In theory copy relocations should have been gone for everyone who > switched to PIE, but then the gcc/binutils folks brought them back as > an "optimization" (allowing pc-rel access to external symbols). :-( > Several quotes about optimization come to mind. Especially ones with the word "premature" in them. I just checked and I saw that there are no copy relocs in my libraries, but my binaries are indeed PIEs. Therefore I have to conclude that gcc compiles PIE and PIC differently. Weren't they meant to be the same, except maybe for the selection of object files during the linker phase? > At the very least I think we ought to catch and error on the case > where def.sym->st_size>sym->st_size, since we can't honor it and > failure to honor it can produce silent memory corruption. I'm less > sure about what to do if def.sym->st_size<sym->st-size; this case > seems safe and might be desirable not to break (I vaguely recall an > intent that it be ok), but if you think there are reasons it's > dangerous I'm ok with disallowing it too. I'm having a hard time now > thinking of a reason it would really help to support that, anyway. > If the application did import sym->st_size bytes, it probably wanted to do something with them. Finding less than that at the symbol and garbage afterwards (and, from the point of view of the dynlinker, you have no way of knowing what counts as garbage and what doesn't) might break the program. Sometimes. That is exactly the kind of rare subtle breakage I added the error for. > Missing , after name, and the indent is weird (2 extra levels). I'd > either align (spaces) with the opening paren or use just one indent > level, and fewer lines. > The missing comma is weird. Must have gone missing as I was cutting that line down to size (originally, I had this as one line, but then it was around 200 columns long, and I thought that was a bit excessive). So then I tried to stay below 80 and you see the result before you. The indent was just vim's default setting. Usually this suits me, but usually I have shiftwidth=4 and expandtab. For musl, it's sw=8 and noet. > Semantically, st_size is size_t not ulong, so I'd probably lean > towards a cast to (size_t) here with %zu. > Isn't on Linux size_t always a ulong? I will admit I only went for ulong because it's easier to write the cast. Ciao, Markus
On Wed, Feb 26, 2020 at 09:12:02PM +0100, Markus Wichmann wrote: > On Wed, Feb 26, 2020 at 12:36:21PM -0500, Rich Felker wrote: > > In theory copy relocations should have been gone for everyone who > > switched to PIE, but then the gcc/binutils folks brought them back as > > an "optimization" (allowing pc-rel access to external symbols). :-( > > > > Several quotes about optimization come to mind. Especially ones with the > word "premature" in them. > > I just checked and I saw that there are no copy relocs in my libraries, > but my binaries are indeed PIEs. Therefore I have to conclude that gcc > compiles PIE and PIC differently. Weren't they meant to be the same, > except maybe for the selection of object files during the linker phase? Libraries inherently can't have copy relocations. Only the main program can. > > At the very least I think we ought to catch and error on the case > > where def.sym->st_size>sym->st_size, since we can't honor it and > > failure to honor it can produce silent memory corruption. I'm less > > sure about what to do if def.sym->st_size<sym->st-size; this case > > seems safe and might be desirable not to break (I vaguely recall an > > intent that it be ok), but if you think there are reasons it's > > dangerous I'm ok with disallowing it too. I'm having a hard time now > > thinking of a reason it would really help to support that, anyway. > > > > If the application did import sym->st_size bytes, it probably wanted to > do something with them. Not really. For example: extern char *strings[]; extern size_t nstrings; This type of thing was common with historical mistakes like sys_errlist, the equivalent for signals, etc., where the application either had an out-of-band way of knowing the runtime-provided size, or it could just assume all indices it got from the library were in-bounds to access the array. It went out of fashion entirely because of copy relocations breaking it. > > Missing , after name, and the indent is weird (2 extra levels). I'd > > either align (spaces) with the opening paren or use just one indent > > level, and fewer lines. > > The missing comma is weird. Must have gone missing as I was cutting that > line down to size (originally, I had this as one line, but then it was > around 200 columns long, and I thought that was a bit excessive). So > then I tried to stay below 80 and you see the result before you. > > The indent was just vim's default setting. Usually this suits me, but > usually I have shiftwidth=4 and expandtab. For musl, it's sw=8 and noet. > > > Semantically, st_size is size_t not ulong, so I'd probably lean > > towards a cast to (size_t) here with %zu. > > Isn't on Linux size_t always a ulong? I will admit I only went for ulong > because it's easier to write the cast. Yes, but the ldso code doesn't use that assumption, or at least it's not intended to. It's used to some extent elsewhere in the syscall glue framework. Rich
On Wed, Feb 26, 2020 at 08:53:03PM +0100, Florian Weimer wrote: > * Rich Felker: > > > On Wed, Feb 26, 2020 at 07:38:31PM +0100, Florian Weimer wrote: > >> * Rich Felker: > >> > >> > At the very least I think we ought to catch and error on the case > >> > where def.sym->st_size>sym->st_size, since we can't honor it and > >> > failure to honor it can produce silent memory corruption. I'm less > >> > sure about what to do if def.sym->st_size<sym->st-size; this case > >> > seems safe and might be desirable not to break (I vaguely recall an > >> > intent that it be ok), but if you think there are reasons it's > >> > dangerous I'm ok with disallowing it too. I'm having a hard time now > >> > thinking of a reason it would really help to support that, anyway. > >> > >> Unfortunately the Mozilla NSS people disagree that size mismatches for > >> global symbols are an ABI break. I don't know if this is relevant in > >> the musl context, but it means that for glibc, we probably can't make > >> it a hard error. > >> > >> I want to have better diagnostics for this in glibc, but the current > >> warning (which is poorly worded at that) is in the > >> architecture-specific code, and I got side-tracked when I tried to > >> clean this up the last time. > > > > Thanks for the feedback. Do you have a source where we could read more > > about this? What non-broken behavior do they expect to get when sizes > > don't match? > > There's an NSS bug report: > > <https://bugzilla.mozilla.org/show_bug.cgi?id=1201900> > > It seems that the NSS situation is better than what I remembered. Good to know. > > As an aside, I think we should be encouraging distros that are using > > PIE to get rid of copy relocations by passing whatever options are > > needed (or building gcc with whatever options are needed) to avoid > > emitting them in PIE. IIRC I looked this up once but I can't remember > > what I found. > > If I recall correctly, the optimization was a factor when rolling out > PIE-by-default in Fedora. I do not know if we can revert it without > switching back to fixed-address builds. I think this is almost surely premature optimization. In almost all cases, if there's software where the performance impact makes a difference it can be avoided by giving the affected global data objects visibility of hidden (if it's not used outside the main program anyway) or protected (if it needs to be externally visible). But on x86_64 and aarch64, and to some extent on 32-bit arm as well, the performance difference of accessing globals via the got vs pc-relative is negligible. > Even if we did that, the ABI incompatibility will still be there. Yes. But fixing it would avoid any bugs from fallout of the full object not being copyable at runtime in any newly build programs. > There is also a similar truncation issue for TLS variables, I think. TLS variables never use copy relocations, except for a short period of time on riscv64, where thankfully it was realized to be a mistake and reverted. So I don't think this issue applies to TLS. Rich
On 2020-02-26, Rich Felker wrote: >On Wed, Feb 26, 2020 at 08:53:03PM +0100, Florian Weimer wrote: >> * Rich Felker: >> >> > On Wed, Feb 26, 2020 at 07:38:31PM +0100, Florian Weimer wrote: >> >> * Rich Felker: >> >> >> >> > At the very least I think we ought to catch and error on the case >> >> > where def.sym->st_size>sym->st_size, since we can't honor it and >> >> > failure to honor it can produce silent memory corruption. I'm less >> >> > sure about what to do if def.sym->st_size<sym->st-size; this case >> >> > seems safe and might be desirable not to break (I vaguely recall an >> >> > intent that it be ok), but if you think there are reasons it's >> >> > dangerous I'm ok with disallowing it too. I'm having a hard time now >> >> > thinking of a reason it would really help to support that, anyway. >> >> >> >> Unfortunately the Mozilla NSS people disagree that size mismatches for >> >> global symbols are an ABI break. I don't know if this is relevant in >> >> the musl context, but it means that for glibc, we probably can't make >> >> it a hard error. >> >> >> >> I want to have better diagnostics for this in glibc, but the current >> >> warning (which is poorly worded at that) is in the >> >> architecture-specific code, and I got side-tracked when I tried to >> >> clean this up the last time. >> > >> > Thanks for the feedback. Do you have a source where we could read more >> > about this? What non-broken behavior do they expect to get when sizes >> > don't match? +1 for a `def.sym->st_size!=sym->st-size` diagnostic. >> There's an NSS bug report: >> >> <https://bugzilla.mozilla.org/show_bug.cgi?id=1201900> >> >> It seems that the NSS situation is better than what I remembered. > >Good to know. There may be instances where the code takes the address of a global variable but does not actually care about the contents (st_size does not matter). > >> > As an aside, I think we should be encouraging distros that are using >> > PIE to get rid of copy relocations by passing whatever options are >> > needed (or building gcc with whatever options are needed) to avoid >> > emitting them in PIE. IIRC I looked this up once but I can't remember >> > what I found. >> >> If I recall correctly, the optimization was a factor when rolling out >> PIE-by-default in Fedora. I do not know if we can revert it without >> switching back to fixed-address builds. > >I think this is almost surely premature optimization. In almost all >cases, if there's software where the performance impact makes a >difference it can be avoided by giving the affected global data >objects visibility of hidden (if it's not used outside the main >program anyway) or protected (if it needs to be externally visible). >But on x86_64 and aarch64, and to some extent on 32-bit arm as well, >the performance difference of accessing globals via the got vs >pc-relative is negligible. clang has an option -mpie-copy-relocations (the name could be improved), which enables direct access (usually PC-relative) for -fPIE. // -target aarch64 -fPIE -O3 adrp x8, :got:var ldr x8, [x8, :got_lo12:var] ldr w0, [x8] ret // -target aarch64 -fPIE -O3 -mpie-copy-relocations adrp x8, var ldr w0, [x8, :lo12:var] ret On x86, with R_X86_64_[REX_]GOTPCRELX, the option is still beneficial. // -O3 -fPIE a.c -Wa,--mrelax-relocations=yes 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7> 3: R_X86_64_REX_GOTPCRELX var-0x4 7: 8b 00 mov (%rax),%eax 9: c3 retq // -O3 -fPIE a.c -mpie-copy-relocations 0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 6 <foo+0x6> 2: R_X86_64_PC32 var-0x4 6: c3 retq Users of -mpie-copy-relocations may compile their applications in a mostly statically linking mode, with only a few definitions in DSOs. If some variables are unfortunately really external, R_*_COPY will be produced by the linker. >> Even if we did that, the ABI incompatibility will still be there. > >Yes. But fixing it would avoid any bugs from fallout of the full >object not being copyable at runtime in any newly build programs. > >> There is also a similar truncation issue for TLS variables, I think. > >TLS variables never use copy relocations, except for a short period of >time on riscv64, where thankfully it was realized to be a mistake and >reverted. So I don't think this issue applies to TLS. > >Rich https://sourceware.org/bugzilla/show_bug.cgi?id=23825 (GCC 10 riscv should have fixed this.) (glibc/sysdeps/riscv/dl-machine.h has a hack supporting R_RISCV_COPY on STT_TLS. No other known ld.so supports it.)
* Rich Felker: >> If I recall correctly, the optimization was a factor when rolling out >> PIE-by-default in Fedora. I do not know if we can revert it without >> switching back to fixed-address builds. > > I think this is almost surely premature optimization. In almost all > cases, if there's software where the performance impact makes a > difference it can be avoided by giving the affected global data > objects visibility of hidden (if it's not used outside the main > program anyway) or protected (if it needs to be externally visible). > But on x86_64 and aarch64, and to some extent on 32-bit arm as well, > the performance difference of accessing globals via the got vs > pc-relative is negligible. I think the main obstacle is that we do not have the required annotations in the header files. LTO only for deciding whether calls are local or not could perhaps solve this. >> There is also a similar truncation issue for TLS variables, I think. > > TLS variables never use copy relocations, except for a short period of > time on riscv64, where thankfully it was realized to be a mistake and > reverted. So I don't think this issue applies to TLS. You are right, the optimization I feared is not actually implemented.