From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2 Received: from primenet.com.au (ns1.primenet.com.au [203.24.36.2]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id ec13d01e for ; Fri, 20 Dec 2019 16:59:16 +0000 (UTC) Received: (qmail 8032 invoked by alias); 20 Dec 2019 16:59:06 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: List-Unsubscribe: X-Seq: 45096 Received: (qmail 23374 invoked by uid 1010); 20 Dec 2019 16:59:06 -0000 X-Qmail-Scanner-Diagnostics: from mail-wm1-f52.google.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.102.1/25663. spamassassin: 3.4.2. Clear:RC:0(209.85.128.52):SA:0(-2.0/5.0):. Processed in 2.545616 secs); 20 Dec 2019 16:59:06 -0000 X-Envelope-From: stephane.chazelas@gmail.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 209.85.128.52 as permitted sender) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:mail-followup-to :references:mime-version:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=9ImBYIqWwhCwkF3dAtQ1ck+1G+vxHCmtd1zU+UqsSdk=; b=ly+Hu4shjszUFXjoXvy214ErAwWNSzTDfhzGUZvOb7dYpfXuKWSVkcuuiLM+0VwSM+ BvHXhjkP73cRWQJXUcezzaQAPETV91MBAGmIe0tU3vI1X02yILNvxRm5m3HHqJLaAsQS KRjpC40u/7CAPcgMT9Ubx8XltMNBfLuEtosdh9vxlm9tvzqWYOy83MQBl4OixFin/+s0 TQrI2ce6OT5+EpCCqB7RpCNihiG6R3LcBjUMfRHNx4jrbMhSLfgjIikXm5QYsdSLqC4O 2dPHLUY98p29hEvtIyW+ilYA4moQycVsc8Rpdzqk4ClV6LJtDR4IGP4NN1NePwDYpC6O Ok9g== X-Gm-Message-State: APjAAAX4S37YCAVR4dn340XtZ8RKWSVFeGONnFmb7ed4+Fm3+u6Wzzd7 MYflshussIAknZANka4+EJrv2JJc X-Google-Smtp-Source: APXvYqxbU/gWPfbRNxQz16Vt4pqRUjNKKDGghttvmq11BzQ+RjJGKvYe7xxuxa36JduMf+OoK1KtDg== X-Received: by 2002:a7b:c764:: with SMTP id x4mr18300847wmk.116.1576861107498; Fri, 20 Dec 2019 08:58:27 -0800 (PST) Date: Fri, 20 Dec 2019 16:58:24 +0000 From: Stephane Chazelas To: zsh-workers@zsh.org Subject: Re: zsh converts a floating-point number to string with too much precision Message-ID: <20191220165824.ufvjtx37xt7dp2dt@chaz.gmail.com> Mail-Followup-To: zsh-workers@zsh.org References: <20191220013711.GA708801@zira.vinc17.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20191220013711.GA708801@zira.vinc17.org> User-Agent: NeoMutt/20180716 2019-12-20 02:37:11 +0100, Vincent Lefevre: > With zsh 5.7.1, I get: > > zira% echo $((1.1)) > 1.1000000000000001 > > because zsh seems to first select the precision independently > from the value, i.e. 17 to be able to convert the string back > to floating point, preserving the original value, then it > outputs the closest number in this precision. > > Instead, zsh should select the minimum precision so that the > inverse conversion can give the original value, i.e. it should > output 1.1 here. And what should it give for $((1.1000000000000001)) ? (hint, 1.1000000000000001 and 1.1 have the same "double" representation). See also: https://unix.stackexchange.com/questions/422122/why-does-0-1-expand-to-0-10000000000000001-in-zsh Reproduced below for convenience: ════════════════════════════════════════════════════════════════ TL;DR zsh chooses a decimal representation for the double binary numbers that it uses for evaluating floating point arithmetics that preserves their information fully, that is safe for reinput into its arithmetic expressions. And that is done at the expense of cosmetic. For that, it needs 17 significant digits, and make sure the expansion always includes a . or e so it's treated as float on reinput. That "full-precision" decimal representation could be seen as an intermediary format between the binary double precision machine-only numbers and a human-readable one. An intermediary format that is understood by all tools that understand decimal representations of floating point numbers. In the case of 0.1 as used in a arithmetic expression, it so happens that the closest 17 digit decimal representation of the double precision binary number closest to 0.1 is 0.10000000000000001, an artefact caused by the limit of the precision of double precision numbers and rounding. Other shells privilege the cosmetic aspect and do lose some information upon conversion to decimal (though still try to preserve as much precision as possible within that additional constraint). Both approaches have their merits and drawbacks, see below for details. awk doesn't have this kind of problematic as it's not a shell and doesn't have to translate back and forth constantly between binary and decimal representation when manipulating floating points. zsh's approach zsh, like many other programming languages (including yash, ksh93) and many tools used from the shell (like awk, printf...) that deal with floating point numbers, perform arithmetic operations on a binary representation of those numbers. That's convenient and efficient because those operations are supported by the C compiler and on most architectures are done by the processor itself. zsh uses the double C type for its internal representation of real numbers.. On most architectures (and with most compilers), those are implemented using IEEE 754 double-precision binary floating points. Those are implemented a bit like our 1.12e4 engineering notation decimal numbers but in binary (base 2) instead of decimal (base 10). With the mantissa on 53 bits (1 of which implied) and the exponent on 11 bits (and a sign bit). Those generally give you more precision than you'd ever need. When evaluating an arithmetic expression like 1. / 10 (which here has a literal float constant as one of the operands), zsh converts them from their text decimal representation to doubles internally (using the standard strtod() function) and does the operation which results in a new double. 1/10 can be represented with a decimal notation as 0.1 or 1e-1, but just like we can't represent 1/3 in decimal (it would be fine in base 3, 6 or 9), 1/10 cannot be represented in binary (as 10 is not a power of 2). Like 1/3 is 0.333333[adlib] in decimal, 1/10 is .0001100110011001100110011001[adlib] or 1.10011001100110011001[adlib]p-4 in binary (where p-4 stands for 2^-4, (the 4 here in decimal)). As we can only store 52 bits worth of those 1001..., 1/10 as a double becomes 1.1001100110011001100110011001100110011001100110011010p-4 (note the rounding in the last 2 digits). That's the closest representation of 1/10 that we can get with doubles. If we convert that back to decimal, we get: # 1 2 #12345678901234567890 .1000000000000000055511151231257827021181583404541015625 The double before that (1.1001100110011001100110011001100110011001100110011001p-4 is: .09999999999999999167332731531132594682276248931884765625 and the one after (1.1001100110011001100110011001100110011001100110011011p-4): .10000000000000001942890293094023945741355419158935546875 are not as close. Now, zsh is before all a shell, that is, a command line interpreter. Sooner or later it will need to pass the floating point number that results of the arithmetic expression to a command. In a non-shell programming-language, you'd pass your double to the function you want to call. But in a shell, you can only pass strings to commands. You can't pass the raw byte values of your double as it may very well contain NUL bytes and anyway the commands would not know what to do with them. So you need to convert it back to a string notation that the command understands. There are some notations like the C99 0xc.ccccccccccccccdp-7 floating point hexadecimal notation that can easily represent a IEEE 754 binary floating point number, but it's not widely supported yet and more generally meaningless for most mortal humans (few people would recognise 0.1 at first sight above). So the result of $((...)) arithmetic expansion is actually a floating point number in decimal notation�. Now .1000000000000000055511151231257827021181583404541015625 is a bit lengthy and it's pointless to give that much precision given that doubles (and so the result of arithmetic expressions) don't have that much precision. In effect, .1000000000000000055511151231257827021181583404541015625, .100000000000000005551115123125782, or even 0.1 in this case would convert back to the same double. If we truncate (and round) to 15 digits, like yash (which also uses doubles internally for its floating point arithmetics) does, we do get our 0.1, but then again we get 0.1 as well for the two other doubles, so we're losing information as we can't distinguish those 3 different numbers. If we're truncating to 16 bits, we still get 2 of those different doubles that yield 0.1. We'd need to keep 17 significant decimal digits to not lose information stored in a IEEE 754 double-precision number. As [1]the double-precision Wikipedia article puts it (quoting a paper by William Kahan, the main architect behind IEEE 754): If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number Conversely, if we use fewer bits, there are binary double values for which we won't get back the same double once we convert them back as seen in the example above. That's what zsh does, it chooses to preserve the whole precision of the double binary format into the decimal representation given by the result of the arithmetic expansion, so that when used again into something (like awk or printf "%17f" or zsh's own arithmetic expressions...) that converts it back to a double it comes back as the same double. As seen in the zsh code (already there in 2000 when floating point support was added to zsh): /* * Conversion from a floating point expression without using * a variable. The best bet in this case just seems to be * to use the general %g format with something like the maximum * double precision. */ You'll also notice that it expands floats that turn out to have no decimal part when truncated with a . appended to make sure they're considered as float when used again in an arithmetic expression: $ zsh -c 'echo $((0.5 * 4))' 2. If it didn't and it was reused in an arithmetic expression, it would be treated as an integer instead of a float which would affect the behaviour of the operations being used (for instance 2/4 is an integer division which yields 0 and 2./4 is a floating point division which yields 0.5). Now, that choice on the number of significant digits means that for the case of that 0.1 as input, the 1.1001100110011001100110011001100110011001100110011010p-4 binary double (the closest one to 0.1) becomes 0.100000000000001, which looks bad when shown to a human. It's even worse when the error is in the other direction like 0.3 that becomes 0.29999999999999999. There's also the converse problem that when we pass that number to an application that supports more precision than doubles do, we're actually passing that 0.000000000000001 error (from the value input by the user like 0.1) along which then becomes significant: $ v=$((0.1)) awk 'BEGIN{print ENVIRON["v"] == 0.1}' 1 $ v=$((0.1)) yash -c 'echo "$((v == 0.1))"' 1 OK because awk and yash use doubles just like zsh, but: $ echo "$((0.1)) == 0.1" | bc 0 $ v=$((0.1)) ksh93 -c 'echo "$((v == 0.1))"' 0 not OK because bc uses arbitrary precision and ksh93 extended precision on my system. Now, if instead of 0.1 (1/10), the original decimal input had been 0.11111111111111111 (or any other arbitrary approximation of 1/9), the tables would turn, showing it's quite hopeless to do equality comparisons on floats. The human display artefact problem can be solved by specifying the precision at the time of display (after you've done all your calculations using the full precision), for instance by using printf: $ x=$((1./10)); printf '%s %g\n' $x $x 0.10000000000000001 0.1 (%g, short for %.6g like the default output format for floats in awk). That also removes the extra trailing . on integer floats. yash (and ksh93's) approach yash chose to remove the artefacts at the expense of precision, 15 decimal digits is the highest number of significant decimal digits that guarantees that there won't be this kind of artefact when converting a number from decimal to binary and back again to decimal like in our $((0.1)) case. The fact that information in the binary number is lost upon converting to decimal can cause other forms of artefacts: $ yash -c 'x=$((1./3)); echo "$((x == 1./3)) $((1./3 == 1./3))"' 0 1 Though (in)equality comparisons are generally unsafe with floating points. Here, we could expect x and 1./3 to be identical as they are the result of the exact same operation. Also: $ yash -c 'x=$((0.5 * 3)); y=$((1.25 * 4)); echo "$((x / y))"' 0.3 $ yash -c 'x=$((0.5 * 6)); y=$((1.25 * 4)); echo "$((x / y))"' 0 (as yash doesn't always include a . or e in the decimal representation of a floating point result, the next arithmetic operation could end-up being either an integer operation or floating point operation). Or: $ yash -c 'a=$((1e15)); echo $((a*100000))' 1e+20 $ yash -c 'a=$((1e14)); echo $((a*100000))' -8446744073709551616 ($((1e15)) expands to 1e+15 which is taken as a float, while $((1e14)) expands to 100000000000000 which is taken as an integer and causes the overflow because we're actually doing an integer multiplication instead of a float multiplication). While there are ways to address the artefact problems by reducing the precision upon display in zsh as seen above, the loss of precision cannot be recovered in other shells. $ yash -c 'printf "%.17g\n" $((5./9))' 0.555555555555556 (still only 15 digits) In any case, however how short you truncate, you can always end up getting artefacts in the results of arithmetic expansions as errors are inherent to floating point representations. $ yash -c 'echo $((10.1 - 10))' 0.0999999999999996 Which is yet another illustration of why you can't really use the equality operator with floating points: $ zsh -c 'echo $((10.1 - 10 == 0.1))' 0 $ yash -c 'echo "$((10.1 - 10 == 0.1))"' 0 ksh93 The case of ksh93 is more complex. ksh93 uses long doubles instead of double where available. long doubles are only guaranteed by C to be at least as big as doubles. In practice, depending on the compiler and architecture, they're most often either IEEE 754 double-precision (64 bit) like doubles, IEEE 754 quadruple precision (128 bit) or extended precision (80 bit precision, but often stored on 128 bits) like when ksh93 is built for GNU/Linux systems running on x86. To represent them fully and unambiguously in decimal, you need respectively 17, 36 or 21 significant digits. ksh93 truncates at 18 significant digits. I can only test on x86 architecture at the moment, but my understanding is that on systems where long doubles are like doubles, you'd get the same kind of artefact as with zsh (worse as it uses 18 digits instead of 17). Where doubles have 80 bits or 128 bits precision, you get the same kind of problems as with yash except that the situation is better when interacting with tools that work with doubles as ksh93 gives them more precision than they need and would preserve as much precision as they give it. $ ksh93 -c 'x=$((1./3)); echo "$((x == 1. / 3))"' 0 is still a "problem" but not: $ ksh93 -c 'x=$((1./3)) awk "BEGIN{print ENVIRON[\"x\"] == 1/3}"' 1 is OK. Where the behaviour is suboptimal though is when typeset -F/-E are used. In that case, ksh93 truncates to 15 significant digits when assigning a value to a variable even if you request a value of greater than 15: $ ksh93 -c 'typeset -F21 x; ((x = y = 1./3)); echo "$((x == y))"' 0 $ ksh93 -c 'typeset -F21 x; ((y = 1./3)); x=$y; echo "$((x == y))"' 0 There are differences in behaviour in between ksh93, zsh and yash when it comes to the handling on the locale's decimal radix character (whether to use/recognise 3.14 or 3,14) which affects the ability to reinput the result of arithmetic expansions inside arithmetic expressions. zsh is consistent again in that the result of expansions can always we used inside arithmetic expressions regardless of the user's locale there. awk awk is one of those programming languages that is not a shell and handles floating point numbers. The same would apply to perl... Its variables are not limited to strings and nowadays generally store numbers internally as binary doubles (gawk also supports arbitrary precision numbers as an extension). The conversion to the string decimal notation only happens when printing a number like in: $ awk 'BEGIN {print 0.1}' 0.1 In which case it uses the format specified in the OFMT special variable (%.6g by default), but can be made arbitrarily big: $ awk -v OFMT=%.80g 'BEGIN{print 0.1}' 0.1000000000000000055511151231257827021181583404541015625 Or when there is an implicit conversion of a number to string, like when a string operator (like concatenation, subtr(), index()...) is used, it which case the CONVFMT variable is used instead (except for integer numbers). $ awk -v OFMT=%.0e -v CONVFMT=%.17g 'BEGIN{x=0.1; print x, ""x}' 1e-01 0.10000000000000001 Or when using printf explicitly. There is usually no problem of precision lost internally as we don't convert back and forth between decimal and binary representation. And on output, one can decide how much or how little precision to give out. Conclusion In conclusion, I'll just offer my personal opinion. Shell floating point arithmetics is not something I use often. Most of the time, it's through zsh's zcalc autoloadable calculator function which prints floats with 6 digit precision anyway. Most of the time anything past the first 3 digits after the decimal point is just noise for this kind of usage. Having arithmetic expansions have a high precision is a necessity. Whether it's the full precision or as much precision as possible while avoiding some of the artefacts probably doesn't matter so much especially considering that nobody is ever going to use a shell to do extensive floating point calculations. While it does give me comfort to know that in zsh, the roundtripping to decimal is not going to introduce an extra level of errors, I find more important the fact that the result of expansions can safely be used inside arithmetic expressions, that floats stay floats and that a script will keep working when used in a locale where the decimal radix is , for instance. ════════════════════════════════════════════════════════════════ � zsh is the only Korn-like shell that I know that can have arithmetic expansions be in bases other than 10, but that's only for integer ones. References Visible links 1. https://en.wikipedia.org/wiki/Double-precision_floating-point_format#cite_ref-whyieee_1-0