Computer Old Farts Forum
 help / color / mirror / Atom feed
* [COFF] Was [TUHS] tabs vs spaces - entab, detab
@ 2021-03-11 20:02 jpl.jpl
  2021-03-11 21:18 ` steffen
  0 siblings, 1 reply; 4+ messages in thread
From: jpl.jpl @ 2021-03-11 20:02 UTC (permalink / raw)


The tab/detab horse was still twitching, so I decided to beat it a little
more.

Doug's claim that tabs saving space was an urban legend didn't ring true,
but betting again Doug is a good way to get poor quick. So I tossed
together a perl script (version run through col -x is at the end of this
note) to measure savings. A simpler script just counted tabs,
distinguishing leading tabs, which I expected to be very common, from
embedded tabs, which I expected to be rare. In retrospect, embedded tabs
are common in (my) C code, separating structure types from the element
names and trailing comments. As Norman pointed out, genuine tabs often
preserve line to line alignment in the presence of small changes. So the
fancier script distinguishes between leading tabs and embedded tabs for
various possible tab stops. Small tab stops keep heavily indented code
lines short, large tab stops can save more space when tabbing past leading
blanks. My coding style uses "set-width" of 4, which vi turns into spaces
or tabs, with "standard" tabs every 8 columns. My code therefore benefits
most with tabstops every 4 columns. A lot of code is indented 4 spaces,
which saves 3 bytes when replaced by a tab, but there is no saving with
tabstops at 8. Here's the output when run on itself (before it was
detabbed) and on a largish C program:

  /home/jpl/bin/tabsave.pl /home/jpl/bin/tabsave.pl rsort.c
/home/jpl/bin/tabsave.pl, size 1876
  2: Leading 202, Embedded 3, Total 205
  4: Leading 303, Embedded 4, Total 307
  8: Leading 238, Embedded 5, Total 243

rsort.c, size 209597
  2: Leading 13186, Embedded 4219, Total 17405
  4: Leading 19776, Embedded 5990, Total 25766
  8: Leading 16506, Embedded 6800, Total 23306

The bytes saved by using tabs compared to the (detabbed) original size are
not chump change, with 2, 4 or 8 column tabstops. On ordinary text, savings
are totally unimpressive, usually 0. Your savings may vary. I think the
horse is now officially deceased. -- jpl

===

#!/usr/bin/perl -w

use strict;
my @Tab_stops = ( 2, 4, 8 );

sub check_stop {
    my ($line, $stop_at) = @_;
    my $pos = length($line);
    my ($leading, $embedded) = (0,0);

    while ($pos >= $stop_at) {
        $pos -= ($pos % $stop_at);      # Get to previous tab stop
        my $blanks = 0;
        while ((--$pos >= 0) && (substr($line, $pos, 1) eq ' ')) {
++$blanks; }
        if ($blanks > 1) {
            my $full = int($blanks/$stop_at);
            my $partial = $blanks - $full * $stop_at;
            my $savings = (--$partial > 0) ? $partial : 0;
            $savings += $full * ($stop_at - 1);
            if ($pos < 0) {
                $leading += $savings;
            } else {
                $embedded += $savings;
            }
        }
    }
    return ($leading, $embedded);
}

sub dofile {
    my $file = shift;
    my $command = "col -x < $file";
    my $notabsfh;
    unless (open($notabsfh, "-|", $command)) {
        printf STDERR ("Open failed on '$command': $!");
        return;
    }
    my $size = 0;
    my ($leading, $embedded) = (0,0);
    my @savings;
    for (my $i = 0; $i < @Tab_stops; ++$i) { $savings[$i] = [0,0]; }
    while (my $line = <$notabsfh>) {
        my $n = length($line);
        $size += $n;
        $line =~ s/(\s*)$//;
        for (my $i = 0; $i < @Tab_stops; ++$i) {
            my @l_e = check_stop($line, $Tab_stops[$i]);
            for (my $j = 0; $j < @l_e; ++$j) {
                $savings[$i][$j] += $l_e[$j];
            }
        }
    }
    print("$file, size $size\n");
    for (my $i = 0; $i < @Tab_stops; ++$i) {
        print("  $Tab_stops[$i]: ");
        my $l = $savings[$i][0];
        my $e = $savings[$i][1];
        my $t = $l + $e;
        print("Leading $l, Embedded $e, Total $t\n");
    }
    print("\n");
}

sub main {
    for my $file (@ARGV) {
        dofile($file);
    }
}

main();
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://minnie.tuhs.org/pipermail/coff/attachments/20210311/63c904e4/attachment.htm>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [COFF] Was [TUHS] tabs vs spaces - entab, detab
  2021-03-11 20:02 [COFF] Was [TUHS] tabs vs spaces - entab, detab jpl.jpl
@ 2021-03-11 21:18 ` steffen
  2021-03-11 23:31   ` jpl.jpl
  0 siblings, 1 reply; 4+ messages in thread
From: steffen @ 2021-03-11 21:18 UTC (permalink / raw)


John P. Linderman wrote in
 <CAC0cEp9GVsYbjYhsYk2Hjjj90FxYFAia2Luy_vg854NTrV3Hww at mail.gmail.com>:
 |The tab/detab horse was still twitching, so I decided to beat it a little
 |more.
 |
 |Doug's claim that tabs saving space was an urban legend didn't ring true,
 |but betting again Doug is a good way to get poor quick. So I tossed
 |together a perl script (version run through col -x is at the end of this
 |note) to measure savings. A simpler script just counted tabs,
 |distinguishing leading tabs, which I expected to be very common, from
 |embedded tabs, which I expected to be rare. In retrospect, embedded tabs
 |are common in (my) C code, separating structure types from the element
 |names and trailing comments. As Norman pointed out, genuine tabs often
 |preserve line to line alignment in the presence of small changes. So the
 |fancier script distinguishes between leading tabs and embedded tabs for
 |various possible tab stops. Small tab stops keep heavily indented code
 |lines short, large tab stops can save more space when tabbing past leading
 |blanks. My coding style uses "set-width" of 4, which vi turns into spaces
 |or tabs, with "standard" tabs every 8 columns. My code therefore benefits
 |most with tabstops every 4 columns. A lot of code is indented 4 spaces,
 |which saves 3 bytes when replaced by a tab, but there is no saving with
 |tabstops at 8. Here's the output when run on itself (before it was
 |detabbed) and on a largish C program:
 |
 |  /home/jpl/bin/tabsave.pl /home/jpl/bin/tabsave.pl rsort.c
 |/home/jpl/bin/tabsave.pl, size 1876
 |  2: Leading 202, Embedded 3, Total 205
 |  4: Leading 303, Embedded 4, Total 307
 |  8: Leading 238, Embedded 5, Total 243
 |
 |rsort.c, size 209597
 |  2: Leading 13186, Embedded 4219, Total 17405
 |  4: Leading 19776, Embedded 5990, Total 25766
 |  8: Leading 16506, Embedded 6800, Total 23306
 |
 |The bytes saved by using tabs compared to the (detabbed) original size are
 |not chump change, with 2, 4 or 8 column tabstops. On ordinary text, savings
 |are totally unimpressive, usually 0. Your savings may vary. I think the
 |horse is now officially deceased. -- jpl

Not really.  I mean, i do not insist of this, but i looked at the
numbers.  And despite col(1) 2.36.2 giving the wrong line when
failing to dig a LATIN1 in UTF-8 (should be 7, gave 11), when
i sum up the total of 8: in an old project with tests,
documentation etc. here the output is 1044401.  This is without
generated data.  I mean, even today i strip whitespace in shipout
code, of generated data, of documentation parsed through
processors.  This includes mangling of internal interface headers
and removal of their documentation and comments, but copyright.
You know, only this automatized pre-release step of the pretty
small open source MUA i maintain causes this line count change:
  37 files changed, 2441 insertions(+), 10537 deletions(-)

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [COFF] Was [TUHS] tabs vs spaces - entab, detab
  2021-03-11 21:18 ` steffen
@ 2021-03-11 23:31   ` jpl.jpl
  2021-03-12  0:31     ` steffen
  0 siblings, 1 reply; 4+ messages in thread
From: jpl.jpl @ 2021-03-11 23:31 UTC (permalink / raw)


On Thu, Mar 11, 2021 at 4:18 PM Steffen Nurpmeso <steffen at sdaoden.eu> wrote:

> John P. Linderman wrote in
>  <CAC0cEp9GVsYbjYhsYk2Hjjj90FxYFAia2Luy_vg854NTrV3Hww at mail.gmail.com>:
>  |The tab/detab horse was still twitching, so I decided to beat it a little
>  |more.
>  |
>  |Doug's claim that tabs saving space was an urban legend didn't ring true,
>  |but betting again Doug is a good way to get poor quick.
>


> Not really.  I mean, i do not insist of this, but i looked at the
> numbers.  And despite col(1) 2.36.2 giving the wrong line when
> failing to dig a LATIN1 in UTF-8 (should be 7, gave 11), when
> i sum up the total of 8: in an old project with tests,
> documentation etc. here the output is 1044401.  This is without
> generated data.
>

I'm not certain what you are referring to by "Not Really". But there is a
general issue about the ability of historical commands (like "ed") to
properly handle unicode. I would expect that many early commands do very
poorly. -- jpl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://minnie.tuhs.org/pipermail/coff/attachments/20210311/3e93c4bb/attachment.htm>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [COFF] Was [TUHS] tabs vs spaces - entab, detab
  2021-03-11 23:31   ` jpl.jpl
@ 2021-03-12  0:31     ` steffen
  0 siblings, 0 replies; 4+ messages in thread
From: steffen @ 2021-03-12  0:31 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

John P. Linderman wrote in
 <CAC0cEp93762jEt4BAed-tQ7-vi1YGVEket=B8Z-F-3finWUHbw at mail.gmail.com>:
 |On Thu, Mar 11, 2021 at 4:18 PM Steffen Nurpmeso <steffen at sdaoden.eu> \
 |wrote:
 |> John P. Linderman wrote in
 |>  <CAC0cEp9GVsYbjYhsYk2Hjjj90FxYFAia2Luy_vg854NTrV3Hww at mail.gmail.com>:
 |>|The tab/detab horse was still twitching, so I decided to beat it a little
 |>|more.
 |>|
 |>|Doug's claim that tabs saving space was an urban legend didn't ring true,
 |>|but betting again Doug is a good way to get poor quick.
 |
 |> Not really.  I mean, i do not insist of this, but i looked at the
 |> numbers.  And despite col(1) 2.36.2 giving the wrong line when
 |> failing to dig a LATIN1 in UTF-8 (should be 7, gave 11), when
 |> i sum up the total of 8: in an old project with tests,
 |> documentation etc. here the output is 1044401.  This is without
 |> generated data.
 |
 |I'm not certain what you are referring to by "Not Really". But there is a

I am sorry.  It must have been bad english and a misunderstanding.
Especially surrounding the dead horse that was mentioned in the
original message of yours.

 |general issue about the ability of historical commands (like "ed") to
 |properly handle unicode. I would expect that many early commands do very
 |poorly. -- jpl

Yes.  Of course.  It is one of these days were whatever i do
i stumble over errors in my own as well as in other peoples
software, and then this script of years was a nice try (i never
actually did that test), and then the find(1) i did spit out
masses of errors, and it took quite some trials to get it done.
I apologise.  Just one of these days.

  $ echo Müßig | iconv -f utf8 -t latin1 | LC_ALL=C col -x
  col: failed on line 1: Invalid or incomplete multibyte or wide character

This should not happen.  (Now also clarified by POSIX standard.)

  #?1|kent:tmp$ cat <<_EOT | iconv -f utf8 -tlatin1 | LC_ALL=C col -x
  one
  two
  three
  four
  five
  älter
  _EOT
  col: failed on line 11: Invalid or incomplete multibyte or wide character
  one
  two
  three
  four
  five

No fun for me today.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-03-12  0:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 20:02 [COFF] Was [TUHS] tabs vs spaces - entab, detab jpl.jpl
2021-03-11 21:18 ` steffen
2021-03-11 23:31   ` jpl.jpl
2021-03-12  0:31     ` steffen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).