The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] another conversion of the CSRG BSD SCCS archives to Git
@ 2019-11-16  2:51 Greg A. Woods
  2019-11-29 21:52 ` Steffen Nurpmeso
  0 siblings, 1 reply; 4+ messages in thread
From: Greg A. Woods @ 2019-11-16  2:51 UTC (permalink / raw)
  To: The Unix Heritage Society mailing list

[-- Attachment #1: Type: text/plain, Size: 654 bytes --]

I've been fixing and enhancing James Youngman's git-sccsimport to use
with some of my SCCS archives, and I thought it might be the ultimate
stress test of it to convert the CSRG BSD SCCS archives.

The conversion takes about an hour to run on my old-ish Dell server.

This conversion is unlike others -- there is some mechanical compression
of related deltas into a single Git commit.

https://github.com/robohack/ucb-csrg-bsd

https://github.com/robohack/git-sccsimport

--
					Greg A. Woods <gwoods@acm.org>

Kelowna, BC     +1 250 762-7675           RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>     Avoncote Farms <woods@avoncote.ca>

[-- Attachment #2: OpenPGP Digital Signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git
  2019-11-16  2:51 [TUHS] another conversion of the CSRG BSD SCCS archives to Git Greg A. Woods
@ 2019-11-29 21:52 ` Steffen Nurpmeso
  2019-12-01  1:25   ` Greg A. Woods
  0 siblings, 1 reply; 4+ messages in thread
From: Steffen Nurpmeso @ 2019-11-29 21:52 UTC (permalink / raw)
  To: The Unix Heritage Society mailing list

Greg A. Woods wrote in <m1iVoBV-0036tPC@more.local>:
 |I've been fixing and enhancing James Youngman's git-sccsimport to use
 |with some of my SCCS archives, and I thought it might be the ultimate
 |stress test of it to convert the CSRG BSD SCCS archives.
 |
 |The conversion takes about an hour to run on my old-ish Dell server.
 |
 |This conversion is unlike others -- there is some mechanical compression
 |of related deltas into a single Git commit.
 |
 |https://github.com/robohack/ucb-csrg-bsd

Thanks for taking the time to produce a CSRG repo that seems to
mimic changesets as they really happened.  As i never made it
there on my own, i have switched to yours some weeks ago.  (Mind
you, after doing "gc --aggressive --prune=all" the repository size
has more than halved, it was the final reason to prepare new
repositories on a vhost with good internet connection before
getting this through my flaky wifi here.  Storage and internet
bandwidth and their cost really do not seem to bother anyone
anymore.  I have no offense in mind, i only recognized it (the
hard way).)

 |https://github.com/robohack/git-sccsimport
 |
 |--
 |     Greg A. Woods <gwoods@acm.org>
 |
 |Kelowna, BC     +1 250 762-7675           RoboHack <woods@robohack.ca>
 |Planix, Inc. <woods@planix.com>     Avoncote Farms <woods@avoncote.ca>
 --End of <m1iVoBV-0036tPC@more.local>

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git
  2019-11-29 21:52 ` Steffen Nurpmeso
@ 2019-12-01  1:25   ` Greg A. Woods
  2019-12-02 18:36     ` Steffen Nurpmeso
  0 siblings, 1 reply; 4+ messages in thread
From: Greg A. Woods @ 2019-12-01  1:25 UTC (permalink / raw)
  To: The Unix Heritage Society mailing list

At Fri, 29 Nov 2019 22:52:58 +0100, Steffen Nurpmeso <steffen@sdaoden.eu> wrote:
Subject: Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git
>
> Greg A. Woods wrote in <m1iVoBV-0036tPC@more.local>:
>  |I've been fixing and enhancing James Youngman's git-sccsimport to use
>  |with some of my SCCS archives, and I thought it might be the ultimate
>  |stress test of it to convert the CSRG BSD SCCS archives.
>  |
>  |The conversion takes about an hour to run on my old-ish Dell server.
>  |
>  |This conversion is unlike others -- there is some mechanical compression
>  |of related deltas into a single Git commit.
>  |
>  |https://github.com/robohack/ucb-csrg-bsd
>
> Thanks for taking the time to produce a CSRG repo that seems to
> mimic changesets as they really happened.  As i never made it
> there on my own, i have switched to yours some weeks ago.  (Mind
> you, after doing "gc --aggressive --prune=all" the repository size
> has more than halved, it was the final reason to prepare new
> repositories on a vhost with good internet connection before
> getting this through my flaky wifi here.  Storage and internet
> bandwidth and their cost really do not seem to bother anyone
> anymore.  I have no offense in mind, i only recognized it (the
> hard way).)

Ah!  I did indeed forget the "git gc" step that many conversion guides
recommend.  I might change the import script to do that automatically,
particularly if it has also initialised the repository in the same run.

Apparently github themselves run it regularly:

	https://stackoverflow.com/a/56020315/816536

Probably they do this by configuring "gc.auto" in each repository,
though I've not found any reference to what they might configure it to.

However it seems that without the "--aggressive" option, nothing will be
done in this repository.  With it though I go from 316M down to just 71M.

I don't see any way to force/tell/ask github to run "git gc --aggressive".

Perhaps I can just delete it from github and immediately re-create it
with the re-packed repository, and in theory all the hashes should stay
the same and any existing clones should be unaffected.  What do you think?

Note I have some thoughts of re-doing the whole conversion anyway, with
with more ideas on to dealing with "removed" files (SCCS files renamed
to the likes of "S.foo") and also including the many files that were
never checked into SCCS, perhaps even on a per-release basis, thus being
able to create release tags that can be checked out to match the actual
releases on the CDs.  But this will not happen quite so soon.

--
					Greg A. Woods <gwoods@acm.org>

Kelowna, BC     +1 250 762-7675           RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>     Avoncote Farms <woods@avoncote.ca>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git
  2019-12-01  1:25   ` Greg A. Woods
@ 2019-12-02 18:36     ` Steffen Nurpmeso
  0 siblings, 0 replies; 4+ messages in thread
From: Steffen Nurpmeso @ 2019-12-02 18:36 UTC (permalink / raw)
  To: The Unix Heritage Society mailing list

Hello.

Please excuse the late reply.

Greg A. Woods wrote in <m1ibDzG-0036tPC@more.local>:
 |At Fri, 29 Nov 2019 22:52:58 +0100, Steffen Nurpmeso <steffen@sdaoden.eu> \
 |wrote:
 |Subject: Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git
 |> Greg A. Woods wrote in <m1iVoBV-0036tPC@more.local>:
 |>|I've been fixing and enhancing James Youngman's git-sccsimport to use
 |>|with some of my SCCS archives, and I thought it might be the ultimate
 |>|stress test of it to convert the CSRG BSD SCCS archives.
 |>|
 |>|The conversion takes about an hour to run on my old-ish Dell server.
 |>|
 |>|This conversion is unlike others -- there is some mechanical compression
 |>|of related deltas into a single Git commit.
 |>|
 |>|https://github.com/robohack/ucb-csrg-bsd
 |>
 |> Thanks for taking the time to produce a CSRG repo that seems to
 |> mimic changesets as they really happened.  As i never made it
 |> there on my own, i have switched to yours some weeks ago.  (Mind
 |> you, after doing "gc --aggressive --prune=all" the repository size
 ...
 |Ah!  I did indeed forget the "git gc" step that many conversion guides
 |recommend.  I might change the import script to do that automatically,
 |particularly if it has also initialised the repository in the same run.
 |
 |Apparently github themselves run it regularly:
 |
 | https://stackoverflow.com/a/56020315/816536
 |
 |Probably they do this by configuring "gc.auto" in each repository,
 |though I've not found any reference to what they might configure it to.

I do not know either, but i have the impression they work with
individual repositories, possibly doing deduplication on the
filesystem level, if at all.  (Some repositories shrink notably,
while others do not.  And i say that because i think bitbucket,
once they added git support, seemed to have used common storage
for the individual git objects, at least i remember a post
pointing to some git object <-> python <-> mercurial library
layer.)

 |However it seems that without the "--aggressive" option, nothing will be
 |done in this repository.  With it though I go from 316M down to just 71M.

It throws away intermediate full data and keeps only the deltas.
It can also throw away reflog info (which i never have used).
I always use it.  Now with my new machine i can even use it for
the BSD repositories etc., whereas before each update added its
own pack, and the normal gc only combined the packs, but did not
resolve the intermediate deltas.  (Note however i have learned git
almost a decade ago, and have not reread the documentation or
technical papers ever since, let alone in full.)

 |I don't see any way to force/tell/ask github to run "git gc --aggressive".

Very computing intensive task.  Back when i was subscribed to the
git ML around 2011 i was witness of Hamano asking and Jeff King
(from Github by then) responding something like "[it is ok but] gc
aggressive is a pain".  They must have changed the algorithm until
then, now going much more over main memory and requiring much more
of it, too, not truly honouring the provided pack.windowMemory /
pack.threads options (once i tried last).  It has no recovery path
too, for example my old machine could not garbage collect the
FreeBSD repository, i even let it work almost over night (5+
hours), and it did not made it, whereas my new one can do it in
a few minutes, despite the CPUs not being that much faster, it is
only about the memory (8GB instead of 2GB).

I sometimes think about the fact that a lot of software seems to
loose its capability to run in restricted environments.  Providing
alternative runtime solutions is coding etc. intensive, of course,
and in the way of a rapid development process, too.

 |Perhaps I can just delete it from github and immediately re-create it
 |with the re-packed repository, and in theory all the hashes should stay
 |the same and any existing clones should be unaffected.  What do you think?

From the technical point i think this should simply work.  But No
need to delete the repository, simply deleting the branch should
be enough.  (Or fooling around with the update-ref that i often
use, as in "update-ref newmaster master" "checkout newmaster",
"branch -D master" (or "update-ref -d master"), then pushing, then
re-renaming newmaster to master, and pushing again, etc.)

Would be interesting to know how github does deduplication.  The
real great ones of Bell Labs/Plan9 developed this venti / fossil
storage with the blockhash-based permanent storage, and despite
all the multimedia the curve of new allocation flattened after
some time.  I would assume github would benefit dramatically from
deduplication.

 |Note I have some thoughts of re-doing the whole conversion anyway, with
 |with more ideas on to dealing with "removed" files (SCCS files renamed
 |to the likes of "S.foo") and also including the many files that were
 |never checked into SCCS, perhaps even on a per-release basis, thus being
 |able to create release tags that can be checked out to match the actual
 |releases on the CDs.  But this will not happen quite so soon.

That would be nice; having the real changesets is a real
improvement already and however!  And even Spinellis Unix history
repository seems not to be perfect even after years, i heard on
some FreeBSD ML list.
Ciao,

 |     Greg A. Woods <gwoods@acm.org>
 |
 |Kelowna, BC     +1 250 762-7675           RoboHack <woods@robohack.ca>
 |Planix, Inc. <woods@planix.com>     Avoncote Farms <woods@avoncote.ca>
 --End of <m1ibDzG-0036tPC@more.local>

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-12-02 18:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-16  2:51 [TUHS] another conversion of the CSRG BSD SCCS archives to Git Greg A. Woods
2019-11-29 21:52 ` Steffen Nurpmeso
2019-12-01  1:25   ` Greg A. Woods
2019-12-02 18:36     ` Steffen Nurpmeso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).