* [TUHS] another conversion of the CSRG BSD SCCS archives to Git @ 2019-11-16 2:51 Greg A. Woods 2019-11-29 21:52 ` Steffen Nurpmeso 0 siblings, 1 reply; 4+ messages in thread From: Greg A. Woods @ 2019-11-16 2:51 UTC (permalink / raw) To: The Unix Heritage Society mailing list [-- Attachment #1: Type: text/plain, Size: 654 bytes --] I've been fixing and enhancing James Youngman's git-sccsimport to use with some of my SCCS archives, and I thought it might be the ultimate stress test of it to convert the CSRG BSD SCCS archives. The conversion takes about an hour to run on my old-ish Dell server. This conversion is unlike others -- there is some mechanical compression of related deltas into a single Git commit. https://github.com/robohack/ucb-csrg-bsd https://github.com/robohack/git-sccsimport -- Greg A. Woods <gwoods@acm.org> Kelowna, BC +1 250 762-7675 RoboHack <woods@robohack.ca> Planix, Inc. <woods@planix.com> Avoncote Farms <woods@avoncote.ca> [-- Attachment #2: OpenPGP Digital Signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git 2019-11-16 2:51 [TUHS] another conversion of the CSRG BSD SCCS archives to Git Greg A. Woods @ 2019-11-29 21:52 ` Steffen Nurpmeso 2019-12-01 1:25 ` Greg A. Woods 0 siblings, 1 reply; 4+ messages in thread From: Steffen Nurpmeso @ 2019-11-29 21:52 UTC (permalink / raw) To: The Unix Heritage Society mailing list Greg A. Woods wrote in <m1iVoBV-0036tPC@more.local>: |I've been fixing and enhancing James Youngman's git-sccsimport to use |with some of my SCCS archives, and I thought it might be the ultimate |stress test of it to convert the CSRG BSD SCCS archives. | |The conversion takes about an hour to run on my old-ish Dell server. | |This conversion is unlike others -- there is some mechanical compression |of related deltas into a single Git commit. | |https://github.com/robohack/ucb-csrg-bsd Thanks for taking the time to produce a CSRG repo that seems to mimic changesets as they really happened. As i never made it there on my own, i have switched to yours some weeks ago. (Mind you, after doing "gc --aggressive --prune=all" the repository size has more than halved, it was the final reason to prepare new repositories on a vhost with good internet connection before getting this through my flaky wifi here. Storage and internet bandwidth and their cost really do not seem to bother anyone anymore. I have no offense in mind, i only recognized it (the hard way).) |https://github.com/robohack/git-sccsimport | |-- | Greg A. Woods <gwoods@acm.org> | |Kelowna, BC +1 250 762-7675 RoboHack <woods@robohack.ca> |Planix, Inc. <woods@planix.com> Avoncote Farms <woods@avoncote.ca> --End of <m1iVoBV-0036tPC@more.local> --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git 2019-11-29 21:52 ` Steffen Nurpmeso @ 2019-12-01 1:25 ` Greg A. Woods 2019-12-02 18:36 ` Steffen Nurpmeso 0 siblings, 1 reply; 4+ messages in thread From: Greg A. Woods @ 2019-12-01 1:25 UTC (permalink / raw) To: The Unix Heritage Society mailing list At Fri, 29 Nov 2019 22:52:58 +0100, Steffen Nurpmeso <steffen@sdaoden.eu> wrote: Subject: Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git > > Greg A. Woods wrote in <m1iVoBV-0036tPC@more.local>: > |I've been fixing and enhancing James Youngman's git-sccsimport to use > |with some of my SCCS archives, and I thought it might be the ultimate > |stress test of it to convert the CSRG BSD SCCS archives. > | > |The conversion takes about an hour to run on my old-ish Dell server. > | > |This conversion is unlike others -- there is some mechanical compression > |of related deltas into a single Git commit. > | > |https://github.com/robohack/ucb-csrg-bsd > > Thanks for taking the time to produce a CSRG repo that seems to > mimic changesets as they really happened. As i never made it > there on my own, i have switched to yours some weeks ago. (Mind > you, after doing "gc --aggressive --prune=all" the repository size > has more than halved, it was the final reason to prepare new > repositories on a vhost with good internet connection before > getting this through my flaky wifi here. Storage and internet > bandwidth and their cost really do not seem to bother anyone > anymore. I have no offense in mind, i only recognized it (the > hard way).) Ah! I did indeed forget the "git gc" step that many conversion guides recommend. I might change the import script to do that automatically, particularly if it has also initialised the repository in the same run. Apparently github themselves run it regularly: https://stackoverflow.com/a/56020315/816536 Probably they do this by configuring "gc.auto" in each repository, though I've not found any reference to what they might configure it to. However it seems that without the "--aggressive" option, nothing will be done in this repository. With it though I go from 316M down to just 71M. I don't see any way to force/tell/ask github to run "git gc --aggressive". Perhaps I can just delete it from github and immediately re-create it with the re-packed repository, and in theory all the hashes should stay the same and any existing clones should be unaffected. What do you think? Note I have some thoughts of re-doing the whole conversion anyway, with with more ideas on to dealing with "removed" files (SCCS files renamed to the likes of "S.foo") and also including the many files that were never checked into SCCS, perhaps even on a per-release basis, thus being able to create release tags that can be checked out to match the actual releases on the CDs. But this will not happen quite so soon. -- Greg A. Woods <gwoods@acm.org> Kelowna, BC +1 250 762-7675 RoboHack <woods@robohack.ca> Planix, Inc. <woods@planix.com> Avoncote Farms <woods@avoncote.ca> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git 2019-12-01 1:25 ` Greg A. Woods @ 2019-12-02 18:36 ` Steffen Nurpmeso 0 siblings, 0 replies; 4+ messages in thread From: Steffen Nurpmeso @ 2019-12-02 18:36 UTC (permalink / raw) To: The Unix Heritage Society mailing list Hello. Please excuse the late reply. Greg A. Woods wrote in <m1ibDzG-0036tPC@more.local>: |At Fri, 29 Nov 2019 22:52:58 +0100, Steffen Nurpmeso <steffen@sdaoden.eu> \ |wrote: |Subject: Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git |> Greg A. Woods wrote in <m1iVoBV-0036tPC@more.local>: |>|I've been fixing and enhancing James Youngman's git-sccsimport to use |>|with some of my SCCS archives, and I thought it might be the ultimate |>|stress test of it to convert the CSRG BSD SCCS archives. |>| |>|The conversion takes about an hour to run on my old-ish Dell server. |>| |>|This conversion is unlike others -- there is some mechanical compression |>|of related deltas into a single Git commit. |>| |>|https://github.com/robohack/ucb-csrg-bsd |> |> Thanks for taking the time to produce a CSRG repo that seems to |> mimic changesets as they really happened. As i never made it |> there on my own, i have switched to yours some weeks ago. (Mind |> you, after doing "gc --aggressive --prune=all" the repository size ... |Ah! I did indeed forget the "git gc" step that many conversion guides |recommend. I might change the import script to do that automatically, |particularly if it has also initialised the repository in the same run. | |Apparently github themselves run it regularly: | | https://stackoverflow.com/a/56020315/816536 | |Probably they do this by configuring "gc.auto" in each repository, |though I've not found any reference to what they might configure it to. I do not know either, but i have the impression they work with individual repositories, possibly doing deduplication on the filesystem level, if at all. (Some repositories shrink notably, while others do not. And i say that because i think bitbucket, once they added git support, seemed to have used common storage for the individual git objects, at least i remember a post pointing to some git object <-> python <-> mercurial library layer.) |However it seems that without the "--aggressive" option, nothing will be |done in this repository. With it though I go from 316M down to just 71M. It throws away intermediate full data and keeps only the deltas. It can also throw away reflog info (which i never have used). I always use it. Now with my new machine i can even use it for the BSD repositories etc., whereas before each update added its own pack, and the normal gc only combined the packs, but did not resolve the intermediate deltas. (Note however i have learned git almost a decade ago, and have not reread the documentation or technical papers ever since, let alone in full.) |I don't see any way to force/tell/ask github to run "git gc --aggressive". Very computing intensive task. Back when i was subscribed to the git ML around 2011 i was witness of Hamano asking and Jeff King (from Github by then) responding something like "[it is ok but] gc aggressive is a pain". They must have changed the algorithm until then, now going much more over main memory and requiring much more of it, too, not truly honouring the provided pack.windowMemory / pack.threads options (once i tried last). It has no recovery path too, for example my old machine could not garbage collect the FreeBSD repository, i even let it work almost over night (5+ hours), and it did not made it, whereas my new one can do it in a few minutes, despite the CPUs not being that much faster, it is only about the memory (8GB instead of 2GB). I sometimes think about the fact that a lot of software seems to loose its capability to run in restricted environments. Providing alternative runtime solutions is coding etc. intensive, of course, and in the way of a rapid development process, too. |Perhaps I can just delete it from github and immediately re-create it |with the re-packed repository, and in theory all the hashes should stay |the same and any existing clones should be unaffected. What do you think? From the technical point i think this should simply work. But No need to delete the repository, simply deleting the branch should be enough. (Or fooling around with the update-ref that i often use, as in "update-ref newmaster master" "checkout newmaster", "branch -D master" (or "update-ref -d master"), then pushing, then re-renaming newmaster to master, and pushing again, etc.) Would be interesting to know how github does deduplication. The real great ones of Bell Labs/Plan9 developed this venti / fossil storage with the blockhash-based permanent storage, and despite all the multimedia the curve of new allocation flattened after some time. I would assume github would benefit dramatically from deduplication. |Note I have some thoughts of re-doing the whole conversion anyway, with |with more ideas on to dealing with "removed" files (SCCS files renamed |to the likes of "S.foo") and also including the many files that were |never checked into SCCS, perhaps even on a per-release basis, thus being |able to create release tags that can be checked out to match the actual |releases on the CDs. But this will not happen quite so soon. That would be nice; having the real changesets is a real improvement already and however! And even Spinellis Unix history repository seems not to be perfect even after years, i heard on some FreeBSD ML list. Ciao, | Greg A. Woods <gwoods@acm.org> | |Kelowna, BC +1 250 762-7675 RoboHack <woods@robohack.ca> |Planix, Inc. <woods@planix.com> Avoncote Farms <woods@avoncote.ca> --End of <m1ibDzG-0036tPC@more.local> --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-12-02 18:37 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-16 2:51 [TUHS] another conversion of the CSRG BSD SCCS archives to Git Greg A. Woods 2019-11-29 21:52 ` Steffen Nurpmeso 2019-12-01 1:25 ` Greg A. Woods 2019-12-02 18:36 ` Steffen Nurpmeso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).