From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/8072 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Further dynamic linker optimizations Date: Tue, 30 Jun 2015 16:04:54 -0400 Message-ID: <20150630200454.GA28127@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1435694721 24235 80.91.229.3 (30 Jun 2015 20:05:21 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 30 Jun 2015 20:05:21 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-8085-gllmg-musl=m.gmane.org@lists.openwall.com Tue Jun 30 22:05:19 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1ZA1mP-0004Vm-Dv for gllmg-musl@m.gmane.org; Tue, 30 Jun 2015 22:05:17 +0200 Original-Received: (qmail 27720 invoked by uid 550); 30 Jun 2015 20:05:15 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 26592 invoked from network); 30 Jun 2015 20:05:07 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:8072 Archived-At: Discussion on #musl with Timo Teräs has produced the following results: - Moving bloom filter size to struct dso gives 5% improvement in clang (built as 110 .so's) start time, simply because of a reduction of number of instructions in the hot path. So I think we should apply that patch. - The whole outer for loop in find_sym is the hot path for performance. As such, eliminating the lazy calculation of gnu_hash and simply doing it before the loop should be a measurable win, just by removing the if (!ghm) branch. - Even the check if (!dso->global) continue; has nontrivial cost. Since I want to replace this representation with a separate linked-list chain for global dsos anyway (for other reasons) I think that's worth prioritizing for performance too. - We still don't save and reuse the last symbol lookup in do_relocs. Doing so could improve performance a lot when the same symbol is referenced multiple times from global data. When the only references are the GOT (thus only one per symbol), it's not going to help, but since it's outside the find_sym dso loop, it should not have measurable cost anyway. - String comparison (dl_strcmp) is costly, but nontrivial to optimize. Word-at-a-time optimizations have issues with crossing pages, even on archs that don't require aligned access. Probably the right way forward here is to get an optimized general strcmp, then add a mechanism (function pointer in struct dso? or global?) for the dynamic linker to call dl_strcmp when relocating itself but the real strcmp later. - The strength-reduction of remainder operations does not seem to provide worthwhile benefits yet, simply because so little of the overall time is spent on the division/remainder. Rich