__thread fastpath not being taken

Chris Metcalf cmetcalf at tilera.com
Wed Oct 13 16:53:01 UTC 2010


 In my email quoted below, I found that __tls_get_addr() under the uClibc
version we were running with was always running the slow path.  We were
running an old version from the nptl branch, back at change 22990 on svn. 
Since then, I've ported glibc-2.11.2 to tile.  I looked at the performance
of __tls_get_addr() under glibc, and found that the fast path is correctly
taken in that environment.

I have not looked at what makes this difference, or whether more recent
uClibc's would take the fast path in our environment, but I figured I at
least owed it to the uclibc folks to close the circle on this earlier email.

On 12/9/2009 2:09 PM, Chris Metcalf wrote:
> The __tls_get_addr() routine is called by the compiler for __thread
> accesses (barring optimizations, IE mode, etc).  But I'm confused by how
> the "generation" code flow is supposed to work.
>
> The dlopen() and dlclose() routines, for example, both increment the
> global _dl_tls_generation counter after loading/unloading a shared
> object with TLS entries, and __tls_get_addr() responds to this by having
> each thread call into _dl_update_slotinfo(), passing the index for the
> particular module from which they are getting a __thread variable.  But
> if that module itself is older, i.e. its "gen" count is less than
> _dl_tls_generation, it may well also be less than the dtv[] generation
> counter (i.e. sufficiently up-to-date), and so _dl_update_slotinfo()
> short-cuts out and returns.  But this just means that the next time we
> reference a __thread variable from that module, we will have to take the
> slow path again, and again and again, until finally the thread
> references a __thread variable from a module with a newer generation
> stamp, at which point it will bump up its dtv[] generation count, and
> then no longer need to call into _dl_update_slotinfo().  This seems
> pretty bad.
>
> This is made much more obvious for just normal operation with no dlopen
> calls since the ld.so initialization bumps up _dl_tls_generation in
> _dl_get_ready_to_run(), since that's the first time we call init_tls()
> and therefore the was_tls_init_tp_called variable is false.  So we set
> _dl_tls_generation == 1, but all the actual slotinfo generation counts
> are zero, and we never stop using the slowpath on __tls_get_addr!  This
> is a pretty harsh performance penalty on "__thread" variables.
>
> My proposed fix is something like the following, in dl-tls.c.  (I'm
> running code patched up from the old SVN NPTL branch, but looking at the
> latest GIT NPTL it seems pretty similar here.)  Obviously a real patch
> would re-indent all the code contained within the scope for which I
> removed the "if" test.
>
> This code is all pretty subtle, so it's hard to be sure I haven't missed
> something.  In any case, does this seem plausible?
>
> @@ -715,23 +715,25 @@
>
>    while (idx >= listp->len)
>      {
>        idx -= listp->len;
>        listp = listp->next;
>      }
>
> -  if (dtv[0].counter < listp->slotinfo[idx].gen)
>      {
> -      /* The generation counter for the slot is higher than what the
> -        current dtv implements.  We have to update the whole dtv but
> -        only those entries with a generation counter <= the one for
> -        the entry we need.  */
> -      size_t new_gen = listp->slotinfo[idx].gen;
> +      /* Update the whole dtv, but capture the current generation
> +         counter value to compare against, so we can ignore entries
> +         that are being created as we walk through the list. */
> +      size_t new_gen = _dl_tls_generation;
>        size_t total = 0;
>
> +      /* Barrier to make sure the compiler has read _dl_tls_generation.
> +         FIXME: there should be a portable macro to do this.  */
> +      __asm__ __volatile__("": : :"memory");
> +
>        /* We have to look through the entire dtv slotinfo list.  */
>        listp =  _dl_tls_dtv_slotinfo_list;
>        do
>         {
>           size_t cnt;
>
>           for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)
>
>
>
> I may also have missed something with the need for a compiler barrier,
> but it seems required by the spirit of "ignore incompletely set up entries".
>

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com



More information about the uClibc mailing list