__thread fastpath not being taken

Chris Metcalf cmetcalf at tilera.com
Tue Dec 15 19:04:09 UTC 2009


On 12/15/2009 9:07 AM, Carmelo AMOROSO wrote:
> Austin Foxley wrote:
>   
>> On 12/09/2009 11:09 AM, Chris Metcalf wrote:
>>     
>>> My proposed fix is something like the following, in dl-tls.c.  (I'm
>>> running code patched up from the old SVN NPTL branch, but looking at the
>>> latest GIT NPTL it seems pretty similar here.)  Obviously a real patch
>>> would re-indent all the code contained within the scope for which I
>>> removed the "if" test.
>>>       
>> Haven't tried your patch yet, but it looks reasonable. This code is
>> actually pretty much unaltered since it came from glibc, and current
>> glibc still looks like it has it. You might want to send an email to
>> their list about this as well.
>>     
> Chris's comments definitely require for me looking again at TLS
> details much more deeply than I've done last time.
>
> I'll try to give a try at least on sh.
>   

I have a somewhat cleaned up version of the patch that also fixes some
apparent race conditions in how additional segments of the array are
allocated and chained in; you might as well start with it.  Thanks! 
(Sorry, the diff is a Perforce diff and not directly patch-compatible. 
Also note that the "15" cycle quote in the description assumes another
patch which breaks __tls_get_addr into a "leaf" fast path with no stack
frame that tail-calls into the slow path -- I can provide that patch too
if you like.  Also, I purposely don't re-indent a big chunk of code
here, but you'd want to do that before committing this to git, of course.)

Change 89612 by cmetcalf at cmetcalf-main on 2009/12/11 08:26:14

        Bug 6444: improve performance of __thread variable access.

        The compiler generates calls to __tls_get_addr() for __thread
        variables.  The implementation checks a global _dl_tls_generation
        variable to make sure that the thread-specific data structure
        is up to date, and if it isn't, goes to the slow path to
        rebuild the thread-specific data.

        Unfortunately the existing code is very conservative about
        when it's willing to rebuild; in particular it only looks at
        the current module's generation count, and if the thread is
        as new as the module, doesn't do the rebuild.  This can lead
        to degenerate situations in which we end up always running
        the slow path.  In particular, this is the case at startup,
        since all the modules have generation zero, but the TLS startup
        code increments the global generation to one.

        To fix this, this change makes the runtime always sync
        the thread-specific data structures up to the current
        _dl_tls_generation value.  To make this safe, we add suitable
        barriers, and reorder how the dlopen/dlclose paths touch the
        per-module data structures that the __tls_get_addr code reads.
        The code was originally written to be used this way, so it doesn't
        take much tweaking, but in a few places it really wasn't quite
        ready to be used the way it needs to be used.

        This change reduces the typical __thread address-generation
        overhead from about 260 cycles to about 15 cycles.

Affected files ...

... //tilera/main/tools/uclibc/ldso/ldso/dl-tls.c#7 edit
... //tilera/main/tools/uclibc/ldso/libdl/libdl.c#6 edit

Differences ...

==== //tilera/main/tools/uclibc/ldso/ldso/dl-tls.c#7 (text) ====

@@ -29,6 +29,7 @@
 #include <tls.h>
 #include <dl-tls.h>
 #include <ldsodefs.h>
+#include <atomic.h>

 void *(*_dl_calloc_function) (size_t __nmemb, size_t __size) = NULL;
 void *(*_dl_realloc_function) (void *__ptr, size_t __size) = NULL;
@@ -704,13 +705,17 @@
      possible that other threads at the same time dynamically load
      code and therefore add to the slotinfo list.  This is a problem
      since we must not pick up any information about incomplete work.
-     The solution to this is to ignore all dtv slots which were
-     created after the one we are currently interested.  We know that
-     dynamic loading for this module is completed and this is the last
-     load operation we know finished.  */
+     The solution to this is to ensure that we capture the current
+     generation count before walking the list, and ignore any slots
+     which are marked with a higher generation count, since they may
+     still be in the process of being updated.  */
   unsigned long int idx = req_modid;
   struct dtv_slotinfo_list *listp = _dl_tls_dtv_slotinfo_list;
+  size_t new_gen = _dl_tls_generation;

+  /* Make sure we don't read _dl_tls_generation too late. */
+  atomic_read_barrier();
+
   _dl_debug_early ("Updating slotinfo for module %d\n", req_modid);

   while (idx >= listp->len)
@@ -719,13 +724,7 @@
       listp = listp->next;
     }

-  if (dtv[0].counter < listp->slotinfo[idx].gen)
-    {
-      /* The generation counter for the slot is higher than what the
-        current dtv implements.  We have to update the whole dtv but
-        only those entries with a generation counter <= the one for
-        the entry we need.  */
-      size_t new_gen = listp->slotinfo[idx].gen;
+    {  /* Tilera: leave scope to make diffs against our baseline easier. */
       size_t total = 0;

       /* We have to look through the entire dtv slotinfo list.  */
@@ -939,7 +938,7 @@
         the first slot.  */
       _dl_assert (idx == 0);

-      listp = prevp->next = (struct dtv_slotinfo_list *)
+      listp = (struct dtv_slotinfo_list *)
        _dl_malloc (sizeof (struct dtv_slotinfo_list)
                + TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
       if (listp == NULL)
@@ -963,11 +962,22 @@
       listp->next = NULL;
       _dl_memset (listp->slotinfo, '\0',
              TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+
+      /* Make sure the new slotinfo element is in memory before exposing it. */
+      atomic_write_barrier();
+      prevp->next = listp;
     }

-  /* Add the information into the slotinfo data structure.  */
+  /* Add the information into the slotinfo data structure.  Make sure
+     to barrier as we do it, to present a consistent view to other
+     threads that may be calling __tls_get_addr() as we do this.
+     Barrier after the writes so we know everything is ready for an
+     increment of _dl_tls_generation.  */
+  listp->slotinfo[idx].gen = _dl_tls_generation + 1;
+  atomic_write_barrier();
   listp->slotinfo[idx].map = l;
-  listp->slotinfo[idx].gen = _dl_tls_generation + 1;
+  atomic_write_barrier();
+
   /* ??? ideally this would be done once per call to dlopen.  However there's
      no easy way to indicate whether a library used TLS, so do it here
         instead. */

==== //tilera/main/tools/uclibc/ldso/libdl/libdl.c#6 (text) ====

@@ -34,6 +34,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <stdbool.h>
+#include <atomic.h>

 #include <tls.h>
 #if USE_TLS
@@ -585,9 +586,12 @@
        }

        /* Bump the generation number if necessary.  */
-       if (any_tls && __builtin_expect (++_dl_tls_generation == 0, 0)) {
-               _dl_debug_early("TLS generation counter wrapped! Please report this.");
-               _dl_exit(30);
+       if (any_tls) {
+               atomic_write_barrier();
+               if (__builtin_expect (++_dl_tls_generation == 0, 0)) {
+                       _dl_debug_early("TLS generation counter wrapped! Please report this.");
+                       _dl_exit(30);
+               }
        }

 #endif
@@ -967,6 +971,7 @@
        /* If we removed any object which uses TLS bump the generation counter.  */
        if (any_tls)
        {
+               atomic_write_barrier();
                if (__builtin_expect (++_dl_tls_generation == 0, 0))
                {
                        _dl_debug_early ("TLS generation counter wrapped!  Please report to the uClibc mailing list.\n");



-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com



More information about the uClibc mailing list