glibc/backport-x86-Optimizing-memcpy-for-AMD-Zen-architect.patch
Feifei Wang1994 8dee2397c0 x86: backport x86 optimization patches
Backport 3 x86 optimization patches.

The first one 'Reversing calculation of __x86_shared_non_temporal_threshold'
patch reverse calculation of __x86_shared_non_temporal_threshold from
3/4 of the entire shared cache size of a multi-threaded system to
3/4 of one thread's share of the cache size, and then improve the memcpy
performance.

The second patch 'Optimizing memcpy for AMD Zen architecture' recomputing
the shareable cache as 'L3 per CCX(Core-Complex)' and improve performance
for amd.

The third patch 'Add Hygon Dhyana support' fix Hygon Dhyana processor CPU
Vendor ID detection problem in glibc sysdep module.
2024-09-04 06:01:47 +00:00

62 lines
2.0 KiB
Diff

From 8374ca9a2f66ad1b36dbd4b53abba9c692fccee6 Mon Sep 17 00:00:00 2001
From: Sajan Karumanchi <sajan.karumanchi@amd.com>
Date: Tue, 3 Sep 2024 08:23:27 +0000
Subject: [PATCH] backport-x86-Optimizing-memcpy-for-AMD-Zen-architecture
Modifying the shareable cache '__x86_shared_cache_size', which is a
factor in computing the non-temporal threshold parameter
'__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen
architectures.
In the existing implementation, the shareable cache is computed as 'L3
per thread, L2 per core'. Recomputing this shareable cache as 'L3 per
CCX(Core-Complex)' has brought in performance gains.
As per the large bench variant results, this patch also addresses the
regression problem on AMD Zen architectures.
Origin backport: https://sourceware.org/git/?p=glibc.git;a=commit;h=8813b2682e4094e43b0cf1634e99619f1b8b2c62
---
sysdeps/x86/cacheinfo.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 5b43fa78..37a03af0 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -728,7 +728,7 @@ intel_bug_no_cache_info:
threads = 1 << ((ecx >> 12) & 0x0f);
}
- if (threads == 0)
+ if (threads == 0 || cpu_features->family >= 0x17)
{
/* If APIC ID width is not available, use logical
processor count. */
@@ -743,8 +743,22 @@ intel_bug_no_cache_info:
if (threads > 0)
shared /= threads;
- /* Account for exclusive L2 and L3 caches. */
- shared += core;
+ /* Get shared cache per ccx for Zen architectures. */
+ if (cpu_features->family >= 0x17)
+ {
+ unsigned int eax;
+
+ /* Get number of threads share the L3 cache in CCX. */
+ __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx);
+
+ unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1;
+ shared *= threads_per_ccx;
+ }
+ else
+ {
+ /* Account for exclusive L2 and L3 caches. */
+ shared += core;
+ }
}
#ifndef DISABLE_PREFETCHW
--
2.27.0