chacha-armv8-sve.pl - OpenGrok history log for /openssl/crypto/chacha/asm/chacha-armv8-sve.pl

Revision	Date	Author	Comments
# da1c088f	07-Sep-2023	Matt Caswell	Copyright year updates Reviewed-by: Richard Levitte <levitte@openssl.org> Release: yes
# cd7a8e96	11-Jan-2023	fangming.fang	Fix big-endian issue in chacha20 SVE implementation on aarch64 Fixes: #19902 Reviewed-by: Todd Short <todd.short@me.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged Fix big-endian issue in chacha20 SVE implementation on aarch64 Fixes: #19902 Reviewed-by: Todd Short <todd.short@me.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from https://github.com/openssl/openssl/pull/20028) show more ...
# 3f42f41a	19-Jul-2022	Daniel Hu	Improve chacha20 perfomance on aarch64 by interleaving scalar with SVE/SVE2 The patch will process one extra block by scalar in addition to blocks by SVE/SVE2 in parallel. This is esp. h Improve chacha20 perfomance on aarch64 by interleaving scalar with SVE/SVE2 The patch will process one extra block by scalar in addition to blocks by SVE/SVE2 in parallel. This is esp. helpful in the scenario where we only have 128-bit vector length. The actual uplift to performance is complicated, depending on the vector length and input data size. SVE/SVE2 implementation don't always perform better than Neon, but it should prevail in most cases On a CPU with 256-bit SVE/SVE2, interleaved processing can handle 9 blocks in parallel (8 blocks by SVE and 1 by Scalar). on 128-bit SVE/SVE2 it is 5 blocks. Input size that is a multiple of 9/5 blocks on respective CPU can be typically handled at maximum speed. Here are test data for 256-bit and 128-bit SVE/SVE2 by running "openssl speed -evp chacha20 -bytes 576" (and other size) ----------------------------------+--------------------------------- 256-bit SVE \| 128-bit SVE2 ----------------------------------\|--------------------------------- Input 576 bytes 512 bytes \| 320 bytes 256 bytes ----------------------------------\|--------------------------------- SVE 1716361.91k 1556699.18k \| 1615789.06k 1302864.40k ----------------------------------\|--------------------------------- Neon 1262643.44k 1509044.05k \| 680075.67k 1060532.31k ----------------------------------+--------------------------------- If the input size gets very large, the advantage of SVE/SVE2 over Neon will fade out. Signed-off-by: Daniel Hu <Daniel.Hu@arm.com> Change-Id: Ieedfcb767b9c08280d7c8c9a8648919c69728fab Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from https://github.com/openssl/openssl/pull/18901) show more ...
# bcb52bcc	25-May-2022	Daniel Hu	Optimize chacha20 on aarch64 by SVE2 This patch improves existing chacha20 SVE patch by using SVE2, which is an optional architecture feature of aarch64, with XAR instruction that ca Optimize chacha20 on aarch64 by SVE2 This patch improves existing chacha20 SVE patch by using SVE2, which is an optional architecture feature of aarch64, with XAR instruction that can improve the performance of chacha20. Signed-off-by: Daniel Hu <Daniel.Hu@arm.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from https://github.com/openssl/openssl/pull/18522) show more ...
# b1b2146d	07-Feb-2022	Daniel Hu	Acceleration of chacha20 on aarch64 by SVE This patch accelerates chacha20 on aarch64 when Scalable Vector Extension (SVE) is supported by CPU. Tested on modern micro-architecture with Acceleration of chacha20 on aarch64 by SVE This patch accelerates chacha20 on aarch64 when Scalable Vector Extension (SVE) is supported by CPU. Tested on modern micro-architecture with 256-bit SVE, it has the potential to improve performance up to 20% The solution takes a hybrid approach. SVE will handle multi-blocks that fit the SVE vector length, with Neon/Scalar to process any tail data Test result: With SVE type 1024 bytes 8192 bytes 16384 bytes ChaCha20 1596208.13k 1650010.79k 1653151.06k Without SVE (by Neon/Scalar) type 1024 bytes 8192 bytes 16384 bytes chacha20 1355487.91k 1372678.83k 1372662.44k The assembly code has been reviewed internally by ARM engineer Fangming.Fang@arm.com Signed-off-by: Daniel Hu <Daniel.Hu@arm.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from https://github.com/openssl/openssl/pull/17916) show more ...