History log of /openssl/crypto/chacha/asm/chacha-armv8-sve.pl (Results 1 – 5 of 5)
Revision Date Author Comments
# da1c088f 07-Sep-2023 Matt Caswell

Copyright year updates


Reviewed-by: Richard Levitte <levitte@openssl.org>
Release: yes


# cd7a8e96 11-Jan-2023 fangming.fang

Fix big-endian issue in chacha20 SVE implementation on aarch64

Fixes: #19902

Reviewed-by: Todd Short <todd.short@me.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged

Fix big-endian issue in chacha20 SVE implementation on aarch64

Fixes: #19902

Reviewed-by: Todd Short <todd.short@me.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/20028)

show more ...


# 3f42f41a 19-Jul-2022 Daniel Hu

Improve chacha20 perfomance on aarch64 by interleaving scalar with SVE/SVE2

The patch will process one extra block by scalar in addition to
blocks by SVE/SVE2 in parallel. This is esp. h

Improve chacha20 perfomance on aarch64 by interleaving scalar with SVE/SVE2

The patch will process one extra block by scalar in addition to
blocks by SVE/SVE2 in parallel. This is esp. helpful in the
scenario where we only have 128-bit vector length.

The actual uplift to performance is complicated, depending on the
vector length and input data size. SVE/SVE2 implementation don't
always perform better than Neon, but it should prevail in most
cases

On a CPU with 256-bit SVE/SVE2, interleaved processing can
handle 9 blocks in parallel (8 blocks by SVE and 1 by Scalar).
on 128-bit SVE/SVE2 it is 5 blocks. Input size that is a multiple
of 9/5 blocks on respective CPU can be typically handled at
maximum speed.

Here are test data for 256-bit and 128-bit SVE/SVE2 by running
"openssl speed -evp chacha20 -bytes 576" (and other size)

----------------------------------+---------------------------------
256-bit SVE | 128-bit SVE2
----------------------------------|---------------------------------
Input 576 bytes 512 bytes | 320 bytes 256 bytes
----------------------------------|---------------------------------
SVE 1716361.91k 1556699.18k | 1615789.06k 1302864.40k
----------------------------------|---------------------------------
Neon 1262643.44k 1509044.05k | 680075.67k 1060532.31k
----------------------------------+---------------------------------

If the input size gets very large, the advantage of SVE/SVE2 over
Neon will fade out.

Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>
Change-Id: Ieedfcb767b9c08280d7c8c9a8648919c69728fab

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/18901)

show more ...


# bcb52bcc 25-May-2022 Daniel Hu

Optimize chacha20 on aarch64 by SVE2

This patch improves existing chacha20 SVE patch by using SVE2,
which is an optional architecture feature of aarch64, with XAR
instruction that ca

Optimize chacha20 on aarch64 by SVE2

This patch improves existing chacha20 SVE patch by using SVE2,
which is an optional architecture feature of aarch64, with XAR
instruction that can improve the performance of chacha20.

Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/18522)

show more ...


# b1b2146d 07-Feb-2022 Daniel Hu

Acceleration of chacha20 on aarch64 by SVE

This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with

Acceleration of chacha20 on aarch64 by SVE

This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with
256-bit SVE, it has the potential to improve performance up to 20%

The solution takes a hybrid approach. SVE will handle multi-blocks that fit
the SVE vector length, with Neon/Scalar to process any tail data

Test result:
With SVE
type 1024 bytes 8192 bytes 16384 bytes
ChaCha20 1596208.13k 1650010.79k 1653151.06k

Without SVE (by Neon/Scalar)
type 1024 bytes 8192 bytes 16384 bytes
chacha20 1355487.91k 1372678.83k 1372662.44k

The assembly code has been reviewed internally by
ARM engineer Fangming.Fang@arm.com

Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/17916)

show more ...