标题：High-speed elliptic curve cryptography on the NVIDIA GT200 graphics processing unit
作者：Cui, Shujie ;dl, Johann ;Liu, Zhe ;Xu, Qiuliang
作者机构：[Cui, Shujie ;Cui, Shujie ;Cui, Shujie ;Xu, Qiuliang ] Shandong University, School of Computer Science and Technology, Shunhua Road 1500, Jinan 250101 更多
会议名称：10th International Conference on Information Security Practice and Experience, ISPEC 2014
会议日期：5 May 2014 through 8 May 2014
来源：Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
摘要：This paper describes a high-speed software implementation of Elliptic Curve Cryptography (ECC) for GeForce GTX graphics cards equipped with an NVIDIA GT200 Graphics Processing Unit (GPU). In order to maximize throughput, our ECC software allocates just a single thread per scalar multiplication and aims to launch as many threads in parallel as possible. We adopt elliptic curves in Montgomery as well as twisted Edwards form, both defined over a special family of finite fields known as Optimal Prime Fields (OPFs). All field-arithmetic operations use a radix-224 representation for the operands (i.e. 24 operand bits are contained in a 32-bit word) to comply with the native (24 ×24)-bit integer multiply instruction of the GT200 platform. We implemented the OPF arithmetic without conditional statements (e.g. if-then clauses) to prevent thread divergence and unrolled the loops to minimize execution time. The scalar multiplication on the twisted Edwards curve employs a comb approach if the base point is fixed and uses extended projective coordinates so that a point addition requires only seven multiplications in the underlying OPF. Our software currently supports elliptic curves over 160-bit and 224-bit OPFs. After a detailed evaluation of numerous implementation options and configurations, we managed to launch 2880 threads on the 30 multiprocessors of the GT200 when the elliptic curve has Montgomery form and is defined over a 224-bit OPF. The resulting throughput is 115k scalar multiplications per second (for arbitrary base points) and we achieved a minimum latency of 19.2 ms. In a fixed-base setting with 256 precomputed points, the throughput increases to some 345k scalar multiplications and the latency drops to 4.52 ms. © 2014 Springer International Publishing.