# FPGA Implementation of Point Multiplication on Koblitz Curves Using Kleinian Integers 

V.S. Dimitrov ${ }^{1 \star}$, K.U. Järvinen ${ }^{2}$, M.J. Jacobson, Jr. ${ }^{3}$, W.F. Chan ${ }^{3}$, and Z. Huang ${ }^{1}$<br>${ }^{1}$ Department of Electrical and Computer Engineering, University of Calgary, 2500<br>University Drive NW, Calgary, Alberta, Canada T2N 1N4,<br>(dimitrov,huangzh)@atips.ca<br>${ }^{2}$ Signal Processing Laboratory, Helsinki University of Technology, Otakaari 5A, 02150, Espoo, Finland, kimmo.jarvinen@tkk.fi<br>${ }^{3}$ Department of Computer Science, University of Calgary, 2500 University Drive NW, Calgary, Alberta, Canada T2N 1N4, (chanwf, jacobs)@cpsc.ucalgary.ca


#### Abstract

We describe algorithms for point multiplication on Koblitz curves using multiple-base expansions of the form $k=\sum \pm \tau^{a}(\tau-1)^{b}$ and $k=\sum \pm \tau^{a}(\tau-1)^{b}\left(\tau^{2}-\tau-1\right)^{c}$. We prove that the number of terms in the second type is sublinear in the bit length of k , which leads to the first provably sublinear point multiplication algorithm on Koblitz curves. For the first type, we conjecture that the number of terms is sublinear and provide numerical evidence demonstrating that the number of terms is significantly less than that of $\tau$-adic non-adjacent form expansions. We present details of an innovative FPGA implementation of our algorithm and performance data demonstrating the efficiency of our method.


## 1 Introduction

In 1985, Koblitz [1] and Miller [2] independently proposed the use of the additive finite abelian group of points on elliptic curves defined over a finite field for cryptographic applications. The Koblitz curves [3], or anomalous binary curves, are

$$
\begin{equation*}
E_{a}: y^{2}+x y=x^{3}+a x^{2}+1 \tag{1}
\end{equation*}
$$

defined over $\mathbb{F}_{2}$, where $a \in\{0,1\}$. The number of points on these curves when considered over $\mathbb{F}_{2^{m}}$ can be computed rapidly using a simple recurrence relation, and there are many prime values of $m$ for which the number of points is twice a prime (when $a=1$ ) or four times a prime (when $a=0$ ). Five Koblitz curves are recommended for cryptographic use by NIST [4].

The main advantage of Koblitz curves is that the Frobenius automorphism of $\mathbb{F}_{2}$ acts on points via $\tau(x, y)=\left(x^{2}, y^{2}\right)$ and is essentially free to compute. Because $\tau$ satisfies $\left(\tau^{2}+2\right) P=\mu \tau(P)$ for all points $P \in E_{a}\left(\mathbb{F}_{2^{m}}\right)$ where $\mu=$ $(-1)^{1-a}$, we can consider $\tau$ as a complex number satisfying $\tau^{2}-\mu \tau+2=0$, i.e., $\tau=(\mu+\sqrt{-7}) / 2$. Thus, computing $k P$, where $k \in \mathbb{Z}$ and $P \in E_{a}\left(\mathbb{F}_{2^{m}}\right)$, can

[^0]be done using a representation of $k$ involving powers of $\tau$ instead of the usual binary representation using powers of 2 , yielding a point multiplication algorithm similar to the binary "double-and-add" method in which the point doublings are replaced by applications of the Frobenius [3,5]. Solinas [5] shows how the non-adjacent form (NAF) and window-NAF methods mentioned earlier can be extended to $\tau$-adic expansions. The resulting point multiplication algorithms require on average $\left(\log _{2} k\right) / 3$ point additions or $\left(\log _{2} k\right) /(w+1)$ point additions using width- $w$ window methods requiring precomputations based on $P$. A recent result of Avanzi et. al. [6] reduces this to $\left(\log _{2} k\right) / 4$ at the cost of one additional point halving, but the practicality of this method has not yet been demonstrated.

Recently, double-base integer representations have been used to devise efficient point multiplication algorithms [7-9]. For example, it can be shown that the number of terms of the form $\pm 2^{a} 3^{b}$ required to represent $k$ is bounded by $O(\log k / \log \log k)$. These representations can be computed efficiently and the resulting point multiplication algorithms are the only known methods for which the number of required point additions is sublinear in $\log k$.

In this paper, we extend the double-base idea to $\tau$-adic expansions for point multiplication on Koblitz curves by representing $k$ as a sum of terms $\pm \tau^{a}(\tau-1)^{b}$. Our algorithm requires no precomputations based on the point $P$, no point doublings, and fewer point additions than $\tau$-adic NAF ( $\tau$-NAF) for the five recommended Koblitz curves from [4]. Our algorithm for computing the double-base representation of $k$ is very efficient; it requires only the unsigned $\tau$-adic expansion of $k$ plus a few table-lookups. A precomputed table of optimal representations for a small number of $\tau$-adic integers is required, but these are independent of the multiplier $k$ and the base point $P$. We have developed a novel FPGA implementation of both the conversion and point multiplication algorithms that demonstrates the efficiency of our method.

We conjecture that the average density of our representations is sublinear in $\log k$, and provide extensive numerical evidence showing that the density is lower than that of $\tau$-NAF expansions. Although we do not have a proof that the number of point additions required by our algorithm is sublinear, we provide a proof that sublinearity is obtained using similar expressions involving three bases of the form $\pm \tau^{a}(\tau-1)^{b}\left(\tau^{2}-\tau-1\right)^{c}$. This work represents the first rigorouslyproven sublinear point multiplication algorithm using complex bases.

Avanzi and Sica [10] have reported independently on a provably sublinear point multiplication algorithm using bases $\tau$ and 3 . However, it is not clear how their algorithm performs in practice, and their proof has been shown to have a gap [11].

The remainder of the paper is organized as follows. In Sec. 2 we present our provably sublinear point multiplication algorithm. We present a similar algorithm using only two complex bases in Sec. 3. Although we cannot prove sublinearity for this algorithm, we conjecture that the density of the representations is in fact sublinear, and provide numerical evidence in Subsection 3.2 indicating that the density of our representations is lower than that of $\tau$-NAF representations. A description of our FPGA implementation and numerical data
demonstrating its efficiency are presented in Sec. 4. Finally, we conclude with an outlook on possible directions for further research.

## 2 Multi-dimensional Frobenius Expansions

We start with the following three definitions:
Definition 1. A complex number, $\xi$ of the form $e+f \tau, e, f$-integers is called a Kleinian integer [12].

Definition 2. A Kleinian integer $\omega$ of the form $\omega= \pm \tau^{x}(\tau-1)^{y}, x, y \geq 0$ is called a $\{\tau, \tau-1\}$-Kleinian integer.

Definition 3. A Kleinian integer $\omega$ of the form $\omega= \pm \tau^{x}(\tau-1)^{y}\left(\tau^{2}-\tau-\right.$ $1)^{z}, x, y, z \geq 0$ is called $a\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-Kleinian integer.

The main idea of the new point multiplication algorithm over Koblitz curves is to extend the existing and widely-used $\tau$-NAF expansion of the scalar to a new form which will speed up the computations. The improvements obtained in the paper are based on the following representation, which we will call twodimensional or three-dimensional Frobenius expansions (or $\{\tau, \tau-1\}$-expansion and $\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-expansion, for short):

$$
\begin{gather*}
k=\sum_{i=1}^{d} s_{i} \tau^{a_{i}}(\tau-1)^{b_{i}}, \quad s_{i}= \pm 1, \quad a_{i}, b_{i} \in \mathbb{Z}_{\geq 0}  \tag{2}\\
k=\sum_{i=1}^{d} s_{i} \tau^{a_{i}}(\tau-1)^{b_{i}}\left(\tau^{2}-\tau-1\right)^{c_{i}}, \quad s_{i}= \pm 1, \quad a_{i}, b_{i}, c_{i} \in \mathbb{Z}_{\geq 0} . \tag{3}
\end{gather*}
$$

Such representations are clearly highly redundant. If we rearrange the summands in the above formula, then, using two bases, we can represent the scalar $k$ as

$$
\begin{equation*}
k=\sum_{l=1}^{\max \left(b_{i}\right)}(\tau-1)^{l}\left(\sum_{i=1}^{\max \left(a_{i, l}\right)} s_{i, l} \tau^{a_{i, l}}\right) \tag{4}
\end{equation*}
$$

where $\max \left(a_{i, l}\right)$ is the maximal power of $\tau$ that is multiplied by $(\tau-1)^{l}$ in (2). Using three bases, we can represent $k$ as

$$
\begin{equation*}
k=\sum_{l_{2}}^{\max \left(c_{i}\right)}\left(\tau^{2}-\tau-1\right)^{l_{2}} \sum_{l_{1}=1}^{\max \left(b_{i}\right)}(\tau-1)^{l_{1}}\left(\sum_{i=1}^{\max \left(a_{i, l_{1}, l_{2}}\right)} s_{i, l_{1}, l_{2}} \tau^{a_{i, l_{1}, l_{2}}}\right) \tag{5}
\end{equation*}
$$

where $\max \left(a_{i, l_{1}, l_{2}}\right)$ is the maximal power of $\tau$ that is multiplied by $(\tau-1)^{l_{1}}\left(\tau^{2}-\right.$ $\tau-1)^{l_{2}}$ in (3).

Alg. 1 computes $k P$ given a $\{\tau, \tau-1\}$-expansion of $k$. The corresponding algorithm for $\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-expansions will be described later, along with

```
Algorithm 1 Point multiplication using \(\{\tau, \tau-1\}\)-expansions.
InPut: \(k, P\)
Output: \(Q=k P\)
    \(P_{0} \leftarrow P\)
    \(Q \leftarrow \mathcal{O}\)
    for \(i=0\) to \(j\) do
        \(S \leftarrow r_{i}(k) P_{i}\) \{One dimensional \(\tau\)-NAF corresponding to \((\tau-1)^{l}\) in (4)\}
        \(P_{i+1} \leftarrow \tau P_{i}-P_{i}\)
        \(Q \leftarrow Q+S\)
```

a proof that the number of point additions is sublinear in $\log k$. Essentially, $k P$ is computed via a succession of one-dimensional $\tau$-adic expansions.

The representation of $k$ given in (4) is the cornerstone of our algorithm, so some comments on it are in order.

1. The multiplications by $\tau-1$ cost one Frobenius mapping (free in our computational model) and one point subtraction. The multiplications by $\tau^{2}-\tau-1$ cost two Frobenius mappings and two point subtractions. Therefore, the total number of point additions/subtractions, $A S(k)$, is given by

$$
A S(k)=d+\max \left(b_{i}\right)-1
$$

in the case of $\{\tau, \tau-1\}$-expansions and

$$
A S(k)=d+\max \left(b_{i}\right) \max \left(c_{i}\right)-1
$$

in the case of $\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-expansions. The smallest possible value of $\max \left(b_{i}\right)$ and $\max \left(c_{i}\right), 0$, corresponds to the classical (one-dimensional) $\tau$-NAF expansion, for which it is known that the expected number of point additions/subtractions is $\left(\log _{2} k\right) / 3$. It is clear that by allowing larger values for $\max \left(b_{i}\right)$ and $\max \left(c_{i}\right)$ one would decrease the corresponding number of summands, $d$. Therefore, it is vital to find out the optimal values for $\max \left(b_{i}\right)$ as a function of the size of the scalar.
2. Finding an algorithm that can return a fairly short decomposition of $k$ as the sum of $\{\tau, \tau-1\}$-Kleinian integers is absolutely essential. The most straightforward idea seems to be the greedy algorithm described in Alg. 2. A greedy algorithm for computing $\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-expansions is an easy generalization of this algorithm.
The complexity of the greedy algorithm depends crucially on the time spent to find the closest $\{\tau, \tau-1\}$-Kleinian integer to the current Kleinian integer. Unfortunately we were not able to find a significantly more efficient method to do this than precomputing all Kleinian integers $\pm \tau^{x}(\tau-1)^{y}$ for $x, y$ less than certain bounds and finding the closest one using exhaustive search. In the next subsection, we present an efficient algorithm for computing $\{\tau, \tau-1\}$-expansions with slightly more weight than those produced by the greedy algorithm and an algorithm for computing $\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-expansions with weight provably sublinear in $\log k$.

```
Algorithm 2 Greedy algorithm for computing \(\{\tau, \tau-1\}\)-expansions.
Input: A Kleinian integer \(\xi=e+f \tau\)
Output: \(\left\{\omega_{1}, \ldots, \omega_{d}\right\}\), a \(\{\tau, \tau-1\}\)-expansion of \(\xi\)
    \(i \leftarrow 1\)
    while \(\xi \neq 0\) do
        Find \(\omega_{i}= \pm \tau^{a_{i}}(\tau-1)^{b_{i}}, a_{i}, b_{i} \geq 0\), the closest \(\{\tau, \tau-1\}\)-Kleinian integer to \(\xi\).
        \(\xi \leftarrow \xi-\omega_{i}\)
    \(i \leftarrow i+1\)
```


### 2.1 Comparison to Double-base Number Systems

The similarities between (2) and the double-base number system (DBNS), in which one represents integers as the sum or difference of numbers of the form $2^{a} 3^{b}, a, b$ non-negative integers (called $\{2,3\}$-integers), are apparent. In the case of DBNS, one can prove the following result:

Theorem 1. Every positive integer, n, can be written as the sum of at most $O(\log n / \log \log n)\{2,3\}$-integers and (one) such representation can be found by using the greedy algorithm.

The key point in proving this theorem is the following result of Tijdeman [13].
Theorem 2. Let $x$ and $y$ be two $\{2,3\}$-integers, $x>y$. Then there exist effectively computable constants, $c_{1}$ and $c_{2}$, such that

$$
\frac{x}{(\log x)^{c_{1}}}<x-y<\frac{x}{(\log x)^{c_{2}}} .
$$

The proof of Theorem 1 uses only the first inequality.
Theorem 2 provides a very accurate description of the difference between two consecutive $\{2,3\}$-integers. More to the point, it can be generalized easily to any set of $\left\{p_{1}, p_{2}, \cdots, p_{s}\right\}$-integers if $p_{s}$ is fixed. The proof depends on the main result of [14] from the theory of linear form in logarithms.

Theorem 3. Let $a_{1}, a_{2}, \cdots, a_{k}$ be nonzero algebraic integers and $b_{1}, b_{2}, \cdots, b_{k}$ rational integers. Assume $a_{1}^{b_{1}} a_{2}^{b_{2}} \cdots a_{k}^{b_{k}} \neq 1$ and $B=\max \left(b_{1}, b_{2}, \cdots, b_{k}\right)$. Then the following inequality holds:

$$
\left|a_{1}^{b_{1}} a_{2}^{b_{2}} \cdots a_{k}^{b_{k}}-1\right| \geq \exp \left(-C(k) \log a_{1} \log a_{2} \cdots \log a_{k}\right)
$$

where $C(k)=\exp \left(4 k+10 k^{3 k+5}\right)$.
The constant $C(k)$ is huge, even in the case of linear forms in two logarithms, approximately $\exp \left(6 \cdot 10^{9}\right)$. By using some results aimed specifically at the case of two logarithms [15], one can reduce $C(k)$ to $\exp \left(10^{7}\right)$, but this is still enormous. However, practical simulations suggest that this constant is likely to be much smaller, perhaps less than 100.

There are two very essential points that are often overlooked in the formulations of the above theorems [16]:

1. the estimates are correct if the algebraic numbers used are real,
2. if the algebraic numbers are complex, then the estimates provided remain unchanged if one of them, say $a_{1}$, has an absolute value strictly greater than absolute values of the other algebraic numbers.

The latter point is what prevents us from applying Tijdeman's Theorem 2 to the case of $a_{1}=\tau, a_{2}=\tau-1$. Thus, we are not in position to trivially extend the proof of Theorem 1 to the case of $\{\tau, \tau-1\}$-expansions of Kleinian integers. Nevertheless, extensive numerical simulations (by using several attempted optimizations of Alg. 2) has led us to the following conjecture:

Conjecture 1. Every Kleinian integer, $\xi=a+b \tau$, can be represented as the sum of at most $O(\log N(\xi) / \log \log N(\xi))\{\tau, \tau-1\}$-Kleinian integers, where $N(\xi)$ is the norm of $\xi$.

A very recent paper by Avanzi and Sica [10] contains a proof that Conjecture 1 is true if one uses $\{\tau, 3\}$-Kleinian integers under the unproven but reasonable assumption that the irrationality measure of $\log _{2} 3$ and $\arg (\tau) / \pi$ is 2. Unfortunately, the proof, even with the assumption on irrationality measures, has a gap [11]. The use of two complex bases, used in this paper, increases the theoretical difficulties in proving the conjecture, but provides much more practical algorithms.

However, in the case of three bases we can prove without any assumptions the following:

Theorem 4. Every Kleinian integer $\zeta=a+b \tau$ can be represented as the sum of at most $O(\log N(\zeta) /(\log \log N(\zeta)))\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-Kleinian integers, such that the largest power of both $\tau-1$ and $\tau^{2}-\tau-1$ is $O\left(\log ^{\alpha} N(\zeta)\right)$ for any real constant $\alpha$ where $0<\alpha<1 / 2$.

Proof. We assume that $b=0$; otherwise, one applies the same proof for the real and imaginary part of $\zeta$, which leads to doubling the implicit constant hidden in the big- $O$ notation.

Let $\alpha$ be a real constant where $0<\alpha<1 / 2$. We determine the $\tau$-adic representation of $a$, the real part of $\zeta$, using digits 0 and 1 . The length of this expansion is $O(\log N(\zeta))$. We break this representation into $\left\lceil\log ^{1-\alpha} N(\zeta)\right\rceil$ blocks, where each block contains $O\left(\log ^{\alpha} N(\zeta)\right)$ digits. Each of these blocks corresponds to a Kleinian integer $c_{i}+d_{i} \tau, i=0,1, \ldots,\left\lceil\log ^{1-\alpha} N(\zeta)\right\rceil$, where the size of both $c_{i}$ and $d_{i}$ is $O\left(\log ^{\alpha} N(\zeta)\right)$. Now, we represent each integer $c_{i}$ and $d_{i}$ in double-base representation using bases 2 and 3 . According to Theorem 1, these numbers will require at most

$$
O\left(\log ^{\alpha} N(\zeta) /\left(\log \log ^{\alpha} N(\zeta)\right)\right)=O\left(\log ^{\alpha} N(\zeta) /(\log \log N(\zeta))\right)
$$

summands of the form $2^{x} 3^{y}$ where $x, y \geq 0$ and $x, y \in O\left(\log ^{\alpha} N(\zeta)\right)$. Using the fact that $2=\tau(1-\tau)$ and $3=1-\tau-\tau^{2}$, we substitute the 2's and 3's in the 2, 3-expansions of $c_{i}$ and $d_{i}$ to obtain $\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-Kleinian integer expansions of each $c_{i}+d_{i} \tau, i=0,1, \ldots,\left\lceil\log ^{1-\alpha} N(\zeta)\right\rceil$. To obtain the expansion
of $\zeta=a+b \tau$, we multiply each term of the form $\pm \tau^{x}(\tau-1)^{y}\left(\tau^{2}-\tau-1\right)^{z}$ by $\tau^{i}$ where $i$ is the index of the corresponding block. Note that $x, y, z \in$ $O\left(\log ^{\alpha} N(\zeta)\right)$. Since the number of blocks is $\left\lceil\log ^{1-\alpha} N(\zeta)\right\rceil$ and each block requires $O\left(\log ^{\alpha} N(\zeta) /(\log \log N(\zeta))\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}\right.$-Kleinian integers, we conclude that the overall number of Kleinian integers used to represent $\zeta$ is

$$
O\left(\frac{\log ^{\alpha} N(\zeta)}{\log \log N(\zeta)} \log ^{1-\alpha} N(\zeta)\right)=O\left(\frac{\log N(\zeta)}{\log \log N(\zeta)}\right)
$$

The exponents of $\tau-1$ and $\tau^{2}-\tau-1$ are bounded by $O\left(\log ^{\alpha} N(\zeta)\right)$.

Theorem 4 is in fact constructive and leads to the following sublinear point multiplication algorithm (Algorithm 3).

```
\(\overline{\text { Algorithm } 3}\) Point multiplication algorithm using \(\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}\) -
expansions.
Input: An Kleinian integer \(\zeta\), a point \(P\) on a Koblitz curve, a real constant \(\alpha\) with
    \(0<\alpha<1 / 2\)
Output: \(Q=\zeta P\)
    Compute in succession for \(i=0,1, \ldots,\left\lceil\log ^{\alpha} N(\zeta)\right\rceil\) the points \(P_{i}^{(1)}=(\tau-1) P_{i-1}^{(1)}\)
    and \(P_{i}^{(2)}=\left(\tau^{2}-\tau-1\right) P_{i-1}^{(2)}\) where \(P_{0}^{(1)}=P_{0}^{(2)}=P\).
    Compute the points \(Q_{i_{1}, i_{2}}=P_{i_{1}}^{(1)}+P_{i_{2}}^{(2)}\) for \(i_{i}, i_{2}=0,1, \ldots,\left\lceil\log ^{\alpha} N(\zeta)\right\rceil\).
    Compute a \(\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}\)-expansion of the form (5) using Theorem 4.
    Apply in succession \(\tau\)-NAF based point multiplications based on (5) to compute \(Q\).
```

The analysis of Alg. 3 is simple. Step 1 requires $O\left(\log ^{\alpha} N(\zeta)\right)$ point additions and Step 2 requires $O\left(\log ^{2 \alpha} N(\zeta)\right)$ point additions. Because $\alpha<1 / 2$, the total number of point additions for Steps 1 and 2 is $o(\log N(\zeta))$. According to Theorem 4, Step 3 requires $O(\log N(\zeta) /(\log \log N(\zeta)))$ point additions. The total number of point additions for Alg. 3 is therefore $O(\log N(\zeta) /(\log \log N(\zeta)))$. Thus, one can compute $k P$ in $O(\log k /(\log \log k))$ point additions by computing $\zeta \equiv k\left(\bmod \left(\tau^{m}-1\right) /(\tau-1)\right)$ and applying Alg. 3 to compute $\zeta P$.

Note that the first two steps of Alg. 3 are independent of $k$. If a fixed base point $P$ is to be used, the points $Q_{i_{1}, i_{2}}$ may be precomputed.

The parameter $\alpha$ can be chosen in a variety of ways. The total number of point additions required by all three steps is roughly $\log ^{\alpha} N(\zeta)+\log ^{2 \alpha} N(\zeta)+$ $2 \log N(\zeta) /(\alpha \log \log N(\zeta))$; for $163<N(\zeta)<571$, taking $\alpha$ such that $0.365<$ $\alpha<0.368$ minimizes this quantity. Smaller values of $\alpha$ reduce the number of points $Q_{i_{1}, i_{2}}$ that must be precomputed and stored at the cost of increasing the number of point additions that must be performed in Step 3. On the other hand, larger values of $\alpha$ decrease the number of point additions in Step 3 at the cost of having to precompute and store more points.

## 3 A Practical Blocking Algorithm

Although, as proved in Theorem 4, using $\left\{\tau, \tau-1, \tau^{2}-\tau-1\right\}$-expansions does lead to a sublinear point multiplication algorithm, the resulting algorithm is likely not suitable for practical purposes. Nevertheless, assuming the truth of Conjecture 1, we can devise an efficient algorithm that computes $\{\tau, \tau-1\}$ expansions with sublinear density of Kleinian integers. This algorithm is based on the following theorem.

Theorem 5. Assuming Conjecture 1, every Kleinian integer, $\xi=a+b \tau$, can be represented as the sum of at most $O(\log N(\xi) / \log \log N(\xi))\{\tau, \tau-1\}$-Kleinian integers such that the largest power of $\tau-1$ is $O(\log N(\xi) / \log \log N(\xi))$.

The proof, omitted for brevity, is quite similar to that of Theorem 4. The $\tau$-adic expansions of $a$ and $b$ are broken into $\log \log N(\xi)$ blocks and the conjecture is applied to each block. In fact, this observation gives us an efficient method to compute $\{\tau, \tau-1\}$-expansions with sublinear density under Conjecture 1. The idea, described in Alg. 4, is to apply the blocking strategy described in the proof and compute optimal $\{\tau, \tau-1\}$-expansions for each block.

```
Algorithm 4 Blocking algorithm for computing \(\{\tau, \tau-1\}\)-expansions.
Input: A Kleinian integer \(\xi=e+f \tau\), block size \(w\), precomputed table of the minimal
    \(\{\tau, \tau-1\}\)-expansion of every Kleinian integer \(\sum_{i=0}^{w-1} d_{i} \tau^{i}, d_{i} \in\{0,1\}\)
Output: List \(L\) of \(\{\tau, \tau-1\}\)-Kleinian integers representing \(\{\tau, \tau-1\}\)-expansion of \(\xi\)
    \(L=\emptyset\)
    Compute the \(\tau\)-adic expansion of \(\xi=\sum_{i=0}^{l} d_{i} \tau^{i}, d_{i} \in\{0,1\}\)
    for \(j=0\) to \(\lceil l / w\rceil\) do
        \(\{\) Process blocks of length \(w\}\)
        Find minimal \(\{\tau, \tau-1\}\)-expansion of \(\sum_{i=0}^{w-1} d_{i+j w} \tau^{i}\) from the precomputed table
        Multiply each term of the expansion by \(\tau^{j w}\) and add to \(L\)
```

There are four important points regarding the implementation Alg. 4:

1. All powers of $\tau$ can be reduced modulo $m$, as $\left(\tau^{m}\right) P=P$ for all $P \in E_{a}\left(\mathbb{F}_{2^{m}}\right)$.
2. The bit-string $d_{w-1} \ldots d_{1} d_{0}$ corresponding to any block can be used as an index into the table of minimal $\{\tau, \tau-1\}$-expansions.
3. One can choose the size of the blocks based on available memory. The larger the block size, the lower the density of the $\{\tau, \tau-1\}$-expansions produced.
4. If the block size is not too big, one can precompute the minimal $\{\tau, \tau-1\}$ expansion of every Kleinian integer of the form $\sum_{i=0}^{w-1} d_{i} \tau^{i}, d_{i} \in\{0,1\}$, thereby ensuring as low a density as possible. This precomputation can be done using exhaustive search and need only be done once per elliptic curve, as it does not depend on $k$ nor the base point $P$.


Fig. 1. Expansion of $-104+50 \tau$ using Alg. 4.

### 3.1 Example

Consider the representation of 6465 into a $\{\tau, \tau-1\}$-expansion by using the two different algorithm we have described. Assume that we intend to compute (6465) $P$ for some point $P \in E_{1}\left(\mathbb{F}_{2^{163}}\right)$, so $\tau=(1+\sqrt{-7}) / 2$. As in the case of computing the $\tau$-NAF expansion, we first do a partial reduction of 6465 modulo $\left(\tau^{163}-1\right) /(\tau-1)$ as in [5], yielding $\xi=-104+50 \tau$. The greedy algorithm, Alg. 2, returns

$$
\xi=\tau^{8}(\tau-1)+\tau^{2}(\tau-1)^{2}+\tau^{8}(\tau-1)^{6}
$$

The blocking algorithm, Alg. 4, using a block size $w=7$ returns the same representation.

Fig. 1 illustrates the idea behind the blocking algorithm. The $\tau$-adic expansion of $\xi$ is $\tau^{2}+\tau^{5}+\tau^{9}+\tau^{11}+\tau^{12}+\tau^{13}$. This 14-bit expansion of $\xi$ is broken into two 7 -bit blocks. The right block corresponds to $\tau^{2}+\tau^{5}$, and $\tau^{2}(\tau-1)^{2}$ is its optimal $\{\tau, \tau-1\}$-expansion. The left block corresponds to $\tau^{2}+\tau^{4}+\tau^{5}+\tau^{6}$, and $\tau(\tau-1)+\tau(\tau-1)^{6}$ is its optimal expansion. Finally, multiplying the expression for the left block by $\tau^{7}$ yields the $\{\tau, \tau-1\}$-expansion of $\xi$.

To see the usefulness of this idea, notice that the $\{\tau, \tau-1\}$-expansion obtained is very sparse. Of the 63 possible terms that could occur, assuming $\tau^{8}$ is the maximum power of $\tau$ and $(\tau-1)^{6}$ is the maximum power of $\tau-1$, only three actually occur in the expansion. Furthermore, note that when computing $k P$ using this representation, each power of $\tau-1$ corresponds to a one-dimensional $\tau$-adic expansion, and that each of these is very sparse.

### 3.2 Numerical Evidence

In this section, we present results from software implementations of Alg. 2 and Alg. 4. The objective is to compare the density of the $\{\tau, \tau-1\}$-expansions computed by our algorithms with $\tau$-NAF expansions. Our algorithms and the algorithm for computing the $\tau$-NAF [5] of $k$ were implemented in C, using the GMP library for multi-precision integer arithmetic. Tests were run on an Intel Xeon 2.8 GHz CPU running Linux.

Theorem 5 states that our conversion algorithm outputs expansions of $k$ with sublinear density even if the maximum power of $\tau-1$ is bounded by some constant $\max \left(b_{i}\right)$ as long as any sublinear expansion exists. For practical purposes,
we need to know what value of $\max \left(b_{i}\right)$ gives us minimal weight expansions on average. We computed the average number of point additions required to compute $k P$ using a $\{\tau, \tau-1\}$-expansion of $k$, i.e., the number of non-zero terms in the expansion plus $\max \left(b_{i}\right)-1$. For each size of $k$ between 160 and 600 bits, the optimal value of $\max \left(b_{i}\right)$ starts at 4 and increases to 12 as the bit length of $k$ increases. As shown in Sec. 4.4, $\max \left(b_{i}\right)=3$ turns out to be optimal for our FPGA implementation of point multiplication on $E_{1}\left(\mathbb{F}_{2^{163}}\right)$.

In Table 1 we list the average number of point additions required to compute $k P$ on the five NIST-recommended Koblitz curves [4] when using $\tau$-NAF, our greedy $\{\tau, \tau-1\}$-expansion algorithm (Alg. 2), and our blocking-based Alg. 4 using block lengths of $w=5,10,16$ and $\max \left(b_{i}\right)=6$. In all cases the data are taken as the average over 500000 random values of $k$. Our algorithm requires significantly fewer point additions than $\tau$-NAF in all cases.

Table 1. The average number of point additions required to compute $k P$ for the five Koblitz curves in [4].

|  |  | Alg. 2 | Alg. 4 (blocking) |  |  |
| ---: | ---: | ---: | ---: | ---: | ---: |
| $\log _{2} k$ | $\tau$-NAF | (greedy | $w=5$ | $w=10$ | $w=16$ |
| 163 | 54.25 | 36.37 | 47.86 | 40.00 | 37.22 |
| 233 | 77.59 | 49.31 | 66.23 | 54.96 | 50.76 |
| 283 | 94.25 | 58.64 | 79.37 | 65.66 | 60.49 |
| 409 | 137.12 | 81.84 | 113.64 | 93.63 | 85.68 |
| 571 | 190.25 | 111.90 | 154.98 | 127.21 | 117.04 |

## 4 FPGA Implementation

An FPGA implementation was designed in order to investigate the performance of the new algorithm in practice. The design implements $k P$ on the NIST curve K-163 defined by (1), where $a=1$, over $\mathbb{F}_{2^{163}}$ [4].

As the number of zero coefficients in a $\{\tau, \tau-1\}$-expansion is large, a normal basis $\mathbb{F}_{2^{m}}$ was selected. In a normal basis, an element $A \in \mathbb{F}_{2^{m}}$ is represented as $A=\sum_{i=0}^{m-1} a_{i} \alpha^{2^{i}}$ where $a_{i} \in\{0,1\}$ and $\alpha^{2^{i}} \neq \alpha^{2^{j}}$ for all $i \neq j$ and $\alpha^{2^{m}}=\alpha$. Thus, it is obvious that the squaring operation $A^{2}$ is a cyclic right shift of the bit vector $\left(a_{0}, a_{1}, \ldots, a_{m-1}\right)$ which is fast if implemented in hardware.

Affine coordinates, $\mathcal{A}$, and López-Dahab coordinates, $\mathcal{L D}$ [17], are used for representing points on $E_{a}\left(\mathbb{F}_{2^{m}}\right)$. In $\mathcal{A}$, a point $P$ is represented with two coordinates as $P=(x, y)$, and, in $\mathcal{L D}$, with three coordinates as $P=(X, Y, Z)$. The $\mathcal{L D}$ triple represents the point $\left(X / Z, Y / Z^{2}\right)$ in $\mathcal{A}$ [17]. Three elements $x, y, \bar{y}=x+y \in \mathbb{F}_{2^{m}}$ are required to represent $P$ and $-P$ in $\mathcal{A}$. A point addition in $\mathcal{A}$ is performed as presented, e.g., in [18], and its cost is $\mathrm{I}+2 \mathrm{M}+\mathrm{S}+8 \mathrm{~A}$ where $\mathrm{I}, \mathrm{M}, \mathrm{S}$, and A denote inversion, multiplication, squaring, and addition in $\mathbb{F}_{2^{m}}$, respectively. A point addition in $\mathcal{L D}$ is performed as presented in [19], and it
requires $13 \mathrm{M}+4 \mathrm{~S}+9 \mathrm{~A}$. A point addition $Q+P$, where $Q$ is in $\mathcal{L D}$ and $P$ in $\mathcal{A}$, is called the mixed coordinate point addition, and it requires only $9 \mathrm{M}+5 \mathrm{~S}+9 \mathrm{~A}$ [20]. If the curve is fixed and both $P$ and $-P$ are available, one multiplication and one addition can be omitted resulting $8 \mathrm{M}+5 \mathrm{~S}+8 \mathrm{~A}$. The $\mathcal{A} \mapsto \mathcal{L D}$ mapping does not require any operations in $\mathbb{F}_{2^{m}}$ while $\mathcal{L D} \mapsto \mathcal{A}$ requires $\mathrm{I}+2 \mathrm{M}+\mathrm{S}$. The cost of a Frobenius mapping is 3 S in $\mathcal{L D}$ and 2 S in $\mathcal{A}$. An inversion in $\mathbb{F}_{2^{m}}$ is computed as suggested by Itoh and Tsujii in [21]. The Itoh-Tsujii inversion requires $m-1$ squarings and $\left\lfloor\log _{2}(m-1)\right\rfloor+H_{w}(m-1)-1$ multiplications, where $H_{w}(m-1)$ is the Hamming weight of $m-1$ [21]. Hence, $\mathrm{I}=9 \mathrm{M}+162 \mathrm{~S}$ if $m=163$.

Different coordinates are used in Alg. 1 as follows: the point addition in $\mathcal{A}$ is used in computing $P_{i}$ so that the point addition in mixed coordinates can be used in $S \leftarrow S \pm P_{i}$ computations. Because the results of row multiplications are in $\mathcal{L D}$, the point addition in $\mathcal{L D}$ must be used for $Q \leftarrow Q+S$ computations.

The implementation was designed especially for Xilinx Virtex-II family FPGAs. The implementation includes a field arithmetic processor (FAP) for arithmetic in $\mathbb{F}_{2^{m}}$, control logic for controlling the FAP, and a converter for converting $k$ to a $\{\tau, \tau-1\}$-expansion. The FAP is considered in Sec. 4.1, the control logic in Sec. 4.2, and the conversion unit in Sec. 4.3.

### 4.1 Field Arithmetic Processor (FAP)

The FAP includes a multiplier, a squarer, an adder, a storage element and control logic. A storage element for $m$-bit elements of $\mathbb{F}_{2^{m}}$ is required in order to store points and temporary variables during computation of $k P$. As Xilinx devices offer embedded memory blocks which can be used without consuming logic resources, the storage element is implemented in BlockRAMs. One dual-port BlockRAM can be configured into a $512 \times 36$-bit mode. All $m$ bits of an element must be accessed in parallel in the FAP architecture. Hence, $\left\lceil\frac{m}{36}\right\rceil=5$ BlockRAMs are required. Write and read operations require both one clock cycle, i.e. $\mathrm{W}=\mathrm{R}=1$.

The squarer is a shifter which is capable of performing operations $A^{2^{d}}$, where $A \in \mathbb{F}_{2^{m}}$ and $0 \leq d \leq d_{\max }=2^{6}-1$, Thus, $A^{2^{d}}$ operations can be performed with a cost of S . Addition in $\mathbb{F}_{2^{m}}$ is simply a bitwise exclusive-or (xor). Both squaring and addition are performed in one clock cycle, i.e. $S=A=1$.

Field multiplication is critical for the overall performance. The multiplier is a digit-serial implementation of the Massey-Omura multiplier [22]. In a bit-serial Massey-Omura multiplier, one bit of the output is calculated in one clock cycle and, hence, $m$ cycles are required in total. One bit $c_{i}$ of the result $C=A \times B$ is computed from $A$ and $B$ by using an $F$-function. The $F$-function is field specific, and the same $F$ is used for all output bits $c_{i}$ as follows: $c_{i}=F\left(A_{\lll}, B_{\lll}\right)$, where $\lll i$ denotes cyclical left shift by $i$ bits. [4, 22]

In a digit-serial implementation, $D$ bits are computed in parallel. Hence, $\left\lceil\frac{m}{D}\right\rceil$ cycles are required in one multiplication. In this FAP, $D=24$. The $F$-function is pipelined in order to increase clock frequency by adding one register stage. As loading the operands into the shift registers requires one clock cycle and pipelining increases latency by one clock cycle, the latency is $\mathrm{M}=\left\lceil\frac{163}{24}\right\rceil+2=9$.

### 4.2 Control Logic

Logic controlling the FAP consists of a storage for $k$, a control finite state machine (FSM) and a ROM for control sequences.

The implementation handles $k$ in a coded form. The coding is performed using $\kappa:\{s, d\}$ symbols, where $s \in\{0, \overline{0}, 1, \overline{1}\}$ and $0 \leq d \leq d_{\max } \cdot \overline{0}$ is a symbol reserved for a row change not -0 . Coding is started from the first non-zero signed bit of the first row and it proceeds as follows: $s$ is the signed bit starting a symbol and $d$ is the number of Frobenius mappings following $s$, i.e. the number of consecutive zeros plus one (the Frobenius map associated with the start bit of the next symbol). If the run of consecutive Frobenius maps is longer than $d_{\max }$, the run must be divided into two symbols and, for the latter one, $s=0$. Each $\kappa$, with the exception of the row change symbol, transforms into an operation $S \leftarrow$ $\tau^{d}(S+s P)$ on $E_{a}\left(\mathbb{F}_{2^{m}}\right)$. Let $Z(k)$ denote the maximum number of consecutive Frobenius mappings required by $k$. Then, the number of $\kappa$-symbols, $e$, required to represent $k$, is given by $e \geq H_{w}(k)+j$, with equality if and only if $d_{\max } \geq Z(k)$.

Control FSM takes $\kappa$-symbols as input and, according to $s$ and $d$ of $\kappa$, it sets addresses of the control sequence ROM. The control sequences controlling the FAP consist of successive FAP instructions directly controlling the FAP. There are separate control sequences for $P_{i+1} \leftarrow \tau P_{i}-P_{i}$ computation (Frobenius map and point addition in $\mathcal{A}$ ), point addition and subtraction (point addition in the mixed coordinates), Frobenius map, row change (point addition in $\mathcal{L D}$ ), and $\mathcal{L D} \mapsto \mathcal{A}$ mapping. They are all stored in a ROM implemented in a BlockRAM.

If implemented so that, for each operation, the operands would be first read from the memory, then the operation calculated and finally the result saved to the memory, the latency of an operation would be the latency of the operation ( $M, S$ or $A$ ) plus two clock cycles $(R+W)$. However, different operations can be performed with the same operands without reading the operands more than once, and the operands of the next operation can be read while the previous operation is performed if the result is not required in the next operation. When the control sequences were carefully hand-optimized, different operations have the following latencies: point addition in the mixed coordinates $L_{\mathcal{M}}=98$, the Frobenius map $\mathrm{L}_{F}=6$, row change (point addition in $\mathcal{L D}$ ) $\mathrm{L}_{\mathcal{L D}}=153$, the computation of $P_{i} \mathrm{~L}_{P_{i}}=182$, and the $\mathcal{L D} \mapsto \mathcal{A}$ mapping $\mathrm{L}_{\mathcal{L D} \mapsto \mathcal{A}}=160$. The first point addition of each row is simply $S \leftarrow \pm P_{i}$ and the first row change operation is given by $Q \leftarrow S$. Both of these operations have a latency of $\mathrm{L}_{C}=6$. In the beginning, an initialization including, e.g., the transferring of $P$ into the FAP, needs to be performed. The latency of the initialization is $\mathrm{L}_{I}=10$. Thus, it follows that the latency of the computation of $k P$ becomes

$$
\begin{equation*}
\mathrm{L}_{k P}=\left(H_{w}(k)-(j+1)\right) \mathrm{L}_{\mathcal{M}}+(j+2) \mathrm{L}_{C}+(e-j) \mathrm{L}_{F}+j\left(\mathrm{~L}_{\mathcal{L D}}+\mathrm{L}_{P_{i}}\right)+\mathrm{L}_{I}+\mathrm{L}_{\mathcal{L D} \mapsto \mathcal{A}} \tag{6}
\end{equation*}
$$

and, by assuming that $d_{\max } \geq Z(k)$, i.e. $e=H_{w}(k)+j,(6)$ simplifies to

$$
\begin{equation*}
\mathrm{L}_{k P}=104 H_{w}(k)+243 j+84 \tag{7}
\end{equation*}
$$

### 4.3 Conversion Unit

The conversion unit, which converts an integer $k$ into a $\{\tau, \tau-1\}$-expansion, is a straightforward implementation of Alg. 4, our blocking-based method.

The main part of this unit is an ALU, which has two integer multipliers, each of which makes use of one 18 -bit by 18 -bit embedded multiplier to create 102-bit by 102-bit products. The ALU also includes adders, shifters and the rounding function required by the partial reduction algorithm [5]. The conversion unit uses the ALU and two intermediate registers for reducing every integer $k$ to an equivalent $r_{0}+r_{1} \tau$, then gets the $\tau$-adic expansion by a shift-and-add circuit, which produces one bit per cycle, from the least significant bit to the most significant bit.

For our implementation, we selected a block size of 10 , so every 10 bits of the $\tau$-adic expansion are used as an index into a look-up table. This table has one entry for each possible index $\left(b_{9} b_{8} \ldots b_{0}\right), b_{i} \in\{0,1\}$, where each entry is the optimal $\{\tau, \tau-1\}$-expansion of $\sum_{i=0}^{9} b_{i} \tau^{i}$ allowing a maximum exponent of 6 for $\tau-1$. At most 3 terms of the form $\pm \tau^{a}(\tau-1)^{b}$ are required for each representation, so each entry in the table consists of three tuples of the form $\left(d_{n}, i_{n}, j_{n}\right)$ representing $d_{n} \tau^{i_{n}}(\tau-1)^{j_{n}}$. Hence, each entry requires 27 bits and the whole look-up table requires 27 KB RAM. Note that, according to the data in Sec. 3.2, using a block size of 5 would still give us a significant improvement over $\tau$-NAF and in this case the table would require less than 1 KB .

Because integer operations are slower than the $\mathbb{F}_{2^{m}}$ operations in the FAP, the conversion unit will be the bottleneck if the two units use the same clock. So a dual-port RAM is used in order to separate these units into different clock domains. The look-up results are written into the dual-port RAM using one port, and the ECC processor will read them out from the another port later.

### 4.4 Results

The FPGA design was written in VHDL and implemented on a Xilinx VirtexII XC2V2000-6. The design was synthesized with Synplify Pro 8.0 and Xilinx ISE 6.2 was used for the place \& route. The design (excluding the converter) requires 6,494 slices and 6 BlockRAMs on the XC2V2000-6, and it operates at a maximum clock frequency of $f_{\max }=128 \mathrm{MHz}$. The converter requires 2251 slices, 2 BlockRAMs and 2 multipliers. The maximum clock frequency is 88 MHz . It takes 335 clock cycles, or $3.81 \mu$ s to convert one 163 -bit integer.

Average timings of the design are presented in Table 2. The latency $L_{k P}$ is given by (7), and timings are calculated using $f_{\max }$. The time consumed in the conversion is neglected. Table 2 shows that the best performance is achieved when $j=3$ which is smaller than estimated in Sec. 3.2, because the latencies of point additions differ. In Sec. 3.2, all point additions were assumed equal.

Numerous publications considering implementation of elliptic curve cryptography on FPGAs have been published, e.g., in [23-26]. To the best of the authors' knowledge, the only FPGA-based implementation using $\tau$-NAF expansions was presented by Lutz and Hasan [26] where a $k P$ operation on $E_{1}\left(\mathbb{F}_{2^{163}}\right)$ was reported to require $75 \mu \mathrm{~s}$ on a Xilinx Virtex-E XCV2000E.

Table 2. Performance calculations of the FPGA implementation on a Xilinx Virtex-II XC2V2000-6 with different values of $j . H_{w}(k)$ for $j>0$ are based on empirical data. The numbers of point additions in the mixed coordinates, in $\mathcal{A}$ and in $\mathcal{L D}$ are denoted as $\mathcal{M}, \mathcal{A}$ and $\mathcal{L D}$, respectively.

| $j$ | $H_{w}(k)$ | $\mathcal{M}$ | $\mathcal{A} \mathcal{L D}$ | $\mathrm{L}_{k P}$ Time $(\mu \mathrm{s})$ |  |  |
| :--- | :--- | :--- | :--- | :--- | ---: | ---: |
| 0 | 54.33 | 53.33 | 0 | 0 | 5735 | 44.80 |
| 2 | 39.47 | 36.47 | 2 | 2 | 4675 | 36.52 |
| 3 | 36.18 | 32.18 | 3 | 3 | $\mathbf{4 5 7 6}$ | $\mathbf{3 5 . 7 5}$ |
| 4 | 34.74 | 29.74 | 4 | 4 | 4669 | 36.48 |
| 5 | 33.42 | 27.42 | 5 | 5 | 4775 | 37.30 |
| 6 | 32.22 | 25.22 | 6 | 6 | 4893 | 38.23 |

## 5 Further Work

Our results demonstrate that $\{\tau, \tau-1\}$-expansions lead to a competitive point multiplication algorithm for Koblitz curves when the base point $P$ is not fixed. Nevertheless, there are a number of aspects we are continuing to explore.

The latency of a point multiplication using our FPGA implementation could be significantly reduced at the expense of larger area requirements by computing each row in parallel. This possibility will be studied in the future. In addition, alternative choices of the bases, or even using three bases, may lead to further improvements.

Our point multiplication algorithm does not require any precomputations involving the base point $P$ nor storage of additional points, and hence is wellsuited to applications where $P$ is random. We are investigating the possibility of generalizing window methods, using two-dimensional windows, to our algorithm in order to obtain further improvements when precomputations involving $P$ are permitted.

Although our numerical data suggests that the density of the $\{\tau, \tau-1\}$ expansions obtained by our conversion algorithm is sublinear in the bit length of $k$, we do not yet have a proof of this fact. In addition, our conversion algorithm requires a modest amount of storage. These precomputed quantities are independent of both the base point $P$ and multiplier $k$ and can be viewed as part of the domain parameters. Nevertheless, we continue to search for an efficient memory-free conversion algorithm.

## References

1. Koblitz, N.: Elliptic curve cryptosystems. Math. Comp. 48 (1987) 203-209
2. Miller, V.: Use of elliptic curves in cryptography. In: CRYPTO '85. Volume 218 of Lecture Notes in Computer Science (LNCS). (1986) 417-426
3. Koblitz, N.: CM-curves with good cryptographic properties. In: CRYPTO '91. Volume 576 of LNCS. (1992) 279-287
4. National Institute of Standards and Technology (NIST): Digital signature standard (DSS). Federal Information Processing Standard, FIPS PUB 186-2 (2000)
5. Solinas, J.: Efficient arithmetic on Koblitz curves. Designs, Codes, and Cryptography 19 (2000) 195-249
6. Avanzi, R., Heuberger, C., Prodinger, H.: Minimality of the Hamming weight of the $\tau$-NAF for Koblitz curves and improved combination with point halving. In: SAC 2005. Volume 3897 of LNCS. (2005) 332-344
7. Dimitrov, V., Jullien, G., Miller, W.: An algorithm for modular exponentiation. Inform. Process. Lett. 66 (1998) 155-159
8. Ciet, M., Sica, F.: An analysis of double base number systems and a sublinear scalar multiplication algorithm. In: Mycrypt 2005. Volume 3715 of LNCS. (2005) 171-182
9. Dimitrov, V., Imbert, L., Mishra, P.: Efficient and secure elliptic curve point multiplication using double-base chains. In: ASIACRYPT 2005. Volume 3788 of LNCS. (2005) 59-78
10. Avanzi, R., Sica, F.: Scalar multiplication on Koblitz curves using double bases. Technical Report Number 2006/067, Cryptology ePrint Archive (2006)
11. Sica, F.: Personal communication. (2006)
12. Conway, J., Smith, D.: On quaternions and octonions. AK Peters (2003)
13. Tijdeman, R.: On integers with many small prime factors. Compos. Math. 26 (1973) 319-330
14. Baker, A.: Linear forms in the logarithms of algebraic numbers IV. Mathematica 15 (1968) 204-216
15. Mignotte, M., Waldshmidt, M.: Linear forms in two logarithms and Schneider's method III. In: Annales Fas. Sci. Toulouse. (1990) 43-75
16. Tijdeman, R.: Personal communication. (2006)
17. López, J., Dahab, R.: Improved algorithms for elliptic curve arithmetic in $G F\left(2^{n}\right)$. In: SAC '98. Volume 1556 of LNCS. (1998) 201-212
18. Doche, C., Lange, T.: Arithmetic of elliptic curves. In Cohen, H., Frey, G., eds.: Handbook of Elliptic and Hyperelliptic Curve Cryptography. Chapman \& Hall/CRC (2006) 267-302
19. Higuchi, A., Takagi, N.: A fast addition algorithm for elliptic curve arithmetic in $G F\left(2^{n}\right)$ using projective coordinates. Inform. Process. Lett. 76 (2000) 101-103
20. Al-Daoud, E., Mahmod, R., Rushdan, M., Kilicman, A.: A new addition formula for elliptic curves over $G F\left(2^{n}\right)$. IEEE Trans. Comput. 51 (2002) 972-975
21. Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses in $G F\left(2^{m}\right)$ using normal bases. Inform. Comput. 78 (1988) 171-177
22. Wang, C., Troung, T., Shao, H., Deutsch, L., Omura, J., Reed, I.: VLSI architectures for computing multiplications and inverses in $G F\left(2^{m}\right)$. IEEE Trans. Comput. 34 (1985) 709-717
23. Bednara, M., Daldrup, M., von zur Gathen, J., Shokrollahi, J., Teich, J.: Reconfigurable implementation of elliptic curve crypto algorithms. In: IPDPS 2002. (2002) 157-164
24. Leong, P., Leung, K.: A microcoded elliptic curve processor using FPGA technology. IEEE Trans. VLSI Syst. 10 (2002) 550-559
25. Eberle, H., Gura, N., Shantz, S., Gupta, V.: A cryptographic processor for arbitrary elliptic curves over $G F\left(2^{m}\right)$. Technical Report SMLI TR-2003-123, Sun Microsystems, Inc. (2003)
26. Lutz, J., Hasan, A.: High performance FPGA based elliptic curve cryptographic co-processor. In: Proc. of the Int'l Conf. on Information Technology: Coding and Computing. Volume 2. (2004) 486-492

[^0]:    * Chan, Dimitrov and Jacobson are supported in part by NSERC of Canada.

