1 | #! /usr/bin/env perl
|
---|
2 | # Copyright 2010-2020 The OpenSSL Project Authors. All Rights Reserved.
|
---|
3 | #
|
---|
4 | # Licensed under the OpenSSL license (the "License"). You may not use
|
---|
5 | # this file except in compliance with the License. You can obtain a copy
|
---|
6 | # in the file LICENSE in the source distribution or at
|
---|
7 | # https://www.openssl.org/source/license.html
|
---|
8 |
|
---|
9 | #
|
---|
10 | # ====================================================================
|
---|
11 | # Written by Andy Polyakov <[email protected]> for the OpenSSL
|
---|
12 | # project. The module is, however, dual licensed under OpenSSL and
|
---|
13 | # CRYPTOGAMS licenses depending on where you obtain it. For further
|
---|
14 | # details see http://www.openssl.org/~appro/cryptogams/.
|
---|
15 | # ====================================================================
|
---|
16 | #
|
---|
17 | # March, May, June 2010
|
---|
18 | #
|
---|
19 | # The module implements "4-bit" GCM GHASH function and underlying
|
---|
20 | # single multiplication operation in GF(2^128). "4-bit" means that it
|
---|
21 | # uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two
|
---|
22 | # code paths: vanilla x86 and vanilla SSE. Former will be executed on
|
---|
23 | # 486 and Pentium, latter on all others. SSE GHASH features so called
|
---|
24 | # "528B" variant of "4-bit" method utilizing additional 256+16 bytes
|
---|
25 | # of per-key storage [+512 bytes shared table]. Performance results
|
---|
26 | # are for streamed GHASH subroutine and are expressed in cycles per
|
---|
27 | # processed byte, less is better:
|
---|
28 | #
|
---|
29 | # gcc 2.95.3(*) SSE assembler x86 assembler
|
---|
30 | #
|
---|
31 | # Pentium 105/111(**) - 50
|
---|
32 | # PIII 68 /75 12.2 24
|
---|
33 | # P4 125/125 17.8 84(***)
|
---|
34 | # Opteron 66 /70 10.1 30
|
---|
35 | # Core2 54 /67 8.4 18
|
---|
36 | # Atom 105/105 16.8 53
|
---|
37 | # VIA Nano 69 /71 13.0 27
|
---|
38 | #
|
---|
39 | # (*) gcc 3.4.x was observed to generate few percent slower code,
|
---|
40 | # which is one of reasons why 2.95.3 results were chosen,
|
---|
41 | # another reason is lack of 3.4.x results for older CPUs;
|
---|
42 | # comparison with SSE results is not completely fair, because C
|
---|
43 | # results are for vanilla "256B" implementation, while
|
---|
44 | # assembler results are for "528B";-)
|
---|
45 | # (**) second number is result for code compiled with -fPIC flag,
|
---|
46 | # which is actually more relevant, because assembler code is
|
---|
47 | # position-independent;
|
---|
48 | # (***) see comment in non-MMX routine for further details;
|
---|
49 | #
|
---|
50 | # To summarize, it's >2-5 times faster than gcc-generated code. To
|
---|
51 | # anchor it to something else SHA1 assembler processes one byte in
|
---|
52 | # ~7 cycles on contemporary x86 cores. As for choice of MMX/SSE
|
---|
53 | # in particular, see comment at the end of the file...
|
---|
54 |
|
---|
55 | # May 2010
|
---|
56 | #
|
---|
57 | # Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
|
---|
58 | # The question is how close is it to theoretical limit? The pclmulqdq
|
---|
59 | # instruction latency appears to be 14 cycles and there can't be more
|
---|
60 | # than 2 of them executing at any given time. This means that single
|
---|
61 | # Karatsuba multiplication would take 28 cycles *plus* few cycles for
|
---|
62 | # pre- and post-processing. Then multiplication has to be followed by
|
---|
63 | # modulo-reduction. Given that aggregated reduction method [see
|
---|
64 | # "Carry-less Multiplication and Its Usage for Computing the GCM Mode"
|
---|
65 | # white paper by Intel] allows you to perform reduction only once in
|
---|
66 | # a while we can assume that asymptotic performance can be estimated
|
---|
67 | # as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction
|
---|
68 | # and Naggr is the aggregation factor.
|
---|
69 | #
|
---|
70 | # Before we proceed to this implementation let's have closer look at
|
---|
71 | # the best-performing code suggested by Intel in their white paper.
|
---|
72 | # By tracing inter-register dependencies Tmod is estimated as ~19
|
---|
73 | # cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
|
---|
74 | # processed byte. As implied, this is quite optimistic estimate,
|
---|
75 | # because it does not account for Karatsuba pre- and post-processing,
|
---|
76 | # which for a single multiplication is ~5 cycles. Unfortunately Intel
|
---|
77 | # does not provide performance data for GHASH alone. But benchmarking
|
---|
78 | # AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
|
---|
79 | # alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
|
---|
80 | # the result accounts even for pre-computing of degrees of the hash
|
---|
81 | # key H, but its portion is negligible at 16KB buffer size.
|
---|
82 | #
|
---|
83 | # Moving on to the implementation in question. Tmod is estimated as
|
---|
84 | # ~13 cycles and Naggr is 2, giving asymptotic performance of ...
|
---|
85 | # 2.16. How is it possible that measured performance is better than
|
---|
86 | # optimistic theoretical estimate? There is one thing Intel failed
|
---|
87 | # to recognize. By serializing GHASH with CTR in same subroutine
|
---|
88 | # former's performance is really limited to above (Tmul + Tmod/Naggr)
|
---|
89 | # equation. But if GHASH procedure is detached, the modulo-reduction
|
---|
90 | # can be interleaved with Naggr-1 multiplications at instruction level
|
---|
91 | # and under ideal conditions even disappear from the equation. So that
|
---|
92 | # optimistic theoretical estimate for this implementation is ...
|
---|
93 | # 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
|
---|
94 | # at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
|
---|
95 | # where Tproc is time required for Karatsuba pre- and post-processing,
|
---|
96 | # is more realistic estimate. In this case it gives ... 1.91 cycles.
|
---|
97 | # Or in other words, depending on how well we can interleave reduction
|
---|
98 | # and one of the two multiplications the performance should be between
|
---|
99 | # 1.91 and 2.16. As already mentioned, this implementation processes
|
---|
100 | # one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
|
---|
101 | # - in 2.02. x86_64 performance is better, because larger register
|
---|
102 | # bank allows to interleave reduction and multiplication better.
|
---|
103 | #
|
---|
104 | # Does it make sense to increase Naggr? To start with it's virtually
|
---|
105 | # impossible in 32-bit mode, because of limited register bank
|
---|
106 | # capacity. Otherwise improvement has to be weighed against slower
|
---|
107 | # setup, as well as code size and complexity increase. As even
|
---|
108 | # optimistic estimate doesn't promise 30% performance improvement,
|
---|
109 | # there are currently no plans to increase Naggr.
|
---|
110 | #
|
---|
111 | # Special thanks to David Woodhouse for providing access to a
|
---|
112 | # Westmere-based system on behalf of Intel Open Source Technology Centre.
|
---|
113 |
|
---|
114 | # January 2010
|
---|
115 | #
|
---|
116 | # Tweaked to optimize transitions between integer and FP operations
|
---|
117 | # on same XMM register, PCLMULQDQ subroutine was measured to process
|
---|
118 | # one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere.
|
---|
119 | # The minor regression on Westmere is outweighed by ~15% improvement
|
---|
120 | # on Sandy Bridge. Strangely enough attempt to modify 64-bit code in
|
---|
121 | # similar manner resulted in almost 20% degradation on Sandy Bridge,
|
---|
122 | # where original 64-bit code processes one byte in 1.95 cycles.
|
---|
123 |
|
---|
124 | #####################################################################
|
---|
125 | # For reference, AMD Bulldozer processes one byte in 1.98 cycles in
|
---|
126 | # 32-bit mode and 1.89 in 64-bit.
|
---|
127 |
|
---|
128 | # February 2013
|
---|
129 | #
|
---|
130 | # Overhaul: aggregate Karatsuba post-processing, improve ILP in
|
---|
131 | # reduction_alg9. Resulting performance is 1.96 cycles per byte on
|
---|
132 | # Westmere, 1.95 - on Sandy/Ivy Bridge, 1.76 - on Bulldozer.
|
---|
133 |
|
---|
134 | $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
|
---|
135 | push(@INC,"${dir}","${dir}../../perlasm");
|
---|
136 | require "x86asm.pl";
|
---|
137 |
|
---|
138 | $output=pop;
|
---|
139 | open STDOUT,">$output";
|
---|
140 |
|
---|
141 | &asm_init($ARGV[0],$x86only = $ARGV[$#ARGV] eq "386");
|
---|
142 |
|
---|
143 | $sse2=0;
|
---|
144 | for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
|
---|
145 |
|
---|
146 | ($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx");
|
---|
147 | $inp = "edi";
|
---|
148 | $Htbl = "esi";
|
---|
149 | |
---|
150 |
|
---|
151 | $unroll = 0; # Affects x86 loop. Folded loop performs ~7% worse
|
---|
152 | # than unrolled, which has to be weighted against
|
---|
153 | # 2.5x x86-specific code size reduction.
|
---|
154 |
|
---|
155 | sub x86_loop {
|
---|
156 | my $off = shift;
|
---|
157 | my $rem = "eax";
|
---|
158 |
|
---|
159 | &mov ($Zhh,&DWP(4,$Htbl,$Zll));
|
---|
160 | &mov ($Zhl,&DWP(0,$Htbl,$Zll));
|
---|
161 | &mov ($Zlh,&DWP(12,$Htbl,$Zll));
|
---|
162 | &mov ($Zll,&DWP(8,$Htbl,$Zll));
|
---|
163 | &xor ($rem,$rem); # avoid partial register stalls on PIII
|
---|
164 |
|
---|
165 | # shrd practically kills P4, 2.5x deterioration, but P4 has
|
---|
166 | # MMX code-path to execute. shrd runs tad faster [than twice
|
---|
167 | # the shifts, move's and or's] on pre-MMX Pentium (as well as
|
---|
168 | # PIII and Core2), *but* minimizes code size, spares register
|
---|
169 | # and thus allows to fold the loop...
|
---|
170 | if (!$unroll) {
|
---|
171 | my $cnt = $inp;
|
---|
172 | &mov ($cnt,15);
|
---|
173 | &jmp (&label("x86_loop"));
|
---|
174 | &set_label("x86_loop",16);
|
---|
175 | for($i=1;$i<=2;$i++) {
|
---|
176 | &mov (&LB($rem),&LB($Zll));
|
---|
177 | &shrd ($Zll,$Zlh,4);
|
---|
178 | &and (&LB($rem),0xf);
|
---|
179 | &shrd ($Zlh,$Zhl,4);
|
---|
180 | &shrd ($Zhl,$Zhh,4);
|
---|
181 | &shr ($Zhh,4);
|
---|
182 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4));
|
---|
183 |
|
---|
184 | &mov (&LB($rem),&BP($off,"esp",$cnt));
|
---|
185 | if ($i&1) {
|
---|
186 | &and (&LB($rem),0xf0);
|
---|
187 | } else {
|
---|
188 | &shl (&LB($rem),4);
|
---|
189 | }
|
---|
190 |
|
---|
191 | &xor ($Zll,&DWP(8,$Htbl,$rem));
|
---|
192 | &xor ($Zlh,&DWP(12,$Htbl,$rem));
|
---|
193 | &xor ($Zhl,&DWP(0,$Htbl,$rem));
|
---|
194 | &xor ($Zhh,&DWP(4,$Htbl,$rem));
|
---|
195 |
|
---|
196 | if ($i&1) {
|
---|
197 | &dec ($cnt);
|
---|
198 | &js (&label("x86_break"));
|
---|
199 | } else {
|
---|
200 | &jmp (&label("x86_loop"));
|
---|
201 | }
|
---|
202 | }
|
---|
203 | &set_label("x86_break",16);
|
---|
204 | } else {
|
---|
205 | for($i=1;$i<32;$i++) {
|
---|
206 | &comment($i);
|
---|
207 | &mov (&LB($rem),&LB($Zll));
|
---|
208 | &shrd ($Zll,$Zlh,4);
|
---|
209 | &and (&LB($rem),0xf);
|
---|
210 | &shrd ($Zlh,$Zhl,4);
|
---|
211 | &shrd ($Zhl,$Zhh,4);
|
---|
212 | &shr ($Zhh,4);
|
---|
213 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4));
|
---|
214 |
|
---|
215 | if ($i&1) {
|
---|
216 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp"));
|
---|
217 | &and (&LB($rem),0xf0);
|
---|
218 | } else {
|
---|
219 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp"));
|
---|
220 | &shl (&LB($rem),4);
|
---|
221 | }
|
---|
222 |
|
---|
223 | &xor ($Zll,&DWP(8,$Htbl,$rem));
|
---|
224 | &xor ($Zlh,&DWP(12,$Htbl,$rem));
|
---|
225 | &xor ($Zhl,&DWP(0,$Htbl,$rem));
|
---|
226 | &xor ($Zhh,&DWP(4,$Htbl,$rem));
|
---|
227 | }
|
---|
228 | }
|
---|
229 | &bswap ($Zll);
|
---|
230 | &bswap ($Zlh);
|
---|
231 | &bswap ($Zhl);
|
---|
232 | if (!$x86only) {
|
---|
233 | &bswap ($Zhh);
|
---|
234 | } else {
|
---|
235 | &mov ("eax",$Zhh);
|
---|
236 | &bswap ("eax");
|
---|
237 | &mov ($Zhh,"eax");
|
---|
238 | }
|
---|
239 | }
|
---|
240 |
|
---|
241 | if ($unroll) {
|
---|
242 | &function_begin_B("_x86_gmult_4bit_inner");
|
---|
243 | &x86_loop(4);
|
---|
244 | &ret ();
|
---|
245 | &function_end_B("_x86_gmult_4bit_inner");
|
---|
246 | }
|
---|
247 |
|
---|
248 | sub deposit_rem_4bit {
|
---|
249 | my $bias = shift;
|
---|
250 |
|
---|
251 | &mov (&DWP($bias+0, "esp"),0x0000<<16);
|
---|
252 | &mov (&DWP($bias+4, "esp"),0x1C20<<16);
|
---|
253 | &mov (&DWP($bias+8, "esp"),0x3840<<16);
|
---|
254 | &mov (&DWP($bias+12,"esp"),0x2460<<16);
|
---|
255 | &mov (&DWP($bias+16,"esp"),0x7080<<16);
|
---|
256 | &mov (&DWP($bias+20,"esp"),0x6CA0<<16);
|
---|
257 | &mov (&DWP($bias+24,"esp"),0x48C0<<16);
|
---|
258 | &mov (&DWP($bias+28,"esp"),0x54E0<<16);
|
---|
259 | &mov (&DWP($bias+32,"esp"),0xE100<<16);
|
---|
260 | &mov (&DWP($bias+36,"esp"),0xFD20<<16);
|
---|
261 | &mov (&DWP($bias+40,"esp"),0xD940<<16);
|
---|
262 | &mov (&DWP($bias+44,"esp"),0xC560<<16);
|
---|
263 | &mov (&DWP($bias+48,"esp"),0x9180<<16);
|
---|
264 | &mov (&DWP($bias+52,"esp"),0x8DA0<<16);
|
---|
265 | &mov (&DWP($bias+56,"esp"),0xA9C0<<16);
|
---|
266 | &mov (&DWP($bias+60,"esp"),0xB5E0<<16);
|
---|
267 | }
|
---|
268 | |
---|
269 |
|
---|
270 | $suffix = $x86only ? "" : "_x86";
|
---|
271 |
|
---|
272 | &function_begin("gcm_gmult_4bit".$suffix);
|
---|
273 | &stack_push(16+4+1); # +1 for stack alignment
|
---|
274 | &mov ($inp,&wparam(0)); # load Xi
|
---|
275 | &mov ($Htbl,&wparam(1)); # load Htable
|
---|
276 |
|
---|
277 | &mov ($Zhh,&DWP(0,$inp)); # load Xi[16]
|
---|
278 | &mov ($Zhl,&DWP(4,$inp));
|
---|
279 | &mov ($Zlh,&DWP(8,$inp));
|
---|
280 | &mov ($Zll,&DWP(12,$inp));
|
---|
281 |
|
---|
282 | &deposit_rem_4bit(16);
|
---|
283 |
|
---|
284 | &mov (&DWP(0,"esp"),$Zhh); # copy Xi[16] on stack
|
---|
285 | &mov (&DWP(4,"esp"),$Zhl);
|
---|
286 | &mov (&DWP(8,"esp"),$Zlh);
|
---|
287 | &mov (&DWP(12,"esp"),$Zll);
|
---|
288 | &shr ($Zll,20);
|
---|
289 | &and ($Zll,0xf0);
|
---|
290 |
|
---|
291 | if ($unroll) {
|
---|
292 | &call ("_x86_gmult_4bit_inner");
|
---|
293 | } else {
|
---|
294 | &x86_loop(0);
|
---|
295 | &mov ($inp,&wparam(0));
|
---|
296 | }
|
---|
297 |
|
---|
298 | &mov (&DWP(12,$inp),$Zll);
|
---|
299 | &mov (&DWP(8,$inp),$Zlh);
|
---|
300 | &mov (&DWP(4,$inp),$Zhl);
|
---|
301 | &mov (&DWP(0,$inp),$Zhh);
|
---|
302 | &stack_pop(16+4+1);
|
---|
303 | &function_end("gcm_gmult_4bit".$suffix);
|
---|
304 |
|
---|
305 | &function_begin("gcm_ghash_4bit".$suffix);
|
---|
306 | &stack_push(16+4+1); # +1 for 64-bit alignment
|
---|
307 | &mov ($Zll,&wparam(0)); # load Xi
|
---|
308 | &mov ($Htbl,&wparam(1)); # load Htable
|
---|
309 | &mov ($inp,&wparam(2)); # load in
|
---|
310 | &mov ("ecx",&wparam(3)); # load len
|
---|
311 | &add ("ecx",$inp);
|
---|
312 | &mov (&wparam(3),"ecx");
|
---|
313 |
|
---|
314 | &mov ($Zhh,&DWP(0,$Zll)); # load Xi[16]
|
---|
315 | &mov ($Zhl,&DWP(4,$Zll));
|
---|
316 | &mov ($Zlh,&DWP(8,$Zll));
|
---|
317 | &mov ($Zll,&DWP(12,$Zll));
|
---|
318 |
|
---|
319 | &deposit_rem_4bit(16);
|
---|
320 |
|
---|
321 | &set_label("x86_outer_loop",16);
|
---|
322 | &xor ($Zll,&DWP(12,$inp)); # xor with input
|
---|
323 | &xor ($Zlh,&DWP(8,$inp));
|
---|
324 | &xor ($Zhl,&DWP(4,$inp));
|
---|
325 | &xor ($Zhh,&DWP(0,$inp));
|
---|
326 | &mov (&DWP(12,"esp"),$Zll); # dump it on stack
|
---|
327 | &mov (&DWP(8,"esp"),$Zlh);
|
---|
328 | &mov (&DWP(4,"esp"),$Zhl);
|
---|
329 | &mov (&DWP(0,"esp"),$Zhh);
|
---|
330 |
|
---|
331 | &shr ($Zll,20);
|
---|
332 | &and ($Zll,0xf0);
|
---|
333 |
|
---|
334 | if ($unroll) {
|
---|
335 | &call ("_x86_gmult_4bit_inner");
|
---|
336 | } else {
|
---|
337 | &x86_loop(0);
|
---|
338 | &mov ($inp,&wparam(2));
|
---|
339 | }
|
---|
340 | &lea ($inp,&DWP(16,$inp));
|
---|
341 | &cmp ($inp,&wparam(3));
|
---|
342 | &mov (&wparam(2),$inp) if (!$unroll);
|
---|
343 | &jb (&label("x86_outer_loop"));
|
---|
344 |
|
---|
345 | &mov ($inp,&wparam(0)); # load Xi
|
---|
346 | &mov (&DWP(12,$inp),$Zll);
|
---|
347 | &mov (&DWP(8,$inp),$Zlh);
|
---|
348 | &mov (&DWP(4,$inp),$Zhl);
|
---|
349 | &mov (&DWP(0,$inp),$Zhh);
|
---|
350 | &stack_pop(16+4+1);
|
---|
351 | &function_end("gcm_ghash_4bit".$suffix);
|
---|
352 | |
---|
353 |
|
---|
354 | if (!$x86only) {{{
|
---|
355 |
|
---|
356 | &static_label("rem_4bit");
|
---|
357 |
|
---|
358 | if (!$sse2) {{ # pure-MMX "May" version...
|
---|
359 |
|
---|
360 | $S=12; # shift factor for rem_4bit
|
---|
361 |
|
---|
362 | &function_begin_B("_mmx_gmult_4bit_inner");
|
---|
363 | # MMX version performs 3.5 times better on P4 (see comment in non-MMX
|
---|
364 | # routine for further details), 100% better on Opteron, ~70% better
|
---|
365 | # on Core2 and PIII... In other words effort is considered to be well
|
---|
366 | # spent... Since initial release the loop was unrolled in order to
|
---|
367 | # "liberate" register previously used as loop counter. Instead it's
|
---|
368 | # used to optimize critical path in 'Z.hi ^= rem_4bit[Z.lo&0xf]'.
|
---|
369 | # The path involves move of Z.lo from MMX to integer register,
|
---|
370 | # effective address calculation and finally merge of value to Z.hi.
|
---|
371 | # Reference to rem_4bit is scheduled so late that I had to >>4
|
---|
372 | # rem_4bit elements. This resulted in 20-45% procent improvement
|
---|
373 | # on contemporary µ-archs.
|
---|
374 | {
|
---|
375 | my $cnt;
|
---|
376 | my $rem_4bit = "eax";
|
---|
377 | my @rem = ($Zhh,$Zll);
|
---|
378 | my $nhi = $Zhl;
|
---|
379 | my $nlo = $Zlh;
|
---|
380 |
|
---|
381 | my ($Zlo,$Zhi) = ("mm0","mm1");
|
---|
382 | my $tmp = "mm2";
|
---|
383 |
|
---|
384 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII
|
---|
385 | &mov ($nhi,$Zll);
|
---|
386 | &mov (&LB($nlo),&LB($nhi));
|
---|
387 | &shl (&LB($nlo),4);
|
---|
388 | &and ($nhi,0xf0);
|
---|
389 | &movq ($Zlo,&QWP(8,$Htbl,$nlo));
|
---|
390 | &movq ($Zhi,&QWP(0,$Htbl,$nlo));
|
---|
391 | &movd ($rem[0],$Zlo);
|
---|
392 |
|
---|
393 | for ($cnt=28;$cnt>=-2;$cnt--) {
|
---|
394 | my $odd = $cnt&1;
|
---|
395 | my $nix = $odd ? $nlo : $nhi;
|
---|
396 |
|
---|
397 | &shl (&LB($nlo),4) if ($odd);
|
---|
398 | &psrlq ($Zlo,4);
|
---|
399 | &movq ($tmp,$Zhi);
|
---|
400 | &psrlq ($Zhi,4);
|
---|
401 | &pxor ($Zlo,&QWP(8,$Htbl,$nix));
|
---|
402 | &mov (&LB($nlo),&BP($cnt/2,$inp)) if (!$odd && $cnt>=0);
|
---|
403 | &psllq ($tmp,60);
|
---|
404 | &and ($nhi,0xf0) if ($odd);
|
---|
405 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem[1],8)) if ($cnt<28);
|
---|
406 | &and ($rem[0],0xf);
|
---|
407 | &pxor ($Zhi,&QWP(0,$Htbl,$nix));
|
---|
408 | &mov ($nhi,$nlo) if (!$odd && $cnt>=0);
|
---|
409 | &movd ($rem[1],$Zlo);
|
---|
410 | &pxor ($Zlo,$tmp);
|
---|
411 |
|
---|
412 | push (@rem,shift(@rem)); # "rotate" registers
|
---|
413 | }
|
---|
414 |
|
---|
415 | &mov ($inp,&DWP(4,$rem_4bit,$rem[1],8)); # last rem_4bit[rem]
|
---|
416 |
|
---|
417 | &psrlq ($Zlo,32); # lower part of Zlo is already there
|
---|
418 | &movd ($Zhl,$Zhi);
|
---|
419 | &psrlq ($Zhi,32);
|
---|
420 | &movd ($Zlh,$Zlo);
|
---|
421 | &movd ($Zhh,$Zhi);
|
---|
422 | &shl ($inp,4); # compensate for rem_4bit[i] being >>4
|
---|
423 |
|
---|
424 | &bswap ($Zll);
|
---|
425 | &bswap ($Zhl);
|
---|
426 | &bswap ($Zlh);
|
---|
427 | &xor ($Zhh,$inp);
|
---|
428 | &bswap ($Zhh);
|
---|
429 |
|
---|
430 | &ret ();
|
---|
431 | }
|
---|
432 | &function_end_B("_mmx_gmult_4bit_inner");
|
---|
433 |
|
---|
434 | &function_begin("gcm_gmult_4bit_mmx");
|
---|
435 | &mov ($inp,&wparam(0)); # load Xi
|
---|
436 | &mov ($Htbl,&wparam(1)); # load Htable
|
---|
437 |
|
---|
438 | &call (&label("pic_point"));
|
---|
439 | &set_label("pic_point");
|
---|
440 | &blindpop("eax");
|
---|
441 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
|
---|
442 |
|
---|
443 | &movz ($Zll,&BP(15,$inp));
|
---|
444 |
|
---|
445 | &call ("_mmx_gmult_4bit_inner");
|
---|
446 |
|
---|
447 | &mov ($inp,&wparam(0)); # load Xi
|
---|
448 | &emms ();
|
---|
449 | &mov (&DWP(12,$inp),$Zll);
|
---|
450 | &mov (&DWP(4,$inp),$Zhl);
|
---|
451 | &mov (&DWP(8,$inp),$Zlh);
|
---|
452 | &mov (&DWP(0,$inp),$Zhh);
|
---|
453 | &function_end("gcm_gmult_4bit_mmx");
|
---|
454 | |
---|
455 |
|
---|
456 | # Streamed version performs 20% better on P4, 7% on Opteron,
|
---|
457 | # 10% on Core2 and PIII...
|
---|
458 | &function_begin("gcm_ghash_4bit_mmx");
|
---|
459 | &mov ($Zhh,&wparam(0)); # load Xi
|
---|
460 | &mov ($Htbl,&wparam(1)); # load Htable
|
---|
461 | &mov ($inp,&wparam(2)); # load in
|
---|
462 | &mov ($Zlh,&wparam(3)); # load len
|
---|
463 |
|
---|
464 | &call (&label("pic_point"));
|
---|
465 | &set_label("pic_point");
|
---|
466 | &blindpop("eax");
|
---|
467 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
|
---|
468 |
|
---|
469 | &add ($Zlh,$inp);
|
---|
470 | &mov (&wparam(3),$Zlh); # len to point at the end of input
|
---|
471 | &stack_push(4+1); # +1 for stack alignment
|
---|
472 |
|
---|
473 | &mov ($Zll,&DWP(12,$Zhh)); # load Xi[16]
|
---|
474 | &mov ($Zhl,&DWP(4,$Zhh));
|
---|
475 | &mov ($Zlh,&DWP(8,$Zhh));
|
---|
476 | &mov ($Zhh,&DWP(0,$Zhh));
|
---|
477 | &jmp (&label("mmx_outer_loop"));
|
---|
478 |
|
---|
479 | &set_label("mmx_outer_loop",16);
|
---|
480 | &xor ($Zll,&DWP(12,$inp));
|
---|
481 | &xor ($Zhl,&DWP(4,$inp));
|
---|
482 | &xor ($Zlh,&DWP(8,$inp));
|
---|
483 | &xor ($Zhh,&DWP(0,$inp));
|
---|
484 | &mov (&wparam(2),$inp);
|
---|
485 | &mov (&DWP(12,"esp"),$Zll);
|
---|
486 | &mov (&DWP(4,"esp"),$Zhl);
|
---|
487 | &mov (&DWP(8,"esp"),$Zlh);
|
---|
488 | &mov (&DWP(0,"esp"),$Zhh);
|
---|
489 |
|
---|
490 | &mov ($inp,"esp");
|
---|
491 | &shr ($Zll,24);
|
---|
492 |
|
---|
493 | &call ("_mmx_gmult_4bit_inner");
|
---|
494 |
|
---|
495 | &mov ($inp,&wparam(2));
|
---|
496 | &lea ($inp,&DWP(16,$inp));
|
---|
497 | &cmp ($inp,&wparam(3));
|
---|
498 | &jb (&label("mmx_outer_loop"));
|
---|
499 |
|
---|
500 | &mov ($inp,&wparam(0)); # load Xi
|
---|
501 | &emms ();
|
---|
502 | &mov (&DWP(12,$inp),$Zll);
|
---|
503 | &mov (&DWP(4,$inp),$Zhl);
|
---|
504 | &mov (&DWP(8,$inp),$Zlh);
|
---|
505 | &mov (&DWP(0,$inp),$Zhh);
|
---|
506 |
|
---|
507 | &stack_pop(4+1);
|
---|
508 | &function_end("gcm_ghash_4bit_mmx");
|
---|
509 | |
---|
510 |
|
---|
511 | }} else {{ # "June" MMX version...
|
---|
512 | # ... has slower "April" gcm_gmult_4bit_mmx with folded
|
---|
513 | # loop. This is done to conserve code size...
|
---|
514 | $S=16; # shift factor for rem_4bit
|
---|
515 |
|
---|
516 | sub mmx_loop() {
|
---|
517 | # MMX version performs 2.8 times better on P4 (see comment in non-MMX
|
---|
518 | # routine for further details), 40% better on Opteron and Core2, 50%
|
---|
519 | # better on PIII... In other words effort is considered to be well
|
---|
520 | # spent...
|
---|
521 | my $inp = shift;
|
---|
522 | my $rem_4bit = shift;
|
---|
523 | my $cnt = $Zhh;
|
---|
524 | my $nhi = $Zhl;
|
---|
525 | my $nlo = $Zlh;
|
---|
526 | my $rem = $Zll;
|
---|
527 |
|
---|
528 | my ($Zlo,$Zhi) = ("mm0","mm1");
|
---|
529 | my $tmp = "mm2";
|
---|
530 |
|
---|
531 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII
|
---|
532 | &mov ($nhi,$Zll);
|
---|
533 | &mov (&LB($nlo),&LB($nhi));
|
---|
534 | &mov ($cnt,14);
|
---|
535 | &shl (&LB($nlo),4);
|
---|
536 | &and ($nhi,0xf0);
|
---|
537 | &movq ($Zlo,&QWP(8,$Htbl,$nlo));
|
---|
538 | &movq ($Zhi,&QWP(0,$Htbl,$nlo));
|
---|
539 | &movd ($rem,$Zlo);
|
---|
540 | &jmp (&label("mmx_loop"));
|
---|
541 |
|
---|
542 | &set_label("mmx_loop",16);
|
---|
543 | &psrlq ($Zlo,4);
|
---|
544 | &and ($rem,0xf);
|
---|
545 | &movq ($tmp,$Zhi);
|
---|
546 | &psrlq ($Zhi,4);
|
---|
547 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi));
|
---|
548 | &mov (&LB($nlo),&BP(0,$inp,$cnt));
|
---|
549 | &psllq ($tmp,60);
|
---|
550 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
|
---|
551 | &dec ($cnt);
|
---|
552 | &movd ($rem,$Zlo);
|
---|
553 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi));
|
---|
554 | &mov ($nhi,$nlo);
|
---|
555 | &pxor ($Zlo,$tmp);
|
---|
556 | &js (&label("mmx_break"));
|
---|
557 |
|
---|
558 | &shl (&LB($nlo),4);
|
---|
559 | &and ($rem,0xf);
|
---|
560 | &psrlq ($Zlo,4);
|
---|
561 | &and ($nhi,0xf0);
|
---|
562 | &movq ($tmp,$Zhi);
|
---|
563 | &psrlq ($Zhi,4);
|
---|
564 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo));
|
---|
565 | &psllq ($tmp,60);
|
---|
566 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
|
---|
567 | &movd ($rem,$Zlo);
|
---|
568 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo));
|
---|
569 | &pxor ($Zlo,$tmp);
|
---|
570 | &jmp (&label("mmx_loop"));
|
---|
571 |
|
---|
572 | &set_label("mmx_break",16);
|
---|
573 | &shl (&LB($nlo),4);
|
---|
574 | &and ($rem,0xf);
|
---|
575 | &psrlq ($Zlo,4);
|
---|
576 | &and ($nhi,0xf0);
|
---|
577 | &movq ($tmp,$Zhi);
|
---|
578 | &psrlq ($Zhi,4);
|
---|
579 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo));
|
---|
580 | &psllq ($tmp,60);
|
---|
581 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
|
---|
582 | &movd ($rem,$Zlo);
|
---|
583 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo));
|
---|
584 | &pxor ($Zlo,$tmp);
|
---|
585 |
|
---|
586 | &psrlq ($Zlo,4);
|
---|
587 | &and ($rem,0xf);
|
---|
588 | &movq ($tmp,$Zhi);
|
---|
589 | &psrlq ($Zhi,4);
|
---|
590 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi));
|
---|
591 | &psllq ($tmp,60);
|
---|
592 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
|
---|
593 | &movd ($rem,$Zlo);
|
---|
594 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi));
|
---|
595 | &pxor ($Zlo,$tmp);
|
---|
596 |
|
---|
597 | &psrlq ($Zlo,32); # lower part of Zlo is already there
|
---|
598 | &movd ($Zhl,$Zhi);
|
---|
599 | &psrlq ($Zhi,32);
|
---|
600 | &movd ($Zlh,$Zlo);
|
---|
601 | &movd ($Zhh,$Zhi);
|
---|
602 |
|
---|
603 | &bswap ($Zll);
|
---|
604 | &bswap ($Zhl);
|
---|
605 | &bswap ($Zlh);
|
---|
606 | &bswap ($Zhh);
|
---|
607 | }
|
---|
608 |
|
---|
609 | &function_begin("gcm_gmult_4bit_mmx");
|
---|
610 | &mov ($inp,&wparam(0)); # load Xi
|
---|
611 | &mov ($Htbl,&wparam(1)); # load Htable
|
---|
612 |
|
---|
613 | &call (&label("pic_point"));
|
---|
614 | &set_label("pic_point");
|
---|
615 | &blindpop("eax");
|
---|
616 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
|
---|
617 |
|
---|
618 | &movz ($Zll,&BP(15,$inp));
|
---|
619 |
|
---|
620 | &mmx_loop($inp,"eax");
|
---|
621 |
|
---|
622 | &emms ();
|
---|
623 | &mov (&DWP(12,$inp),$Zll);
|
---|
624 | &mov (&DWP(4,$inp),$Zhl);
|
---|
625 | &mov (&DWP(8,$inp),$Zlh);
|
---|
626 | &mov (&DWP(0,$inp),$Zhh);
|
---|
627 | &function_end("gcm_gmult_4bit_mmx");
|
---|
628 | |
---|
629 |
|
---|
630 | ######################################################################
|
---|
631 | # Below subroutine is "528B" variant of "4-bit" GCM GHASH function
|
---|
632 | # (see gcm128.c for details). It provides further 20-40% performance
|
---|
633 | # improvement over above mentioned "May" version.
|
---|
634 |
|
---|
635 | &static_label("rem_8bit");
|
---|
636 |
|
---|
637 | &function_begin("gcm_ghash_4bit_mmx");
|
---|
638 | { my ($Zlo,$Zhi) = ("mm7","mm6");
|
---|
639 | my $rem_8bit = "esi";
|
---|
640 | my $Htbl = "ebx";
|
---|
641 |
|
---|
642 | # parameter block
|
---|
643 | &mov ("eax",&wparam(0)); # Xi
|
---|
644 | &mov ("ebx",&wparam(1)); # Htable
|
---|
645 | &mov ("ecx",&wparam(2)); # inp
|
---|
646 | &mov ("edx",&wparam(3)); # len
|
---|
647 | &mov ("ebp","esp"); # original %esp
|
---|
648 | &call (&label("pic_point"));
|
---|
649 | &set_label ("pic_point");
|
---|
650 | &blindpop ($rem_8bit);
|
---|
651 | &lea ($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit));
|
---|
652 |
|
---|
653 | &sub ("esp",512+16+16); # allocate stack frame...
|
---|
654 | &and ("esp",-64); # ...and align it
|
---|
655 | &sub ("esp",16); # place for (u8)(H[]<<4)
|
---|
656 |
|
---|
657 | &add ("edx","ecx"); # pointer to the end of input
|
---|
658 | &mov (&DWP(528+16+0,"esp"),"eax"); # save Xi
|
---|
659 | &mov (&DWP(528+16+8,"esp"),"edx"); # save inp+len
|
---|
660 | &mov (&DWP(528+16+12,"esp"),"ebp"); # save original %esp
|
---|
661 |
|
---|
662 | { my @lo = ("mm0","mm1","mm2");
|
---|
663 | my @hi = ("mm3","mm4","mm5");
|
---|
664 | my @tmp = ("mm6","mm7");
|
---|
665 | my ($off1,$off2,$i) = (0,0,);
|
---|
666 |
|
---|
667 | &add ($Htbl,128); # optimize for size
|
---|
668 | &lea ("edi",&DWP(16+128,"esp"));
|
---|
669 | &lea ("ebp",&DWP(16+256+128,"esp"));
|
---|
670 |
|
---|
671 | # decompose Htable (low and high parts are kept separately),
|
---|
672 | # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack...
|
---|
673 | for ($i=0;$i<18;$i++) {
|
---|
674 |
|
---|
675 | &mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16);
|
---|
676 | &movq ($lo[0],&QWP(16*$i+8-128,$Htbl)) if ($i<16);
|
---|
677 | &psllq ($tmp[1],60) if ($i>1);
|
---|
678 | &movq ($hi[0],&QWP(16*$i+0-128,$Htbl)) if ($i<16);
|
---|
679 | &por ($lo[2],$tmp[1]) if ($i>1);
|
---|
680 | &movq (&QWP($off1-128,"edi"),$lo[1]) if ($i>0 && $i<17);
|
---|
681 | &psrlq ($lo[1],4) if ($i>0 && $i<17);
|
---|
682 | &movq (&QWP($off1,"edi"),$hi[1]) if ($i>0 && $i<17);
|
---|
683 | &movq ($tmp[0],$hi[1]) if ($i>0 && $i<17);
|
---|
684 | &movq (&QWP($off2-128,"ebp"),$lo[2]) if ($i>1);
|
---|
685 | &psrlq ($hi[1],4) if ($i>0 && $i<17);
|
---|
686 | &movq (&QWP($off2,"ebp"),$hi[2]) if ($i>1);
|
---|
687 | &shl ("edx",4) if ($i<16);
|
---|
688 | &mov (&BP($i,"esp"),&LB("edx")) if ($i<16);
|
---|
689 |
|
---|
690 | unshift (@lo,pop(@lo)); # "rotate" registers
|
---|
691 | unshift (@hi,pop(@hi));
|
---|
692 | unshift (@tmp,pop(@tmp));
|
---|
693 | $off1 += 8 if ($i>0);
|
---|
694 | $off2 += 8 if ($i>1);
|
---|
695 | }
|
---|
696 | }
|
---|
697 |
|
---|
698 | &movq ($Zhi,&QWP(0,"eax"));
|
---|
699 | &mov ("ebx",&DWP(8,"eax"));
|
---|
700 | &mov ("edx",&DWP(12,"eax")); # load Xi
|
---|
701 |
|
---|
702 | &set_label("outer",16);
|
---|
703 | { my $nlo = "eax";
|
---|
704 | my $dat = "edx";
|
---|
705 | my @nhi = ("edi","ebp");
|
---|
706 | my @rem = ("ebx","ecx");
|
---|
707 | my @red = ("mm0","mm1","mm2");
|
---|
708 | my $tmp = "mm3";
|
---|
709 |
|
---|
710 | &xor ($dat,&DWP(12,"ecx")); # merge input data
|
---|
711 | &xor ("ebx",&DWP(8,"ecx"));
|
---|
712 | &pxor ($Zhi,&QWP(0,"ecx"));
|
---|
713 | &lea ("ecx",&DWP(16,"ecx")); # inp+=16
|
---|
714 | #&mov (&DWP(528+12,"esp"),$dat); # save inp^Xi
|
---|
715 | &mov (&DWP(528+8,"esp"),"ebx");
|
---|
716 | &movq (&QWP(528+0,"esp"),$Zhi);
|
---|
717 | &mov (&DWP(528+16+4,"esp"),"ecx"); # save inp
|
---|
718 |
|
---|
719 | &xor ($nlo,$nlo);
|
---|
720 | &rol ($dat,8);
|
---|
721 | &mov (&LB($nlo),&LB($dat));
|
---|
722 | &mov ($nhi[1],$nlo);
|
---|
723 | &and (&LB($nlo),0x0f);
|
---|
724 | &shr ($nhi[1],4);
|
---|
725 | &pxor ($red[0],$red[0]);
|
---|
726 | &rol ($dat,8); # next byte
|
---|
727 | &pxor ($red[1],$red[1]);
|
---|
728 | &pxor ($red[2],$red[2]);
|
---|
729 |
|
---|
730 | # Just like in "May" version modulo-schedule for critical path in
|
---|
731 | # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor'
|
---|
732 | # is scheduled so late that rem_8bit[] has to be shifted *right*
|
---|
733 | # by 16, which is why last argument to pinsrw is 2, which
|
---|
734 | # corresponds to <<32=<<48>>16...
|
---|
735 | for ($j=11,$i=0;$i<15;$i++) {
|
---|
736 |
|
---|
737 | if ($i>0) {
|
---|
738 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo]
|
---|
739 | &rol ($dat,8); # next byte
|
---|
740 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8));
|
---|
741 |
|
---|
742 | &pxor ($Zlo,$tmp);
|
---|
743 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
|
---|
744 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4)
|
---|
745 | } else {
|
---|
746 | &movq ($Zlo,&QWP(16,"esp",$nlo,8));
|
---|
747 | &movq ($Zhi,&QWP(16+128,"esp",$nlo,8));
|
---|
748 | }
|
---|
749 |
|
---|
750 | &mov (&LB($nlo),&LB($dat));
|
---|
751 | &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0);
|
---|
752 |
|
---|
753 | &movd ($rem[0],$Zlo);
|
---|
754 | &movz ($rem[1],&LB($rem[1])) if ($i>0);
|
---|
755 | &psrlq ($Zlo,8); # Z>>=8
|
---|
756 |
|
---|
757 | &movq ($tmp,$Zhi);
|
---|
758 | &mov ($nhi[0],$nlo);
|
---|
759 | &psrlq ($Zhi,8);
|
---|
760 |
|
---|
761 | &pxor ($Zlo,&QWP(16+256+0,"esp",$nhi[1],8)); # Z^=H[nhi]>>4
|
---|
762 | &and (&LB($nlo),0x0f);
|
---|
763 | &psllq ($tmp,56);
|
---|
764 |
|
---|
765 | &pxor ($Zhi,$red[1]) if ($i>1);
|
---|
766 | &shr ($nhi[0],4);
|
---|
767 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2) if ($i>0);
|
---|
768 |
|
---|
769 | unshift (@red,pop(@red)); # "rotate" registers
|
---|
770 | unshift (@rem,pop(@rem));
|
---|
771 | unshift (@nhi,pop(@nhi));
|
---|
772 | }
|
---|
773 |
|
---|
774 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo]
|
---|
775 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8));
|
---|
776 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4)
|
---|
777 |
|
---|
778 | &pxor ($Zlo,$tmp);
|
---|
779 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
|
---|
780 | &movz ($rem[1],&LB($rem[1]));
|
---|
781 |
|
---|
782 | &pxor ($red[2],$red[2]); # clear 2nd word
|
---|
783 | &psllq ($red[1],4);
|
---|
784 |
|
---|
785 | &movd ($rem[0],$Zlo);
|
---|
786 | &psrlq ($Zlo,4); # Z>>=4
|
---|
787 |
|
---|
788 | &movq ($tmp,$Zhi);
|
---|
789 | &psrlq ($Zhi,4);
|
---|
790 | &shl ($rem[0],4); # rem<<4
|
---|
791 |
|
---|
792 | &pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi]
|
---|
793 | &psllq ($tmp,60);
|
---|
794 | &movz ($rem[0],&LB($rem[0]));
|
---|
795 |
|
---|
796 | &pxor ($Zlo,$tmp);
|
---|
797 | &pxor ($Zhi,&QWP(16+128,"esp",$nhi[1],8));
|
---|
798 |
|
---|
799 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2);
|
---|
800 | &pxor ($Zhi,$red[1]);
|
---|
801 |
|
---|
802 | &movd ($dat,$Zlo);
|
---|
803 | &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48
|
---|
804 |
|
---|
805 | &psllq ($red[0],12); # correct by <<16>>4
|
---|
806 | &pxor ($Zhi,$red[0]);
|
---|
807 | &psrlq ($Zlo,32);
|
---|
808 | &pxor ($Zhi,$red[2]);
|
---|
809 |
|
---|
810 | &mov ("ecx",&DWP(528+16+4,"esp")); # restore inp
|
---|
811 | &movd ("ebx",$Zlo);
|
---|
812 | &movq ($tmp,$Zhi); # 01234567
|
---|
813 | &psllw ($Zhi,8); # 1.3.5.7.
|
---|
814 | &psrlw ($tmp,8); # .0.2.4.6
|
---|
815 | &por ($Zhi,$tmp); # 10325476
|
---|
816 | &bswap ($dat);
|
---|
817 | &pshufw ($Zhi,$Zhi,0b00011011); # 76543210
|
---|
818 | &bswap ("ebx");
|
---|
819 |
|
---|
820 | &cmp ("ecx",&DWP(528+16+8,"esp")); # are we done?
|
---|
821 | &jne (&label("outer"));
|
---|
822 | }
|
---|
823 |
|
---|
824 | &mov ("eax",&DWP(528+16+0,"esp")); # restore Xi
|
---|
825 | &mov (&DWP(12,"eax"),"edx");
|
---|
826 | &mov (&DWP(8,"eax"),"ebx");
|
---|
827 | &movq (&QWP(0,"eax"),$Zhi);
|
---|
828 |
|
---|
829 | &mov ("esp",&DWP(528+16+12,"esp")); # restore original %esp
|
---|
830 | &emms ();
|
---|
831 | }
|
---|
832 | &function_end("gcm_ghash_4bit_mmx");
|
---|
833 | }}
|
---|
834 | |
---|
835 |
|
---|
836 | if ($sse2) {{
|
---|
837 | ######################################################################
|
---|
838 | # PCLMULQDQ version.
|
---|
839 |
|
---|
840 | $Xip="eax";
|
---|
841 | $Htbl="edx";
|
---|
842 | $const="ecx";
|
---|
843 | $inp="esi";
|
---|
844 | $len="ebx";
|
---|
845 |
|
---|
846 | ($Xi,$Xhi)=("xmm0","xmm1"); $Hkey="xmm2";
|
---|
847 | ($T1,$T2,$T3)=("xmm3","xmm4","xmm5");
|
---|
848 | ($Xn,$Xhn)=("xmm6","xmm7");
|
---|
849 |
|
---|
850 | &static_label("bswap");
|
---|
851 |
|
---|
852 | sub clmul64x64_T2 { # minimal "register" pressure
|
---|
853 | my ($Xhi,$Xi,$Hkey,$HK)=@_;
|
---|
854 |
|
---|
855 | &movdqa ($Xhi,$Xi); #
|
---|
856 | &pshufd ($T1,$Xi,0b01001110);
|
---|
857 | &pshufd ($T2,$Hkey,0b01001110) if (!defined($HK));
|
---|
858 | &pxor ($T1,$Xi); #
|
---|
859 | &pxor ($T2,$Hkey) if (!defined($HK));
|
---|
860 | $HK=$T2 if (!defined($HK));
|
---|
861 |
|
---|
862 | &pclmulqdq ($Xi,$Hkey,0x00); #######
|
---|
863 | &pclmulqdq ($Xhi,$Hkey,0x11); #######
|
---|
864 | &pclmulqdq ($T1,$HK,0x00); #######
|
---|
865 | &xorps ($T1,$Xi); #
|
---|
866 | &xorps ($T1,$Xhi); #
|
---|
867 |
|
---|
868 | &movdqa ($T2,$T1); #
|
---|
869 | &psrldq ($T1,8);
|
---|
870 | &pslldq ($T2,8); #
|
---|
871 | &pxor ($Xhi,$T1);
|
---|
872 | &pxor ($Xi,$T2); #
|
---|
873 | }
|
---|
874 |
|
---|
875 | sub clmul64x64_T3 {
|
---|
876 | # Even though this subroutine offers visually better ILP, it
|
---|
877 | # was empirically found to be a tad slower than above version.
|
---|
878 | # At least in gcm_ghash_clmul context. But it's just as well,
|
---|
879 | # because loop modulo-scheduling is possible only thanks to
|
---|
880 | # minimized "register" pressure...
|
---|
881 | my ($Xhi,$Xi,$Hkey)=@_;
|
---|
882 |
|
---|
883 | &movdqa ($T1,$Xi); #
|
---|
884 | &movdqa ($Xhi,$Xi);
|
---|
885 | &pclmulqdq ($Xi,$Hkey,0x00); #######
|
---|
886 | &pclmulqdq ($Xhi,$Hkey,0x11); #######
|
---|
887 | &pshufd ($T2,$T1,0b01001110); #
|
---|
888 | &pshufd ($T3,$Hkey,0b01001110);
|
---|
889 | &pxor ($T2,$T1); #
|
---|
890 | &pxor ($T3,$Hkey);
|
---|
891 | &pclmulqdq ($T2,$T3,0x00); #######
|
---|
892 | &pxor ($T2,$Xi); #
|
---|
893 | &pxor ($T2,$Xhi); #
|
---|
894 |
|
---|
895 | &movdqa ($T3,$T2); #
|
---|
896 | &psrldq ($T2,8);
|
---|
897 | &pslldq ($T3,8); #
|
---|
898 | &pxor ($Xhi,$T2);
|
---|
899 | &pxor ($Xi,$T3); #
|
---|
900 | }
|
---|
901 | |
---|
902 |
|
---|
903 | if (1) { # Algorithm 9 with <<1 twist.
|
---|
904 | # Reduction is shorter and uses only two
|
---|
905 | # temporary registers, which makes it better
|
---|
906 | # candidate for interleaving with 64x64
|
---|
907 | # multiplication. Pre-modulo-scheduled loop
|
---|
908 | # was found to be ~20% faster than Algorithm 5
|
---|
909 | # below. Algorithm 9 was therefore chosen for
|
---|
910 | # further optimization...
|
---|
911 |
|
---|
912 | sub reduction_alg9 { # 17/11 times faster than Intel version
|
---|
913 | my ($Xhi,$Xi) = @_;
|
---|
914 |
|
---|
915 | # 1st phase
|
---|
916 | &movdqa ($T2,$Xi); #
|
---|
917 | &movdqa ($T1,$Xi);
|
---|
918 | &psllq ($Xi,5);
|
---|
919 | &pxor ($T1,$Xi); #
|
---|
920 | &psllq ($Xi,1);
|
---|
921 | &pxor ($Xi,$T1); #
|
---|
922 | &psllq ($Xi,57); #
|
---|
923 | &movdqa ($T1,$Xi); #
|
---|
924 | &pslldq ($Xi,8);
|
---|
925 | &psrldq ($T1,8); #
|
---|
926 | &pxor ($Xi,$T2);
|
---|
927 | &pxor ($Xhi,$T1); #
|
---|
928 |
|
---|
929 | # 2nd phase
|
---|
930 | &movdqa ($T2,$Xi);
|
---|
931 | &psrlq ($Xi,1);
|
---|
932 | &pxor ($Xhi,$T2); #
|
---|
933 | &pxor ($T2,$Xi);
|
---|
934 | &psrlq ($Xi,5);
|
---|
935 | &pxor ($Xi,$T2); #
|
---|
936 | &psrlq ($Xi,1); #
|
---|
937 | &pxor ($Xi,$Xhi) #
|
---|
938 | }
|
---|
939 |
|
---|
940 | &function_begin_B("gcm_init_clmul");
|
---|
941 | &mov ($Htbl,&wparam(0));
|
---|
942 | &mov ($Xip,&wparam(1));
|
---|
943 |
|
---|
944 | &call (&label("pic"));
|
---|
945 | &set_label("pic");
|
---|
946 | &blindpop ($const);
|
---|
947 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
|
---|
948 |
|
---|
949 | &movdqu ($Hkey,&QWP(0,$Xip));
|
---|
950 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap
|
---|
951 |
|
---|
952 | # <<1 twist
|
---|
953 | &pshufd ($T2,$Hkey,0b11111111); # broadcast uppermost dword
|
---|
954 | &movdqa ($T1,$Hkey);
|
---|
955 | &psllq ($Hkey,1);
|
---|
956 | &pxor ($T3,$T3); #
|
---|
957 | &psrlq ($T1,63);
|
---|
958 | &pcmpgtd ($T3,$T2); # broadcast carry bit
|
---|
959 | &pslldq ($T1,8);
|
---|
960 | &por ($Hkey,$T1); # H<<=1
|
---|
961 |
|
---|
962 | # magic reduction
|
---|
963 | &pand ($T3,&QWP(16,$const)); # 0x1c2_polynomial
|
---|
964 | &pxor ($Hkey,$T3); # if(carry) H^=0x1c2_polynomial
|
---|
965 |
|
---|
966 | # calculate H^2
|
---|
967 | &movdqa ($Xi,$Hkey);
|
---|
968 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey);
|
---|
969 | &reduction_alg9 ($Xhi,$Xi);
|
---|
970 |
|
---|
971 | &pshufd ($T1,$Hkey,0b01001110);
|
---|
972 | &pshufd ($T2,$Xi,0b01001110);
|
---|
973 | &pxor ($T1,$Hkey); # Karatsuba pre-processing
|
---|
974 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H
|
---|
975 | &pxor ($T2,$Xi); # Karatsuba pre-processing
|
---|
976 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2
|
---|
977 | &palignr ($T2,$T1,8); # low part is H.lo^H.hi
|
---|
978 | &movdqu (&QWP(32,$Htbl),$T2); # save Karatsuba "salt"
|
---|
979 |
|
---|
980 | &ret ();
|
---|
981 | &function_end_B("gcm_init_clmul");
|
---|
982 |
|
---|
983 | &function_begin_B("gcm_gmult_clmul");
|
---|
984 | &mov ($Xip,&wparam(0));
|
---|
985 | &mov ($Htbl,&wparam(1));
|
---|
986 |
|
---|
987 | &call (&label("pic"));
|
---|
988 | &set_label("pic");
|
---|
989 | &blindpop ($const);
|
---|
990 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
|
---|
991 |
|
---|
992 | &movdqu ($Xi,&QWP(0,$Xip));
|
---|
993 | &movdqa ($T3,&QWP(0,$const));
|
---|
994 | &movups ($Hkey,&QWP(0,$Htbl));
|
---|
995 | &pshufb ($Xi,$T3);
|
---|
996 | &movups ($T2,&QWP(32,$Htbl));
|
---|
997 |
|
---|
998 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$T2);
|
---|
999 | &reduction_alg9 ($Xhi,$Xi);
|
---|
1000 |
|
---|
1001 | &pshufb ($Xi,$T3);
|
---|
1002 | &movdqu (&QWP(0,$Xip),$Xi);
|
---|
1003 |
|
---|
1004 | &ret ();
|
---|
1005 | &function_end_B("gcm_gmult_clmul");
|
---|
1006 |
|
---|
1007 | &function_begin("gcm_ghash_clmul");
|
---|
1008 | &mov ($Xip,&wparam(0));
|
---|
1009 | &mov ($Htbl,&wparam(1));
|
---|
1010 | &mov ($inp,&wparam(2));
|
---|
1011 | &mov ($len,&wparam(3));
|
---|
1012 |
|
---|
1013 | &call (&label("pic"));
|
---|
1014 | &set_label("pic");
|
---|
1015 | &blindpop ($const);
|
---|
1016 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
|
---|
1017 |
|
---|
1018 | &movdqu ($Xi,&QWP(0,$Xip));
|
---|
1019 | &movdqa ($T3,&QWP(0,$const));
|
---|
1020 | &movdqu ($Hkey,&QWP(0,$Htbl));
|
---|
1021 | &pshufb ($Xi,$T3);
|
---|
1022 |
|
---|
1023 | &sub ($len,0x10);
|
---|
1024 | &jz (&label("odd_tail"));
|
---|
1025 |
|
---|
1026 | #######
|
---|
1027 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
|
---|
1028 | # [(H*Ii+1) + (H*Xi+1)] mod P =
|
---|
1029 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P
|
---|
1030 | #
|
---|
1031 | &movdqu ($T1,&QWP(0,$inp)); # Ii
|
---|
1032 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1
|
---|
1033 | &pshufb ($T1,$T3);
|
---|
1034 | &pshufb ($Xn,$T3);
|
---|
1035 | &movdqu ($T3,&QWP(32,$Htbl));
|
---|
1036 | &pxor ($Xi,$T1); # Ii+Xi
|
---|
1037 |
|
---|
1038 | &pshufd ($T1,$Xn,0b01001110); # H*Ii+1
|
---|
1039 | &movdqa ($Xhn,$Xn);
|
---|
1040 | &pxor ($T1,$Xn); #
|
---|
1041 | &lea ($inp,&DWP(32,$inp)); # i+=2
|
---|
1042 |
|
---|
1043 | &pclmulqdq ($Xn,$Hkey,0x00); #######
|
---|
1044 | &pclmulqdq ($Xhn,$Hkey,0x11); #######
|
---|
1045 | &pclmulqdq ($T1,$T3,0x00); #######
|
---|
1046 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2
|
---|
1047 | &nop ();
|
---|
1048 |
|
---|
1049 | &sub ($len,0x20);
|
---|
1050 | &jbe (&label("even_tail"));
|
---|
1051 | &jmp (&label("mod_loop"));
|
---|
1052 |
|
---|
1053 | &set_label("mod_loop",32);
|
---|
1054 | &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi)
|
---|
1055 | &movdqa ($Xhi,$Xi);
|
---|
1056 | &pxor ($T2,$Xi); #
|
---|
1057 | &nop ();
|
---|
1058 |
|
---|
1059 | &pclmulqdq ($Xi,$Hkey,0x00); #######
|
---|
1060 | &pclmulqdq ($Xhi,$Hkey,0x11); #######
|
---|
1061 | &pclmulqdq ($T2,$T3,0x10); #######
|
---|
1062 | &movups ($Hkey,&QWP(0,$Htbl)); # load H
|
---|
1063 |
|
---|
1064 | &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
|
---|
1065 | &movdqa ($T3,&QWP(0,$const));
|
---|
1066 | &xorps ($Xhi,$Xhn);
|
---|
1067 | &movdqu ($Xhn,&QWP(0,$inp)); # Ii
|
---|
1068 | &pxor ($T1,$Xi); # aggregated Karatsuba post-processing
|
---|
1069 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1
|
---|
1070 | &pxor ($T1,$Xhi); #
|
---|
1071 |
|
---|
1072 | &pshufb ($Xhn,$T3);
|
---|
1073 | &pxor ($T2,$T1); #
|
---|
1074 |
|
---|
1075 | &movdqa ($T1,$T2); #
|
---|
1076 | &psrldq ($T2,8);
|
---|
1077 | &pslldq ($T1,8); #
|
---|
1078 | &pxor ($Xhi,$T2);
|
---|
1079 | &pxor ($Xi,$T1); #
|
---|
1080 | &pshufb ($Xn,$T3);
|
---|
1081 | &pxor ($Xhi,$Xhn); # "Ii+Xi", consume early
|
---|
1082 |
|
---|
1083 | &movdqa ($Xhn,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1
|
---|
1084 | &movdqa ($T2,$Xi); #&reduction_alg9($Xhi,$Xi); 1st phase
|
---|
1085 | &movdqa ($T1,$Xi);
|
---|
1086 | &psllq ($Xi,5);
|
---|
1087 | &pxor ($T1,$Xi); #
|
---|
1088 | &psllq ($Xi,1);
|
---|
1089 | &pxor ($Xi,$T1); #
|
---|
1090 | &pclmulqdq ($Xn,$Hkey,0x00); #######
|
---|
1091 | &movups ($T3,&QWP(32,$Htbl));
|
---|
1092 | &psllq ($Xi,57); #
|
---|
1093 | &movdqa ($T1,$Xi); #
|
---|
1094 | &pslldq ($Xi,8);
|
---|
1095 | &psrldq ($T1,8); #
|
---|
1096 | &pxor ($Xi,$T2);
|
---|
1097 | &pxor ($Xhi,$T1); #
|
---|
1098 | &pshufd ($T1,$Xhn,0b01001110);
|
---|
1099 | &movdqa ($T2,$Xi); # 2nd phase
|
---|
1100 | &psrlq ($Xi,1);
|
---|
1101 | &pxor ($T1,$Xhn);
|
---|
1102 | &pxor ($Xhi,$T2); #
|
---|
1103 | &pclmulqdq ($Xhn,$Hkey,0x11); #######
|
---|
1104 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2
|
---|
1105 | &pxor ($T2,$Xi);
|
---|
1106 | &psrlq ($Xi,5);
|
---|
1107 | &pxor ($Xi,$T2); #
|
---|
1108 | &psrlq ($Xi,1); #
|
---|
1109 | &pxor ($Xi,$Xhi) #
|
---|
1110 | &pclmulqdq ($T1,$T3,0x00); #######
|
---|
1111 |
|
---|
1112 | &lea ($inp,&DWP(32,$inp));
|
---|
1113 | &sub ($len,0x20);
|
---|
1114 | &ja (&label("mod_loop"));
|
---|
1115 |
|
---|
1116 | &set_label("even_tail");
|
---|
1117 | &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi)
|
---|
1118 | &movdqa ($Xhi,$Xi);
|
---|
1119 | &pxor ($T2,$Xi); #
|
---|
1120 |
|
---|
1121 | &pclmulqdq ($Xi,$Hkey,0x00); #######
|
---|
1122 | &pclmulqdq ($Xhi,$Hkey,0x11); #######
|
---|
1123 | &pclmulqdq ($T2,$T3,0x10); #######
|
---|
1124 | &movdqa ($T3,&QWP(0,$const));
|
---|
1125 |
|
---|
1126 | &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
|
---|
1127 | &xorps ($Xhi,$Xhn);
|
---|
1128 | &pxor ($T1,$Xi); # aggregated Karatsuba post-processing
|
---|
1129 | &pxor ($T1,$Xhi); #
|
---|
1130 |
|
---|
1131 | &pxor ($T2,$T1); #
|
---|
1132 |
|
---|
1133 | &movdqa ($T1,$T2); #
|
---|
1134 | &psrldq ($T2,8);
|
---|
1135 | &pslldq ($T1,8); #
|
---|
1136 | &pxor ($Xhi,$T2);
|
---|
1137 | &pxor ($Xi,$T1); #
|
---|
1138 |
|
---|
1139 | &reduction_alg9 ($Xhi,$Xi);
|
---|
1140 |
|
---|
1141 | &test ($len,$len);
|
---|
1142 | &jnz (&label("done"));
|
---|
1143 |
|
---|
1144 | &movups ($Hkey,&QWP(0,$Htbl)); # load H
|
---|
1145 | &set_label("odd_tail");
|
---|
1146 | &movdqu ($T1,&QWP(0,$inp)); # Ii
|
---|
1147 | &pshufb ($T1,$T3);
|
---|
1148 | &pxor ($Xi,$T1); # Ii+Xi
|
---|
1149 |
|
---|
1150 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi)
|
---|
1151 | &reduction_alg9 ($Xhi,$Xi);
|
---|
1152 |
|
---|
1153 | &set_label("done");
|
---|
1154 | &pshufb ($Xi,$T3);
|
---|
1155 | &movdqu (&QWP(0,$Xip),$Xi);
|
---|
1156 | &function_end("gcm_ghash_clmul");
|
---|
1157 | |
---|
1158 |
|
---|
1159 | } else { # Algorithm 5. Kept for reference purposes.
|
---|
1160 |
|
---|
1161 | sub reduction_alg5 { # 19/16 times faster than Intel version
|
---|
1162 | my ($Xhi,$Xi)=@_;
|
---|
1163 |
|
---|
1164 | # <<1
|
---|
1165 | &movdqa ($T1,$Xi); #
|
---|
1166 | &movdqa ($T2,$Xhi);
|
---|
1167 | &pslld ($Xi,1);
|
---|
1168 | &pslld ($Xhi,1); #
|
---|
1169 | &psrld ($T1,31);
|
---|
1170 | &psrld ($T2,31); #
|
---|
1171 | &movdqa ($T3,$T1);
|
---|
1172 | &pslldq ($T1,4);
|
---|
1173 | &psrldq ($T3,12); #
|
---|
1174 | &pslldq ($T2,4);
|
---|
1175 | &por ($Xhi,$T3); #
|
---|
1176 | &por ($Xi,$T1);
|
---|
1177 | &por ($Xhi,$T2); #
|
---|
1178 |
|
---|
1179 | # 1st phase
|
---|
1180 | &movdqa ($T1,$Xi);
|
---|
1181 | &movdqa ($T2,$Xi);
|
---|
1182 | &movdqa ($T3,$Xi); #
|
---|
1183 | &pslld ($T1,31);
|
---|
1184 | &pslld ($T2,30);
|
---|
1185 | &pslld ($Xi,25); #
|
---|
1186 | &pxor ($T1,$T2);
|
---|
1187 | &pxor ($T1,$Xi); #
|
---|
1188 | &movdqa ($T2,$T1); #
|
---|
1189 | &pslldq ($T1,12);
|
---|
1190 | &psrldq ($T2,4); #
|
---|
1191 | &pxor ($T3,$T1);
|
---|
1192 |
|
---|
1193 | # 2nd phase
|
---|
1194 | &pxor ($Xhi,$T3); #
|
---|
1195 | &movdqa ($Xi,$T3);
|
---|
1196 | &movdqa ($T1,$T3);
|
---|
1197 | &psrld ($Xi,1); #
|
---|
1198 | &psrld ($T1,2);
|
---|
1199 | &psrld ($T3,7); #
|
---|
1200 | &pxor ($Xi,$T1);
|
---|
1201 | &pxor ($Xhi,$T2);
|
---|
1202 | &pxor ($Xi,$T3); #
|
---|
1203 | &pxor ($Xi,$Xhi); #
|
---|
1204 | }
|
---|
1205 |
|
---|
1206 | &function_begin_B("gcm_init_clmul");
|
---|
1207 | &mov ($Htbl,&wparam(0));
|
---|
1208 | &mov ($Xip,&wparam(1));
|
---|
1209 |
|
---|
1210 | &call (&label("pic"));
|
---|
1211 | &set_label("pic");
|
---|
1212 | &blindpop ($const);
|
---|
1213 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
|
---|
1214 |
|
---|
1215 | &movdqu ($Hkey,&QWP(0,$Xip));
|
---|
1216 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap
|
---|
1217 |
|
---|
1218 | # calculate H^2
|
---|
1219 | &movdqa ($Xi,$Hkey);
|
---|
1220 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey);
|
---|
1221 | &reduction_alg5 ($Xhi,$Xi);
|
---|
1222 |
|
---|
1223 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H
|
---|
1224 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2
|
---|
1225 |
|
---|
1226 | &ret ();
|
---|
1227 | &function_end_B("gcm_init_clmul");
|
---|
1228 |
|
---|
1229 | &function_begin_B("gcm_gmult_clmul");
|
---|
1230 | &mov ($Xip,&wparam(0));
|
---|
1231 | &mov ($Htbl,&wparam(1));
|
---|
1232 |
|
---|
1233 | &call (&label("pic"));
|
---|
1234 | &set_label("pic");
|
---|
1235 | &blindpop ($const);
|
---|
1236 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
|
---|
1237 |
|
---|
1238 | &movdqu ($Xi,&QWP(0,$Xip));
|
---|
1239 | &movdqa ($Xn,&QWP(0,$const));
|
---|
1240 | &movdqu ($Hkey,&QWP(0,$Htbl));
|
---|
1241 | &pshufb ($Xi,$Xn);
|
---|
1242 |
|
---|
1243 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey);
|
---|
1244 | &reduction_alg5 ($Xhi,$Xi);
|
---|
1245 |
|
---|
1246 | &pshufb ($Xi,$Xn);
|
---|
1247 | &movdqu (&QWP(0,$Xip),$Xi);
|
---|
1248 |
|
---|
1249 | &ret ();
|
---|
1250 | &function_end_B("gcm_gmult_clmul");
|
---|
1251 |
|
---|
1252 | &function_begin("gcm_ghash_clmul");
|
---|
1253 | &mov ($Xip,&wparam(0));
|
---|
1254 | &mov ($Htbl,&wparam(1));
|
---|
1255 | &mov ($inp,&wparam(2));
|
---|
1256 | &mov ($len,&wparam(3));
|
---|
1257 |
|
---|
1258 | &call (&label("pic"));
|
---|
1259 | &set_label("pic");
|
---|
1260 | &blindpop ($const);
|
---|
1261 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
|
---|
1262 |
|
---|
1263 | &movdqu ($Xi,&QWP(0,$Xip));
|
---|
1264 | &movdqa ($T3,&QWP(0,$const));
|
---|
1265 | &movdqu ($Hkey,&QWP(0,$Htbl));
|
---|
1266 | &pshufb ($Xi,$T3);
|
---|
1267 |
|
---|
1268 | &sub ($len,0x10);
|
---|
1269 | &jz (&label("odd_tail"));
|
---|
1270 |
|
---|
1271 | #######
|
---|
1272 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
|
---|
1273 | # [(H*Ii+1) + (H*Xi+1)] mod P =
|
---|
1274 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P
|
---|
1275 | #
|
---|
1276 | &movdqu ($T1,&QWP(0,$inp)); # Ii
|
---|
1277 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1
|
---|
1278 | &pshufb ($T1,$T3);
|
---|
1279 | &pshufb ($Xn,$T3);
|
---|
1280 | &pxor ($Xi,$T1); # Ii+Xi
|
---|
1281 |
|
---|
1282 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1
|
---|
1283 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2
|
---|
1284 |
|
---|
1285 | &sub ($len,0x20);
|
---|
1286 | &lea ($inp,&DWP(32,$inp)); # i+=2
|
---|
1287 | &jbe (&label("even_tail"));
|
---|
1288 |
|
---|
1289 | &set_label("mod_loop");
|
---|
1290 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
|
---|
1291 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H
|
---|
1292 |
|
---|
1293 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
|
---|
1294 | &pxor ($Xhi,$Xhn);
|
---|
1295 |
|
---|
1296 | &reduction_alg5 ($Xhi,$Xi);
|
---|
1297 |
|
---|
1298 | #######
|
---|
1299 | &movdqa ($T3,&QWP(0,$const));
|
---|
1300 | &movdqu ($T1,&QWP(0,$inp)); # Ii
|
---|
1301 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1
|
---|
1302 | &pshufb ($T1,$T3);
|
---|
1303 | &pshufb ($Xn,$T3);
|
---|
1304 | &pxor ($Xi,$T1); # Ii+Xi
|
---|
1305 |
|
---|
1306 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1
|
---|
1307 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2
|
---|
1308 |
|
---|
1309 | &sub ($len,0x20);
|
---|
1310 | &lea ($inp,&DWP(32,$inp));
|
---|
1311 | &ja (&label("mod_loop"));
|
---|
1312 |
|
---|
1313 | &set_label("even_tail");
|
---|
1314 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
|
---|
1315 |
|
---|
1316 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
|
---|
1317 | &pxor ($Xhi,$Xhn);
|
---|
1318 |
|
---|
1319 | &reduction_alg5 ($Xhi,$Xi);
|
---|
1320 |
|
---|
1321 | &movdqa ($T3,&QWP(0,$const));
|
---|
1322 | &test ($len,$len);
|
---|
1323 | &jnz (&label("done"));
|
---|
1324 |
|
---|
1325 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H
|
---|
1326 | &set_label("odd_tail");
|
---|
1327 | &movdqu ($T1,&QWP(0,$inp)); # Ii
|
---|
1328 | &pshufb ($T1,$T3);
|
---|
1329 | &pxor ($Xi,$T1); # Ii+Xi
|
---|
1330 |
|
---|
1331 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi)
|
---|
1332 | &reduction_alg5 ($Xhi,$Xi);
|
---|
1333 |
|
---|
1334 | &movdqa ($T3,&QWP(0,$const));
|
---|
1335 | &set_label("done");
|
---|
1336 | &pshufb ($Xi,$T3);
|
---|
1337 | &movdqu (&QWP(0,$Xip),$Xi);
|
---|
1338 | &function_end("gcm_ghash_clmul");
|
---|
1339 |
|
---|
1340 | }
|
---|
1341 | |
---|
1342 |
|
---|
1343 | &set_label("bswap",64);
|
---|
1344 | &data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
|
---|
1345 | &data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial
|
---|
1346 | &set_label("rem_8bit",64);
|
---|
1347 | &data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E);
|
---|
1348 | &data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E);
|
---|
1349 | &data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E);
|
---|
1350 | &data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E);
|
---|
1351 | &data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E);
|
---|
1352 | &data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E);
|
---|
1353 | &data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E);
|
---|
1354 | &data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E);
|
---|
1355 | &data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE);
|
---|
1356 | &data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE);
|
---|
1357 | &data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE);
|
---|
1358 | &data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE);
|
---|
1359 | &data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E);
|
---|
1360 | &data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E);
|
---|
1361 | &data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE);
|
---|
1362 | &data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE);
|
---|
1363 | &data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E);
|
---|
1364 | &data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E);
|
---|
1365 | &data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E);
|
---|
1366 | &data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E);
|
---|
1367 | &data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E);
|
---|
1368 | &data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E);
|
---|
1369 | &data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E);
|
---|
1370 | &data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E);
|
---|
1371 | &data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE);
|
---|
1372 | &data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE);
|
---|
1373 | &data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE);
|
---|
1374 | &data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE);
|
---|
1375 | &data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E);
|
---|
1376 | &data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E);
|
---|
1377 | &data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE);
|
---|
1378 | &data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE);
|
---|
1379 | }} # $sse2
|
---|
1380 |
|
---|
1381 | &set_label("rem_4bit",64);
|
---|
1382 | &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S);
|
---|
1383 | &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S);
|
---|
1384 | &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S);
|
---|
1385 | &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S);
|
---|
1386 | }}} # !$x86only
|
---|
1387 |
|
---|
1388 | &asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>");
|
---|
1389 | &asm_finish();
|
---|
1390 |
|
---|
1391 | close STDOUT or die "error closing STDOUT: $!";
|
---|
1392 |
|
---|
1393 | # A question was risen about choice of vanilla MMX. Or rather why wasn't
|
---|
1394 | # SSE2 chosen instead? In addition to the fact that MMX runs on legacy
|
---|
1395 | # CPUs such as PIII, "4-bit" MMX version was observed to provide better
|
---|
1396 | # performance than *corresponding* SSE2 one even on contemporary CPUs.
|
---|
1397 | # SSE2 results were provided by Peter-Michael Hager. He maintains SSE2
|
---|
1398 | # implementation featuring full range of lookup-table sizes, but with
|
---|
1399 | # per-invocation lookup table setup. Latter means that table size is
|
---|
1400 | # chosen depending on how much data is to be hashed in every given call,
|
---|
1401 | # more data - larger table. Best reported result for Core2 is ~4 cycles
|
---|
1402 | # per processed byte out of 64KB block. This number accounts even for
|
---|
1403 | # 64KB table setup overhead. As discussed in gcm128.c we choose to be
|
---|
1404 | # more conservative in respect to lookup table sizes, but how do the
|
---|
1405 | # results compare? Minimalistic "256B" MMX version delivers ~11 cycles
|
---|
1406 | # on same platform. As also discussed in gcm128.c, next in line "8-bit
|
---|
1407 | # Shoup's" or "4KB" method should deliver twice the performance of
|
---|
1408 | # "256B" one, in other words not worse than ~6 cycles per byte. It
|
---|
1409 | # should be also be noted that in SSE2 case improvement can be "super-
|
---|
1410 | # linear," i.e. more than twice, mostly because >>8 maps to single
|
---|
1411 | # instruction on SSE2 register. This is unlike "4-bit" case when >>4
|
---|
1412 | # maps to same amount of instructions in both MMX and SSE2 cases.
|
---|
1413 | # Bottom line is that switch to SSE2 is considered to be justifiable
|
---|
1414 | # only in case we choose to implement "8-bit" method...
|
---|