Efficient approximate square roots and division in Verilog

January 23, 2021

SuperRT uses, for the most part, fairly simple maths operations - integer addition and multiplication - which are natively supported by Verilog (and can be offloaded to DSPs in some cases). There are, however, some places where it inevitably has to do more complex operations - notably divides and square roots (he hardware side avoids the need for any sine/cosine calculations by formatting the incoming data in such a way that the SNES CPU takes those on).

So I needed a way to support these these operations, and to keep the pipeline constraints simple I knew I needed an algorithm that could execute in a fixed number of cycles - so variable-length iterative approaches were out. On the plus side, I could afford to sacrifice a bit of accuracy for speed, and I didn’t need to worry too much about the semantics of invalid/out-of-range inputs as those would (hopefully!) never occur normally.

A quick note: SuperRT was my first FPGA project, and arguably best described as “a massively over-complicated Hello World”. My understanding of Verilog is incomplete at best and most likely downright wrong in at worst, so please take everything I say about it with a pinch of salt. Additions/corrections are gratefully accepted!

I started looking at square root first - for reasons that will become apparent shortly - and after looking around a bit for information on hardware-suitable algorithms and not finding very much guidance, I decided to try some experiments in software first. Based on the “fixed number of cycles” constraint, Newton-Raphson seemed like a good candidate - this is an iterative method where each step can be done with a fixed number of operations, and the number of steps required is determined by the desired output accuracy.

More specifically, this method calculates the reciprocal of the square root of a value val by iteratively computing this formula for step n:

x(n+1) = (x(n) * (1.5 - (val * 0.5 * x(n)^2))

This calculates the reciprocal 1/sqrt(val), but once we have that we can easily calculate the regular square root by multiplying the result by val, thus yielding val / sqrt(val) which by definition is sqrt(val).

Since each step depends on the output of the previous one, we need to “seed” the initial value of x(0) in order to get started. This initial value is a “first guess” at the answer which is then refined with each iteration, and so a precomputed table seemed to be a good approach to take.

I toyed a bit with the idea of using a lookup ROM for this, but it felt like that was probably going to overcomplicate things. Also, a straightforward lookup based on the upper bits of the input value or similar would either need an unreasonably large table or suffer accuracy problems with smaller inputs. Thus, the approach I ended up taking was to use a magnitude-based lookup, where the initial guess is determined by the number of leading zero bits in the input value. That gives us a table which looks like this:

Number of leading zeroes    Mid-value       Square root
30                          6.103516E-05    0.0078125
29                          0.0001831055    0.0135316469341319
28                          0.0003662109    0.0191366386154936
27                          0.0007324219    0.0270632938682637
26                          0.001464844     0.0382732772309872
25                          0.002929688     0.0541265877365274
24                          0.005859375     0.0765465544619743
23                          0.01171875      0.108253175473055
22                          0.0234375       0.153093108923949
21                          0.046875        0.21650635094611
20                          0.09375         0.306186217847897
19                          0.1875          0.433012701892219
18                          0.375           0.612372435695794
17                          0.75            0.866025403784439
16                          1.5             1.22474487139159
15                          3               1.73205080756888
14                          6               2.44948974278318
13                          12              3.46410161513775
12                          24              4.89897948556636
11                          48              6.92820323027551
10                          96              9.79795897113271
9                           192             13.856406460551
8                           384             19.5959179422654
7                           768             27.712812921102
6                           1536            39.1918358845308
5                           3072            55.4256258422041
4                           6144            78.3836717690617
3                           12288           110.851251684408
2                           24576           156.767343538123
1                           49152           221.702503368816
0                           98304           313.534687076247

Source values are in 14 bit fixed-point and signed, and thus 30 is the maximum number of leading zero bits.

The initial square root guesses are calculated as the square root of the mid-point of the range they are used for.

Using this table to get the initial value and then performing two Newton’s method iterations turned out to give just enough accuracy for SuperRT’s purposes - graphing the error (as a percentage of the answer) over the entire input range gives this:

Error rates across the whole range

As can be seen, overall error isn’t great (particularly for large numbers), but at the lower end of the range it’s much more reasonable (<1%) and this is where the majority of SuperRT calculations occur:

Error rates across the 0-256 range

Implemented in Verilog as a big combinatorial function, this solution looks like this:

localparam [31:0] fixedShift = 14; // Left shift factor for fixed point (i.e. how many fractional bits do we have)
localparam signed [31:0] onePointFive = 32'sh00006000; // 1.5f in fixed point

// Fixed-point mul (40-bit internal precision)

function automatic signed [31:0] FixedMul;
	input signed [31:0] x;
	input signed [31:0] y;
	begin
		// Sign-extend out to 40 bits and then truncate again afterwards
		FixedMul = 32'($signed({{8{x[31]}}, x} * {{8{y[31]}}, y}) >>> fixedShift);
	end
endfunction

// Fast fixed-point square root

function automatic signed [31:0] FixedSqrt;
	input signed [31:0] sqrtIn; // Input
	reg signed [31:0] sqrtFirstGuess; // First guess (table-based)
	reg signed [31:0] sqrtIntermediate; // Result of first Newton iteration
	reg signed [31:0] sqrtRcpResult; // Result of second Newton iteration
	begin	
		// First-guess lookup based on number of leading zeroes
		
		if (sqrtIn[30]) sqrtFirstGuess = 32'sh00000034;
		else if (sqrtIn[29]) sqrtFirstGuess = 32'sh00000049;
		else if (sqrtIn[28]) sqrtFirstGuess = 32'sh00000068;
		else if (sqrtIn[27]) sqrtFirstGuess = 32'sh00000093;
		else if (sqrtIn[26]) sqrtFirstGuess = 32'sh000000d1;
		else if (sqrtIn[25]) sqrtFirstGuess = 32'sh00000127;
		else if (sqrtIn[24]) sqrtFirstGuess = 32'sh000001a2;
		else if (sqrtIn[23]) sqrtFirstGuess = 32'sh0000024f;
		else if (sqrtIn[22]) sqrtFirstGuess = 32'sh00000344;
		else if (sqrtIn[21]) sqrtFirstGuess = 32'sh0000049e;
		else if (sqrtIn[20]) sqrtFirstGuess = 32'sh00000688;
		else if (sqrtIn[19]) sqrtFirstGuess = 32'sh0000093c;
		else if (sqrtIn[18]) sqrtFirstGuess = 32'sh00000d10;
		else if (sqrtIn[17]) sqrtFirstGuess = 32'sh00001279;
		else if (sqrtIn[16]) sqrtFirstGuess = 32'sh00001a20;
		else if (sqrtIn[15]) sqrtFirstGuess = 32'sh000024f3;
		else if (sqrtIn[14]) sqrtFirstGuess = 32'sh00003441;
		else if (sqrtIn[13]) sqrtFirstGuess = 32'sh000049e6;
		else if (sqrtIn[12]) sqrtFirstGuess = 32'sh00006882;
		else if (sqrtIn[11]) sqrtFirstGuess = 32'sh000093cd;
		else if (sqrtIn[10]) sqrtFirstGuess = 32'sh0000d105;
		else if (sqrtIn[9]) sqrtFirstGuess = 32'sh0001279a;
		else if (sqrtIn[8]) sqrtFirstGuess = 32'sh0001a20b;
		else if (sqrtIn[7]) sqrtFirstGuess = 32'sh00024f34;
		else if (sqrtIn[6]) sqrtFirstGuess = 32'sh00034417;
		else if (sqrtIn[5]) sqrtFirstGuess = 32'sh00049e69;
		else if (sqrtIn[4]) sqrtFirstGuess = 32'sh0006882f;
		else if (sqrtIn[3]) sqrtFirstGuess = 32'sh00093cd3;
		else if (sqrtIn[2]) sqrtFirstGuess = 32'sh000d105e;
		else if (sqrtIn[1]) sqrtFirstGuess = 32'sh001279a7;
		else sqrtFirstGuess = 32'sh00200000;

		// Newton's method - x(n+1) =(x(n) * (1.5 - (val * 0.5f * x(n)^2))
		
		// First iteration	
		sqrtIntermediate = FixedMul(sqrtFirstGuess, onePointFive - FixedMul(sqrtIn >>> 1, FixedMul(sqrtFirstGuess, sqrtFirstGuess)));
		
		// Second iteration
		sqrtRcpResult = FixedMul(sqrtIntermediate, onePointFive - FixedMul(sqrtIn >>> 1, FixedMul(sqrtIntermediate, sqrtIntermediate)));
		
		// Convert 1/sqrt(x) to sqrt(x)
		FixedSqrt = FixedMul(sqrtRcpResult, sqrtIn);
	end
endfunction

…or, in a pipeline-friendly 4-cycle-latency, 1-cycle-throughput form:

localparam [31:0] fixedShift = 14; // Left shift factor for fixed point (i.e. how many fractional bits do we have)
localparam signed [31:0] onePointFive = 32'sh00006000; // 1.5f in fixed point

// Fixed-point mul (40-bit internal precision)

function automatic signed [31:0] FixedMul;
	input signed [31:0] x;
	input signed [31:0] y;
	begin
		// Sign-extend out to 40 bits and then truncate again afterwards
		FixedMul = 32'($signed({{8{x[31]}}, x} * {{8{y[31]}}, y}) >>> fixedShift);
	end
endfunction

// Pipelined clocked square root

module FixedSqrtClocked(
	input wire clock,
	input wire signed [31:0] sqrtIn, // Input
	output wire signed [31:0] result); // Result

// Cycle 1
always @(posedge clock) begin

	c2_sqrtIn <= sqrtIn;

	// First-guess lookup based on number of leading zeroes

	if (sqrtIn[30]) c2_sqrtFirstGuess <= 32'sh00000034;
	else if (sqrtIn[29]) c2_sqrtFirstGuess <= 32'sh00000049;
	else if (sqrtIn[28]) c2_sqrtFirstGuess <= 32'sh00000068;
	else if (sqrtIn[27]) c2_sqrtFirstGuess <= 32'sh00000093;
	else if (sqrtIn[26]) c2_sqrtFirstGuess <= 32'sh000000d1;
	else if (sqrtIn[25]) c2_sqrtFirstGuess <= 32'sh00000127;
	else if (sqrtIn[24]) c2_sqrtFirstGuess <= 32'sh000001a2;
	else if (sqrtIn[23]) c2_sqrtFirstGuess <= 32'sh0000024f;
	else if (sqrtIn[22]) c2_sqrtFirstGuess <= 32'sh00000344;
	else if (sqrtIn[21]) c2_sqrtFirstGuess <= 32'sh0000049e;
	else if (sqrtIn[20]) c2_sqrtFirstGuess <= 32'sh00000688;
	else if (sqrtIn[19]) c2_sqrtFirstGuess <= 32'sh0000093c;
	else if (sqrtIn[18]) c2_sqrtFirstGuess <= 32'sh00000d10;
	else if (sqrtIn[17]) c2_sqrtFirstGuess <= 32'sh00001279;
	else if (sqrtIn[16]) c2_sqrtFirstGuess <= 32'sh00001a20;
	else if (sqrtIn[15]) c2_sqrtFirstGuess <= 32'sh000024f3;
	else if (sqrtIn[14]) c2_sqrtFirstGuess <= 32'sh00003441;
	else if (sqrtIn[13]) c2_sqrtFirstGuess <= 32'sh000049e6;
	else if (sqrtIn[12]) c2_sqrtFirstGuess <= 32'sh00006882;
	else if (sqrtIn[11]) c2_sqrtFirstGuess <= 32'sh000093cd;
	else if (sqrtIn[10]) c2_sqrtFirstGuess <= 32'sh0000d105;
	else if (sqrtIn[9]) c2_sqrtFirstGuess <= 32'sh0001279a;
	else if (sqrtIn[8]) c2_sqrtFirstGuess <= 32'sh0001a20b;
	else if (sqrtIn[7]) c2_sqrtFirstGuess <= 32'sh00024f34;
	else if (sqrtIn[6]) c2_sqrtFirstGuess <= 32'sh00034417;
	else if (sqrtIn[5]) c2_sqrtFirstGuess <= 32'sh00049e69;
	else if (sqrtIn[4]) c2_sqrtFirstGuess <= 32'sh0006882f;
	else if (sqrtIn[3]) c2_sqrtFirstGuess <= 32'sh00093cd3;
	else if (sqrtIn[2]) c2_sqrtFirstGuess <= 32'sh000d105e;
	else if (sqrtIn[1]) c2_sqrtFirstGuess <= 32'sh001279a7;
	else c2_sqrtFirstGuess <= 32'sh00200000;
end

reg signed [31:0] c2_sqrtIn;
reg signed [31:0] c2_sqrtFirstGuess; // First guess (table-based)

// Cycle 2
always @(posedge clock) begin
	// Newton's method - x(n+1) =(x(n) * (1.5 - (val * 0.5f * x(n)^2))

	// First iteration	
	c3_sqrtIntermediate <= FixedMul(c2_sqrtFirstGuess, sqrt_onePointFive - FixedMul(c2_sqrtIn >> 1, FixedMul(c2_sqrtFirstGuess, c2_sqrtFirstGuess)));

	c3_sqrtIn <= c2_sqrtIn;
end

reg signed [31:0] c3_sqrtIn;
reg signed [31:0] c3_sqrtIntermediate; // Result of first Newton iteration

// Cycle 3
always @(posedge clock) begin
	// Second iteration
	c4_sqrtRcpResult <= FixedMul(c3_sqrtIntermediate, sqrt_onePointFive - FixedMul(c3_sqrtIn >> 1, FixedMul(c3_sqrtIntermediate, c3_sqrtIntermediate)));

	c4_sqrtIn <= c3_sqrtIn;
end

reg signed [31:0] c4_sqrtIn;
reg signed [31:0] c4_sqrtRcpResult; // Result of second Newton iteration

// Cycle 4
always @(posedge clock) begin
	if (c4_sqrtIn == 'h0) begin
		result <= 32'h0;
	end else begin
		// Convert 1/sqrt(x) to sqrt(x)
		result <= FixedMul(c4_sqrtRcpResult, c4_sqrtIn);
	end	
end	

I actually tried two variants of the initial lookup code - the version seen here, and what felt like a more “idiomatic Verilog” version using a case statement, as follows:

unique casez(sqrtIn)
	32'sb?1??????????????????????????????: c2_sqrtFirstGuess <= 32'sh00000034;
	32'sb?01?????????????????????????????: c2_sqrtFirstGuess <= 32'sh00000049;
	32'sb?001????????????????????????????: c2_sqrtFirstGuess <= 32'sh00000068;
	32'sb?0001???????????????????????????: c2_sqrtFirstGuess <= 32'sh00000093;
	32'sb?00001??????????????????????????: c2_sqrtFirstGuess <= 32'sh000000d1;
	32'sb?000001?????????????????????????: c2_sqrtFirstGuess <= 32'sh00000127;
	32'sb?0000001????????????????????????: c2_sqrtFirstGuess <= 32'sh000001a2;
	32'sb?00000001???????????????????????: c2_sqrtFirstGuess <= 32'sh0000024f;
	32'sb?000000001??????????????????????: c2_sqrtFirstGuess <= 32'sh00000344;
	32'sb?0000000001?????????????????????: c2_sqrtFirstGuess <= 32'sh0000049e;
	32'sb?00000000001????????????????????: c2_sqrtFirstGuess <= 32'sh00000688;
	32'sb?000000000001???????????????????: c2_sqrtFirstGuess <= 32'sh0000093c;
	32'sb?0000000000001??????????????????: c2_sqrtFirstGuess <= 32'sh00000d10;
	32'sb?00000000000001?????????????????: c2_sqrtFirstGuess <= 32'sh00001279;
	32'sb?000000000000001????????????????: c2_sqrtFirstGuess <= 32'sh00001a20;
	32'sb?0000000000000001???????????????: c2_sqrtFirstGuess <= 32'sh000024f3;
	32'sb?00000000000000001??????????????: c2_sqrtFirstGuess <= 32'sh00003441;
	32'sb?000000000000000001?????????????: c2_sqrtFirstGuess <= 32'sh000049e6;
	32'sb?0000000000000000001????????????: c2_sqrtFirstGuess <= 32'sh00006882;
	32'sb?00000000000000000001???????????: c2_sqrtFirstGuess <= 32'sh000093cd;
	32'sb?000000000000000000001??????????: c2_sqrtFirstGuess <= 32'sh0000d105;
	32'sb?0000000000000000000001?????????: c2_sqrtFirstGuess <= 32'sh0001279a;
	32'sb?00000000000000000000001????????: c2_sqrtFirstGuess <= 32'sh0001a20b;
	32'sb?000000000000000000000001???????: c2_sqrtFirstGuess <= 32'sh00024f34;
	32'sb?0000000000000000000000001??????: c2_sqrtFirstGuess <= 32'sh00034417;
	32'sb?00000000000000000000000001?????: c2_sqrtFirstGuess <= 32'sh00049e69;
	32'sb?000000000000000000000000001????: c2_sqrtFirstGuess <= 32'sh0006882f;
	32'sb?0000000000000000000000000001???: c2_sqrtFirstGuess <= 32'sh00093cd3;
	32'sb?00000000000000000000000000001??: c2_sqrtFirstGuess <= 32'sh000d105e;
	32'sb?000000000000000000000000000001?: c2_sqrtFirstGuess <= 32'sh001279a7;
	default: c2_sqrtFirstGuess <= 32'sh00200000;
endcase

Both of these produce identical results, but interestingly synthesize to very different RTL - the if statements produce a linear cascade of of tests that broadly mirror the Verilog structure, like this:

RTL netlist for if statement version

Whilst the case statement produces a big network of gates that generate each bit of the output separately:

RTL netlist for case statement version

The two have broadly similar resource requirements (402 ALUTs and 190 registers for the case statement vs 437 ALUTs and 184 registers for the if version) - SuperRT uses the if version for no particular reason other than that it was the one I wrote first.

This code calculates 1/sqrt(val), and from there sqrt(val), but it can also be used to calculate 1/val simply by taking the square of the result, thus giving (1/sqrt(val))^2 and yielding 1/val. Once we have a means of calculating 1/val, division becomes a simple question of taking the reciprocal of the divider and then multiplying.

Just in case anyone happens upon this and wants to use any of the code in this article in their own project (bearing in mind the warning at the top of this page about my Verilog skills!), please feel free to do so - for the sake of avoiding any confusion, I’ll officially put the code snippets here under the 0BSD license:

Copyright © 2021 by Ben Carter ben@shironekolabs.com

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.