Using lazy loading for duplicate subcircuit constraints to cut 95% memory usage for gnark R1CS

When using gnark as a base zksnarks lib for scroll up system, people will face memory issues.

We detect there are two parts using huge memory/storage size.

  1. R1CS size. 2. Pk size.

Here we mainly focus on the R1CS size.

R1CS is the description of an arithmetic circuit using by groth16 (a popular snark proving system). R1CS is composed of many R1C, every single R1C will contain L, R, O LinearExpression, LinearExpression means variable or constants set. The R1C describe that L * R = O, and in solving R1C will be extracted and calculate the right value. F.g. if L,R,O is all solved, then the solving system will constraint them to satisfy the multiplication equation. If any two of them is solved, then the solving system will calculate the unsolved LinearExpressions. if less then two of them is solved, the solving system will not work.

There’re many efforts to reduce memory usage by Consensys team, in the gnark latest release v0.8.0, like in this gnark PR #418 the R1CS compilation will replace the long LinearExpression with a new one, and then using the new one for other R1C in the afterwards compilation. And in the gnark PR #412 the updates for the R1CS core system also mainly focus on R1CS memory usage, re structure the Linear Experssion and Term(item in Linear Expression) fields. These effort has improved, however with the explosion of zk application more complex systems(like zkevm or recursive proving) makes the circuit even more larger. We will need to specifically deal with the large circuit for reducing memory.

Noted that the large circuit always contains loops in R1CS generation, here we call them duplicate subcircuit. For example, the merkle tree in circuit will reuse hash functions for many times, and the hash functions like MiMC will reuse permutation for many times. Thus we come up with an idea to highly compress the duplicate subcircuit constraints to one simple circuit. Here’s how we make it.

  1. record repeatable constraints structure (during this process, we will record the static constraints StaticConstraints, and it will be stored in map by a key defined by user, for looking up the constraint need to be recovered)
  2. record repeatable constraints’s inputs, and pick only the input constrains to store. (GeneralLazyInputs)
  3. lazify and remove all the repeatable r1cs structure, with only lazy inputs left. rebuild levels.
  4. recover the r1c when solving. the input constraints can be easily extracted from store, for others we have a shift recorded the distance from the static constraints, and add the offset shift to the variables to recover.

let’s imagine a giant circuit with inputs i1,i2, assume that v_ is the internal variables generated, c_ is the constant variables used, the circuit which will generate constraints below,

_circuit 1 generated constraints as below_
1. (i1 + c1) * (i1 + c1) = v1
2. (i2 + c2) * (i2 + c2) = v2
3. (v1 + c3) * (v1 + c3) = v3
4. v3 * v3 = v4

n. ...constraints

if later we got the same circuit, with inputs i3,i4, assume that before this time this circuit process, thee existed variables index has been 1000, then the constraints will be

_circuit 2 generated constraints as below_

1. (i3 + c1) * (i3 + c1) = v1001
2. (i4 + c2) * (i4 + c2) = v1002
3. (v1000 + c3) * (v1000 + c3) = v1003
4. v1003 * v1003 = v1004

n. ...constraints

as you can see, there will be many internal variables generated in duplicate circuit, the variables’ born and order are the same, except that they got a shift like the internal variables in the mapped constraint always differ 1000.

so we save the structure static constraints(meanwhile circuit 1 generated constraints) for the first time

and we recover the other by 1. retrieve the input constraints, which contains the input’s term. this is exactly the same with normal constraints 2. retrieve contraints and add shift to the variables. in above case, we have shift 1000, so in constraint 4 we can recover (v3 * v3 = v4 in static constraints) to (v1003 * v1003 = v1004)

Results

mimc-inputs-4 (save 70% r1cs memory usage)

mimc 1 10 100 1000 10000
dump_r1cs_w/o_origin (kb) 127 1,225 12,939 135,828 1,364,706
dump_r1cs_w/o_lazify (kb) 244 1,342 13,056 135,946 1,364,833
dump_r1cs_w/_lazify (kb) 158 479 4,055 43,072 435,292

as you can see, for such circuit we save nearly 70% of memory, and both lazified and non-lazified grow linearly with the mimc hashes num, that’s because we use input constraints to recover true constraints. If the inputs constraints contained is very little, then the memory of r1cs used actually will be very little.

and test for poseidon 12 inputs, which surely the input constraints contained in its static constrains are less.

poseidon-inputs-12 (save 95% r1cs memory usage)

poseidon 1 10 100 1000 10000
dump_r1cs_w/o_origin (kb) 313 2,520 24,602 273,890 N/A
dump_r1cs_w/o_lazify (kb) 546 2,761 24,915 275,049 N/A
dump_r1cs_w/_lazify (kb) 325 450 1,745 16,962 N/A

this time it saves nearly 95% of memory used in r1cs.

Looks very promising. Is this applicable only to R1CS or also can be extended to Plonk?