For the first iteration, it is better to use xvf*ger instead of xvf*gerpp builtins which helps to avoid setting accumulators to zero. This helps to reduce few instructions.