nanomachines - CTF CUP 2025

Me and my team have played CTF CUP 2025, we've placed 5th overall in the scoreboard, I've enjoyed the rev/forensics tasks that the event provided, nanomachines turned out to be my favorite task to solve, even though I couldn't do it in time, because I was too sleepy after 30h of non-stop playing. But I've finished reversing it after the event and here's a writeup for it.

Starting point

So we're given an ELF binary from the start, it's a clean one and not packed or anything, you can download it here: click.

Let's jump straight into IDA and open it. main func would look like that:

__int64 __fastcall main(int a1, char **a2, char **a3)
{
  void (__fastcall *v4)(__int64, __int64); // [rsp+0h] [rbp-8h] BYREF

  v4 = real_main;
  return sub_5555555E36D0((__int64)&v4, (__int64)&off_555555642AD0, a1, (__int64)a2, 0);
}

v4 is the function that's really being launched, sub_5555555E36D0 is likely some std thingy, let's jump into real_main. Inside we would see something like that:

void __fastcall real_main(__int64 a1, __int64 a2)
{
  ...
  v2 = fork();
  if ( v2 )
  {
    if ( v2 == -1 )
      sub_555555582D10(); // this just errors out.
    else
      sub_555555565290(v2, a2, v3, v4, v5, v6);
  }
  else
  {
    v7 = sub_555555582C30(a1, a2);
    ...
  }
  ...
}

My addresses might differ from yours initial launch of IDA because I've done some debugging, but the main logic should stay the same. If we go funcs one by one after fork we can see that one is used for error, another launches some big function (we would explore it later), and the 555555582C30 me if we would look inside it's a ptrace traceme wrapper:

__int64 ptrace_traceme()
{
  int *v1; // rax

  if ( ptrace(PTRACE_TRACEME, 0, 0, 0) != -1 )
    return 134;
  v1 = _errno_location();
  return sub_555555582D20(*v1);
}

From that point it should be clear that the program is using nanomites technique to obfuscate the reverse engineering process (the name of the task also hints at that). Let's find all xrefs for ptrace and rename the functions accordingly. Here's the list of which functions use the ptrace and what their inner logic is:

sub_555555582B30 - ptrace_getregs;

sub_555555582BA0 - ptrace_setregs;

sub_555555582BE0 - general ptrace wrapper (I've called it ptrace_proxy);

sub_555555582C30 - ptrace_traceme (as mentioned above);

sub_555555582C60 - ptrace_peekdata;

sub_555555582CD0 - ptrace_pokedata;

After all of those functions are found let's continue to explore. The main function after calling ptrace_traceme looks like a mess (a lot of bcmp to zeros, some operations that result in segmentation fault (writing to a zero addr), etc.). It's clear that it's probably a child process as by itself it does not make a whole much sense, as the program itself does not really segfault if we provide it with input. So let's go to sub_555555565290 which is a huge function we've seen before in one of the branches after the work. It should be the parent function which handles all the mess. If we jump into it it really looks like a mess but we can start carefully analyzing it.

Analyzing parent algorithm

After we've figured it's nanomites program using ptrace to debug itself and found the parent function, we need to understand what the parent does. You can get a general idea by briefly looking at the code (or asking GPT, if you really like). First, it waits for the pid (it's passed as the first arg in the function):

 v7 = waitpid(pid, stat_loc, 0);
  if ( v7 )
  {
    if ( v7 == -1 )
      sub_555555582D10(); // crash out.
    else
      sub_555555582940((__int64)v329, v7, stat_loc[0], v8, v9, v10); // not really significant? (i've not analyzed it)
  }

After that it launches an infinite loop, where it sends PTRACE_CONT to the child, and then gets the registers using our ptrace wrappers functions mentioned above:

 while ( 1 )
  {
    do
    {
LABEL_7:
      ptrace_proxy((__int64)v329, PTRACE_CONT, pid, 0, 0);
      if ( LODWORD(v329[0]) )
        goto LABEL_396;
      stat_loc[0] = 0;
      v13 = waitpid(pid, stat_loc, 0);
    }
    while ( !v13 );
    if ( v13 == -1 )
      break;
    sub_555555582940((__int64)v329, v13, stat_loc[0], v14, v15, v16);
    if ( LOBYTE(v329[0]) == 7 )
      goto LABEL_396;
    if ( LOBYTE(v329[0]) == 2 )
    {
      if ( LODWORD(v329[1]) == 11 || LODWORD(v329[1]) == 5 )
      {
        ptrace_getregs((int *)v329, pid);
        ...
      }
      ...
   }
   ...
}

Basically what it does is it continues the process, checks where has it stopped, and if it stopped because of SIGSEGV or SIGTRAP, it then does some logic, first of all fetching the registers values. Basically v329 + 16 becomes the register, so we can change the type of next variable after it into user_regs_struct, which should make it understandable what it does after. After that it basically matches rax with a lot of different values (0x1001, 0x1002, etc.) which seem to be opcodes and then does some logic based on that. Let's check one of those opcodes for example:

        switch ( ptr.rax )
        {
          case 0x1001uLL:
            sub_555555564FA0((__int64)&first_arg_data, pid, rdi);
            sub_555555564FA0((__int64)&second_arg_data, pid, rsi);
            // more logic...
         ...
        }

First of all it does something strange with rdi/rsi registers, it looks like it's an argument to our VM, so let's see what happens in sub_555555564FA0, the main loop inside looks like that:

  do
  {
    ptrace_peekdata((__int64)v22, a2, a3 + v6 + 8);
    if ( v22[0] )
    {
      v20 = v22[1];
      sub_5555555623C0(aCalledResultUn_2, 43, &v20, &off_5555556428E8, &off_555555642920);
    }
    v7 = read_data_amnt;
    if ( read_data_amnt < 0x80 )
    {
      read_result_1 = read_result;
      if ( read_data_amnt == *((_QWORD *)&read_data + 1) )
      {
        sub_55555556DBE0(&read_data, read_data_amnt);
        v7 = read_data_amnt;
      }
      *(_BYTE *)(read_data + v7) = read_result_1;
      v9 = read_data_amnt + 1;
      read_data_amnt = v9;
      if ( v9 <= 0x7F )
      {
        if ( v9 == *((_QWORD *)&read_data + 1) )
        {
          sub_55555556DBE0(&read_data, v9);
          v9 = read_data_amnt;
        }
        *(_BYTE *)(read_data + v9) = BYTE1(read_result_1);
        v10 = read_data_amnt + 1;
        read_data_amnt = v10;
        if ( v10 <= 0x7F )
        {
             // copy the next byte, and then repeat, basically copying all 8 bytes.
        }
    }
    v6 += 8LL;
  }
  while ( v6 < 120 );

So what the function does, is it gets the data by the provided addr from the child process, and then copies it in chunks of 8 bytes, and does that until v6 (that's being initialized as -8) is less than 120. Basically it just copies 128 bytes from addr a3 in process a2 to a current process memory at a1. Let's call it read_128_bytes_from. And then analyze what happens later in that opcode. As it's written in rust it's really messy to do in static so here I've cheated and opened the IDA debugger and started analyzing the parent like that. Next function we would see is:

            if ( *((_QWORD *)&v324 + 1) )
            {
              sub_55555557BBB0((void **)&first_arg_copy, (__int64)first_arg_data, *((__int64 *)&v324 + 1), 8, v24, v25);
              ...
            }

For now we would skip the meaning of the check it does, we're interested in the function itself, in order to analyze what happens inside of it I've just placed a breakpoint and then watched at what happens in the first argument (as you can see it's already renamed so it gives away what the func does). Basicalyl it takes the second argument as a bytearray and then copies it into first argument, but in a slightly different format, it looks like a home-made bigint thing, It basically stores the integer as { ptr to little-endian array of 128 bytes, limbs_length } and then provides the "length" of the integer as if its limbs are 8 bytes per one limb, here's a code to read it in IDAPython:

def read_int(addr):
    dlen = idc.read_dbg_qword(addr + 8)
    return int.from_bytes(idc.read_dbg_memory(idc.read_dbg_qword(addr), dlen * 8), byteorder='little')

It would be really useful to debug all the other functions too. For now let's rename this function into copy_data_as_int and go further into the opcode, it's clear to see that after reading 128 bytes from rdi and rsi, it copies them as bigints into other buffers. The check mentioned above (&v324 + 1) is basically checking that the int is not a zero (the amount of limbs is not zero). After that stepping in debugger (as in static it's really easy to mess up with branches and how it jumps between them) we can see that the next unknown function it does is sub_55555557A800 like that:

                  sub_55555557A800(
                    (__int64 *)dest,
                    first_arg_copy_bigint.m256i_i64[0],
                    first_arg_copy_bigint.m256i_i64[2],
                    second_arg_cpy_bigint.m256i_i64[0],
                    second_arg_cpy_bigint.m256i_i64[2],
                    v25);

You should've guessed what arguments name are based on copy_data_as_int function and renaming every call to it. As we can see that function does something with our arguments and probably places it in the first argument (dest was automatic name IDA gave to that variable). If we'll go into the function keeping in mind the structure of our home-made bigint structure we should see that it's clearly a multiplication function:

__int64 *__fastcall bigint_mul(__int64 *a1, __int64 a2, char *a3, __int64 a4, unsigned __int64 a5, __int64 a6)
{
  v7 = (_QWORD *)a4;
  v9 = (unsigned __int64)&a3[a5];
  v10 = 8;
  v11 = (unsigned __int64)&a3[a5 + 1];
  if ( &a3[a5] != (char *)-1LL )
  {
    if ( v11 >> 60 )
      sub_5555556120C0((__int64)a1, a2, (__int64)a3, a4, a4, a6);
    if ( 8 * v11 )
    {
      v13 = alloc_mem(8 * v11, 8u);
      v7 = (_QWORD *)a4;
      v10 = v13;
      if ( !v13 )
        sub_555555561D60(8, 8 * v11);
    }
  }
  sub_555555577730(v10, v11, a2, a3, v7, a5);
  if ( v11 )
  {
    if ( *(_QWORD *)(v10 + 8 * v9) )
    {
LABEL_7:
      v14 = v11;
      goto LABEL_8;
    }
    v18 = (__int64)&a3[a5 + 2];
    v19 = 8 * ~v9;
    while ( v19 )
    {
      v14 = v18 - 1;
      v19 += 8;
      if ( *(_QWORD *)(v10 + 8 * v18-- - 16) != 0 )
      {
        if ( v11 < v14 )
          goto LABEL_7;
        if ( v14 < v11 >> 2 && v11 > v14 )
        {
          v21 = 8 * v11;
          if ( v14 )
          {
            v15 = sub_555555574340((void *)v10, v21, 8u, 8 * v14);
            v16 = v14;
            if ( !v15 )
              sub_555555561D60(8, 8 * v14);
            goto LABEL_11;
          }
LABEL_22:
          v15 = 8;
          j_free((void *)v10);
          v14 = 0;
          goto LABEL_10;
        }
LABEL_8:
        v15 = v10;
        v16 = v11;
        goto LABEL_11;
      }
    }
    if ( v11 >= 4 )
      goto LABEL_22;
    v14 = 0;
    goto LABEL_8;
  }
  v14 = 0;
  v15 = v10;
LABEL_10:
  v16 = 0;
LABEL_11:
  result = a1;
  *a1 = v15;
  a1[1] = v16;
  a1[2] = v14;
  return result;
}

If you don't see it, you could just debug it and try all the operations in math you know of, eventually you'll guess correctly (use function read_int I've provided above to play with debugging it), also GPT can easily identify it as a bigint multiplication function, so it should be trivial to understand that it's a bigint_mul. Now let's continue stepping through this opcode further and spot the next unknown operations it does.

Next up it calls sub_555555563430 like that:

 v12 = (__int64 *)dest;
                    if ( !v29 )
                    {
LABEL_125:
                      sub_555555563430((__int64)&v318, (__int64)v12, (unsigned __int64 **)someConst, v23, v24, v25);

As you can see v12 is our dest (multiplication result), it stores the result of this op into v318, and then the second argument is basically "someConst". Why have I named it that way? Well, if we look at where it's first used it's at the very start of the function and it looks like that:

copy_data_as_int(someConst, (__int64)asc_555555618DC5, 128, 8, a5, a6);

So you can see some const data is getting copied as int into someConst, so it's safe to assume it's just some constant. If we jump into the function sub_555555563430, we can easily spot it's a modulo (%) function, again, if you don't see it, it's easy to bruteforce known math operations until one of them eventually gives the same result, or just ask gpt, it would be again successful to understand what the function does. So let's rename it into a bigint_modulo or something similar. By debuggin further we will see that the next main thing it hits is this loop:

                        v141 = v139;
                        for ( i = 0; i < v139; i += 8LL )
                        {
                          v143 = 8;
                          if ( v141 < 8 )
                            v143 = v141;
                          dest[0] = 0;
                          if ( __CFADD__(i, v143) )
                          {
                            v306 = 8;
                            if ( v141 < 8 )
                              v306 = v141;
                            sub_555555562290(i, i + v306, (__int64)&off_555555642938, v129, v130, v131);
                          }
                          if ( i + v143 > v139 )
                          {
                            v304 = 8;
                            if ( v141 < 8 )
                              v304 = v141;
                            sub_555555562210(i + v304, v139, &off_555555642938);
                          }
                          ((void (__fastcall *)(void **, char *))_memmove_avx512_unaligned_erms)(dest, &v136[i]);
                          v144 = ptrace_pokedata(pid, r8 + i, (__int64)dest[0]);
                          if ( v144 != 134 )
                          {
                            v331.m256i_i32[0] = v144;
                            sub_5555555623C0(aCalledResultUn_2, 43, &v331, &off_5555556428E8, &off_555555642950);
                          }
                          v141 -= 8LL;
                        }

What it basically does is writes data by 8 bytes, the data being written is the result of the operation performed by the opcode (you can just see it by reading what it copies in debug mode, for this opcode it would be (arg1 * arg2) % someConst). It copies this data in r8.

So basically the parent peforms different bigint operations, the arguments for which goes into rdi and rsi. The result is being written to r8, and the opcode itself is rax value. This is true for all the opcodes inside of it, but there are some opcodes that only operate with one integer.

Besides that all the opcodes if you would analyze carefully use "someConst" for modulo operations always at the end, which means all the bigint operations are performed in mod N. That was also one of the hints for the task that was given by the author. Now I would leave analyzing all the other bigint operations by debugging/statically analyzing them for the reader and here I'll spoil what each opcode does:

0x1001 - BIGINT_MUL, as we've already analyzed;

0x1002 - BIGINT_ADD;

0x1003 - BIGINT_SUB;

0x1004 - BIGINT_MODINV (pow(arg1, -1, N));

0x1005 - BIGINT_LEGENDRE_SYMBOL - Performs pow(arg1, (N - 1) // 2, N) - which is basically used for checking quadratic residue later on;

0x1006 - BIGINT_SQRT - Performs pow(initial, (N + res) // 4, N) - which is basically a sqrt.

After all of that is analyzed it's really hard to debug the child process, and we won't actually do it. The simplest way to recreate the algo at hand is to write a small IDAPython script to actually dump all the opcodes being called with their arguments, here's what I came up with:

names = {0x1001: 'MULTIPLY', 0x1002: 'ADD', 0x1003: 'SUB', 0x1004: 'MODINV', 0x1005: 'LEGENDRE_SYMBOL', 0x1006: 'SQRT'}

logs = open('logs.txt', 'w')

def printfile(*a):
    logs.write(' '.join(a) + '\n')

def read_num(addr):
    data = idc.read_dbg_memory(idc.read_dbg_qword(addr), 128)
    return int.from_bytes(data, byteorder='little')

def process_bp():
    m = {
        0x555555565516: [(0xA8, 0xF8), (0x80, 0x280)],
        0x5555555657da: [(0x1E0, 0xF8), (0xA8, 0x280)],
        0x555555565626: [(0xA8, 0xF8)],
        0x55555556584b: [(0xA8, 0xF8)],
        0x555555565685: [(0x1E0, 0xF8), (0xA8, 0x280)],
        0x555555565712: [(0x340, 0xF8)]
    }
    ea = idc.get_event_ea()
    rip = idc.read_dbg_qword(idc.get_reg_value('rsp')+0x318)
    r8 = idc.read_dbg_qword(idc.get_reg_value('rsp')+0x208)
    rax = idc.read_dbg_qword(idc.get_reg_value('rsp')+0x358)

    printfile('=' * 30)
    printfile(f'HIT OPCODE {names.get(rax, hex(rax))} AT {hex(rip)}: ')
    for i, (off, off_addr) in enumerate(m[ea]):
        res = read_num(idc.get_reg_value('rsp')+off)
        real_addr = idc.read_dbg_qword(idc.get_reg_value('rsp')+off_addr)
        printfile(f'Argument #{i + 1}: {hex(res)} (addr: {hex(real_addr)})')
    printfile('Output to:', hex(r8))


def main():
    while True:
        process_bp()
        idc.resume_process()
        idc.wait_for_next_event(WFNE_SUSP, -1)

After settings all the breakpoints needed (look at keys in m dict), we can launch the main function and then write logs.close() in IDAPython console, we should get a good logs file (but before that it's also important to make so your flag input passes the quadratic residue check, I was lucky that my test input of ctfcup{test_flggg} was passing it by default). Here's my logs file if you're lazy to do it yourself: click.

So at that point I've started recreating a program by performing the same operations from logs and instead of actually figuring out the logic I've just matched all the results I was getting with logs and replacing them by variables :)

Recreating the checker algorithm

Let's start with a first few lines of code:

==============================
HIT OPCODE MULTIPLY AT 0x55555556a717: 
Argument #1: 0x7d6767676c665f747365747b707563667463 (addr: 0x7fffffffd510)
Argument #2: 0x7d6767676c665f747365747b707563667463 (addr: 0x7fffffffd510)
Output to: 0x7fffffffd1c0
==============================
HIT OPCODE MULTIPLY AT 0x55555556a7e7: 
Argument #1: 0x3d6e24bf5ed9969b517cfd6f3c47631a0c36e95e091ab37d136a12d93737698385cdde49 (addr: 0x7fffffffdb10)
Argument #2: 0x7d6767676c665f747365747b707563667463 (addr: 0x7fffffffd510)
Output to: 0x7fffffffd1c0
==============================
HIT OPCODE MULTIPLY AT 0x55555556a8b7: 
Argument #1: 0x539 (addr: 0x7fffffffda90)
Argument #2: 0x7d6767676c665f747365747b707563667463 (addr: 0x7fffffffd510)
Output to: 0x7fffffffd1c0
==============================
HIT OPCODE ADD AT 0x55555556a994: 
Argument #1: 0x1e1798125c6c388955ebfb5f9bdbce26d62139ae7d5653c7498781b33d03a0ccfdd24588d2f587cb3a70efecf8fd76ccdad7d36c0a3b (addr: 0x7fffffffdb90)
Argument #2: 0x28ef10b0b2522a8872eacdd58ae55142215d90b (addr: 0x7fffffffdc10)
Output to: 0x7fffffffd1c0
==============================
HIT OPCODE LEGENDRE_SYMBOL AT 0x55555556aa29: 
Argument #1: 0x1e1798125c6c388955ebfb5f9bdbce26d62139ae7d5653c7498781b33d03a0ccfdd24817c40092f05d19771ba5dacf7b2febf581e346 (addr: 0x7fffffffd910)
Output to: 0x7fffffffd1c0
==============================
HIT OPCODE SQRT AT 0x55555556aa9a: 
Argument #1: 0x1e1798125c6c388955ebfb5f9bdbce26d62139ae7d5653c7498781b33d03a0ccfdd24817c40092f05d19771ba5dacf7b2febf581e346 (addr: 0x7fffffffd910)
Output to: 0x7fffffffd1c0

In the first opcode at 0x55555556a717 argument #1 is actually our flag as a little-endian number (you could've seen it when debugging the program first time). Then everything else is based on that flag value and only one const that is unknown to us (but if we change the flag we can see it's never changes so we can safely assume it's a const value, plus it's literally 1337, what else would it be). So basically here's what happens if we rewrite this whole construction to python:

N = 0x9e1c0acb7b20f01bc1ddf2870a79332887cfc747ed49cee32f6d5c0976ea37c11276e445564a906ffb31704b0f888e65ac7d9fd51e303587f93b198ff95b89ea1e01feec1a30ecfe9b811d930105bbeacee65838ba6dc72c3fa7f065ffad6d1e7ec2a1e63414bc5e8f95badbb3e65e9fd98b4d9be922daed4636b58214b3900b

flag_int = int.from_bytes(flag.encode(), byteorder='little')
fff = (flag_int * flag_int) % N
fff2 = (fff * flag_int) % N
fff3 = (0x539 * flag_int) % N
initial = (fff2 + fff3) % N
res = pow(initial, (N - 1) // 2, N)  # quadratic residue?

if res != 1:
    print('no')
    exit()

a = pow(initial, (N + res) // 4, N)  # sqrt?

Again this code was written by a simple algorithm:

Perform the operation in the code, save it to a variable (optionally give it a nonsensical name like fff);
Move to the next operation, if it uses some int you've seen before - find in which variable it is and use it, if you didn't see it - treat it as a const;

Then just perform this algo until you replicate all the logic, test it on a few different inputs, and if you see any differences - locate them and fix it (usually happens when the same num is in multiple variables, that can be fixed by just checking each step). It's pretty time-wasting, but it's an easy way to replicate the algorithm.

Next up we would need to actually go back to the child to find a few more things it does

v16 = 31337;
      do
      {
        if ( (v16 & 1) != 0 ) // v16 % 2 == 1
        {
          if ( bcmp(&v53[16], &v26, 0x80u) ) // check if v53 is 0
          {
            if ( bcmp(&s2[16], &v26, 0x80u) ) // check if s2 is 0
            {
              // …

              if ( bcmp(v149, v150, 0x80u) )
              {
                // zero out dest
              }
              else if ( bcmp(v121, v122, 128) )
              {
                // zero out dest
              }
              else
              {
                subroutine((__int64)dest, (__int64)v53);
              }
            }
            else
            {
              // copy v53 to dest;
            }
          }
          else
          {
            // zero out dest;
          }
          memcpy(v53, dest, sizeof(v53));
        }
        subroutine((__int64)dest, (__int64)s2);
        memcpy(s2, dest, sizeof(s2));
        v17 = v16 <= 1;
        v16 >>= 1;
      }
      while ( !v17 );

It's pretty messy but the general idea is that it inits var to 31337, divides it by 2 each loop and loops until it hits 0, if at any step it's odd, then it performs some operations using segfaults (ida does not show it, you need to look at assembly). After that it always calls subroutine to do something. And it basically performs the subroutine on the same buffer (s2 = subroutine(s2)). For the operation inside of v16 % 2 == 1 branch it seems to use two numbers - one being v53, another being dest. The logic itself is not clear for us. But let's recreate it through logs. It's also important to notice s2, dest and v53 seem to be a 3-integer tuple, not a single bigint.

Subroutine function is being called at addr 0x555555569274, and it ends at 0x55555556a30f. If we recreate this function based on the algo above (by just matching logs result), we should get something like that:

def sub_stage(x, y, z):
    ll2 = (x * x) % N
    bb = (y * y) % N
    ee = (bb * bb) % N
    ff = (z * z) % N
    lol = (ff * ff) % N
    vv = (x * bb) % N
    jj = (vv + vv) % N
    kk = (jj + jj) % N
    pp = (ll2 + ll2) % N
    ii = (pp + ll2) % N
    uu = (0x539 * lol) % N
    rr = (ii + uu) % N
    yy = (rr * rr) % N
    oo = (yy - kk) % N
    hh = (oo - kk) % N
    eee = (ee + ee) % N
    jjj = (eee + eee) % N
    jjjj = (jjj + jjj) % N
    iii = (kk - hh) % N
    uuu = (rr * iii) % N
    lll = (uuu - jjjj) % N
    el = (y * z) % N
    ij = (el + el) % N
    return hh, lll, ij

As I'm a reverse guy and dumbass at crypto, I have not figured out it was eliptic curves double operation, but it actually is. But let's not keep it in mind for now, we have another function to analyze. I've basically separated the branch logic into another function. You can see it starts at 0x55555556af34 and ends at 0x55555556c421. Let's once again use the same algorithm of matching input to output. But this function is a little bit harder to separate as it uses two arguments, but by looking at child we can analyze that it basically performs operations between s2 (that's being s2 = subroutine(s2) on each loop cycle) and then itself (v53). So the basic skeleton of the loop from child is something like that:

idk = None
h = 31337

while h != 0:
    if idk is not None and h % 2 == 1:
        idk = sub_stage2(idk, (flag_int, a, res))

    if idk is None:
        idk = (flag_int, a, res)

    flag_int, a, res = sub_stage(flag_int, a, res)
    h >>= 1

So, using our extremely smart algo of matching values in logs, let's recreate the sub_stage2:

def sub_stage2(a, b):
    x1, y1, z1 = a
    x2, y2, z2 = b
    dd = (z1 * z1) % N
    ee = (z2 * z2) % N
    pp = (x1 * ee) % N
    ii = (x2 * dd) % N
    ooo = (y1 * z2) % N
    ooo = (ooo * ee) % N
    bbb = (y2 * z1) % N
    bbb = (bbb * dd) % N
    uuu = (ii - pp) % N
    i = (uuu + uuu) % N
    iii = (i * i) % N
    j = (uuu * iii) % N
    r = (bbb - ooo) % N
    r = (r + r)%N
    v = (pp * iii) % N
    ppp = (r * r) % N
    ppp = (ppp - j) % N
    ppp = (ppp - v) % N
    ppp = (ppp - v) % N
    jj = (v - ppp) % N
    kk = (ooo * j) % N
    ll = (r * jj) % N
    nnn = (ll - kk) % N
    lll = (ooo * j) % N
    nnn = (nnn - lll) % N
    kkk = (z1 + z2) % N
    kkk2 = (kkk * kkk) % N
    kkk3 = (kkk2 - dd) % N
    yyy = (kkk3 - ee) % N
    yyy = (yyy * uuu) % N
    return ppp, nnn, yyy

Now after the loop passes it also performs a couple of operations from logs:

HIT OPCODE MODINV AT 0x55555556ca56: 
Argument #1: 0x8fdd6d1d1afc42a2d986612a9d0add3a4012e262c175cff57acc29113ebfd9aca8695ae84e5c3443c2aa4b8089a25376ebfcf96a0bb6953e1e3881390120d84cdcc210af82f8882626ab8c133f0bc07668808c10ea299922c5adf0a3844e253aa9d54e03e2002864291b69246b2cb11f8f5ebf60caa6eca1c1b60c4e63e40b22 (addr: 0x7fffffffd2c0)
Output to: 0x7fffffffce40
==============================
HIT OPCODE MULTIPLY AT 0x55555556cb2e: 
Argument #1: 0x7dfd034e9c3af63d848e108fe9c25b9efdc0629c8690b4dd3252bcdd4bd0f0ceb90bce0f78f2e83ba91d007b305ee380d889d37d0c7d50ad1ca1176dc5bb1691e4d73e77b709deef2ab89ca0abc668f1cf1ac152ab20181e2b5006e0c8755d78742657707c6b58adff293f0dbf02f75fedfd3cdbe39e1645638e4d187fd322a2 (addr: 0x7fffffffcfc0)
Argument #2: 0x7dfd034e9c3af63d848e108fe9c25b9efdc0629c8690b4dd3252bcdd4bd0f0ceb90bce0f78f2e83ba91d007b305ee380d889d37d0c7d50ad1ca1176dc5bb1691e4d73e77b709deef2ab89ca0abc668f1cf1ac152ab20181e2b5006e0c8755d78742657707c6b58adff293f0dbf02f75fedfd3cdbe39e1645638e4d187fd322a2 (addr: 0x7fffffffcfc0)
Output to: 0x7fffffffce40
==============================
HIT OPCODE MULTIPLY AT 0x55555556cc01: 
Argument #1: 0x169049b653d15f87f39268a3f722b7e36489db11f30a21943f487e7ea0fe5944b116a4f0c98ce8c1da29a9412817e5b3275be995d67a8c7bd12644013db8d91eb881fdf471eaff30fb507b6bb25c64b897af20a00cdd96011a3b393cec019419ca72c14a4a3cea0d750c0ff22a1464289ffb77b7623f69d41f9a3ad1a35eb9c5 (addr: 0x7fffffffcdc0)
Argument #2: 0x7dfd034e9c3af63d848e108fe9c25b9efdc0629c8690b4dd3252bcdd4bd0f0ceb90bce0f78f2e83ba91d007b305ee380d889d37d0c7d50ad1ca1176dc5bb1691e4d73e77b709deef2ab89ca0abc668f1cf1ac152ab20181e2b5006e0c8755d78742657707c6b58adff293f0dbf02f75fedfd3cdbe39e1645638e4d187fd322a2 (addr: 0x7fffffffcfc0)
Output to: 0x7fffffffce40
==============================
HIT OPCODE MULTIPLY AT 0x55555556ccc3: 
Argument #1: 0x5ccf2e0f4546bf4722d1ea426e928517ce0a08700bf58893cba8f3dfdc34519f53179b02693efaa2d509a367f53c5574d64ae156cbe7627f13336ef48c848335b0c39287143634c8a9330da95a3b157770af618e1caf34797fc3a9029116b3c40c9f85c1c384eee33fe9ef5a3e5846876eab9fbcb832d3ccc9849707b4f6318c (addr: 0x7fffffffd1c0)
Argument #2: 0x169049b653d15f87f39268a3f722b7e36489db11f30a21943f487e7ea0fe5944b116a4f0c98ce8c1da29a9412817e5b3275be995d67a8c7bd12644013db8d91eb881fdf471eaff30fb507b6bb25c64b897af20a00cdd96011a3b393cec019419ca72c14a4a3cea0d750c0ff22a1464289ffb77b7623f69d41f9a3ad1a35eb9c5 (addr: 0x7fffffffcdc0)
Output to: 0x7fffffffd040
==============================
HIT OPCODE MULTIPLY AT 0x55555556cd96: 
Argument #1: 0x873ada375d526d995664f2c819bceaf56dfbac32d66eaefa99efa1879aa3c370db4f1c78de8868de2c296f5abad79f10db15e23f40f498a22cd7afcaf210126ce378bbc6d6dc8c0d15d072720056a2a99eb9e3e76b6891edf6b14abaad7d51f4bf3a0d4e7c9b6f31ca554bdc2e78305c20ae0e74c5d216be616253dee54fed19 (addr: 0x7fffffffd240)
Argument #2: 0x5b1178a4c1955cf73db64cb143b3fc1479991cd86cd1e7c36a289413988481783d2c243641f799335bf1e1fdacd385b2085b9a6fb5f08809f90d3a4d627bd409a4902a16231374857c666f52e962cb7e6ad3dbf62000b9cbaa611bcc2e0afb48172c12948a849e4909bbb3212d03d099747e9499275967054498a4b20f8d9548 (addr: 0x7fffffffcd30)
Output to: 0x7fffffffd040

The 3 numbers being used are our result from idk var from skeleton loop, all the other ones are being calculated in those logs:

aa, bb, cc = idk
lol = pow(cc, -1, N)
kk = (lol * lol) % N
kk2 = (kk * lol) % N
kk3 = (aa * kk) % N
kk4 = (bb * kk2) % N

Here's what it results to. After that the child compares the resulting kk3 and kk4 variables to some const ints it has embedded inside its logic.

 v24 = bcmp(s2, src, 0x80u);
      v12 = ptr;
      v14 = v34;
      if ( v24 || (v23 = (void **)v120, bcmp(&s2[8], v120, 0x80u)) )

And we can actually see their values in the same func:

      v119 = xmmword_555555618EBB;
      v118 = xmmword_555555618EAB;
      v117 = xmmword_555555618E9B;
      v116 = xmmword_555555618E8B;
      v115 = xmmword_555555618E7B;
      v114 = xmmword_555555618E6B;
      v113 = xmmword_555555618E5B;
      *(_OWORD *)src = xmmword_555555618E4B;
      v120[0] = xmmword_555555618ECB;
      v120[1] = xmmword_555555618EDB;
      v120[2] = xmmword_555555618EEB;
      v120[3] = xmmword_555555618EFB;
      v120[4] = xmmword_555555618F0B;
      v120[5] = xmmword_555555618F1B;
      v120[6] = xmmword_555555618F2B;
      v120[7] = xmmword_555555618F3B;

Here we can have two options - statically extract those nums, or just do LDPRELOAD trick and replace bcmp function with ours in order to log the data in bcmp and extract it that way. I've chosen the static way, here's IDAPython code to easily extract those two integers:

kk3_want = int.from_bytes(idc.get_bytes(0x555555618E4B, 128), byteorder='little')
kk4_want = int.from_bytes(idc.get_bytes(0x555555618ECB, 128), byteorder='little')

So in the end the program does:

kk3_want = 0x7e668370223bfe527bddf4a30d26e48a533e5cb88cf9b1e169349888818c401a7ed9aeb4b0d4acbb8fd5909046ff8073036fdc343c12622d87bbe93127e15514318edba221a381bf70827fe95096ca5c674da1a40c8d3c8e8f5a63121418a5d4a12b911cfb241aaa358d4fd984e34cfe43f4fce34ba832e23a2d296288680e62
kk4_want = 0x9cc04c587c08455d3b38b057fc85a5015fd5bd0f2488e76911b50ea4b2901443e643a577ffb6f2f3c058e4658a8c306d0554970c028ac5d9862b09de47d698c2186d6b9f94a2cee53e1c4e95a16417f0bc319ee1b9f195fe04a16c3f75b6d8a1be052f14915eafab854617febde8011bb3340ba0c78c776a1d98d8d09fd30432
if kk3 == kk3_want and kk4 == kk4_want:
    print('yes')
else:
    print('no')
    exit()

Solving the final program

Alright, so the whole program looks more or less like that:

flag = 'ctfcup{test_flggg}'
N = 0x9e1c0acb7b20f01bc1ddf2870a79332887cfc747ed49cee32f6d5c0976ea37c11276e445564a906ffb31704b0f888e65ac7d9fd51e303587f93b198ff95b89ea1e01feec1a30ecfe9b811d930105bbeacee65838ba6dc72c3fa7f065ffad6d1e7ec2a1e63414bc5e8f95badbb3e65e9fd98b4d9be922daed4636b58214b3900b


def sub_stage2(a, b):
    x1, y1, z1 = a
    x2, y2, z2 = b
    dd = (z1 * z1) % N
    ee = (z2 * z2) % N
    pp = (x1 * ee) % N
    ii = (x2 * dd) % N
    ooo = (y1 * z2) % N
    ooo = (ooo * ee) % N
    bbb = (y2 * z1) % N
    bbb = (bbb * dd) % N
    uuu = (ii - pp) % N
    i = (uuu + uuu) % N
    iii = (i * i) % N
    j = (uuu * iii) % N
    r = (bbb - ooo) % N
    r = (r + r)%N
    v = (pp * iii) % N
    ppp = (r * r) % N
    ppp = (ppp - j) % N
    ppp = (ppp - v) % N
    ppp = (ppp - v) % N
    jj = (v - ppp) % N
    kk = (ooo * j) % N
    ll = (r * jj) % N
    nnn = (ll - kk) % N
    lll = (ooo * j) % N
    nnn = (nnn - lll) % N
    kkk = (z1 + z2) % N
    kkk2 = (kkk * kkk) % N
    kkk3 = (kkk2 - dd) % N
    yyy = (kkk3 - ee) % N
    yyy = (yyy * uuu) % N
    return ppp, nnn, yyy


def sub_stage(x, y, z):
    ll2 = (x * x) % N
    bb = (y * y) % N
    ee = (bb * bb) % N
    ff = (z * z) % N
    lol = (ff * ff) % N
    vv = (x * bb) % N
    jj = (vv + vv) % N
    kk = (jj + jj) % N
    pp = (ll2 + ll2) % N
    ii = (pp + ll2) % N
    uu = (0x539 * lol) % N
    rr = (ii + uu) % N
    yy = (rr * rr) % N
    oo = (yy - kk) % N
    hh = (oo - kk) % N
    eee = (ee + ee) % N
    jjj = (eee + eee) % N
    jjjj = (jjj + jjj) % N
    iii = (kk - hh) % N
    uuu = (rr * iii) % N
    lll = (uuu - jjjj) % N
    el = (y * z) % N
    ij = (el + el) % N
    return hh, lll, ij


if len(flag) % 8 != 0:
    pad = 8 - (len(flag) % 8)
    flag += '\x00' * pad

flag_int = int.from_bytes(flag.encode(), byteorder='little')
fff = (flag_int * flag_int) % N
fff2 = (fff * flag_int) % N
fff3 = (0x539 * flag_int) % N
initial = (fff2 + fff3) % N
res = pow(initial, (N - 1) // 2, N)  # quadratic residue?

if res != 1:
    print('no')
    exit()

a = pow(initial, (N + res) // 4, N)  # sqrt?

idk = None
h = 31337

while h != 0:
    if idk is not None and h % 2 == 1:
        idk = sub_stage2(idk, (flag_int, a, res))

    if idk is None:
        idk = (flag_int, a, res)

    flag_int, a, res = sub_stage(flag_int, a, res)
    h >>= 1

aa, bb, cc = idk
lol = pow(cc, -1, N)
kk = (lol * lol) % N
kk2 = (kk * lol) % N
kk3 = (aa * kk) % N
kk4 = (bb * kk2) % N


kk3_want = 0x7e668370223bfe527bddf4a30d26e48a533e5cb88cf9b1e169349888818c401a7ed9aeb4b0d4acbb8fd5909046ff8073036fdc343c12622d87bbe93127e15514318edba221a381bf70827fe95096ca5c674da1a40c8d3c8e8f5a63121418a5d4a12b911cfb241aaa358d4fd984e34cfe43f4fce34ba832e23a2d296288680e62
kk4_want = 0x9cc04c587c08455d3b38b057fc85a5015fd5bd0f2488e76911b50ea4b2901443e643a577ffb6f2f3c058e4658a8c306d0554970c028ac5d9862b09de47d698c2186d6b9f94a2cee53e1c4e95a16417f0bc319ee1b9f195fe04a16c3f75b6d8a1be052f14915eafab854617febde8011bb3340ba0c78c776a1d98d8d09fd30432
if kk3 == kk3_want and kk4 == kk4_want:
    print('yes')
else:
    print('no')
    exit()

So at that step we can figure out that it basically maps our x (being flag bytes) to a point on elliptic curve y^2 = x^2 + 0x539 * x (mod N) and then does a scalar multiplication by 31337. sub_stage is doubling on eliptic curve, sub_stage2 is addition on elliptic curve. Last operations are basically turning our point into the affine coordinates. Then it compares those coordinates to predetermined ones. We can also just give this code to ChatGPT and it would solve it pretty easily, but here's my solve to the problem:

from sage.all import *

N = 0x9e1c0acb7b20f01bc1ddf2870a79332887cfc747ed49cee32f6d5c0976ea37c11276e445564a906ffb31704b0f888e65ac7d9fd51e303587f93b198ff95b89ea1e01feec1a30ecfe9b811d930105bbeacee65838ba6dc72c3fa7f065ffad6d1e7ec2a1e63414bc5e8f95badbb3e65e9fd98b4d9be922daed4636b58214b3900b

A = 0x539
kk3_want = 0x7e668370223bfe527bddf4a30d26e48a533e5cb88cf9b1e169349888818c401a7ed9aeb4b0d4acbb8fd5909046ff8073036fdc343c12622d87bbe93127e15514318edba221a381bf70827fe95096ca5c674da1a40c8d3c8e8f5a63121418a5d4a12b911cfb241aaa358d4fd984e34cfe43f4fce34ba832e23a2d296288680e62
kk4_want = 0x9cc04c587c08455d3b38b057fc85a5015fd5bd0f2488e76911b50ea4b2901443e643a577ffb6f2f3c058e4658a8c306d0554970c028ac5d9862b09de47d698c2186d6b9f94a2cee53e1c4e95a16417f0bc319ee1b9f195fe04a16c3f75b6d8a1be052f14915eafab854617febde8011bb3340ba0c78c776a1d98d8d09fd30432

E = EllipticCurve(GF(N), [A, 0])
Q = E( kk3_want, kk4_want )

k_inv = inverse_mod(31337, E.order())
P = k_inv * Q

flag_bytes = int(P[0]).to_bytes((int(P[0]).bit_length()+7)//8, 'little')

print(flag_bytes.decode())

After running it with sage we should get a correct flag.

Conclusion

Got skill issue'd by this challenge, but really enjoyed it nonetheless, thanks to @al5c2rd for a great challenge.