Matching decompilation projects attempt to write source code (C, C++) that compiles to the same binary as the original. All source code is written from scratch
Amusingly the little bit of decomp I was even slightly following was Lego Island, recently announced as finished, and it isn't present! The same person is why I knew about the Mario Party 4 decomp...
Amusingly the little bit of decomp I was even slightly following was Lego Island, recently announced as finished, and it isn't present!
The same person is why I knew about the Mario Party 4 decomp also reaching basically the end. According to him it is the first known instance of a finished GameCube game decomp and was helped a lot by both not having optimizations on and that some parts of the engine had symbols leak via another game on the engine.
This decomp.dev site relies on the projects themselves integrating into the API. It's notably missing N64 game decompilations when though they're the most numerous, and most commonly finished...
This decomp.dev site relies on the projects themselves integrating into the API. It's notably missing N64 game decompilations when though they're the most numerous, and most commonly finished ones. Though I'm starting to see traction on that front. Some won't add their project here though as they don't want the attention. People tend to assume decomp = PC Port, when the reality is that less than half of finished decompilations end up with any ports. But it doesn't stop users from begging and annoying the devs due one. So staying more obscure is a strategy to avoid the headaches of dealing with the general public.
I'm out of the loop, but as a software engineer I'm fascinated by the technique as I understand it (writing new code to reproduce the original binaries). That sounds unbelievably challenging to me...
I'm out of the loop, but as a software engineer I'm fascinated by the technique as I understand it (writing new code to reproduce the original binaries).
That sounds unbelievably challenging to me - not in the time consuming sense but the nearly impossible to get right sense. But people are doing it. Are there tools that help or is it really just looking at decompiled code and rewriting it in proper source?
As a side comment, what is the end goal? Is it primarily to produce source code or for the experience? I know Nintendo is very protective of IP so I assume any fan projects that crack the code are on numbered days.
Usually to enable modding or porting to different systems. If you have matching binaries then you can make a mod and provide a patch file that people can apply to their ROM so that you aren't...
As a side comment, what is the end goal?
Usually to enable modding or porting to different systems.
If you have matching binaries then you can make a mod and provide a patch file that people can apply to their ROM so that you aren't distributing the stuff most likely to get you in trouble.
If you have fully matching C or C++ then it is also very possible to run it through another compiler for another system and make like a PC version of a game that was never released on PC. There is a lot more to porting than just changing your CMake but it can help if that is a goal.
All of this stuff is in super wishy washy legal gray areas and most of it has never actually gone to court anywhere. People have the idea that a "clean room" decompilation (meaning no help from leaked code) is going to be legally fine but I have no idea how true that will be in practice (and will also be location dependent).
Are there tools that help or is it really just looking at decompiled code and rewriting it in proper source?
There are tools, but a lot of the lift is that the same functions are often used across several games on the same platform. You can hash all of the function assembly across a platform and look for matches. When the Legend of Zelda team figures out the function, the Diddy Kong Racing team and the Mario Party team get that function for free.
You could (and maybe some people do?) just run a script that creates tons of code with a bit of guidance and try to shotgun some of the functions but I have not been involved with that particular aspect.
Generally there are a lot of things done to make the compiler produce the same assembly code and also make it reasonable to do on a larger scale. First of all, the same compiler version must be...
Generally there are a lot of things done to make the compiler produce the same assembly code and also make it reasonable to do on a larger scale. First of all, the same compiler version must be found - this can range from pretty easy to nearly impossible, depending on how old the binary in question is and whether the compiler has left any metadata in it. You also need to figure out the exact same compiler flags that were used as in the original. I'm not sure what's the exact process for doing that, but I'd imagine it involves finding some library functions that have known source code and trying things until the compiler outputs the same bytes that are in the binary that's being decompiled (at least, that sounds like the most reasonable thing for me).
Usually there's also some great tooling written to make the process more reasonable, like automatically comparing all decompiled function to the instructions these functions have in the binary - recent video about the LEGO island decompilation shows such tool, and also illustrates some of frustrations with getting the compiler to generate exactly the same binary. There's also another interesting method to see if the decompiled code is at least functionally equivalent by literally patching your versions of the functions into the original binary and forcing it to use them instead (I think it's really cool, because it allows checking if things work even early in the decompilation progress - but I'm not sure how popular doing it is, since it might be tricky to get it to work).
Based on my (admittedly limited) reverse engineering experience I can also say that it gets easier over time - eventually you start recognising certain code patterns the compiler generates in different scenarios, and it becomes easier to tell what the original code might have possibly been. So yeah, as you said, it definitely seems impossible at first, but with the right approach and tools it becomes manageable, just very time consuming (and also still very hard).
Extras
Btw I can't talk about decomp without mentioning the UM_bullet_ex.cpp. This was originally intended to reverse engineer some of the bullet related code in Touhou 18, but it kind of went out of control and now there's a pretty sizable chunk of the game implemented there. Still in just one file. How many lines does it have? No idea, github mobile stopped loading lines past 29000 when I tried to scroll to the bottom. Apparently clang hates it and throws an internal compiler error sometimes when attempting to build it. But can you really blame it?
Amusingly the little bit of decomp I was even slightly following was Lego Island, recently announced as finished, and it isn't present!
The same person is why I knew about the Mario Party 4 decomp also reaching basically the end. According to him it is the first known instance of a finished GameCube game decomp and was helped a lot by both not having optimizations on and that some parts of the engine had symbols leak via another game on the engine.
This decomp.dev site relies on the projects themselves integrating into the API. It's notably missing N64 game decompilations when though they're the most numerous, and most commonly finished ones. Though I'm starting to see traction on that front. Some won't add their project here though as they don't want the attention. People tend to assume decomp = PC Port, when the reality is that less than half of finished decompilations end up with any ports. But it doesn't stop users from begging and annoying the devs due one. So staying more obscure is a strategy to avoid the headaches of dealing with the general public.
I'm out of the loop, but as a software engineer I'm fascinated by the technique as I understand it (writing new code to reproduce the original binaries).
That sounds unbelievably challenging to me - not in the time consuming sense but the nearly impossible to get right sense. But people are doing it. Are there tools that help or is it really just looking at decompiled code and rewriting it in proper source?
As a side comment, what is the end goal? Is it primarily to produce source code or for the experience? I know Nintendo is very protective of IP so I assume any fan projects that crack the code are on numbered days.
Usually to enable modding or porting to different systems.
If you have matching binaries then you can make a mod and provide a patch file that people can apply to their ROM so that you aren't distributing the stuff most likely to get you in trouble.
If you have fully matching C or C++ then it is also very possible to run it through another compiler for another system and make like a PC version of a game that was never released on PC. There is a lot more to porting than just changing your CMake but it can help if that is a goal.
All of this stuff is in super wishy washy legal gray areas and most of it has never actually gone to court anywhere. People have the idea that a "clean room" decompilation (meaning no help from leaked code) is going to be legally fine but I have no idea how true that will be in practice (and will also be location dependent).
There are tools, but a lot of the lift is that the same functions are often used across several games on the same platform. You can hash all of the function assembly across a platform and look for matches. When the Legend of Zelda team figures out the function, the Diddy Kong Racing team and the Mario Party team get that function for free.
You could (and maybe some people do?) just run a script that creates tons of code with a bit of guidance and try to shotgun some of the functions but I have not been involved with that particular aspect.
Generally there are a lot of things done to make the compiler produce the same assembly code and also make it reasonable to do on a larger scale. First of all, the same compiler version must be found - this can range from pretty easy to nearly impossible, depending on how old the binary in question is and whether the compiler has left any metadata in it. You also need to figure out the exact same compiler flags that were used as in the original. I'm not sure what's the exact process for doing that, but I'd imagine it involves finding some library functions that have known source code and trying things until the compiler outputs the same bytes that are in the binary that's being decompiled (at least, that sounds like the most reasonable thing for me).
Usually there's also some great tooling written to make the process more reasonable, like automatically comparing all decompiled function to the instructions these functions have in the binary - recent video about the LEGO island decompilation shows such tool, and also illustrates some of frustrations with getting the compiler to generate exactly the same binary. There's also another interesting method to see if the decompiled code is at least functionally equivalent by literally patching your versions of the functions into the original binary and forcing it to use them instead (I think it's really cool, because it allows checking if things work even early in the decompilation progress - but I'm not sure how popular doing it is, since it might be tricky to get it to work).
Based on my (admittedly limited) reverse engineering experience I can also say that it gets easier over time - eventually you start recognising certain code patterns the compiler generates in different scenarios, and it becomes easier to tell what the original code might have possibly been. So yeah, as you said, it definitely seems impossible at first, but with the right approach and tools it becomes manageable, just very time consuming (and also still very hard).
Extras
Btw I can't talk about decomp without mentioning the
UM_bullet_ex.cpp
. This was originally intended to reverse engineer some of the bullet related code in Touhou 18, but it kind of went out of control and now there's a pretty sizable chunk of the game implemented there. Still in just one file. How many lines does it have? No idea, github mobile stopped loading lines past 29000 when I tried to scroll to the bottom. Apparently clang hates it and throws an internal compiler error sometimes when attempting to build it. But can you really blame it?