The do’s and don’ts of PlayStation programming

This is a little article I decided to write while reading some good and bad code from various homebrew PlayStation communities. Full news after the jump.


The less you write/read, the faster it works

On the PlayStation reading and writing from/to RAM is an expensive operation, which is terribly noticeable on critical code, especially loops dedicated to population and draw of primitives. When you read try and cache data as necessary: for example, set a local variable into your code for recurring data, like a structure pointer; this will avoid your code to become a mess of reloads of the same value over and over. As for writes, there are at least a couple important cases to take into consideration:

  • Repopulation: if you have code that populates a primitive entirely, unless it’s for small data (say 4 words) avoid rewriting it fully. Instead preallocate the necessary primitives, fill data that is usually static (like UV coordinates), and update only changing bits (say XY coordinates).
  • Bulk writes: Avoid populating data by structure boundaries, instead go for writes that cover multiple values. This is important because building a couple of variables from registers takes way less time than a number of bytes, shorts, etc. For example, if you need to populate rgb values in a primitive use a cast that initializes all three values in one write (i.e. *(u32*)&p->rgb0 = 0xFF00FF); this can be used in combination with repopulation for maximum optimization.

In all cases, beware of memory alignment. If you read or write from a non-multiple of the operation size (i.e. read 2 bytes -> must be performed on an address multiple of 2, read 4 bytes -> memory address multiple of 4), the console will fail the operation and throw an exception, which translates to a hard crash if you don’t have an exception handler to catch the case.

The scratchpad is your friend, ABUSE IT!

The scratchpad is a little buffer (1KB) of fast memory that can be operated by the programmer for quite a few optimizations. You can use it for decompression code as a fast-access buffer, or similarly for a sort algorithm. Typically the best usage is with 3D operations, where you need to write a lot of variables and structures while performing operations in loops. Say, if you have a MATRIX it will be allocated onto the stack, which is part of RAM. Ok, we know RAM is slow and bad, right? So instead change the stack to point to the scratchpad, execute your slow code, and expect it to have quite a few speed ups. Of course, remember to restore the stack when you’re done.

Organize your data in packages

If your game or program needs to load textures or other small data from disk, don’t fall into the trap of loading a million files just because they are tiny. Always remember the PlayStation CD unit isn’t exactly the fastest, even if it can read 300KB/s in double speed mode. CD seek to position is a slow operation that breaks the flow, which means you better pack your data and possibly read it all in one pass to take advantage of those 300 KBs, otherwise the laser will jump back and forth like a spicy mole on a synthesizer. Once all the data is in memory you can process it as necessary and discard whatever isn’t needed anymore.

Running out of space? Overlays to the rescue

If you are not familiar with the term overlay, it’s more or less like a DLL, just without the dynamic part. In other words, overlays are a part of your exe that live as a separate binary files in a specific region of memory (can be dynamically or statically set, you decide). The advantage of using overlays comes when you’re running low on resources and need a stable region of RAM to hold some extra code to be cached on demand. For example, you can code as overlays a main menu or a configuration screen, which can both share the same address in memory.


LibGS is EVIL and you can do better

This is the most common error I’ve found while looking at homebrew code. People tend to overdo LibGS usage because it’s usually simple to operate and provides a nice abstraction layer. What these people ignore is how utterly slow and big this library really is.

Let’s talk about TMD as a case to examine: the whole thing is programmed so that you have instant access to 3D functionality with very little effort. What samples don’t tell you is that TMD is an extremely limited format which provides almost no flexibility at all, not to mention the code to display a model is freaking huge and usually comes with a million cases you don’t really need. Sure, you could fall back to HMD, a more advanced container with extended functionality, but that is extremely slow and usually put together from poorly optimized code.

Another flaw with this library dwells with sprites. Don’t bother to use its internal sprite handler; again, it’s slow as hell and provides not that much of a real treat in the end. You wanna scale and rotate? Write optimized code for that and use POLY_FT4 directly, it’s not hard to call sin functions to produce rotations and even rotation matrices do that internally, so that the user can take advantage of them to produce rotated vectors to use for the effect. Don’t want rotations? Use straight sprite primitives then, SPRT/SPRT8/SPRT16 already do everything and are extremely fast even to repopulate (3-4 writes at worst). Don’t know how to address a VRAM page? Check how DR_TPAGE primitives work, you can even merge them with any sprite to draw sequencially from different VRAM pages! Plus you can sort all types of primitives, not just sprites or whatever else LibGS offers (which is extremely limiting and dull to be honest).

In other words, just write your code around LibGPU. That library provides all low level access you will ever need and you can come with faster replacement than what LibGS has to offer.

Allocation is important, but don’t overdo it

Another weird habit I’ve seen is the abuse of InitHeap3/malloc3/free3, which apparently stems from some of my code on Opera of the Red Moon (thanks to the guy who recycled it in the first place by stripping all my comments, now the code looks like random gibberish).

So why would you actually use dynamic allocation? Because you need to keep all that data automatically managed without the need to dig some room manually. Ok, scrap that thought because malloc comes with a price, which is a best fit algorithm. Why avoid it? Because you can use a temp pool to store your data sequentially (i.e. kinda like a stack, but in reverse) and fit new data there, rather than polluting RAM with search requests. The good thing is you can even discard some volatile data (textures and sound) without any reallocation required. Just drop volatiles from the last allocated pointer and you’re good to push more data with no problem. You can also use this to cooperate around overlays so that you don’t accidentally allocate where code may be stored at some point.

On a side note, you can use stack allocation to also allocate primitives, as an efficient counterpart to LibGS’ packet allocator. Both would work more or less the same really, but with your own code at least you know what the heck is going on.

Float is not what you wanna use

Float variables aren’t any good for the R3000, because they are internally handled as “emulated” data to build actual floats (i.e. there is no FPU to operate them natively). This is going to kill your performance, A LOT. Wanna use something that keeps the CPU fresh and running? Try fixed point math; that’s what the console uses for vectors, degrees, and matrices anyways.