{"id":172,"date":"2016-04-14T16:12:03","date_gmt":"2016-04-14T16:12:03","guid":{"rendered":"http:\/\/appleofeden.de-doc.com\/?p=172"},"modified":"2016-04-22T16:40:41","modified_gmt":"2016-04-22T16:40:41","slug":"the-weird-art-of-optimization","status":"publish","type":"post","link":"https:\/\/classicrebirth.com\/index.php\/2016\/04\/14\/the-weird-art-of-optimization\/","title":{"rendered":"The weird art of optimization"},"content":{"rendered":"<p style=\"text-align: justify;\">In today&#8217;s update we discuss a bit about a recent find on PlayStation micro-otpimizations and unexpected results. Full news after the jump.<\/p>\n<p><!--more--><\/p>\n<p style=\"text-align: justify;\">So, the other day I was testing some optimized code to render polygons with the old procedures I&#8217;ve been using since day 1 of programming with Squeeze Bomb. Let&#8217;s see in detail what it does, starting from macros:<\/p>\n<p>[code language=&#8221;CPP&#8221;]\/\/ a few extra macros for faster code, needs testing<br \/>\n\/\/ these don&#8217;t abuse the stack to store GTE calculation results<br \/>\n#define gte_stotz_m( r0 ) __asm__ volatile (\t\\<br \/>\n\t&quot;mfc2\t%0, $7;&quot;\t\t\\<br \/>\n\t: &quot;=r&quot;( r0 )\t\t\t\\<br \/>\n\t: )<\/p>\n<p>#define gte_stflg_m( r0 ) __asm__ volatile (\t\\<br \/>\n\t&quot;mfc2\t%0, $31;&quot;\t\t\\<br \/>\n\t: &quot;=r&quot;( r0 )\t\t\t\\<br \/>\n\t: )<\/p>\n<p>#define gte_stopz_m( r0 ) __asm__ volatile (\t\\<br \/>\n\t&quot;mfc2\t%0, $24;&quot;\t\t\\<br \/>\n\t: &quot;=r&quot;( r0 )\t\t\t\\<br \/>\n\t: )<\/p>\n<p>\/\/ direct access to POLY_GT4.rgb3<br \/>\n#define gte_strr3_gt4( r0 ) __asm__ volatile (\t\\<br \/>\n\t&quot;swc2\t$22, 40( %0 );&quot;\t\\<br \/>\n\t:\t\t\t\t\t\t\\<br \/>\n\t: &quot;r&quot;( r0 )\t\t\t\t\\<br \/>\n\t: &quot;memory&quot; )<\/p>\n<p>\/\/ direct access to POLY_GT4.xy3<br \/>\n#define gte_stsxy_gt4_3( r0 ) __asm__ volatile (\\<br \/>\n\t&quot;swc2\t$14, 0x2C( %0 )&quot;\\<br \/>\n\t:\t\t\t\t\t\t\\<br \/>\n\t: &quot;r&quot;( r0 )\t\t\t\t\\<br \/>\n\t: &quot;memory&quot; )[\/code]<\/p>\n<p style=\"text-align: justify;\">The following is the actual rendering code for tris and quads:<\/p>\n<p>[code language=&#8221;cpp&#8221;]void FastTG3L(void *ob, void *packet, CVECTOR *rgb, u32* ot)<br \/>\n{<br \/>\n\tregister u32 i, is, *tag;<br \/>\n#if !CRAZY<br \/>\n\tIFO ifo;<br \/>\n#else<br \/>\n\tregister int otz;<br \/>\n#endif<br \/>\n\tregister POLY_GT3 *sx;<br \/>\n\tconst MD1_TRIANGLES *obj = (const MD1_TRIANGLES*)ob;<br \/>\n\tconst MD1_TRIANGLE *t = (const MD1_TRIANGLE*)obj-&gt;tri_offset;<br \/>\n\tconst SVECTOR *vp = (const SVECTOR*)obj-&gt;vertex_offset;<br \/>\n\tconst SVECTOR *vn = (const SVECTOR*)obj-&gt;normal_offset;<\/p>\n<p>\trgb-&gt;cd = (rgb-&gt;cd &amp; 3) | CODE_PGT3;<br \/>\n\tgte_ldrgb(rgb);<\/p>\n<p>\tsx = (POLY_GT3*)packet;<\/p>\n<p>\tfor (i = 0, is = obj-&gt;tri_count; i &lt; is; t++)<br \/>\n\t{<br \/>\n\t\tPOLY_GT3 *si;<br \/>\n\t\tgte_ldv3(&amp;vp[t-&gt;v0], &amp;vp[t-&gt;v1], &amp;vp[t-&gt;v2]);\t\/* load model vertices *\/<br \/>\n\t\ti++;<br \/>\n\t\tsi = sx;<br \/>\n\t\tgte_rtpt_b();\t\t\t\t\t\/* perspective *\/<\/p>\n<p>#if !CRAZY<br \/>\n\t\tgte_stflg(&amp;ifo.flg);\t\t\t\/* store flag *\/<br \/>\n\t\tif (ifo.flg &amp; GTEFLG_ERROR) { sx += 2; continue; }<br \/>\n#else<br \/>\n\t\tgte_stflg_m(otz);<br \/>\n\t\tif (otz &amp; GTEFLG_ERROR) { sx += 2; continue; }<br \/>\n#endif<br \/>\n\t\tgte_nclip_b();\t\t\t\t\t\/* normal clipping *\/<br \/>\n#if !CRAZY<br \/>\n\t\tgte_stopz(&amp;ifo.otz);\t\t\t\/* return orientation *\/<br \/>\n\t\tif (ifo.otz &lt;= 0) { sx += 2; continue; }<br \/>\n#else<br \/>\n\t\tgte_stopz_m(otz);<br \/>\n\t\tif (otz &lt;= 0) { sx += 2; continue; }<br \/>\n#endif<br \/>\n\t\tgte_stsxy3_gt3(si); \/* store transformed result *\/<br \/>\n\t\tsx += 2;<br \/>\n\t\tgte_nop();<br \/>\n\t\tgte_avsz3_b(); \/* calculate depth *\/<br \/>\n#if !CRAZY<br \/>\n\t\tgte_stotz(&amp;ifo.otz); \/* get depth *\/<br \/>\n\t\tif (!(ifo.otz &gt;&gt; 6)) continue;\t\/* skip if it&#8217;s too low or too high *\/<br \/>\n#else<br \/>\n\t\tgte_stotz_m(otz);<br \/>\n\t\tif (!(otz &gt;&gt; 6)) continue;<br \/>\n#endif<\/p>\n<p>\t\tgte_ldv3(&amp;vn[t-&gt;n0], &amp;vn[t-&gt;n1], &amp;vn[t-&gt;n2]);\t\/* set lighting *\/<br \/>\n#if !CRAZY<br \/>\n\t\ttag = &amp;ot[ifo.otz &gt;&gt; 4];<br \/>\n#else<br \/>\n\t\ttag = &amp;ot[otz &gt;&gt; 4];<br \/>\n\t\tsi-&gt;tag = (*tag &amp; 0x00FFFFFF) | 0x09000000;<br \/>\n#endif<br \/>\n\t\tgte_ncct_b();\t\t\t\t\t\t\t\t\/* calculate *\/<br \/>\n\t\tgte_strgb3_gt3(si);\t\t\t\t\t\t\t\/* store rgb values *\/<\/p>\n<p>\t\t\/\/ sort!!<br \/>\n#if !CRAZY<br \/>\n\t\tsi-&gt;tag = (*tag &amp; 0x00FFFFFF) | 0x09000000;<br \/>\n#endif<br \/>\n\t\t*tag = (u32)si &amp; 0x00FFFFFF;<br \/>\n\t}<br \/>\n}<\/p>\n<p>void FastTG4L(void *ob, void *packet, CVECTOR *rgb, u32* ot)<br \/>\n{<br \/>\n\tregister u32 i, is, *tag;<br \/>\n#if !CRAZY<br \/>\n\tIFO ifo;<br \/>\n#else<br \/>\n\tint otz, flg;<br \/>\n#endif<br \/>\n\tregister POLY_GT4 *sx;<br \/>\n\tconst MD1_QUADS *obj = (const MD1_QUADS*)ob;<br \/>\n\tconst MD1_QUAD *q = (const MD1_QUAD*)obj-&gt;quad_offset;<br \/>\n\tconst SVECTOR *vp = (const SVECTOR*)obj-&gt;vertex_offset;<br \/>\n\tconst SVECTOR *vn = (const SVECTOR*)obj-&gt;normal_offset;<\/p>\n<p>\trgb-&gt;cd = (rgb-&gt;cd &amp; 3) | CODE_PGT4;<br \/>\n\tgte_ldrgb(rgb);<\/p>\n<p>\tsx = (POLY_GT4*)packet;<\/p>\n<p>\tfor (i = 0, is = obj-&gt;quad_count; i &lt; is; q++)<br \/>\n\t{<br \/>\n\t\tPOLY_GT4 *si;<br \/>\n\t\tgte_ldv3(&amp;vp[q-&gt;v0], &amp;vp[q-&gt;v1], &amp;vp[q-&gt;v2]);<br \/>\n\t\tsi = sx;<br \/>\n\t\ti++;<br \/>\n\t\tgte_rtpt_b();\t\t\t\/* RotTransPers3 *\/<\/p>\n<p>#if !CRAZY<br \/>\n\t\tgte_stflg(&amp;ifo.flg0);<br \/>\n\t\tif (ifo.flg0 &amp; GTEFLG_ERROR) { sx += 2; continue; }<br \/>\n\t\tgte_nclip_b();\t\t\t\/* NormalClip *\/<br \/>\n\t\tgte_stopz(&amp;ifo.otz);\t\/* back clip *\/<br \/>\n\t\tif (ifo.otz &lt;= 0) { sx += 2; continue; }\t\/* flipped, skip *\/<br \/>\n#else<br \/>\n\t\tgte_stflg_m(flg);<br \/>\n\t\tif (flg &amp; GTEFLG_ERROR) { sx += 2; continue; }<br \/>\n\t\tgte_nclip_b();\t\t\t\/* NormalClip *\/<br \/>\n\t\tgte_stopz_m(otz);\t\/* back clip *\/<br \/>\n\t\tif (otz &lt;= 0) { sx += 2; continue; } \/* flipped, skip *\/<br \/>\n#endif<br \/>\n\t\tgte_stsxy3_gt4((u_long *)si); gte_ldv0(&amp;vp[q-&gt;v3]);<br \/>\n\t\tsx += 2;<br \/>\n\t\tgte_nop();<br \/>\n\t\tgte_rtps_b();\t\t\t\/* RotTransPers *\/<br \/>\n#if !CRAZY<br \/>\n\t\tgte_stflg(&amp;ifo.flg);<br \/>\n\t\tif (ifo.flg &amp; GTEFLG_ERROR) continue;<br \/>\n#else<br \/>\n\t\tgte_stflg_m(flg);<br \/>\n\t\tif (flg &amp; GTEFLG_ERROR) continue;<br \/>\n#endif<\/p>\n<p>\t\tgte_stsxy_gt4_3(si);<br \/>\n\t\tgte_avsz4();<br \/>\n#if !CRAZY<br \/>\n\t\tgte_stotz(&amp;ifo.otz);<br \/>\n\t\t\/\/ limit range<br \/>\n\t\tif (!(ifo.otz &gt;&gt; 6)) continue;<br \/>\n#else<br \/>\n\t\tgte_stotz_m(otz);<br \/>\n\t\tif (!(otz &gt;&gt; 6)) continue;<br \/>\n#endif<\/p>\n<p>\t\tgte_ldv3(&amp;vn[q-&gt;n0], &amp;vn[q-&gt;n1], &amp;vn[q-&gt;n2]);<br \/>\n#if !CRAZY<br \/>\n\t\ttag = &amp;ot[ifo.otz &gt;&gt; 4];<br \/>\n#else<br \/>\n\t\ttag = &amp;ot[otz &gt;&gt; 4];<br \/>\n#endif<br \/>\n\t\tgte_ncct_b();<br \/>\n\t\tgte_strgb3_gt4(si);<\/p>\n<p>\t\tgte_ldv0(&amp;vn[q-&gt;n3]);<br \/>\n\t\tsi-&gt;tag = (*tag &amp; 0x00FFFFFF) | 0x0C000000;<br \/>\n\t\tgte_nccs_b();<br \/>\n\t\tgte_strr3_gt4(si);<\/p>\n<p>\t\t\/\/ sort!!<br \/>\n\t\t*tag = (u32)si &amp; 0x00FFFFFF;<br \/>\n\t}<br \/>\n}[\/code]<\/p>\n<p style=\"text-align: justify;\">If you are familiar with inline assembly and how the stack works, you can probably notice two minor differences that can translate to better performance. The macros above tend to change one very stupid behavior of Sony&#8217;s original tricks to retrieve GTE registers, which were previously stored in memory rather than registers. The code activated via CRAZY = TRUE is the one that uses register direct copies, while the other case defaults to stack writes. It&#8217;s not exactly the biggest change ever, but it avoids any unnecessary access to memory, which is a great penalty on the PlayStation.<\/p>\n<p style=\"text-align: justify;\">At first I thought the code wouldn&#8217;t work because Sony made the macros work with memory as the only mean to access GTE registers, but apparently there are no differences whatsoever in behavior when you use mfc2 (possibly cfc2 too) instead of swc2. I&#8217;m still not sure how much this improves the general performance, but it could be enough to prevent any future lag. Similarly, the new macros to access POLY_GT3 and POLY_GT4 diffuse attributes does a little more optimization, even tho it&#8217;s not that great; all it does is performing straight access on the structures rather than creating temp register values for each attribute.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In today&#8217;s update we discuss a bit about a recent find on PlayStation micro-otpimizations and unexpected results. Full news after the jump.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-172","post","type-post","status-publish","format-standard","hentry","category-re2proto"],"_links":{"self":[{"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/posts\/172","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/comments?post=172"}],"version-history":[{"count":0,"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/posts\/172\/revisions"}],"wp:attachment":[{"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/media?parent=172"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/categories?post=172"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/classicrebirth.com\/index.php\/wp-json\/wp\/v2\/tags?post=172"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}