<?xml version="1.0"?>
<?xml-stylesheet type="text/xml" href="presenter.xsl"?>
<slideshow>
	<title>Slideshow Title</title>
	<slide>
		<title>SSE and Conway's Game of Life</title>
		<body>
			By Mike Lewis
		</body>
	</slide>
	<slide>
		<title> Outline (for 5 minutes) </title>
		<body>
			<ul>
				<li>What's SSE(2)</li>
				<li>Making GCC use SSE via intrinsics</li>
				<li>Conway's game of Life</li>
				<li>Basic Optimizations</li>
				<ul>
					<li>Inlining</li>
					<li>Avoiding branching</li>
					<li>Reducing memory bandwidth</li>
					<ul>
						<li>Not-so-basic bit packing</li>
					</ul>
				</ul>
				<li>Super quick demo</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title> What is SSE? </title>
		<body>
			<ul>
				<li>"Streaming SIMD Extensions"</li>
				<li>Math with 128 bit XMM registers</li>
				<li>SSE==FP math</li>
				<li>SSE2==Integer math (as well as double precision FP)</li>
				<li>lots of math at once</li>
				<ul>
					<li>16 byte math operations at once!</li>
				</ul>
				<li>Lots more too</li>
				<li>AltiVec is comparable</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title> How to get GCC to generate SSE2 instructions </title>
		<body>
			<ul>
				<li>it may automatically vectorize some loops by default depending on your system and code</li>
				<li>Optimizing by hand is generally faster</li>
			</ul>
			<div class="highlight">
				<code>
#include &lt;emmintrin.h>
				</code>
			</div>
			IA-32 programmer's handbook is friend
		</body>
	</slide>
	<slide>
		<title>Conway's game of life</title>
		<body>
			Blah
		</body>
	</slide>

	<slide>
		<title>
			Inlining
		</title>
		<body>
			<code>static inline func( void ) { asdf }</code>
			put in header if you're going to use a lot

			<br/>
			<strong>usually better than macros</strong>
		</body>
	</slide>
	<slide>
		<title> Minimizing memory bandwidth </title>
		<body>
			<ul>
				<li>Number of cores growing almost exponentially</li>
				<li>Memory bandwidth and FSB speeds are not</li>
			</ul>
			Solutions (in Life at least):
			<ul>
				<li>bit pack data and make the processor do a bit more work.</li>
				<li>Align memory oncache lines</li>
				<li>Use explicit aligned loads</li>
			</ul>

		</body>

	</slide>
	<slide>
		<title>not-so-basic bit packing</title>
		<body>
			<ul>
				<li>Not enough time to get into this.</li>
				<li>Packs and unpacks in 3 instructions.</li>
			</ul>
			
<div class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="n">__m128i</span> <span class="nf">unpack</span><span class="p">(</span> <span class="n">__m128i</span> <span class="n">A</span><span class="p">,</span> <span class="kt">int</span> <span class="n">row</span> <span class="p">)</span> <span class="p">{</span>
    <span class="k">const</span> <span class="n">__m128i</span> <span class="n">onemask</span> <span class="o">=</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span> <span class="mh">0x01</span> <span class="p">);</span>
    <span class="n">__m128i</span> <span class="n">shift</span> <span class="o">=</span> <span class="n">_mm_srli_epi32</span><span class="p">(</span> <span class="n">A</span><span class="p">,</span> <span class="n">row</span> <span class="p">);</span>
    <span class="k">return</span> <span class="n">_mm_and_si128</span><span class="p">(</span> <span class="n">shift</span><span class="p">,</span> <span class="n">onemask</span> <span class="p">);</span>
<span class="p">}</span>

<span class="c">// MAKE SURE A IS IN THE FORM (0xFF) not (0x01</span>
<span class="c">// ( a | mask ) | ( ~mask &amp; b )</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="n">__m128i</span> <span class="nf">pack</span><span class="p">(</span> <span class="n">__m128i</span> <span class="n">A</span><span class="p">,</span> <span class="kt">int</span> <span class="n">row</span> <span class="p">)</span> <span class="p">{</span>
    <span class="k">const</span> <span class="n">__m128i</span> <span class="n">onemask</span> <span class="o">=</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span> <span class="mh">0x01</span> <span class="p">);</span>
    <span class="n">A</span> <span class="o">=</span> <span class="n">_mm_and_si128</span><span class="p">(</span> <span class="n">onemask</span><span class="p">,</span> <span class="n">A</span> <span class="p">);</span>
    <span class="k">return</span> <span class="n">_mm_slli_epi32</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">row</span><span class="p">);</span>
<span class="p">}</span>
</code></div>
		</body>
	</slide>
	<slide>
		<title>Stop! It's demo time</title>
	</slide>
	<slide>
		<title>I can has optimizations?</title>
		<body>
			<ul>
				<li>trac for life: <a href="http://trac.lolrus.org/life">trac.lolrus.org/life</a></li>
				<li>best optimization guides in existance: <a href="http://www.agner.org/optimize/">http://www.agner.org/optimize/</a></li>
			</ul>
		</body>
	</slide>
</slideshow>

