GPU basics and technology trends
Chen, J.Y.; , "GPU technology trends and future requirements," Electron Devices Meeting (IEDM), 2009 IEEE International , vol., no., pp.1-6, 7-9 Dec. 2009
IEEE, (Free access link needed)
Fermi GF100 GPU Architecture
Wittenbrink, C.M.; Kilgariff, E.; Prabhu, A.; , "Fermi GF100 GPU Architecture," Micro, IEEE , vol.31, no.2, pp.50-59, March-April 2011
IEEE, cmu
NVIDIA Tesla (lots of detail on GPU architecture)
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39–55, 2008.
IEEE, ucdenver
NVIDIA GT200
M. Papadopoulou, M. Sadooghi-Alvandi, H. Wong. Micro-benchmarking the GT200 GPU. Computer Group, ECE, University of Toronto, Tech. Rep, 2009
harvard
Comparison between nVidia and ATI GPUs
Ying Zhang; Lu Peng; Bin Li; Jih-Kwon Peir; Jianmin Chen; , "Architecture comparisons between Nvidia and ATI GPUs: Computation parallelism and data communications," Workload Characterization (IISWC), 2011 IEEE International Symposium on , vol., no., pp.205-215, 6-8 Nov. 2011
IEEE, lsu
GPGPU-Sim
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS ’09.
IEEE, stuffedcow
Barra (another GPU simulator)
S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: a parallel functional simulator for GPGPU. In IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pages 351–360, 2010.
IEEE, ufmg.br, Barra Google code page
Some precursor work relating to GPGPU-Sim
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In Proc. 40th IEEE/ACM Int’l Symp. on Microarchitecture, 2007.
ACM, CMU
NVIDIA PTX ISA: Low-level virtual instruction set for general-purpose computation on NVIDIA GPUs
Link to paper on nvidia.com
Ocelot: A compiler framework to translate from PTX to other SIMD architectures
G. Diamos, A. Kerr, and M. Kesavan, “Translating GPU binaries to tiered SIMD architectures with Ocelot,” Georgia Institute of Technology, CERCS technical report GIT-CERCS-09-01, 2009.
handle.net, gatech
Y. Jiao, H. Lin, P. Balaji, W. Feng. "Power and performance characterization of computational kernels on the gpu," Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on & Int'l Conference on Cyber, Physical and Social Computing (CPSCom), pages 221-228.
IEEE, anl
First paper to deal with both branch divergence and memory divergence
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10). ACM, New York, NY, USA, 235-246.
ACM, virginia.edu
Fung, W.W.L.; Aamodt, T.M.; "Thread block compaction for efficient SIMT control flow," High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on , vol., no., pp.25-36, 12-16 Feb. 2011
IEEE, ucb
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems (ASPLOS '11). ACM, New York, NY, USA, 369-380.
ACM, wm.edu
Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11). ACM, New York, NY, USA, 477-488.
ACM, dgiamos.net
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11). ACM, New York, NY, USA, 308-317.
ACM, CMU
Minsoo Rhu and Mattan Erez. 2012. CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA '12). IEEE Press, Piscataway, NJ, USA, 61-71.
ACM, utexas
N. Brunie, S. Collange, G. Diamos. Simultaneous Branch and Warp Interweaving for Sustained GPU Performance. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA '12). IEEE Press, Piscataway, NJ, USA
lyon
Gebhart, Mark, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. "Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor." MICRO 2012.
utexas
Exception support (e.g. for virtual memory)
Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: exception support and speculative execution on GPUs. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA '12). IEEE Press, Piscataway, NJ, USA, 72-83.
ACM
Virtual memory support WITHOUT precise exception handling
Hyesoon Kim. 2012. Supporting virtual memory in GPGPU without supporting precise exceptions. In Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC '12). ACM, New York, NY, USA, 70-71.
ACM, cmu
Decoupling texture memory access from computation
José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2012. Boosting mobile GPU performance with a decoupled access/execute fragment processor. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA '12). IEEE Press, Piscataway, NJ, USA, 84-93.
ACM, upc
Reinforcement learning applied to memory access scheduling
Ipek, E.; Mutlu, O.; Martinez, J.F.; Caruana, R.; , "Self-Optimizing Memory Controllers: A Reinforcement Learning Approach," Computer Architecture, 2008. ISCA '08. 35th International Symposium on , vol., no., pp.39-50, 21-25 June 2008
IEEE, cornell
Memory competition between co-located CPU and GPU
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA '12). IEEE Press, Piscataway, NJ, USA, 416-427.
cmu
Programmable memory controller
Bojnordi, M.N.; Ipek, E.; , "PARDIS: A programmable memory controller for the DDRx interfacing standards," Computer Architecture (ISCA), 2012 39th Annual International Symposium on , vol., no., pp.13-24, 9-13 June 2012
IEEE, rochester
Sven Woop, Jörg Schmittler, and Philipp Slusallek. 2005. RPU: a programmable ray processing unit for realtime ray tracing. In ACM SIGGRAPH 2005 Papers (SIGGRAPH '05), Markus Gross (Ed.). ACM, New York, NY, USA, 434-444.
ACM, uni-sb.de
A single CPU architecture with ILP (OOO single-thread) and TLP (in-order SMT) modes.
Khubaib, M. Aater Suleman, Milad Hashemi, Chris Wilkerson, Yale N. Patt, "MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP", MICRO 2012
utexas
Splitting flow control into two threads, one of which computes flow control ahead of the other. This is analogous to Execute-Access decoupling, where one thread computes address and requests reads ahead of another that consumes the data. Both are relevant to GPUs.
Sheikh, Rami, James Tuck, and Eric Rotenberg. "Control-Flow Decoupling.", MICRO 2012
ncsu