Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Proceedings
ISSN 1759-3433
CCP: 90
PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING FOR ENGINEERING
Edited by:
Paper 16

Unified Kernel and User Space Distributed Tracing for Message Passing Analysis

B. Poirier, R. Roy and M. Dagenais

Department of Computer Engineering, École Polytechnique, Montreal, Canada

Full Bibliographic Reference for this paper
B. Poirier, R. Roy, M. Dagenais, "Unified Kernel and User Space Distributed Tracing for Message Passing Analysis", in , (Editors), "Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 16, 2009. doi:10.4203/ccp.90.16
Keywords: performance analysis, event tracing, time synchronization, message passing, operating system.

Summary
Tracing tools commonly available do not have the ability to fully trace a distributed system. Whereas profilers record a sampled overview of a system, tracers can record a complete list of events. There are tools for application tracing, kernel tracing and network monitoring but each of these, taken individually, records events from only one part of a complete system. Some tools such as DTrace and KTAU allow to merge user and kernel space traces. A complete tracer would enable the debugging, monitoring and optimization of distributed systems, grid computing systems and client-server programs from the application level down to the operating system and device driver level.

In this paper we present the design and architecture of a tool to trace an entire distributed system with minimal impact. This is accomplished in two parts: user space and kernel space traces are merged during execution time whereas the resulting distributed traces are synchronized afterwards during a retrospective analysis. Offline trace synchronization includes algorithms based on linear regressions or geometric analysis of offsets of individual messages. To achieve this, we have used the Linux Trace Toolkit Next Generation (LTTng) tracer for the Linux kernel. It has been extended with user space trace points and an MPI tracing library. Time synchronization is based on identifying message exchanges, using the traced TCP events.

We have tested the impact of tracing on the MPIBench communication benchmark and the Dbench filesystem benchmark. The tracer can collect millions of events per second from user and kernel space with an impact on communication times between 10 to 15%. We have then analyzed these traces to calculate clock parameters and synchronized all the events in a common timebase with an estimated standard deviation lower than 130µs.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £72 +P&P)