Skip to content

Latest commit

 

History

History
102 lines (81 loc) · 7.43 KB

Readme.md

File metadata and controls

102 lines (81 loc) · 7.43 KB

CS205: Projects in Parallel Data Science at SCALE

Spring 2017

about

About

Extreme scale data science at the convergence of big data and massively parallel computing is enabling simulation, modelling and real-time analysis of complex natural and social phenomena at unprecedented scales. The aim of the project is to gain practical experience into this interplay by applying parallel computation principles in solving a compute and data-intensive problem. Applying interdisciplinary principles and skills of parallel computation and data science from CS205 and other courses the goal is to construct a novel parallel software solution for an open-ended data science application that requires orders of magnitude compute scaling using Harvard’s supercomputer: Odyssey. Additionally, the project provides an opportunity to apply novel concepts and technologies to create niche applications and research outputs.

Requirements

As a project team (4 to 5 members) you will identify a data science problem, analyse its compute scaling requirement, collect the data, design and implement a parallel software, and demonstrate scaled performance of an end-to-end application.

The parallel software solution

  • should be implemented on a heterogenous distributed memory architecture with either a many-core or a multi-core compute node and evaluated on 8 compute nodes (note: each compute node on Odyssey is a multi-core with 32 (or 64) cores or with a manycore GPU with hundreds of cores).
  • as a hybrid parallel program in either
    • MPI+OpenMP
    • MPI+OpenAcc (or CUDA )
    • PGAS + X
    • Spark (with GPU acceleration)
  • its performance evaluated on large data sets to demonstrate both weak and strong scaling using appropriate metrics (throughput, efficiency, iso-efficiency).
  • and should solve a problem for a non-trivial computation graph and with hierarchical parallelism.

Advanced concepts and technologies

To create novel parallel software solutions, or to undertake a research oriented outcome, you can make use of advanced concepts and technology that was explored in the course. Implementations in the form of libraries and open source software are available to build niche applications on top of it:

Project Deliverables

  1. Web site (max. 5 pages)
    • Introduction which should include comparison with existing work on the problem.
    • Technical description of the parallel software solution, programming models, with links to code repository.
    • Application scaling plots (speed-up, throughput, weak and strong scaling).
    • Advanced Features
    • Citations
  2. Software with evaluation data sets, test cases (on Github repo)
  3. Presentation to the students and staff

Project Milestones

The milestones for your final project will be graded at each step according to the grading criteria given below. It is important to adhere to the deadlines as the late date submission policy does not apply to projects.

Milestones Deadline
Project Team announcement (sign up document deposited in Git repo) 22nd March 2017
Project Proposal (1 page web site) 25th March 2017
Interim progress report (website populated with preliminary results) 15th April
Project deliverables (web site, code, README) 1st May
Project presentation to class (10 mins. + Qs) 2nd May 2017
Weekly meetings with project supervisors 20th March to 28th April

Project submissions: Github and Piazza

  • All submissions are per group. Make your own respository on Github with a link to your project web page.
  • All project deliverables, including milestones related, should be deposited in GitHub repos for peer evaluation.
  • All project related correspondence should be posted on Piazza.
  • Project related emails (meeting schedules) to: [email protected], general queries to Piazza. Only critical (or personal) nature emails to project supervisors email.

Project Supervisors

  1. Manju
  2. Charles
  3. Rafael
  4. WeiWeiPan

Project choices and programming environments

  • You can choose any data/computational science problem that you have already worked on in any other course: AC 209a, AC 209b, AM205, AM207, AC297R.
  • Alternatively, your own research work with advanced concepts as above would be suitable to generate research output.
  • Supervisors may offer projects based on their research interests.
  • You can re-use any code from the CS205 homework set and build your application software on top of it. But the code should be augmented with additional parallel code with the requirements as specified above to gain further credit.
  • You can implement the solution in any programming language of your choice (discuss with supervisors).
  • In the interdisciplinary spirit of the subject area, and the cs205 course, projects and project teams should have multiple disciplines.

Research output (optional)

Optionally, the project may take the path of research and generate a research paper as output. In this case the project requirements are:

  1. To implement a parallel algorithm as above but to support the research problem being addressed.
  2. To generate as final output a technical paper of journal quality comparable in depth to papers published in leading journals in computational/data science or parallel computing.
  3. To choose a parallel solution which could range from a novel parallel graph algorithm to optimising a scientific application on odyssey with new insights, and anything inbetween these two bounds of theory and experimentation.

Project Grade

Project will be graded on the depth of work undertaken, communication (web site, presentation) and participation.

  • 10%: Project review (Peer and Supervisor meetings).
  • 40%: Project software, README, overall quality (base line).
  • 30%: Advanced features.
  • 10%: Project web site.
  • 10%: Presentation to class.

Project Scale and Scope

To uniformly assess the different projects the following criteria will be applied:

  • A team of 2 will be expected to generate at least 2 homeworks equivalent of output for baseline outcome.
  • A team of 4 will therefore be expected to generate double the output in qualitative terms (not necessarily code, but features, analysis, evaluation, innovation). A team of 5 should be exceeding this threshold.
  • Advanced features can include half on modelling and half on parallel software.

Resources