Large-scale Data Engineering (Winter 2025/26)

Seminar

Time: Mondays 14:00 – 16:00
Place: MAR 0.015/MAR 0.009 and zoom

In the beginning of the semester, students hear introductory lectures on reading scientific papers, finding related work, writing high-quality scientific papers, and giving a high-quality scientific presentation. Each student selects a topic, reads and understands the given paper, searches for related work, and writes a short summary of the assigned paper. In the end of the semester, each student gives a slide presentation in front of the class.

Topics

This semester's umbrella topic: Robust and Adaptive Query Processing

Traditionally, database query processing is divided into an optimization phase, which determines an optimal plan for the query, and an execution phase, which executes this plan. During optimization, different logically equivalent plans are enumerated and the plan with the lowest cost with respect to some cost model is chosen. Cost estimation is largely based on estimates of the cardinalities of intermediate results. Unfortunately, these estimates are often quite wrong resulting in bad query execution plans that may take orders of magnitude longer to execute than the optimal plan. Moreover, additional unknowns further complicate the efficient query processing, e.g., unknown properties of base data and input datasets, the access to external data sources, query parameters, and the system utilization at run-time. This semester, we deal with a broad range of research papers that address these challengens through (a) improved creation/management of statistics, (b) robust query optimization, and (c) adaptive query processing.

List of topics

Submission & deadlines

Topic selection: After the first introductory lecture via a poll in the ISIS course.
Deadline: Oct 31, 23:59.
Notification of assigned topics: Nov 03.
Submission of the summary paper (PDF) by upload in the ISIS course.
Deadline: Jan 12, 23:59.
Submission of the presentation slides (PDF) by upload in the ISIS course.
Deadline: The day before the presentation, 23:59.
Last-minute changes after the submission are permitted.

Schedule

Introductory lectures
Slides will be made available prior to the individual lectures.

Oct 13, MAR 0.015: 01 Structure of Scientific Papers [pdf, pptx]
Oct 20, MAR 0.015: 02 Scientific Reading and Writing [pdf, pptx]
Oct 27, MAR 0.015: 03 Experiments, Reproducibility, and Presentations [pdf, pptx]

Self-organized seminar work

Nov 03 – Jan 19, room FR 768 and zoom: Optional consultation hours to discuss any questions

Student presentations

Jan 26, MAR 0.009, 14:00 – 18:00: Final presentations #1
Feb 02, MAR 0.009, 14:00 – 18:00: Final presentations #2

Project

Time: Mondays 16:00 – 18:00
Place: MAR 0.015/MAR 0.009 and zoom

In the beginning of the semester, students/teams select a project topic from a provided list. Then, they design and implement a high-quality prototype and prove the value of their contribution through extensive tests, experiments, and documentation. The project ends with a presentation and defense of the results in front of the class.

Topics

The topics of the project are independent of the seminar. We will offer tasks in a wide range of components of data management and machine learning systems. Each individual project will be conducted in the context of one of the two systems developed by our group (and other collaborators) as part of our research:

DAPHNE: An open and extensible system infrastructure for integrated data analysis pipelines (mainly written in C++)
Apache SystemDS: An open-source ML system for the end-to-end data science lifecycle (mainly written in Java)

Thereby, students get the chance to make meaningful contributions to free open-source projects. The projects can be done either individually or in teams of up to three students (with the expected amount of work proportional to the team size).

List of topics (updated Oct 22)

Submission & deadlines

Topic selection: After the kick-off meeting via a poll in the ISIS course. Deadline: Oct 31, 23:59.
Notification of assigned topics and teams: Nov 10.
Submission of the initial prototype (source code, tests) as a pull request on the GitHub repository of either DAPHNE or SystemDS (or, exceptionally via email to patrick.damme(æ)tu-berlin.de and the respective project mentor).
Deadline: Jan 18, 23:59.
Submission of the final prototype (source code, tests, docs, experiments) as a pull request on the GitHub repository of either DAPHNE or SystemDS (or, exceptionally via email to patrick.damme(æ)tu-berlin.de and the respective project mentor).
Deadline: Feb 16, 23:59.
Submission of the presentation slides (PDF) by upload in the ISIS course.
Deadline: The day before the presentation, 23:59.
Last-minute changes after the submission are permitted.

Schedule

Introductory lectures
Slides will be made available prior to the individual lectures.

Oct 13, MAR 0.015: Kick-off Meeting [pdf, pptx]
Oct 27, MAR 0.015: Recommendation to attend the seminar at 14:00

Self-organized project work

Nov 10 – Feb 16: Recommended consultations with the project mentor to discuss the initial design, implementation, tests, documentation, experiments, and any questions

Student presentations

Jan 19, MAR 0.009, 16:00 – 18:00: Intermediate presentations
Feb 23, MAR 0.009, 14:00 – 18:00: Final presentations

Large-scale Data Engineering (Winter 2025/26)

News

Seminar

Topics

Schedule

Project

Topics

Schedule

Organization