Large-scale Data Engineering (Summer 2024)


Conducting research in the areas of data engineering, data management, and machine learning systems requires the ability to deal with scientific literature in these areas as well as to design, implement, and evaluate prototypes. To facilitate these skills, the DAMS Lab group (FG Big Data Engineering) at TU Berlin offers a seminar and a project on Large-scale Data Engineering as a combined module (12 ECTS), which can be taken by bachelor and master students. Taking both seminar and project is the ideal preparation for a bachelor/master thesis with our group. Alternatively, only bachelor students may take the seminar as a separate module (3 ECTS) and the project as a separate module (9 ECTS).

Modules and assigned degree programs
  • Large-scale Data Engineering (module #41086, 12 ECTS): seminar and project
    • Bachelor's and master's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik), M.Sc. Computer Science (Informatik), M.Sc. Computer Engineering, M.Sc. Information Systems Management (Wirtschaftsinformatik), M.Sc. Electrical Engineering (Elektrotechnik)
    • Registration via a poll in the ISIS course. Deadline: Apr 07, 23:59 CEST.
    • Notification of admission (guaranteed participation, place in the queue, or no participation): Apr 08.
  • Seminar Large-scale Data Engineering (module #41095, 3 ECTS): seminar only
    • Bachelor's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik), B.Sc. Media Technology (Medientechnik)
    • Registration organized centrally by Faculty IV via an ISIS meta course. Deadline: Mar 25, 10:00 CET.
    • Notification of admission: centrally by Faculty IV by Apr 05.
  • Project Large-scale Data Engineering (module #41183, 9 ECTS): project only
    • Bachelor's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik)
    • Registration via a poll in the ISIS course. Deadline: Apr 07, 23:59 CEST.
    • Notification of admission (guaranteed participation, place in the queue, or no participation): Apr 08.
    • This module cannot be taken as a programming practical (Programmierpraktikum) anymore. Suggested alternative: Programmierpraktium Datensysteme via the ISIS meta course.

News

See announcements in the ISIS course.

Seminar

Time: Mondays 14:00 – 16:00
Place: TEL-811 (tba) and zoom

In the beginning of the semester, students will hear introductory lectures on reading scientific papers, finding related work, writing high-quality scientific papers, and giving a high-quality scientific presentation. Each student selects a topic, reads and understands the given paper, searches for related work, and writes a short summary of the assigned paper. In the end of the semester, each student gives a slide presentation in front of the group.


Topics

This semester's umbrella topic: Efficiently Combining DB and ML Workloads

Database query processing and ML training and scoring are normally executed in dedicated systems. However, there is a trend towards integrated data analysis pipelines involving both query processing and ML. Unfortunately, the orchestration of existing DB and ML systems is inefficient due to expensive data transfer and missed global optimization potential. This semester, we deal with recent research papers addressing these challenges through: (a) improving the data transfer between DB and ML systems, (b) running one kind of workload on existing software/hardware designed for the other kind of workload, (c) creating entirely new systems supporting both query processing and ML at the same time. These solutions affect all levels of the system stack, from query languages over optimization and compilation techniques as well as local/distributed runtime techniques to the use of multi-core CPUs and hardware accelerators.

List of topics: Separate list of topics (updated Apr 19)

Submission & deadlines
  • Topic selection: After the first introductory lecture via a poll in the ISIS course.
    Deadline: Apr 29, 23:59 CEST.
    Notification of assigned topics: May 06.
  • Submission of the summary paper as a PDF document by email to patrick.damme(æ)tu-berlin.de.
    Deadline: Jun 24, 23:59 CEST.
  • Submission of the presentation slides as a PDF document by email to patrick.damme(æ)tu-berlin.de.
    Deadline: The day before the presentation, 23:59 CEST.
    Last-minute changes after the submission are permitted.

Preliminary Schedule
Introductory lectures
Slides will be made available prior to the individual lectures.
  • Apr 15, TEL-811 MA-005: 01 Structure of Scientific Papers [pdf, pptx]
  • Apr 22: optional office hour via zoom
  • Apr 29, TEL-811 MAR-0.003: 02 Scientific Reading and Writing [pdf, pptx]
  • May 06, TEL-811 MAR-0.003: 03 Experiments, Reproducibility, and Presentations [pdf, pptx]
Self-organized seminar work
  • May 06 – Jun 24: Optional office hours to discuss any questions
Final presentations
  • Jul 01, 14:00 – 18:00: Presentation session #1
  • Jul 15, 14:00 – 18:00: Presentation session #2

Project

Time: Mondays 16:00 – 18:00
Place: TEL-811 (tba) and zoom

In the beginning of the semester, students/teams pick a programming project from a provided list, devise an initial design and then implement a prototype including documentation, tests, and relevant experiments. The project ends with a presentation of the obtained results in front of the group.


Topics

The topics of the project are independent of the seminar. We will offer tasks in a wide range of components of data management and machine learning systems. Each individual project will be conducted in the context of one of the two systems developed by our group (and other collaborators) as part of our research:

  • Apache SystemDS: An open-source ML system for the end-to-end data science lifecycle (mainly written in Java)
  • DAPHNE: An open and extensible system infrastructure for integrated data analysis pipelines (mainly written in C++)

Thereby, students get the chance to make meaningful contributions to free open-source projects. The projects can be done either individually or in teams of up to three students (with the expected amount of work proportional to the team size).

List of topics: Separate list of topics (updated Apr 19)

Submission & deadlines
  • Topic selection: After the kick-off meeting via a poll in the ISIS course. Deadline: Apr 29, 23:59 CEST.
    Notification of assigned topics and teams: May 06.
  • Submission of the prototype source code, tests, and documentation as a pull request on the GitHub repository of either SystemDS or DAPHNE (or, exceptionally via email to patrick.damme(æ)tu-berlin.de and the respective project mentor).
    Deadline: Jul 29, 23:59 CEST.
  • Submission of the presentation slides as a PDF document by email to patrick.damme(æ)tu-berlin.de and the respective project mentor.
    Deadline: The day before the presentation, 23:59 CEST.
    Last-minute changes after the submission are permitted.

Preliminary Schedule
Introductory lectures
Slides will be made available prior to the individual lectures.
  • Apr 15, TEL-811 MA-005: Kick-off Meeting [pdf, pptx]
  • May 06, TEL-811 MAR-0.003: Recommendation to attend the seminar at 14:00
Self-organized project work
  • May 06 – Jul 22: Optional office hours to discuss the initial design, implementation, tests, documentation, experiments, and any questions
Final presentations
  • Aug 05, 14:00 – 18:00: Presentation session

Organization

People
  • Dr.-Ing. Patrick Damme, patrick.damme(æ)tu-berlin.de
    Lecturer, seminar mentor, project mentor, general contact person
  • Prof. Dr.-Ing. Matthias Boehm; Arnab Phani, M.Sc.; Sebastian Baunsgaard, M.Sc.; Carlos E. Muniz Cuza, M.Sc.; Philipp Ortner, M.Sc.
    Project mentors
Language
  • Both seminar and project are given exclusively in English, but questions or other communication in German is fine as well.