Large-scale Data Engineering (Winter 2023/24)


Conducting research in the areas of data engineering, data management, and machine learning systems requires the ability to deal with scientific literature in these areas as well as to design, implement, and evaluate prototypes. To facilitate these skills, the DAMS Lab group (FG Big Data Engineering) at TU Berlin offers a seminar and a programming project on Large-scale Data Engineering as a combined module (12 ECTS), which can be taken by bachelor and master students. Taking both seminar and project is the ideal preparation for a bachelor/master thesis with our group. Alternatively, only bachelor students may take the seminar as a separate module (3 ECTS) and the project as a separate module (9 ECTS).

Modules and assigned degree programs
  • Large-scale Data Engineering (module #41086, 12 ECTS): seminar and project
    • Bachelor's and master's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik), M.Sc. Computer Science (Informatik), M.Sc. Computer Engineering, M.Sc. Information Systems Management (Wirtschaftsinformatik), M.Sc. Electrical Engineering (Elektrotechnik)
    • Registration via the ISIS course. Deadline: Oct 12, 23:59 CEST.
    • Notification of admission (guaranteed participation, place in the queue, or no participation): Oct 13.
  • Seminar Large-scale Data Engineering (module #41095, 3 ECTS): seminar only
    • Bachelor's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik), B.Sc. Media Technology (Medientechnik)
    • Registration organized centrally by Faculty IV via an ISIS meta course. Deadline: Sep 20, 10:00 CEST.
    • Notification of admission: centrally by Faculty IV on Oct 02.
  • Project Large-scale Data Engineering (module #41094, 9 ECTS): project only
    • Bachelor's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik)
    • Registration by email to Patrick Damme (patrick.damme(æ)tu-berlin.de). Deadline: Oct 12, 23:59 CEST.
      Please specify your matriculation number, study program, and whether you want to apply LDE as a Projekt or a Programmierpraktikum.
    • Notification of admission (guaranteed participation, place in the queue, or no participation): Oct 13.

News

See announcements in the ISIS course.

Seminar

Time: Mondays 14:00 s.t. – 16:00
Place: TEL-811 and zoom

In the beginning of the semester, students will hear introductory lectures on reading scientific papers, finding related work, writing high-quality scientific papers, and giving a high-quality scientific presentation. Each student selects a topic, reads and understands the given paper, searches for related work, and writes a short summary of the assigned paper (6 pages). In the end of the semester, each student gives a slide presentation (15 min talk + 5 min discussion) in front of the group.


Topics

In this semester, we focus on the umbrella topic of Extensible Data Systems:

To meet the requirements of emerging applications as well as to enable the timely adoption of novel techniques and technologies, making database and machine learning systems extensible has been an active research field for decades. Concepts for extensibility and variability have been proposed at all levels of the system stack from query/program languages, over the optimizer, down to the execution in distributed environments and on heterogeneous hardware and storage. This seminar takes a tour through some of the most important works in this field.

List of topics: Separate list of topics (updated Oct 20)

Submission & deadlines
  • Topic selection: After the first introductory lecture, please send your ranked list of ca. 5 preferred topics via email to patrick.damme(æ)tu-berlin.de.
    Deadline: Oct 30, 23:59 CET.
    Notification of assigned topics: Nov 06.
  • The summary papers (6 pages) should be written in LaTeX using the ACM acmart template (document class sigconf, double-column), and be submitted as PDF documents by email to patrick.damme(æ)tu-berlin.de.
    Deadline: Jan 08 Jan 15, 23:59 CET.
  • The presentation slides should be submitted as PDF documents by email to patrick.damme(æ)tu-berlin.de.
    Deadline: The day before the presentation, 23:59 CET.
    Last-minute changes after the submission are permitted.

Schedule
Introductory lectures
Slides will be made available prior to the individual lectures.
  • Oct 16: 01 Structure of Scientific Papers [pdf, pptx]
  • Oct 23: 02 Scientific Reading and Writing [pdf, pptx]
  • Oct 30: 03 Experiments, Reproducibility, and Presentations [pdf, pptx]
Self-organized seminar work
  • Nov 06 – Jan 08: Optional office hours to discuss any questions
Final presentations
  • Jan 15: Presentation session #0 (canceled)
  • Jan 22: Presentation session #1
  • Jan 29: Presentation session #2
  • Feb 05: Presentation session #3
  • Feb 12: Presentation session #4

Project

Time: Mondays 16:00 s.t. – 18:00
Place: TEL-811 and zoom

In the beginning of the semester, students/teams pick a programming project from a provided list, devise an initial design and then implement a prototype including documentation, tests, and relevant experiments. The project ends with a presentation (15 min) of the obtained results.


Topics

The topics of the project are independent of the seminar. We will offer tasks in a wide range of components of data management and machine learning systems. Each individual project will be conducted in the context of one of the two systems developed by our group (and other collaborators) as part of our research:

  • Apache SystemDS: An open-source ML system for the end-to-end data science lifecycle (mainly written in Java)
  • DAPHNE: An open and extensible system infrastructure for integrated data analysis pipelines (mainly written in C++)

Thereby, students get the chance to make meaningful contributions to free open-source projects.

The projects can be done either individually or in teams of up to three students (with the expected amount of work proportional to the team size).

List of topics: Separate list of topics (updated Oct 20)

Submission & deadlines
  • Topic selection: After the kick-off meeting, please send the following via email to patrick.damme(æ)tu-berlin.de:
    • your ranked list of ca. 5 preferred topics
    • your preference regarding team work (individual/team/open to both)
    • optionally: names of team members (if you already know who you want to work with)
    Deadline: Oct 30, 23:59 CET.
    Notification of assigned topics and teams: Nov 06.
  • The prototype source code, tests, and documentation should be submitted as a pull request on the GitHub repository of either SystemDS or DAPHNE (or, exceptionally via email to patrick.damme(æ)tu-berlin.de).
    Deadline: Feb 19, 23:59 CET.
  • The presentation slides should be submitted as PDF documents by email to patrick.damme(æ)tu-berlin.de and the respective project mentor.
    Deadline: The day before the presentation, 23:59 CET.
    Last-minute changes after the submission are permitted.

Schedule
Introductory lectures
Slides will be made available prior to the individual lectures.
  • Oct 16: Kick-off Meeting [pdf, pptx]
  • Oct 30: Recommendation to attend the seminar at 14:00
Self-organized project work
  • Oct 23 – Feb 12: Optional office hours to discuss the initial design, implementation, and any questions
Final presentations
  • Feb 26, 14:00-18:00: Presentation session #1
  • Mar 04, 14:00-18:00: Presentation session #2

Organization

People
  • Dr.-Ing. Patrick Damme, patrick.damme(æ)tu-berlin.de
    Lecturer, seminar, DAPHNE projects, general contact person
  • Prof. Dr.-Ing. Matthias Boehm
    Apache SystemDS projects
  • Arnab Phani, M.Sc.
    Apache SystemDS projects
  • Sebastian Baunsgaard, M.Sc.
    Apache SystemDS projects
  • Carlos E. Muniz Cuza, M.Sc.
    Apache SystemDS projects
  • Philipp Ortner, M.Sc.
    DAPHNE projects
Language
  • Both seminar and project are given exclusively in English, but questions or other communication in German is fine as well.