Large-scale Data Engineering (Summer 2023)


Conducting research in the areas of data engineering, data management, and machine learning systems requires the ability to deal with scientific literature in these areas as well as to design, implement, and evaluate prototypes. To facilitate these skills, the DAMS Lab group offers a seminar and a programming project on Large-scale Data Engineering as a combined 12 ECTS module, which can be taken by bachelor and master students. Alternatively, only bachelor students may take the seminar (3 ECTS) and project (9 ECTS) as separate modules. Taking both seminar and project is the ideal preparation for a bachelor/master thesis with our group.

Modules and assigned degree programs
  • Large-scale Data Engineering (module #41086, 12 ECTS): seminar and project
    • bachelor's and master's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik), M.Sc. Computer Science (Informatik), M.Sc. Computer Engineering, M.Sc. Information Systems Management (Wirtschaftsinformatik), M.Sc. Electrical Engineering (Elektrotechnik)
    • registration via email to Patrick Damme (*)
    • graded portfolio exam (25 pts seminar summary paper + 15 pts seminar presentation + 50 pts project implementation/tests/docs + 10 pts project presentation)
  • Seminar Large-scale Data Engineering (module #41095, 3 ECTS): seminar only
    • bachelor's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik), B.Sc. Media Technology (Medientechnik)
    • registration organized centrally by Faculty IV via meta course
    • graded portfolio exam (65 pts summary paper + 35 pts presentation)
  • Project Large-scale Data Engineering (module #41094, 9 ECTS): project only
    • bachelor's programs: B.Sc. Computer Science (Informatik), B.Sc. Computer Engineering (Technische Informatik), B.Sc. Information Systems Management (Wirtschaftsinformatik)
    • registration
      • Programmierpraktikum in B.Sc. Computer Science: registration organized centrally by Faculty IV via meta course
      • otherwise: registration via email to Patrick Damme (*)
    • ungraded portfolio exam (85 pts implementation/tests/docs + 15 pts presentation)

News

  • Jun 13: Exam registration open.
    In case anyone has missed the announcement via the corresponding ISIS course: The exam registration for the modules Large-scale Data Engineering, Seminar Large-scale Data Engineering, and Project Large-scale Data Engineering via Moses/MTS is still open until June 18, 23:59 CEST.
  • Apr 14: Information on the kick-off meeting.
    • Time: Monday, Apr 17, 2023: 14:00 (seminar) and 16:00 (project)
    • Place: hybrid (in-person in TEL 811 and via zoom)
    • Students in the queue are allowed to join, but virtual attendance is recommended due to space limitation
    • Students taking only the seminar or only the project need to attend only the respective part of the kick-off
    • Topics: the final lists of topics for seminar and project will be presented in the kick-off, the topic selection takes place asynchronously after the kick-off
  • Apr 14: The registration is closed.
    The participants have been selected in accordance with §48 (3) AllgStuPO. All registered students have been notified about the result (participation, queue, no participation in this semester).

Seminar

Time: Mondays 14:00 - 16:00
Place: TEL 811 and zoom

In the beginning of the semester, students will hear presentations on reading scientific papers, finding related work, writing high-quality scientific papers, and giving a high-quality scientific presentation. Each student selects a topic, reads and understands the given paper, searches for related work, and writes a short summary of the assigned paper (6 pages). In the end of the semester, each student gives a slide presentation (15 min talk + 5 min discussion) in front of the group.


Topics

In this semester, we focus on the umbrella topic of Extensible Data Systems:

To meet the requirements of emerging applications as well as to enable the timely adoption of novel techniques and technologies, making database and machine learning systems extensible has been an active research field for decades. Concepts for extensibility and variability have been proposed at all levels of the system stack from query/program languages, over the optimizer, down to the execution in distributed environments and on heterogeneous hardware and storage. This seminar takes a tour through some of the most important works in this field.

List of topics: Separate list of topics (updated Apr 28)

Submission & deadlines
  • Topic selection: After the kick-off, please send your list of preferred topics via email to patrick.damme(æ)tu-berlin.de. First-come-first-serve.
    Deadline: May 1, 23:59 CEST.
  • The summary papers (6 pages) should be written in LaTeX using the ACM acmart template (document class sigconf), and be submitted as PDF documents by email to patrick.damme(æ)tu-berlin.de.
    Deadline: Jun 19, 23:59 CEST.
  • The presentation slides should be submitted as PDF documents by email to patrick.damme(æ)tu-berlin.de.
    Deadline: The day before the presentation, 23:59 CEST.
    Last-minute changes after the submission are permitted.

Schedule
Introductory lectures
Slides will be made available prior to the individual lectures.
  • Apr 17: 01 Structure of Scientific Papers [pdf, pptx]
  • Apr 24: 02 Scientific Reading and Writing [pdf, pptx]
  • May 01: (public holiday)
  • May 08: 03 Experiments, Reproducibility, and Presentations [pdf, pptx] (updated May 8)
Self-organized seminar work
  • optional office hours to discuss any questions
Final presentations
  • Jun 26: oral presentation session #1
  • Jul 03: oral presentation session #2
  • Jul 10: oral presentation session #3
  • Jul 17: oral presentation session #4

Project

Time: Mondays 16:00 - 18:00
Place: TEL 811 and zoom

In the beginning of the semester, students/teams pick a programming project from a provided list, devise an initial design and then implement a prototype including documentation, tests, and relevant experiments. The project ends with a presentation (15 min) of the obtained results.


Topics

The topics of the project are independent of the seminar. We will offer tasks in a wide range of components of data management and machine learning systems. Each individual project will be conducted in the context of one of the two systems developed by our group (and other collaborators) as part of our research:

  • Apache SystemDS: An open-source ML system for the end-to-end data science lifecycle (mainly written in Java)
  • DAPHNE: An open and extensible system infrastructure for integrated data analysis pipelines (mainly written in C++)

Thereby, students get the chance to make meaningful contributions to free open-source projects.

The projects can be done either individually or in teams of up to three students (with the expected amount of work proportional to the team size).

List of topics: Separate list of topics (updated Apr 28)

Submission & deadlines
  • Topic selection: After the kick-off, please send your list of preferred topics via email to patrick.damme(æ)tu-berlin.de. Feel free to approach us as an individual or as a team. First-come-first-serve, but we may suggest you to team up with other students interested in the same topic.
    Deadline: May 1, 23:59 CEST.
  • The prototype source code, tests, and documentation should be submitted as a pull request on the GitHub repository of either SystemDS or DAPHNE (or, exceptionally via email to patrick.damme(æ)tu-berlin.de).
    Deadline: Jul 24 (soft deadline).
  • The presentation slides should be submitted as PDF documents by email to patrick.damme(æ)tu-berlin.de.
    Deadline: The day before the presentation, 23:59 CEST.
    Last-minute changes after the submission are permitted.

Schedule
Kick-off Meeting
  • Apr 17: Kick-off Meeting [pdf, pptx] (updated Apr 18)
Self-organized project work
  • optional office hours to discuss the initial design, implementation, and any questions
Final presentations
  • scheduled individually with each student/team

Organization

Registration
  • To register for the seminar in any bachelor program of Faculty IV or for the Programmierpraktikum in B.Sc. Computer Science, please use the respective meta courses.
  • In all other cases, please send an email to Patrick Damme (patrick.damme(æ)tu-berlin.de) with the following information: your name, matriculation number, degree program, and the module you would like to take. Registrations are accepted until Thursday, April 13, 2023, 23:59 CEST. The registration is closed.
  • The capacity of both the seminar and the project is limited to 20 students. In case we receive Since we have already received more registrations, students will be admitted in accordance with §48 (3) AllgStuPO. On April 14 we will notify all registered students about the result (guaranteed participation, a place in the queue, or no participation in this semester).
People
  • Dr.-Ing. Patrick Damme, patrick.damme(æ)tu-berlin.de
    Lecturer, seminar, DAPHNE projects, general contact person
  • Prof. Dr.-Ing. Matthias Boehm
    Apache SystemDS projects
  • Arnab Phani, M.Sc.
    Apache SystemDS projects
  • Sebastian Baunsgaard, M.Sc.
    Apache SystemDS projects
  • Carlos E. Muniz Cuza, M.Sc.
    Apache SystemDS projects
Language
  • Both seminar and project are given exclusively in English, but questions or other communication in German is fine as well.