A6 – Computational lexicography

Teachers:

The EMLex offers a diverse spectrum of teachers & lecturers from around the globe. This course will be held by:

Prof. Dr. Stefan Evert

Friedrich Alexander University Erlangen-Nuremberg

Prof. Dr. Ulrich Heid

University of Hildesheim

Dr. Besim Kabashi

Friedrich Alexander University Erlangen-Nuremberg

 

Contents:

Topics to be treated in this module include:

  1. Foundations of corpus linguistics
    • principles and methods of corpus analysis
    • applications of corpus data in lexicography
    • types of corpora, overview of existing corpora
    • corpus design, representativity, data sources, metadata
  2. Corpus compilation
    • building corpora from online data: web scraping etc.
    • boilerplate removal, normalization, metadata extraction
    • representation and exchange formats
    • online and stand-alone tools for web corpus compilation
    • automatic linguistic annotation (POS, lemma, NER, parsing, …)
    • online and stand-alone tools for linguistic annotation
  3. Searching corpora
    • regular expressions
    • character encodings and the Unicode standard
    • CQP query language for lexico-grammatical patterns
    • practical exercises with Sketch Engine and CQP web
  4. Quantitative analysis
    • frequency lists and metadata distribution
    • collocations and word sketches
    • keyword analysis
    • lexicographic interpretation of results
    • foundations of statistical inference
  5. Reproducibility
    • research methodology and documentation
    • data management, sustainability of corpus resources

Please see the module description for further information.

 

General information:

Time frame 22.03.-26.03.21
Room on Zoom
Evaluation method participation in a team project with a written report (the teams will be determined at the beginning of the module)
Teaching language German and English

 

Information on the EMLex 2021 Summer school:

Practical arrangements: Participants will receive a syllabus, relevant literature and suggestions on how to prepare for the course well in advance on the moodle plattform. The sessions are moderated by the lecturer and the guest lecturer. The lessons are centered around practical exercises with the computer, to be carried out in small groups (instructions will be given beforehand).

 

Certificate: There are two alternatives to get an EMLex 2021 Summer school certificate – (a) without grade: active participation in practical exercises, class discussions and a team project and (b) with grade: participation in a team project with a written report; the teams will be determined at the beginning of the course.

 

Schedule:

Time/day Monday Tuesday Wednesday Thursday Friday
9:00-10:30 Welcome & Introduction
(all)
Presentation of project ideas + corpus design with discussion

(all)

Linguistic annotation & pre-processing (Heid) Corpus search with CQP queries
(Heid/Kabashi)
Final team presentations and discussion (all)
11:00-12:30 Lexicography and text corpora, Corpus design (Heid) Collecting corpus data from the Web (Evert/Kabashi) Representation formats, practice with SketchEngine (Evert/Kabashi) Frequencies, collocations, and keywords (Evert) Final team presentations and discussion (all)
Lunch break
2:00-3:30 Form teams and discuss projects (Kabashi/Evert) Collecting corpus data from the Web Group work on team projects (Heid) Group work on team projects (Kabashi)
4:00-5:30 Regular expressions
(Evert/Kabashi)
Schierholz (A5) 4:15 p.m. Q&A session with instructors (Kabashi/Evert) Q&A session with instructors (all)
6:00-7:30 Further group work as needed Further group work as needed Further group work as needed