CS 457: Natural Language Processing

Tuesday (3/12)

Topic: Neural Networks for Text Classification

Reading:

In-Class Activities:

Class Sections:
A: Tuesday/Thursday, 8:15AM-9:30AM
B: Tuesday/Thursday, 9:45AM-11:00AM
All class meetings take place in 75 Shannon Street room 203

Instructor: Professor Laura Biester
You can call me Laura, Professor Laura, or Professor Biester, whichever you are more comfortable with
Email: lbiester@middlebury.edu
Office: 75 Shannon Street Room 214
Drop-In Hours: Monday 4-5
Thursday 2-3
Friday 11-12
One-on-one Appointments: please email me to schedule a one-on-one meeting
Lunch Appointments: book here at least 24 hours in advance
purpose: casual discussion in the dining hall of topics related to CS but not directly related to the course, including but not limited to NLP research, the CS major, working in tech, and graduate school

Anonymous course feedback form

Course Description

From the course catalog: In this course we will explore computational models for processing natural (human) language. We will introduce statistical and algorithmic techniques that are used to classify, generate, and understand language at the syntactic and semantic levels. We will explore applications such as parsing, information extraction, language modeling, and sentiment analysis. Assignments will involve constructing and modifying systems and will incorporate a variety of large corpora. We will also discuss the ethical concerns associated with current methods for collecting and labeling large corpora, and how language technologies might reflect and reinforce social hierarchies. This course fulfills the Responsible Computing requirement for the Computer Science major.

Prerequisites

The prerequisites for this course are CS 200 and CS 201. To be successful in this course, you should:

  • Be a self-sufficient programmer. We will focus on learning new NLP algorithms, and will not spend significant time reviewing basic Python programming.
  • Have some knowledge of conditional probability and bayes rule.

Additional mathematics, computer science, and linguistics courses might supplement your knowledge in ways that are helpful in CS 457, but knowledge of topics covered in those courses is not assumed.

If you are concerned about your preparation for this course, please come talk to me!

Learning Objectives

By the end of this course, you will:

  • Be familiar with NLP methods in three key areas: text classification, text generation, and language understanding
  • Be able to effectively use python libraries that are part of the large ecosystem of tools for NLP
  • Explore various NLP applications, including applications with positive societal impact
  • Be able to identify pros and cons of various data collection and labeling practices that are commonly used in NLP, including data scraping and crowdsourcing
  • Be exposed to ways in which language technology can perpetuate stereotypes and biases related to minoritized groups, and inequity among different languages
  • Demonstrate your ability to engage with recent research papers in NLP

Course Work and Grading

Your grade will be determined by the number of assignments that you complete at a satisfactory level.

 HomeworkFinal ProjectReading
To earn an A6 satisfactory homework assignments, including homework 716 components of the final project are satisfactoryPrepare a 10 minute presentation of one paper for the class (can be done with a partner), submit 3 satisfactory reading reflections
To earn a B5 satisfactory homework assignments, including homework 75 components of the final project are satisfactorySubmit 3 satisfactory reading reflections
To earn a C4 satisfactory homework assignments, including homework 74 components of the final project are satisfactorySubmit 2 satisfactory reading reflections
To earn a D3 satisfactory homework assignments, including homework 73 components of the final project are satisfactorySubmit 1 satisfactory reading reflection

All work will receive one of the following grades:

  • S: satisfactory. This work meets all of the assignment’s requirements.
  • R: revision required. You must revise this work to get credit.
  • NC: no credit. Typically assigned to missing work or work that does not reflect an attempt to complete the assignment prior to the due date.

Students whose work maps onto different grades in different categories will have a base grade defined by the category in which their grade is the lowest. If a student has earned the base grade in two categories, they will receive a grade of base grade+. If they have earned the base grade in only one category, they will receive a grade of (base grade + 1)-. If this is confusing in English, please see the following python program:

grades = input("Input your three category grades as a string like 'AAB' or 'CCA': ")
base_grade = max(grades) # A is lowest in ascii
if grades.count(base_grade) == 1: # both other categories were higher
    final_grade = chr(ord(base_grade) - 1) + "-"
else:
    final_grade = base_grade + "+"

print(f"The final grade is {final_grade}")

Finally, here are a couple of examples:

  • AAB –> A-
  • BBA –> B+
  • AAC –> B-

Homework

Homework will be assigned weekly in weeks 1-7 and due one week later on Thursday evening. The purpose of the homework assignments is to give you experience implementing the algorithms that we discuss during class and evaluating NLP techniques. Most homework assignments will include both a programming component and a written reflection. A small number of assignments may require problem solving on paper.

Grading

All homework assignments will have both an autograded component and a manually graded component. The autograder might do something as simple as confirming that you submitted the correct files, but it may also test your code on unseen data in cases where the algorithm you are implementing is fully deterministic.

Autograder tests are officially worth 0 points, but they are still important! Because all homework assignments have a manually graded component, passing all autograder tests is necessary but not sufficient to earn a grade of S on an assignment!

Extra Credit

A small number of homework assignments will include an option to compete with your classmates on a leaderboard to see who can create the best model! The purpose of this competition is primarily to deepen your understanding of the topic and to gain bragging rights. However, the winner of each competition2 will receive a bump of 1/3 letter grade at the end of the semester.3 Each student may only get extra credit this way once.

Final Project

To synthesize your knowledge, you complete a NLP research project with a partner during the second half of the semester. Your project is expected to have some novel component (e.g., you should not exactly replicate the model in a single research paper using the same data set and evaluation method), but your contribution can be small relative to what would be expected to publish a research paper.

Writeup

Your final project writeup is expected to include at least the following five components (the sixth component if your project is the presentation):

  • Literature Review
  • Data
  • Methods
  • Results
  • Ethical Considerations, Limitations, and Future Work

Please come speak to me before the project proposal deadline if you do not believe that your intended final project fits this structure.

Project Deadlines

Some project components will be due on Thursdays in lieu of homework towards the end of the semester. This is to ensure that you are on the right track to finish your project by the end of the semester; it will also give you an opportunity to receive feedback and resubmit these project components if necessary (see the revision policy).

The final version of your project is due on the last day of finals (May 21st). There will be an earlier feedback deadline, which is the latest day you can submit components of your final project for feedback with a possible grade of R. Any project work submitted after that deadline will receive a grade of S or NC.

Reading Assignments

Supplemental Reading

Most days in weeks 2-7 will have assigned supplemental reading in addition to required reading. Supplemental reading assignments are typically recent papers published in NLP or adjacent fields. You are expected to engage with this reading by writing reading responses throughout the semester. Visit this page to learn more about the requirements for reading responses.

To earn an A in the class, you must also present a supplemental reading to the class. There are enough presentation slots for everyone, but only as long as students present in the first few weeks, so please sign up as soon as possible. Your presentation may optionally be completed with a classmate.

Expectations of Students

You should expect to spend up to 10 hours per week on work outside of class to be successful in this course. If you find that you are regularly spending more time than 10 hours per week on the class, send me an email or stop by drop-in hours to chat.

You are expected to complete all required reading prior to the class in which we will discuss each topic. This will allow us to spend more of our class time on activities related to the topic instead of lecture.

Course Materials

Textbook

The main textbook for this course is the draft of the 3rd edition of Speech and Language Processing by Dan Jurafsky and James H. Martin.

Reading assignments will be available on Perusall. Perusall is a “social annotation platform”, which allows you to ask and answer questions as you read. Participation on Perusall will not contribute to your grade, but I strongly recommend participating to make your reading assignments more engaging!

Additional Materials

All additional materials assigned, which may include PDFs, blogs, audio recordings, or videos, will be freely available.

Python Environment and Computational Resources

All programming assignments for this course should be completed in Python. We are using Python in this course due to the large ecosystem of Python libraries available for NLP and machine learning.

The first few homework assignments for this course will not require external libraries; any Python workflow that you are comfortable with (using Thonny, vscode, or something else) is appropriate.

Later assignments may require (a) training models for a long period of time, (b) external libraries, or (c) GPUs. I will provide guidance on how to complete these assignments using Middlebury’s Ada HPC cluster and/or Google Colab. You may also use the cluster for your final project, however, please remember that the cluster is a shared resource. I will ask you to reconsider project ideas that require training a week of GPU-time, for instance.

Course Policies

Resources Available to You

We have many resources that can make the learning process easier throughout the course:

  • Professor Drop-In Hours: My drop-in hours are a great place to ask questions! You can ask questions about your homework, the lecture, the CS major, CS research, working in tech… even your general experience at Midd!
  • Ed Message Board: Ask questions about course content and assignments on the Ed message board. Asking questions here allows your classmates to see answers to frequently asked questions. Do not share code for any assignments publicly on the message board. If you need to share code, I recommend that you come to drop-in hours.
  • Email: If you have a question that cannot be asked on a public message board, please send an email to lbiester@middlebury.edu. I will commit to responding to emails from students within 1 business day; I will not respond to emails on the weekend.

Extension Policy

If you are unable to complete a homework assignment or project component by the deadline, submit this form to request a later due date at least 24 hours before the assignment is due. Requests to submit up to 3 days late will be accepted with no questions asked. If you need a longer extension due to extenuating circumstances, please contact me directly.

Exceptions to this policy include: reading responses (you should respond to another reading instead rather than submitting a response late), leaderboard competitions (you can only submit up to the original deadline), and the deadline for the final project (this can not be extended past the end of the exam period).

Revision Policy

Any assignment that receives a grade of R can be revised for credit. Revisions are due within 4 weeks of the initial due date or within 2 weeks of receiving feedback on the assignment, whichever is later.4

Exceptions to this policy include: in-class presentations, any work that is submitted after the feedback deadline for the final project.

Laptops

You are expected to bring a charged laptop to all class sessions. If you don’t have access to a laptop (even if for just a single class period), please contact me to ask about the availability of the department’s loaner laptops. The CS Department maintains a set of loaner laptops, preinstalled with relevant course tools, for both short-term and longer-term use. Given the small number of machines available (approximately 10), if you anticipate needing a laptop for a longer period (e.g., the entire semester or more), I encourage you to also inquire with the library about loaner equipment and/or Elaine Orozco Hammond about an Opportunity Grant, which can help you to purchase a laptop. Our department commits to meeting the needs of every student, so please don’t hesitate to contact Smith (our ASI) if you need a computer (in any way) for this course.

Collaboration and Outside Resources

On homework assignments and on your final project, you are allowed to work with a partner.5 We will sometimes have time to start the homework or work on projects in class, so your partner should be someone in your section. You may also discuss your general approach with other classmates, but the code that you write is expected to be written by yourself and your partner.

With proper attribution, you are allowed to use online resources such as StackOverflow and ChatGPT to answer basic python questions that lead to a few lines of code, for instance “how do you get the key corresponding to the largest value in a dictionary.6 You may not use online resources to solve the main problem posed by any assignment.

Disability Access and Accommodation

Every class is made up of learners with different access needs. My goal is for each student in our class to succeed, and to create an accessible learning environment for everyone. Students who have Letters of Accommodation in this class are encouraged to contact me as early in the semester as possible to ensure that such accommodations are implemented in a timely fashion.

For those without Letters of Accommodation, assistance is available to eligible students through the Disability Resource Center (formerly called Student Accessibility Services). All discussions will remain confidential.

Please contact one of the ADA Coordinators at ada@middlebury.edu for more information.

Academic Integrity

As an academic community devoted to the life of the mind, Middlebury requires that every student complete intellectual honesty in the preparation and submission of all academic work. Details of our Academic Honesty, Honor Code, and Related Disciplinary Policies are available in Middlebury’s handbook.

Honor Code Pledge

The Honor Code pledge reads as follows: “I have neither given nor received unauthorized aid on this assignment.” It is the responsibility of the student to write out in full, adhere to, and sign the Honor Code pledge on all examinations, research papers, and laboratory reports. Faculty members reserve the right to require the signed Honor Code pledge on other kinds of academic work.

Tentative Schedule and Topics

Week 1: Syllabus Overview, Tokens, Corpora (Tu), Regular Expressions, Reading Papers (Th)

Week 2: N-Gram Language Models (Tu), Naive Bayes, Classifier Evaluation (Th)

Week 3: Part of Speech Tagging with Hidden Markov Models (Tu), Logistic Regression (Th)

Week 4: Vector Semantics (Tu), Word2Vec (Th)

Week 5: Neural Networks for Text Classification (Tu), Neural Networks for Text Classification (II) (Th)

Week 6: BERT for Text Classification (Tu), BERT for Text Classification (II) (Th)

Week 7: Neural Language Modeling (Tu), Machine Translation (Th)

Week 8: Ethics in NLP: Data (Tu), Ethics in NLP: Models (Th)

Week 9: Language Understanding (Tu), TBD (Th)

Week 10: TBD

Week 11: TBD

Week 12: Project Presentations

Note that the topics for weeks 8-11 are intentionally TBD! The topics for these class periods will be determined by student interests and needs, and may include:

  • Overflow from weeks 1-7
  • NLP topics that we didn’t have time to cover in weeks 1-7 (e.g., topic modeling, parsing)
  • Exploring topics in greater depth (e.g., transformers, RLHF)
  • Guest lectures on special topics
  • Time to work on final projects in class
  • Interactive demos of tools that may be useful for your final projects

See the detailed schedule for more!


  1. You only need to complete 6/7 homework assignments at a satisfactory level to earn an A in the course with one exception. Homework 7 focuses on responsible computing, and is required of all students because CS 457 satisfies the CS department’s responsible computing requirement. 

  2. In most cases, the winner will be determined exclusively by scores on the leaderboard. Any evidence of an honor code violation related to a leaderboard submission, including evidence that a student has searched for the test set online, will result in disqualification. 

  3. E.g., from an A- to an A. 

  4. I am planning to return assignments to students within two weeks of submission. 

  5. Your partner should be a student in your section. If you are having trouble finding a partner, please let me know. 

  6. Of course, you could pretty easily write your own function to do this!