Inference, Models and Simulation for Complex Systems
CSCI 7000/4830

Instructor: Aaron Clauset
Office: ECOT 743
Office hours: Tuesday, 1:30-3:00pm

Time: Tuesday and Thursday, 11:00am - 12:15pm
Room: ECCR 131

Description
Computers are fundamentally changing how science is done by allowing us to record enormous amounts of data on social, biological, technological and physical systems. This data glut should allow us both to address old questions about complex systems in new ways and to answer fundamentally new kinds of questions. But, our ability to do this relies on the development of new algorithms to automatically extract information about patterns in these huge data sets and to reliably test complex scientific hypotheses. Examples include understanding the evolution of the Internet (at various levels), social behavior on the Web (Facebook, Wikipedia, Twitter, etc.), the role of networks in structuring complex social behaviors and biological functions, the emergence of population-scale patterns from seemingly random individual-level behavior, and forecasting rare but important events in the future. 

This graduate-level topics course will cover a selection of recent developments in computational approaches to doing science with complex systems. It is not a scientific computing course. Topics will include statistical inference, the structure of complex networks, macro-phenomena in biological evolution and in wars and terrorism, simple mathematical models, and simulation techniques for more complicated models. The focus will be on using computational tools (algorithms) to do science (work with data; test hypotheses; build understanding; make predictions).

Prerequisites

Required courses: CSCI 3104 (undergraduate algorithms), two semesters of calculus and basic statistics and probability theory

Recommended courses: CSCI 5454 (graduate algorithms), APPM 3050 (Scientific Computing in Matlab), APPM 3570 (Applied Probability) or APPM 4570 (Statistical Methods)
Note: No biological or social science background is necessary. A quantitative background is mandatory. The concepts and techniques covered in this course depend heavily on basic statistics, scientific programming (Matlab and Mathematica) and calculus (integration and differentiation). Students without sufficient preparation in these areas will struggle to keep up with the lectures and especially with the assignments, each of which has a mathematical and a programming component. Students concerned about their preparation are encouraged to audit the course.

Text
Required (available at the CU Bookstore):
1. Networks: An Introduction by M.E.J. Newman
2. The Sciences of the Artificial (3rd ed.) by H.A. Simon

Optional:
3. All of Statistics by L. Wasserman 
4. Numerical Recipes are recommended. the first two of these are being held on reserve for this course at the CU Library.

Course work and grading
Attendance to the lectures is required. 

The first half of the course will be lecture and problem-set driven. The second half of the class will revolve around crowd-source (student) lectures on advanced topics and a small independent research project. The lecture and project are the major deliverables for the class, and students are expected to spend considerable time outside of class preparing them. There are no written examinations in this class; the lecture and project should be viewed as an oral and a final examination. Many lectures will have an accompanying reading assignment from the required texts or a reading from the primary literature. I'll post the latter readings on the class website. Additional detail on the crowd-sourced lectures is given in the syllabus. 

Problem sets: The four problem sets will each have some mathematical and some programming problems. Programming problems can be done in any reasonable imperative language, but I highly recommend using Matlab, or something similar, which has data analysis and visualization capabilities built-in. Familiarity with Mathematica will be useful for many of the mathematical problems. 

Problem sets will be due roughly two weeks after they are assigned (schedule given below). Solutions must be in PDF format (e.g., typeset using LaTeX), should include all necessary details for me to follow the logic, and must be submitted via email by 11:59pm the day they are due. No late assignments will be accepted. Collaboration is allowed on the problem sets, but you may not copy (in any way) from your collaborators and you must respect University academic policies at all times. To be clear: you may discuss the problems verbally, but you must write up your solutions separately. If you do discuss the problems with someone (and you are encouraged to!), you must then list and describe the extent of your collaboration in your solutions (a footnote is fine). Copying from any source, in any way, including the Web but especially from another student (past or present), is strictly forbidden. If you are unsure about whether something is permitted under these rules, ask me well before the deadline. 

Crowd-source lecture: Students will self-organize into teams of two. Each team will then give a full-length (75 minute) technical lecture that (i) frames a scientific question and motivates an algorithm or model to address it, (ii) describes the algorithm or model in appropriate technical detail (at least at the pseudocode level), (iii) describes its performance or results, and (iv) describes interesting extensions, open question, problems, etc. Students should notify me via email of their team composition by September 13. Each team should then notify me via email by September 15 of their preferences on lecture topics in the form of a complete ranking of the topics listed below. I will match teams to topics using a matching algorithm. (The number of topics and the precise schedule may change slightly during the semester depending on the number of students in the class, etc.) 

Independent project: The assignment here is to develop and complete a small research-style project based around the tools and concepts developed in the course. This independent project should focus on either inference, modeling or simulation. Students will work independently, but are strongly encouraged to get feedback from each other on their ideas. The deliverables here are a short (20 minute) in-class presentation of their results and a 10-page writeup due via email to me by 11:00am on December 15. 

A short project proposal is due via email on October 13. The proposal should be two paragraphs: one explaining any background material, including necessary references to the scientific literature, and a second describing what you're going to do. When choosing a topic, students should be cognizant of the need to make reasonable progress in the 9 weeks remaining in the semester. To help facilitate this, students are strongly encouraged to meet with me outside of class prior to October 13 to discuss potential project topics. 

Grading Grades will be assigned based on a 20% attendance, 30% problem sets, 20% lecture and 30% project division. Undergraduates taking CSCI 4830: Instead of doing both the lecture and the project, undergraduate students may choose to do one or the other. Grades will be assigned based on a 30% attendance, 30% problem sets, and 40% lecture/project division.

Tentative Schedule
Week 1: Overview, probability distributions, likelihood functions, the Poisson process (Lecture 0 and Lecture 1
Week 2: Power-law distributions and power-law tails (Lecture 2 and Lecture 3
Week 3: Model plausibility, model comparison, and two fallacies (Lecture 4 and Lecture 5
Week 4: Time series analysis, random walks (Lecture 6 and Lecture 7
Week 5: Random walk models of macroevolution (Lectures 8 and 9
Week 6: Probabilistic models of terrorism and wars (Lecture 10
Week 7: Measures of network structure, algorithms (Lecture 12a and Lecture 12b
Week 8: Random graph models, citation networks (Lecture 13 and Lecture 14
Week 9: Modular and hierarchical structure, missing information, etc. (Lecture 15 and Lecture 16)
Week 10: (CSL 1) The bootstrap and (2) duplication-mutation models 
Week 11: (3) Privacy and de-anonymizing human data and (4) Levy flights 
Week 12: (5) Regression and (6) metabolic networks 
Week 13: (7) Gaussian mixture models and (8) People are not? particles 
Week 14: Fall break 
Week 15: Project presentations 
Week 16: Project presentations

Problem Sets
Problem set 1 (assigned August 23; due September 13)

  • data set 1A: the frequencies of occurrence of US family names in the 1990 US Census. 
    (From A. Clauset, C.R. Shalizi and M.E.J. Newman (2009) SIAM Review 51, 661.)
  • data set 1B: the intensity of wars from 1816-1980 measured as the number of battle deaths per 10,000 of the combined populations of the warring nations. 
    (From M. Small and J.D. Singer (1982) Sage Publications).
  • data set 1C: the aggregate net worth in US dollars of the richest individuals in the United States in October 2003. 
    (From M.E.J. Newman (2005) Contemporary Physics 46, 323).
  • data set 1D: the numbers of adherents of religious denominations, bodies, and sects, as compiled and published on the web site adherents.com. 
    (From A. Clauset, C.R. Shalizi and M.E.J. Newman (2009) SIAM Review 51, 661.)

 

Problem set 2 (assigned September 16; due September 27)

Problem set 3 (assigned September 30; due October 11)

Problem set 4 (assigned October 11; due October 25)

Supplemental Readings

Week 1: 

 

Week 2: 

 

Week 3: 

 

Week 4: 

 

Week 5: 

 

Week 6: 

 

Weeks 7 and 8: 

 

Resources

LaTeX (general) and TeXShop (Mac)
Matlab license for CU staff (includes student employees)
Mathematica license for CU students
NumPy/SciPy libraries for Python (similar to Matlab)
GNU Octave (similar to Matlab)
Wolfram Alpha (Web interface for simple integration and differentiation)
Machine Learning, Statistical Inference and Induction Notebook (by Cosma Shalizi)
Power Law distributions, etc. Notebook (by Cosma Shalizi)
Some Advice on Process for [Research Projects]
Cytoscape, network visualization software
yEd Graph Editor, network visualization software
Graphviz, network visualization software