This assignment will begin exploring the trade-offs posed by structured data (= stored in database) and unstructured data (= plain text). In particular, we will explore the "data cleaning" problem, and the ease (or difficulty) with which we can extract useful information from unstructured data and query it. The file dinesafe.xml (available in the class dataset page) contains information about a large number of food safety inspections for Toronto-based establishments that prepare and/or sell food: everything from hot dog vending carts to restaurants to grocery stores to meat packing plants. We are particularly interested in the INFRACTION_DETAILS field: it contains information about cases where the inspector found operator and/or employees failed to comply with the various laws governing the handling and sale of food in Toronto. The dataset was acquired from Toronto's DineSafe Inspection and Disclosure System [1] and covers inspections over the last three years. The inspection system makes use of three laws, which are summarized below: Ontario Health Protection and Promotion Act, R.S.O 1990, ch. H.7 [2] ---------------------------------------------------------------------- Grants the Ontario Ministry of Health authority to "provide for the organization and delivery of public health programs and services, the prevention of the spread of disease and the promotion and protection of the health of the people of Ontario" (Sec. 3). Ontario Regulation 562 - Food Premises [3] ---------------------------------------------------------------------- Lists specific rules which must be followed when preparing and serving food. It covers buildings and equipment (construction, maintenance, and cleanliness), employee hygiene, and all aspects of food storage and preparation. The majority of food safety infractions will cite a specific section of this law. Toronto Municipal Code, chapter 545 - Licensing [4] ---------------------------------------------------------------------- Section 5.G of this law requires all eating and drinking establishments to be properly licensed. The business itself, and all food handlers working there, must obtain proper licenses and present them upon request. [1] http://www.toronto.ca/health/dinesafe [2] http://www.e-laws.gov.on.ca/html/statutes/english/elaws_statutes_90h07_e.htm [3] http://www.e-laws.gov.on.ca/html/regs/english/elaws_regs_900562_e.htm [4] http://www.toronto.ca/legdocs/municode/1184_545.pdf ======================================================================== For this assignment, you should use shell tools (sort, grep, sed, wc, etc.) and write python code to analyze the data and answer the following questions: Q1: How many infractions are recorded in the dataset? How many distinct infractions are there? Q2: Of the distinct infractions found, how many arise out of the Health Protection Act vs. the Municipal Code? Q3: What irregularities in the data made it harder to answer Q1 and Q2? How did you work around those irregularities? Q4: Not all infractions cite any law. Using other information (e.g. from other infractions in the dataset and the text of the laws [1][2]), how many of those infractions can you match with a specific section (subsection, paragraph) of a law? Your code should take a sequence of infraction details as input and output groups of infractions that fall under the same section of law. For example: O. Reg. 562/90 Sec. 68(4) fail to clean toilets as often as necessary fail to clean toilets once a day fail to clean washbasins as often as necessary fail to clean washbasins once a day ... O. Reg. 562/90 Sec. 23 fail to store food on pallets fail to store food on racks or shelves ... Q5: Which section of the law has the most infractions? How many? Q6: (BONUS, CHALLENGING) Of inspections that closed a business, what rule was cited the most often? How many times? What detailed reasons were given? Because this is real data, it will be "messy" and potentially ambiguous. That is part of the "fun." You will find that case-insensitive comparisons and regular expressions can be extremely useful. ========================================================================