Submitty
diff --git a/‎.travis.yml
Lines changed: 16 additions & 20 deletions b/‎.travis.yml
Lines changed: 16 additions & 20 deletions
diff --git a/‎Makefile
Lines changed: 0 additions & 31 deletions b/‎Makefile
Lines changed: 0 additions & 31 deletions
diff --git a/‎README.md
Lines changed: 6 additions & 62 deletions b/‎README.md
Lines changed: 6 additions & 62 deletions
diff --git a/‎anonymization/Makefile
Lines changed: 28 additions & 0 deletions b/‎anonymization/Makefile
Lines changed: 28 additions & 0 deletions
diff --git a/‎anonymization/README.md
Lines changed: 15 additions & 0 deletions b/‎anonymization/README.md
Lines changed: 15 additions & 0 deletions
diff --git a/‎anonymization/bin/anonymization
Lines changed: 16 additions & 0 deletions b/‎anonymization/bin/anonymization
Lines changed: 16 additions & 0 deletions
diff --git a/‎anonymization/bin/anonymize_log
Lines changed: 92 additions & 0 deletions b/‎anonymization/bin/anonymize_log
Lines changed: 92 additions & 0 deletions
diff --git a/‎anonymization/include/config.h
Lines changed: 30 additions & 0 deletions b/‎anonymization/include/config.h
Lines changed: 30 additions & 0 deletions
@@ -1,31 +1,27 @@
-sudo: required
 dist: trusty
+sudo: false
 
 language: c
 
-compiler: gcc
+cache:
+  directories:
+    - ~/.stack/
 
 addons:
   apt:
-    sources:
-      - sourceline: "deb http://downloads.skewed.de/apt/trusty trusty universe"
-      - ubuntu-toolchain-r-test
     packages:
-      - build-essential
-      - pkg-config
-      - flex
-      - bison
-      - libpcre3 
-      - libpcre3-dev
-      - splint
-      - indent
-      - python3
-      - python3-dev
-      - libpython3.4 
-      - python3-pip
-      - python-graph-tool
+      - libgmp-dev
 
 before_install:
- - python3 -m pip install pylint --user
+  - mkdir -p ~/.local/bin
+  - export PATH=$HOME/.local/bin:$PATH
+  - travis_retry curl -L https://www.stackage.org/stack/linux-x86_64 | tar xz --wildcards --strip-components=1 -C ~/.local/bin '*/stack'
 
-script: make
+install:
+  - travis_wait 30 stack --no-terminal --install-ghc build
+  - stack --no-terminal install hlint
+
+script:
+  - stack --no-terminal test --only-dependencies
+  - ./hlint '--ignore=Parse error' src # All parse errors should be caught during compilation, and HLint erroneously throws a parse error
+  - ./hlint '--ignore=Parse error' app # when using constrained types alongside a record-syntax GADT declaration (bug in haskell-src-exts).
@@ -1,66 +1,10 @@
 # Analysis Tools [![Build Status](https://travis-ci.org/Submitty/AnalysisTools.svg?branch=master)](https://travis-ci.org/Submitty/AnalysisTools)
 This repository contains a variety of tools used for source code analysis.
 
-## Dependencies
-* C compiler with Posix standard libraries (tested on GCC 4.9.3)
-* Flex (tested on version 2.5.39)
-* GNU Bison (tested on version 3.0.4)
-* PCRE (tested on version 8.38)
-* Python 3.4 (with development libraries)
-* Eclipse JDT library
-* python-graph-tool
+## Installation
 
-Optionally, if GNU Indent, Splint, and/or Pylint are installed, they will be incorporated into the build process to check for common stylistic and logical errors. It is recommended to install these tools if you wish to make a large contribution.
-
-## Building
-
-If you're on the Submitty Ubuntu 14.04 vagrant box, first run `make ubuntudeps`. Note: You'll need to be root.
-
-Then run `make` in the repository root. This will build a number of executables, described below, to the `bin/` directory. In the interest of providing examples that fully express the usage of the tool, each example uses the complete set of available flags. However, *every* flag is optional, as a rule. Default values can be set at three different levels: at the lowest level of precedence, build-wide values can be set in `include/config.h`. Single values in `include/config.h` can be overwritten for a specific executable by setting `CFLAGS` when calling `make`. Finally, language-specific values and some other runtime configuration used by `bin/plagiarism` can be modified in `config/plagiarism.json` by default (or by passing the `-c config_file.json` flag to `bin/plagiarism`).
-
-### `bin/anonymize`
-Usage example: `cat file.c | ./bin/anonymize -n name_list.txt -t to_replace.csv -r 's/foo/bar/' -a 10`
-
-Anonymizes standard input, writing to standard output. In the example above, case-insensitively redacts all words found in the whitespace-delimited list in `name_list.txt`, replaces any word found in the first column of the two-column CSV file `to_replace.csv` with the corresponding word in the second column, and replaces any string matching regular expression `foo` with the string `bar`. If the `-a <n>` flag is provided, replace only in the first `n` lines of the input.
-
-Multiple instances of each flag may be passed when appropriate. For example, `./bin/anonymize -n name_list.txt -n name_list_2.txt` will replace words found in either list.
-
-### `bin/anonymization`
-Usage example: `./bin/anonymization source_dir -n first_names.txt -n last_names.txt -t rcs_ids.txt -r 's/66[0-9]{7}/REDACTEDRIN/' -a 5 -l 3 source_dir/data/*`. 
-
-Anonymizes all files in the given directory tree. The `-n`, `-t`, `-a`, and `-r` flags have the same meaning as those passed to `bin/anonymize`. One additional flag is also accepted: `-l`, which allows the user to specify a level in the directory tree (as an integer). Any directories at this level will also have their names anonymized according to the CSV files passed using the `-t` flag. Note that only those CSV files will be used for the replacement: simple name lists passed with `-n` will not be used for directory replacement.
-
-Any additional arguments passed after the above flags are treated as files to ignore. To ignore entire directories, ensure the paths are well-formed, i.e. `foo/bar/baz/` (noting the `/` at the end).
-
-Capture the standard output of this run to see the statistics regarding the number of replacements in directory names, file names, and file contents.
-
-### `bin/plagiarism`
-Usage example: `./bin/plagiarism java source_dir`
-
-Checks every source file in the given language contained in `source_dir` (or a subdirectory) against every other source file for possible plagiarism, writing sorted (by percent match) CSV data to standard output.
-
-### `bin/winnow`
-Usage example: `cat tokens | ./bin/winnow -u 30 -l 15`
-
-Applies the winnowing algorithm to a sequence of whitespace-separated integer-valued tokens read from standard input using upper bound `30` and lower bound `15`. Writes the selected fingerprints to standard output.
-
-### `bin/walk`
-Usage example: `./bin/walk -u 30 -l 15 python source_dir`
-
-Tokenizes and winnows each source file in `source_dir`, using upper bound `30` and lower bound `15`. Winnowing output is written in a parallel directory tree as a subdirectory of `.analysis_data/<timestamp>`, where timestamp is a unique integer-valued timestamp. The timestamp used is written to standard output.
-
-### `bin/genpairs`
-Usage example: `./bin/genpairs <timestamp>`
-
-Generates all pairs of files in the directory `.analysis_data/<timestamp>`, writing those pairs to standard output.
-
-### `bin/compare_fingerprints`
-Usage example: `cat pairs | ./bin/genpairs <timestamp>`
-
-Reads pairs from standard input, reads fingerprint data from `.analysis_data/<timestamp>`, and computes match data, writing to standard output.
-
-### `bin/csa`
-Usage example: `./bin/csa student\_structure\_file subgraph\_structure\_file`
-
-### Java Parser
-Arguments: `\[javafile ...\] \[ -t for XML AST \|  -u for class hierarchical data\] -o filename`
+    apt-get install stack
+    stack upgrade --install-ghc
+    git clone https://github.com/Submitty/AnalysisTools
+    cd AnalysisTools
+    stack install
@@ -0,0 +1,28 @@
+CC = gcc
+
+SRCS = $(foreach file,$(wildcard src/*.c),$(notdir $(file)))
+BUILD_DIR = bin
+LIB_DIR = lib_$(CC)
+BINARIES = $(addprefix $(BUILD_DIR)/, $(SRCS:.c=))
+
+CFLAGS_gcc = -Iinclude -I/usr/local/include -O2 -g -Wall -Werror -D_POSIX_C_SOURCE=200809 -D_DEFAULT_SOURCE -Wno-unused-result
+CFLAGS = $(CFLAGS_$(CC))
+LINKER_FLAGS_gcc = -lm -lpcre
+LINKER_FLAGS = $(LINKER_FLAGS_$(CC))
+
+vpath %.c src
+
+.PHONY: all directories clean
+
+all: directories $(BINARIES)
+
+directories: $(BUILD_DIR)
+
+$(BUILD_DIR):
+	mkdir -p $(BUILD_DIR)
+
+$(BUILD_DIR)/%: %.c
+	$(CC) -o $@ $(CFLAGS) $< $(LINKER_FLAGS)
+
+clean:
+	rm $(BINARIES) -f
@@ -0,0 +1,15 @@
+### `bin/anonymize`
+Usage example: `cat file.c | ./bin/anonymize -n name_list.txt -t to_replace.csv -r 's/foo/bar/' -a 10`
+
+Anonymizes standard input, writing to standard output. In the example above, case-insensitively redacts all words found in the whitespace-delimited list in `name_list.txt`, replaces any word found in the first column of the two-column CSV file `to_replace.csv` with the corresponding word in the second column, and replaces any string matching regular expression `foo` with the string `bar`. If the `-a <n>` flag is provided, replace only in the first `n` lines of the input.
+
+Multiple instances of each flag may be passed when appropriate. For example, `./bin/anonymize -n name_list.txt -n name_list_2.txt` will replace words found in either list.
+
+### `bin/anonymization`
+Usage example: `./bin/anonymization source_dir -n first_names.txt -n last_names.txt -t rcs_ids.txt -r 's/66[0-9]{7}/REDACTEDRIN/' -a 5 -l 3 source_dir/data/*`. 
+
+Anonymizes all files in the given directory tree. The `-n`, `-t`, `-a`, and `-r` flags have the same meaning as those passed to `bin/anonymize`. One additional flag is also accepted: `-l`, which allows the user to specify a level in the directory tree (as an integer). Any directories at this level will also have their names anonymized according to the CSV files passed using the `-t` flag. Note that only those CSV files will be used for the replacement: simple name lists passed with `-n` will not be used for directory replacement.
+
+Any additional arguments passed after the above flags are treated as files to ignore. To ignore entire directories, ensure the paths are well-formed, i.e. `foo/bar/baz/` (noting the `/` at the end).
+
+Capture the standard output of this run to see the statistics regarding the number of replacements in directory names, file names, and file contents.
@@ -0,0 +1,16 @@
+#!/usr/bin/env python3
+
+"""
+Run anonymization system.
+"""
+
+import sys
+import subprocess
+
+REALARGS = sys.argv
+
+REALARGS[0] = "./bin/anonymize_dirs"
+ANONYMIZE_DIRS = subprocess.Popen(REALARGS, stderr=subprocess.PIPE)
+ANONYMIZE_LOG = subprocess.Popen(["./bin/anonymize_log"],
+                                 stdin=ANONYMIZE_DIRS.stderr)
+ANONYMIZE_LOG.wait()
@@ -0,0 +1,92 @@
+#!/usr/bin/env python3
+
+"""
+Summarize statistics for anonymization.
+"""
+
+from __future__ import print_function
+import sys
+import argparse
+import json
+import re
+
+from collections import defaultdict
+
+ARGPARSER = argparse.ArgumentParser(\
+        description="Summarize statistics for anonymization.")
+ARGPARSER.add_argument("-c", "--config", type=int)
+ARGPARSER.add_argument("-p", "--plot", action="store_const", const=True)
+ARGS = ARGPARSER.parse_args()
+
+DIRECTORY_NAMES = defaultdict(lambda: 0)
+FILE_NAMES = defaultdict(lambda: defaultdict(lambda: 0))
+INTERNAL_NAMES = defaultdict(lambda: 0)
+IGNORED_NAMES = defaultdict(lambda: defaultdict(lambda: 0))
+IGNORED_FILE_NAMES = defaultdict(lambda: 0)
+
+def print_csv():
+    """
+    Output anonymization statistics in CSV format.
+    """
+    print("=== REPLACEMENTS IN DIRECTORIES ===")
+    for key in DIRECTORY_NAMES.keys():
+        print(key + "," + str(DIRECTORY_NAMES[key]))
+
+    print("=== REPLACEMENTS IN FILENAMES ===")
+    for key in FILE_NAMES.keys():
+        print(key + "," + FILE_NAMES[key]["new"] + "," +
+              str(FILE_NAMES[key]["count"]))
+
+    print("=== REPLACEMENTS IN FILES ===")
+    for (name, new) in INTERNAL_NAMES.keys():
+        print(name + "," + new + "," +
+              str(INTERNAL_NAMES[(name, new)]))
+
+    print("=== IGNORED IN FILENAMES ===")
+    for key in IGNORED_FILE_NAMES.keys():
+        print(key + "," + str(IGNORED_FILE_NAMES[key]))
+
+    print("=== IGNORED IN FILES ===")
+    for key in IGNORED_NAMES.keys():
+        print(key + "," + IGNORED_NAMES[key]["new"] + "," +
+              str(IGNORED_NAMES[key]["count"]))
+
+def print_gnuplot():
+    """
+    Output anonymization statistics in a GNUPlot-readable format.
+    """
+    for i, key in enumerate(INTERNAL_NAMES.keys()):
+        print(i, key, INTERNAL_NAMES[key]["count"])
+
+for line in sys.stdin:
+    match = re.search(r"Replaced (\w+) with (\w+)", line)
+    if match:
+
+        INTERNAL_NAMES[(match.group(1), match.group(2))] += 1
+        continue
+    match = re.search(r"Applied substitution in path ([\w|/.]+)", line)
+    if match:
+        DIRECTORY_NAMES[match.group(1)] += 1
+        continue
+    match = re.search(r"Applied substitution in filename: (\w+) for (\w+)",
+                      line)
+    if match:
+        FILE_NAMES[match.group(2)]["new"] = match.group(1)
+        FILE_NAMES[match.group(2)]["count"] += 1
+        continue
+    match = re.search(r"Ignored potential replacement of (\w+) with (\w+)",
+                      line)
+    if match:
+        IGNORED_NAMES[match.group(1)]["new"] = match.group(2)
+        IGNORED_NAMES[match.group(1)]["count"] += 1
+        continue
+    match = re.search(r"Ignored file ([\w|/.]+)",
+                      line)
+    if match:
+        IGNORED_FILE_NAMES[match.group(1)] += 1
+        continue
+
+if ARGS.plot:
+    print_gnuplot()
+else:
+    print_csv()
@@ -0,0 +1,30 @@
+#ifndef CONFIG_H
+#define CONFIG_H
+
+/*
+ * Length used for character buffers holding paths.
+ */
+#ifndef STRING_LENGTH
+#define STRING_LENGTH 1024
+#endif
+
+/*
+ * The directory in which the system saves metadata for each execution.
+ */
+#ifndef WORKING_DIR
+#define WORKING_DIR ".analysis_data"
+#endif
+
+/*
+ * Pattern used to redact matched names with no provided replacement in
+ * the anonymization tool.
+ */
+#ifndef REDACTION_PATTERN
+#define REDACTION_PATTERN "REDACTED%04u"
+#endif
+
+#ifndef ANONIMIZATION_HASH_BOUND
+#define ANONIMIZATION_HASH_BOUND 9999
+#endif
+
+#endif