Skip to content

Commit cf620a9

Browse files
authored
Refactor installation #23
Refactor installation
2 parents fb09dda + b24627b commit cf620a9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+2293
-2404
lines changed

.travis.yml

Lines changed: 16 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,27 @@
1-
sudo: required
21
dist: trusty
2+
sudo: false
33

44
language: c
55

6-
compiler: gcc
6+
cache:
7+
directories:
8+
- ~/.stack/
79

810
addons:
911
apt:
10-
sources:
11-
- sourceline: "deb http://downloads.skewed.de/apt/trusty trusty universe"
12-
- ubuntu-toolchain-r-test
1312
packages:
14-
- build-essential
15-
- pkg-config
16-
- flex
17-
- bison
18-
- libpcre3
19-
- libpcre3-dev
20-
- splint
21-
- indent
22-
- python3
23-
- python3-dev
24-
- libpython3.4
25-
- python3-pip
26-
- python-graph-tool
13+
- libgmp-dev
2714

2815
before_install:
29-
- python3 -m pip install pylint --user
16+
- mkdir -p ~/.local/bin
17+
- export PATH=$HOME/.local/bin:$PATH
18+
- travis_retry curl -L https://www.stackage.org/stack/linux-x86_64 | tar xz --wildcards --strip-components=1 -C ~/.local/bin '*/stack'
3019

31-
script: make
20+
install:
21+
- travis_wait 30 stack --no-terminal --install-ghc build
22+
- stack --no-terminal install hlint
23+
24+
script:
25+
- stack --no-terminal test --only-dependencies
26+
- ./hlint '--ignore=Parse error' src # All parse errors should be caught during compilation, and HLint erroneously throws a parse error
27+
- ./hlint '--ignore=Parse error' app # when using constrained types alongside a record-syntax GADT declaration (bug in haskell-src-exts).

Makefile

Lines changed: 0 additions & 31 deletions
This file was deleted.

README.md

Lines changed: 6 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,10 @@
11
# Analysis Tools [![Build Status](https://travis-ci.org/Submitty/AnalysisTools.svg?branch=master)](https://travis-ci.org/Submitty/AnalysisTools)
22
This repository contains a variety of tools used for source code analysis.
33

4-
## Dependencies
5-
* C compiler with Posix standard libraries (tested on GCC 4.9.3)
6-
* Flex (tested on version 2.5.39)
7-
* GNU Bison (tested on version 3.0.4)
8-
* PCRE (tested on version 8.38)
9-
* Python 3.4 (with development libraries)
10-
* Eclipse JDT library
11-
* python-graph-tool
4+
## Installation
125

13-
Optionally, if GNU Indent, Splint, and/or Pylint are installed, they will be incorporated into the build process to check for common stylistic and logical errors. It is recommended to install these tools if you wish to make a large contribution.
14-
15-
## Building
16-
17-
If you're on the Submitty Ubuntu 14.04 vagrant box, first run `make ubuntudeps`. Note: You'll need to be root.
18-
19-
Then run `make` in the repository root. This will build a number of executables, described below, to the `bin/` directory. In the interest of providing examples that fully express the usage of the tool, each example uses the complete set of available flags. However, *every* flag is optional, as a rule. Default values can be set at three different levels: at the lowest level of precedence, build-wide values can be set in `include/config.h`. Single values in `include/config.h` can be overwritten for a specific executable by setting `CFLAGS` when calling `make`. Finally, language-specific values and some other runtime configuration used by `bin/plagiarism` can be modified in `config/plagiarism.json` by default (or by passing the `-c config_file.json` flag to `bin/plagiarism`).
20-
21-
### `bin/anonymize`
22-
Usage example: `cat file.c | ./bin/anonymize -n name_list.txt -t to_replace.csv -r 's/foo/bar/' -a 10`
23-
24-
Anonymizes standard input, writing to standard output. In the example above, case-insensitively redacts all words found in the whitespace-delimited list in `name_list.txt`, replaces any word found in the first column of the two-column CSV file `to_replace.csv` with the corresponding word in the second column, and replaces any string matching regular expression `foo` with the string `bar`. If the `-a <n>` flag is provided, replace only in the first `n` lines of the input.
25-
26-
Multiple instances of each flag may be passed when appropriate. For example, `./bin/anonymize -n name_list.txt -n name_list_2.txt` will replace words found in either list.
27-
28-
### `bin/anonymization`
29-
Usage example: `./bin/anonymization source_dir -n first_names.txt -n last_names.txt -t rcs_ids.txt -r 's/66[0-9]{7}/REDACTEDRIN/' -a 5 -l 3 source_dir/data/*`.
30-
31-
Anonymizes all files in the given directory tree. The `-n`, `-t`, `-a`, and `-r` flags have the same meaning as those passed to `bin/anonymize`. One additional flag is also accepted: `-l`, which allows the user to specify a level in the directory tree (as an integer). Any directories at this level will also have their names anonymized according to the CSV files passed using the `-t` flag. Note that only those CSV files will be used for the replacement: simple name lists passed with `-n` will not be used for directory replacement.
32-
33-
Any additional arguments passed after the above flags are treated as files to ignore. To ignore entire directories, ensure the paths are well-formed, i.e. `foo/bar/baz/` (noting the `/` at the end).
34-
35-
Capture the standard output of this run to see the statistics regarding the number of replacements in directory names, file names, and file contents.
36-
37-
### `bin/plagiarism`
38-
Usage example: `./bin/plagiarism java source_dir`
39-
40-
Checks every source file in the given language contained in `source_dir` (or a subdirectory) against every other source file for possible plagiarism, writing sorted (by percent match) CSV data to standard output.
41-
42-
### `bin/winnow`
43-
Usage example: `cat tokens | ./bin/winnow -u 30 -l 15`
44-
45-
Applies the winnowing algorithm to a sequence of whitespace-separated integer-valued tokens read from standard input using upper bound `30` and lower bound `15`. Writes the selected fingerprints to standard output.
46-
47-
### `bin/walk`
48-
Usage example: `./bin/walk -u 30 -l 15 python source_dir`
49-
50-
Tokenizes and winnows each source file in `source_dir`, using upper bound `30` and lower bound `15`. Winnowing output is written in a parallel directory tree as a subdirectory of `.analysis_data/<timestamp>`, where timestamp is a unique integer-valued timestamp. The timestamp used is written to standard output.
51-
52-
### `bin/genpairs`
53-
Usage example: `./bin/genpairs <timestamp>`
54-
55-
Generates all pairs of files in the directory `.analysis_data/<timestamp>`, writing those pairs to standard output.
56-
57-
### `bin/compare_fingerprints`
58-
Usage example: `cat pairs | ./bin/genpairs <timestamp>`
59-
60-
Reads pairs from standard input, reads fingerprint data from `.analysis_data/<timestamp>`, and computes match data, writing to standard output.
61-
62-
### `bin/csa`
63-
Usage example: `./bin/csa student\_structure\_file subgraph\_structure\_file`
64-
65-
### Java Parser
66-
Arguments: `\[javafile ...\] \[ -t for XML AST \| -u for class hierarchical data\] -o filename`
6+
apt-get install stack
7+
stack upgrade --install-ghc
8+
git clone https://github.com/Submitty/AnalysisTools
9+
cd AnalysisTools
10+
stack install

anonymization/Makefile

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
CC = gcc
2+
3+
SRCS = $(foreach file,$(wildcard src/*.c),$(notdir $(file)))
4+
BUILD_DIR = bin
5+
LIB_DIR = lib_$(CC)
6+
BINARIES = $(addprefix $(BUILD_DIR)/, $(SRCS:.c=))
7+
8+
CFLAGS_gcc = -Iinclude -I/usr/local/include -O2 -g -Wall -Werror -D_POSIX_C_SOURCE=200809 -D_DEFAULT_SOURCE -Wno-unused-result
9+
CFLAGS = $(CFLAGS_$(CC))
10+
LINKER_FLAGS_gcc = -lm -lpcre
11+
LINKER_FLAGS = $(LINKER_FLAGS_$(CC))
12+
13+
vpath %.c src
14+
15+
.PHONY: all directories clean
16+
17+
all: directories $(BINARIES)
18+
19+
directories: $(BUILD_DIR)
20+
21+
$(BUILD_DIR):
22+
mkdir -p $(BUILD_DIR)
23+
24+
$(BUILD_DIR)/%: %.c
25+
$(CC) -o $@ $(CFLAGS) $< $(LINKER_FLAGS)
26+
27+
clean:
28+
rm $(BINARIES) -f

anonymization/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### `bin/anonymize`
2+
Usage example: `cat file.c | ./bin/anonymize -n name_list.txt -t to_replace.csv -r 's/foo/bar/' -a 10`
3+
4+
Anonymizes standard input, writing to standard output. In the example above, case-insensitively redacts all words found in the whitespace-delimited list in `name_list.txt`, replaces any word found in the first column of the two-column CSV file `to_replace.csv` with the corresponding word in the second column, and replaces any string matching regular expression `foo` with the string `bar`. If the `-a <n>` flag is provided, replace only in the first `n` lines of the input.
5+
6+
Multiple instances of each flag may be passed when appropriate. For example, `./bin/anonymize -n name_list.txt -n name_list_2.txt` will replace words found in either list.
7+
8+
### `bin/anonymization`
9+
Usage example: `./bin/anonymization source_dir -n first_names.txt -n last_names.txt -t rcs_ids.txt -r 's/66[0-9]{7}/REDACTEDRIN/' -a 5 -l 3 source_dir/data/*`.
10+
11+
Anonymizes all files in the given directory tree. The `-n`, `-t`, `-a`, and `-r` flags have the same meaning as those passed to `bin/anonymize`. One additional flag is also accepted: `-l`, which allows the user to specify a level in the directory tree (as an integer). Any directories at this level will also have their names anonymized according to the CSV files passed using the `-t` flag. Note that only those CSV files will be used for the replacement: simple name lists passed with `-n` will not be used for directory replacement.
12+
13+
Any additional arguments passed after the above flags are treated as files to ignore. To ignore entire directories, ensure the paths are well-formed, i.e. `foo/bar/baz/` (noting the `/` at the end).
14+
15+
Capture the standard output of this run to see the statistics regarding the number of replacements in directory names, file names, and file contents.

anonymization/bin/anonymization

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env python3
2+
3+
"""
4+
Run anonymization system.
5+
"""
6+
7+
import sys
8+
import subprocess
9+
10+
REALARGS = sys.argv
11+
12+
REALARGS[0] = "./bin/anonymize_dirs"
13+
ANONYMIZE_DIRS = subprocess.Popen(REALARGS, stderr=subprocess.PIPE)
14+
ANONYMIZE_LOG = subprocess.Popen(["./bin/anonymize_log"],
15+
stdin=ANONYMIZE_DIRS.stderr)
16+
ANONYMIZE_LOG.wait()

anonymization/bin/anonymize_log

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
#!/usr/bin/env python3
2+
3+
"""
4+
Summarize statistics for anonymization.
5+
"""
6+
7+
from __future__ import print_function
8+
import sys
9+
import argparse
10+
import json
11+
import re
12+
13+
from collections import defaultdict
14+
15+
ARGPARSER = argparse.ArgumentParser(\
16+
description="Summarize statistics for anonymization.")
17+
ARGPARSER.add_argument("-c", "--config", type=int)
18+
ARGPARSER.add_argument("-p", "--plot", action="store_const", const=True)
19+
ARGS = ARGPARSER.parse_args()
20+
21+
DIRECTORY_NAMES = defaultdict(lambda: 0)
22+
FILE_NAMES = defaultdict(lambda: defaultdict(lambda: 0))
23+
INTERNAL_NAMES = defaultdict(lambda: 0)
24+
IGNORED_NAMES = defaultdict(lambda: defaultdict(lambda: 0))
25+
IGNORED_FILE_NAMES = defaultdict(lambda: 0)
26+
27+
def print_csv():
28+
"""
29+
Output anonymization statistics in CSV format.
30+
"""
31+
print("=== REPLACEMENTS IN DIRECTORIES ===")
32+
for key in DIRECTORY_NAMES.keys():
33+
print(key + "," + str(DIRECTORY_NAMES[key]))
34+
35+
print("=== REPLACEMENTS IN FILENAMES ===")
36+
for key in FILE_NAMES.keys():
37+
print(key + "," + FILE_NAMES[key]["new"] + "," +
38+
str(FILE_NAMES[key]["count"]))
39+
40+
print("=== REPLACEMENTS IN FILES ===")
41+
for (name, new) in INTERNAL_NAMES.keys():
42+
print(name + "," + new + "," +
43+
str(INTERNAL_NAMES[(name, new)]))
44+
45+
print("=== IGNORED IN FILENAMES ===")
46+
for key in IGNORED_FILE_NAMES.keys():
47+
print(key + "," + str(IGNORED_FILE_NAMES[key]))
48+
49+
print("=== IGNORED IN FILES ===")
50+
for key in IGNORED_NAMES.keys():
51+
print(key + "," + IGNORED_NAMES[key]["new"] + "," +
52+
str(IGNORED_NAMES[key]["count"]))
53+
54+
def print_gnuplot():
55+
"""
56+
Output anonymization statistics in a GNUPlot-readable format.
57+
"""
58+
for i, key in enumerate(INTERNAL_NAMES.keys()):
59+
print(i, key, INTERNAL_NAMES[key]["count"])
60+
61+
for line in sys.stdin:
62+
match = re.search(r"Replaced (\w+) with (\w+)", line)
63+
if match:
64+
65+
INTERNAL_NAMES[(match.group(1), match.group(2))] += 1
66+
continue
67+
match = re.search(r"Applied substitution in path ([\w|/.]+)", line)
68+
if match:
69+
DIRECTORY_NAMES[match.group(1)] += 1
70+
continue
71+
match = re.search(r"Applied substitution in filename: (\w+) for (\w+)",
72+
line)
73+
if match:
74+
FILE_NAMES[match.group(2)]["new"] = match.group(1)
75+
FILE_NAMES[match.group(2)]["count"] += 1
76+
continue
77+
match = re.search(r"Ignored potential replacement of (\w+) with (\w+)",
78+
line)
79+
if match:
80+
IGNORED_NAMES[match.group(1)]["new"] = match.group(2)
81+
IGNORED_NAMES[match.group(1)]["count"] += 1
82+
continue
83+
match = re.search(r"Ignored file ([\w|/.]+)",
84+
line)
85+
if match:
86+
IGNORED_FILE_NAMES[match.group(1)] += 1
87+
continue
88+
89+
if ARGS.plot:
90+
print_gnuplot()
91+
else:
92+
print_csv()

anonymization/include/config.h

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#ifndef CONFIG_H
2+
#define CONFIG_H
3+
4+
/*
5+
* Length used for character buffers holding paths.
6+
*/
7+
#ifndef STRING_LENGTH
8+
#define STRING_LENGTH 1024
9+
#endif
10+
11+
/*
12+
* The directory in which the system saves metadata for each execution.
13+
*/
14+
#ifndef WORKING_DIR
15+
#define WORKING_DIR ".analysis_data"
16+
#endif
17+
18+
/*
19+
* Pattern used to redact matched names with no provided replacement in
20+
* the anonymization tool.
21+
*/
22+
#ifndef REDACTION_PATTERN
23+
#define REDACTION_PATTERN "REDACTED%04u"
24+
#endif
25+
26+
#ifndef ANONIMIZATION_HASH_BOUND
27+
#define ANONIMIZATION_HASH_BOUND 9999
28+
#endif
29+
30+
#endif

0 commit comments

Comments
 (0)