Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6b28730
using node.id and networkx to construct json out
robobenklein Apr 4, 2022
a0a0215
disable test docker container
robobenklein May 21, 2022
6de9fd6
can export text in graphml
robobenklein May 21, 2022
8de110b
allow prefixing nodes in nx export
robobenklein May 22, 2022
525059b
cleanup before new idea
robobenklein Feb 13, 2023
9a57301
one arg one line
robobenklein Feb 14, 2023
b5cd3cd
initial attempt at node hashing
robobenklein Feb 14, 2023
5b84da5
coords in jsonl output
robobenklein Feb 14, 2023
c48ab87
type of node and x2
robobenklein Mar 10, 2023
c676068
reenable c_sharp and fix the root node ERROR
robobenklein Mar 24, 2023
7d74e10
wip wociterators and nhv1_blobfuncs_v0
robobenklein Apr 10, 2023
d60060a
iterating blobs with filename filter
robobenklein Apr 10, 2023
40fc8dc
write to stdout for file subcommand
robobenklein Apr 10, 2023
47954b4
py3.8
robobenklein Apr 10, 2023
574772e
now recording data outputs
robobenklein Apr 10, 2023
98ac961
manual pool approach
robobenklein Apr 11, 2023
93c7e56
binary_line_iterator
robobenklein Apr 11, 2023
f517e21
forgot log import
robobenklein Apr 11, 2023
8640919
pass specific errors only
robobenklein Apr 11, 2023
3d107dd
can't pickle the BlobResult...
robobenklein Apr 11, 2023
7123a41
retry for the munmap call error
robobenklein Apr 11, 2023
f4fcc45
refactor to blobjob
robobenklein Apr 11, 2023
c87e11f
handling lock failures
robobenklein Apr 14, 2023
7ae9621
fix order of lock and lib check
robobenklein Apr 14, 2023
a13b0d6
increase to 64 processes
robobenklein Apr 14, 2023
bcd8d9c
save more errors
robobenklein Apr 14, 2023
6ca75a8
now using redis to cache BlobStatus
robobenklein Apr 23, 2023
642c5da
counter for BlobStatus from redis
robobenklein Apr 24, 2023
4132d2c
1hr timeout, optimize initial line iter
robobenklein Apr 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,6 @@ dmypy.json

# Pyre type checker
.pyre/

#
coredumps/
16 changes: 8 additions & 8 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ install:
- python setup.py install

script:
- python tests/test.py -l javascript tests/stuff.js
- python tests/test.py -l python tests/stuff.py
- python tests/test.py -l ruby tests/stuff.rb
- python tests/test.py -l c tests/stuff.c
- python tests/test.py -l cpp tests/stuff.cpp
- python tests/test.py -l java tests/stuff.java
- python tests/test.py -l rust tests/stuff.rs
- python tests/test.py -l go tests/stuff.go
- wsyntree-collector file -l javascript tests/stuff.js
- wsyntree-collector file -l python tests/stuff.py
- wsyntree-collector file -l ruby tests/stuff.rb
- wsyntree-collector file -l c tests/stuff.c
- wsyntree-collector file -l cpp tests/stuff.cpp
- wsyntree-collector file -l java tests/stuff.java
- wsyntree-collector file -l rust tests/stuff.rs
- wsyntree-collector file -l go tests/stuff.go
2 changes: 1 addition & 1 deletion LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2020-2021;
Copyright (c) 2020-2023;
Ben Klein (robobenklein)
*et al* (`git shortlog -sn`)

Expand Down
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,8 @@ Scales to incredible size: the goal is to build a system capable of storing, par
WorldSyntaxTree is built upon the following technologies:

- Python: we chose this as it is quick to develop and has a large ecosystem of scientific libraries that we aim to be able to support integration with. In our field of research Python is the most popular to use for quickly wrangling large and complex datasets.
- [NetworkX](https://networkx.org/): graph wrangling library of choice
- [Tree-Sitter](https://tree-sitter.github.io/tree-sitter/): this is the integral component to enable us to generate the concrete syntax trees from code quickly while still generating a somewhat useful result even on broken code.
- [ArangoDB](https://www.arangodb.com/): our choice of database stemmed from the following requirements:
- Must be open source (free for use and improvement by all)
- Must support our incredibly large data size (many terabytes)
- Must have native/serverside graph processing capabilities
- Git: the outer / top-level structure for the whole tree is based upon Git's structure of repositories, commits, and files, thus we aren't currently exploring other VCS systems (though we might in the far future)

For a full list of libraries used, check the `setup.py` or `requirements.txt`.
Expand Down Expand Up @@ -54,9 +51,15 @@ Requirements:
- Standard development tooling (git, pip, python-dev, setuptools, etc)
- C++ compiler (ubuntu: `libc++-dev libc++abi-dev`, plus any other dependencies needed for WST to auto-compile Tree-Sitter languages)
- Python 3.8+
- Optional: an ArangoDB instance

Install steps:
Recommended: perform these steps in a python Virtual Environment:

```
virtualenv -p python3 venv
source venv/bin/activate # run again to activate the virtualenv again later
```

Install dependencies and wsyntree itself:

```
python -m pip install -r requirements.txt
Expand Down
25 changes: 14 additions & 11 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,26 @@
version: '3'

services:
wst-test:
build:
context: .
dockerfile: docker/testing.dockerfile
depends_on:
- neo4j
restart: "no"
command: "wait-for-it neo4j:7687 -- /wst/docker/test-collector.sh"
environment:
- NEO4J_BOLT_URL=bolt://neo4j:pass@neo4j:7687
# wst-test:
# build:
# context: .
# dockerfile: docker/testing.dockerfile
# depends_on:
# - neo4j
# restart: "no"
# command: "wait-for-it neo4j:7687 -- /wst/docker/test-collector.sh"
# environment:
# - NEO4J_BOLT_URL=bolt://neo4j:pass@neo4j:7687

neo4j:
image: neo4j:4.2
image: neo4j:5
ports:
- 127.0.0.1:9784:7474
- 127.0.0.1:9787:7687
environment:
- NEO4J_AUTH=none
- "NEO4JLABS_PLUGINS=[\"apoc\"]"
- "NEO4J_dbms_security_procedures_unrestricted=apoc.\\*"
- "NEO4J_apoc_import_file_enabled=true"
# volumes:
# - wst_neo4j_data:/data
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
tree_sitter@git+https://github.com/utk-se/py-tree-sitter.git@master#egg=tree_sitter
grand-cypher@git+https://github.com/aplbrain/grand-cypher.git@master#egg=grand-cypher
python-arango
coloredlogs
pygit2
Expand All @@ -11,3 +12,5 @@ enlighten==1.9.0
psutil
bpython
orjson>=3.0.0
networkx[default]
redis[hiredis]
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
"psutil",
"bpython",
"orjson>=3.0.0",
"networkx[default]"
],
entry_points={
'console_scripts': [
Expand Down
7 changes: 6 additions & 1 deletion tests/stuff.rb
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
puts "Hello World"

def hello_world
puts "Hello World"
end

hello_world
18 changes: 18 additions & 0 deletions woc-support/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM robobenklein/home:latest

USER root
# enable sshd
RUN rm -f /etc/service/sshd/down
RUN /etc/my_init.d/00_regen_ssh_host_keys.sh
RUN install_clean python3-dev libghc-bzlib-dev
ARG UNAME=testuser
ARG UID=1000
ARG GID=1000
RUN groupadd -g $GID -o $UNAME
RUN useradd -m -u $UID -g $GID -G sudo -o -s /bin/zsh $UNAME
RUN usermod -aG docker_env $UNAME
RUN chown -R :docker_env /etc/container_environment /etc/container_environment.sh /etc/container_environment.json
RUN chmod -R g+rwX /etc/container_environment /etc/container_environment.sh /etc/container_environment.json
USER $UNAME
CMD sudo /sbin/my_init
WORKDIR /home/$UNAME
40 changes: 40 additions & 0 deletions woc-support/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@

version: '2'

services:
wst-jobrunner:
build:
context: .
args:
# remember to export these in the shell before build:
UNAME: ${USER}
GID: ${GID}
UID: ${UID}
#image: robobenklein/home:latest
restart: "no"
command: "sudo /sbin/my_init"
environment:
- OSCAR_TEST=1
volumes:
- "/da0_data:/da0_data:ro"
- "/da1_data:/da1_data:ro"
- "/da2_data:/da2_data:ro"
- "/da3_data:/da3_data:ro"
- "/da4_data:/da4_data:ro"
- "/da4_fast:/da4_fast:ro"
- "/da5_data:/da5_data:ro"
- "/da5_fast:/da5_fast:ro"
- "/da7_data:/da7_data:ro"
- "/da7_data/WorldSyntaxTree:/da7_data/WorldSyntaxTree" # RW: output in here
- "/home/bklein3:/home/bklein3"
mem_limit: 512G

wst-redis:
image: redis:7-alpine
mem_limit: 64G

wst-telegraf:
image: telegraf:latest
volumes:
- "./telegraf.conf:/etc/telegraf/telegraf.conf:ro"
hostname: ${HOSTNAME:-nohost}-wst-telegraf
8 changes: 4 additions & 4 deletions wsyntree/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@
"tsrepo": "https://github.com/tree-sitter/tree-sitter-ruby.git",
"file_ext": "\.rb$",
},
# "csharp": {
# "tsrepo": "https://github.com/tree-sitter/tree-sitter-c-sharp.git",
# "file_ext": "\.cs$",
# },
"c_sharp": { # name in compiled language uses underscore
"tsrepo": "https://github.com/tree-sitter/tree-sitter-c-sharp.git",
"file_ext": "\.cs$",
},
"c": {
"tsrepo": "https://github.com/tree-sitter/tree-sitter-c.git",
"file_ext": "\.(c|h)$",
Expand Down
3 changes: 3 additions & 0 deletions wsyntree/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ class UnhandledGitFileMode(ValueError, WSTBaseError):
class DeduplicatedObjectMismatch(ValueError, WSTBaseError):
pass

class RootTreeSitterNodeIsError(ValueError, WSTBaseError):
pass

def isArangoWriteWriteConflict(e: ArangoDocumentInsertError) -> bool:
"""Is an exception a Write-Write conflict?"""
if isinstance(e, ArangoDocumentInsertError):
Expand Down
59 changes: 59 additions & 0 deletions wsyntree/hashtypes/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
"""
# Why are there multiple node hash types?

Multiple types are needed because different properties can be included or excluded
from the hash, e.g. to include or not to include named nodes.


"""
from enum import Enum
import hashlib
import functools

from wsyntree import log

import orjson
import networkx as nx

class WSTNodeHashV1():
__included__ = ["named", "type"]
def __init__(self, G, node):
""""""
self._graph = G
self._node = node
self._nodes = []
nodes = nx.dfs_preorder_nodes(G, source=node)
for node in nodes:
self._nodes.append(node)

@functools.lru_cache(maxsize=None) # functools.cache added in 3.9
def _get_hashable_repr(self):
s = bytearray(b"WSTNodeHashV1<")
nodedata = list(map(
lambda x: {k:v for k,v in x.items() if k in self.__included__},
[self._graph.nodes[n] for n in self._nodes]
))
# s += ",".join([f"{list(nd.items())}" for nd in nodedata])
# we must sort keys here in case a python version is used that does not
# preserve dict ordering is used
s += orjson.dumps(nodedata, option=orjson.OPT_SORT_KEYS)
# log.debug(f"{self}, {nodedata}")
s += b">"
return s

@property
def _node_props(self):
return self._graph.nodes[self._node]

def _get_sha512_hex(self):
h = hashlib.sha512()
h.update(self._get_hashable_repr())
return h.hexdigest()

def __str__(self):
return f"WSTNodeHashV1"

# Once defined here the behavior should not change (stable versions)
class WSTNodeHashType(Enum):
# V1 = WSTNodeHashV1
pass
4 changes: 2 additions & 2 deletions wsyntree/log.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,15 +69,15 @@ def __exit__(self, type, value, traceback):
logger.setLevel(self._prev_state)


class supress_stdout():
class suppress_stdout():
"""
Stops the logger's StreamHandlers temporarily.

Used to prevent output from messing with ncurses views
or other terminal full-windows views.

Use a 'with' statement:
with supress_stdout():
with suppress_stdout():
# your code

"""
Expand Down
27 changes: 15 additions & 12 deletions wsyntree/wrap_tree_sitter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import functools
import re
import time
import warnings

import pebble
from tree_sitter import Language, Parser, TreeCursor, Node
Expand All @@ -18,6 +19,7 @@

class TreeSitterAutoBuiltLanguage():
def __init__(self, lang):
assert lang in wsyntree_langs, f"{lang} not found or not yet available in WorldSyntaxTree"
self.lang = lang
self.parser = None
self.ts_language = None
Expand All @@ -42,6 +44,7 @@ def _get_language_repo(self):
if not repodir.exists():
repodir.mkdir(mode=0o770)
log.debug(f"cloning treesitter repo for {self}")
warnings.warn(f"WorldSyntaxTree cloning parser repo for {self.lang}, this might be slow.")
return git.clone_repository(
wsyntree_langs[self.lang]["tsrepo"],
repodir.resolve()
Expand All @@ -56,25 +59,25 @@ def _get_language_repo(self):

def _get_language_library(self):
try:
self.ts_lang_cache_lock.acquire(timeout=300)
lib = self._get_language_cache_dir() / "language.so"
repo = self._get_language_repo()
repodir = self._get_language_repo_path()
if not lib.exists():
log.warn(f"building library for {self}, this could take a while...")
start = time.time()
Language.build_library(
str(lib.resolve()),
[repodir]
)
log.debug(f"library build of {self} completed after {round(time.time() - start)} seconds")
with self.ts_lang_cache_lock.acquire(timeout=600):
if not lib.exists():
log.warn(f"building library for {self}, this could take a while...")
start = time.time()
Language.build_library(
str(lib.resolve()),
[repodir]
)
log.debug(f"library build of {self} completed after {round(time.time() - start)} seconds")
return lib
except filelock.Timeout as e:
log.error(f"Failed to acquire lock on TSABL {self}")
log.error(f"Failed to acquire lock on TSABL {self} (needed to build language lib)")
log.debug(f"lock object is {self.ts_lang_cache_lock}")
raise e
finally:
self.ts_lang_cache_lock.release()
#finally:
# self.ts_lang_cache_lock.release()

def _get_ts_language(self):
if self.ts_language is not None:
Expand Down
12 changes: 12 additions & 0 deletions wsyntree_collector/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
from .jsonl_collector import WST_JSONLCollector
from .batch_analyzer import set_batch_analyze_args

from . import commands


def analyze(args):

Expand Down Expand Up @@ -248,6 +250,16 @@ def __main__():
help="Delete any existing data in the database",
action="store_true",
)

# run for a single file
cmd_file = subcmds.add_parser(
'file', aliases=[], help="Run WST on a single file")
commands.file.set_args(cmd_file)

cmd_node_hash_v1 = subcmds.add_parser(
'node_hash_v1', aliases=['nhv1'], help="Hash syntax nodes (V1)")
commands.node_hash_v1.set_args(cmd_node_hash_v1)

args = parser.parse_args()

if args.verbose:
Expand Down
3 changes: 3 additions & 0 deletions wsyntree_collector/commands/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

from . import file
from . import node_hash_v1
Loading