Summer UROP planning

# Overview
This issue contains preliminary notes on summer UROP work. It condenses many of the collected notes and issues created as the last few groups have picked up work on this project. The goal is to produce a series of three or four self-contained issues that the program can tackle during the summer.

## First steps
Check out onboarding and installation documentation for the lab [here](https://urop.dhmit.xyz/). We'll spend the first week of the lab getting up to speed with the project and getting all of the required tooling set up.

## Project ideas

### Improve the interface
**Relevant topics:** object-oriented programming, interface design, data structures.

The analysis modules `dependency_parsing` and `dunning` do not follow the updated package architecture as outlined in issue #157. Let's bring them up to the standards of the rest of the package, and, while we're at it, improve the test coverage on these modules. Consider whether [this package](https://github.com/stanfordnlp/stanza) would serve as a suitable upgrade for the existing `dependency_parsing` dependency tree implementation.


### Improve memory usage
**Relevant topics:** data structures, algorithms, memory.

We could improve how optimized and performant our various analysis modules are. Leaving an assemblage of notes we've gathered over the last few months here until we can sort things out:

From issue #157: "I'd encourage us to think, here, about processing load! Initializing our Analyzer functions on medium-to-large corpora takes a really long time, so I think we want to do that as little as possible. For example, the `GenderProximityAnalyzer` function needs to be initialized for each part-of-speech set; if I want to look for verbs but have initialized based on adjectives, e.g., I have to initialize twice."

Mentioned in Slack: "feature request: a progress bar for hefty analyses!"

Outlined in issue #135: there are two separate kinds of tokenization that occur in the codebase, `nltk.word_tokenize` and the custom implementation in `Document.get_tokenized_text()`. We may want to standardize our implementation, provide better memoization, and generally ensure that all analyses that rely on tokenized texts are able to retrieve the tokenized text optimally.

Consider which packages/tools might be of use here ([this package](https://pypi.org/project/memory-profiler/), vel sim.).

### Improve how we manage texts
**Relevant topics:** interface design, web scraping, data types.

It's probable that we can improve how we work with our `.txt` files in the package as whole. We've identified one significant problem, for instance: if a user imports a `.txt` file into a `Corpus` or `Document` instance that has no content, many of our analyses fail (notes outlining the problem [here](https://docs.google.com/document/d/1-Uz__nxeRn2OHjPd8Q6Gz3Z4cFOa6hA8wabczF0YKQ4/edit)). We also implicitly rely on the user constructing meaningful `.csv` metadata files for many of our analysis modules, so at the very least we could help them out by updating our documentation with helpful tips for creating such a file (notes [here](https://docs.google.com/document/d/130pWbn734Bx2ZS314BQoBVm8n4M915wbr-k62YX9D2k/edit)). It's probably worth spending some time identifying other similar problems relating to how we format and use `.txt` files.

We could also revive a long-lived attempt to allow the user to load in Project Gutenberg (issue #41) texts.

### Improve the user-facing output
**Relevant topics:** interface design, data structures, data visualization.

Most of our analysis modules produce Python dictionaries. These are relatively simple to traverse and to convert into other data formats (for instance, `pandas` `DataFrame`s), but may not be the ideal data format for our users without that additional transformation. It's also likely that our users will want to produce data visualizations based on these analysis modules, much of which is relatively straightforward through something like `pandas` and `matplotlib` (ex. in issue #165). How much can we and should we streamline some of that data visualization creation for our users? Which of our analysis modules are particular suitable for creating visualizations from?

Additionally, there's some inconsistency throughout the package as to what data format we return and when. In many places we return dictionaries that would better be represented as `Counter` instances, for instance. Some of this topic is initially outlined in issue #104.

## General issues
- The new `frequency` module does not currently allow the user to find average pronoun counts across documents (called out in issue #165). Should we introduce that?
- Could we create a `.summary` on `Corpus`? What would it print to the user?
- Should `help()` on a particular class or method be more useful?
- `Document.word_count`  could use a `remove_swords` flag. 
- `Document.get_wordcount_counter()` could use a `remove_swords` flag.
- Could we consider making `Document.words_associated()` and `Document.get_word_windows()` more sophisticated? These method measures only the occurrence of a word immediately after or within a window of the target word in the tokenized text. We're definitely picking up associations across syntactical breaks. While these undoubtedly have an association with the target word, they're 'less' associated than, say, words that are in an object-relationship with the target word.
- For many of the `Document` methods, could we pull those out to the `Corpus` class and organize the return by `Document` label? That way the interface would always be the `Corpus`.
- `Document.update_metadata()` would require us to update any cached returns in the `analysis` modules.
- `Corpus.count_authors_by_gender()` takes a string argument to represent gender. Is that standard throughout the `analysis` modules?
- `Corpus.get_wordcount_counter()` could use a `remove_swords` flag.
- `Corpus.get_field_vals()` could be supplanted by a `.summary` vel sim.
- `Corpus.get_sample_text_passages()` returns a list of tuples of the shape `Tuple[str(document filename), str(sentence)]`. It might be more useful to return a dictionary with the document filename/label (filename minus extension) as the top-level keys. This data structure is very common throughout the `analysis` modules.
- `dunning.dunn_individual_word_by_corpus()` throws `ZeroDivisionError` if the target word doesn't exist in a `Corpus`.
- Any reason `metadata_visualizations` shouldn't be methods on the `Corpus` class?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Summer UROP planning #170

Overview

First steps

Project ideas

Improve the interface

Improve memory usage

Improve how we manage texts

Improve the user-facing output

General issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Summer UROP planning #170

Description

Overview

First steps

Project ideas

Improve the interface

Improve memory usage

Improve how we manage texts

Improve the user-facing output

General issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions