Skip to content

Summer UROP planning #170

@MBJean

Description

@MBJean

Overview

This issue contains preliminary notes on summer UROP work. It condenses many of the collected notes and issues created as the last few groups have picked up work on this project. The goal is to produce a series of three or four self-contained issues that the program can tackle during the summer.

First steps

Check out onboarding and installation documentation for the lab here. We'll spend the first week of the lab getting up to speed with the project and getting all of the required tooling set up.

Project ideas

Improve the interface

Relevant topics: object-oriented programming, interface design, data structures.

The analysis modules dependency_parsing and dunning do not follow the updated package architecture as outlined in issue #157. Let's bring them up to the standards of the rest of the package, and, while we're at it, improve the test coverage on these modules. Consider whether this package would serve as a suitable upgrade for the existing dependency_parsing dependency tree implementation.

Improve memory usage

Relevant topics: data structures, algorithms, memory.

We could improve how optimized and performant our various analysis modules are. Leaving an assemblage of notes we've gathered over the last few months here until we can sort things out:

From issue #157: "I'd encourage us to think, here, about processing load! Initializing our Analyzer functions on medium-to-large corpora takes a really long time, so I think we want to do that as little as possible. For example, the GenderProximityAnalyzer function needs to be initialized for each part-of-speech set; if I want to look for verbs but have initialized based on adjectives, e.g., I have to initialize twice."

Mentioned in Slack: "feature request: a progress bar for hefty analyses!"

Outlined in issue #135: there are two separate kinds of tokenization that occur in the codebase, nltk.word_tokenize and the custom implementation in Document.get_tokenized_text(). We may want to standardize our implementation, provide better memoization, and generally ensure that all analyses that rely on tokenized texts are able to retrieve the tokenized text optimally.

Consider which packages/tools might be of use here (this package, vel sim.).

Improve how we manage texts

Relevant topics: interface design, web scraping, data types.

It's probable that we can improve how we work with our .txt files in the package as whole. We've identified one significant problem, for instance: if a user imports a .txt file into a Corpus or Document instance that has no content, many of our analyses fail (notes outlining the problem here). We also implicitly rely on the user constructing meaningful .csv metadata files for many of our analysis modules, so at the very least we could help them out by updating our documentation with helpful tips for creating such a file (notes here). It's probably worth spending some time identifying other similar problems relating to how we format and use .txt files.

We could also revive a long-lived attempt to allow the user to load in Project Gutenberg (issue #41) texts.

Improve the user-facing output

Relevant topics: interface design, data structures, data visualization.

Most of our analysis modules produce Python dictionaries. These are relatively simple to traverse and to convert into other data formats (for instance, pandas DataFrames), but may not be the ideal data format for our users without that additional transformation. It's also likely that our users will want to produce data visualizations based on these analysis modules, much of which is relatively straightforward through something like pandas and matplotlib (ex. in issue #165). How much can we and should we streamline some of that data visualization creation for our users? Which of our analysis modules are particular suitable for creating visualizations from?

Additionally, there's some inconsistency throughout the package as to what data format we return and when. In many places we return dictionaries that would better be represented as Counter instances, for instance. Some of this topic is initially outlined in issue #104.

General issues

  • The new frequency module does not currently allow the user to find average pronoun counts across documents (called out in issue Aggregator Functions for gender_frequency.py #165). Should we introduce that?
  • Could we create a .summary on Corpus? What would it print to the user?
  • Should help() on a particular class or method be more useful?
  • Document.word_count could use a remove_swords flag.
  • Document.get_wordcount_counter() could use a remove_swords flag.
  • Could we consider making Document.words_associated() and Document.get_word_windows() more sophisticated? These method measures only the occurrence of a word immediately after or within a window of the target word in the tokenized text. We're definitely picking up associations across syntactical breaks. While these undoubtedly have an association with the target word, they're 'less' associated than, say, words that are in an object-relationship with the target word.
  • For many of the Document methods, could we pull those out to the Corpus class and organize the return by Document label? That way the interface would always be the Corpus.
  • Document.update_metadata() would require us to update any cached returns in the analysis modules.
  • Corpus.count_authors_by_gender() takes a string argument to represent gender. Is that standard throughout the analysis modules?
  • Corpus.get_wordcount_counter() could use a remove_swords flag.
  • Corpus.get_field_vals() could be supplanted by a .summary vel sim.
  • Corpus.get_sample_text_passages() returns a list of tuples of the shape Tuple[str(document filename), str(sentence)]. It might be more useful to return a dictionary with the document filename/label (filename minus extension) as the top-level keys. This data structure is very common throughout the analysis modules.
  • dunning.dunn_individual_word_by_corpus() throws ZeroDivisionError if the target word doesn't exist in a Corpus.
  • Any reason metadata_visualizations shouldn't be methods on the Corpus class?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions