- 
                Notifications
    You must be signed in to change notification settings 
- Fork 8
Move to ElasticSearch and drop Solr/MongoDB #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            markwoodhall
  wants to merge
  157
  commits into
  develop
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
elastic
  
      
      
   
  
    
  
  
  
 
  
      
    base: develop
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    As per the logic for a works resource we should extract the long doi. This commit also includes minor updates to the transform assertion data.
This commit also includes a regeneration of the journal based assertion date for minor changes
Since we now have corpus tests to very scoring across citation matching we can get away without asserting against score for all other works related tests, this is helpful because score does vary per test run using the elastic implementation.
As per other indexes we will use one shard for the work index, this also happens to be the most similar to the existing solr setup.
Since the elastic version has no transformation of sub types we should use common versions during the index phase.
It is possible to just call index-journals so load-test-journals is a little redundant
Set cr-funder-registry at the first available opportunity, port fix to enable starting the core only once per process, rather than once per test fixture.
The indexing was failing due to a self reference in the ingest RDF file. Once that was fixed, the funder route for works was broken e.g.: /funders/100006151/works
test now tests for the funders/####/works route
query.clj change to assoc-in is unnecessary.
issue-36 funder route
In order to make this work and avoid excess mapping explosion I have changed the underlying structure of the coverage index so that coverage by type is actually indexed, from this we can calculate an overview of the coverage, coverage counts by type, and coverage type.
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
WIP PR
Purpose
This pull request migrates away from Solr and MongoDB to ElasticSearch.
Highlights
Solr and MongoDB have been removed in favour of ElasticSearch for all data storage. ElasticSearch indexes exist for all of the core data types:
The configuration for
docker-composehas been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removedA new "corpus test" has been created see
cayenne.corpus-test. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found hereIndex settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment
There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus
Index Structures
Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.
yearto be non numeric here. The reasons for this are explained in the commit message.Concerns
The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.