-
Notifications
You must be signed in to change notification settings - Fork 41
feat: add protein_qa generation #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
c8ada4c
wip: add protein_qa generation
ChenZiHong-Gavin 5d5012a
refactor: refactor build_kg process
ChenZiHong-Gavin e783736
wip: add protein qa pipeline
ChenZiHong-Gavin 27ab285
merge from main
ChenZiHong-Gavin 2192ee8
fix: fix lint errors
ChenZiHong-Gavin 96be73a
delete search_mo
ChenZiHong-Gavin 256acc1
feat: add mo_kg_builder
ChenZiHong-Gavin 51c12ce
chore: downgrade numpy in requirements.txt
ChenZiHong-Gavin fa6e32a
fix: fix dependencies
ChenZiHong-Gavin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| read: | ||
| input_file: resources/input_examples/protein_qa_demo.json # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples | ||
| anchor_type: protein # get protein information from chunks | ||
| split: | ||
| chunk_size: 1024 # chunk size for text splitting | ||
| chunk_overlap: 100 # chunk overlap for text splitting | ||
| search: # web search configuration | ||
| enabled: false # whether to enable web search | ||
| search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia | ||
| quiz_and_judge: # quiz and test whether the LLM masters the knowledge points | ||
| enabled: false | ||
| partition: # graph partition configuration | ||
| method: anchor_bfs # partition method | ||
| method_params: | ||
| anchor_type: protein # node type to select anchor nodes | ||
| max_units_per_community: 10 # atomic partition, one node or edge per community | ||
| generate: | ||
| mode: protein_qa # atomic, aggregated, multi_hop, cot, vqa | ||
| data_format: ChatML # Alpaca, Sharegpt, ChatML |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,3 @@ | ||
| from .light_rag_kg_builder import LightRAGKGBuilder | ||
| from .mm_kg_builder import MMKGBuilder | ||
| from .mo_kg_builder import MOKGBuilder |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,100 @@ | ||||||
| import re | ||||||
| from collections import defaultdict | ||||||
| from typing import Dict, List, Tuple | ||||||
|
|
||||||
| from graphgen.bases import Chunk | ||||||
| from graphgen.templates import PROTEIN_KG_EXTRACTION_PROMPT | ||||||
| from graphgen.utils import ( | ||||||
| detect_main_language, | ||||||
| handle_single_entity_extraction, | ||||||
| handle_single_relationship_extraction, | ||||||
| logger, | ||||||
| split_string_by_multi_markers, | ||||||
| ) | ||||||
|
|
||||||
| from .light_rag_kg_builder import LightRAGKGBuilder | ||||||
|
|
||||||
|
|
||||||
| class MOKGBuilder(LightRAGKGBuilder): | ||||||
| @staticmethod | ||||||
| async def scan_document_for_schema( | ||||||
| chunk: Chunk, schema: Dict[str, List[str]] | ||||||
| ) -> Tuple[Dict[str, List[dict]], Dict[Tuple[str, str], List[dict]]]: | ||||||
| """ | ||||||
| Scan the document chunk to extract entities and relationships based on the provided schema. | ||||||
| :param chunk: The document chunk to be scanned. | ||||||
| :param schema: A dictionary defining the entities and relationships to be extracted. | ||||||
| :return: A tuple containing two dictionaries - one for entities and one for relationships. | ||||||
| """ | ||||||
| # TODO: use hard-coded PROTEIN_KG_EXTRACTION_PROMPT for protein chunks, | ||||||
| # support schema for other chunk types later | ||||||
| print(chunk.id, schema) | ||||||
| return {}, {} | ||||||
|
|
||||||
| async def extract( | ||||||
| self, chunk: Chunk | ||||||
| ) -> Tuple[Dict[str, List[dict]], Dict[Tuple[str, str], List[dict]]]: | ||||||
| """ | ||||||
| Multi-Omics Knowledge Graph Builder | ||||||
| Step1: Extract and output a JSON object containing protein information from the given chunk. | ||||||
| Step2: Get more details about the protein by querying external databases if necessary. | ||||||
| Step3: Construct entities and relationships for the protein knowledge graph. | ||||||
| Step4: Return the entities and relationships. | ||||||
| :param chunk | ||||||
|
||||||
| :param chunk | |
| :param chunk: Chunk: The input data chunk containing information to extract protein entities and relationships from. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message refers to 'text chunks' but this code path handles all chunk types (both text and multi-modal). The message should be updated to 'No entities or relations extracted from chunks' to accurately reflect the unified processing.