Metadata-Version: 2.1
Name: capricorn
Version: 0.1.1
Summary: nlp vocabulary builder and embedding loader
Home-page: https://github.com/WenchenLi/capricorn
Author: Wenchen Li
Author-email: wenchen.li.cs@example.com
License: MIT
Description: 
        
        <div align="center">
          <img src="img_src/Capricorn_icon_sml.png"><br><br>
        </div>
        
        -----------------
        
        # capricorn
        
        capricorn is a lightweight library for helping prepare vocabulary from
        corpus and prepare word embedding ready to be used by learning models.
        
        1. build vocabulary from corpus
        2. load necessary word embedding with consistent word index in Vocabulary
        
        ## getting started
        ```python
        import capricorn
        import os
        
        # Specify filepaths
        Vocab_path = "vocab_processor"
        embedding_vector_path = "data/embedding/model.vec"
        
        # Load vocab
        if os.path.isfile(Vocab_path):
            print("Loading Vocabulary ...")
            vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)
        
        else: # build vocab
        	print("Building Vocabulary ...")
        	
        	x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
                                "Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
                                "Stansted Could Double Passengers on Deregulation, Times Reports."]
        
        	# Build/load vocabulary
        	max_document_length = 11
        	min_freq_filter = 2
        
        	vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length, min_frequency=min_freq_filter)
        	vocab_processor.fit(x_text) # fit_transform to get the transformed corpus
        	vocab_processor.save(Vocab_path)
        	print "vocab_processor saved at:", Vocab_path
        
        # build embedding matrix of which the index is consistent with vocab word2index mapping	
        embedding_matrix = vocab_processor.prepare_embedding_matrix(embedding_vector_path)
        
        ```
        # User input
        
        The library default to use special token \_\_UNK__  and \_\_PAD__, 
        if the input sequence lengths below the max_document_length when initial
        VocabularyProcessor, it will automatically pad the sequence use the \_\_PAD__. 
        
            
        If user have pre defined special tokens when initialize Vocabulary, user 
        need to pre-process the sequence, namely adding the self defined special 
        tokens to the input sequence. For example if user defined \_\_START__
        and \_\_END__ as additional special tokens and max_document_length=11,  User has to process the original
        sentence from: 
        
        "We like it very much" 
        
        to:
        
        "\_\_START__ \_\_PAD__ \_\_PAD__  We like it very much \_\_PAD__ \_\_PAD__ \_\_END__"
        
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.6.0
Description-Content-Type: text/markdown
