label_postprocessing package¶

Submodules¶

label_postprocessing.ocr_postprocessing module¶

label_postprocessing.ocr_postprocessing.correct_transcript(transcript: str) → str[source]¶

Performs corrections on a transcript, removing non-ASCII characters, multiple non-alphanumeric characters, the pipe character, and other special symbols (like °, ‘, , etc.). Also removes any trailing periods.

Args:: transcript (str): Input transcript.
Returns:: str: Corrected transcript.

label_postprocessing.ocr_postprocessing.count_mean_token_length(tokens: List[str]) → float[source]¶

Calculates the mean length of tokens in a list.

Args:: tokens (list): List of tokens.
Returns:: float: Mean token length.

label_postprocessing.ocr_postprocessing.is_empty(transcript: str) → bool[source]¶

Checks if a transcript is empty.

Args:: transcript (str): Input transcript.
Returns:: bool: True if the transcript is empty, False otherwise.

label_postprocessing.ocr_postprocessing.is_nuri(transcript: str) → bool[source]¶

Checks if a transcript starts with “http,” indicating a Nuri.

Args:: transcript (str): Input transcript.
Returns:: bool: True if the transcript is a Nuri, False otherwise.

label_postprocessing.ocr_postprocessing.is_plausible_prediction(transcript: str) → bool[source]¶

Checks if a transcript is a plausible prediction based on the average token length.

Args:: transcript (str): Input transcript.
Returns:: bool: True if the transcript is plausible, False otherwise.

label_postprocessing.ocr_postprocessing.process_ocr_output(ocr_output: str) → None[source]¶

Processes OCR output, categorizing and saving transcripts based on Nuri, empty, plausible, and corrected.

Args:: ocr_output (str): OCR output file path.

label_postprocessing.ocr_postprocessing.save_json(transcripts: List[Dict], file_name: str) → None[source]¶

Saves transcripts as a JSON file.

Args:: transcripts (list): List of transcripts. file_name (str): Name of the output JSON file.

label_postprocessing.ocr_postprocessing.save_transcripts(transcripts: Dict, file_name: str) → None[source]¶

Saves transcripts as a CSV file.

Args:: transcripts (dict): Dictionary of transcripts. file_name (str): Name of the output CSV file.

label_postprocessing.vocabulary module¶

label_postprocessing.vocabulary.contains_only_letters(token: str) → bool[source]¶

Checks if a token consists only of letters.

Args:: token (str): Token from word_tokenize.
Returns:: bool: True if token contains only letters, False otherwise.

label_postprocessing.vocabulary.extract_vocabulary(ocr_output: str) → None[source]¶

Extract unique words from the transcripts that consist only of letters and are at least 3 characters long. Saves the extracted vocabulary to a CSV file.

Args:: ocr_output (str): Path to the OCR output file.

label_postprocessing.vocabulary.is_punctuation(token: str) → bool[source]¶

Check if a token is a punctuation mark.

Args:: token (str): The token to check for punctuation.
Returns:: bool: True if the token is a punctuation mark, False otherwise.