label_postprocessing package¶
Submodules¶
label_postprocessing.ocr_postprocessing module¶
- label_postprocessing.ocr_postprocessing.correct_transcript(transcript: str) str[source]¶
Performs corrections on a transcript, removing non-ASCII characters, multiple non-alphanumeric characters, the pipe character, and other special symbols (like °, ‘, , etc.). Also removes any trailing periods.
- Args:
transcript (str): Input transcript.
- Returns:
str: Corrected transcript.
- label_postprocessing.ocr_postprocessing.count_mean_token_length(tokens: List[str]) float[source]¶
Calculates the mean length of tokens in a list.
- Args:
tokens (list): List of tokens.
- Returns:
float: Mean token length.
- label_postprocessing.ocr_postprocessing.is_empty(transcript: str) bool[source]¶
Checks if a transcript is empty.
- Args:
transcript (str): Input transcript.
- Returns:
bool: True if the transcript is empty, False otherwise.
- label_postprocessing.ocr_postprocessing.is_nuri(transcript: str) bool[source]¶
Checks if a transcript starts with “http,” indicating a Nuri.
- Args:
transcript (str): Input transcript.
- Returns:
bool: True if the transcript is a Nuri, False otherwise.
- label_postprocessing.ocr_postprocessing.is_plausible_prediction(transcript: str) bool[source]¶
Checks if a transcript is a plausible prediction based on the average token length.
- Args:
transcript (str): Input transcript.
- Returns:
bool: True if the transcript is plausible, False otherwise.
- label_postprocessing.ocr_postprocessing.process_ocr_output(ocr_output: str) None[source]¶
Processes OCR output, categorizing and saving transcripts based on Nuri, empty, plausible, and corrected.
- Args:
ocr_output (str): OCR output file path.
label_postprocessing.vocabulary module¶
- label_postprocessing.vocabulary.contains_only_letters(token: str) bool[source]¶
Checks if a token consists only of letters.
- Args:
token (str): Token from word_tokenize.
- Returns:
bool: True if token contains only letters, False otherwise.