Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the generation of csv files (splits) into a parser format #38

Open
martinwholtmon opened this issue Mar 16, 2024 · 0 comments
Open
Labels
enhancement New feature or request maybe wontfix This will not be worked on

Comments

@martinwholtmon
Copy link
Owner

martinwholtmon commented Mar 16, 2024

Instead of having a script for each dataset, instead make a parser for each dataset, register the parser and process the dataset.

Create a generic class that represent a parser

from abc import ABC, abstractmethod

class Parser(ABC):
    @abstractmethod
    def __init__(self, data_path: str, fold: int, val_split: float):
        self.data_path = data_path
        self.val_split = val_split
        self.classes: dict = self._get_classes()

    @abstractmethod
    def _get_classes(self):
        """Get the class idx and class names.
        
        Returns:
            dict[str, int]: class name, class id
        """

    @abstractmethod
    def process(self):
        """Process the dataset, generating the train,val,test splits"""

    def split_train_val(
        self, train_data: pd.DataFrame
    ) -> Tuple[pd.DataFrame, pd.DataFrame]:
        total_len = len(train_data)
        val_len = int(total_len * self.val_split)
        train_len = total_len - val_len

        shuffled = train_data.sample(frac=1).reset_index(drop=True)
        return shuffled.iloc[:train_len], shuffled.iloc[train_len:]

    def save_csv(self, data: pd.DataFrame, file_name: str):
        data.to_csv(
            os.path.join(self.data_path, file_name),
            sep=" ",
            index=False,
            header=False,
        )

Based on this class, abstract from it and create the parser for each dataset. Create a main that instanciates the correct parser based on the arguments etc..

@martinwholtmon martinwholtmon added enhancement New feature or request maybe labels Mar 16, 2024
@martinwholtmon martinwholtmon added the wontfix This will not be worked on label Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maybe wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant