Skip to content
tianyiwangnova edited this page Feb 4, 2021 · 2 revisions

ReviewMiner

class ReviewMiner(df: pd.DataFrame = None, id_column: str = None, review_column: str = None)


Parameters:

df: pd.DataFrame, default=None

a data frame where each row is a comment/review; The data frame should have at least an ID column that stores the unique IDs of the comments, and a review column where the actual comments/reviews are stored. You can initialize the class without df if you just want to use some of its methods to analyze external datasets. You can assign values to df later by <class>.df = <your_data_frame>.

id_column: str, default=None

the name of the column that stores the unique IDs of the comments.

review_column: str, default=None

the name of the column where the actual comments/reviews are stored.


Attributes:

all_negative_sentence: list

All negative sentences in a list; Generated from return_all_negative_sentences.

aspect_mute_list: list, default = ['i']

A list of aspects to be excluded from the analysis. By default it's ['i'] ('I' is a quite comment unwanted aspect in review data). Users can define their own list by .aspect_mute_list = <your list> and the list will be appended to ['i']. When aspect_mute_list has changed, the visualizations will change as well when the related methods are calling, but the base intermediate output tables (e.g. aspect_opinion_df) won't change.

aspects_opinions_df: pd.DataFrame

A pandas dataframe with 2 columns: aspects, opinions. The column aspects has the all the aspects in the review data. The column opinions has the opinion words collected for each of the aspects. The dataframe is sorted by the numbers of opinion words (can have replicates since we are collecting all the opinion words) in descending order. Generated by aspect_opinon_for_all_comments.

df_with_aspects_opinions: pd.DataFrame

A pandas dataframe with comment id, reviews and the string version of the aspect_opinion_dict that has the aspects and opinions of the comments. Generated by aspect_opinon_for_all_comments.

negative_comments_by_aspects_dict

A dictionary where keys are the aspects and values are the sentences associated with the aspects. Generated by negative_comments_by_aspects.

top_aspects

The most popular aspects in the reviews. Generated by popular_aspects_view


Methods:

one_time_analysis (report_interval: int = None)

One time analysis to display popular aspects and opinions, distribution of sentiment scores of each comment, sentiment scores for common aspects, and aspects with the most negative comments.

  • Parameters:

report_interval: int, default=None

It might take quite a while to extract the aspects and opinions if the dataset is very large. When extracting all the aspects and opinions, the function will report progress for every report_interval comments. When there're more than 500 comments and there's no specified report interval, the function will report progress every 10% of the comments. When there's no more 500 comments and no specified report interval, the function will only report when it finishes for all the comments.


aspect_extractor(sentence: str)

Extract aspects (noun phrases and nouns) from a sentence

  • Parameters:

sentence: int

The sentence for analyzing

  • Returns:

candidate_aspects: list

a list of aspects in the sentence


aspect_opinion_for_one_comment(comment: str)

Extract aspects and opinions for one comment (which can consist of many sentences)

  • Parameters:

comment: int

The sentence for analyzing

  • Returns:

aspect_opinion_dict: dict

a dictionary with the aspects as keys and the opinions wrapped up as a single string of words separated with ' ' e.g. {'bedroom': 'sunny spacious', 'wardrobe': 'beautiful'}


aspect_opinon_for_all_comments(report_interval: int)

Extract aspects and opinions for all the comments in a pandas dataframe. The function will drop the rows with nan comments.

  • Parameters:

report_interval: int, default=None

It might take quite a while to extract the aspects and opinions if the dataset is very large. When extracting all the aspects and opinions, the function will report progress for every report_interval comments. When there're more than 500 comments and there's no specified report interval, the function will report progress every 10% of the comments. When there's no more 500 comments and no specified report interval, the function will only report when it finishes for all the comments.

  • Returns:

df_with_aspects_opinions: pd.DataFrame

A pandas dataframe with comment id, reviews and the string version of the aspect_opinion_dict that has the aspects and opinions of the comments.

aspects_opinions_df: pd.DataFrame

A pandas dataframe with 2 columns: aspects, opinions. The column aspects has the all the aspects in the review data. The column opinions has the opinion words collected for each of the aspects. The dataframe is sorted by the numbers of opinion words (can have replicates since we are collecting all the opinion words) in descending order.


most_popular_opinions(self, aspect: str, num_top_words: int = 10)

Collect the most popular opinion words for an aspect.

  • Parameters:

aspect: str

The aspect for analyzing.

num_top_words: int, default=10

Numbers of most common opinions to collect.

  • Returns:

aspect_plot: _pd.DataFrame

A pandas dataframe to show the most popular opinions words and the proportions of people using them to describe the aspect.


single_aspect_view(aspect: str, num_top_words: int = 10, change_figsize: bool = True, xticks_rotation: int = 45)

plot popular opinions around an aspect; For example, we are interested in what people say about "staff", We pick the top n popular words people used to describe staff and calculalte among those who have expressed opinion towards "staff", how many percentage of them used certain words; The output will be a bar chart that shows the most popular opinion words associated with the aspect, and their proportions.

  • Parameters:

aspect: str

The aspect for analyzing.

num_top_words: int, default = 10

Numbers of most common opinions to display.

change_figsize: bool, default = True

Change the figsize or not. If True, the figsize will be num_top_words * 5.

xticks_rotation: int, default = 45

The rotation degree for xticks in the chart.


popular_aspects_view()

Quick plot: single_aspect_view for the top 9 aspects.


sentiment_for_one_comment(comment: str)

Calculalte sentiment score for one comment ==> the mean of (polarity * subjectivity) for each sentence (if the sentence has a non-zero polarity).

  • Parameters:

comment: str

The comment (which can consist of multiple sentences).

  • Return

result: float

The sentiment score of the sentence.


overall_sentiment()

Plot the histogram of the sentiment scores for all the comments.


sentiment_for_one_aspect(aspect: str)

Return the average sentiment score for an aspect; Average sentiment score: the average of the sentiment scores of the opinion words.

  • Parameters:

aspect: str

the aspect for analyzing.

  • Returns:

aspect_sent_score: float

the average of the sentiment scores of the opinion words.


aspects_radar_plot(aspects: list)

Plot the sentiment score radar chart for designated aspects.

  • Parameters:

aspects: list

A list of aspects to visualize.


return_all_negative_sentences()

Return all negative sentences in the reviews data.

  • Returns:

negatives: list

All negative sentences in a list.


negative_comments_by_aspects()

  • Returns:

negative_comments_by_aspects_dict: dict

A dictionary where keys are the aspects and values are the sentences associated with the aspects.


negative_comments_view()

Barplot on the numbers of negative sentences of each aspect


return_negative_comments_of_aspect(aspect: str)

Return all the negative comments related to an aspect in a list

  • Parameters:

aspect: str

The aspect for analyzing.

  • Returns:

A list of all the related negative comments; Will return an empty list if there's no negative comments