Skip to content

Commit

Permalink
Update to new dataset version
Browse files Browse the repository at this point in the history
  • Loading branch information
sbaltes committed Nov 24, 2020
1 parent ea62c85 commit cd9b3f6
Show file tree
Hide file tree
Showing 22 changed files with 947 additions and 254 deletions.
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,19 @@ All notable changes to the SOTorrent dataset project will be documented in this

## [Upcoming]

* Update import script to use new SQL dumps
* Extract language information from Stack Snippets and link individual snippets to their predecessors
* Update database schema on website
* Add historical user reputation
* Automate import of tables `PostTags` and `PostViews`
* Properly wait for MySQL import to be finished instead of using `sleep`
* Revise table `PostBlockDiff`

## [2020-11-16] - First release based on SO data dump 2020-09-08

* see 2020-08-31

## [2020-08-31] - First release based on SO data dump 2020-06-02

* Update escaping of newline characters (related to [this](https://github.com/sotorrent/db-scripts/issues/19) issue )
* Now using MySQL dumps, newline characters are not espaced anymore in the BigQuery version of the dataset
* This also fixes a bug in the export script (for tables `PostVersionUrl` and `CommentUrl`, column `LinkAnchor` was identical to column `FullMatch`)
Expand Down
35 changes: 0 additions & 35 deletions analysis/merge_log_csvs.sh

This file was deleted.

2 changes: 1 addition & 1 deletion sotorrent/LICENSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The following tables are identical to the corresponding XML files in the [offici
Legal code can be found below ([source](https://github.com/creativecommons/legalcode/blob/master/by-sa_3.0.txt)).


## Tables PostReferenceGH and GHMatches
## Tables PostReferenceGH, GHMatches, and GHCommits

The tables `PostReferenceGH`, `GHMatches`, and `GHCommits` were retrieved from the [Google BigQuery GitHub data set](https://cloud.google.com/bigquery/public-data/github), for which the [GitHub Terms of Service](https://help.github.com/articles/github-terms-of-service/) apply.

Expand Down
4 changes: 2 additions & 2 deletions sotorrent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@

## Data

The Stack Overflow data has been extracted from the official [Stack Exchange data dump](https://archive.org/details/stackexchange) released 2020-06-02.
The Stack Overflow data has been extracted from the official [Stack Exchange data dump](https://archive.org/details/stackexchange) released 2020-09-08.

The GitHub references have been retrieved from the [Google BigQuery GitHub data set](https://cloud.google.com/bigquery/public-data/github) on 2020-11-02 (last updated 2020-10-29 according to table info).
The GitHub references have been retrieved from the [Google BigQuery GitHub data set](https://cloud.google.com/bigquery/public-data/github) on 2020-11-22 (last updated 2020-11-19 according to table info).

## MySQL Troubleshooting

Expand Down
6 changes: 3 additions & 3 deletions sotorrent/export/export.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent20_06"
sotorrent_db="sotorrent20_09"

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
data_path="E:/Temp/" # Cygwin
#data_path="/tmp/" # Linux
data_path="E:/Temp" # Cygwin
#data_path="/tmp" # Linux

rm -f $log_file

Expand Down
24 changes: 12 additions & 12 deletions sotorrent/gh-references/retrieve-gh-references.sh
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
#!/bin/bash

project="sotorrent-org"
dataset="gh_so_references_2020_11_02"
sotorrent="2020_08_31"
dataset="gh_so_references_2020_11_22"
sotorrent="2020_11_16"
bucket="sotorrent"
logfile="bigquery.log"

# "Table Info" of table "bigquery-public-data:github_repos.commits"
# Last modified: Nov 19, 2020, 10:41:33 AM
# Number of Rows: 245,405,539
# Table Size: 795.15 GB
#
# Unique Git commits from open source repositories on GitHub, pre-grouped by repositories they appear in.

# "Table Info" of table "bigquery-public-data:github_repos.contents"
# Last Modified: Oct 29, 2020, 6:23:26 PM
# Number of Rows: 268,707,525
# Table Size: 2.3 TB
# Last Modified: Nov 19, 2020, 8:07:20 PM
# Number of Rows: 269,172,700
# Table Size: 2.31 TB
#
# Unique file contents of text files under 1 MiB on the HEAD branch.
# Can be joined to [bigquery-public-data:github_repos.files] table using the id columns to identify the repository and file path.

# "Table Info" of table "bigquery-public-data:github_repos.commits"
# Last modified: Oct 29, 2020, 10:03:37 AM
# Number of Rows: 244,721,848
# Table Size: 793.6 GB
#
# Unique Git commits from open source repositories on GitHub, pre-grouped by repositories they appear in.

# select all source code lines of text files that contain a link to Stack Overflow
bq --headless query --max_rows=0 --destination_table "$project:$dataset.matched_lines" "$(< sql/matched_lines.sql)" >> "$logfile" 2>&1

Expand Down
2 changes: 1 addition & 1 deletion sotorrent/load_sotorrent.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
root_password="_AqUjvtv68E\$N!r]"
sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent20_06"
sotorrent_db="sotorrent20_09"
db_init=false
load_so=false
load_gh=false
Expand Down
93 changes: 93 additions & 0 deletions sotorrent/load_sotorrent_csv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#!/bin/sh

root_password="_AqUjvtv68E\$N!r]"
sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent20_09"
db_init=false
load_so=false
load_gh=false
load_sotorrent=false

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
data_path="E:\/Temp\/" # Cygwin
#data_path="\/tmp\/" # Linux

rm -f $log_file

echo "Available command-line arguments: 'so-dump', 'gh-references', 'complete'."
echo "If called with second parameter 'db-init', a new database is initalized."

if [ "$1" = "so-dump" ]; then
echo "Will only load SO tables." | tee -a "$log_file"
load_so=true
load_gh=false
load_sotorrent=false
elif [ "$1" = "gh-references" ]; then
echo "Will only load GH tables." | tee -a "$log_file"
load_so=false
load_gh=true
load_sotorrent=false
elif [ "$1" = "complete" ]; then
echo "Will load all tables." | tee -a "$log_file"
load_so=true
load_gh=true
load_sotorrent=true
fi

if [ "$2" = "db-init" ] ; then
db_init=true
echo "Creating database..." | tee -a "$log_file"
mysql -u root --password="$root_password" -e "DROP DATABASE IF EXISTS $sotorrent_db;
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE DATABASE $sotorrent_db DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;"

echo "Creating Stack Overflow tables..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/create_so_tables.sql >> $log_file 2>&1

echo "Adding database user and granting privileges..." | tee -a "$log_file"
mysql -u root --password="$root_password" -e "CREATE USER IF NOT EXISTS 'sotorrent'@'localhost' IDENTIFIED BY '$sotorrent_password';
CREATE USER IF NOT EXISTS 'sotorrent'@'%' IDENTIFIED BY '$sotorrent_password';
GRANT ALL PRIVILEGES ON $sotorrent_db.* TO 'sotorrent'@'localhost';
GRANT ALL PRIVILEGES ON $sotorrent_db.* TO 'sotorrent'@'%';
GRANT FILE ON *.* TO 'sotorrent'@'localhost';
GRANT FILE ON *.* TO 'sotorrent'@'%'; FLUSH PRIVILEGES;"
fi

if [ "$load_so" = true ] ; then
echo "Loading Stack Overflow tables..." | tee -a "$log_file"
sed -e"s/<PATH>/$data_path/g" ./sql/load_so_from_xml.sql > ./sql/load_so_from_xml_absolute_paths.sql
mysql $sotorrent_db -u root --password="$root_password" < ./sql/load_so_from_xml_absolute_paths.sql >> $log_file 2>&1
rm ./sql/load_so_from_xml_absolute_paths.sql

echo "Creating indices for Stack Overflow tables..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/create_so_indices.sql >> $log_file 2>&1
fi

if [ "$db_init" = true ] ; then
echo "Creating SOTorrent tables..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/create_sotorrent_tables.sql >> $log_file 2>&1
fi

if [ "$load_gh" = true ] ; then
echo "Loading GH tables..." | tee -a "$log_file"
sed -e"s/<PATH>/$data_path/g" ./sql/load_gh-references.sql > ./sql/load_gh-references_paths.sql
mysql $sotorrent_db -u root --password="$root_password" < ./sql/load_gh-references_paths.sql >> $log_file 2>&1
rm ./sql/load_gh-references_paths.sql

echo "Creating indices for GH References tables..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/create_gh-references_indices.sql >> $log_file 2>&1
fi

if [ "$load_sotorrent" = true ] ; then
echo "Loading SOTorrent tables..." | tee -a "$log_file"
sed -e"s/<PATH>/$data_path/g" ./sql/load_sotorrent.sql > ./sql/load_sotorrent_paths.sql
mysql $sotorrent_db -u root --password="$root_password" < ./sql/load_sotorrent_paths.sql >> $log_file 2>&1
rm ./sql/load_sotorrent_paths.sql

echo "Creating indices for SOTorrent tables..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/create_sotorrent_indices.sql >> $log_file 2>&1
fi

echo "Finished." | tee -a "$log_file"
6 changes: 3 additions & 3 deletions sotorrent/posttags/bigquery/PostTags.sql
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
SELECT temp.PostId AS PostId, tags.Id AS TagId
FROM `sotorrent-org.2020_08_31.Tags` tags
JOIN `sotorrent-org.2020_08_31.PostTagsTemp` temp
FROM `sotorrent-org.2020_11_12.Tags` tags
JOIN `sotorrent-org.2020_11_12.PostTagsTemp` temp
ON tags.TagName = temp.Tag;

=> `sotorrent-org.2020_06_31.PostTags`
=> `sotorrent-org.2020_11_12.PostTags`
File renamed without changes.
29 changes: 0 additions & 29 deletions sotorrent/posttags/export_posttags.sh

This file was deleted.

2 changes: 1 addition & 1 deletion sotorrent/posttags/load_posttags.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
root_password="_AqUjvtv68E\$N!r]"
sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent20_06"
sotorrent_db="sotorrent20_09"

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
Expand Down
4 changes: 2 additions & 2 deletions sotorrent/postviews/load_postviews.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
root_password="_AqUjvtv68E\$N!r]"
sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent20_06"
sotorrent_db="sotorrent20_09"

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
data_path="E:\/Temp\/postviews\/" # Cygwin
#data_path="\/tmp\/" # Linux

declare -a datadump_versions=("2016-09-12" "2016-12-15" "2017-03-14" "2017-06-12" "2017-12-01" "2018-03-13" "2018-06-05" "2018-09-05" "2018-12-02" "2019-03-04" "2019-06-03" "2019-09-04" "2019-12-02" "2020-03-02", "2020-06-02")
declare -a datadump_versions=("2016-09-12" "2016-12-15" "2017-03-14" "2017-06-12" "2017-12-01" "2018-03-13" "2018-06-05" "2018-09-05" "2018-12-02" "2019-03-04" "2019-06-03" "2019-09-04" "2019-12-02" "2020-03-02" "2020-06-02" "2020-09-08")

rm -f $log_file

Expand Down
Loading

0 comments on commit cd9b3f6

Please sign in to comment.