Skip to content
View pakkinlau's full-sized avatar
🚩
Focusing
🚩
Focusing

Highlights

  • Pro

Block or report pakkinlau

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
pakkinlau/README.md

👋 Pak Kin, LAU

1703358697810

I would love to connect! Feel free to send me a message at my Gmail account.

I am currently a master student in mathematics at CUHK, and a full-time researcher at CUHK in building big data solutions with Machine Learning, Large Language Models and Graph Neural Networks, for mass historical archive (image+text) research.

📚Fields of (research) interest:

  • Graph Neural Network
  • Knowledge modeling, representation theory of knowledge, and Knowledge base schema design
  • Knowledge mining and curation (For personal or multi-user use)

🏘Open Source Packages

Status: Deployment Stage 🐲

  • Scene_graph_benchmark_nvidia: This repository provides a November-2024-reworked implementation of the Scene Graph Benchmark Docker container definition, aimed at tasks like scene graph generation, object detection, and relationship detection. The current implementation is based on NVIDIA's PyTorch container with GPU acceleration, ensuring compatibility with CUDA and cuDNN.

  • Envsync: envsync is creating a apt-get experience for python packages, to automatically update the content of requirements.txt and synchronize the virtual environments of any Python projects with Git hooks.

  • Gites: The objective of the Gites is to replicate the user experience of employing Google Drive or OneDrive functionalities within the context of Git commands. The actions such as batch push, pull, clone are provided.


🏘My Curriculum Development Work

Status: Deployment Stage 🐲

  • Prog-for-humanists-web: This collection was created by me and serves as a valuable resource for senior year students at CUHK who are studying humanity major can grow their programming skills in the context of data engineering, database management, machine learning, natural language processing and project deployment as part of a 3-credit course at the university. The course aims to help students to have solid foundation to develop modern and impactful humanity projects in python.

Display:

1703358542877


🏘Open Source Programming Projects

While my closed source projects are not listed here, here's a glimpse into the projects I'm currently working on, categorized based on their development stage:

Status: Deployment Stage 🐲

Bots

  • Keyword-listening-discord-bot: This project represents my personal endeavor to create a Discord bot for my dedicated server. The bot is designed to monitor all messages within a specific Discord server, requiring the specification of a token and guild ID. Whenever it detects predefined commands or keywords in messages, it responds accordingly. In essence, my Discord bot serves as a helpful tool, providing instant access to information from a manual, aiding the coder members of the server with informative responses to their commands.
  • youtube-chatroom-response: An asynchronous youtube chatroom response bot written by python, that allow users to customize the patterns matched in the chatroom, and then automatically response accordingly.

Finance

  • yahoo-finance-scraper: This python package use the API from yfin package, and collect data from yahoo finance website. Checkout the snapshots in readme.md in this repository for more details.

  • My tradingview profile: Check out my contributions to the library of traders community, which garnered over 2000 stars on TradingView.

  • package_to_text: generates a two-part text output: 1. ASCII Tree: A visual, indented directory structure. 2. File Contents: A flat listing of each file’s text. The final result is automatically copied to your clipboard, making it convenient to share or paste into large language models (like ChatGPT). It also includes a rough token count estimate for LLM usage.

Status: Development stage 🌱

Software engineering

  • Software Engineering Toolbox: Part 1 is completed. Curate several folders of tools so that user can dignose a target directory easily during software development.

Project management

  • csv2gantt: Aims to provide a tool that converts csv format data file into gantt chart.

🏘Open Source Gists

Status: Deployment Stage 🐲

Large Language Models

  • poe-langchain: Adapting LLM that we can access on poe, into langchain's llm eco-system.

Software engineering

  • print_dir_structure.py: Python script that prints a tree-like directory structure (folders + optional files) and automatically copies the output to the clipboard. It supports specifying a target directory or defaults to the current one. Use the --full flag to show files. Requires pyperclip.

Format change

  • gpt2md.py: A python script that convert the mathematics LaTeX format that you can copy from ChatGPT, into the LaTeX format that markdown (such as Obsidian) can display.
  • mermaid2md.py: This script automatically converts the output of LLM to syntax-error free markdown format.

File migration

  • Migrate_to_public_space: This Python script facilitates the management of two spaces: a creation space (a large and private area for work) and a publication space (a smaller area for selected items ready for publishing). Users can specify a list of items to migrate from the creation space to the publication space, streamlining the publishing process.

YouTube content download

  • youtube_subtitle: A python script that output youtube video substitle. Only video_id is needed to run the script.
  • downloadYT_whole.py: A python script that download a video without chopping. Only videoID is needed to be provided to run the script.
  • downloadYT_chopping.py: A python script that download a video with chopping. Only videoID is needed to be provided to run the script.

Video editing

  • compress_concate.py: A python script that compress and concatenate a series of videos into a single video.

Scraping

  • Home PC activity parser: Automatic home PC user activity parser. That automatically parse user activity into text file, for real time streaming or database for further analytics.

  • JobsDB scraper: A python script that scrape the job information from JOBSDB, a popular recruiter website in Hong Kong. It also tries to count mentioned skillsets and then make simple statistics for the data collected.


🏘My Short-course Development Work

I have discovered a true passion for crafting articles that break down intricate concepts into easily digestible pieces. As a result, I am currently working on developing the following sets of materials for the public. Please remember to cite the source when using these materials.

Status: Development stage 🌱

  • GPU-Environment-Windows-Linux-Docker: When I first ventured into setting up a GPU-accelerated environment with CUDA and PyTorch, the process proved to be daunting and time-consuming. This was due to a couple of factors:

    • Choice Overload: With multiple setup methodologies available, it was unclear which combination of steps would yield a functional environment without trial and error.
    • Component Complexity: The setup involves integrating components from various providers, each independently maintained and with often opaque error messaging, which complicates troubleshooting.

    To help streamline this setup process, I've created a series of Jupyter notebooks that address different setup scenarios:

    • CUDA-GPU Environment: Detailed guides for various CUDA-GPU environment configurations, tailored to circumvent common pitfalls.
    • Docker Approach: Clear, step-by-step instructions for defining a Docker image, designed to be flexible and accommodate updates to underlying components.
    • Demonstration Scripts: Practical examples demonstrating how to utilize Large Language Models (LLMs) in the configured environment, providing a quick start to harnessing their capabilities.

    These resources aim to minimize setup headaches and get you started with a robust data science platform, whether you're working in Windows or Linux, and whether you prefer a native installation or a Dockerized solution.

  • Multilingual-Semantic-Search-Course: Contextual embeddings generated by LLMs are the core technology used in semantic search. This short-course repository try to walk-through the environment setup, concepts involved, and also demonstrating some key steps of using those functions.

  • Big-Data-Integration-and-Processing-Course: This repository serves as a comprehensive guide to mastering big data techniques, featuring:

    • Detailed software setup instruction
    • Integration Tools: Introduction to MongoDB, a NoSQL database perfect for document integration, alongside strategies for effective data amalgamation.
    • Processing Frameworks: Tutorials and examples on using Apache Spark, the leading platform for large-scale data processing
    • Final project demonstration: Ddemonstrate how to integrate, update, query and delete data from a >1TB textual corpus.
  • BigdataMath: During my learning journey, I've found that I possess the ability to create effective notes that can simplify complex concepts, making them more accessible and comprehensive. Consequently, I'm planning to write a series of articles covering a variety of topics in the field of mathematics related to big data, which is an area I have a strong interest in.


🔒 License & Usage Rights

The content of these repositories is available for educational and informational purposes. While I encourage you to explore, learn from, and engage with the material, please respect the following terms:

  • Access: The repositories are publicly accessible and open to everyone.
  • Use: You are free to clone the repositories, run the Jupyter notebooks, and build upon the material for personal and educational use.
  • Distribution: Redistribution of the original or modified content is allowed, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests I endorse you or your use.
  • Commercial Use: Commercial use of the content may be restricted depending on the license chosen. Please review the specific license of each repository to understand what is and isn't allowed.
  • Contribution: If you wish to contribute to the repositories or have any suggestions, feel free to open an issue or a pull request in the respective repository.

Pinned Loading

  1. envsync envsync Public

    A tool to set up Git hooks for your target local git repo, that automatically synchronize and updates for requirements.txt and virtual environments, streamlining the process of managing development…

    Python

  2. gites gites Public

    The objective of the Gites is to replicate the user experience of employing Google Drive or OneDrive functionalities within the context of Git commands.This package tries to memorize your github re…

    Python 2 1