Recently

  • Web scraping 50 thousand documents from different sources for NLP

    In following article we will explore methods of building your own dataset for NLP. Doing so from ground up means solving a lot of problems. Publicly available datasets (ie. Kaggle) are usually preprocessed, where NaN values are properly tagged and out-of-category rows are removed. That is not the case when you start with empty spreadsheet which you will want to populate with data straight from the source. By source I don’t mean API which responds with nicely formated JSON key value pairs. We will learn how to get data directly from the websites, store it and clean it properly while avoiding unneeded losses. In part II we will continue with NLP methods of exploration. Building dataset yourself shows importance (and hardships) of data cleaning. You will discover how much better understanding of your data you will have after knowing what to remove from it and how to judge what to remove.

  • Beginners programmer guide to solving easy problems.

    Writing code can be hard. Writing good code is even harder. Being stuck on a seemingly easy task while writing code is the hardest. For this you can find tremendous amount of resources available to help you get unstack, all under your fingertips. However, even in the information age society, where there are thousands of experienced programmers willing to help you in their spare time and abundance of online communities (both free and paid) which exists solely for a purpose of solving ‘this-one-thing-you-cannot-wrap-your-head-around’ it will be still up to you to know what the heck are you trying to do.

  • IndieHackers Statistics, Dataset Exploration Case Study

    Motivation for writing this article was mostly educational. It started as a quest to learn web scraping of JavaScript based websites and basic data exploration using python.

subscribe via RSS