Allen's 데이터 맛집
[2] SQL로 데이터 분석을 위한 EDA 본문
이번 프로젝트의 중요 목표 중 하나로, 이전 프로젝트들과 같이 Python을 사용하지 않고 SQL만을 사용하여 EDA를 해보겠습니다. 이를 위해 MSSQL을 사용하기 위해 SSMS(SQL Server Management Studio) 프로그램으로 진행하였습니다.
1.
이번 프로젝트에서 가장 많이 쓰일 컬럼 [Location, date, total_cases, new_cases, total_deaths, population]을 조회하여 데이터들을 파악합니다.
Select Location, date, total_cases, new_cases, total_deaths, population
From PortfolioProject..CovidDeaths
Where continent is not null
order by 1,2
2.
한국에서 코로나에 걸리면 사망할 가능성을 시간의 흐름에 따라 탐색해 봅니다.
Select Location, date, total_cases,total_deaths,
(CONVERT(float, total_deaths) / NULLIF(CONVERT(float, total_cases), 0)) * 100 as DeathPercentage
From PortfolioProject..CovidDeaths
Where location like '%south korea%'
and continent is not null
order by 1,2
3.
나라별 코로나에 감염된 인구의 비율을 시간의 흐름에 따라 탐색해 봅니다.
Select Location, date, Population, total_cases, (total_cases/population)*100 as PercentPopulationInfected
From PortfolioProject..CovidDeaths
order by 1,2
4.
인구 대비 감염률이 가장 높은 국가를 탐색하여 내림차순으로 정렬합니다.
Select Location, Population, MAX(total_cases) as HighestInfectionCount,
Max((total_cases/population))*100 as PercentPopulationInfected
From PortfolioProject..CovidDeaths
Group by Location, Population
order by PercentPopulationInfected desc
5.
국가별 총 사망자 수를 탐색하고 내림차순으로 정렬합니다.
Select Location, MAX(cast(Total_deaths as int)) as TotalDeathCount
From PortfolioProject..CovidDeaths
Where continent is not null
Group by Location
order by TotalDeathCount desc
대륙별 분류
6.
대륙별 총 사망자 수를 탐색하고 내림차순으로 정렬합니다.
Select continent, MAX(cast(Total_deaths as int)) as TotalDeathCount
From PortfolioProject..CovidDeaths
Where continent is not null
Group by continent
order by TotalDeathCount desc
7.
전 세계 총 감염자수와 총 사망자 수를 사용하여 전 세계 사망률을 계산합니다.
Select SUM(new_cases) as total_cases, SUM(cast(new_deaths as int)) as total_deaths,
SUM(cast(new_deaths as int))/SUM(New_Cases)*100 as DeathPercentage
From PortfolioProject..CovidDeaths
where continent is not null
order by 1,2
8.
최소한 1차 이상의 코로나 백신을 접종받은 인구를 탐색해 봅니다.
Select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations
, SUM(CONVERT(bigint,vac.new_vaccinations)) OVER
(Partition by dea.Location Order by dea.location, dea.Date) as RollingPeopleVaccinated
From PortfolioProject..CovidDeaths dea
Join PortfolioProject..CovidVaccinations vac
On dea.location = vac.location
and dea.date = vac.date
where dea.continent is not null
order by 2,3
9.
8번 쿼리를 CTE를 사용하여 Partition By 계산 수행, 최소한 1차 이상의 코로나 백신을 접종받은 인구 비율을 탐색해 봅니다.
With PopvsVac (Continent, Location, Date, Population, New_Vaccinations, RollingPeopleVaccinated)
as
(
Select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations
, SUM(CONVERT(bigint,vac.new_vaccinations)) OVER
(Partition by dea.Location Order by dea.location, dea.Date) as RollingPeopleVaccinated
From PortfolioProject..CovidDeaths dea
Join PortfolioProject..CovidVaccinations vac
On dea.location = vac.location
and dea.date = vac.date
where dea.continent is not null
)
Select *, (RollingPeopleVaccinated/Population)*100 as PercentPopulationVaccinated
From PopvsVac
GITHUB : https://github.com/siilver94/Analyzing-COVID19-Data